📚 NLP-Distributed-System-MPI

An NLP Model utilizing the power of MPI Communicator to implement a distributed system environment for scalable and efficient data preprocessing and machine learning model training.

🛠️ Features

Parallel Data Preprocessing 🚀: Uses MPI to distribute and preprocess large datasets across multiple processes, enabling faster execution.
Flexible NLP Pipelines 📖: Includes functions for text cleaning (stopword removal, punctuation stripping, URL elimination, and more).
Decision Tree Classification 🌳: Implements a basic machine learning pipeline with sklearn’s DecisionTreeClassifier.
Efficient Dataset Handling 📊: Supports large-scale datasets through intelligent splitting and gathering using MPI.
Cross-platform 🌍: Compatible with any system supporting MPI and Python.

📂 Project Structure

NLP-Distributed-System-MPI
|-- Project.py    # Main Python script
|-- twitter_training.csv             # Example dataset

⚙️ Setup & Installation

Clone the repository:

git clone https://github.com/Abdelrhman-Ellithy/NLP-Distributed-System-MPI.git
cd NLP-Distributed-System-MPI

Install dependencies:

It’s recommended to use a virtual environment:

python3 -m venv env
source env/bin/activate  # On Windows, use `env\Scripts\activate`

Install required packages:
```
pip install -r requirements.txt
```

Ensure MPI is installed:
- For Linux:
```
sudo apt-get install mpich
```
- For macOS (using Homebrew):
```
brew install open-mpi
```
- Verify installation:
```
mpiexec --version
```

🚀 How to Run

Run the MPI script:
```
mpiexec -n <num_processes> python Project.py
```
Replace <num_processes> with the number of parallel processes you want to use.

🧹 Preprocessing Functions

remove_stopwords: Eliminates common stopwords to improve data quality.
remove_punc: Strips punctuation marks.
remove_digits: Removes numerical characters.
remove_html_tags: Cleans HTML content.
remove_url: Filters out URLs.

🧪 Example Workflow

Load Dataset: Reads and prepares the dataset for processing.
Preprocess with MPI: Distributes preprocessing tasks across multiple processes.
Vectorize Text: Converts preprocessed text into numerical features using sklearn’s CountVectorizer.
Train Model: Fits a Decision Tree classifier.
Evaluate: Outputs a confusion matrix and classification report.

📊 Performance

Speedup: Parallel processing significantly reduces preprocessing time for large datasets.
Scalability: Easily scale the system by increasing the number of processes.

🛡️ License

This project is licensed under the MIT License. See the LICENSE file for details.

🌟 Acknowledgments

MPI4Py: For enabling Python-based MPI implementations.
scikit-learn: For powerful machine learning tools.
Pandas: For efficient data manipulation.
nltk: For NLP-specific preprocessing utilities.

💡 Future Enhancements

Integration with advanced classifiers (e.g., Random Forest, Gradient Boosting).
Adding support for GPU-based preprocessing.
Extending compatibility with cloud-based environments.

Happy coding! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Project.py		Project.py
README.md		README.md
Screenshot 2024-12-13 153447.png		Screenshot 2024-12-13 153447.png
twitter_training.csv		twitter_training.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚 NLP-Distributed-System-MPI

🛠️ Features

📂 Project Structure

⚙️ Setup & Installation

🚀 How to Run

🧹 Preprocessing Functions

🧪 Example Workflow

📊 Performance

🛡️ License

🌟 Acknowledgments

💡 Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Abdelrhman-Ellithy/NLP-Distributed-System-MPI

Folders and files

Latest commit

History

Repository files navigation

📚 NLP-Distributed-System-MPI

🛠️ Features

📂 Project Structure

⚙️ Setup & Installation

🚀 How to Run

🧹 Preprocessing Functions

🧪 Example Workflow

📊 Performance

🛡️ License

🌟 Acknowledgments

💡 Future Enhancements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages