An NLP Model utilizing the power of MPI Communicator to implement a distributed system environment for scalable and efficient data preprocessing and machine learning model training.
- Parallel Data Preprocessing 🚀: Uses MPI to distribute and preprocess large datasets across multiple processes, enabling faster execution.
- Flexible NLP Pipelines 📖: Includes functions for text cleaning (stopword removal, punctuation stripping, URL elimination, and more).
- Decision Tree Classification 🌳: Implements a basic machine learning pipeline with sklearn’s DecisionTreeClassifier.
- Efficient Dataset Handling 📊: Supports large-scale datasets through intelligent splitting and gathering using MPI.
- Cross-platform 🌍: Compatible with any system supporting MPI and Python.
NLP-Distributed-System-MPI
|-- Project.py # Main Python script
|-- twitter_training.csv # Example dataset
-
Clone the repository:
git clone https://github.com/Abdelrhman-Ellithy/NLP-Distributed-System-MPI.git cd NLP-Distributed-System-MPI -
Install dependencies:
- It’s recommended to use a virtual environment:
python3 -m venv env source env/bin/activate # On Windows, use `env\Scripts\activate`
- Install required packages:
pip install -r requirements.txt
- It’s recommended to use a virtual environment:
-
Ensure MPI is installed:
- For Linux:
sudo apt-get install mpich
- For macOS (using Homebrew):
brew install open-mpi
- Verify installation:
mpiexec --version
- For Linux:
- Run the MPI script:
Replace
mpiexec -n <num_processes> python Project.py
<num_processes>with the number of parallel processes you want to use.
- remove_stopwords: Eliminates common stopwords to improve data quality.
- remove_punc: Strips punctuation marks.
- remove_digits: Removes numerical characters.
- remove_html_tags: Cleans HTML content.
- remove_url: Filters out URLs.
- Load Dataset: Reads and prepares the dataset for processing.
- Preprocess with MPI: Distributes preprocessing tasks across multiple processes.
- Vectorize Text: Converts preprocessed text into numerical features using sklearn’s CountVectorizer.
- Train Model: Fits a Decision Tree classifier.
- Evaluate: Outputs a confusion matrix and classification report.
- Speedup: Parallel processing significantly reduces preprocessing time for large datasets.
- Scalability: Easily scale the system by increasing the number of processes.
This project is licensed under the MIT License. See the LICENSE file for details.
- MPI4Py: For enabling Python-based MPI implementations.
- scikit-learn: For powerful machine learning tools.
- Pandas: For efficient data manipulation.
- nltk: For NLP-specific preprocessing utilities.
- Integration with advanced classifiers (e.g., Random Forest, Gradient Boosting).
- Adding support for GPU-based preprocessing.
- Extending compatibility with cloud-based environments.
Happy coding! 🎉