The Web Mining Algorithm is a Python-based tool designed to extract, analyze, and visualize information from websites. It leverages web scraping techniques, IP geolocation, and natural language processing to gather data such as location, names, emails, and social media links from a list of target websites.
- Web Scraping: Utilizes
requestsandBeautifulSoupto fetch and parse website content. - IP Geolocation: Determines the geographical location of websites using IP addresses via the
ipapi.coAPI. - Named Entity Recognition (NER): Employs
spaCyto identify and extract names from website content. - Social Media Link Extraction: Collects links to social media profiles, including Twitter, Facebook, LinkedIn, Instagram, and YouTube.
- Email Extraction: Identifies and extracts email addresses present on the website.
- Data Presentation: Outputs the extracted information in a structured table format using
PrettyTable.
- Programming Language: Python 3.x
- Libraries:
requestsbeautifulsoup4socketspacyprettytable
-
Python 3.x: Ensure Python is installed on your system.
-
Install Required Libraries: Use
pipto install the necessary libraries:pip install requests beautifulsoup4 spacy prettytable python -m spacy download en_core_web_sm
git clone https://github.com/ozeraysenur/WebMiningAlgorithm.git
cd WebMiningAlgorithmCreate a text file named websites.txt containing the list of target websites, each on a new line.
Run the main Python script:
python main.pyThe script will display a table with the extracted information, including website, location, names, emails, and social media links.
WebMiningAlgorithm/
├── websites.txt # Input file containing list of target websites
├── main.py # Main script for web mining
├── README.md # Project documentationAfter running the script, you can expect an output similar to:
+-------------------+----------+---------------------+-------------------+-----------+---------+----------+----------+-----------+
| Website | Location | Names | Email | Instagram | YouTube | LinkedIn | Twitter | Facebook |
+-------------------+----------+---------------------+-------------------+-----------+---------+----------+----------+-----------+
| somewebsite1.com | N/A | ['Ayse Nur Ozer'] | contact@site1 | N/A | N/A | N/A | N/A | N/A |
| somewebsite2.com | N/A | ['John Doe'] | info@site2 | N/A | N/A | N/A | N/A | N/A |
+-------------------+----------+---------------------+-------------------+-----------+---------+----------+----------+-----------+
- Adding More Social Media Platforms: To extend the script's capabilities to extract additional social media links, you can modify the
extract_social_media_linksfunction inmain.pyto include patterns for other platforms. - Enhancing Named Entity Recognition: To improve the accuracy of name extraction, consider training
spaCywith a custom dataset tailored to your specific requirements.
Contributions are welcome! To contribute:
-
Fork the Repository: Click the 'Fork' button on the GitHub page.
-
Create a New Branch: For your feature or bug fix:
git checkout -b feature-name
-
Make Changes: Implement your feature or fix.
-
Commit Changes: Commit your modifications with a descriptive message:
git commit -m "Description of feature or fix" -
Push to Branch: Push your changes to GitHub:
git push origin feature-name
-
Submit Pull Request: Open a pull request on GitHub with a detailed description of your changes.
Special thanks to the developers and maintainers of the open-source libraries utilized in this project, including requests, BeautifulSoup, spaCy, and PrettyTable.
For any questions or support, please open an issue in this repository.
Happy coding! 💻✨