Skip to content

Need access to Twitter data? Struggling with managing a developer account? This will help you get started and have access to almost 20 years of historical tweets.

License

Notifications You must be signed in to change notification settings

Akhilesh97/Get_Historical_Tweets

 
 

Repository files navigation

Extract Historical Twitter Data


getTwitterData.ipynb

  • Extracts Twitter meta-data for a given timeline and a search query (one or multiple words) in most of the known languages.
  • Makes an API POST call to SNScrape with a formatted body.
  • No need for an API key.
API Payload:
start date, end date, search query

Output Attributes:
- date: Tweet timestamp
- tweet: Tweet content
- lang: Language classifer used by parent API
- retweetCount: Tweet retweeted count
- likeCount: Tweet like count
- replyCount: Number of replies to the original tweet
- username: Author profile
- user_followersCount: Number of followers the Author has (tells you about the popularity index for tweets)
- user_friendsCount: Number of friends the Author has
- verifiedStatus: If the Author has been Verified or not (i.e. pays 8 bucks every month!)
- tweet_url: URL of the original tweet
- hastags: If any hastags were used (hastags are important for search and information retrieval, works like an anchor-text founded by Wikipedia)
- chr_count: Count of English characters in the original tweet
- topic: Keywords you used for searching tweets (aka. labels)

NLP_basics_preprocessing_vectorization_similarity.ipynb

  • Basic and Advanced pre-processing units for general text cleaning in English language. Works well with short-text as well as a big corpus (inspired by practical experience and fine-tuned considering the pipeline used in Open.Ai's GPT-3 series).
  • Contains general vectorizer pipeline streamlined for most of the vectorizers like Word2Vec, WMD, Glove, Google, BERT, FastText, Google USE.
  • You would need to store and load your own embedding models in-order to run this.
  • Contains an intricate comparison of various syntactic and semantic similarity estimation industrial techniques.

Advised/Supervised by:

  • Dr. Hyuckchul Jung (Principal Data Science Lead, Meta, Alma: University of Southern California, Ph.D, CS Machine Learning)
  • Dr. Jason Li (Data Science Lead, Morgan Stanley)

Setup

pip install git+https://github.com/JustAnotherArchivist/snscrape.git 

Directory Structure

└── data
    ├── Resources
        ├── chatwords.txt
        ├── ...
└── models
    ├── ...
    └── en_core_web_lg
        ├── __init__.py
        ├── meta.json
        └── en_core_web_lg-2.2.5
            ├── config.cfg
            ├── meta.json
            └── ...
└── Requirements.txt
└── Conda_env_python38forTextAnalytics.yml
└── getTwitterData.ipynb
└── NLP_basics_preprocessing_vectorization_similarity.ipynb

Model Path Setting

# PATH:
root_dir = os.path.abspath("./")
data_dir = os.path.join(root_dir, "data")
output_dir = os.path.join(root_dir, "outputs")
PATH_SPACY_MODEL = os.path.join(os.path.join(os.path.join(root_dir, "models"), "en_core_web_lg”), “en_core_web_lg-2.2.5”)
PATH_RES_DIR = os.path.join(data_dir, "resources")

Model Loading

# 1. Load Spacy Model:
import spacy
spacy_model_data_path = PATH_SPACY_MODEL
nlp = spacy.load(spacy_model_data_path, disable=['ner'])
from spacy import displacy
from spacy.matcher import Matcher
from spacy.lang.en import English
print("Spacy loaded.")
# 2. Load NLP Resources:
resources_dir_path = PATH_RES_DIR
# 3. Load sent.BERT model:
bert_model_fp = PATH_BERT_MODEL

Credits:

About

Need access to Twitter data? Struggling with managing a developer account? This will help you get started and have access to almost 20 years of historical tweets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.5%
  • Rich Text Format 0.5%