Utility scripts to find article records within the Web of Science data set.
This code provides a few wrapper objects for Web of Science (WOS) JSON data files. The Article class will wrap an individual JSON record found on one line of a WOS JSON article data file.
article = Article(raw_json)
print(article['id'], article['title'])The ArticleCollection object will wrap the files themselves and behaves like an iterator.
filepath = '/path/to/articles.json'
for article in ArticleCollection(filepath):
print(article['id'], article['title'])Because references are an important field, calling article.references() is preferable to using the dict style accessor article['references']. Using the method form will always return a list. Even when the raw JSON has a null value for references, the article object will always return an empty list so it is safe to iterate over the field.
The ReferenceList object is a data structure that wraps the references for either an Article or ArticleCollection object. It behaves as an iterable object wrapper around a dictionary in which the keys are years when references were published and each year's value is a Set of WOS ids.
ReferenceList objects also have convenience methods to return all their years or ids.
filepath = '/path/to/articles.json'
print( ArticleCollection(filepath).reference_list().years() )
# => ['1996', '1997', '1998', '1998', '1999', '2000', '2001', '2003']The wos_explorer package depends on the nltk, the Natural Language Toolkit, specifically for its word tokenizer and n-grams functionality for WOS Explorer's phrase searching. Note that the NLTK word_tokenizer() function used depends on the NLTK's "punkt" data set, which does not automatically download via the pip install nltk command. You may need to run the following command in a Python terminal session:
>>> import nltk
>>> nltk.download("punkt")