comp530-group-project

PRISMA-ScR Automated Research Pipeline

This project provides an integrated, automated pipeline for conducting scoping reviews using the PRISMA-ScR methodology. The pipeline retrieves, screens, ranks, and summarizes research papers from multiple open-access databases, leveraging both traditional NLP models and state-of-the-art large language models (LLMs).

Features

Automated Literature Search: Searches Europe PMC, DOAJ, and Semantic Scholar for open-access research papers based on user-supplied keywords or research questions. PDF Download: Downloads available open-access PDFs for all retrieved papers. Abstract Screening & Ranking: Uses BM25, SBERT, SPLADE, and ensemble LLMs (Gemini, Mistral) to rank abstracts for relevance. Ensemble Summarization: Generates summaries of top-ranked papers using both Gemini and Mistral LLMs, and synthesizes a final ensemble summary. Evaluation: Optionally evaluates summaries using BERTScore against a reference summary. Reproducible & Modular: All steps are contained in a single notebook for easy modification and reproducibility.

Requirements

Python 3.8+ Jupyter Notebook or Google Colab The following Python packages (install via pip if running locally):

python-dotenv
requests
concurrent.futures (standard library)
beautifulsoup4
bert-score
mistralai
rank_bm25
sentence-transformers
transformers
torch
scikit-learn

Tip: The notebook will attempt to install missing packages automatically if run in Colab.

Setup

Clone or Download the Repository
API Keys:
- Obtain API keys for Gemini and Mistral LLMs.
- Create a .env file (e.g., Semantic_key.env) in your working directory with the following format:
```
GEMINI_API_KEY=your_gemini_api_key_here
MISTRAL_API_KEY=your_mistral_api_key_here
```
- If using Semantic Scholar API, add:
```
API_KEY=your_semantic_scholar_api_key_here
```
Open IntegratedPipeline.ipynb in Jupyter or Colab.
Upload your .env file when prompted (if running in Colab).

Usage

Run all cells in IntegratedPipeline.ipynb.
When prompted, enter your research question or keywords.
The pipeline will: Search databases and download papers Screen and rank abstracts Generate and display ensemble summaries Optionally evaluate summaries if a reference is provided
Results (rankings, summaries) are saved in the results/ directory.

Output

Ranked Abstracts: JSON file with ranked papers and relevance scores Summaries: Model-specific and ensemble summaries for top papers BERTScore Evaluation: (Optional) Precision, recall, and F1 for summaries

Notes

The pipeline is designed for open-access literature and may not retrieve paywalled content. LLM API usage may incur costs and is subject to rate limits. For best results, use in Google Colab with GPU enabled.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Climate change V2.ipynb		Climate change V2.ipynb
FirearmLegalisation (1).ipynb		FirearmLegalisation (1).ipynb
IntegratedPipeline.ipynb		IntegratedPipeline.ipynb
MobileHealth.ipynb		MobileHealth.ipynb
README.md		README.md
RelationshipbwtSedantaryUniStudents.ipynb		RelationshipbwtSedantaryUniStudents.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

comp530-group-project

PRISMA-ScR Automated Research Pipeline

Features

Requirements

Setup

Usage

Output

Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Vzaldat/NLP_Large_Text_Analysis_For_Scoping_Review_Process

Folders and files

Latest commit

History

Repository files navigation

comp530-group-project

PRISMA-ScR Automated Research Pipeline

Features

Requirements

Setup

Usage

Output

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages