Kaggle AI Agent Capstone Project - Human Genomics AI Agent allowing technical enquiries, clinical research and mutation analysis \Here is a clean, concise, professional README.md written specifically for your Sapien AI multi-agent genomics pipeline notebook.
This notebook implements SAPIEN AI, a fully automated multi-agent genomics analysis system designed for the Kaggle Agents Intensive Capstone (Google × Kaggle, Nov–Dec 2025). It transforms complex genomics workflows into a single conversational interface, powered by a coordinated team of domain-specific AI agents. It is structured for Kaggle, GitHub, educational, research and workshop practice purposes only!
- based on demo tools only,
- using simplified mock or publicly accessible data,
- designed for educational and prototyping purposes,
- not tested or verified on real patient VCFs,
- not reviewed under clinical pipelines or ISO standards.
- Future versions may expand or improve the functionality, but this project must not be interpreted as a clinical workflow or used to make health-related decisions.
Genomic interpretation normally requires many separate steps:
- VCF parsing
- VEP annotation
- ClinVar clinical significance lookup
- Gene metadata retrieval
- PubMed literature review
- Final scientific report writing
This notebook unifies all of these into one conversation, by constructing a multi-agent architecture where each agent is responsible for a specific domain task and a Supervisor coordinates their execution.
The result is an end-to-end genomics intelligence system that produces a research-style Markdown report for any user question.
| Agent | Purpose |
|---|---|
| GeneExpert | Retrieves Ensembl gene metadata using ensembl_gene_lookup. |
| VariantAnalyst | Parses VCF files, runs VEP, ClinVar, and gene inference. |
| LiteratureExpert | Uses hybrid RAG to synthesize PubMed/S2 literature. |
| ChiefScientist | Produces the final, unified Markdown research report. |
| Supervisor | Executes the multi-step orchestration and performs Agent-to-Agent (A2A) routing. |
Agents are not just LLM prompts—they are called programmatically using ADK’s delegate() API.
The Supervisor:
- Detects the type of query
- Decides which agents should run
- Executes them in the correct order
- Passes all outputs downstream (A2A)
- Assembles a final report via ChiefScientist
Variant analysis output → GeneExpert → LiteratureExpert → ChiefScientist.
All downstream modules receive upstream outputs to ensure a coherent final report.
ChiefScientist produces a multi-section genomic analysis report containing:
- Genes Analyzed
- Variant Tables (VEP + ClinVar)
- Literature-Derived Insights
- Summary & Disclaimer
Suitable for educational and research-only workflows.
- If no VCF → VariantAnalyst is skipped
- If no gene symbol → GeneExpert is skipped
- LiteratureExpert always runs
- ChiefScientist accepts empty blocks (no errors)
A dedicated testing cell validates pipeline behavior, ensuring:
- Correct routing
- Correct A2A
- No missing-context errors
Use the provided A2A Diagnostic Suite to validate end-to-end functionality.
Example test cases:
liver disease
BRCA1 gene
my_sample.vcf
analyse sample.vcf
analyse sample.vcf with BRCA1
This confirms that:
- VariantAnalyst runs only when appropriate
- GeneExpert activates only on gene symbols
- LiteratureExpert always runs
- ChiefScientist receives VA / GENE / LIT blocks correctly
| Section | Description |
|---|---|
| Cell 1 – Environment Setup | Loads ADK, tools, keys, and supporting libraries. |
| Cell 2 – Tool Definitions | PubMed, Ensembl, VEP, ClinVar, RAG, etc. |
| Cell 3 – Multi-Agent System | Builds all agents + real execution Supervisor + App + Runner. |
| Cell 4 – Interactive Mode | Fully conversational genomics assistant. |
| Cell X – A2A Diagnostic Suite | Validates multi-agent routing & pipeline integration. |
Once the notebook shows:
SAPIEN AI – INTERACTIVE MODE ACTIVATED
You can type questions such as:
common lung disease
analyse mydata.vcf
TP53 function
final report
- VCF analysis
- Gene metadata
- Literature synthesis
- or general biomedical reasoning
This system is for education and research only. It is not a medical device and should not be used for clinical decisions.
Variant annotations rely on external databases and may not be fully complete or up to date.
- Ensembl REST API
- ClinVar variation services
- PubMed & Semantic Scholar
- VEP (Variant Effect Predictor)
- Google ADK (Agents Development Kit)
This project is released for educational and research use under the Kaggle Agents Intensive rules.