Skip to content

Python toolkit for genomic data analysis and bioinformatics

Notifications You must be signed in to change notification settings

amritasule/python-for-genomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Python for Genomics

A comprehensive Python toolkit for genomic data analysis and bioinformatics, created as part of the Coursera "Python for Genomic Data Science" course.

πŸ“š Overview

This repository provides everything you need to learn Python for genomics:

  • Tutorials - Learn Python fundamentals with genomics examples
  • Modules - Reusable tools for DNA/RNA analysis
  • Examples - Complete genomics workflows
  • Practice - Exercises to test your skills
  • Quick Reference - Fast lookup cheat sheet

πŸ—‚οΈ Repository Structure

python-for-genomics/
β”œβ”€β”€ tutorials/            # πŸ“— Python fundamentals (START HERE!)
β”‚   β”œβ”€β”€ 01_strings_and_dna.py
β”‚   β”œβ”€β”€ 02_lists_and_sequences.py
β”‚   β”œβ”€β”€ 03_dictionaries_and_codons.py
β”‚   β”œβ”€β”€ 04_conditionals_in_genomics.py
β”‚   β”œβ”€β”€ 05_loops_and_iteration.py
β”‚   └── 06_file_io_genomics.py
β”œβ”€β”€ modules/              # Core reusable modules
β”‚   β”œβ”€β”€ dna_tools.py
β”‚   β”œβ”€β”€ sequence_analysis.py
β”‚   └── file_parsers.py
β”œβ”€β”€ examples/             # Complete genomics workflows
β”‚   β”œβ”€β”€ 00_basic_operations.py
β”‚   β”œβ”€β”€ 01_my_first_analysis.py
β”‚   β”œβ”€β”€ 02_interactive_analyzer.py
β”‚   β”œβ”€β”€ 03_compare_sequences.py
β”‚   β”œβ”€β”€ 04_gc_content_analysis.py
β”‚   β”œβ”€β”€ 05_orf_finder.py
β”‚   └── 06_fasta_file_operations.py
β”œβ”€β”€ practice/             # πŸ’ͺ Test your skills
β”‚   β”œβ”€β”€ exercises.py
β”‚   └── solutions.py
β”œβ”€β”€ data/                 # Sample data files
β”œβ”€β”€ QUICK_REFERENCE.md    # πŸ“– Cheat sheet
└── README.md

πŸš€ Getting Started

For Complete Beginners

Start with the tutorials! They teach Python fundamentals using genomics examples.

cd tutorials
python3 01_strings_and_dna.py

Work through tutorials 01-06 in order. Each tutorial is a complete, runnable script with explanations.

For Those Who Know Python

Jump straight to the examples to see complete genomics workflows:

cd examples
python3 00_basic_operations.py

πŸ“š Learning Path

Step 1: Fundamentals (tutorials/)

Learn Python basics with genomics context:

  1. Strings and DNA - String operations for sequences
  2. Lists - Working with multiple sequences
  3. Dictionaries - Genetic code and mappings
  4. Conditionals - Sequence validation
  5. Loops - Iterating through data
  6. File I/O - Reading and writing FASTA files

Each tutorial includes:

  • Clear explanations
  • Code examples
  • Genomics applications
  • Try-it-yourself exercises

Step 2: Examples (examples/)

See complete workflows in action:

πŸ“— Beginner

  • 00: Basic DNA operations
  • 01: Your first analysis
  • 02: Interactive analyzer
  • 03: Compare sequences

πŸ“™ Intermediate

  • 04: GC content analysis
  • 05: ORF finding

πŸ“• Advanced

  • 06: Complete FASTA operations

Step 3: Practice (practice/)

Test your skills with exercises:

  • 12 exercises covering all concepts
  • Solutions provided
  • Real genomics problems

Step 4: Reference (QUICK_REFERENCE.md)

Fast lookup for Python syntax and common patterns.

πŸ’» Installation

git clone https://github.com/amritasule/python-for-genomics.git
cd python-for-genomics

No external dependencies! Uses only Python standard library.

πŸ“– Module Documentation

dna_tools.py - Core DNA Functions

Function Description
validate_dna(seq) Check if valid DNA
gc_content(seq) Calculate GC percentage
complement(seq) Get DNA complement
reverse_complement(seq) Get reverse complement
transcribe(dna) Convert DNA to RNA
translate(dna) Translate to protein
count_nucleotides(seq) Count each base
has_start_codon(seq) Check for ATG
has_stop_codon(seq) Check for stop codons

sequence_analysis.py - Advanced Analysis

Function Description
find_motif(seq, motif) Find pattern occurrences
find_orfs(seq) Find open reading frames
calculate_melting_temp(seq) Calculate Tm
hamming_distance(seq1, seq2) Calculate differences
gc_content_window(seq, size) Sliding window GC
find_repeats(seq, min_len) Find repeated sequences

file_parsers.py - File I/O

Function Description
read_fasta(filename) Read FASTA file
write_fasta(seqs, filename) Write FASTA file
read_fastq(filename) Read FASTQ file
write_fastq(records, filename) Write FASTQ file

🎯 Quick Start Examples

Analyze a sequence

import sys
sys.path.append('modules')
import dna_tools

sequence = "ATGCGCTAGGGTAA"
print(f"GC Content: {dna_tools.gc_content(sequence):.2f}%")
print(f"Protein: {dna_tools.translate(sequence)}")

Read FASTA file

import file_parsers

sequences = file_parsers.read_fasta('data/sample_sequences.fasta')
for header, seq in sequences.items():
    print(f"{header}: {len(seq)} bp")

Find ORFs

import sequence_analysis

orfs = sequence_analysis.find_orfs("ATGCGCGCGTAGGGTAA")
for start, end, orf in orfs:
    print(f"ORF: {orf}")

✨ Features

  • βœ… No external dependencies (pure Python)
  • βœ… Complete tutorials with genomics examples
  • βœ… Runnable code examples
  • βœ… Practice exercises with solutions
  • βœ… FASTA/FASTQ file support
  • βœ… Complete genetic code table
  • βœ… ORF finding in all reading frames
  • βœ… Quick reference guide
  • βœ… Portfolio-ready quality

πŸŽ“ Topics Covered

Python Fundamentals

  • Variables and data types
  • Strings, lists, dictionaries
  • If/elif/else statements
  • For and while loops
  • Functions and modules
  • File I/O operations

Bioinformatics Concepts

  • DNA sequence manipulation
  • GC content calculation
  • Sequence complement
  • Transcription and translation
  • ORF finding
  • FASTA file parsing
  • Sequence comparison
  • Pattern matching

πŸ”¬ Use Cases

  • Education - Learn Python and bioinformatics
  • Research - Quick sequence analysis
  • Pipelines - Building blocks for workflows
  • Prototyping - Test ideas before scaling

πŸ“ Examples of What You Can Do

After completing the tutorials, you'll be able to:

βœ“ Analyze DNA sequences (GC content, composition, etc.)
βœ“ Read and write FASTA files
βœ“ Find open reading frames (ORFs)
βœ“ Translate DNA to protein
βœ“ Compare sequences
βœ“ Filter sequences by criteria
βœ“ Search for patterns and motifs
βœ“ Build analysis pipelines

🀝 Contributing

Contributions welcome! Feel free to:

  • Add new features
  • Improve documentation
  • Report bugs
  • Suggest enhancements

πŸ“ž Resources

  • tutorials/README.md - Detailed tutorial guide
  • practice/README.md - Exercise instructions
  • QUICK_REFERENCE.md - Python syntax cheat sheet
  • examples/ - Working code examples

πŸ™ Acknowledgments

Created as part of the Coursera course "Python for Genomic Data Science" offered by Johns Hopkins University.

πŸ“ License

MIT License - feel free to use this code for learning and research.


Ready to start? Head to tutorials/ and begin with 01_strings_and_dna.py!

Happy coding! 🧬

About

Python toolkit for genomic data analysis and bioinformatics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages