ngs-course.github.io/Course_Materials/alignment/tutorial/example.html at master · ngs-course/ngs-course.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <meta name="author" content="DNA and RNA-seq NGS alignment" />
  <title>NGS data analysis course</title>
  <style type="text/css">code{white-space: pre;}</style>
  <link rel="stylesheet" href="../../../Commons/css_template_for_examples.css" type="text/css" />
</head>
<body>
<div id="header">
<h1 class="title"><a href="http://ngs-course.github.io/">NGS data analysis course</a></h1>
<h2 class="author"><strong>DNA and RNA-seq NGS alignment</strong></h2>
<h3 class="date"><em>(updated 22-02-2015)</em></h3>
</div>
<!-- COMMON LINKS HERE -->

<h1 id="preliminaries">Preliminaries</h1>
<p>In this hands-on will learn how to align DNA and RNA-seq data with most widely used software today. Building a whole genome index requires a lot of RAM memory and almost one hour in a typical workstation, for this reason <strong>in this tutorial we will work with chromosome 21</strong> to speed up the exercises. The same steps would be done for a whole genome alignment. Two different datasets, high and low quality have been simulated for DNA, high quality contains 0.1% of mutations and low quality contains 1%. For RNA-seq a 100bp and 150bp datasets have been simulated.</p>
<h3 id="ngs-aligners-used">NGS aligners used:</h3>
<ul>
<li><a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> : BWA is a software package for mapping <strong>DNA</strong> low-divergent sequences against a large reference genome, such as the human genome. The new project repository is available at <a href="https://github.com/lh3/bwa">GitHub BWA</a></li>
<li><a href="https://github.com/opencb/hpg-aligner/wiki/" title="HPG Aligner">HPG Aligner</a> : HPG Aligner is a new NGS aligner for mapping both <strong>DNA Genomic</strong> and <strong>RNA-seq</strong> data against a large reference genome. It’s has been designed for having a high sensitivity and performance.</li>
<li><a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" title="Bowtie2">Bowtie2</a> : <em>Bowtie 2</em> is an ultrafast and memory-efficient tool for aligning <strong>DNA</strong> sequencing reads to long reference sequences.</li>
<li><a href="http://ccb.jhu.edu/software/tophat/index.shtml" title="TopHat2">TopHat2</a> : <em>TopHat</em> is a fast splice junction mapper for RNA-Seq reads. It aligns <strong>RNA-Seq</strong> reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.</li>
<li><a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> : <em>STAR</em> aligns <strong>RNA-seq</strong> reads to a reference genome using uncompressed suffix arrays.</li>
</ul>
<h3 id="other-software-used-in-this-hands-on">Other software used in this hands-on:</h3>
<ul>
<li><a href="http://www.htslib.org/" title="SAMtools">SAMTools</a> : SAM Tools <strong>provide various utilities</strong> for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.</li>
<li><a href="http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation" title="dwgsim">dwgsim</a> (optional): dwgsim can perform whole <strong>genome simulation</strong>.</li>
<li><a href="http://www.cbil.upenn.edu/BEERS/" title="BEERS">BEERS</a> (optional): BEERS is a <strong>simulation engine</strong> for generating <strong>RNA-Seq</strong> data.</li>
</ul>
<h3 id="file-formats-explored">File formats explored:</h3>
<ul>
<li><a href="http://samtools.sourceforge.net/SAMv1.pdf">SAM</a>: Sequence alignment format, plain text.</li>
<li><a href="http://www.broadinstitute.org/igv/bam">BAM</a>: Binary and compressed version of SAM</li>
</ul>
<h3 id="data-used-in-this-practical">Data used in this practical</h3>
<p>Create a <code>data</code> folder in your <strong>working directory</strong> and download the <strong>reference genome sequence</strong> to be used (human chromosome 21) and <em>simulated datasets</em> from <strong>Dropbox</strong> <a href="https://www.dropbox.com/sh/4qkqch7gyt888h7/AABD_i9ShwryfAqGeJ0yqqF3a">data</a>. For the rest of this tutorial the <strong>working directory</strong> will be <strong>cambridge_mda15</strong> and all the <strong>paths</strong> will be relative to that working directory:</p>
<pre><code>cd cambridge_mda15
mkdir data</code></pre>
<h5 id="download-reference-genome-from-ensembl">Download reference genome from <a href="http://www.ensembl.org/index.html" title="Ensembl">Ensembl</a></h5>
<p>Working with NGS data requires a high-end workstations and time for building the reference genome indexes and alignment. During this tutorial we will work only with chromosome 21 to speed up the runtimes. You can download it from <strong>Dropbox</strong> <a href="https://www.dropbox.com/sh/4qkqch7gyt888h7/AABD_i9ShwryfAqGeJ0yqqF3a">data</a> or from the <em>Download</em> link at the top of <a href="http://www.ensembl.org/index.html" title="Ensembl">Ensembl</a> website and then to <em>Download data via FTP</em>, you get it in only one step by going to:</p>
<p><a href="http://grch37.ensembl.org/info/data/ftp/index.html">Ensembl GRCh37 http://grch37.ensembl.org/info/data/ftp/index.html</a></p>
<p><a href="http://www.ensembl.org/info/data/ftp/index.html">Ensembl GRCh38 http://www.ensembl.org/info/data/ftp/index.html</a></p>
<p>You should see a species table with a Human (<em>Homo sapiens</em>) row and a <em>DNA (FASTA)</em> column or click at <a href="ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/">ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/</a>, download the chromosome 21 file (<em>Homo_sapiens.GRCh37.75.dna.chromosome.21.fa.gz</em>) and move it from your browser download folder to your <code>data</code> folder:</p>
<pre><code>mv Homo_sapiens.GRCh37.75.dna.chromosome.21.fa.gz path_to_local_data</code></pre>
<p><strong>NOTE:</strong> For working with the whole reference genome the file to be downloaded is <em>Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz</em></p>
<h5 id="copy-simulated-datasets">Copy simulated datasets</h5>
<p>For this hands-on we are going to use small DNA and RNA-seq datasets simulated from chromosome 21. Data has been already simulated using <em>dwgsim</em> software from SAMtools for DNA and <em>BEERS</em> for RNA-seq. You can copy from the shared resources from <strong>Dropbox</strong> <a href="https://www.dropbox.com/sh/4qkqch7gyt888h7/AABD_i9ShwryfAqGeJ0yqqF3a">data</a> into your <code>data</code> directory for this practical session. Preparing the data directory:</p>
<pre><code>cp path_to_course_materials/alignment/* your_local_data/</code></pre>
<p>Notice that the name of the folders and files describe the dataset, ie. <code>dna_chr21_100_hq</code> stands for: <em>DNA</em> type of data from <em>chromosome 21</em> with <em>100</em>nt read lengths of <em>high</em> quality. Where <em>hq</em> quality means 0.1% mutations and <em>lq</em> quality 1% mutations. Take a few minutes to understand the different files.</p>
<p><strong>NOTE:</strong> If you want to learn how to simulate DNA and RNA-seq for other conditions go down to the end of this tutorial.</p>
<h5 id="real-datasets">Real datasets</h5>
<p>For those with access to high-end nodes clusters you can index and simulated whole genome datasets or download real datasets from this sources: - <a href="http://www.1000genomes.org/">1000genomes project</a> - <a href="https://www.ebi.ac.uk/ena/">European Nucleotide Archive (ENA)</a> - <a href="http://www.ncbi.nlm.nih.gov/sra">Sequence Read Archive (SRA)</a></p>
<h3 id="installing-samtools-optional-already-installed">Installing SAMtools (Optional, already installed)</h3>
<p>Check that it is not installed by executing</p>
<pre><code>samtools</code></pre>
<p>A list of commands should be printed. If not then proceed with the installation.</p>
<p>Download <a href="http://www.htslib.org/" title="SAMtools">SAMtools</a> from <em>SF Download Page</em> link and move to the working directory, then uncompress it.</p>
<pre><code>mv samtools-1.2.tar.bz2 working_directory
cd working_directory
tar -jxvf samtools-1.2.tar.bz2
cd samtools-1.2
make</code></pre>
<p>Check that is correct by executing it with no arguments, the different commands available should be printed. You can also copy it to your <code>bin</code> folder in your home directory, if bin folder exist, to make it available to the PATH:</p>
<pre><code>samtools
cp samtools ~/bin</code></pre>
<h1 id="exercise-1-ngs-genomic-dna-aligment">Exercise 1: NGS Genomic DNA aligment</h1>
<p>In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with the two most widely DNA aligners: <em>BWA</em> and <em>Bowtie2</em>. But first, create an <code>aligners</code> folder to store the software, and an <code>results</code> folder to store the alignment results, create those folders in your <em>working directory</em> next to <code>data</code>, you can create both folders by executing:</p>
<pre><code>mkdir aligners
mkdir results</code></pre>
<p>Now go to <code>aligners</code> and <code>results</code> folders and create subfolders for <em>bwa</em> and <em>bowtie</em> to store the indexes and alignments results:</p>
<pre><code>cd aligners
mkdir bwa hpg-aligner bowtie</code></pre>
<p>and</p>
<pre><code>cd results
mkdir bwa hpg-aligner bowtie</code></pre>
<p><strong>NOTE:</strong> Now your working directory must contain 3 folders: data (with the reference genome of chrom. 21 and simulated datasets), aligners and results. Your working directory should be similar to this (notice that aligners have not been downloaded), execute ‘tree -L 2’:</p>
<pre><code>.
├── aligners
│   ├── bowtie
│   ├── bwa
│   ├── hpg-aligner
├── results
│   ├── bowtie
│   ├── bwa
│   ├── hpg-aligner
├── data
│   ├── dna_chr21_100_hq_read1.fastq
│   ├── dna_chr21_100_hq_read2.fastq
│   ├── dna_chr21_100_lq_read1.fastq
│   ├── dna_chr21_100_lq_read2.fastq
│   ├── Homo_sapiens_cDNAs_chr21.fa
│   ├── Homo_sapiens.GRCh37.75.dna.chromosome.21.fa
│   ├── rna_chr21_100_hq_read1.fastq
│   └── rna_chr21_100_hq_read2.fastq
│   ├── rna_chr21_150_lq_read1.fastq
│   └── rna_chr21_150_lq_read2.fastq

</code></pre>
<h3 id="bwa">BWA</h3>
<p><a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> is probably the most used aligner for DNA. AS the documentation states it consists of three different algorithms: <em>BWA</em>, <em>BWA-SW</em> and <em>BWA-MEM</em>. The first algorithm, which is the oldest, is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA for 70-100bp Illumina reads.</p>
<p>All these three algorithms come in the same binary so only one download and installation is needed.</p>
<h5 id="download-and-install-optional-already-installed">Download and install (Optional, already installed)</h5>
<p>First check that bwa is not currently installed by executing:</p>
<pre><code>bwa</code></pre>
<p>A list of commands will be printed if already installed. If not you can continue with the installation.</p>
<p>You can click on <code>SF download page</code> link in the <a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> page or click directly to:</p>
<p><a href="http://sourceforge.net/projects/bio-bwa/files">http://sourceforge.net/projects/bio-bwa/files</a></p>
<p>Click in the last version of BWA and wait for a few seconds, as the time of this tutorial last version is <strong>bwa-0.7.10.tar.bz2</strong>, the download will start. When downloaded go to your browser download folder and move it to aligners folder, uncompress it and compile it:</p>
<pre><code>mv bwa-0.7.12.tar.gz working_directory/aligners/bwa
tar -jxvf bwa-0.7.12.tar.gz
cd bwa-0.7.12
make
cp bwa ~/bin</code></pre>
<p>You can check that everything is allright by executing:</p>
<pre><code>bwa</code></pre>
<p>Some information about the software and commands should be listed.</p>
<h5 id="build-the-index">Build the index</h5>
<p>Create a folder inside <code>aligners/bwa</code> folder called <code>index</code> to store the BWA index and copy the reference genome into it:</p>
<pre><code>mkdir index
cp  data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa  aligners/bwa/index/   (this path can be different!)</code></pre>
<p>Now you can create the index by executing:</p>
<pre><code>bwa index aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa</code></pre>
<p>Some files will be created in the <code>index</code> folder, those files constitute the index that BWA uses.</p>
<p><strong>NOTE:</strong> The index must created only once, it will be used for all the different alignments with BWA.</p>
<h5 id="aligning-with-new-bwa-mem-in-both-single-end-se-and-paired-end-pe-modes">Aligning with new BWA-MEM in both single-end (SE) and paired-end (PE) modes</h5>
<p>BWA-MEM is the recommended algorithm to use now. You can check the options by executing:</p>
<pre><code>bwa mem</code></pre>
<p>To align <strong>SE</strong> with BWA-MEM execute:</p>
<pre><code>bwa mem -t 4 -R &quot;@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1&quot; aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/dna_chr21_100_hq_read1.fastq &gt; results/bwa/dna_chr21_100_hq_se.sam</code></pre>
<p>Now you can use SAMtools to create the BAM file from the <em>alignment/bwa</em> folder:</p>
<pre><code>cd results/bwa
samtools view -b dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam</code></pre>
<p>To align <strong>PE</strong> with BWA-MEM just execute the same command line with the two FASTQ files:</p>
<pre><code>bwa mem -t 4 -R &quot;@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1&quot; aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq &gt; results/bwa/dna_chr21_100_hq_pe.sam</code></pre>
<p>Now you can use SAMtools to create the BAM file from the <em>alignment/bwa</em> folder:</p>
<pre><code>cd results/bwa
samtools view -b dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Now you can do the same for the <strong>low</strong> quality datasets.</p>
<h5 id="aligning-with-old-bwa-algorithm-two-command-lines-aln-and-samsesampe-in-se-and-pe-modes-optional-exercise">Aligning with old BWA algorithm, two command lines: ALN and SAMSE/SAMPE in SE and PE modes (Optional exercise)</h5>
<p>Now we are going to align SE and PE the <strong>high</strong> quality dataset. Single-end alignment with BWA requires 2 executions. The first uses <code>aln</code> command and takes the <code>fastq</code> file and creates a <code>sai</code> file; the second execution uses <code>samse</code> and the <code>sai</code> file and create the <code>sam</code> file. Results are stored in <code>results</code> folder:</p>
<pre><code>bwa aln aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -t 4 data/dna_chr21_100_hq_read1.fastq -f results/bwa/dna_chr21_100_hq_se.sai
bwa samse aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa results/bwa/dna_chr21_100_hq_se.sai data/dna_chr21_100_hq_read1.fastq -f results/bwa/dna_chr21_100_hq_se.sam</code></pre>
<p>For paired-end alignments with BWA 3 executions are needed: 2 for <code>aln</code> command and 1 for <code>sampe</code> command:</p>
<pre><code>bwa aln aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -t 4 data/dna_chr21_100_hq_read1.fastq -f results/bwa/dna_chr21_100_hq_pe1.sai
bwa aln aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -t 4 data/dna_chr21_100_hq_read2.fastq -f results/bwa/dna_chr21_100_hq_pe2.sai
bwa sampe aligners/bwa/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa results/bwa/dna_chr21_100_hq_pe1.sai results/bwa/dna_chr21_100_hq_pe2.sai data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq -f results/bwa/dna_chr21_100_hq_pe.sam</code></pre>
<p>Now you can use SAMtools to create the BAM file from the <em>results/bwa</em> folder:</p>
<pre><code>cd results/bwa
samtools view -b dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam
samtools view -b dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Now you can do the same for the <strong>low</strong> quality datasets.</p>
<h3 id="hpg-aligner">HPG Aligner</h3>
<p><a href="https://github.com/opencb/hpg-aligner/wiki/" title="HPG Aligner">HPG Aligner</a> is an ultrafast and high sensitivity tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of any size. HPG Aligner 2.0 indexes the genome using a Suffix Arrays. HPG Aligner 2 supports gapped, local, and paired-end alignment modes, together with INDEL realignment and recalibration.</p>
<h5 id="build-the-index-1">Build the index</h5>
<p>Create a folder inside HPG Aligner program called <code>index</code> to store the HPG Aligner index and copy the reference genome into it:</p>
<pre><code>cd hpg-aligner   (_inside aligners folder_)
mkdir index
cp data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa aligners/bwa/index/</code></pre>
<p>Now you can create the index by executing:</p>
<pre><code>hpg-aligner build-sa-index -g aligners/hpg-aligner/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -i aligners/hpg-aligner/index/</code></pre>
<p>Some files will be created in the <code>index</code> folder, those files constitute the index that HPG Aligner uses.</p>
<p><strong>NOTE:</strong> The index must created only once, it will be used for all the different alignments with HPG Aligner.</p>
<h5 id="aligning-in-se-and-pe-modes">Aligning in SE and PE modes</h5>
<p>Mapping <strong>SE</strong> with HPG Aligner requires only 1 execution, for aligning the <strong>high</strong> in SE mode execute:</p>
<pre><code>hpg-aligner dna --cpu-threads 4 -i aligners/hpg-aligner/index/ -f data/dna_chr21_100_hq_read1.fastq -o results/hpg-aligner/ --prefix dna_chr21_100_hq_se</code></pre>
<p>And create the BAM file using SAMtools, you could create the BAM file adding <em>–bam-format</em> to the previous command line:</p>
<pre><code>cd results/hpg-aligner
samtools view -b dna_chr21_100_hq_se_out.sam -o dna_chr21_100_hq_se.bam</code></pre>
<p>Mapping in <strong>PE</strong> also requires only one execution:</p>
<pre><code>hpg-aligner dna --cpu-threads 4 -i aligners/hpg-aligner/index/ -f data/dna_chr21_100_hq_read1.fastq -j data/dna_chr21_100_hq_read2.fastq -o results/hpg-aligner --prefix dna_chr21_100_hq_pe</code></pre>
<p>And create the BAM file using SAMtools:</p>
<pre><code>cd results/hpg-aligner
samtools view -b dna_chr21_100_hq_pe_out.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Repeat the same steps for the <strong>low</strong> quality dataset.</p>
<h3 id="bowtie2">Bowtie2</h3>
<p><a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" title="Bowtie2">Bowtie2</a> as documentation states is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to few 100s. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.</p>
<h5 id="download-and-install-optional-already-installed-1">Download and install (Optional, already installed)</h5>
<p>First check that bwa is not currently installed by executing:</p>
<pre><code>bowtie2</code></pre>
<p>A list of commands will be printed if already installed. If not you can continue with the installation.</p>
<p>From <a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" title="Bowtie2">Bowtie2</a> go to <code>Latest Release</code> and download the program or go directly to:</p>
<p><a href="http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.3/">http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.3/</a></p>
<p>Click in the Linux version of Bowtie2 and wait for a few seconds, as the time of this tutorial last version is <strong>bowtie2-2.2.3-linux-x86_64.zip</strong>, the download will start. When downloaded go to your browser download folder and move it to aligners folder and uncompress it. No need to compile if you downloaded the Linux version:</p>
<pre><code>mv bowtie2-2.2.3-linux-x86_64.zip working_directory/aligners/bowtie
unzip bowtie2-2.2.3-linux-x86_64.zip
cd bowtie2-2.2.3</code></pre>
<p>You can check that everything is allright by executing:</p>
<pre><code>bowtie2</code></pre>
<p>Big information about the software and commands should be listed.</p>
<h5 id="build-the-index-2">Build the index</h5>
<p>Create a folder inside Bowtie2 program called <code>index</code> to store the Bowtie2 index and copy the reference genome into it:</p>
<pre><code>cd bowtie   (_inside aligners folder_)
mkdir index
cp data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa aligners/bowtie/index/</code></pre>
<p>Now you can create the index by executing:</p>
<pre><code>bowtie2-build aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa</code></pre>
<p>Some files will be created in the <code>index</code> folder, those files constitute the index that Bowtie2 uses.</p>
<p><strong>NOTE:</strong> The index must created only once, it will be used for all the different alignments with Bowtie2.</p>
<h5 id="aligning-in-se-and-pe-modes-1">Aligning in SE and PE modes</h5>
<p>Mapping <strong>SE</strong> with Bowtie2 requires only 1 execution, for aligning the <strong>high</strong> in SE mode execute:</p>
<pre><code>bowtie2 -q -p 4 -x aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -U data/dna_chr21_100_hq_read1.fastq -S results/bowtie/dna_chr21_100_hq_se.sam</code></pre>
<p>And create the BAM file using SAMtools;</p>
<pre><code>cd results/bowtie
samtools view -b dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam</code></pre>
<p>Mapping in <strong>PE</strong> also requires only one execution:</p>
<pre><code>bowtie2 -q -p 4 -x aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa -1 data/dna_chr21_100_hq_read1.fastq -2 data/dna_chr21_100_hq_read2.fastq -S results/bowtie/dna_chr21_100_hq_pe.sam</code></pre>
<p>And create the BAM file using SAMtools;</p>
<pre><code>cd results/bowtie
samtools view -b dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam</code></pre>
<p>Repeat the same steps for the <strong>low</strong> quality dataset.</p>
<h3 id="more-exercises">More exercises</h3>
<ul>
<li>Try to simulate datasets with longer reads and more mutations to study which aligner behaves better</li>
<li>Test the aligner sensitivity to INDELS</li>
<li>Try BWA-MEM algorithm and compare sensitivity. The same index is valid, only one execution for the SAM file <code>./bwa mem index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa ../../data/dna_chr21_100_low/dna_chr21_100_low.bwa.read1.fastq</code></li>
</ul>
<h1 id="exercise-2-ngs-rna-seq-aligment">Exercise 2: NGS RNA-seq aligment</h1>
<p>In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with the two most widely RNA-seq aligner: <em>TopHat2</em>. TopHat2 uses Bowtie2 as an aligner.</p>
<p><strong>NOTE:</strong> Two others commonly used RNA-seq aligners are <a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> and <a href="http://www.netlab.uky.edu/p/bioinfo/MapSplice2" title="MapSplice2">MapSplice2</a>, no guided exercises have been documented in this tutorials, but users are encouraged to follow the instructions of their web sites.</p>
<p>Go to <code>results</code> folder and create to folders for <em>tophat</em> to store alignments results:</p>
<pre><code>cd results
mkdir tophat</code></pre>
<p><strong>NOTE:</strong> No index is needed for TopHat as it uses Bowtie2 for alignment.</p>
<h3 id="tophat2">TopHat2</h3>
<p><a href="http://ccb.jhu.edu/software/tophat/index.shtml" title="TopHat2">TopHat2</a> states to be a <em>fast</em> splice junction mapper for RNA-Seq reads, which is not always completrly true. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.</p>
<h5 id="download-and-install-optional-already-installed-2">Download and install (Optional, already installed)</h5>
<p>First check that bwa is not currently installed by executing:</p>
<pre><code>tophat2</code></pre>
<p>A list of commands will be printed if already installed. If not you can continue with the installation.</p>
<p>From <a href="http://ccb.jhu.edu/software/tophat/index.shtml" title="TopHat2">TopHat2</a> go to <code>Releases</code> and download the Linux program by clicking in <em>Linux x86_64 binary</em> link.</p>
<p>As the time of this tutorial last version is <strong>tophat-2.0.10.Linux_x86_64.tar.gz</strong>, the download will start. When downloaded go to your browser download folder and move it to aligners folder and uncompress it. No need to compile if you downloaded the Linux version:</p>
<pre><code>mv tophat-2.0.10.Linux_x86_64.tar.gz working_directory/aligners/tophat
tar -zxvf tophat-2.0.10.Linux_x86_64.tar.gz
cd tophat-2.0.10.Linux_x86_64</code></pre>
<p>You can check that everything is allright by executing:</p>
<pre><code>tophat2</code></pre>
<p>Big information about the software and commands should be listed.</p>
<p><strong>NOTE:</strong> TopHat uses Bowtie as the read aligner. You can use either Bowtie 2 (the default) or Bowtie (–bowtie1) and you will need the following Bowtie 2 (or Bowtie) programs in your PATH. Index must be created with Bowtie not TopHat. So, copy Bowtie2 into ~/bin:</p>
<pre><code>cd bowtie2   (bowtie 2.2 does not work)
cp bowtie* ~/bin</code></pre>
<h5 id="aligning-in-se-and-pe-modes-2">Aligning in SE and PE modes</h5>
<p>To align in SE mode:</p>
<pre><code>tophat2 -o results/tophat/rna_chr21_100_hq_se aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/rna_chr21_100_hq_read1.fastq</code></pre>
<p>And for PE:</p>
<pre><code>tophat2 -o results/tophat/rna_chr21_100_hq_pe/ aligners/bowtie/index/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa data/rna_chr21_100_hq_read1.fastq data/rna_chr21_100_hq_read2.fastq</code></pre>
<p>Now align the rna dataset of 150bp with low quality and compare stats.</p>
<h3 id="star-and-mapsplice2">STAR and MapSplice2</h3>
<p><a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> and <a href="http://www.netlab.uky.edu/p/bioinfo/MapSplice2" title="MapSplice2">MapSplice2</a> are two others interesting RNA-seq aligners. <a href="https://code.google.com/p/rna-star/" title="STAR">STAR</a> offer a great performance while still have good sensitivity. <a href="http://www.netlab.uky.edu/p/bioinfo/MapSplice2" title="MapSplice2">MapSplice2</a> shows usually a better sensitivity but is several times slower.</p>
<h5 id="star-installation-optional-already-installed">STAR installation (Optional, already installed)</h5>
<p>STAR comes compiled for Linux, you only have to download the <em>tarball</em>:</p>
<pre><code>tar -zxvf STAR_2.3.0e.Linux_x86_64_static.tgz</code></pre>
<p>Read the documentation and try to align the simulated dataset.</p>
<h5 id="mapsplice2-installation-optional-already-installed">MapSplice2 installation (Optional, already installed)</h5>
<p>MapSplice must be unizpped and compiled:</p>
<pre><code>unzip MapSplice-v2.1.6.zip
cd MapSplice-v2.1.6
make</code></pre>
<p>Read the documentation and try to align the simulated dataset.</p>
<h1 id="exercise-3-simulating-ngs-datasets-optional">Exercise 3: Simulating NGS datasets (Optional)</h1>
<h3 id="dna">DNA</h3>
<p>Download <a href="http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation" title="dwgsim">dwgsim</a> from http://sourceforge.net/projects/dnaa/files/ to the <em>working_directory</em> and uncompress it and compile it:</p>
<pre><code>tar -zxvf dwgsim-0.1.10.tar.gz
cd dwgsim-0.1.10
make</code></pre>
<p>Check options by executing:</p>
<pre><code>./dwgsim</code></pre>
<p>Then you can simulate 2 million reads of 150bp with a 2% if mutation executing:</p>
<pre><code>./dwgsim-0.1.11/dwgsim -1 150 -2 150 -y 0 -N 2000000 -r 0.02 ../data/Homo_sapiens.GRCh37.75.dna.chromosome.21.fa ../data/dna_chr21_100_low/dna_chr21_100_verylow</code></pre>
<h3 id="rna-seq">RNA-seq</h3>
<p><a href="http://www.cbil.upenn.edu/BEERS/" title="BEERS">BEERS</a> is a perl-based program, no compilation is needed, just download it from here http://www.cbil.upenn.edu/BEERS and uncompress it:</p>
<pre><code>tar xvf beers.tar</code></pre>
</body>
</html>