You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
echo"##CADD $GENOMEBUILD-$VERSION (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health 2013-2019. All rights reserved.";
Copy file name to clipboardExpand all lines: README.md
+22-28Lines changed: 22 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ Please check our [website for updates and further information](http://cadd.gs.wa
24
24
25
25
## Offline Installation
26
26
27
-
This section describes how users can setup CADD version 1.5 on their own system. Please note that this requires between 100 GB - 1 TB of disc space and at least 12 GB of RAM.
27
+
This section describes how users can setup CADD version 1.6 on their own system. Please note that this requires between 100 GB - 1 TB of disc space and at least 12 GB of RAM.
*Note: You can also install CADD without conda by installing the dependencies otherwise. You can find the list of tools (VEP) and (python) libraries in the [`src/environment.yml`](https://github.com/kircherlab/CADD-scripts/blob/master/src/environment.yml) file (for CADD GRCh38-v1.5 use [`src/environment_v1.5.yml`](https://github.com/kircherlab/CADD-scripts/blob/master/src/environment_v1.5.yml)). In this case, you will also have to disable the line `source activate cadd-env` in `CADD.sh`*
40
-
41
-
*Note2: If you are using an existing conda installation, please make sure it is [a version >=4.4.0](https://github.com/conda/conda/issues/3200).*
43
+
*Note2: If you are using an existing conda installation, please make sure it is [a version >=4.4.0](https://github.com/conda/conda/issues/3200). Make also sure to use snakemake >= 4.0 as some command line parameters are not available in earlier versions. *
42
44
43
45
### Setup
44
46
@@ -58,48 +60,50 @@ This is the easier way of installing CADD, just run:
58
60
./install.sh
59
61
```
60
62
61
-
You first state which parts you want to install (the environment as well as at least one genome build including annotation tracks are neccessary for a quick start) and the script should manage loading and unpacking the neccessary files.
63
+
You first state which parts you want to install (the environments as well as at least one genome build including annotation tracks are neccessary for a quick start) and the script should manage loading and unpacking the neccessary files.
62
64
63
65
#### Manual installation
64
66
65
-
Running CADD depends on three big building blocks (plus the repository containing this README which we assume you already downloaded):
67
+
Running CADD depends on four big building blocks (plus the repository containing this README which we assume you already downloaded):
66
68
69
+
- snakemake
67
70
- dependencies
68
71
- genome annotations
69
72
- prescored variants
70
73
71
74
**Installing dependencies**
72
75
73
-
As stated already in the Prerequisite you can install CADD dependencies without conda, although we heavily recommend doing so. This is because managing the various parts becomes very handy by relying it. To setup the neccessary environment, we only need to run the command:
76
+
As of this version, dependencies have to be installed via conda and snakemake. This is because we are using two different enviroments for python2 and python3.
After the installing process (which will take a few minutes), the CADD conda environment will be loaded (via `source activate cadd-env` automatically in the `CADD.sh` script) and CADD can run without further settings.
83
+
Please note that we installing both conda environments in the CADD subdirectory `envs` via `--conda-prefix envs`. If you do not want this behavior (we do this in order to not install the environments in all active directories you run CADD from), adjust or remove this parameter.
80
84
81
85
**Installing annotations**
82
86
83
87
Both version of CADD (for the different genome builds) rely on a big number of genomic annotations. Depending on which genome build you require you can get them from our website (be careful where you put them as these are really big files and have identical filenames) via:
As those files are about 100 and 200 GB in size, downloads can take long (depending on your internet connection). We recommend to setup the process in the background and using a tool (like `wget -c` mentioned above) that allows you to continue an interrupted download.
93
97
94
-
To make sure you downloaded the files correctly, we recommend downloading md5 hash files from our website (e.g. `wget http://krishna.gs.washington.edu/download/CADD/v1.4/GRCh37/MD5SUMs`) and checking for completness (via `md5sum -c`).
98
+
To make sure you downloaded the files correctly, we recommend downloading md5 hash files from our website (e.g. `wget wget -c http://krishna.gs.washington.edu/download/CADD/v1.6/MD5SUMs`) and checking for completness (via `md5sum -c`).
95
99
96
100
The annotation files are finally put in the folder `data/annotations` and unpacked:
97
101
98
102
```bash
99
103
cd data/annotations
100
-
tar -zxvf annotationsGRCh37.tar.gz
104
+
tar -zxvf annotationsGRCh37_v1.6.tar.gz
101
105
mv GRCh37 GRCh37_v1.4
102
-
tar -zxvf annotationsGRCh38.tar.gz
106
+
tar -zxvf annotationsGRCh38_v1.6.tar.gz
103
107
cd$OLDPWD
104
108
```
105
109
@@ -109,35 +113,25 @@ At this point you are ready to go, but if you want a faster version of CADD, you
109
113
110
114
### Running CADD
111
115
112
-
You run CADD via the script `CADD.sh` which technically only requieres an either vcf or vcf.gz input file as last argument. You can further specify the genome build via `-g`, CADD version via `-v`, request a fully annotated output (`-a` flag) and specify a seperate output file via `-o` (else inputfile name `.tsv.gz` is used). I.e:
116
+
You run CADD via the script `CADD.sh` which technically only requieres an either vcf or vcf.gz input file as last argument. You can further specify the genome build via `-g`, CADD version via `-v` (deprecated, the new version of the scripts only support v1.6), request a fully annotated output (`-a` flag) and specify a seperate output file via `-o` (else inputfile name `.tsv.gz` is used). I.e:
113
117
114
118
```bash
115
119
./CADD.sh test/input.vcf
116
120
117
-
./CADD.sh -a -g GRCh37 -v v1.4 -o output_inclAnno_GRCh37.tsv.gz test/input.vcf
121
+
./CADD.sh -a -g GRCh37 -o output_inclAnno_GRCh37.tsv.gz test/input.vcf
118
122
```
119
123
120
124
You can test whether your CADD is set up properly by comparing to the example files in the `test` directory.
121
125
122
126
### Update
123
127
124
-
Between versions 1.4 and 1.5, we adjusted the CADD repository slightly. If you used CADD before and obviously do not want to download all v1.4 files again, please proceed as follows:
125
-
126
-
1. update the repository (just overwriting is fine, however this dublicates some moved files so `git pull` is prefered)
127
-
2. rename the annotation (and prescored) folder from `data/annotations/$GENOMEBUILD` to `data/annotations/${GENOMEBUILD}_${VERSION}`
Version 1.6 includes some changes in comparison to v1.5. Next to the obvious switch of the pipeline into a Snakemake workflow which became necessary due to the ongoin issues with `conda activate`, the new models for v1.6 are extended by more specialized annotations for splicing variants, as well as a few minor changes in some other annotations (most prominent: fixed gerp for GRCh38) and changes in consequence categories which make this scripts incompatible with CADD v1.4 and v1.5. If you are still using those version, please use [version 1.5 of this repository](https://github.com/kircherlab/CADD-scripts/archive/CADD1.5.zip).
0 commit comments