Skip to content

Commit 1fc34c4

Browse files
committed
Various bug fixes and extensions:
* Added --tmp-path option for sfm. * Fixed /dev/stdin and /dev/stdout for sfm. * Added a command for merging intermediate metrics files. * Fixed detection of directory path names in sfm.
1 parent f6a70a4 commit 1fc34c4

File tree

12 files changed

+262
-41
lines changed

12 files changed

+262
-41
lines changed

README.md

Lines changed: 62 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,33 @@ This option is used to specify the number of levels for quantizing quality score
289289

290290
This option is used to indicate to use static quantized quality scores to a given number of levels during base quality score recalibration (--bqsr). This list should be of the form "[nr, nr, nr]". The default value is [].
291291

292+
### --mark-optical-duplicates-intermediate file
293+
294+
This option is used in the context of filtering files created using the elprep split command. It is used internally by
295+
the elprep sfm command, but can be used when writing your own split/filter/merge scripts.
296+
297+
This option tells elPrep to perform optical duplicate marking and to write the result to an intermediate metrics file.
298+
The intermediate metrics file generated this way can later be merged with other intermediate metrics files, see the
299+
merge-optical-duplicates-metrics command.
300+
301+
### --bqsr-tables-only table-file
302+
303+
This option is used in the context of filtering files created using the elprep split command. It is used internally by
304+
the elprep sfm command, but can be used when writing your own split/filter/merge scripts.
305+
306+
This option tells elPrep to perform base quality score recalibration and to write the result of the recalibration to an
307+
intermediate table file. This table file will need to be merged with other intermediate recalibration results during the
308+
application of the base quality score recalibration. See the --bqsr-apply option.
309+
310+
### --bqsr-apply path
311+
312+
This option is used when filtering files created by the elprep split command. It is used internally by the elprep sfm
313+
command, and can be used when writing your own split/filter/merge scripts.
314+
315+
This option is used for applying base quality score recalibration on an input file. It expects a path parameter that
316+
refers to a directory that contains intermediate recalibration results for multiple files created using the
317+
--bqsr-tables-only option.
318+
292319
## Sorting Command Options
293320

294321
### --sorting-order [keep | unknown | unsorted | queryname | coordinate]
@@ -384,6 +411,7 @@ The elprep split command can be used to split up .sam files into smaller files t
384411
Splitting the .sam file into smaller files for processing "per chromosome" is useful for reducing the memory pressure as these split files are typically significantly smaller than the input file as a whole. Splitting also makes it possible to parallelize the processing of a single .sam file by distributing the different split files across different processing nodes.
385412

386413
We provide an sfm command that executes a pipeline while silently using the elprep filter and split/merge tools. It is of course possible to write scripts to combine the filter and split/merge tools yourself.
414+
We provide a recipe for writing your own split/filter/merge scripts on our github wiki.
387415

388416
## Name
389417

@@ -395,8 +423,6 @@ We provide an sfm command that executes a pipeline while silently using the elpr
395423

396424
elprep sfm input.bam output.bam --mark-duplicates --mark-optical-duplicates output.metrics --sorting-order coordinate --bqsr output.recal --bqsr-reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites
397425

398-
elprep sfm --mark-duplicates --mark-optical-duplicates output.metrics --sorting-order coordinate --bqsr output.recal --bqsr-reference hg38.elfasta --known-sites dbsnp_138.hg38.elsites
399-
400426
## Description
401427

402428
The elprep sfm command is a drop-in replacement for the elprep filter command that minimises the use of RAM. For this, it silently calls the elprep split and merge tools to split up the data "per chromosome" for processing, which requires less RAM than processing a .sam/.bam file as a whole (see Split and Merge tools).
@@ -409,6 +435,10 @@ The elprep sfm command has the same options as the elprep filter command, with t
409435

410436
This command option sets the format of the split files. By default, elprep uses the same format as the input file for the split files. Changing the intermediate file output type may improve either runtime (.sam) or reduce peak disk usage (.bam).
411437

438+
### --tmp-path
439+
440+
This command option is used to specify a path where elPrep can store temporary files that are created (and deleted) by the split and merge commands that are silently called by the elprep sfm command. The default path is the folder from where you call elprep sfm.
441+
412442
### --single-end
413443

414444
Use this command option to indicate the sfm command is processing single-end data. This information is important for the split/merge tools to operate correcly. For more details, see the description of the elprep split and elprep merge commands.
@@ -439,6 +469,8 @@ Choosing the value 1 for the --contig-group-size tells elprep split to split the
439469

440470
The elprep split command requires two arguments: 1) the input file or a path to multiple input files and 2) a path to a directory where elPrep can store the split files. The input file(s) can be .sam or .bam. It is also possible to use /dev/stdin as the input for using Unix pipes. There are no structural requirements on the input file(s) for using elprep split. For example, it is not necessary to sort the input file, nor is it necessary to convert to .bam or index the input file.
441471

472+
Warning: If you pass a path to multiple input files to the elprep split command, elprep assumes that they all have the same (or compatible) headers, and just picks the first header it finds as the header for all input files. elprep currently does not make an attempt to resolve potential conflicts between headers, especially with regard to the @SQ, @RG, or @PG header records. We will include proper merging of different SAM/BAM files in a future version of elprep. In the meantime, if you need proper merging of SAM/BAM files, please use samtools merge, Picard MergeSamFiles, or a similar tool. (If such a tool produces SAM file as output, it can be piped into elprep using Unix pipes.)
473+
442474
elPrep creates the output directory denoted by the output path, unless the directory already exists, in which case elPrep may override the existing files in that directory. Please make sure elPrep has the correct permissions for writing that directory.
443475

444476
By default, the elprep split command assumes it is processing pair-end data. The flag --single-end can be used for processing single-end data. The output will look different for paired-end and single-end data.
@@ -524,6 +556,34 @@ Sets the path for writing a log file.
524556

525557
The --contig-group-size parameter for the elprep merge command is deprecated since version 4.1.1. The elprep merge command now correctly processes the split files without that information.
526558

559+
## Name
560+
561+
### elprep merge-optical-duplicate-metrics - a commandline tool for merging intermediate metrics files created by the --mark-optical-duplicates-intermediate option
562+
563+
## Synopsis
564+
565+
elprep merge-optical-duplicates-metrics input-file output-file metrics-file /path/to/intermediate/metrics --remove-duplicates
566+
567+
## Description
568+
569+
The elprep merge-optical-duplicates-metrics command requires four arguments:
570+
the names of the original input and output .sam/.bam files for which the metrics are calculated,
571+
the metrics file to which the merged metrics should be written, and a path to the intermediate metrics files that need
572+
to be merged (and were generated using --mark-optical-duplicates-intermediate).
573+
574+
## Options
575+
576+
### --nr-of-threads number
577+
578+
This command option sets the number of threads that elPrep uses during execution for parsing/outputting .sam/.bam data. The default number of threads is equal to the number of cpu threads.
579+
580+
It is normally not necessary to set this option. elPrep by default allocates the optimal number of threads.
581+
582+
## --remove-duplicates
583+
584+
Pass this option if the metrics were generated for a file for which the duplicates were removed. This information will
585+
be included in the merged metrics file.
586+
527587
# Extending elPrep
528588

529589
If you wish to extend elPrep, for example by adding your own filters, please consult our [API documentation](https://godoc.org/github.com/ExaScience/elprep).
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
// elPrep: a high-performance tool for preparing SAM/BAM files.
2+
// Copyright (c) 2017-2019 imec vzw.
3+
4+
// This program is free software: you can redistribute it and/or modify
5+
// it under the terms of the GNU Affero General Public License as
6+
// published by the Free Software Foundation, either version 3 of the
7+
// License, or (at your option) any later version, and Additional Terms
8+
// (see below).
9+
10+
// This program is distributed in the hope that it will be useful, but
11+
// WITHOUT ANY WARRANTY; without even the implied warranty of
12+
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
13+
// Affero General Public License for more details.
14+
15+
// You should have received a copy of the GNU Affero General Public
16+
// License and Additional Terms along with this program. If not, see
17+
// <https://github.com/ExaScience/elprep/blob/master/LICENSE.txt>.
18+
19+
package cmd
20+
21+
import (
22+
"bytes"
23+
"flag"
24+
"fmt"
25+
"log"
26+
"os"
27+
"path/filepath"
28+
"runtime"
29+
30+
"github.com/exascience/elprep/v4/filters"
31+
)
32+
33+
// MergeOpticalDuplicatesMetricsHelp is the help string for this command.
34+
const MergeOpticalDuplicatesMetricsHelp = "\nmerge-optical-duplicates-metrics parameters:\n" +
35+
"elprep merge-optical-duplicates-metrics sam-input-file sam-output-file metrics-file /path/to/intermediate/metrics\n" +
36+
"[--remove-duplicates]\n" +
37+
"[--nr-of-threads nr]\n" +
38+
"[--timed]\n" +
39+
"[--log-path path]\n"
40+
41+
// Merge implements the elprep merge command.
42+
func MergeOpticalDuplicatesMetrics() error {
43+
var (
44+
profile, logPath string
45+
nrOfThreads int
46+
timed, removeDuplicates bool
47+
)
48+
49+
var flags flag.FlagSet
50+
51+
flags.IntVar(&nrOfThreads, "nr-of-threads", 0, "number of worker threads")
52+
flags.BoolVar(&timed, "timed", false, "measure the runtime")
53+
flags.BoolVar(&removeDuplicates, "remove-duplicates", false, "use when duplicates were removed during duplicate marking")
54+
flags.StringVar(&profile, "profile", "", "write a runtime profile to the specified file(s)")
55+
flags.StringVar(&logPath, "log-path", "", "write log files to the specified directory")
56+
57+
parseFlags(flags, 6, MergeOpticalDuplicatesMetricsHelp)
58+
59+
input := getFilename(os.Args[2], MergeOpticalDuplicatesMetricsHelp)
60+
output := getFilename(os.Args[3], MergeOpticalDuplicatesMetricsHelp)
61+
metrics := getFilename(os.Args[4], MergeOpticalDuplicatesMetricsHelp)
62+
intermediateMetrics := getFilename(os.Args[5], MergeOpticalDuplicatesMetricsHelp)
63+
64+
setLogOutput(logPath)
65+
66+
// sanity checks
67+
68+
var sanityChecksFailed bool
69+
70+
if !checkExist("", input) {
71+
log.Println("Warning: Input file does not exist: ", input)
72+
}
73+
74+
if !checkExist("", intermediateMetrics) {
75+
sanityChecksFailed = true
76+
}
77+
78+
if profile != "" && !checkCreate("--profile", profile) {
79+
sanityChecksFailed = true
80+
}
81+
82+
metricsDir, err := filepath.Abs(intermediateMetrics)
83+
if err != nil {
84+
return err
85+
}
86+
87+
if nrOfThreads < 0 {
88+
sanityChecksFailed = true
89+
log.Println("Error: Invalid nr-of-threads: ", nrOfThreads)
90+
}
91+
92+
if sanityChecksFailed {
93+
fmt.Fprint(os.Stderr, MergeOpticalDuplicatesMetricsHelp)
94+
os.Exit(1)
95+
}
96+
97+
// building output command line
98+
99+
var command bytes.Buffer
100+
fmt.Fprint(&command, os.Args[0], " merge-optical-duplicates-metrics ", input, " ", output, " ", metrics, " ", intermediateMetrics)
101+
if nrOfThreads > 0 {
102+
runtime.GOMAXPROCS(nrOfThreads)
103+
fmt.Fprint(&command, " --nr-of-threads ", nrOfThreads)
104+
}
105+
if timed {
106+
fmt.Fprint(&command, " --timed ")
107+
}
108+
if logPath != "" {
109+
fmt.Fprint(&command, " --log-path ", logPath)
110+
}
111+
if removeDuplicates {
112+
fmt.Fprint(&command, " --remove-duplicates")
113+
}
114+
115+
// executing command
116+
117+
log.Println("Executing command:\n", command.String())
118+
119+
var ctr filters.DuplicatesCtrMap
120+
121+
// merge intermediate metrics files
122+
err = timedRun(timed, profile, "Loading and combining duplicate metrics.", 1, func() error {
123+
ctr = filters.LoadAndCombineDuplicateMetrics(metricsDir)
124+
return ctr.Err()
125+
})
126+
if err != nil {
127+
return err
128+
}
129+
return timedRun(timed, profile, "Printing comdined duplicate metrics.", 2, func() error {
130+
return filters.PrintDuplicatesMetrics(input, output, metrics, removeDuplicates, ctr)
131+
})
132+
}

cmd/merge.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
// elPrep: a high-performance tool for preparing SAM/BAM files.
2-
// Copyright (c) 2017, 2018 imec vzw.
2+
// Copyright (c) 2017-2019 imec vzw.
33

44
// This program is free software: you can redistribute it and/or modify
55
// it under the terms of the GNU Affero General Public License as
@@ -84,7 +84,7 @@ func Merge() error {
8484
if err != nil {
8585
return err
8686
}
87-
filesToMerge, err := internal.Directory(fullInputPath)
87+
fullInputPath, filesToMerge, err := internal.Directory(fullInputPath)
8888
if err != nil {
8989
log.Printf("Given directory %v causes error %v.\n", input, err)
9090
sanityChecksFailed = true

0 commit comments

Comments
 (0)