Tataki is a command-line tool designed primarily for detecting file formats in the bioinformatics field. The tool comes with the following features:
- Supports various file formats mainly used in bioinformatics
- Bioinformatics file formats
- bam
- bcf
- bed
- cram
- fasta
- fastq
- gff3
- gtf
- sam
- vcf
- Compression formats
- gzip
- bzip2
- will be added in the future
- Bioinformatics file formats
- Allows for the invocation of a CWL document and enables users to define their own complex criteria for detection.
- Can target local files, remote URLs and standard input
- Compatible with EDAM ontology
Bioinformatics workflows often fail silently due to malformed, truncated, or inconsistent intermediate files, and many tools do not reliably indicate such issues through exit codes. These silent errors propagate through pipelines, consuming computational resources and making failures difficult to diagnose. Tataki addresses this problem by examining actual file contents with strict, domain-aware parsers, detecting anomalies before they propagate to downstream steps. By inserting Tataki between workflow steps, researchers can improve workflow robustness, catch format-related errors early, and ensure more reliable automated analyses.
A single binary is available for Linux x86_64 and aarch64.
curl -fsSL -o ./tataki https://github.com/sapporo-wes/tataki/releases/latest/download/tataki-$(uname -m)
chmod +x ./tataki
./tataki --helpYou can also install tataki using cargo.
cargo install -f --git https://github.com/sapporo-wes/tataki.gitOr, you can run tataki using Docker.
docker run --rm -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/tataki:latest --helpIn case you want to execute the CWL document with external extension mode, please make sure to mount docker.sock, /tmp and any other necessary directories.
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp -v $PWD:$PWD -w $PWD ghcr.io/sapporo-wes/tataki:latest --helpDetermine the file format of a local file. By default, tataki checks the first --num-records records of the input:
$ tataki path/to/unknown/file.txt -q
File Path,Edam ID,Label,Decompressed ID,Decompressed Label
path/to/unknown/file.txt,http://edamontology.org/format_2572,BAM,,Determine the file format of a remote file, and output result in YAML format:
$ tataki https://path/to/unknown/file.txt -q -f yaml
https://path/to/unknown/file.txt:
label: GZIP format
id: http://edamontology.org/format_3989
decompressed:
id: http://edamontology.org/format_1930
label: FASTQSpecify the paths of the files as arguments to tataki. Local file path, remote URL and standard input (-) are supported.
tataki <FILE|URL|'-'>...For more details:
$ tataki --help
Usage: tataki [OPTIONS] [FILE|URL|'-']...
Arguments:
[FILE|URL|'-']... Path to the file, URL, or "-" to read from standard input. Multiple inputs can be specified
Options:
-o, --output <FILE> Path to the output file [default: stdout]
-f <OUTPUT_FORMAT> [default: csv] [possible values: yaml, tsv, csv, json]
-C, --cache-dir <DIR> Specify the directory in which to create a temporary directory. If this option is not provided, a temporary directory will be created in the default system temporary directory (/tmp)
-c, --conf <FILE> Specify the tataki configuration file. If this option is not provided, the default configuration will be used. The option `--dry-run` shows the default configuration file
-t, --tidy Attempt to read the whole lines from the input files
--no-decompress Do not try to decompress the input file when detecting the file format
-n, --num-records <NUM_RECORDS> Number of records to read from the input file. Recommended to set it to a multiple of 4 to prevent false negatives. Conflicts with `--tidy` option [default: 100000]
--dry-run Output the configuration file in yaml format and exit the program. If `--conf` option is not provided, the default configuration file will be shown
-v, --verbose Show verbose log messages
-q, --quiet Suppress all log messages
-h, --help Print help
-V, --version Print version
Version: v0.5.1Table of Contents
- Tataki
- Installation
- Usage
- Detailed Usage
- Potentially Unexpected Behaviors
- Contributing
- Reporting Issues
- Getting Help
- License
Read from standard input by specifying - as the file path.
cat <FILE> | tataki -By default, Tataki reads the first 100,000 records of the input. You can change this number by using the -n|--num-records=<NUM_RECORDS> option.
tataki <FILE|URL|'-'> -n 1000By using the -t|--tidy option, Tataki attempts to read the whole lines from the input. This option helps when the input is truncated or its end is corrupted, and it avoids misidentifying the file formats of corrupted files
tataki <FILE|URL> -tTataki attempts to automatically decompress the input when detecting the file format. Currently, gzip and bzip2 are supported.
$ tataki foo.fastq.gz -q -f yaml
foo.fastq.gz:
label: GZIP format
id: http://edamontology.org/format_3989
decompressed:
id: http://edamontology.org/format_1930
label: FASTQIf you want to disable decompression, use the --no-decompress option.
$ tataki foo.fastq.gz -q -f yaml --no-decompress
foo.fastq.gz:
label: GZIP format
id: http://edamontology.org/format_3989
decompressed:
id: null
label: nullBGZF compressed files, such as BCF, BAM, or anything compressed with BGZF, are handled slightly differently. When BGZF files are given as input, Tataki does not attempt to decompress them and passes them directly to the parsers.
$ tataki foo.bam -q -f yaml
foo.bam:
label: BAM
id: http://edamontology.org/format_2572
decompressed:
label: null
id: nullUsing the -c|--conf=<FILE> option allows you to change the order or set of file formats to check for.
The configuration file is in YAML format. For the schema, please refer to the default configuration shown below. To specify a CWL document in the configuration file, see Add Path to Configuration File.
The default configuration can be achieved by using the --dry-run option.
# $ tataki --dry-run
order:
- bam
- bcf
- bed
- cram
- fasta
- fastq
- gff3
- gtf
- sam
- vcfTataki can also be used to execute a CWL document with external extension mode. This is useful when determining file formats that are not supported in pre-built mode or when you want to re-use the existing software to parse the input file.
This mode is dependent on Docker, so please ensure that 'docker' is in your PATH.
Here are the steps to execute a CWL document with external extension mode.
Tataki accepts a CWL document in a specific format. The following is an example of a CWL document that executes samtools view.
edam_id and label are the two required fields for the CWL document. Both must have tataki prefix which is listed in the $namespaces section of the document.
cwlVersion: v1.2
class: CommandLineTool
requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.18--h50ea8bc_1
InlineJavascriptRequirement: {}
baseCommand: [samtools, view]
successCodes: [0, 139]
inputs:
input_file:
type: File
inputBinding:
position: 1
outputs: {}
$namespaces:
tataki: https://github.com/sapporo-wes/tataki
tataki:edam_id: http://edamontology.org/format_2573
tataki:label: SAMInsert a path to the CWL document in the configuration file. The example shown below executes the CWL document followed by SAM and BAM format detection.
order:
- ./path/to/cwl_document.cwl
- sam
- bamAnd then, execute tataki with the -c|--conf=<FILE> option. Remember to use the --tidy when executing the CWL document because the whole lines are required for the tool in CWL document to parse.
tataki <FILE|URL|`-`> -c <CONFIG_FILE> --tidyAlso, consider using the --no-decompress option when you prefer to pass the input without decompression.
These are the tricky cases where the result of tataki may not be as expected. Please see issue #6 for the examples of these cases. If you encounter any unusual behavior like these examples, please consider posting to issue #6.
- Files with only header lines
Tataki will output the file as the first format which its spec for header lines matches in the order of the configuration file. If you are running tataki with the default configuration file, and the input file uses # as the comment delimiter, the file will be detected as a BED file.
- Gzipped binary files
Gzipped binary files, such as *.bam.gz, are not supported by tataki currently. It will fail with the following error message. Error: stream did not contain valid UTF-8
- BGZF format
As shown in BGZF Compressed Files, BGZF compressed files are not decompressed by tataki, and treated as is. Please be aware of this when parsing BGZF compressed files that have a .gz file extension, such as *.vcf.gz.
$ tataki SAMPLE_01.pass.vcf.gz --yaml
SAMPLE_01.pass.vcf.gz:
id: http://edamontology.org/format_3016
label: VCF
decompressed:
label: null
id: nullPlease see our CONTRIBUTING.md for details on:
- How to add new file format parsers
- How to add CWL documents for external extension mode
If you encounter any bugs or issues, please report them on our GitHub Issues page. When reporting an issue, please include:
- Your operating system, version and architecture
Tatakiversion (tataki --version)- Steps to reproduce the issue
- Expected vs actual behavior
- If possible, a sample file that reproduces the issue
If you have questions or need support, open a GitHub Issue for general questions.
The contents of this repository are basically licensed under the Apache License 2.0. See the LICENSE. However, the following files are licensed under Creative Commons Attribution Share Alike 4.0 International (https://spdx.org/licenses/CC-BY-SA-4.0.html).
./src/EDAM_1.25.id_label.csv- Source: https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.csv
- Processed with
extract_id_label.shto remove lines not related to 'format' and columns other than 'Preferred Label' and 'Class ID'