Kaiju 2016

From HPC users
Jump to navigationJump to search

Introduction

Kaiju is a software tool that allows for fast and accurate taxonomic classification of high-throughput sequencing reads. It can be used for both metagenomic and genomic data analysis and uses a reference database of complete genomes and/or protein sequences to classify reads.

Installed version(s)

The following versions are installed and currently available...

... on environment hpc-env/8.3:

  • Kaiju/1.9.2-GCCcore-8.3.0

Loading Kaiju

To load the desired version of the module, use the module load command, e.g.

module load hpc-env/8.3
module load Kaiju

Always remember: these commands are case-sensitive!

Using Kaiju

To find out of how to use Kaiju you can just type in kaiju after loading the module to print out a help text to get you started:

Kaiju 1.9.2
Copyright 2015-2022 Peter Menzel, Anders Krogh
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

Usage:
   kaiju -t nodes.dmp -f kaiju_db.fmi -i reads.fastq [-j reads2.fastq]

Mandatory arguments:
   -t FILENAME   Name of nodes.dmp file
   -f FILENAME   Name of database (.fmi) file
   -i FILENAME   Name of input file containing reads in FASTA or FASTQ format

Optional arguments:
   -j FILENAME   Name of second input file for paired-end reads
   -o FILENAME   Name of output file. If not specified, output will be printed to STDOUT
   -z INT        Number of parallel threads for classification (default: 1)
   -a STRING     Run mode, either "mem"  or "greedy" (default: greedy)
   -e INT        Number of mismatches allowed in Greedy mode (default: 3)
   -m INT        Minimum match length (default: 11)
   -s INT        Minimum match score in Greedy mode (default: 65)
   -E FLOAT      Minimum E-value in Greedy mode (default: 0.01)
   -x            Enable SEG low complexity filter (enabled by default)
   -X            Disable SEG low complexity filter
   -p            Input sequences are protein sequences
   -v            Enable verbose output

The kaiju command is used to classify reads with the Kaiju software. Here are some examples:

  • Build database:
kaiju-makedb -s viruses -t 2

Creates database with which we will execute the example commands:


  • Classify single-end Illumina reads:
kaiju -t nodes.dmp -f custom_db.faa -i reads.fastq.gz -o kaiju_output.txt

This command classifies single-end Illumina reads in file "reads.fastq.gz" using the custom database built with "kaiju-build" and outputs the results to "kaiju_output.txt".

  • Classify paired-end Illumina reads:
kaiju -t nodes.dmp -f custom_db.faa -i reads_1.fastq.gz -j reads_2.fastq.gz -o kaiju_output.txt

This command classifies paired-end Illumina reads in files "reads_1.fastq.gz" and "reads_2.fastq.gz" using the custom database and outputs the results to "kaiju_output.txt".

  • Classify reads with a specified minimum length:
kaiju -t nodes.dmp -f custom_db.faa -i reads.fastq.gz -o kaiju_output.txt -l 100

This command classifies reads in file "reads.fastq.gz" using the custom database, but only considers reads with a minimum length of 100 nucleotides, specified with the "-l" option.

  • Classify reads with a specified number of mismatches:
kaiju -t nodes.dmp -f custom_db.faa -i reads.fastq.gz -o kaiju_output.txt -m 2

This command classifies reads in file "reads.fastq.gz" using the custom database, but allows for up to 2 mismatches per read, specified with the "-m" option.


Besides the kaju command, there are more binaries to find and use inside the installation directory:

$ ls $EBROOTKAIJU/bin
kaiju        kaiju2table          kaiju-convertMAR.py  kaiju-excluded-accessions.txt  kaiju-makedb        kaiju-mkbwt  kaiju-multi  kaiju-taxonlistEuk.tsv
kaiju2krona  kaiju-addTaxonNames  kaiju-convertNR      kaiju-gbk2faa.pl               kaiju-mergeOutputs  kaiju-mkfmi  kaijup       kaijux

Documentation

The full documentation can be found here.