Wtdbg 2016

From HPC users
Jump to navigationJump to search

Introduction

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. Wtdbg2 is able to assemble the human and even the 32Gb Axolotl genome at a speed tens of times faster than CANU and FALCON while producing contigs of comparable base accuracy.

During assembly, wtdbg2 chops reads into 1024bp segments, merges similar segments into a vertex and connects vertices based on the segment adjacency on reads. The resulting graph is called fuzzy Bruijn graph (FBG). It is akin to De Bruijn graph but permits mismatches/gaps and keeps read paths when collapsing k-mers. The use of FBG distinguishes wtdbg2 from the majority of long-read assemblers. 1

Installed version(s)

The following versions are installed and currently available...

... on environment hpc-env/8.3:

  • wtdbg2/2.5-GCCcore-8.3.0

Loading / Using wtdbg2

To load the desired version of the module, use the module load command, e.g.

module load hpc-env/8.3
module load wtdbg2

Always remember: this command is case sensitive!


To find out on how to use wtdbg2 you can just type in wtdbg2 without any additional arguments to print out a help text to get you started:

$ wtdbg2
WTDBG: De novo assembler for long noisy sequences
Author: Jue Ruan <ruanjue@gmail.com>
Version: 2.5 (20190621)
Usage: wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]
Options:
 -i <string> Long reads sequences file (REQUIRED; can be multiple), []
 -o <string> Prefix of output files (REQUIRED), []
 -t <int>    Number of threads, 0 for all cores, [4]
 -f          Force to overwrite output files
 -x <string> Presets, comma delimited, []
            preset1/rsII/rs: -p 21 -S 4 -s 0.05 -L 5000
                    preset2: -p 0 -k 15 -AS 2 -s 0.05 -L 5000
                    preset3: -p 19 -AS 2 -s 0.05 -L 5000
                  sequel/sq
               nanopore/ont:
            (genome size < 1G: preset2) -p 0 -k 15 -AS 2 -s 0.05 -L 5000
            (genome size >= 1G: preset3) -p 19 -AS 2 -s 0.05 -L 5000
      preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5
 -g <number> Approximate genome size (k/m/g suffix allowed) [0]
 -X <float>  Choose the best <float> depth from input reads(effective with -g) [50.0]
 -L <int>    Choose the longest subread and drop reads shorter than <int> (5000 recommended for PacBio) [0]
             Negative integer indicate tidying read names too, e.g. -5000.
 -k <int>    Kmer fsize, 0 <= k <= 23, [0]
 -p <int>    Kmer psize, 0 <= p <= 23, [21]
             k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>
 -K <float>  Filter high frequency kmers, maybe repetitive, [1000.05]
             >= 1000 and indexing >= (1 - 0.05) * total_kmers_count
 -S <float>  Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00]
             -S is very useful in saving memeory and speeding up
             please note that subsampling kmers will have less matched length
 -l <float>  Min length of alignment, [2048]
 -m <float>  Min matched length by kmer matching, [200]
 -R          Enable realignment mode
 -A          Keep contained reads during alignment
 -s <float>  Min similarity, calculated by kmer matched length / aligned length, [0.05]
 -e <int>    Min read depth of a valid edge, [3]
 -q          Quiet
 -v          Verbose (can be multiple)
 -V          Print version information and then exit
 --help      Show more options


The softwares' root directory contains two folders which might be of use for you: The bin folder which contains all executables as well as a scripts directory:

$ ls $EBROOTWTDBG2/bin
kbm2  pgzf  wtdbg2  wtdbg2.pl  wtdbg-cns  wtpoa-cns

$ ls $EBROOTWTDBG2/scripts
best_kbm_hit.pl             dbm_index_fa.pl  fa2tab.pl         hlcolor                     num_n50.pl    runit.pl         split_seqs_3.pl
best_minimap_hit.pl         dbm_read_dot.pl  first_n_bases.pl  longest_pacbio_subreads.pl  rename_fa.pl  sam2dbgcns.pl    wtdbg-dot2gfa.pl
best_sam_hits4longreads.pl  dbm_read_fa.pl   first_n_seqs.pl   mmpoa.pl                    rename_fq.pl  seq_n50.pl
dbm_index_dot.pl            fa2fq.pl         fq2fa.pl          mum_assess.sh               rev_seq.pl    split_seqs_2.pl


Documentation

The full documentation can be found here.