TransDecoder 2016

From HPC users
Revision as of 14:50, 9 September 2021 by Schwietzer (talk | contribs) (Created page with "== Introduction == '''TransDecoder (Find Coding Regions Within Transcripts)''' TransDecoder identifies candidate coding regions within transcript sequences, such as those gen...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction

TransDecoder (Find Coding Regions Within Transcripts)

TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.

TransDecoder identifies likely coding sequences based on the following criteria:

  • minimum length open reading frame (ORF) is found in a transcript sequence
  • a log-likelihood score similar to what is computed by the GeneID software is > 0.
  • the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.
  • if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).
  • a PSSM is built/trained/used to refine the start codon prediction.
  • optional the putative peptide has a match to a Pfam domain above the noise cutoff score. 1

Installed version(s)

The following versions are installed and currently available...

... on environment hpc-env/8.3:

  • TransDecoder/5.5.0-intel-2019b-Perl-5.30.0

Loading / Using TransDecoder

To load the desired version of the module, use the module load command, e.g.

module load hpc-env/8.3
module load TransDecoder 

Always remember: this command is case sensitive!


Transdecoder is mainly to be used by using the two containing scripts: TransDecoder.Predict and TransDecoder.LongOrfs To find out on how to use TransDecoder you can just type in one of the two scripts without any entailed options or arguments. For example, this is what you will get after typing in TransDecoder.Predict:

$ TransDecoder.Predict 

########################################################################################
#             ______                 ___                  __
#            /_  __/______ ____ ___ / _ \___ _______  ___/ /__ ____
#             / / / __/ _ `/ _\(_-</ // / -_) __/ _ \/ _  / -_) __/
#            /_/ /_/ \_,_/_//_/___/____/\__/\__/\___/\_,_/\__/_/   .Predict
#
########################################################################################
#
#  Transdecoder.LongOrfs|http://transdecoder.github.io> - Transcriptome Protein Prediction
#
#
#  Required:
#
#   -t <string>                            transcripts.fasta
#
#  Common options:
#
#
#   --retain_long_orfs_mode <string>        'dynamic' or 'strict' (default: dynamic)
#                                        In dynamic mode, sets range according to 1%FDR in random sequence of same GC content.
#
# 
#   --retain_long_orfs_length <int>         under 'strict' mode, retain all ORFs found that are equal or longer than these many nucleotides even if no other evidence 
#                                         marks it as coding (default: 1000000) so essentially turned off by default.)
#
#   --retain_pfam_hits <string>            domain table output file from running hmmscan to search Pfam (see transdecoder.github.io for info)     
#                                        Any ORF with a pfam domain hit will be retained in the final output.
# 
#   --retain_blastp_hits <string>          blastp output in '-outfmt 6' format.
#                                        Any ORF with a blast match will be retained in the final output.
#
#   --single_best_only                     Retain only the single best orf per transcript (prioritized by homology then orf length)
#
#   --output_dir | -O  <string>            output directory from the TransDecoder.LongOrfs step (default: basename( -t val ) + ".transdecoder_dir")
#
#   -G <string>                            genetic code (default: universal; see PerlDoc; options: Euplotes, Tetrahymena, Candida, Acetabularia, ...)
#
#   --no_refine_starts                     start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.
#
##  Advanced options
#
#    -T <int>                            Top longest ORFs to train Markov Model (hexamer stats) (default: 500)
#                                        Note, 10x this value are first selected for removing redundancies,
#                                        and then this -T value of longest ORFs are selected from the non-redundant set.
#  Genetic Codes
#
#
#   --genetic_code <string>                Universal (default)
#
#        Genetic Codes (derived from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
#
#
Acetabularia
Candida
Ciliate
Dasycladacean
Euplotid
Hexamita
Mesodinium
Mitochondrial-Ascidian
Mitochondrial-Chlorophycean
Mitochondrial-Echinoderm
Mitochondrial-Flatworm
Mitochondrial-Invertebrates
Mitochondrial-Protozoan
Mitochondrial-Pterobranchia
Mitochondrial-Scenedesmus_obliquus
Mitochondrial-Thraustochytrium
Mitochondrial-Trematode
Mitochondrial-Vertebrates
Mitochondrial-Yeast
Pachysolen_tannophilus
Peritrich
SR1_Gracilibacteria
Tetrahymena
Universal
#
#  --version                           show version (5.5.0)
#
#########################################################################################



Documentation

The full documentation can be found here.