Difference between revisions of "BRAKER2 2016"
Schwietzer (talk | contribs) |
Schwietzer (talk | contribs) |
||
Line 1: | Line 1: | ||
== Introduction== | == Introduction== | ||
BRAKER2 is a combination of [[GeneMark-EX 2016 | GeneMark ]] and [[AUGUSTUS 2016 | AUGUSTUS]], that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genomes. | BRAKER2 is a combination of [[GeneMark-EX 2016 | GeneMark ]] and [[AUGUSTUS 2016 | AUGUSTUS]], that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genomes. | ||
BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-EX R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction. [https://github.com/Gaius-Augustus/BRAKER#what-is-braker <sup>1</sup>] | |||
GeneMark and AUGUSTUS are independently installed as dependencies for BRAKER2. | GeneMark and AUGUSTUS are independently installed as dependencies for BRAKER2. | ||
==Installed Version == | ==Installed Version == |
Revision as of 13:24, 22 February 2022
Introduction
BRAKER2 is a combination of GeneMark and AUGUSTUS, that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genomes. BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-EX R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction. 1
GeneMark and AUGUSTUS are independently installed as dependencies for BRAKER2.
Installed Version
... on environment hpc-env/8.3:
- BRAKER/2.1.6-foss-2019b-Python-3.7.4-GeneMark-ET-4.69
... on environment hpc-env/6.4:
- BRAKER2/2.1.2 Installed on hpc-env/6.4
- BRAKER2/2.1.4 Installed on hpc-env/6.4
Loading BRAKER2
In order to use BRAKER2 you primarily have to change into the newer environment:
module load hpc-env/8.3 module load BRAKER
or if you prefer to work on the older environment hpc-env/6.4:
module load hpc-env/6.4 module load BRAKER2
Please note that BRAKER(2) runs with Python3
During the module's loading process, two small scripts are getting activated by different dependency modules:
- AUGUSTUS looks for the folder augustus_coinfig in your $HOME directory and sets the environment variable $AUGUSTUS_CONFIG_PATH variable to said folder. If the folder is not found, it creates the folder and starts to copy some configure scripts into it. For some calculations, BRAKER needs write access to these configure files, which is why this script is part of the loading process.
- The same goes for the key file .gm_key, needed by GeneMark to work properly.
Using BREAKER2
To get a quick overview of the commands available for BRAKER2, just type braker.pl into the console after loading the module. You will then get a short summary of the commands for BRAKER2:
$ ml hpc-env/8.3 $ ml BRAKER setting AUGUSTUS_CONFIG_PATH to /user/erle1100/augustus_config $ braker.pl --help DESCRIPTION braker.pl Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with RNA-Seq and/or proteins SYNOPSIS braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa} INPUT FILE OPTIONS --genome=genome.fa fasta file with DNA sequences --bam=rnaseq.bam bam file with spliced alignments from RNA-Seq --prot_seq=prot.fa A protein sequence file in multi-fasta format used to generate protein hints. Unless otherwise specified, braker.pl will run in "EP mode" which uses ProtHint to generate protein hints and GeneMark-EP+ to train AUGUSTUS. --hints=hints.gff Alternatively to calling braker.pl with a bam or protein fasta file, it is possible to call it with a .gff file that contains introns extracted from RNA-Seq and/or protein hints (most frequently coming from ProtHint). If you wish to use the ProtHint hints, use its "prothint_augustus.gff" output file. This flag also allows the usage of hints from additional extrinsic sources for gene prediction with AUGUSTUS. To consider such additional extrinsic information, you need to use the flag --extrinsicCfgFiles to specify parameters for all sources in the hints file (including the source "E" for intron hints from RNA-Seq) --prot_aln=prot.aln Alignment file generated from aligning protein sequences against the genome with either Exonerate (--prg=exonerate), or Spaln (--prg=spaln), or GenomeThreader (--prg=gth). This option can be used as an alternative to --prot_seq file or protein hints in the --hints file. To prepare alignment file, run Spaln2 with the following command: spaln -O0 ... > spalnfile To prepare alignment file, run Exonerate with the following command: exonerate --model protein2genome \ --showtargetgff T ... > exfile To prepare alignment file, run GenomeThreader with the following command: gth -genomic genome.fa -protein \ protein.fa -gff3out \ -skipalignmentout ... -o gthfile A valid option prg=... must be specified in combination with --prot_aln. Generating tool will not be guessed. Currently, hints from protein alignment files are only used in the prediction step with AUGUSTUS. FREQUENTLY USED OPTIONS --species=sname Species name. Existing species will not be overwritten. Uses Sp_1 etc., if no species is assigned --AUGUSTUS_ab_initio output ab initio predictions by AUGUSTUS in addition to predictions with hints by AUGUSTUS --softmasking Softmasking option for soft masked genome files. (Disabled by default.) --esmode Run GeneMark-ES (genome sequence only) and train AUGUSTUS on long genes predicted by GeneMark-ES. Final predictions are ab initio --epmode Run ProtHint to generate protein hints (if not already specified with --hints option) and use the hints in GeneMark-EP+ to create a training set for AUGUSTUS. --etpmode Use RNA-Seq and protein hints in GeneMark-ETP+ to create a training set for AUGUSTUS. The protein hints are generated by ProtHint (see --epmode). --gff3 Output in GFF3 format (default is gtf format) --cores Specifies the maximum number of cores that can be used during computation. Be aware: optimize_augustus.pl will use max. 8 cores; augustus will use max. nContigs in --genome=file cores. --workingdir=/path/to/wd/ Set path to working directory. In the working directory results and temporary files are stored --nice Execute all system calls within braker.pl and its submodules with bash "nice" (default nice value) --alternatives-from-evidence=true Output alternative transcripts based on explicit evidence from hints (default is true). --fungus GeneMark-EX option: run algorithm with branch point model (most useful for fungal genomes) --crf Execute CRF training for AUGUSTUS; resulting parameters are only kept for final predictions if they show higher accuracy than HMM parameters. --keepCrf keep CRF parameters even if they are not better than HMM parameters --UTR=on create UTR training examples from RNA-Seq coverage data; requires options --bam=rnaseq.bam and --softmasking. Alternatively, if UTR parameters already exist, training step will be skipped and those pre-existing parameters are used. --addUTR=on Adds UTRs from RNA-Seq coverage data to augustus.hints.gtf file. Does not perform training of AUGUSTUS or gene prediction with AUGUSTUS and UTR parameters. --prg=gth|exonerate|spaln Specify an alternative method for generating hints from similarity of protein sequence data to genome data (alternative to the default --epmode/--etpmode in which ProtHint is used to generate the protein hints). Available methods are: gth (GenomeThreader), exonerate (Exonerate), or spaln (Spaln2). Note that this option is suitable only for proteins of closely related species (while the --epmode is generally applicable). This option is required in case --prot_aln option is used. --gth2traingenes Generate training gene structures for AUGUSTUS from GenomeThreader alignments. (These genes can either be used for training AUGUSTUS alone with --trainFromGth; or in addition to GeneMark-ET training genes if also a bam-file is supplied.) --trainFromGth No GeneMark-Training, train AUGUSTUS from GenomeThreader alignments --makehub Create track data hub with make_hub.py for visualizing BRAKER results with the UCSC GenomeBrowser --email E-mail address for creating track data hub --version Print version number of braker.pl --help Print this help message CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER) --AUGUSTUS_CONFIG_PATH=/path/ Set path to config directory of AUGUSTUS (if not specified as environment variable). BRAKER1 will assume that the directories ../bin and ../scripts of AUGUSTUS are located relative to the AUGUSTUS_CONFIG_PATH. If this is not the case, please specify AUGUSTUS_BIN_PATH (and AUGUSTUS_SCRIPTS_PATH if required). The braker.pl commandline argument --AUGUSTUS_CONFIG_PATH has higher priority than the environment variable with the same name. --AUGUSTUS_BIN_PATH=/path/ Set path to the AUGUSTUS directory that contains binaries, i.e. augustus and etraining. This variable must only be set if AUGUSTUS_CONFIG_PATH does not have ../bin and ../scripts of AUGUSTUS relative to its location i.e. for global AUGUSTUS installations. BRAKER1 will assume that the directory ../scripts of AUGUSTUS is located relative to the AUGUSTUS_BIN_PATH. If this is not the case, please specify --AUGUSTUS_SCRIPTS_PATH. --AUGUSTUS_SCRIPTS_PATH=/path/ Set path to AUGUSTUS directory that contains scripts, i.e. splitMfasta.pl. This variable must only be set if AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH do not contains the ../scripts directory of AUGUSTUS relative to their location, i.e. for special cases of a global AUGUSTUS installation. --BAMTOOLS_PATH=/path/to/ Set path to bamtools (if not specified as environment BAMTOOLS_PATH variable). Has higher priority than the environment variable. --GENEMARK_PATH=/path/to/ Set path to GeneMark-ET (if not specified as environment GENEMARK_PATH variable). Has higher priority than environment variable. --SAMTOOLS_PATH=/path/to/ Optionally set path to samtools (if not specified as environment SAMTOOLS_PATH variable) to fix BAM files automatically, if necessary. Has higher priority than environment variable. --PROTHINT_PATH=/path/to/ Set path to the directory with prothint.py. (if not specified as PROTHINT_PATH environment variable). Has higher priority than environment variable. --ALIGNMENT_TOOL_PATH=/path/to/tool Set path to alignment tool (GenomeThreader, Spaln, or Exonerate) if not specified as environment ALIGNMENT_TOOL_PATH variable. Has higher priority than environment variable. --DIAMOND_PATH=/path/to/diamond Set path to diamond, this is an alternative to NCIB blast; you only need to specify one out of DIAMOND_PATH or BLAST_PATH, not both. DIAMOND is a lot faster that BLAST and yields highly similar results for BRAKER. --BLAST_PATH=/path/to/blastall Set path to NCBI blastall and formatdb executables if not specified as environment variable. Has higher priority than environment variable. --PYTHON3_PATH=/path/to Set path to python3 executable (if not specified as envirnonment variable and if executable is not in your $PATH). --JAVA_PATH=/path/to Set path to java executable (if not specified as environment variable and if executable is not in your $PATH), only required with flags --UTR=on and --addUTR=on --GUSHR_PATH=/path/to Set path to gushr.py exectuable (if not specified as an environment variable and if executable is not in your $PATH), only required with the flags --UTR=on and --addUTR=on --MAKEHUB_PATH=/path/to Set path to make_hub.py (if option --makehub is used). --CDBTOOLS_PATH=/path/to cdbfasta/cdbyank are required for running fix_in_frame_stop_codon_genes.py. Usage of that script can be skipped with option '--skip_fixing_broken_genes'. EXPERT OPTIONS --augustus_args="--some_arg=bla" One or several command line arguments to be passed to AUGUSTUS, if several arguments are given, separate them by whitespace, i.e. "--first_arg=sth --second_arg=sth". --skipGeneMark-ES Skip GeneMark-ES and use provided GeneMark-ES output (e.g. provided with --geneMarkGtf=genemark.gtf) --skipGeneMark-ET Skip GeneMark-ET and use provided GeneMark-ET output (e.g. provided with --geneMarkGtf=genemark.gtf) --skipGeneMark-EP Skip GeneMark-EP and use provided GeneMark-EP output (e.g. provided with --geneMarkGtf=genemark.gtf) --skipGeneMark-ETP Skip GeneMark-ETP and use provided GeneMark-ETP output (e.g. provided with --geneMarkGtf=genemark.gtf) --geneMarkGtf=file.gtf If skipGeneMark-ET is used, braker will by default look in the working directory in folder GeneMarkET for an already existing gtf file. Instead, you may provide such a file from another location. If geneMarkGtf option is set, skipGeneMark-ES/ET/EP/ETP is automatically also set. Note that gene and transcript ids in the final output may not match the ids in the input genemark.gtf because BRAKER internally re-assigns these ids. --rounds The number of optimization rounds used in optimize_augustus.pl (default 5) --skipAllTraining Skip GeneMark-EX (training and prediction), skip AUGUSTUS training, only runs AUGUSTUS with pre-trained and already existing parameters (not recommended). Hints from input are still generated. This option automatically sets --useexisting to true. --useexisting Use the present config and parameter files if they exist for 'species'; will overwrite original parameters if BRAKER performs an AUGUSTUS training. --filterOutShort It may happen that a "good" training gene, i.e. one that has intron support from RNA-Seq in all introns predicted by GeneMark-EX, is in fact too short. This flag will discard such genes that have supported introns and a neighboring RNA-Seq supported intron upstream of the start codon within the range of the maximum CDS size of that gene and with a multiplicity that is at least as high as 20% of the average intron multiplicity of that gene. --skipOptimize Skip optimize parameter step (not recommended). --skipIterativePrediction Skip iterative prediction in --epmode (does not affect other modes, saves a bit of runtime) --skipGetAnnoFromFasta Skip calling the python3 script getAnnoFastaFromJoingenes.py from the AUGUSTUS tool suite. This script requires python3, biopython and re (regular expressions) to be installed. It produces coding sequence and protein FASTA files from AUGUSTUS gene predictions and provides information about genes with in-frame stop codons. If you enable this flag, these files will not be produced and python3 and the required modules will not be necessary for running braker.pl. --skip_fixing_broken_genes If you do not have python3, you can choose to skip the fixing of stop codon including genes (not recommended). --eval=reference.gtf Reference set to evaluate predictions against (using evaluation scripts from GaTech) --eval_pseudo=pseudo.gff3 File with pseudogenes that will be excluded from accuracy evaluation (may be empty file) --AUGUSTUS_hints_preds=s File with AUGUSTUS hints predictions; will use this file as basis for UTR training; only UTR training and prediction is performed if this option is given. --flanking_DNA=n Size of flanking region, must only be specified if --AUGUSTUS_hints_preds is given (for UTR training in a separate braker.pl run that builds on top of an existing run) --verbosity=n 0 -> run braker.pl quiet (no log) 1 -> only log warnings 2 -> also log configuration 3 -> log all major steps 4 -> very verbose, log also small steps --downsampling_lambda=d The distribution of introns in training gene structures generated by GeneMark-EX has a huge weight on single-exon and few-exon genes. Specifying the lambda parameter of a poisson distribution will make braker call a script for downsampling of training gene structures according to their number of introns distribution, i.e. genes with none or few exons will be downsampled, genes with many exons will be kept. Default value is 2. If you want to avoid downsampling, you have to specify 0. --checkSoftware Only check whether all required software is installed, no execution of BRAKER --nocleanup Skip deletion of all files that are typically not used in an annotation project after running braker.pl. (For tracking any problems with a braker.pl run, you might want to keep these files, therefore nocleanup can be activated.) DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL) --splice_sites=patterns list of splice site patterns for UTR prediction; default: GTAG, extend like this: --splice_sites=GTAG,ATAC,... this option only affects UTR training example generation, not gene prediction by AUGUSTUS --overwrite Overwrite existing files (except for species parameter files) Beware, currently not implemented properly! -- CfgFiles=file1,file2,... Depending on the mode in which braker.pl is executed, it may require one ore several extrinsicCfgFiles. Don't use this option unless you know what you are doing! --stranded=+,-,+,-,... If UTRs are trained, i.e.~strand-specific bam-files are supplied and coverage information is extracted for gene prediction, create stranded ep hints. The order of strand specifications must correspond to the order of bam files. Possible values are +, -, . If stranded data is provided, ONLY coverage data from the stranded data is used to generate UTR examples! Coverage data from unstranded data is used in the prediction step, only. The stranded label is applied to coverage data, only. Intron hints are generated from all libraries treated as "unstranded" (because splice site filtering eliminates intron hints from the wrong strand, anyway). --optCfgFile=ppx.cfg Optional custom config file for AUGUSTUS for running PPX (currently not implemented) --grass Switch this flag on if you are using braker.pl for predicting genes in grasses with GeneMark-EX. The flag will enable GeneMark-EX to handle GC-heterogenicity within genes more properly. NOTHING IMPLEMENTED FOR GRASS YET! --transmasked_fasta=file.fa Transmasked genome FASTA file for GeneMark-EX (to be used instead of the regular genome FASTA file). --min_contig=INT Minimal contig length for GeneMark-EX, could for example be set to 10000 if transmasked_fasta option is used because transmasking might introduce many very short contigs. --translation_table=INT Change translation table from non-standard to something else. DOES NOT WORK YET BECAUSE BRAKER DOESNT SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET! --gc_probability=DECIMAL Probablity for donor splice site pattern GC for gene prediction with GeneMark-EX, default value is 0.001 --gm_max_intergenic=INT Adjust maximum allowed size of intergenic regions in GeneMark-EX. If not used, the value is automatically determined by GeneMark-EX. EXAMPLE To run with RNA-Seq braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --bam=accepted_hits.bam braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --hints=rnaseq.gff To run with protein sequences braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --prot_seq=proteins.fa braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --hints=prothint_augustus.gff
Documentation
For further Information, visit the developers | website