CESAR 2016
Introduction
CESAR 2.0 (Coding Exon Structure Aware Realigner 2.0) is a method to realign coding exons or genes to DNA sequences using a Hidden Markov Model. Compared to its predecessor [2], CESAR 2.0 is 77X times faster on average (132X times faster for large exons) and requires 30-times less memory. In addition, CESAR 2.0 improves the accuracy of the comparative gene annotation by two new features. First, CESAR 2.0 substantially improves the identification of splice sites that have shifted over a larger distance, which improves the accuracy of detecting the correct exon boundaries. Second, CESAR 2.0 provides a new gene mode that re-aligns entire genes at once. This mode is able to recognize complete intron deletions and will annotate larger joined exons that arose by intron deletion events. ¹
Installed version(s)
This version is installed and currently available
on environment hpc-env/8.3:
- CESAR/2.0-/intel-2019b_25-08-2020
Since CESAR is not versioned propperly, we installed the version from the 25th of Septembre 2020 which is foundable and reproducable with this github commit. If the developers made changes that you are interested in, let us know that we can install a more current CESAR commit.
Loading CESAR modules
To load the desired version of the module, use the command, e.g.
module load hpc-env/8.3 module load CESAR
Always remember: this command is case sensitive!
If you want to find out more about CESAR on the HPC cluster, you can use the command
module spider CESAR
This will show you basic information e.g. a short description and the currently installed version.
Using CESAR (test sequence)
The developers of CESAR made a test sequence accessible. Since it exemplarily shows how CESAR is used, we copied the steps down below.
# create directory on HPC mkdir $WORK/CESAR_TEST cd $WORK/CESAR_TEST # download the data (7 GB in total) wget -r -nH --cut-dirs=2 --reject "index.html*" https://bds.mpi-cbg.de/hillerlab/CESAR2.0_Example . gzip -d multiz_5way.maf.gz # define a few environment variables (see below for details) export inputGenes=ensGene.gp export reference=hg38 export twoBitDir=2bitDir export alignment=multiz_5way.bb export querySpecies=mm10,oryAfe1,galGal4,falChe1 export outputDir=CESARoutput export resultsDir=geneAnnotation # this will set the maxMemory to the amount of RAM in your machine - 1 Gb export maxMemory=`grep MemTotal /proc/meminfo | awk '{print int($2/1000000)-1}'` # create all CESAR jobs formatGenePred.pl ${inputGenes} ${inputGenes}.forCESAR ${inputGenes}.discardedTranscripts -longest for transcript in `cut -f1 ${inputGenes}.forCESAR`; do echo "annotateGenesViaCESAR.pl ${transcript} ${alignment} ${inputGenes}.forCESAR ${reference} ${querySpecies} ${outputDir} ${twoBitDir} ${profilePath} -maxMemory ${maxMemory}" done > jobList # realign all genes by executing the jobList or push it to your compute cluster chmod +x jobList ./jobList # This will run over night (10.5 h on a single core) # Convert results into genePred format (excluding 0-bp exons due to complete deletions) for species in `echo $querySpecies | sed 's/,/ /g'`; do echo "bed2GenePred.pl $species $outputDir /dev/stdout | awk '{if (\$4 != \$5) print \$0}' > $resultsDir/$species.gp" done > jobListGenePred chmod +x jobListGenePred mkdir $resultsDir ./jobListGenePred # This will take ~15 minutes # cleanup rm -rf $outputDir
Documentation
Although we cited some part of the official documentation here, there is more to be found here.