CESAR 2016

From HPC users
Jump to navigationJump to search

Introduction

CESAR 2.0 (Coding Exon Structure Aware Realigner 2.0) is a method to realign coding exons or genes to DNA sequences using a Hidden Markov Model. Compared to its predecessor [2], CESAR 2.0 is 77X times faster on average (132X times faster for large exons) and requires 30-times less memory. In addition, CESAR 2.0 improves the accuracy of the comparative gene annotation by two new features. First, CESAR 2.0 substantially improves the identification of splice sites that have shifted over a larger distance, which improves the accuracy of detecting the correct exon boundaries. Second, CESAR 2.0 provides a new gene mode that re-aligns entire genes at once. This mode is able to recognize complete intron deletions and will annotate larger joined exons that arose by intron deletion events. ¹


Installed version(s)

This version is installed and currently available

on environment hpc-env/8.3:

  • CESAR/2.0-/intel-2019b_25-08-2020

Since CESAR is not versioned properly, we installed the version from the 25th of September 2020 which is foundable and reproducible with this GitHub commit. If the developers made changes that you are interested in, let us know that we can install a more current commit.

Loading CESAR modules

To load the desired version of the module, use the command, e.g.

module load hpc-env/8.3
module load CESAR

Always remember: this command is case sensitive!


If you want to find out more about CESAR on the HPC cluster, you can use the command

module spider CESAR

This will show you basic information e.g. a short description and the currently installed version.

Using CESAR (test sequence)

The developers of CESAR made a test sequence accessible. Since it exemplarily shows how CESAR is used, we copied the steps down below.

# create directory on HPC
mkdir $WORK/CESAR_TEST
cd $WORK/CESAR_TEST

# download the data (7 GB in total)
wget -r -nH --cut-dirs=2 --reject "index.html*" https://bds.mpi-cbg.de/hillerlab/CESAR2.0_Example .
gzip -d multiz_5way.maf.gz

# define a few environment variables (see below for details)
export inputGenes=ensGene.gp
export reference=hg38
export twoBitDir=2bitDir
export alignment=multiz_5way.bb
export querySpecies=mm10,oryAfe1,galGal4,falChe1
export outputDir=CESARoutput
export resultsDir=geneAnnotation
# this will set the maxMemory to the amount of RAM in your machine - 1 Gb
export maxMemory=`grep MemTotal /proc/meminfo | awk '{print int($2/1000000)-1}'`

# create all CESAR jobs
formatGenePred.pl ${inputGenes} ${inputGenes}.forCESAR ${inputGenes}.discardedTranscripts -longest
for transcript in `cut -f1 ${inputGenes}.forCESAR`; do 
   echo "annotateGenesViaCESAR.pl ${transcript} ${alignment} ${inputGenes}.forCESAR ${reference} ${querySpecies} ${outputDir} ${twoBitDir} ${profilePath} -maxMemory ${maxMemory}"
done > jobList

# realign all genes by executing the jobList or push it to your compute cluster
chmod +x jobList
./jobList
# This will run over night (10.5 h on a single core)

# Convert results into genePred format (excluding 0-bp exons due to complete deletions)
for species in `echo $querySpecies | sed 's/,/ /g'`; do 
  echo "bed2GenePred.pl $species $outputDir /dev/stdout | awk '{if (\$4 != \$5) print \$0}' > $resultsDir/$species.gp"
done > jobListGenePred
chmod +x jobListGenePred
mkdir $resultsDir
./jobListGenePred
# This will take ~15 minutes

# cleanup
rm -rf $outputDir


Documentation

Although we cited some part of the official documentation here, there is more to be found here.