SHAPEIT4 is a fast and accurate method for estimation of haplotypes (aka phasing) for SNP array and high coverage sequencing data. The version 4 is a refactored and improved version of the SHAPEIT algorithm with multiple key additional features:
- It includes a Positional Burrow Wheeler Transform (PBWT) based approach to quickly select a small set of informative conditioning haplotypes to be used when updating the phase of an individual.
- We have changed that way in which phase information in sequencing reads is input into the model. We now recommend the use of the WhatsHap tool as a pre-processing step to extract phase information from a bam file.
- It accounts for sets of pre-phased genotypes (i.e. haplotype scaffold). The scaffold can be derived either from family data or large reference panels.
- It reads and writes files using HTSlib for better I/O performance in either VCF or BCF formats.
- The genotype graph and HMM routines have been re-implemented for better hardware usage and performance. 1
The following versions are installed and currently available...
... on environment hpc-env/8.3:
Loading / Using SHAPEIT4
To load the desired version of the module, use the module load command, e.g.
module load hpc-env/8.3 module load SHAPEIT4
Always remember: this command is case sensitive!
To find out on how to use SHAPEIT4 you can just type in shapeit4.2 --help to print out a help text to get you started:
$ shapeit4.2 --help Basic options: --help Produce help message --seed arg (=15052011) Seed of the random number generator -T [ --thread ] arg (=1) Number of thread used Input files: -I [ --input ] arg Genotypes to be phased in VCF/BCF format -H [ --reference ] arg Reference panel of haplotypes in VCF/BCF format -S [ --scaffold ] arg Scaffold of haplotypes in VCF/BCF format -M [ --map ] arg Genetic map -R [ --region ] arg Target region --use-PS arg Informs phasing using PS field from read based phasing --sequencing Default parameter setting for sequencing data (this divides by 50 the default value of --pbwt-modulo) MCMC parameters: --mcmc-iterations arg (=5b,1p,1b,1p,1b,1p,5m) Iteration scheme of the MCMC --mcmc-prune arg (=0.999) Pruning threshold for genotype graphs PBWT parameters: --pbwt-modulo arg (=0.02) Storage frequency of PBWT indexes in cM (i.e. storage every 0.02 cM by default) --pbwt-depth arg (=4) Depth of PBWT indexes to condition on --pbwt-mac arg (=2) Minimal Minor Allele Count at which PBWT is evaluated --pbwt-mdr arg (=0.5) Maximal Missing Data Rate at which PBWT is evaluated --pbwt-disable-init Disable initialization by PBWT sweep IBD2 parameters [DEPRECATED]: --ibd2-length arg (=3) DEPRECATED --ibd2-maf arg (=0.01) DEPRECATED --ibd2-mdr arg (=0.5) DEPRECATED --ibd2-count arg (=100) DEPRECATED --ibd2-output arg DEPRECATED HMM parameters: -W [ --window ] arg (=2.5) Minimal size of the phasing window in cM --effective-size arg (=15000) Effective size of the population Output files: -O [ --output ] arg Phased haplotypes in VCF/BCF format --bingraph arg Phased haplotypes in BIN format [Useful to sample multiple likely haplotype configurations per sample] --log arg Log file
The full documentation can be found here.