SHAPEIT4 2016

From HPC users
Jump to navigationJump to search

Introduction

SHAPEIT4 is a fast and accurate method for estimation of haplotypes (aka phasing) for SNP array and high coverage sequencing data. The version 4 is a refactored and improved version of the SHAPEIT algorithm with multiple key additional features:

  • It includes a Positional Burrow Wheeler Transform (PBWT) based approach to quickly select a small set of informative conditioning haplotypes to be used when updating the phase of an individual.
  • We have changed that way in which phase information in sequencing reads is input into the model. We now recommend the use of the WhatsHap tool as a pre-processing step to extract phase information from a bam file.
  • It accounts for sets of pre-phased genotypes (i.e. haplotype scaffold). The scaffold can be derived either from family data or large reference panels.
  • It reads and writes files using HTSlib for better I/O performance in either VCF or BCF formats.
  • The genotype graph and HMM routines have been re-implemented for better hardware usage and performance. 1


Installed version(s)

The following versions are installed and currently available...

... on environment hpc-env/8.3:

  • SHAPEIT4/4.2.2-foss-2019b

Loading / Using SHAPEIT4

To load the desired version of the module, use the module load command, e.g.

module load hpc-env/8.3
module load SHAPEIT4

Always remember: this command is case sensitive!


To find out on how to use SHAPEIT4 you can just type in shapeit4.2 --help to print out a help text to get you started:

$ shapeit4.2 --help

Basic options:
  --help                                Produce help message
  --seed arg (=15052011)                Seed of the random number generator
  -T [ --thread ] arg (=1)              Number of thread used

Input files:
  -I [ --input ] arg                    Genotypes to be phased in VCF/BCF 
                                        format
  -H [ --reference ] arg                Reference panel of haplotypes in 
                                        VCF/BCF format
  -S [ --scaffold ] arg                 Scaffold of haplotypes in VCF/BCF 
                                        format
  -M [ --map ] arg                      Genetic map
  -R [ --region ] arg                   Target region
  --use-PS arg                          Informs phasing using PS field from 
                                        read based phasing
  --sequencing                          Default parameter setting for 
                                        sequencing data (this divides by 50 the
                                        default value of --pbwt-modulo)

MCMC parameters:
  --mcmc-iterations arg (=5b,1p,1b,1p,1b,1p,5m)
                                        Iteration scheme of the MCMC
  --mcmc-prune arg (=0.999)             Pruning threshold for genotype graphs

PBWT parameters:
  --pbwt-modulo arg (=0.02)             Storage frequency of PBWT indexes in cM
                                        (i.e. storage every 0.02 cM by default)
  --pbwt-depth arg (=4)                 Depth of PBWT indexes to condition on
  --pbwt-mac arg (=2)                   Minimal Minor Allele Count at which 
                                        PBWT is evaluated
  --pbwt-mdr arg (=0.5)                 Maximal Missing Data Rate at which PBWT
                                        is evaluated
  --pbwt-disable-init                   Disable initialization by PBWT sweep

IBD2 parameters [DEPRECATED]:
  --ibd2-length arg (=3)                DEPRECATED
  --ibd2-maf arg (=0.01)                DEPRECATED
  --ibd2-mdr arg (=0.5)                 DEPRECATED
  --ibd2-count arg (=100)               DEPRECATED
  --ibd2-output arg                     DEPRECATED

HMM parameters:
  -W [ --window ] arg (=2.5)            Minimal size of the phasing window in 
                                        cM
  --effective-size arg (=15000)         Effective size of the population

Output files:
  -O [ --output ] arg                   Phased haplotypes in VCF/BCF format
  --bingraph arg                        Phased haplotypes in BIN format [Useful
                                        to sample multiple likely haplotype 
                                        configurations per sample]
  --log arg                             Log file


Documentation

The full documentation can be found here.