CellRanger 2016

From HPC users
Revision as of 14:13, 21 August 2023 by Harfst (talk | contribs) (→‎Job Script vs. Cluster Mode)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction

Cell Ranger is a set of analysis pipelines that process Chromium single-cell data to align reads, generate feature-barcode matrices, perform clustering and other secondary analysis, and more. Cell Ranger includes five pipelines relevant to the 3' and 5' Single Cell Gene Expression Solutions and related products:

  • cellranger mkfastq demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files. It is a wrapper around Illumina's bcl2fastq, with additional features that are specific to 10x libraries and a simplified sample sheet format.
  • cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate feature-barcode matrices, determine clusters, and perform gene expression analysis. The count pipeline can take input from multiple sequencing runs on the same GEM well. cellranger count also processes Feature Barcode data alongside Gene Expression reads.
  • cellranger aggr aggregates outputs from multiple runs of cellranger count, normalizing those runs to the same sequencing depth and then recomputing the feature-barcode matrices and analysis on the combined data. The aggr pipeline can be used to combine data from multiple samples into an experiment-wide feature-barcode matrix and analysis.
  • cellranger reanalyze takes feature-barcode matrices produced by cellranger count or cellranger aggr and reruns the dimensionality reduction, clustering, and gene expression algorithms using tunable parameter settings.
  • cellranger multi is used to analyze Cell Multiplexing data. It inputs FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate feature-barcode matrices, determine clusters, and perform gene expression analysis. The cellranger multi pipeline also supports the analysis of Feature Barcode data. 1

Installed version(s)

The following version is currently available...

... on environment hpc-env/8.3:

  • CellRanger/6.1.1
  • CellRanger/7.1.0

... on environment hpc-env/6.4:

  • CellRanger/6.1.1
  • CellRanger/7.1.0

... on environment hpc-uniol-env:

  • CellRanger/6.1.1
  • CellRanger/7.1.0

Loading / Using CellRanger

To load the desired version of the module, use the module load command, e.g.

module load hpc-env/8.3
module load CellRanger/6.1.1

Always remember: this command is case sensitive!


To find out on how to use CellRanger you can just type in cellranger -h to print out a help text to get you started:

$ cellranger -h
cellranger cellranger-6.1.1
Process 10x Genomics Gene Expression, Feature Barcode, and Immune Profiling
data

USAGE:
    cellranger <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    count               Count gene expression (targeted or whole-
                        transcriptome) and/or feature barcode reads from a
                        single sample and GEM well
    multi               Analyze multiplexed data or combined gene
                        expression/immune profiling/feature barcode data
    vdj                 Assembles single-cell VDJ receptor sequences from
                        10x Immune Profiling libraries
    aggr                Aggregate data from multiple Cell Ranger runs
    reanalyze           Re-run secondary analysis (dimensionality
                        reduction, clustering, etc)
    targeted-compare    Analyze targeted enrichment performance by
                        comparing a targeted sample to its cognate parent
                        WTA sample (used as input for targeted gene
                        expression)
    targeted-depth      Estimate targeted read depth values (mean reads
                        per cell) for a specified input parent WTA sample
                        and a target panel CSV file
    mkvdjref            Prepare a reference for use with CellRanger VDJ
    mkfastq             Run Illumina demultiplexer on sample sheets that
                        contain 10x-specific sample index sets
    testrun             Execute the 'count' pipeline on a small test
                        dataset
    mat2csv             Convert a gene count matrix to CSV format
    mkref               Prepare a reference for use with 10x analysis
                        software. Requires a GTF and FASTA
    mkgtf               Filter a GTF file by attribute prior to creating a
                        10x reference
    upload              Upload analysis logs to 10x Genomics support
    sitecheck           Collect linux system configuration information
    help                Prints this message or the help of the given
                        subcommand(s)


Additionally, we included some reference files (References - 2020-A (July 7, 2020)) which you can find inside the folder called data which can be found at software path $EBROOTCELLRANGER.
To make the file access easier for you, we created the environment variable $CELLRANGER_DATA which leads to the files directory:

$ ls $CELLRANGER_DATA 
chromium-shared-sample-indexes-plate.csv
chromium-shared-sample-indexes-plate.json
chromium-single-cell-sample-indexes-plate-v1.csv
chromium-single-cell-sample-indexes-plate-v1.json
gemcode-single-cell-sample-indexes-plate.csv
gemcode-single-cell-sample-indexes-plate.json
refdata-gex-GRCh38-2020-A
refdata-gex-GRCh38-and-mm10-2020-A
refdata-gex-mm10-2020-A

A simple test command can be executed with

cellranger testrun --id=tiny

which runs for a few minutes and creates a directory tiny containing the results from the test run.

Job Script vs. Cluster Mode

There are in principle two options to run CellRanger on the cluster:

1. Using a Job Script As explained in the documentaion, you can use a standard Slurm job script to run CellRanger on a compute node. For the test run above the job script could look like this:

#!/usr/bin/env bash
# =============================================================================
# Slurm Options (mmodify to your needs)
# =============================================================================
#SBATCH -J CellRangerTestrun
#SBATCH --partition carl.p
#SBATCH --time 0-24:00:00             # time format d-hh:mm:ss
#SBATCH --nodes=1 --ntasks=1          # do not change
#SBATCH --cpus-per-task=4             # adjust as needed
#SBATCH --signal=2
#SBATCH --no-requeue
#SBATCH --mem=20G                     # adjust as needed
#SBATCH -o CellRanger_%j.out          # log file for STDOUT, %j is job id
#SBATCH -e CellRanger_%j.err          # log file for STDERR

# calculate memory limit in GB 
MEM_GB=$((9*SLURM_MEM_PER_NODE/10240))

# pipeline command (replace with the command you would like to run)
# keep the options --jobmode and --local*
cellranger testrun --id=tiny --jobmode=local --localcores=${SLURM_CPUS_PER_TASK} --localmem=${MEM_GB}

The job script above can be saved e.g. as CellRanger_testrun.sh and then submitted with

sbatch CellRanger_testrun.sh

For real applications you can replace the testrun command with the pipeline command you want to run. The --jobmode and the --local* options allow CellRanger to use the resources allocated for the job. The values are taken automatically from the SBATCH-options for --cpus-per-task and --mem and can be adjusted there as needed. Note, that you may need to find an optimal number of --cpus-per-core for different steps of a pipeline by running some benchmark tests.

2. Using the Cluster Mode

Instead of writing a job script, you can also use CellRanger's built-in cluster mode. Here you execute the following commands on a login node:

screen                         # start a screen terminal (it allows you to disconnect from the cluster while the pipeline is running)
module load CellRanger         # maybe specify the version as well
cellranger testrun --id=tiny --jobmode=slurm

This will start the testrun pipeline but now instead of running it locally, it will create around 200 smaller jobs which are submitted to the cluster. The job submission is done with a template in /cm/shared/uniol/scripts/CellRanger and CellRanger automatically sets the number of cores and memory to use. The example runs for almost an hour because of the overhead created from submitting the 200 jobs and the extra time it takes Slurm to manage the jobs. In real applications this might still be beneficial because several jobs can run in parallel with this approach, while in the job script above all steps are taken one after another.

While the pipeline is running, you can check the status in the terminal. You can also use the screen-mechanism to detach your session (press CTRL-A-D and logout from the cluster, but you need to remember which hpcl00x-login node you are connected to. When you login back to the cluster to the same login node, you can reattach you screen-session (screen -r) and check the progress).

If you want, you can also modify the template script. Copy the script from the location above to e.g. your $HOME, modify it and use the option --jobmode=$HOME/slurm.template. That way, you can for example use a different partition.

Documentation

More information and a tutorial can be found here.