Difference between revisions of "Ipyrad"

From HPC users
Jump to navigationJump to search
 
(32 intermediate revisions by 2 users not shown)
Line 3: Line 3:
The software <tt>ipyrad</tt> is an interactive toolkit for assembly and analysis of restriction-site associated genomic data sets (e.g., RAD, ddRAD, GBS) for population genetic and phylogenetic studies. [https://ipyrad.readthedocs.io/en/latest/index.html]
The software <tt>ipyrad</tt> is an interactive toolkit for assembly and analysis of restriction-site associated genomic data sets (e.g., RAD, ddRAD, GBS) for population genetic and phylogenetic studies. [https://ipyrad.readthedocs.io/en/latest/index.html]


At the moment, there is no central installation of <tt>ipyrad</tt>, however, you can easily install it yourself using [[Anaconda 2016|<tt>Anaconda3</tt>]] as described below.
Formerly, we did not have an global installation of ''ipyrad'' on our cluster which is why you can find an instruction on how to install the package with conda for your own user environment.
 
== ==
 
Currently, the only ipyrad version is installed on env ''hpc-env/8.3'' as:
*'''ipyrad/0.9.62'''-foss-2019b-Python-3.7.4
 
== Local User Installation ==
''You can easily skip this step if you want to use our installed module of ipyrad on the cluster. We wrote this section at a time when ipyrad was not yet available as a module.''


== Installation ==


To install <tt>ipyrad</tt> you first need to load a module for [[Anaconda 2016|<tt>Anaconda3</tt>]]. In this example, we use <tt>Anaconda3/2020.02</tt> which can be found in <tt>hpc-env/8.3</tt> (if you want to use a different version/environment you can search with <tt>module av Anaconda3</tt> or <tt>module spider Anaconda3</tt>):
To install <tt>ipyrad</tt> you first need to load a module for [[Anaconda 2016|<tt>Anaconda3</tt>]]. In this example, we use <tt>Anaconda3/2020.02</tt> which can be found in <tt>hpc-env/8.3</tt> (if you want to use a different version/environment you can search with <tt>module av Anaconda3</tt> or <tt>module spider Anaconda3</tt>):
Line 35: Line 42:


Now you can install <tt>ipyrad</tt> along with a the package <tt>mpi4py</tt> for parallel computing?
Now you can install <tt>ipyrad</tt> along with a the package <tt>mpi4py</tt> for parallel computing?
  (ipyrad) [carl]$ conda install ipyrad -c bioconda
  conda install ipyrad -c bioconda
  (ipyrad) [carl]$ conda install mpi4py -c conda-forge
  conda install mpi4py -c conda-forge
These commands will take a moment to complete but after that <tt>ipyrad</tt> is ready to use. And next time you log in or in a job script you only need the commands
These commands will take a moment to complete but after that <tt>ipyrad</tt> is ready to use. And next time you log in or in a job script you only need the commands
  [carl]$ module load hpc-env/8.3
  module load hpc-env/8.3
  [carl]$ module load Anaconda/2020.02
  module load Anaconda/2020.02
  [carl]$ source activate ipyrad
  source activate ipyrad
to get started. If you want to leave the environment you can always type
to get started. If you want to leave the environment you can always type
  (ipyrad) [carl]$ conda deactivate
  conda deactivate
which should return you to the normal command-line prompt.
which should return you to the normal command-line prompt.


== Using ipyrad on CARL ==
== Using ipyrad on CARL ==
'''Important notice!'''
The following instructions have been written for the use of the user installed version of <tt>ipyrad</tt>. You can follow the same tutorial with the globally installed <tt>ipyrad</tt> module by loading the corresponting Environment and module:
ml hpc-env/8.3
ml ipyrad
If you do so, you should not load Anaconda and must not use the ''command source activate ipyrad''!
=== Setting up ipcluster ===
<tt>ipyrad</tt> needs an additional program on the computing clusters, called <tt>ipcluster</tt>, which has to be initiated by <tt>ipyrad</tt> before every single <tt>ipyrad</tt> computation. <tt>ipcluster</tt> allocates the given CPU resources and makes them available for <tt>ipyrad</tt>. After initializing <tt>ipcluster</tt>, it might take a while for the virtual cluster to allocate the resources, so we should give it at least two minutes until calling <tt>ipcluster</tt> again.
Aditionally, ipyrad always must be directed to the said cluster:
module load hpc-env/8.3
module load ipyrad
ipcluster start -n 24 --daemonize                  # let ipcluster allocate 24 cores in the background
sleep 120                                          # give <tt>ipcluster</tt> time to initialize to prevent errors at the next step
ipyrad <commands, options> --ipcluster              # let <tt>ipyrad</tt> compute with the resources given by <tt>ipcluster</tt>
If you write a job script and allocate more than one node, ipcluster should be initialized like this:
  ipcluster start -n=${SLURM_NTASKS} --daemonize    # make use of every core of each node
=== Preparations and First Tests ===


Following the [https://ipyrad.readthedocs.io/en/latest/tutorial_intro_cli.html Introductory tutorial to the CLI] you can start by downloading some test data and creating a parameter file in a new directory under <tt>$WORK</tt>:
Following the [https://ipyrad.readthedocs.io/en/latest/tutorial_intro_cli.html Introductory tutorial to the CLI] you can start by downloading some test data and creating a parameter file in a new directory under <tt>$WORK</tt>:
  (ipyrad) [carl]$ mkdir $WORK/ipyrad_test
  mkdir $WORK/ipyrad_test
  (ipyrad) [carl]$ cd $WORK/ipyrad_test
  cd $WORK/ipyrad_test
  (ipyrad) [carl]$ curl -LkO https://eaton-lab.org/data/ipsimdata.tar.gz
  curl -LkO https://eaton-lab.org/data/ipsimdata.tar.gz
  % Total    % Received % Xferd  Average Speed  Time    Time    Time  Current
                                  Dload  Upload  Total  Spent    Left  Speed  
    ''% Total    % Received % Xferd  Average Speed  Time    Time    Time  Current''</blockquote>
100 11.8M  100 11.8M    0    0  8514k      0  0:00:01  0:00:01 --:--:-- 8508k  
                                  ''Dload  Upload  Total  Spent    Left  Speed''
  (ipyrad) [carl]$ tar -xzf ipsimdata.tar.gz
    ''100 11.8M  100 11.8M    0    0  8514k      0  0:00:01  0:00:01 --:--:-- 8508k''
  (ipyrad) [carl]$ ipyrad -n iptest
  New file 'params-iptest.txt' created in /gss/work/lees4820/ipyrad
  tar -xzf ipsimdata.tar.gz
The resulting file <tt>params-iptest.txt has to be opened in a text editor to add the locations of the raw non-demultiplexed fastq file and the barcodes file. With the test data the first couple of lines should look like this:
  ipyrad -n iptest
  (ipyrad) [carl]$ $ head params-iptest.txt
------- ipyrad params file (v.0.9.53)-------------------------------------------
    ''New file 'params-iptest.txt' created in /gss/work/lees4820/ipyrad''
iptest                        ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
The resulting file <tt>params-iptest.txt</tt> has to be opened in a text editor to add the locations of the raw non-demultiplexed fastq file and the barcodes file. With the test data the first couple of lines should look like this:
/gss/work/lees4820/ipyrad     ## [1] [project_dir]: Project dir (made in curdir if not present)
  head params-iptest.txt
./ipsimdata/rad_example_R1_.fastq.gz      ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
 
./ipsimdata/rad_example_barcodes.txt      ## [3] [barcodes_path]: Location of barcodes file
    ''------- ipyrad params file (v.0.9.53)-------------------------------------------''
                                 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
    ''iptest                        ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps''
denovo                        ## [5] [assembly_method]: Assembly method (denovo, reference)
    ''/gss/work/abcd1234/ipyrad_test     ## [1] [project_dir]: Project dir (made in curdir if not present)''
    ''/gss/work/abcd1234/ipyrad_test/ipsimdata/rad_example_R1_.fastq.gz      ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files''
    ''/gss/work/abcd1234/ipyrad_test/ipsimdata/rad_example_barcodes.txt      ## [3] [barcodes_path]: Location of barcodes file''
                                 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files''
    ''denovo                        ## [5] [assembly_method]: Assembly method (denovo, reference)''
We recommend to use absolute file names including the full path to the file which allows you to move the parameter file to other locations (e.g. a job specific directory).
 
A simple test can be performed with this command:
ipcluster start --daemonize        # ipyrad will allocate 8 threads for this test.  see chapter above ''setting up ipcluster''
sleep 120
ipyrad -p params-iptest.txt -s 1 -c 8 --ipcluster
    ''-------------------------------------------------------------''
    ''ipyrad [v.0.9.53]''
    ''Interactive assembly and analysis of RAD-seq data''
    ''-------------------------------------------------------------''
    ''Parallel connection | hpcl004: 8 cores''
    ''Step 1: Demultiplexing fastq data to Samples''
The program performs the first step of a work flow (<tt>-s 1</tt>) using a total of 8 cores (<tt>-c 8</tt>) on the login node. We can now remove the newly created data files and directories with
rm -r iptest_fastqs/ iptest.json
to avoid error message in the next steps.
 
=== Job Script for Using Multiple Cores on a Single Compute Node ===
 
With the preparations from the previous step, we can now use a job script to run <tt>ipyrad</tt> on the compute nodes. The first example uses a single compute node with multiple cores. The job script could look like this:
<pre>
#!/bin/bash
#SBATCH --partition carl.p
#SBATCH --nodes 1
#SBATCH --tasks-per-node 1
#SBATCH --cpus-per-task 12
#SBATCH --time 0-01:00:00
#SBATCH --mem-per-cpu 5000
#SBATCH --job-name ipyrad
#SBATCH --output ipyrad_output_%j.txt
 
## assembly name
assembly_name="iptest"
 
## load environment and module
module load hpc-env/8.3
module load ipyrad
 
## change into the directory where your params file resides
cd $WORK/ipyrad_test
 
## create, prepare and change to a job specific dir
jobdir="ipyrad_${SLURM_JOB_ID}"
params="params-${assembly_name}.txt"
mkdir $jobdir
sed "s#$(pwd) #$(pwd)/$jobdir#" $params > $jobdir/$params
cd $jobdir
 
## setting the number of available cores
cores=1
if [ -z $SLURM_CPUS_PER_TASK ]
then
    cores=${SLURM_NTASKS}
else
    cores=$((SLURM_NTASKS*SLURM_CPUS_PER_TASK))
fi
 
## start ipcluster and allocate all cores available
ipcluster start --n $cores --daemonize
 
## call ipyrad on your params file and perform 7 steps from the workflow
cmd="ipyrad -p $params -s 1234567 -c $cores --ipcluster"
echo "== starting ipyrad on $(hostname) at $(date) =="
echo "== command: $cmd"
eval $cmd
retval=$?
if [ $retval -ne 0 ]
then
  echo "Warning: exit code for command $cmd is non-zero (=$retval)"
fi
echo "== completed ipyrad on $(hostname) at $(date) =="
exit $retval
</pre>
Some explanations:
* the job script requests a single task with a total of 12 CPU cores (<tt>--cpus-per-task 12</tt>). Depending on the partition, the number of cores can be chosen between 1 and the maximum number of cores available. For <tt>carl.p</tt> this is 24. The the number of cores selected is automatically passed in the command-line near the end of the script.
* other resources can be requested as needed, here e.g. 5000 MB of RAM.
* the script assumes a base directory (<tt>$WORK/ipyrad_test</tt> where the file <tt>params-iptest.txt</tt> and the job script are located. A job-specific subdirectory is created and the <tt>params-iptest.txt</tt> is copied there (by the <tt>sed</tt>-command which also changes the project dir). If you do not want a job-specific directory you can comment-out the lines that contain <tt>jobdir</tt> (with or without the <tt>$</tt>).
 
When you save the job script as <tt>ipyrad_job.sh</tt> you can submit a job with
[carl]$ sbatch ipyrad_job.sh
 
Note that you do need to activate the conda environment for this, it will be (and has to be) done within the job, given you are using a selb fuild version of ipyrad.
 
=== Job Script for Using Multiple Compute Nodes ===
 
To run <tt>ipyrad</tt> on multiple compute nodes you will have to install the package <tt>mpi4py</tt> in your conda environment as explained above. Then, the job script from the previous example only needs a few modifications. The changed lines are shown below:
 
<pre>
...
#SBATCH --nodes 2
#SBATCH --tasks-per-node 24
###SBATCH --cpus-per-task 12
 
...
 
cmd="ipyrad -p $params -s 1234567 -c $cores --MPI"
...
</pre>
So, instead of requesting cpus per task you are now requesting nodes and tasks per node (each task will use a single core with <tt>--cpus-per-task</tt> commented out or set to 1). The <tt>--MPI</tt> tells ipyrad to use MPI to distribute the processes.
 
=== Notes on Parallel Computing with ipyrad ===
 
The toolkit ipyrad combines several steps of a workflow for a genome assembly. Running the full workflow in parallel might speed up the process, however, single steps might not benefit at all. In that case, some of the requested resources would be wasted. For example, let us assume that step 5 can be parallelized nicely, and when it takes 1 hour on a single core, it can run in 10 minutes on six cores. On the hand, step 6 might not parallelize well and if it takes one hour on core it also take 1 hour on 6 cores. Running both steps on 6 cores would take 1:10h, however, 5 cores would be idle for 1 hour.
 
Therefore, if you are planning to run many assemblies with ipyrad you might want to check, how well the different steps scale with some benchmark calculations (you can probably assume that this does not depend too much on the input). And if you find significant differences, you should seperate the steps of the workflow in different jobs, each requesting resources that can be used efficiently. Using [[How to Use Job Dependencies|job dependencies or job chains]] can help you to keep the workflow automatized.

Latest revision as of 16:53, 12 November 2020

Introduction

The software ipyrad is an interactive toolkit for assembly and analysis of restriction-site associated genomic data sets (e.g., RAD, ddRAD, GBS) for population genetic and phylogenetic studies. [1]

Formerly, we did not have an global installation of ipyrad on our cluster which is why you can find an instruction on how to install the package with conda for your own user environment.

Currently, the only ipyrad version is installed on env hpc-env/8.3 as:

  • ipyrad/0.9.62-foss-2019b-Python-3.7.4

Local User Installation

You can easily skip this step if you want to use our installed module of ipyrad on the cluster. We wrote this section at a time when ipyrad was not yet available as a module.


To install ipyrad you first need to load a module for Anaconda3. In this example, we use Anaconda3/2020.02 which can be found in hpc-env/8.3 (if you want to use a different version/environment you can search with module av Anaconda3 or module spider Anaconda3):

[carl]$ module load hpc-env/8.3
[carl]$ module load Anaconda/2020.02

The next step is to create a new environment for ipyrad with the command:

[carl]$ conda create --name ipyrad
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##
  environment location: /user/abcd1234/.conda/envs/ipyrad

Proceed ([y]/n)?

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

The name for the environment can be freely chosen and it will be created after you have confirmed to proceed with pressing (y and)enter. You may see a warning about an outdated conda which you can safely ignore (or, if you wish, you can switch to a newer module of Anaconda3 if available).

The new environment can now be activated. We recommend using the command(*):

[carl]$ source activate ipyrad
(ipyrad) [carl]$

You will notice the change of the command-line prompt to indicate the active environment. Packages that are now installed with conda install will be installed in this environment and not interfere with other software installations.

(*) The alternative conda activate requires you to use the command conda init bash first which modifies your .bashrc and more or less forces you to always use the same version of Anaconda3.

Now you can install ipyrad along with a the package mpi4py for parallel computing?

conda install ipyrad -c bioconda
conda install mpi4py -c conda-forge

These commands will take a moment to complete but after that ipyrad is ready to use. And next time you log in or in a job script you only need the commands

module load hpc-env/8.3
module load Anaconda/2020.02
source activate ipyrad

to get started. If you want to leave the environment you can always type

conda deactivate

which should return you to the normal command-line prompt.

Using ipyrad on CARL

Important notice!

The following instructions have been written for the use of the user installed version of ipyrad. You can follow the same tutorial with the globally installed ipyrad module by loading the corresponting Environment and module:

ml hpc-env/8.3
ml ipyrad

If you do so, you should not load Anaconda and must not use the command source activate ipyrad!


Setting up ipcluster

ipyrad needs an additional program on the computing clusters, called ipcluster, which has to be initiated by ipyrad before every single ipyrad computation. ipcluster allocates the given CPU resources and makes them available for ipyrad. After initializing ipcluster, it might take a while for the virtual cluster to allocate the resources, so we should give it at least two minutes until calling ipcluster again. Aditionally, ipyrad always must be directed to the said cluster:

module load hpc-env/8.3
module load ipyrad

ipcluster start -n 24 --daemonize                   # let ipcluster allocate 24 cores in the background
sleep 120                                           # give ipcluster time to initialize to prevent errors at the next step
ipyrad <commands, options> --ipcluster              # let ipyrad compute with the resources given by ipcluster

If you write a job script and allocate more than one node, ipcluster should be initialized like this:

 ipcluster start -n=${SLURM_NTASKS} --daemonize     # make use of every core of each node

Preparations and First Tests

Following the Introductory tutorial to the CLI you can start by downloading some test data and creating a parameter file in a new directory under $WORK:

mkdir $WORK/ipyrad_test
cd $WORK/ipyrad_test
curl -LkO https://eaton-lab.org/data/ipsimdata.tar.gz

% Total  % Received % Xferd Average Speed Time Time Time Current

                                  Dload  Upload   Total   Spent    Left  Speed 
    100 11.8M  100 11.8M    0     0  8514k      0  0:00:01  0:00:01 --:--:-- 8508k 

tar -xzf ipsimdata.tar.gz
ipyrad -n iptest

    New file 'params-iptest.txt' created in /gss/work/lees4820/ipyrad

The resulting file params-iptest.txt has to be opened in a text editor to add the locations of the raw non-demultiplexed fastq file and the barcodes file. With the test data the first couple of lines should look like this:

head params-iptest.txt
 
   ------- ipyrad params file (v.0.9.53)-------------------------------------------
   iptest                         ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
   /gss/work/abcd1234/ipyrad_test      ## [1] [project_dir]: Project dir (made in curdir if not present)
   /gss/work/abcd1234/ipyrad_test/ipsimdata/rad_example_R1_.fastq.gz      ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
   /gss/work/abcd1234/ipyrad_test/ipsimdata/rad_example_barcodes.txt      ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
   denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)

We recommend to use absolute file names including the full path to the file which allows you to move the parameter file to other locations (e.g. a job specific directory).

A simple test can be performed with this command:

ipcluster start --daemonize         # ipyrad will allocate 8 threads for this test.  see chapter above setting up ipcluster
sleep 120
ipyrad -p params-iptest.txt -s 1 -c 8 --ipcluster
   -------------------------------------------------------------
    ipyrad [v.0.9.53]
    Interactive assembly and analysis of RAD-seq data
   -------------------------------------------------------------
    Parallel connection | hpcl004: 8 cores

   Step 1: Demultiplexing fastq data to Samples

The program performs the first step of a work flow (-s 1) using a total of 8 cores (-c 8) on the login node. We can now remove the newly created data files and directories with

rm -r iptest_fastqs/ iptest.json

to avoid error message in the next steps.

Job Script for Using Multiple Cores on a Single Compute Node

With the preparations from the previous step, we can now use a job script to run ipyrad on the compute nodes. The first example uses a single compute node with multiple cores. The job script could look like this:

#!/bin/bash
#SBATCH --partition carl.p
#SBATCH --nodes 1
#SBATCH --tasks-per-node 1
#SBATCH --cpus-per-task 12
#SBATCH --time 0-01:00:00
#SBATCH --mem-per-cpu 5000
#SBATCH --job-name ipyrad
#SBATCH --output ipyrad_output_%j.txt

## assembly name
assembly_name="iptest"

## load environment and module
module load hpc-env/8.3
module load ipyrad

## change into the directory where your params file resides
cd $WORK/ipyrad_test

## create, prepare and change to a job specific dir
jobdir="ipyrad_${SLURM_JOB_ID}"
params="params-${assembly_name}.txt"
mkdir $jobdir
sed "s#$(pwd) #$(pwd)/$jobdir#" $params > $jobdir/$params
cd $jobdir

## setting the number of available cores
cores=1
if [ -z $SLURM_CPUS_PER_TASK ]
then
    cores=${SLURM_NTASKS}
else
    cores=$((SLURM_NTASKS*SLURM_CPUS_PER_TASK))
fi

## start ipcluster and allocate all cores available
ipcluster start --n $cores --daemonize

## call ipyrad on your params file and perform 7 steps from the workflow
cmd="ipyrad -p $params -s 1234567 -c $cores --ipcluster"
echo "== starting ipyrad on $(hostname) at $(date) =="
echo "== command: $cmd"
eval $cmd
retval=$?
if [ $retval -ne 0 ]
then
   echo "Warning: exit code for command $cmd is non-zero (=$retval)"
fi
echo "== completed ipyrad on $(hostname) at $(date) =="
exit $retval

Some explanations:

  • the job script requests a single task with a total of 12 CPU cores (--cpus-per-task 12). Depending on the partition, the number of cores can be chosen between 1 and the maximum number of cores available. For carl.p this is 24. The the number of cores selected is automatically passed in the command-line near the end of the script.
  • other resources can be requested as needed, here e.g. 5000 MB of RAM.
  • the script assumes a base directory ($WORK/ipyrad_test where the file params-iptest.txt and the job script are located. A job-specific subdirectory is created and the params-iptest.txt is copied there (by the sed-command which also changes the project dir). If you do not want a job-specific directory you can comment-out the lines that contain jobdir (with or without the $).

When you save the job script as ipyrad_job.sh you can submit a job with

[carl]$ sbatch ipyrad_job.sh

Note that you do need to activate the conda environment for this, it will be (and has to be) done within the job, given you are using a selb fuild version of ipyrad.

Job Script for Using Multiple Compute Nodes

To run ipyrad on multiple compute nodes you will have to install the package mpi4py in your conda environment as explained above. Then, the job script from the previous example only needs a few modifications. The changed lines are shown below:

...
#SBATCH --nodes 2
#SBATCH --tasks-per-node 24
###SBATCH --cpus-per-task 12

...

cmd="ipyrad -p $params -s 1234567 -c $cores --MPI"
...

So, instead of requesting cpus per task you are now requesting nodes and tasks per node (each task will use a single core with --cpus-per-task commented out or set to 1). The --MPI tells ipyrad to use MPI to distribute the processes.

Notes on Parallel Computing with ipyrad

The toolkit ipyrad combines several steps of a workflow for a genome assembly. Running the full workflow in parallel might speed up the process, however, single steps might not benefit at all. In that case, some of the requested resources would be wasted. For example, let us assume that step 5 can be parallelized nicely, and when it takes 1 hour on a single core, it can run in 10 minutes on six cores. On the hand, step 6 might not parallelize well and if it takes one hour on core it also take 1 hour on 6 cores. Running both steps on 6 cores would take 1:10h, however, 5 cores would be idle for 1 hour.

Therefore, if you are planning to run many assemblies with ipyrad you might want to check, how well the different steps scale with some benchmark calculations (you can probably assume that this does not depend too much on the input). And if you find significant differences, you should seperate the steps of the workflow in different jobs, each requesting resources that can be used efficiently. Using job dependencies or job chains can help you to keep the workflow automatized.