Cdbfasta 2016

Introduction

CDB (Constant DataBase) indexing and retrieval tools for FASTA files

cdbfasta and cdbyank are platform independent file-based hashing tools that can be used for creating indices for quick retrieval of any particular sequences from large multi-FASTA files. ¹

Installed version(s)

The following versions are installed and currently available...

... on environment hpc-env/8.3:

cdbfasta/0.99-GCC-8.3.0

Loading cdbfasta

To load the desired version of the module, use the module load command, e.g.

module load hpc-env/8.3
module load cdbfasta

Always remember: this commands are case sensitive!

Using cdbfasta and cdbyank

To find out on how to use cdbfasta you can just type in cdbfasta after loading the module to print out a help text to get you started. The same goes for cdbyank which is bundled with cdbfasta:

$ cdbfasta 
Usage:
  cdbfasta <fastafile> [-o <index_file>] [-r <record_delimiter>]
   [-z <compressed_db>] [-i] [-m|-n <numkeys>|-f<LIST>]|-c|-C]
    [-w <stopwords_list>] [-s <stripendchars>] [-v]
   
   Creates an index file for records from a multi-fasta file.
   By default (without -m/-n/-c/-C option), only the first 
   space-delimited token from the defline is used as a key.
  
   <fastafile> is the multi-fasta file to index; 
   -o the index file will be named <index_file>; if not given,
      the index filename is database name plus the suffix '.cidx'
   -r <record_delimiter> a string of characters at the beginning of line
      marking the start of a record (default: '>')
   -Q treat input as fastq format, i.e. with '@' as record delimiter
      and with records expected to have at least 4 lines
   -z database is compressed into the file <compressed_db>
      before indexing (<fastafile> can be "-" or "stdin" 
      in order to get the input records from stdin)
   -s strip extraneous characters from *around* the space delimited
      tokens, for the multikey options below (-m,-n,-f);
      Default <stripendchars> set is: '",`.(){}/[]!:;~|><+-
   -m ("multi-key" option) create hash entries pointing to 
      the same record for all tokens found in
      the defline
   -n <numkeys> same as -m, but only takes the first <numkeys>
      tokens from the defline
   -f indexes *space* delimited tokens (fields) in the defline as given
      by LIST of fields or fields ranges (the same syntax as UNIX 'cut')
   -w <stopwordslist> exclude from indexing all the words found
      in the file <stopwordslist> (for options -m, -n and -k)
   -i do case insensitive indexing (i.e. create additional keys for 
      all-lowercase tokens used for indexing from the defline 
   -c for deflines in the format: db1|accession1|db2|accession2|...,
      only the first db-accession pair ('db1|accession1') is taken as key
   -C like -c, but also subsequent db|accession constructs are indexed,
      along with the full (default) token; additionally,
      all nrdb concatenated accessions found in the defline 
      are parsed and stored (assuming 0x01 or '^|^' as separators)
   -a accession mode: like -C option, but indexes the 'accession'
      part for all 'db|accession' constructs found
   -A like -a and -C together (both accessions and 'db|accession'
      constructs are used as keys
   -v show program version and exit

$ cdbyank 
Usage:
  cdbyank <index_file> [-d <fasta_file>] [-a <key>|-n|-l|-s]
      [-o <outfile>] [-q <char>|-Q][-F] [-R] [-P] [-x] [-w] 
      [-z <dbfasta.cdbz>

    <index_file> is the index file created previously with cdbfasta
       (usually having a ".cidx" suffix)
    -a <key> the sequence name (accession) for a fasta record to be
       retrieved; if not given, a list of accessions is expected
       at stdin
    -d <fasta_file> is the fasta file to pull records from; 
       if not specified, cdbyank will look in the same directory
       where <index_file> resides, for a file with the same name
       but without the ".cidx" suffix
    -o the records found are written to file <outfile> instead of stdout
    -x allows retrieval of multiple records per key, if the indexed 
       database had records with the same key (non-unique keys);
       (without -x only one record for a given key is retrieved)
    -i case insensitive query (expects the <index_file> to have been 
       created with cdbfasta -i option)
    -Q output the query key surrounded by character '%' before the
       corresponding record
    -q same as -Q but use character <char> instead of '%'
    -w enable warnings (sent to stderr) when a key is not found
    -F pulls only the defline for each record (discard the sequence)
    -P only displays the position(s) (file offset) within the 
       database file, for the requested record(s)
    -R sequence range extraction: expects the input <key(s)> to have 
       the format: '<seq_name> <start> <end>'
       and pulls only the specified sequence range
    -z decompress the entire file <dbfasta.cdbz>
       (assumes it was built using cdbfasta with '-z' option)
    -v show version number and exit
    
    Index file statistics (no database file needed):
    -n display the number of records indexed
    -l list all keys stored in <index_file>
    -s display indexing summary info

Documentation

The full documentation can be found at the project page.

Cdbfasta 2016

Contents

Introduction

Installed version(s)

Loading cdbfasta

Using cdbfasta and cdbyank

Documentation

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Topics

Tools