Cdbfasta 2016
From HPC users
Jump to navigationJump to search
Introduction
CDB (Constant DataBase) indexing and retrieval tools for FASTA files
cdbfasta and cdbyank are platform independent file-based hashing tools that can be used for creating indices for quick retrieval of any particular sequences from large multi-FASTA files. 1
Installed version(s)
The following versions are installed and currently available...
... on environment hpc-env/8.3:
- cdbfasta/0.99-GCC-8.3.0
Loading cdbfasta
To load the desired version of the module, use the module load command, e.g.
module load hpc-env/8.3 module load cdbfasta
Always remember: this commands are case sensitive!
Using cdbfasta and cdbyank
To find out on how to use cdbfasta you can just type in cdbfasta after loading the module to print out a help text to get you started. The same goes for cdbyank which is bundled with cdbfasta:
$ cdbfasta Usage: cdbfasta <fastafile> [-o <index_file>] [-r <record_delimiter>] [-z <compressed_db>] [-i] [-m|-n <numkeys>|-f<LIST>]|-c|-C] [-w <stopwords_list>] [-s <stripendchars>] [-v] Creates an index file for records from a multi-fasta file. By default (without -m/-n/-c/-C option), only the first space-delimited token from the defline is used as a key. <fastafile> is the multi-fasta file to index; -o the index file will be named <index_file>; if not given, the index filename is database name plus the suffix '.cidx' -r <record_delimiter> a string of characters at the beginning of line marking the start of a record (default: '>') -Q treat input as fastq format, i.e. with '@' as record delimiter and with records expected to have at least 4 lines -z database is compressed into the file <compressed_db> before indexing (<fastafile> can be "-" or "stdin" in order to get the input records from stdin) -s strip extraneous characters from *around* the space delimited tokens, for the multikey options below (-m,-n,-f); Default <stripendchars> set is: '",`.(){}/[]!:;~|><+- -m ("multi-key" option) create hash entries pointing to the same record for all tokens found in the defline -n <numkeys> same as -m, but only takes the first <numkeys> tokens from the defline -f indexes *space* delimited tokens (fields) in the defline as given by LIST of fields or fields ranges (the same syntax as UNIX 'cut') -w <stopwordslist> exclude from indexing all the words found in the file <stopwordslist> (for options -m, -n and -k) -i do case insensitive indexing (i.e. create additional keys for all-lowercase tokens used for indexing from the defline -c for deflines in the format: db1|accession1|db2|accession2|..., only the first db-accession pair ('db1|accession1') is taken as key -C like -c, but also subsequent db|accession constructs are indexed, along with the full (default) token; additionally, all nrdb concatenated accessions found in the defline are parsed and stored (assuming 0x01 or '^|^' as separators) -a accession mode: like -C option, but indexes the 'accession' part for all 'db|accession' constructs found -A like -a and -C together (both accessions and 'db|accession' constructs are used as keys -v show program version and exit
$ cdbyank Usage: cdbyank <index_file> [-d <fasta_file>] [-a <key>|-n|-l|-s] [-o <outfile>] [-q <char>|-Q][-F] [-R] [-P] [-x] [-w] [-z <dbfasta.cdbz> <index_file> is the index file created previously with cdbfasta (usually having a ".cidx" suffix) -a <key> the sequence name (accession) for a fasta record to be retrieved; if not given, a list of accessions is expected at stdin -d <fasta_file> is the fasta file to pull records from; if not specified, cdbyank will look in the same directory where <index_file> resides, for a file with the same name but without the ".cidx" suffix -o the records found are written to file <outfile> instead of stdout -x allows retrieval of multiple records per key, if the indexed database had records with the same key (non-unique keys); (without -x only one record for a given key is retrieved) -i case insensitive query (expects the <index_file> to have been created with cdbfasta -i option) -Q output the query key surrounded by character '%' before the corresponding record -q same as -Q but use character <char> instead of '%' -w enable warnings (sent to stderr) when a key is not found -F pulls only the defline for each record (discard the sequence) -P only displays the position(s) (file offset) within the database file, for the requested record(s) -R sequence range extraction: expects the input <key(s)> to have the format: '<seq_name> <start> <end>' and pulls only the specified sequence range -z decompress the entire file <dbfasta.cdbz> (assumes it was built using cdbfasta with '-z' option) -v show version number and exit Index file statistics (no database file needed): -n display the number of records indexed -l list all keys stored in <index_file> -s display indexing summary info
Documentation
The full documentation can be found at the project page.