Build a reference database
Using ecoPCR
One way to build the reference database is to use the ecoPCR program to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers (TTAGATACCCCACTATGC and TAGAACAGGCTCCTCTAG in this example) used for PCR amplification.
The full list of steps for building this reference database would then be:
-
Download the whole set of EMBL sequences (available from: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/)
-
Download the NCBI taxonomy (available from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
-
Format them into the ecoPCR format (see obiconvert for how you can produce ecoPCR compatible files)
-
Use ecoPCR to simulate amplification and build a reference database based on putatively amplified barcodes together with their recorded taxonomic information
As step 1 and step 3 can be really time-consuming (about one day), we already provide the reference database produced by the following commands so that you can skip its construction. Note that as the EMBL database and taxonomic data can evolve daily, if you run the following commands you may end up with quite different results.
Any utility allowing file downloading from a ftp site can be used. In the following commands, we use the commonly used wget Unix command.
Download the sequences
mkdir EMBL
cd EMBL
wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
cd ..
Download the taxonomy
mkdir TAXO
cd TAXO
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz
cd ..
Format the data
obiconvert --embl -t ./TAXO --ecopcrDB-output=embl_last ./EMBL/*.dat
Use ecoPCR to simulate an in silico` PCR
ecoPCR -d ./ECODB/embl_last -e 3 -l 50 -L 150 \
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
Note that the primers must be in the same order both in wolf_diet_ngsfilter.txt and in the ecoPCR command.
Clean the database
-
filter sequences so that they have a good taxonomic description at the species, genus, and family levels (obigrep command below).
-
remove redundant sequences (obiuniq command below).
-
ensure that the dereplicated sequences have a taxid at the family level (obigrep command below).
-
ensure that sequences each have a unique identification (obiannotate command below)
obigrep -d embl_last --require-rank=species \
--require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
obiuniq -d embl_last \
v05_clean.fasta > v05_clean_uniq.fasta
obigrep -d embl_last --require-rank=family \
v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta