Question

Do DECIPHER:IdTaxa / dada2::assignTaxonomy work with ambigious reference training set?

0

Entering edit mode

fabian.roger • 0

@fabianroger-13931

Last seen 5.2 years ago

Hi,

I am trying to get a taxonomic assignment for Amplicon sequences of a COI barcode. (Insect specific primers, but quite degenerate, BF3, BR2, Elbrecht et al, 2019).

As there are not really any well curated reference databases around as for COI I am trying to adapt existing databases for use with DECIPHER:I:dTaxa and or dada2::assignTaxonomy.

I am trying two approaches:

1) I downloaded all BINs for Arthropods from BOLD 2) I downloaded this database that mines COI genes from GenBank.

Both database have a lot of sequences (~4M for BOLD, 1.2M for GenBankDB) but much fewer unique species (defined as BIN in BOLD and same taxonomy in GenBankDB). (400 K for BOLD, 110 K for GenBankDB)

My plan was thus to

1) Align the seqs within each cluster / BIN / "species" (or a random subsample of 200 seqs if there are more) 2) Find a majority consensus sequence ( DECIPHER::ConsensusSequence( , threshold = 0.5) ) 3) Assign consensus taxonomy to consensus seq

use this as reference file for taxonomic assignment.

While trying this some question arose:

1) Is that approach useful / legitimate?

2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy / DECIPHER::IdTaxa?

3) for DECIPHER::LearnTaxa, is there any information about how it scales (timewise) with the size of the database?

Thank you for your help!

Fabian

decipher dada2 • 2.8k views

ADD COMMENT • link updated 3.9 years ago by fabian.roger08 ▴ 10 • written 5.2 years ago by fabian.roger • 0

score 0 · Answer 1 · 2020-02-12

Partial answer, covering the DADA2 part only:

1) Is that approach useful / legitimate?

Yes. In fact, the naive Bayesian classifier algorithm that DADA2 implements in assignTaxonomy has largely been evaluated on reference databases that have been subsetted in a similar fashion, i.e. by "clustering" identical or highly similar sequences and choosing a representative from the cluster to be in the reference database. Raw databases with large numbers of identical sequences have the potential to induce inaccurate taxonomic assignment by overwhelming the bootstrap-based confidence evaluation step with sheer numeric replication.

That said, thoughtful consideration and perhaps evaluation of the details of your clustering method/thresholds would not be unwarranted.

2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy

Yes. The reference sequences are shredded into kmers as part of the assignTaxonomy method, and the kmers with ambiguous nucleotides are simply ignored.

score 0 · Answer 2 · 2020-02-12

Answering the parts about IDTAXA:

1) Is that approach useful / legitimate?

It is legitimate to cluster sequences and select a representative. However, I suggest clustering sequences and selecting a subset of each cluster for building the reference database. There is no need to only input one consensus sequence, and this would likely work worse than having a limited number (10-100) of representatives of a group. Note that all k-mer based algorithms (RDP and IDTAXA included) ignore ambiguous k-mers.

2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy / DECIPHER::IdTaxa?

Yes, both programs allow ambiguous bases.

3) for DECIPHER::LearnTaxa, is there any information about how it scales (timewise) with the size of the database?

LearnTaxa() scales in time roughly with the size of the reference taxonomy (i.e., taxonomic tree). Note that you only need to run LearnTaxa() once per reference set, and the output can be reused with IdTaxa() for classification.

score 0 · Answer 3 · 2021-06-02

Hi again,

Sorry for coming back to this 15 month later but I have a follow-up questions related to the question above. When I have a training set were some species have missing taxonomies at Species / Genus level, how should this be formatted?

Should it be NA or just missing (;genus;; - for missing species) or must all sequences have a name at all ranks?