Mapping Probes to a Genome
3
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 8 hours ago
Australia

Is there a simple way to map a list of probe sequences to a genome ? The use case is probes of a custom-designed NanoString assay. The RLF file provided by NanoString has a gene symbol and probe sequence, but not the genomic coordinates of the probe. I don't want the hassle of running bowtie for a small set of sequences (about 200) which map to the genome with no mismatches and I'd like to include the mapping procedure in an R Markdown document without using a complex aligner through system() calls.

For another similar question asked five years ago, it was recommended to map to the transcriptome with vwhichPDict but I want the genomic coordinates, so I'd like to map with BSgenome.Hsapiens.UCSC.hg19 (both strands) and obtain a GRangesList result.

BSgenome biostrings probe mapping • 2.9k views
ADD COMMENT
1
Entering edit mode
holgerbrandl ▴ 10
@holgerbrandl-8467
Last seen 7.6 years ago
Germany

If you don't have to may probes of differeing length (which does not allow for direct usage of vmatchPDict you could aksi simply iterate with purr/dplyr

require(purrr)
require(dplyr)
require(Biostrings)
require(BSgenome.Dmelanogaster.UCSC.dm3)

mappedPrimers = primerInfo %>% 
   mutate(mapped=map(primer_seq, ~ as.data.frame(vmatchPDict(DNAStringSet(.x), Dmelanogaster)))) %>%
   unnest(mapped)

 

ADD COMMENT
0
Entering edit mode
@hotz-hans-rudolf-3951
Last seen 4.3 years ago
Switzerland

Hi Dario

Have you looked into the QuasR package (or just the Rbowtie package)? This allows you to use a 'complex aligner' without using a system call.

 

Regards, Hans-Rudolf

 

ADD COMMENT
0
Entering edit mode

This solution involves formatting the reads into a FASTA file, generating a genome index, and writing to disk BAM files and importing them into R. The other proposed answer solves the problem more directly and without reading and writing of results to and from the disk.

ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

vmatchPDict() returns a GRanges with mcols 'index' for the corresponding probe. I guess you'd match the probes and their reverse complement.

ADD COMMENT
0
Entering edit mode

This solution does not work because the probes do not all have the same lengths and vmatchPDict results in an error when it tries to coerce the pdict variable into a PDict object. I need to create a PDict object with tb.start and tb.end settings because of this, but if I do, vmatchPDict gives an error, because it only accepts a DNAStringSet for the pdict parameter, not PDict objects.

e.g. (artificial)

> vmatchPDict(DNAStringSet(c("GATC", "TAG")), Hsapiens)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
  element 2 in Trusted Band has a different length than first element
> vmatchPDict(PDict(DNAStringSet(c("GATC", "TAG")), tb.start=1, tb.end =3), Hsapiens)
Error in .local(pdict, subject, max.mismatch, min.mismatch, fixed, algorithm,  :
  'pdict' must be a DNAStringSet object

It's bizarre that the parameter is named pdict, but it results in an error if the user provides a PDict object.

Could an example also be added to the Examples section of the documentation of how to restrict the search to only chromosomes of interest ? There doesn't seem to be any way to only search chr1, ..., chr22, chrX, chrY, chrM for my human analysis scenario.

Actually, vmatchPDict is just a code stub in BSgenome version 1.40.1. Why is it even present in the release version of BSgenome ?

> vmatchPDict(DNAStringSet("GATC", Hsapiens))
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
  vmatchPDict() is not ready yet, sorry

Code which doesn't work should only be in the development version (and be finished before the next Bioconductor release version).

ADD REPLY
0
Entering edit mode

group probes by length and analyze separately? 

vmatchPDict is a generic, generic arguments can be given generic names like 'x' or 'object' which lack semantic content. Often though argument names are influenced by a combination of most-common use, historical constraint (to avoid breaking user code), or developer aesthetic (e.g., symmetry across related functions).

It's not true that vmatchPDict is a code stub, it is a generic with 4 methods, including the method (for signature BSgenome) that you invoke successfully on a BSgenome subject.

ADD REPLY
0
Entering edit mode

It might also be fun to try the non-Bioconductor package AhoCorasickTrie::AhoCorasickSearch().

ADD REPLY

Login before adding your answer.

Traffic: 971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6