Hello,
I am trying to extract the genomic coordinates (chromosome name, start position, end position) from an MIndex object but can't figure out how to do it.
Can someone help?
Here is the code - first, I load the library, read a genome sequence into memory (it's tiny), and create DNAString objects out of a column of short sequences stored in my data frame data
:
library(Biostrings)
genome=readDNAStringSet("O_sativa_japonica_Oct_2011.fa")
query_seqs = sapply(data$Sequence,DNAString)
A bit more info:
The object data
is a data frame with one column (Sequence
) that contains short oligonucleotide sequences - character data. These are guide RNA sequences I got from a collaborator who needs to map them onto the genomic sequence in rice. We only care about exact matches at this point.
I then locate them in the genomic sequence with:
matches = lapply(query_seqs,vmatchPattern,genome)
rc_matches = lapply(rc_query_seqs,vmatchPattern,genome)
The above two commands return lists. Interrogating one item from one list:
m = matches[[1]]
> m
MIndex object of length 16
$Chr1
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
$Chr2
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 4496975 4496994 20
$Chr3
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
...
<13 more elements>
As you can see from the above code, object m
is an MIndex
. That sounds fine, and now I need to get some data out of it, so that I can then write out a BED format file and look at the data in a genome browser. (For example, I need to interactively check that the data I got from my collaborator fits his assumption about it.)
For each location in the genome that matched, I want to retrieve the name of the sequence (chromosome) that matched, the start position of the match, and the end position of the match, so that I can fill in the proper fields of the BED file.
How do I get that data from this MIndex
thing and the IRanges
objects it appears to contain?
I've been looking through the documentation and can't figure out what methods to use or how to do it.
Thanks in advance for any help you can provide!
-Ann
Thanks! Got it now :-)
Btw, it might be more convenient to use the
vmatchPDict()
function in BSgenome, i.e., something likevmatchPDict(PDict(query_seqs)), BSgenome.Osativa.MSU.MSU7)
. That gives you a GRanges directly, with a column indicating the query sequence for each hit.