how to get genomic coordinates from an MIndex object
1
0
Entering edit mode
Ann Loraine ▴ 110
@ann-loraine-3863
Last seen 3.3 years ago
United States

Hello,

I am trying to extract the genomic coordinates (chromosome name, start position, end position) from an MIndex object but can't figure out how to do it.

Can someone help?

Here is the code - first, I load the library, read a genome sequence into memory (it's tiny), and create DNAString objects out of a column of short sequences stored in my data frame data:

library(Biostrings)
genome=readDNAStringSet("O_sativa_japonica_Oct_2011.fa")
query_seqs = sapply(data$Sequence,DNAString)

A bit more info:

The object data is a data frame with one column (Sequence) that contains short oligonucleotide sequences - character data. These are guide RNA sequences I got from a collaborator who needs to map them onto the genomic sequence in rice. We only care about exact matches at this point.

I then locate them in the genomic sequence with:

matches = lapply(query_seqs,vmatchPattern,genome)
rc_matches = lapply(rc_query_seqs,vmatchPattern,genome)

The above two commands return lists. Interrogating one item from one list:

m = matches[[1]]
> m
MIndex object of length 16
$Chr1
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>

$Chr2
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]   4496975   4496994        20

$Chr3
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>

...
<13 more elements>

As you can see from the above code, object m is an MIndex. That sounds fine, and now I need to get some data out of it, so that I can then write out a BED format file and look at the data in a genome browser. (For example, I need to interactively check that the data I got from my collaborator fits his assumption about it.)

For each location in the genome that matched, I want to retrieve the name of the sequence (chromosome) that matched, the start position of the match, and the end position of the match, so that I can fill in the proper fields of the BED file.

How do I get that data from this MIndex thing and the IRanges objects it appears to contain?

I've been looking through the documentation and can't figure out what methods to use or how to do it.

Thanks in advance for any help you can provide!

-Ann

IRanges MIndex • 1.4k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

You can convert to a GRanges with GRanges(m), and then use its accessors.

ADD COMMENT
0
Entering edit mode

Thanks! Got it now :-)

ADD REPLY
0
Entering edit mode

Btw, it might be more convenient to use the vmatchPDict() function in BSgenome, i.e., something like vmatchPDict(PDict(query_seqs)), BSgenome.Osativa.MSU.MSU7). That gives you a GRanges directly, with a column indicating the query sequence for each hit.

ADD REPLY

Login before adding your answer.

Traffic: 521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6