Question

how to get genomic coordinates from an MIndex object

0

Entering edit mode

Ann Loraine ▴ 110

@ann-loraine-3863

Last seen 3.3 years ago

United States

Hello,

I am trying to extract the genomic coordinates (chromosome name, start position, end position) from an MIndex object but can't figure out how to do it.

Can someone help?

Here is the code - first, I load the library, read a genome sequence into memory (it's tiny), and create DNAString objects out of a column of short sequences stored in my data frame data:

library(Biostrings)
genome=readDNAStringSet("O_sativa_japonica_Oct_2011.fa")
query_seqs = sapply(data$Sequence,DNAString)

A bit more info:

The object data is a data frame with one column (Sequence) that contains short oligonucleotide sequences - character data. These are guide RNA sequences I got from a collaborator who needs to map them onto the genomic sequence in rice. We only care about exact matches at this point.

I then locate them in the genomic sequence with:

matches = lapply(query_seqs,vmatchPattern,genome)
rc_matches = lapply(rc_query_seqs,vmatchPattern,genome)

The above two commands return lists. Interrogating one item from one list:

m = matches[[1]]
> m
MIndex object of length 16
$Chr1
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>

$Chr2
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]   4496975   4496994        20

$Chr3
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>

...
<13 more elements>

As you can see from the above code, object m is an MIndex. That sounds fine, and now I need to get some data out of it, so that I can then write out a BED format file and look at the data in a genome browser. (For example, I need to interactively check that the data I got from my collaborator fits his assumption about it.)

For each location in the genome that matched, I want to retrieve the name of the sequence (chromosome) that matched, the start position of the match, and the end position of the match, so that I can fill in the proper fields of the BED file.

How do I get that data from this MIndex thing and the IRanges objects it appears to contain?

I've been looking through the documentation and can't figure out what methods to use or how to do it.

Thanks in advance for any help you can provide!

-Ann

IRanges MIndex • 1.4k views

ADD COMMENT • link updated 3.3 years ago by Michael Lawrence ★ 11k • written 3.3 years ago by Ann Loraine ▴ 110

score 2 · Answer 1 · 2021-08-03

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.0 years ago

United States

You can convert to a GRanges with GRanges(m), and then use its accessors.

ADD COMMENT • link 3.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks! Got it now :-)

ADD REPLY • link 3.3 years ago Ann Loraine ▴ 110

0

Entering edit mode

Btw, it might be more convenient to use the vmatchPDict() function in BSgenome, i.e., something like vmatchPDict(PDict(query_seqs)), BSgenome.Osativa.MSU.MSU7). That gives you a GRanges directly, with a column indicating the query sequence for each hit.

ADD REPLY • link 3.3 years ago Michael Lawrence ★ 11k