I am trying to get all of the data I can from the incorrect reading frame according to all the human RefSeq transcripts. Specifically, I want to locate all the termination codons that are in the -1/+1 frames of every human transcript and see which are within 55 base pairs of the penultimate exon.
Here are some steps maybe you'd want to explore to get you started:
Build the Human transcriptome as a DNAStringSet object. See ?extractTranscriptSeqs() in the GenomicFeatures package for how to do this (look in particular at 2. A REAL EXAMPLE in the Examples section).
Use the string matching facilities in Biostrings to search for occurrences of stop codons in the transcriptome. The vmatchPattern() function seems particularly appropriate for locating all the stop codons of every human transcript (i.e. for every DNA sequence in the DNAStringSet object representing the transcriptome). You'll have to do that step for each of the 3 stop codons in The Standard Genetic Code (names(GENETIC_CODE)[GENETIC_CODE == "*"] is a programmatic way to get the 3 stop codons). Let's say you have now 3 MIndex objects representing the 3 results you got from your 3 calls to vmatchPattern().
To see which stop codons are within 55 base pairs of the penultimate exon you must first get the locations of the penultimate exon of each transcript. The following function will give you that:
## 'ex_by_tx' must be the GRangesList object obtained by
## exonsBy(txdb, use.names=TRUE).
## Return an IRanges object parallel to 'ex_by_tx' containing the location
## (in transcript coordinates) of the penultimate exon of each transcript.
penultimateExon <- function(ex_by_tx)
{
ex_by_tx2 <- phead(ex_by_tx, n=-1)
pen_ex_end <- sum(width(ex_by_tx2))
pen_ex_width <- as.integer(width(ptail(ex_by_tx2, n=1)))
pen_ex_width[is.na(pen_ex_width)] <- 0L
IRanges(end=pen_ex_end, width=pen_ex_width, names=names(ex_by_tx))
}
Do pen_exon <- penultimateExon(ex_by_tx) using the same ex_by_tx you used earlier with extractTranscriptSeqs() to build the transcriptome.
For each of the 3 MIndex objects you obtained before (each of them being parallel to pen_exon), compute the distance between the i-th element in the MIndex object (an IRanges) and the i-th range in pen_exon. This can be achieved by calling distance() in an lapply() loop. The result of this lapply() will be a list of integer vectors, each of them giving the distance of each stop codon to the penultimate exon of each transcript. For more convenience, turn this list into an IntegerList object. This IntegerList object should have the same shape as the corresponding MIndex object (i.e. their elementLengths() should be identical).
Doing <= 55 on the IntegerList object will return a LogicalList indicating which matches in the MIndex object are within 55 base pairs of the penultimate exon.
So you're almost there but depending on how you want to represent the results and what you want to do next there might still be a few extra steps to perform.