Hi,
I'm not aware of any BioC package dedicated to finding Tandem Repeats.
However Bioconductor provides masked genomes (e.g. the BSgenome.Hsapiens.UCSC.hg19.masked package). In a masked genome the chromosome sequences are the same as in a non-masked genome (e.g. the BSgenome.Hsapiens.UCSC.hg19 package) except that masks have been added on top of them to mask different kinds of regions. For example in BSgenome.Hsapiens.UCSC.hg19.masked each sequence has 4 masks and the 4th one is the "TRF mask" which masks the Tandem Repeats of period <= 12:
library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked
## Display Tandem Repeats on chr17.
Views(unmasked(genome$chr17), masks(genome$chr17)$TRF)
# Views on a 81195210-letter DNAString subject
# subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTG...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
# views:
# start end width
# [1] 30966 30996 31 [GGGCTGGGACCCGGGCTGGGGCCCGGGCTGG]
# [2] 37773 37798 26 [TTTTTTTTTTTTTTTTTTTTTTTTTT]
# [3] 40174 40389 216 [AAAGAAGGAAAGAAAGAAGGA...AGAAAGAAAGAAGAAAGAAA]
# [4] 68984 69008 25 [TTTTTTTTTTTTTTTTTTTTTTTTT]
# [5] 82621 82678 58 [GTGTGTGTGTGTGTGTGTGTG...GTGTGTGTGTGTGTGTGTGT]
# ... ... ... ... ...
# [11977] 81162540 81162582 43 [AAAAAAAAGAAAGAAAGAAAGAAAAGAAAAGAAAAAAAAGAAA]
# [11978] 81166318 81166372 55 [TGAGTGTGTATGACTGTGTGT...TTTGTGTGTGTGAGTGTGTA]
# [11979] 81173535 81173562 28 [TGATTACCAGGCTGATTACCAGGCTGAT]
# [11980] 81185759 81185799 41 [ACACACACACACACACACACACACACACACACACACACACA]
# [11981] 81194905 81195206 302 [GTGAGGGTGAGGGTGAGGGTG...GTGGTGTGTGGGTGTGGGTG]
See ?BSgenome.Hsapiens.UCSC.hg19.masked
and http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ for more information
You can also implement your own little Tandem Repeats finder function:
## Find all *exact* tandem repeats with a period equal to or a divisor of
## 'period.multiple'.
.findTandemRepeats0 <- function(subject, period.multiple=2,
include.period1=FALSE, min.length=24)
{
if (!isSingleNumber(period.multiple))
stop("'period.multiple' must be a single integer")
if (!is.integer(period.multiple))
period.multiple <- as.integer(period.multiple)
if (period.multiple < 2L)
stop("'period.multiple' must be >= 2")
if (!isTRUEorFALSE(include.period1))
stop("'include.period1' must be TRUE or FALSE")
if (!isSingleNumber(min.length))
stop("'min.length' must be a single integer")
if (!is.integer(min.length))
min.length <- as.integer(min.length)
if (min.length < 12L)
stop("'min.length' must be >= 12")
s1 <- subseq(subject, start=1L+period.multiple)
s2 <- subseq(subject, end=-1L-period.multiple)
ir <- as(as.raw(s1) == as.raw(s2), "NormalIRanges")
ir <- ir[width(ir) >= period.multiple]
end(ir) <- end(ir) + period.multiple
ir <- ir[width(ir) >= min.length]
repeats <- Views(subject, ir)
af <- alphabetFrequency(repeats, baseOnly=TRUE)
ok <- af[ , "other"] == 0L # has no IUPAC ambiguities
if (!include.period1)
ok <- ok & rowSums(af[ , DNA_BASES] != 0L) >= 2L
repeats[which(ok)]
}
## Find all *exact* tandem repeats of period <= 12.
findTandemRepeats <- function(subject, include.period1=FALSE)
{
trs_list <- lapply(7:12,
function(period.multiple)
ranges(.findTandemRepeats0(subject,
period.multiple,
include.period1))
)
trs <- sort(unique(do.call("c", trs_list)))
Views(subject, trs)
}
Then:
chr17_trs <- findTandemRepeats(unmasked(genome$chr17), include.period1=TRUE)
chr17_trs
# Views on a 81195210-letter DNAString subject
# subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTG...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
# views:
# start end width
# [1] 37773 37798 26 [TTTTTTTTTTTTTTTTTTTTTTTTTT]
# [2] 40289 40328 40 [AGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA]
# [3] 68984 69008 25 [TTTTTTTTTTTTTTTTTTTTTTTTT]
# [4] 82621 82678 58 [GTGTGTGTGTGTGTGTGTGTG...GTGTGTGTGTGTGTGTGTGT]
# [5] 90847 90891 45 [AATAATAATAATAATAATAAT...ATAATAATAATAATAATAAT]
# ... ... ... ... ...
# [11413] 81194994 81195034 41 [TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG]
# [11414] 81195037 81195083 47 [AGGGTGAGGGTGAGGGTGAGG...GTGAGGGTGAGGGTGAGGGT]
# [11415] 81195087 81195127 41 [GGGTTGGGGTTGGGGTTGGGGTTGGGGTTGGGGTTGGGGTT]
# [11416] 81195123 81195162 40 [GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT]
# [11417] 81195159 81195190 32 [GGGTGTGGGTGTGGGTGTGGGTGTGGGTGTGG]
Even though the results are very close to the Tandem Repeats Finder locations provided by UCSC, there are some small differences. I didn't take a close look but it seems that one of the reasons for these differences is that the Tandem Repeats from UCSC are allowed to contain some fuzziness (i.e. the pattern of a Tandem Repeat is not repeated in an exact way), whereas findTandemRepeats()
finds exact Tandem Repeats.
Cheers,
H.
Hi,
The little
findTandemRepeats()
function I provided above is designed to work on a DNAString object so all you need to do is load the Biostrings package and turn your raw sequence into a DNAString object before callingfindTandemRepeats
on it:Cheers,
H.