getting masked sequence from BSgenome: can I use Views or getSeq?
2
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…
Hi there, I have some ranges of interest in the human genome, specified as a GRanges object (a bunch of them, in a list), and I'd like to use alphabetFrequency on their RepeatMasked sequences. I've found an inefficient way to do this, but I wondered (a) whether I'm missing a better way to do it, or (b) whether it'd be possible for you to implement some version of Views that returns masked sequences, rather than dropping the masks My inefficient (in memory) version is to create a hard-masked version of the genome like this: allMaskedSeqsHsapiens <- list() for (thisChr in seqnames(Hsapiens)) { cat("chr",thisChr,"\n") myChr <- Hsapiens [[thisChr]] active(masks(myChr)) <- rep(TRUE, 4) myChr <- injectHardMask(myChr) allMaskedSeqsHsapiens[[thisChr]] <- myChr } and then for each of the GRanges objects I'm looking at (each has ranges only on a single chromosome), I get the sequence using Views on the relevant hard-masked sequence from allMaskedSeqsHsapiens[[thisChr]]. It seems I shouldn't have to create the hard-masked genome list - is there a direct way to do this? Views always seems to drop the masks, and I don't think getSeq works for MaskedDNAString objects. Maybe I should be doing some roundabout coercion to get the MaskedDNAString chromosome into a DNAString format, retaining the masks? But perhaps it would be nice to have be able to use Views or getSeq directly to get masked sequences. Motivation: I want to calculating repeat-masked intronic GC content for each human gene. thanks very much, Janet [[alternative HTML version deleted]]
• 1.8k views
ADD COMMENT
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 4 months ago
United States
> Hi there, > > I have some ranges of interest in the human genome, specified as a GRanges object (a bunch of them, in a list), and I'd like to use > alphabetFrequency on their RepeatMasked sequences. I've found an inefficient way to do this, but I wondered > (a) whether I'm missing a better way to do it, or > (b) whether it'd be possible for you to implement some version of Views that returns masked sequences, rather than dropping the > masks Hi Janet - perhaps this will help: getSeqHardMasked <- function(BSg,GR,maskList=NULL,letter='N') { ### PURPOSE: return list of DNAString sequences extracted from the ### BSgenome <bsg> corresponding to each location in GenomicRange ### <gr>, and masked with <letter> according to the masks named in ### <masklist> (which are encoded following BSParams convention). ### ### USE CASE - write fasta file of hard masked regions, using ### RepeatMasker (RM) and Tandem Repeat Finder (TRF): ### ### GR <- GRanges('chr2L',IRanges(c(1,100),c(15,125))) ### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N") ### ,"myExtractForGR.fa" ### ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':') ### ) ### ### AUTHOR: malcolm_cook at stowers.org ### ### NB: The implementation was coded 'pay(ing) attention to memory ### management' following suggestions from Herve in: ### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html. ### In particular, the inidividual chromosomes and their ### subseq(uences) should be garbage collectable after the function ### exits and they go out of scope, (although the chromosomes _are_ ### all simultaneously loaded which I think is unavoidable if the ### results are to preserve the arbitrary order of GR). ### ### NB: My initial implementation FAILed as it used bsapply & BSParams ### whose FUN can not 'know' the name of the sequence (which was ### needed to know which subseqs to extract). ### ### TODO: factor out the creation of masked BSGenome seqlist for use ### when this needs to be done to multiple sets of GRanges for the ### same genome. ### ### TODO? - provide options to: filter out COMPLETELY masked regions? ### trim ends that are masked? ']]' <- ## utility to subscript b by a. function(a,b) b[[a]] Vsubseq <- ## vectorized version of subseq. Vectorize(subseq) VinjectHardMask <- ## vectorized version of injectHardMask. Vectorize(injectHardMask) activeMask <- ## A logical vector indicating whether each mask should be ON or ## OFF to be applied to each chromosome in BSg. masknames(BSg) %in% names(maskList[which(maskList)]) BSg_seqList <- ## BSg as a list of named MaskedDNAString (one per chromosome)... sapply(seqnames(BSg),']]',BSg) BSg_seqList <- ## ... with the masks for each chromosome activated. sapply(BSg_seqList,function(x) {active(masks(x)) <- activeMask;x}) GR_seq <- ## the MaskedDNAString corresponding to GR sapply(as.character(seqnames(GR)),']]',BSg_seqList) VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter) } > > My inefficient (in memory) version is to create a hard-masked version of the genome like this: > > allMaskedSeqsHsapiens <- list() > for (thisChr in seqnames(Hsapiens)) { > cat("chr",thisChr,"\n") > myChr <- Hsapiens [[thisChr]] > active(masks(myChr)) <- rep(TRUE, 4) > myChr <- injectHardMask(myChr) > allMaskedSeqsHsapiens[[thisChr]] <- myChr > } > > and then for each of the GRanges objects I'm looking at (each has ranges only on a single chromosome), I get the sequence using > Views on the relevant hard-masked sequence from allMaskedSeqsHsapiens[[thisChr]]. > > It seems I shouldn't have to create the hard-masked genome list - is there a direct way to do this? Views always seems to drop the > masks, and I don't think getSeq works for MaskedDNAString objects. Maybe I should be doing some roundabout coercion to get the > MaskedDNAString chromosome into a DNAString format, retaining the masks? But perhaps it would be nice to have be able to use > Views or getSeq directly to get masked sequences. > > Motivation: I want to calculating repeat-masked intronic GC content for each human gene. > > thanks very much, > > Janet > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
kielpinski • 0
@kielpinski-8492
Last seen 9.1 years ago
Denmark

Dear Malcolm,

Just to warn future users:

getSeqHardMasked 

is not strand aware.

 

ADD COMMENT

Login before adding your answer.

Traffic: 499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6