Hi Dips,
I'm cc'ing the mailing list so others can benefit.
On 09/04/2014 11:34 AM, deepti anand wrote:
> Hi H,
>
> Thank you for the code. I run the code but the extractUpstremSeqs()
does
> not seem to work. Here are the codes:
>
>> ids.ok
> [1] "67665" "13198" "110196" "15368"
> > txdb<- TxDb.Mmusculus.UCSC.mm10.knownGene
> > gn<- genes(txdb)[ids.ok]
> > gn
> GRanges with 4 ranges and 1 metadata column:
> seqnames ranges strand | gene_id
> <rle> <iranges> <rle> | <characterlist>
> 67665 chr18 [ 60526221, 60558762] + | 67665
> 13198 chr10 [127290793, 127296288] + | 13198
> 110196 chr3 [ 89093588, 89101967] - | 110196
> 15368 chr8 [ 75093618, 75100593] + | 15368
> ---
> seqlengths:
> chr1 chr2 chr3
...
> chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
> 195471971 182113224 160039680
...
> 24323 21240 114452
> > genome <- BSgenome.Mmusculus.UCSC.mm10
> > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> Error: could not find function "extractUpstreamSeqs"
>
> Could you suggest how should I proceed further?
Oops, extractUpstreamSeqs() is only available in the current devel
version of Bioconductor (BioC 3.0, which will be released in October).
Sorry. With the current release version (BioC 2.14), you can obtain
'gn_up_seqs' with:
gn_up_seqs <- getPromoterSeq(gn, genome, upstream=1500,
downstream=0)
And 'TSS_up_seqs' with:
TSS_up_seqs <- relist(getPromoterSeq(
unlist(TSS_by_gn, use.names=FALSE),
genome,
upstream=1500, downstream=0),
TSS_by_gn)
Please let me know if you run into other issues.
Thanks,
H.
PS: Yesterday I added a "resize" method for GRangesList objects in
BioC devel. It's in GenomicRanges 1.17.37 (which will become available
via biocLite() in the next 24 hours or so).
>
> Best,
>
> -Dips-
>
>
> > Date: Wed, 3 Sep 2014 12:14:49 -0700
> > From: hpages at fhcrc.org
> > To: anand.deepti at outlook.com; bioconductor at r-project.org
> > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with
same
> name
> >
> > Hi Dips,
> >
> > On 09/03/2014 10:30 AM, deepti anand wrote:
> > > Hi All,
> > >
> > > I am trying to extract promoter sequences for a few ENTREZ IDS.
The
> problem I am having is that there exists multiple transcripts for
same
> gene. So this gives me multiple promoter sequences for same gene.
Can I
> filter out the redundant promoter sequences?
> > >
> > > Here is my code:
> > > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain
coordinates
> of transcript #####>grl <-
> transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene")
> [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus,
> upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)>
> promoter.seqs A DNAStringSet instance of length 8 width seq names
[1]
> 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG
CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> 67665.67665[2] 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG
CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> 67665.67665[3] 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG
CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> 67665.67665[4] 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCG
CCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC
> > C!
> > > TC 67665.67665[5] 1500
> CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAG
GAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA
> 13198.13198[6] 1500
> CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGC
CGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> 110196.110196[7] 1500
> CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGC
CGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> 110196.110196[8] 1500
> GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCAC
CACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC
> 15368.15368
> > >
> > > In example above, there are four sequences for same gene
> '67665.67665'. How can I remove these entries?
> > > I would appreciate any help
> >
> > In this particular example, the TSS (transcription starting site)
is
> > the same for all the transcripts in any of your genes. This
explains
> > why you get the same promoter sequence. However, this won't
always be
> > the case. So in order to handle the general situation, we need to
> > choose between 2 behaviors:
> >
> > 1) Return 1 sequence per gene (even for genes with more than 1
unique
> > TSS).
> >
> > 2) Return 1 sequence per unique TSS per gene.
> >
> > To do 1) we need to choose a particular TSS for each gene. A
common
> > choice is to pick-up the most upstream TSS. In this case, genes()
> > can be used to extract the gene ranges. genes() will return a
GRanges
> > object with 1 genomic range per gene. This genomic range is the
union
> > of all the transcript ranges in the gene.
> >
> > library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> > txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> > ids.ok <- c("67665", "13198", "110196", "15368")
> > gn <- genes(txdb)[ids.ok]
> >
> > Then:
> >
> > > gn
> > GRanges with 4 ranges and 1 metadata column:
> > seqnames ranges strand | gene_id
> > <rle> <iranges> <rle> | <characterlist>
> > 67665 chr18 [ 60526221, 60558762] + | 67665
> > 13198 chr10 [127290793, 127296288] + | 13198
> > 110196 chr3 [ 89093588, 89101967] - | 110196
> > 15368 chr8 [ 75093618, 75100593] + | 15368
> > ---
> > seqlengths:
> > chr1 chr2 ... chrUn_JH584304
> > 195471971 182113224 ... 114452
> >
> > Note that, by default, genes() will exclude genes that don't have
all
> > their transcripts on the same chromosome and strand (because in
that
> > case the gene cannot be represented with a unique genomic range).
> > See ?genes for more information.
> >
> > Then use extractUpstreamSeqs() to extract the upstream sequences
for
> > those genes:
> >
> > library(BSgenome.Mmusculus.UCSC.mm10)
> > genome <- BSgenome.Mmusculus.UCSC.mm10
> > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> >
> > Then:
> >
> > > gn_up_seqs
> > A DNAStringSet instance of length 4
> > width seq names
> >
> > [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665
> > [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198
> > [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196
> > [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368
> >
> > For doing 2) we can use transcriptsBy() to extract all transcript
ranges
> > grouped by gene:
> >
> > tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok]
> >
> > Then we can obtain the unique TSS for each gene by first resizing
each
> > list element in 'tx_by_gn' and then by passing the result thru
unique():
> >
> > TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1))
> >
> > (Note that we use endoapply() here because we don't have a
"resize"
> > method for GRangesList objects. However we should definitely add
> > one because doing endoapply() is not efficient.)
> >
> > Then we can get the upstream sequence for each TSS with:
> >
> > TSS_up_seqs <- relist(extractUpstreamSeqs(
> > genome,
> > unlist(TSS_by_gn, use.names=FALSE),
> > width=1500),
> > TSS_by_gn)
> >
> > This time the upstream sequences are returned in a
DNAStringSetList
> > object with 1 list element (DNAStringSet) per gene:
> >
> > > TSS_up_seqs
> > DNAStringSetList of length 4
> > [["67665"]]
> >
CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT...
> > [["13198"]]
> >
CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC...
> > [["110196"]]
> >
CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT...
> > [["15368"]]
> >
GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG...
> >
> > The number of sequences in each list element is the number of
unique
> > TSS in the gene. Of course, there is no benefit to this second
approach
> > in *your* case because none of your genes has more than 1 unique
TSS.
> > But some genes do have more than 1 unique TSS (e.g. 100017,
100019,
> > 100036521, and many more...).
> >
> > Hope this helps,
> > H.
> >
> > > Dips
> > >
> > >
> > >
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319