Fwd: summarizeOverlaps behavior on RNA-Seq paired end stranded data
1
0
Entering edit mode
@abhishek-pratap-6167
Last seen 10.4 years ago
Hi All Just wanted to check on the expected behavior of the summarizeOverlaps function in the GenomicRanges package for the following cases. 1. In the case of paired end data does it count each read pair as 1 or 2 against a feature. 2. For stranded protocols of RNA-Seq does it take it account the opposite strand of the mate of a read when counting or does it take only one read matching the strand into consideration. 3. For second strand RNA-Seq protocol where the read-1 matches to the opposite strand of gene will summarizeOverlap work ? Thanks! -Abhi [[alternative HTML version deleted]]
GenomicRanges GenomicRanges • 2.0k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States
Hi, On 10/02/2013 12:35 AM, Abhishek Pratap wrote: > Hi All > > Just wanted to check on the expected behavior of the summarizeOverlaps > function in the GenomicRanges package for the following cases. > > > 1. In the case of paired end data does it count each read pair as 1 or 2 > against a feature. Pairs are counted as a single 'hit' reguardless if one or both mates overlap the feature. > > 2. For stranded protocols of RNA-Seq does it take it account the opposite > strand of the mate of a read when counting or does it take only one read > matching the strand into consideration. When counting, pairs are treated as though they are from the same strand. In general, you can ignore the strand when counting by setting the 'ignore.strand=TRUE'. > > 3. For second strand RNA-Seq protocol where the read-1 matches to the > opposite strand of gene will summarizeOverlap work ? I'm not sure what you mean. Please provide an example. summarizeOverlaps() reads the data from a BAM into a GAlignmentPairs or GAlignmentsList container for counting. You can investigate the behavior using a small test case. ## paired-end record ga1 <- GAlignments("chr1", 1L, "10M", strand("+")) ga2 <- GAlignments("chr1", 15L, "11M", strand("-")) galp <- GAlignmentPairs(ga1, ga2, TRUE) ## annotation ann <- GRanges("chr1", IRanges(c(1, 5, 12, 20), c(25, 20, 14, 30)), "-") ## all with mode="Union" se1 <- summarizeOverlaps(ann[1], galp, ignore.strand=TRUE) > assays(se1)$counts reads [1,] 1 se2 <- summarizeOverlaps(ann[1], galp, ignore.strand=FALSE) > assays(se2)$counts reads [1,] 0 se3 <- summarizeOverlaps(ann[4], galp, ignore.strand=TRUE) > assays(se3)$counts reads [1,] 1 Valerie > > Thanks! > -Abhi > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi On Wed, Oct 2, 2013 at 10:28 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hi, > > > On 10/02/2013 12:35 AM, Abhishek Pratap wrote: > >> Hi All >> >> Just wanted to check on the expected behavior of the summarizeOverlaps >> function in the GenomicRanges package for the following cases. >> >> >> 1. In the case of paired end data does it count each read pair as 1 or 2 >> against a feature. >> > > Pairs are counted as a single 'hit' reguardless if one or both mates > overlap the feature. > > > >> 2. For stranded protocols of RNA-Seq does it take it account the opposite >> strand of the mate of a read when counting or does it take only one read >> matching the strand into consideration. >> > > When counting, pairs are treated as though they are from the same strand. > In general, you can ignore the strand when counting by setting the > 'ignore.strand=TRUE'. > > > >> 3. For second strand RNA-Seq protocol where the read-1 matches to the >> opposite strand of gene will summarizeOverlap work ? >> > > I'm not sure what you mean. Please provide an example. > > > summarizeOverlaps() reads the data from a BAM into a GAlignmentPairs or > GAlignmentsList container for counting. You can investigate the behavior > using a small test case. > > ## paired-end record > ga1 <- GAlignments("chr1", 1L, "10M", strand("+")) > ga2 <- GAlignments("chr1", 15L, "11M", strand("-")) > galp <- GAlignmentPairs(ga1, ga2, TRUE) > > ## annotation > ann <- GRanges("chr1", IRanges(c(1, 5, 12, 20), c(25, 20, 14, 30)), "-") > > ## all with mode="Union" > se1 <- summarizeOverlaps(ann[1], galp, ignore.strand=TRUE) > > assays(se1)$counts > reads > [1,] 1 > se2 <- summarizeOverlaps(ann[1], galp, ignore.strand=FALSE) > > assays(se2)$counts > reads > [1,] 0 > se3 <- summarizeOverlaps(ann[4], galp, ignore.strand=TRUE) > > assays(se3)$counts > reads > [1,] 1 > > > Valerie > > I think thats a good idea to test it on a artificial set. Just a quick question. I cant seem to use the constructor GAlingments(). Is that something to do with the version ? library("GenomicRanges") library("Rsamtools") GAlignments("chr1",1L,"10M",strand("+")) >Error: could not find function "GAlignments" Partial output from sessionInfo() other attached packages: [1] Rsamtools_1.12.4 Biostrings_2.28.0 GenomicRanges_1.12.5 IRanges_1.18.4 BiocGenerics_0.6.0 Thanks! -Abhi > > >> Thanks! >> -Abhi >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 10/02/2013 04:31 PM, Abhishek Pratap wrote: > Hi > > > On Wed, Oct 2, 2013 at 10:28 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Hi, > > > On 10/02/2013 12:35 AM, Abhishek Pratap wrote: > > Hi All > > Just wanted to check on the expected behavior of the > summarizeOverlaps > function in the GenomicRanges package for the following cases. > > > 1. In the case of paired end data does it count each read pair > as 1 or 2 > against a feature. > > > Pairs are counted as a single 'hit' reguardless if one or both mates > overlap the feature. > > > > 2. For stranded protocols of RNA-Seq does it take it account the > opposite > strand of the mate of a read when counting or does it take only > one read > matching the strand into consideration. > > > When counting, pairs are treated as though they are from the same > strand. In general, you can ignore the strand when counting by > setting the 'ignore.strand=TRUE'. > > > > 3. For second strand RNA-Seq protocol where the read-1 matches > to the > opposite strand of gene will summarizeOverlap work ? > > > I'm not sure what you mean. Please provide an example. > > > summarizeOverlaps() reads the data from a BAM into a GAlignmentPairs > or GAlignmentsList container for counting. You can investigate the > behavior using a small test case. > > ## paired-end record > ga1 <- GAlignments("chr1", 1L, "10M", strand("+")) > ga2 <- GAlignments("chr1", 15L, "11M", strand("-")) > galp <- GAlignmentPairs(ga1, ga2, TRUE) > > ## annotation > ann <- GRanges("chr1", IRanges(c(1, 5, 12, 20), c(25, 20, 14, 30)), "-") > > ## all with mode="Union" > se1 <- summarizeOverlaps(ann[1], galp, ignore.strand=TRUE) > > assays(se1)$counts > reads > [1,] 1 > se2 <- summarizeOverlaps(ann[1], galp, ignore.strand=FALSE) > > assays(se2)$counts > reads > [1,] 0 > se3 <- summarizeOverlaps(ann[4], galp, ignore.strand=TRUE) > > assays(se3)$counts > reads > [1,] 1 > > > Valerie > > > I think thats a good idea to test it on a artificial set. Just a quick > question. I cant seem to use the constructor GAlingments(). Is that > something to do with the version ? > > > library("GenomicRanges") > library("Rsamtools") > GAlignments("chr1",1L,"10M",strand("+")) > >Error: could not find function "GAlignments" The class is called GappedAlignments in release and GAlignments in devel (name changed this cycle). For an example of constructing one, see the summarizeOverlaps man page. ?summarizeOverlaps Valerie > > > Partial output from sessionInfo() > other attached packages: > [1] Rsamtools_1.12.4 Biostrings_2.28.0 GenomicRanges_1.12.5 > IRanges_1.18.4 BiocGenerics_0.6.0 > > > Thanks! > -Abhi > > > > Thanks! > -Abhi > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > >
ADD REPLY
0
Entering edit mode
Hi Valerie Here are follow up questions. Let me ask them through examples. #creating the feature gene1 <- GRanges("chr1", IRanges(c(1,40),c(20,60)),"+") annotation <- GRangesList("gene1"=gene1) 1. even if one mate of the read falls outside the annotation feature it is still counted .. r1 <- GAlignments("chr1",1L,"10M",strand("+")) r2 <- GAlignments("chr1",65L,"10M",strand("-")) pair <- GAlignmentPairs(r1,r2,isProperPair=TRUE) result <- summarizeOverlaps(annotation,pair,ignore.strand=TRUE) > assays(result)$count reads gene1 1 2. if the read 1 is on the opp strand compared to annotation (which can happen in second strand RNA-Seq protocol) the counting is not proper unless we ignore the strand. I guess the counting is only done based on the strand of the read 1 and ignores the read 2 strand ...? r1 <- GAlignments("chr1",1L,"10M",strand("-")) r2 <- GAlignments("chr1",65L,"10M",strand("+")) pair <- GAlignmentPairs(r1,r2,isProperPair=TRUE) result <- summarizeOverlaps(annotation,pair,ignore.strand=FALSE) > assays(result)$count reads gene1 0 I am a bit skeptical if we should turn on the non stranded counting for stranded paired RNA-Seq data. I guess a proper RNA-Seq read pair should have reads on the opposite strand for it be counted. They can be either +- or -+ depending on the protocol and checking this sanity as part of the counting process would help. I understand this conversation could lead to confusion over email. Let me know if you want to meet in person and discuss. Cheers! -Abhi On Thu, Oct 3, 2013 at 8:27 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > On 10/02/2013 04:31 PM, Abhishek Pratap wrote: > >> Hi >> >> >> On Wed, Oct 2, 2013 at 10:28 AM, Valerie Obenchain <vobencha@fhcrc.org>> <mailto:vobencha@fhcrc.org>> wrote: >> >> Hi, >> >> >> On 10/02/2013 12:35 AM, Abhishek Pratap wrote: >> >> Hi All >> >> Just wanted to check on the expected behavior of the >> summarizeOverlaps >> function in the GenomicRanges package for the following cases. >> >> >> 1. In the case of paired end data does it count each read pair >> as 1 or 2 >> against a feature. >> >> >> Pairs are counted as a single 'hit' reguardless if one or both mates >> overlap the feature. >> >> >> >> 2. For stranded protocols of RNA-Seq does it take it account the >> opposite >> strand of the mate of a read when counting or does it take only >> one read >> matching the strand into consideration. >> >> >> When counting, pairs are treated as though they are from the same >> strand. In general, you can ignore the strand when counting by >> setting the 'ignore.strand=TRUE'. >> >> >> >> 3. For second strand RNA-Seq protocol where the read-1 matches >> to the >> opposite strand of gene will summarizeOverlap work ? >> >> >> I'm not sure what you mean. Please provide an example. >> >> >> summarizeOverlaps() reads the data from a BAM into a GAlignmentPairs >> or GAlignmentsList container for counting. You can investigate the >> behavior using a small test case. >> >> ## paired-end record >> ga1 <- GAlignments("chr1", 1L, "10M", strand("+")) >> ga2 <- GAlignments("chr1", 15L, "11M", strand("-")) >> galp <- GAlignmentPairs(ga1, ga2, TRUE) >> >> ## annotation >> ann <- GRanges("chr1", IRanges(c(1, 5, 12, 20), c(25, 20, 14, 30)), >> "-") >> >> ## all with mode="Union" >> se1 <- summarizeOverlaps(ann[1], galp, ignore.strand=TRUE) >> > assays(se1)$counts >> reads >> [1,] 1 >> se2 <- summarizeOverlaps(ann[1], galp, ignore.strand=FALSE) >> > assays(se2)$counts >> reads >> [1,] 0 >> se3 <- summarizeOverlaps(ann[4], galp, ignore.strand=TRUE) >> > assays(se3)$counts >> reads >> [1,] 1 >> >> >> Valerie >> >> >> I think thats a good idea to test it on a artificial set. Just a quick >> question. I cant seem to use the constructor GAlingments(). Is that >> something to do with the version ? >> >> >> library("GenomicRanges") >> library("Rsamtools") >> GAlignments("chr1",1L,"10M",**strand("+")) >> >Error: could not find function "GAlignments" >> > > The class is called GappedAlignments in release and GAlignments in devel > (name changed this cycle). For an example of constructing one, see the > summarizeOverlaps man page. > > ?summarizeOverlaps > > > Valerie > > > >> >> Partial output from sessionInfo() >> other attached packages: >> [1] Rsamtools_1.12.4 Biostrings_2.28.0 GenomicRanges_1.12.5 >> IRanges_1.18.4 BiocGenerics_0.6.0 >> >> >> Thanks! >> -Abhi >> >> >> >> Thanks! >> -Abhi >> >> [[alternative HTML version deleted]] >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<http s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, On 10/03/2013 12:16 PM, Abhishek Pratap wrote: > Hi Valerie > > Here are follow up questions. Let me ask them through examples. > > > #creating the feature > gene1 <- GRanges("chr1", IRanges(c(1,40),c(20,60)),"+") > annotation <- GRangesList("gene1"=gene1) > > > 1. even if one mate of the read falls outside the annotation feature it > is still counted .. Yes. This is intended behavior. > r1 <- GAlignments("chr1",1L,"10M",strand("+")) > r2 <- GAlignments("chr1",65L,"10M",strand("-")) > pair <- GAlignmentPairs(r1,r2,isProperPair=TRUE) > result <- summarizeOverlaps(annotation,pair,ignore.strand=TRUE) > > assays(result)$count > reads > gene1 1 > > > 2. if the read 1 is on the opp strand compared to annotation (which can > happen in second strand RNA-Seq protocol) the counting is not proper > unless we ignore the strand. I guess the counting is only done based on > the strand of the read 1 and ignores the read 2 strand ...? Yes, overlaps is done based on the strand of the first read (aka first segment in template). How pairs are paired and how strand is determined is described in detail on the man page ?GAlignmentPairs (?GappedAlignmentPairs in release). Specifically, see the 'Details' and section on 'strand(x)'. > r1 <- GAlignments("chr1",1L,"10M",strand("-")) > r2 <- GAlignments("chr1",65L,"10M",strand("+")) > pair <- GAlignmentPairs(r1,r2,isProperPair=TRUE) > result <- summarizeOverlaps(annotation,pair,ignore.strand=FALSE) > > assays(result)$count > reads > gene1 0 > > > I am a bit skeptical if we should turn on the non stranded counting for > stranded paired RNA-Seq data. I guess a proper RNA-Seq read pair should > have reads on the opposite strand for it be counted. They can be either > +- or -+ depending on the protocol and checking this sanity as part of > the counting process would help. Part of the pairing process does involve checking that mates came from opposite strands. See these pages for details of the pairing algorithm. ?readBamGappedAlignmentPairs ?findMateAlignment The question of exactly how stranded paired RNA-Seq data should be counted is better answered by others on the list. Specifically, should you be looking for overlap on the '+' or '-' or both strands. > > I understand this conversation could lead to confusion over email. Let > me know if you want to meet in person and discuss. Sure, stop by anytime if you want to discuss in person. Valerie > > Cheers! > -Abhi > > > On Thu, Oct 3, 2013 at 8:27 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > On 10/02/2013 04:31 PM, Abhishek Pratap wrote: > > Hi > > > On Wed, Oct 2, 2013 at 10:28 AM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> wrote: > > Hi, > > > On 10/02/2013 12:35 AM, Abhishek Pratap wrote: > > Hi All > > Just wanted to check on the expected behavior of the > summarizeOverlaps > function in the GenomicRanges package for the following > cases. > > > 1. In the case of paired end data does it count each > read pair > as 1 or 2 > against a feature. > > > Pairs are counted as a single 'hit' reguardless if one or > both mates > overlap the feature. > > > > 2. For stranded protocols of RNA-Seq does it take it > account the > opposite > strand of the mate of a read when counting or does it > take only > one read > matching the strand into consideration. > > > When counting, pairs are treated as though they are from > the same > strand. In general, you can ignore the strand when counting by > setting the 'ignore.strand=TRUE'. > > > > 3. For second strand RNA-Seq protocol where the read-1 > matches > to the > opposite strand of gene will summarizeOverlap work ? > > > I'm not sure what you mean. Please provide an example. > > > summarizeOverlaps() reads the data from a BAM into a > GAlignmentPairs > or GAlignmentsList container for counting. You can > investigate the > behavior using a small test case. > > ## paired-end record > ga1 <- GAlignments("chr1", 1L, "10M", strand("+")) > ga2 <- GAlignments("chr1", 15L, "11M", strand("-")) > galp <- GAlignmentPairs(ga1, ga2, TRUE) > > ## annotation > ann <- GRanges("chr1", IRanges(c(1, 5, 12, 20), c(25, 20, > 14, 30)), "-") > > ## all with mode="Union" > se1 <- summarizeOverlaps(ann[1], galp, ignore.strand=TRUE) > > assays(se1)$counts > reads > [1,] 1 > se2 <- summarizeOverlaps(ann[1], galp, ignore.strand=FALSE) > > assays(se2)$counts > reads > [1,] 0 > se3 <- summarizeOverlaps(ann[4], galp, ignore.strand=TRUE) > > assays(se3)$counts > reads > [1,] 1 > > > Valerie > > > I think thats a good idea to test it on a artificial set. Just a > quick > question. I cant seem to use the constructor GAlingments(). Is that > something to do with the version ? > > > library("GenomicRanges") > library("Rsamtools") > GAlignments("chr1",1L,"10M",__strand("+")) > >Error: could not find function "GAlignments" > > > The class is called GappedAlignments in release and GAlignments in > devel (name changed this cycle). For an example of constructing one, > see the summarizeOverlaps man page. > > ?summarizeOverlaps > > > Valerie > > > > > Partial output from sessionInfo() > other attached packages: > [1] Rsamtools_1.12.4 Biostrings_2.28.0 GenomicRanges_1.12.5 > IRanges_1.18.4 BiocGenerics_0.6.0 > > > Thanks! > -Abhi > > > > Thanks! > -Abhi > > [[alternative HTML version deleted]] > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > http://news.gmane.org/gmane.____science.biology.informatics. ____conductor > <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > >
ADD REPLY

Login before adding your answer.

Traffic: 671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6