edgeR calcNormFactors for paired counts

0

Entering edit mode

Christopher T Gregg ▴ 210

@christopher-t-gregg-4973

Last seen 10.6 years ago

Hi, We are examining the use of edgeR to analyze allele-specific count data from RNASeq experiments. In these studies, each biological replicate (n=18) has two columns: one with counts from the maternal allele and the other with counts from the paternal allele for each gene. Thus, the data is paired since these counts are parsed from the data for each each replicate. We wish to fit a glm to the data that tests for a main effect of the allele (counts ~ replicate + allele) to find genes that exhibit a significant allele expression bias. My question relates to how to best handle the normalization of the counts in this case. EdgeR applies calcNormFactors to the columns, which disrupts the maternal:paternal count ratio for each gene in each sample. We are grateful for advice on how to best manage the analysis of this type of data. best wishes, Chris

RNASeq Normalization edgeR RNASeq Normalization edgeR • 1.9k views

ADD COMMENT • link updated 10.9 years ago by Ryan C. Thompson ★ 7.9k • written 10.9 years ago by Christopher T Gregg ▴ 210

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

Hi Chris, I think what you want to do here is normalize at the level of individuals. To that end, I would generate the full count matrix for each individual at the gene level (including all reads for each individual, not just ones that cover heterozygous loci) and use that to compute library sizes and normalization factors. Then I would propagate those library sizes and normalization factors to your allele count matrix. This will ensure that both alleles of each individual have the same normalization, and it will also ensure that all loci are normalized relative to the total RNA, which is not biased by where heterozygous alleles happen to occur. -Ryan On Sat May 24 14:10:25 2014, Christopher T Gregg wrote: > > Hi, > > We are examining the use of edgeR to analyze allele-specific count > data from RNASeq experiments. In these studies, each biological > replicate (n=18) has two columns: one with counts from the maternal > allele and the other with counts from the paternal allele for each > gene. Thus, the data is paired since these counts are parsed from the > data for each each replicate. We wish to fit a glm to the data that > tests for a main effect of the allele (counts ~ replicate + allele) to > find genes that exhibit a significant allele expression bias. > > My question relates to how to best handle the normalization of the > counts in this case. EdgeR applies calcNormFactors to the columns, > which disrupts the maternal:paternal count ratio for each gene in each > sample. We are grateful for advice on how to best manage the analysis > of this type of data. > > best wishes, > Chris > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.9 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Terrific, Ryan. This was our thought as well and we are very grateful to have your expert advice. Thank you, Chris On May 24, 2014, at 4:41 PM, Ryan <rct@thompsonclan.org<mailto:rct@thompsonclan.org>> wrote: Hi Chris, I think what you want to do here is normalize at the level of individuals. To that end, I would generate the full count matrix for each individual at the gene level (including all reads for each individual, not just ones that cover heterozygous loci) and use that to compute library sizes and normalization factors. Then I would propagate those library sizes and normalization factors to your allele count matrix. This will ensure that both alleles of each individual have the same normalization, and it will also ensure that all loci are normalized relative to the total RNA, which is not biased by where heterozygous alleles happen to occur. -Ryan On Sat May 24 14:10:25 2014, Christopher T Gregg wrote: Hi, We are examining the use of edgeR to analyze allele-specific count data from RNASeq experiments. In these studies, each biological replicate (n=18) has two columns: one with counts from the maternal allele and the other with counts from the paternal allele for each gene. Thus, the data is paired since these counts are parsed from the data for each each replicate. We wish to fit a glm to the data that tests for a main effect of the allele (counts ~ replicate + allele) to find genes that exhibit a significant allele expression bias. My question relates to how to best handle the normalization of the counts in this case. EdgeR applies calcNormFactors to the columns, which disrupts the maternal:paternal count ratio for each gene in each sample. We are grateful for advice on how to best manage the analysis of this type of data. best wishes, Chris _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Christopher T Gregg ▴ 210

0

Entering edit mode

Hi Chris, I wouldn't go so far as to call myself an export on the subject of normalization, but that is what makes the most sense to me. My other worry would be that the negative binomial glm framework is not appropriate for the allele-specific expression context, since you are in principle dealing with two alleles at each locus that each represent a fraction of the total expression at that locus, and these fractions must add to 1 at each locus. You may be better off using another method designed for analyzing allele-specific expression. There are several such packages available on Bioconductor. I haven't done any ASE work myself, so I can't recommend any of them in particular. But if you want to evaluate edgeR for ASE, normalizing to allele-agnostic per-individual gene counts is how I would do it. I would be interested to know if this performs as well as models custom-built to handle ASE data. -Ryan On 5/24/14, 3:47 PM, Christopher T Gregg wrote: > Terrific, Ryan. This was our thought as well and we are very grateful > to have your expert advice. > > Thank you, > Chris > > On May 24, 2014, at 4:41 PM, Ryan <rct@thompsonclan.org> <mailto:rct@thompsonclan.org>> > wrote: > >> Hi Chris, >> >> I think what you want to do here is normalize at the level of >> individuals. To that end, I would generate the full count matrix for >> each individual at the gene level (including all reads for each >> individual, not just ones that cover heterozygous loci) and use that >> to compute library sizes and normalization factors. Then I would >> propagate those library sizes and normalization factors to your >> allele count matrix. This will ensure that both alleles of each >> individual have the same normalization, and it will also ensure that >> all loci are normalized relative to the total RNA, which is not >> biased by where heterozygous alleles happen to occur. >> >> -Ryan >> >> On Sat May 24 14:10:25 2014, Christopher T Gregg wrote: >>> >>> Hi, >>> >>> We are examining the use of edgeR to analyze allele-specific count >>> data from RNASeq experiments. In these studies, each biological >>> replicate (n=18) has two columns: one with counts from the maternal >>> allele and the other with counts from the paternal allele for each >>> gene. Thus, the data is paired since these counts are parsed from >>> the data for each each replicate. We wish to fit a glm to the data >>> that tests for a main effect of the allele (counts ~ replicate + >>> allele) to find genes that exhibit a significant allele expression bias. >>> >>> My question relates to how to best handle the normalization of the >>> counts in this case. EdgeR applies calcNormFactors to the columns, >>> which disrupts the maternal:paternal count ratio for each gene in >>> each sample. We are grateful for advice on how to best manage the >>> analysis of this type of data. >>> >>> best wishes, >>> Chris >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you, Ryan. We are comparing edgeR to some of these other options. I appreciate your point. best wishes, Chris On May 24, 2014, at 5:06 PM, Ryan <rct@thompsonclan.org<mailto:rct@thompsonclan.org>> wrote: Hi Chris, I wouldn't go so far as to call myself an export on the subject of normalization, but that is what makes the most sense to me. My other worry would be that the negative binomial glm framework is not appropriate for the allele-specific expression context, since you are in principle dealing with two alleles at each locus that each represent a fraction of the total expression at that locus, and these fractions must add to 1 at each locus. You may be better off using another method designed for analyzing allele-specific expression. There are several such packages available on Bioconductor. I haven't done any ASE work myself, so I can't recommend any of them in particular. But if you want to evaluate edgeR for ASE, normalizing to allele- agnostic per-individual gene counts is how I would do it. I would be interested to know if this performs as well as models custom-built to handle ASE data. -Ryan On 5/24/14, 3:47 PM, Christopher T Gregg wrote: Terrific, Ryan. This was our thought as well and we are very grateful to have your expert advice. Thank you, Chris On May 24, 2014, at 4:41 PM, Ryan <rct@thompsonclan.org<mailto:rct@thompsonclan.org>> wrote: Hi Chris, I think what you want to do here is normalize at the level of individuals. To that end, I would generate the full count matrix for each individual at the gene level (including all reads for each individual, not just ones that cover heterozygous loci) and use that to compute library sizes and normalization factors. Then I would propagate those library sizes and normalization factors to your allele count matrix. This will ensure that both alleles of each individual have the same normalization, and it will also ensure that all loci are normalized relative to the total RNA, which is not biased by where heterozygous alleles happen to occur. -Ryan On Sat May 24 14:10:25 2014, Christopher T Gregg wrote: Hi, We are examining the use of edgeR to analyze allele-specific count data from RNASeq experiments. In these studies, each biological replicate (n=18) has two columns: one with counts from the maternal allele and the other with counts from the paternal allele for each gene. Thus, the data is paired since these counts are parsed from the data for each each replicate. We wish to fit a glm to the data that tests for a main effect of the allele (counts ~ replicate + allele) to find genes that exhibit a significant allele expression bias. My question relates to how to best handle the normalization of the counts in this case. EdgeR applies calcNormFactors to the columns, which disrupts the maternal:paternal count ratio for each gene in each sample. We are grateful for advice on how to best manage the analysis of this type of data. best wishes, Chris _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Christopher T Gregg ▴ 210

Login before adding your answer.