edgeR: a question about library size

0

Entering edit mode

raffaele calogero ▴ 500

@raffaele-calogero-294

Last seen 9.4 years ago

Italy/Turin/University of Torino

Hi, I am using edgeR to detect differential expression in NGS experiments. I have a brief question on what I should considered as "total size of my libraries". In my case I have a set of samples that have a quite large variation in the library size: Total reads Mapped reads 1 11076283 8736308 2 5881045 4006468 3 7139703 5108608 4 9089153 5643701 5 9723103 8457914 6 15570265 8706332 7 15844448 12056310 8 13375681 8663496 9 14997114 8799752 10 15744584 8555922 11 4642056 3201515 12 6458028 4277204 13 13206724 9466118 14 3035032 2148730 Should I insert as lib.size parameter the values referring to the real size of the libraries (Total reads) or simply the size of the mapped reads (Mapped reads) Thanks for the help Raffaele -- ---------------------------------------- Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit Dipartimento di Scienze Cliniche e Biologiche c/o Az. Ospedaliera S. Luigi Regione Gonzole 10, Orbassano 10043 Torino tel. ++39 0116705417 Lab. ++39 0116705408 Fax ++39 0119038639 Mobile ++39 3333827080 email: raffaele.calogero@unito.it raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it Info: http://publicationslist.org/raffaele.calogero [[alternative HTML version deleted]]

edgeR edgeR • 1.6k views

ADD COMMENT • link updated 14.8 years ago by Mark Robinson ★ 1.1k • written 14.8 years ago by raffaele calogero ▴ 500

0

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 10.6 years ago

Hi Raffaele. In my experience, you're better off with the number of mapped reads. But, a safer way is to do something data-driven. For example, TMM normalization (http://genomebiology.com/2010/11/3/R25) is implemented in the calcNormFactors() function. See also the docs and the user's guide. Hope that helps. Cheers, Mark On 2010-06-17, at 10:00 PM, rcaloger wrote: > Hi, > I am using edgeR to detect differential expression in NGS experiments. > I have a brief question on what I should considered as "total size of my > libraries". > In my case I have a set of samples that have a quite large variation in > the library size: > > Total reads Mapped reads > > 1 11076283 8736308 > > 2 5881045 4006468 > > 3 7139703 5108608 > > 4 9089153 5643701 > > 5 9723103 8457914 > > 6 15570265 8706332 > > 7 15844448 12056310 > > 8 13375681 8663496 > > 9 14997114 8799752 > > 10 15744584 8555922 > > 11 4642056 3201515 > > 12 6458028 4277204 > > 13 13206724 9466118 > > 14 3035032 2148730 > > > Should I insert as lib.size parameter the values referring to the real > size of the libraries (Total reads) or > simply the size of the mapped reads (Mapped reads) > > Thanks for the help > Raffaele > > -- > > ---------------------------------------- > Prof. Raffaele A. Calogero > Bioinformatics and Genomics Unit > Dipartimento di Scienze Cliniche e Biologiche > c/o Az. Ospedaliera S. Luigi > Regione Gonzole 10, Orbassano > 10043 Torino > tel. ++39 0116705417 > Lab. ++39 0116705408 > Fax ++39 0119038639 > Mobile ++39 3333827080 > email: raffaele.calogero at unito.it > raffaele[dot]calogero[at]gmail[dot]com > www: http://www.bioinformatica.unito.it > Info: http://publicationslist.org/raffaele.calogero > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 14.8 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

I do not understand how we can do a multiplicative read count adjustment. If K is Poisson(v) then E(K)=Var(K)=v. If we do twice the sequencing effort, then E(K)=Var(K)=2v. But if we multiple by 2, then E(2K)=2v and Var(2K)=4v. So, how can this type of adjustment work properly? --Naomi At 08:15 AM 6/17/2010, Mark Robinson wrote: >Hi Raffaele. > >In my experience, you're better off with the number of mapped >reads. But, a safer way is to do something data-driven. For >example, TMM normalization (http://genomebiology.com/2010/11/3/R25) >is implemented in the calcNormFactors() function. See also the docs >and the user's guide. > >Hope that helps. > >Cheers, >Mark > >On 2010-06-17, at 10:00 PM, rcaloger wrote: > > > Hi, > > I am using edgeR to detect differential expression in NGS experiments. > > I have a brief question on what I should considered as "total size of my > > libraries". > > In my case I have a set of samples that have a quite large variation in > > the library size: > > > > Total reads Mapped reads > > > > 1 11076283 8736308 > > > > 2 5881045 4006468 > > > > 3 7139703 5108608 > > > > 4 9089153 5643701 > > > > 5 9723103 8457914 > > > > 6 15570265 8706332 > > > > 7 15844448 12056310 > > > > 8 13375681 8663496 > > > > 9 14997114 8799752 > > > > 10 15744584 8555922 > > > > 11 4642056 3201515 > > > > 12 6458028 4277204 > > > > 13 13206724 9466118 > > > > 14 3035032 2148730 > > > > > > Should I insert as lib.size parameter the values referring to the real > > size of the libraries (Total reads) or > > simply the size of the mapped reads (Mapped reads) > > > > Thanks for the help > > Raffaele > > > > -- > > > > ---------------------------------------- > > Prof. Raffaele A. Calogero > > Bioinformatics and Genomics Unit > > Dipartimento di Scienze Cliniche e Biologiche > > c/o Az. Ospedaliera S. Luigi > > Regione Gonzole 10, Orbassano > > 10043 Torino > > tel. ++39 0116705417 > > Lab. ++39 0116705408 > > Fax ++39 0119038639 > > Mobile ++39 3333827080 > > email: raffaele.calogero at unito.it > > raffaele[dot]calogero[at]gmail[dot]com > > www: http://www.bioinformatica.unito.it > > Info: http://publicationslist.org/raffaele.calogero > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >------------------------------ >Mark Robinson, PhD (Melb) >Epigenetics Laboratory, Garvan >Bioinformatics Division, WEHI >e: m.robinson at garvan.org.au >e: mrobinson at wehi.edu.au >p: +61 (0)3 9345 2628 >f: +61 (0)3 9347 0852 >------------------------------ > > > > > > >_____________________________________________________________________ _ >The information in this email is confidential and intend...{{dropped:6}} > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 14.8 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Hi Naomi. Well, you don't actually adjust the read counts. Basically, you are estimating the offset (from all the data) that gets used in the generalized linear model. Cheers, Mark On 2010-06-20, at 1:02 PM, Naomi Altman wrote: > I do not understand how we can do a multiplicative read count adjustment. > > If K is Poisson(v) then E(K)=Var(K)=v. If we do twice the sequencing effort, then E(K)=Var(K)=2v. But if we multiple by 2, then E(2K)=2v and Var(2K)=4v. So, how can > this type of adjustment work properly? > > --Naomi > > At 08:15 AM 6/17/2010, Mark Robinson wrote: >> Hi Raffaele. >> >> In my experience, you're better off with the number of mapped reads. But, a safer way is to do something data-driven. For example, TMM normalization (http://genomebiology.com/2010/11/3/R25) is implemented in the calcNormFactors() function. See also the docs and the user's guide. >> >> Hope that helps. >> >> Cheers, >> Mark >> >> On 2010-06-17, at 10:00 PM, rcaloger wrote: >> >> > Hi, >> > I am using edgeR to detect differential expression in NGS experiments. >> > I have a brief question on what I should considered as "total size of my >> > libraries". >> > In my case I have a set of samples that have a quite large variation in >> > the library size: >> > >> > Total reads Mapped reads >> > >> > 1 11076283 8736308 >> > >> > 2 5881045 4006468 >> > >> > 3 7139703 5108608 >> > >> > 4 9089153 5643701 >> > >> > 5 9723103 8457914 >> > >> > 6 15570265 8706332 >> > >> > 7 15844448 12056310 >> > >> > 8 13375681 8663496 >> > >> > 9 14997114 8799752 >> > >> > 10 15744584 8555922 >> > >> > 11 4642056 3201515 >> > >> > 12 6458028 4277204 >> > >> > 13 13206724 9466118 >> > >> > 14 3035032 2148730 >> > >> > >> > Should I insert as lib.size parameter the values referring to the real >> > size of the libraries (Total reads) or >> > simply the size of the mapped reads (Mapped reads) >> > >> > Thanks for the help >> > Raffaele >> > >> > -- >> > >> > ---------------------------------------- >> > Prof. Raffaele A. Calogero >> > Bioinformatics and Genomics Unit >> > Dipartimento di Scienze Cliniche e Biologiche >> > c/o Az. Ospedaliera S. Luigi >> > Regione Gonzole 10, Orbassano >> > 10043 Torino >> > tel. ++39 0116705417 >> > Lab. ++39 0116705408 >> > Fax ++39 0119038639 >> > Mobile ++39 3333827080 >> > email: raffaele.calogero at unito.it >> > raffaele[dot]calogero[at]gmail[dot]com >> > www: http://www.bioinformatica.unito.it >> > Info: http://publicationslist.org/raffaele.calogero >> > >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> ------------------------------ >> Mark Robinson, PhD (Melb) >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: m.robinson at garvan.org.au >> e: mrobinson at wehi.edu.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> ------------------------------ >> >> >> >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intend...{{dropped:6}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 14.8 years ago Mark Robinson ★ 1.1k

Login before adding your answer.