Can the normalization factors be too far apart in EdgeR analyses?

0

Entering edit mode

Hoskins, Jason NIH/NCI [F] ▴ 10

@hoskins-jason-nihnci-f-5413

Last seen 10.6 years ago

Hello, I have RNA-seq data from 10 normal samples and 8 tumor samples, which I am using edgeR to analyze for differential expression (DE) between the tumors and the normals. I have basically followed the workflow in the edgeR user's guide section 3.3. It is known that there is a large RNA compositional bias in these normal tissue samples (i.e. the top 25 genes by raw counts account for 50-80% of the total reads), which is not present in the tumor samples, so normalization via edgeR's calcNormFactors() is presumably very important. The results from the calcNormFactors() is printed below with anonymous samples. group lib.size norm.factors Sample1 Normals 136765371 1.0567240 Sample2 Normals 116803340 0.5898912 Sample3 Normals 88783007 0.5880073 Sample4 Normals 314426955 0.6871909 Sample5 Normals 289961788 0.5574136 Sample6 Normals 296455983 0.3413478 Sample7 Normals 260923863 0.7353922 Sample8 Normals 118870482 0.7742314 Sample9 Normals 237556345 0.5113664 Sample10 Normals 126493394 0.3916818 Sample11 Tumors 90611059 1.7934781 Sample12 Tumors 93423641 2.0290747 Sample13 Tumors 122360083 1.9691099 Sample14 Tumors 80575136 1.9405350 Sample15 Tumors 104183711 1.7019891 Sample16 Tumors 112372313 2.0484955 Sample17 Tumors 102789103 1.8569770 Sample18 Tumors 96733614 2.0323221 My first question is what is used as the reference in the default TMM method's calculation of the normalization factors? The user's guide and other documentation claims that the reference is "the sample whose 75%-ile (of library-scale-scaled counts) is closest to the mean of 75%-iles." Presumably the normalization factor for the reference sample should be 1.0, but none of my samples have a normalization factor of 1.0 (closest is sample 1 with 1.0567240). My second question is should I be concerned about the large variation in normalization factors among the normals group, and the even larger difference in normalization factors between the normals and the tumors? I guess it's not all that surprising that the normalization factors are very different between normals and tumors given the huge compositional bias in the normal samples, but is the TMM method robust enough to handle these differences? Is TMM the best method for this type of normalization? Thanks for your help! -Jason [[alternative HTML version deleted]]

Normalization edgeR Normalization edgeR • 1.2k views

ADD COMMENT • link updated 12.8 years ago by Mark Robinson ▴ 880 • written 12.8 years ago by Hoskins, Jason NIH/NCI [F] ▴ 10

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.5 years ago

Hi Jason, Some comments below. On 20.07.2012, at 23:16, Hoskins, Jason (NIH/NCI) [F] wrote: > Hello, > > I have RNA-seq data from 10 normal samples and 8 tumor samples, which I am using edgeR to analyze for differential expression (DE) between the tumors and the normals. I have basically followed the workflow in the edgeR user's guide section 3.3. It is known that there is a large RNA compositional bias in these normal tissue samples (i.e. the top 25 genes by raw counts account for 50-80% of the total reads), which is not present in the tumor samples, so normalization via edgeR's calcNormFactors() is presumably very important. The results from the calcNormFactors() is printed below with anonymous samples. > > group lib.size norm.factors > Sample1 Normals 136765371 1.0567240 > Sample2 Normals 116803340 0.5898912 > Sample3 Normals 88783007 0.5880073 > Sample4 Normals 314426955 0.6871909 > Sample5 Normals 289961788 0.5574136 > Sample6 Normals 296455983 0.3413478 > Sample7 Normals 260923863 0.7353922 > Sample8 Normals 118870482 0.7742314 > Sample9 Normals 237556345 0.5113664 > Sample10 Normals 126493394 0.3916818 > Sample11 Tumors 90611059 1.7934781 > Sample12 Tumors 93423641 2.0290747 > Sample13 Tumors 122360083 1.9691099 > Sample14 Tumors 80575136 1.9405350 > Sample15 Tumors 104183711 1.7019891 > Sample16 Tumors 112372313 2.0484955 > Sample17 Tumors 102789103 1.8569770 > Sample18 Tumors 96733614 2.0323221 > > My first question is what is used as the reference in the default TMM method's calculation of the normalization factors? The user's guide and other documentation claims that the reference is "the sample whose 75%-ile (of library-scale-scaled counts) is closest to the mean of 75%-iles." Presumably the normalization factor for the reference sample should be 1.0, but none of my samples have a normalization factor of 1.0 (closest is sample 1 with 1.0567240). Read a bit further in the docs and it says: "For symmetry, normalization factors are adjusted to multiply to 1. The effective library size is then the original library size multiplied by the scaling factor." That's why there is no sample with factor=1. > My second question is should I be concerned about the large variation in normalization factors among the normals group, and the even larger difference in normalization factors between the normals and the tumors? I guess it's not all that surprising that the normalization factors are very different between normals and tumors given the huge compositional bias in the normal samples, but is the TMM method robust enough to handle these differences? It's tough to know whether to be concerned based on these numbers alone. I suggest having a look at some pairwise MA-plots, both within normals, within cancers and between. Sample6 versus Sample16, for example, is the most extreme. I will say that these are amongst the most extreme that I've seen, but it really depends on the data. > Is TMM the best method for this type of normalization? Questions regarding what method is "best" are not easy to answer and often dataset-dependent. TMM is good at what it does: removing a systematic bias between samples. It doesn't account for everything (e.g. sample-specific GC content effects), so if your data exhibits these, consider looking at BioC packages cqn and EDASeq. Best, Mark > > Thanks for your help! > > -Jason > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.8 years ago Mark Robinson ▴ 880

Login before adding your answer.