limma-voom

1

Entering edit mode

Ming ▴ 380

@ming-yi-6363

Last seen 2.6 years ago

United States

Hi, Dr. Gordon and all: Just follow up on my RNAseq analysis on the TCGA RSEM processed so- called raw counts using limma-voom package as Dr. Gordon suggested due to the issue of so-called RSEM raw counts but not true raw counts that edgeR required. just try to get your opinion on it. Here are the commands I used (only show most relevant ones) > library(edgeR); > library(limma) > show(head(targets)); SampleName Subject Type 1 T4626_01A TCGA_38_4626 Tumor 2 N4626_11A TCGA_38_4626 Normal 3 T2668_01A TCGA_44_2668 Tumor 4 N2668_11A TCGA_44_2668 Normal > Patient<-factor(targets$Subject); > Type<-factor(targets$Type); > y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > show(dim(y)); [1] 20531 58 ##Keep genes with least 1 count-per-million reads (cpm) in at least 20 samples: > isexpr <- rowSums(cpm(y)>1) >= 20; > y2 <- y[isexpr,] > show(dim(y2)); [1] 14735 58 > #Recompute the library sizes: > y2$samples$lib.size <- colSums(y2$counts) > #Use Entrez Gene IDs as row names: > rownames(y2$counts) <- rownames(y2$genes) <- y2$genes$entrezID; > ##Apply TMM normalization > y2 <- calcNormFactors(y2,method=c("TMM")); > plotMDS(y2,main="y2"); > show(data.frame(Sample=colnames(y2),Patient,Type)); Sample Patient Type 1 T4626_01A TCGA_38_4626 Tumor 2 N4626_11A TCGA_38_4626 Normal 3 T2668_01A TCGA_44_2668 Tumor 4 N2668_11A TCGA_44_2668 Normal ... > design<-model.matrix(~Patient+Type); > rownames(design) <- colnames(y2); > show(design[1:6,]); (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 T4626_01A 1 1 0 N4626_11A 1 1 0 T2668_01A 1 0 0 ... > v <- voom(y2,design,plot=TRUE); > plotMDS(v, top=50,main="common for v of y2",labels=paste(substr(Type ,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),gen e.selection="common"); > plotMDS(v, top=50,main="pairwise for v of y2",labels=paste(substr(Ty pe,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),g ene.selection="pairwise"); > fit <- lmFit(v,design) > fit <- eBayes(fit) > options(digits=3) > Rslt<-topTable(fit,coef="TypeTumor",n=Inf,sort="p"); > show(summary(de <- decideTests(fit))); (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 PatientTCGA_38_4632 -1 126 1 3 1 0 1805 14732 14725 14720 1 12804 2 7 14 ... PatientTCGA_91_6836 TypeTumor -1 4 3917 0 14718 5426 1 13 5392 Here are my questions: 1. I used isexpr <- rowSums(cpm(y)>1) >= 20 (y2 filtering) to do the filtering (based on one example in the package guide). I have 29 paired of matched tumor and normal samples (total 58 samples), is this reasonble to use for filtering, I asked this last time as well? I have a bit concern, since after filtering, the number of genes went from 20531 to 14735. Shall I use something like y3<-y[rowSums(y$counts)>=50,] (y3 filtering) to filter (this filtering leaves about 18215 genes, about 4k more genes than the 14735 from the other filtering way)? 2. I did plot with plotMDS(y2,main="y2"); v <- voom(y2,design,plot=TRUE); plotMDS(v, top=50,main...). The plotMDS all show very nice separation of tumors vs normals in dim1. However, the v <- voom(y2,design,plot=TRUE) plot slightly different between filtering with isexpr <- rowSums(cpm(y)>1) >= 20 vs filtering with y3<-y[rowSums(y$counts)>=50,]. The first filtering way seems got similar plot as the one on page 116 of the user guide, but the second way seems got bending toward to the down side around log2(count size +0.5) between 0 to 5, and the data points also shift to 0 side due to more of lower reads data points (the first option only have very few data points between 0 to 5 in x-axis). Does this suggest the first filtering way (y2) is better? 3. the above command show(summary(de <- decideTests(fit))); showed y2 filtering got 3917 +5392=9309 DEGs at FDR5% (7661 at FDR 1%), whereas the y3 filtering (similar commands as y2, not shown) got final 4917+6022=10939 DEGs at FDR5%, both filtering seem getting too many DEGs, any issues here? 4. using edgeR and the same data, I got more than 13k DEGs (total 18215 genes in the dataset after filtering) with FDR <0.05 (filtering like y3), now using limma voom, I still got quite a lot of DEGs (10939 DEGs at FDR5%) as shown in my question 3, although slight less than edgeR. Any potenital issue? Thanks so much in advance for your input and advice! Ming > From: yi02@hotmail.com > To: smyth@wehi.edu.au > Date: Thu, 30 Jan 2014 03:07:56 +0000 > CC: bioconductor@r-project.org > Subject: Re: [BioC] limma-voom > > Hi, Dr. Smyth: > > Thanks for the info. I used limma a lot but not much on voom. I will give a shot on voom. Thanks again and best Ming > > > Date: Thu, 30 Jan 2014 13:05:57 +1100 > > From: smyth@wehi.EDU.AU > > To: yi02@hotmail.com > > CC: bioconductor@r-project.org > > Subject: limma-voom > > > > Hi Ming, > > > > voom is part of the limma package. There is a voom case study in the > > limma User's Guide with complete working code. It is the last case study > > in the users guide. > > > > voom works fine with either counts, or fractional counts, or scaled > > counts. > > > > We are currently finalizing additional voom code to detect differential > > splicing. That code isn't in the public package, but will be soon. > > > > Best wishes > > Gordon > > > > On Thu, 30 Jan 2014, Ming Yi wrote: > > > > > I found the following documents > > > > > > http://stuff.mit.edu/afs/athena/software/r/r_v2.15.1/lib/R/libra ry/limma/html/voom.html > > > http://www.bioconductor.org/packages/release/bioc/vignettes/limm a/inst/doc/usersguide.pdf > > > http://www.statsci.org/smyth/pubs/VoomPreprint.pdf > > > > > > But I could not find the real limma-voom package document. Is voom just > > > as normalization function prior to using limma linear model? Any good > > > examples of case study of working commands available as example? > > > > > > Thanks again! > > > Ming > > > > ______________________________________________________________________ > > The information in this email is confidential and inte...{{dropped:9}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

RNASeq Normalization limma edgeR RNASeq Normalization limma edgeR • 4.9k views

ADD COMMENT • link updated 3.1 years ago by Gordon Smyth 52k • written 11.2 years ago by Ming ▴ 380

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Just follow up on my RNAseq analysis on the TCGA RSEM processed so-called raw counts using limma-voom package as Dr. Gordon suggested due to the issue of so-called RSEM raw counts but not true raw counts that edgeR required.

edgeR will work OK if you simply round the RSEM expected counts to integers before running edgeR. However voom is slightly preferable because it can use the fractional counts as they are.

(1.) I used isexpr <- rowSums(cpm(y)>1) >= 20 (y2 filtering) to do the filtering (based on one example in the package guide). I have 29 paired of matched tumor and normal samples (total 58 samples), is this reasonble to use for filtering, I asked this last time as well? I have a bit concern, since after filtering, the number of genes went from 20531 to 14735.

Why would this be a concern? One wouldn't expect the whole genome to be expressed at worthwhile levels in each particular cell type. Some genes are specific to cell types.

(2.) I did plot with plotMDS(y2,main="y2"); v <- voom(y2,design,plot=TRUE); plotMDS(v, top=50,main...). The plotMDS all show very nice separation of tumors vs normals in dim1.

I have analysed similar data from TCGA, and the separation is really much too good. Furthermore there is little evidence of matching. It would appear that the matched normal cells are not really comparable cells to the tumours. The matched normal samples may actually be profiles of whole blood, a cell type which has a radically different expression profile to epithelial cells. I really wonder what one can hope to learn about cancer by comparing epithelial tumours to blood.

However, the v <- voom(y2,design,plot=TRUE) plot slightly different between filtering with isexpr <- rowSums(cpm(y)>1) >= 20 vs filtering with y3<-y[rowSums(y$counts)>=50,]. The first filtering way seems got similar plot as the one on page 116 of the user guide, but the second way seems got bending toward to the down side around log2(count size +0.5) between 0 to 5, and the data points also shift to 0 side due to more of lower reads data points (the first option only have very few data points between 0 to 5 in x-axis). Does this suggest the first filtering way (y2) is better?

Yes, but one will get good results either way.

(3.) the above command show(summary(de <- decideTests(fit))); showed y2 filtering got 3917 +5392=9309 DEGs at FDR5% (7661 at FDR 1%), whereas the y3 filtering (similar commands as y2, not shown) got final 4917+6022=10939 DEGs at FDR5%, both filtering seem getting too many DEGs, any issues here?

This is what I would expect to get if I compared epithelial tumour cells to normal blood. If this is "too many DEGs", then the problem is with the data rather than the analysis.

(4.) using edgeR and the same data, I got more than 13k DEGs (total 18215 genes in the dataset after filtering) with FDR <0.05 (filtering like y3), now using limma voom, I still got quite a lot of DEGs (10939 DEGs at FDR5%) as shown in my question 3, although slight less than edgeR. Any potenital issue?

The edgeR analysis on fractional counts is not correct. edgeR is a likelihood based package. It evaluates the negative binomial likelihood, which is always exactly zero at non-integers. So there is no point in comparing the edgeR results to the voom results.

Best wishes Gordon

Postscript

Some years after this answer was posted, edgeR does now support fractional counts.

ADD COMMENT • link 11.2 years ago • updated 3.1 years ago Gordon Smyth 52k

0

Entering edit mode

Hi, Just wanted to comment on the "paired" TCGA data: On Wed, Feb 5, 2014 at 2:38 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: [snip] > I have analysed similar data from TCGA, and the separation is really much > too good. Furthermore there is little evidence of matching. It would > appear that the matched normal cells are not really comparable cells to the > tumours. The matched normal samples may actually be profiles of whole > blood, a cell type which has a radically different expression profile to > epithelial cells. I really wonder what one can hope to learn about cancer > by comparing epithelial tumours to blood. There are samples where you have the "correct" matched sample for gene expression, as indeed comparing a solid tumor to "normal blood" to identify changes in gene expression would make very little sense. You just need to be careful in parsing the sample barcodes to ensure you are working with the correct stuff. An explanation of the barcode scheme is here: https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode You are interested in decoding the "sample" value, which ranges from 01-29. Explanation is here: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTab le=Sample%20type There are samples taken from "normal blood" (code = 10), however code=11 are solid tissue normals (presumably from the correct tissue ;-) The preponderance of blood samples is, I believe, primarily to be used in order to identify somatic mutations -- presumably using the exome-seq from "normal blood" would be cleaner than primary tissue, depending on how close to the tumor the tissue was taken from. To figure out what was extracted from the "sample" (DNA, RNA, etc), you just decode the "Analyte" part of the barcode: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTab le=portion%20analyte HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 11.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, Just wanted to add in a little more on the "normal" adjacent samples that TCGA uses. I work with breast cancer samples, and it has appeared that some(if not many) of these "normal" adjacent samples are really similar to the tumour themselves. This has been reported earlier (see http://cancerres.aacrjournals.org/cgi/content/meeting_abstract/72/24_M eetingAbstracts/P1-03-02 ). Just thought it will be worth sharing. Cheers, Jeremy On Thu, Feb 6, 2014 at 7:23 AM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi, > > Just wanted to comment on the "paired" TCGA data: > > On Wed, Feb 5, 2014 at 2:38 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > [snip] > > I have analysed similar data from TCGA, and the separation is really much > > too good. Furthermore there is little evidence of matching. It would > > appear that the matched normal cells are not really comparable cells to > the > > tumours. The matched normal samples may actually be profiles of whole > > blood, a cell type which has a radically different expression profile to > > epithelial cells. I really wonder what one can hope to learn about > cancer > > by comparing epithelial tumours to blood. > > There are samples where you have the "correct" matched sample for gene > expression, as indeed comparing a solid tumor to "normal blood" to > identify changes in gene expression would make very little sense. > > You just need to be careful in parsing the sample barcodes to ensure > you are working with the correct stuff. > > An explanation of the barcode scheme is here: > https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode > > You are interested in decoding the "sample" value, which ranges from 01-29. > > Explanation is here: > > https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeT able=Sample%20type > > There are samples taken from "normal blood" (code = 10), however > code=11 are solid tissue normals (presumably from the correct tissue > ;-) > > The preponderance of blood samples is, I believe, primarily to be used > in order to identify somatic mutations -- presumably using the > exome-seq from "normal blood" would be cleaner than primary tissue, > depending on how close to the tumor the tissue was taken from. > > To figure out what was extracted from the "sample" (DNA, RNA, etc), > you just decode the "Analyte" part of the barcode: > > https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeT able=portion%20analyte > > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Jeremy Ng ▴ 180

0

Entering edit mode

Yes, "normal" is a euphemism. Very often, asymptomatic cells will turn out to have microenvironmental cues that lead to co-opting by tumor cells or blasts. But they're nearby, and they looked normal at the time of debulking/collection, and so it goes. The blood samples are for germline genotyping in solid tumors (obviously that wouldn't have worked for leukemia, so in that case skin punches were used as normals). As it happens, that's more interesting than somatic mutations in some cases. But that's another topic :-) TCGA had some issues early on, but mistaking blood for solid tissue was not one of them. You have to use the barcode (or, later, the UUID) to obtain full information on a given sample. --t > On Feb 5, 2014, at 7:19 PM, Jeremy Ng <jeremy.ng.wk1990 at="" gmail.com=""> wrote: > > Hi, > > Just wanted to add in a little more on the "normal" adjacent samples that > TCGA uses. > > I work with breast cancer samples, and it has appeared that some(if not > many) of these "normal" adjacent samples are really similar to the tumour > themselves. This has been reported earlier (see > http://cancerres.aacrjournals.org/cgi/content/meeting_abstract/72/24 _MeetingAbstracts/P1-03-02 > ). > > Just thought it will be worth sharing. > > Cheers, > Jeremy > > > > On Thu, Feb 6, 2014 at 7:23 AM, Steve Lianoglou <lianoglou.steve at="" gene.com="">wrote: > >> Hi, >> >> Just wanted to comment on the "paired" TCGA data: >> >>> On Wed, Feb 5, 2014 at 2:38 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >>> [snip] >>> I have analysed similar data from TCGA, and the separation is really much >>> too good. Furthermore there is little evidence of matching. It would >>> appear that the matched normal cells are not really comparable cells to >> the >>> tumours. The matched normal samples may actually be profiles of whole >>> blood, a cell type which has a radically different expression profile to >>> epithelial cells. I really wonder what one can hope to learn about >> cancer >>> by comparing epithelial tumours to blood. >> >> There are samples where you have the "correct" matched sample for gene >> expression, as indeed comparing a solid tumor to "normal blood" to >> identify changes in gene expression would make very little sense. >> >> You just need to be careful in parsing the sample barcodes to ensure >> you are working with the correct stuff. >> >> An explanation of the barcode scheme is here: >> https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode >> >> You are interested in decoding the "sample" value, which ranges from 01-29. >> >> Explanation is here: >> >> https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?code Table=Sample%20type >> >> There are samples taken from "normal blood" (code = 10), however >> code=11 are solid tissue normals (presumably from the correct tissue >> ;-) >> >> The preponderance of blood samples is, I believe, primarily to be used >> in order to identify somatic mutations -- presumably using the >> exome-seq from "normal blood" would be cleaner than primary tissue, >> depending on how close to the tumor the tissue was taken from. >> >> To figure out what was extracted from the "sample" (DNA, RNA, etc), >> you just decode the "Analyte" part of the barcode: >> >> https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?code Table=portion%20analyte >> >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.2 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Thanks a lot for all of your inputs, which are all very helpful! Indeed, I did use the barcode to get the matched normals for the tumors of the same patients. My normals are all code 11, which are supposed to be solid tissues as Steve mentioned. Also the plotMDS plots seem showing a very separation of normal vs tumors in all cases, which seem promising. I agreed with Tim, there are always concern or precaution about what the normal tissues are and how they are collected, which however are not under our own controls. Thanks again for the great info and discussions! Ming -----Original Message----- From: Tim Triche, Jr. [mailto:tim.triche@gmail.com] Sent: Wednesday, February 05, 2014 11:16 PM To: Jeremy Ng Cc: Steve Lianoglou; Gordon K Smyth; Bioconductor mailing list Subject: Re: [BioC] limma-voom Yes, "normal" is a euphemism. Very often, asymptomatic cells will turn out to have microenvironmental cues that lead to co-opting by tumor cells or blasts. But they're nearby, and they looked normal at the time of debulking/collection, and so it goes. The blood samples are for germline genotyping in solid tumors (obviously that wouldn't have worked for leukemia, so in that case skin punches were used as normals). As it happens, that's more interesting than somatic mutations in some cases. But that's another topic :-) TCGA had some issues early on, but mistaking blood for solid tissue was not one of them. You have to use the barcode (or, later, the UUID) to obtain full information on a given sample. --t > On Feb 5, 2014, at 7:19 PM, Jeremy Ng <jeremy.ng.wk1990 at="" gmail.com=""> wrote: > > Hi, > > Just wanted to add in a little more on the "normal" adjacent samples that > TCGA uses. > > I work with breast cancer samples, and it has appeared that some(if not > many) of these "normal" adjacent samples are really similar to the tumour > themselves. This has been reported earlier (see > http://cancerres.aacrjournals.org/cgi/content/meeting_abstract/72/24 _MeetingAbstracts/P1-03-02 > ). > > Just thought it will be worth sharing. > > Cheers, > Jeremy > > > > On Thu, Feb 6, 2014 at 7:23 AM, Steve Lianoglou <lianoglou.steve at="" gene.com="">wrote: > >> Hi, >> >> Just wanted to comment on the "paired" TCGA data: >> >>> On Wed, Feb 5, 2014 at 2:38 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >>> [snip] >>> I have analysed similar data from TCGA, and the separation is really much >>> too good. Furthermore there is little evidence of matching. It would >>> appear that the matched normal cells are not really comparable cells to >> the >>> tumours. The matched normal samples may actually be profiles of whole >>> blood, a cell type which has a radically different expression profile to >>> epithelial cells. I really wonder what one can hope to learn about >> cancer >>> by comparing epithelial tumours to blood. >> >> There are samples where you have the "correct" matched sample for gene >> expression, as indeed comparing a solid tumor to "normal blood" to >> identify changes in gene expression would make very little sense. >> >> You just need to be careful in parsing the sample barcodes to ensure >> you are working with the correct stuff. >> >> An explanation of the barcode scheme is here: >> https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode >> >> You are interested in decoding the "sample" value, which ranges from 01-29. >> >> Explanation is here: >> >> https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?code Table=Sample%20type >> >> There are samples taken from "normal blood" (code = 10), however >> code=11 are solid tissue normals (presumably from the correct tissue >> ;-) >> >> The preponderance of blood samples is, I believe, primarily to be used >> in order to identify somatic mutations -- presumably using the >> exome-seq from "normal blood" would be cleaner than primary tissue, >> depending on how close to the tumor the tissue was taken from. >> >> To figure out what was extracted from the "sample" (DNA, RNA, etc), >> you just decode the "Analyte" part of the barcode: >> >> https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?code Table=portion%20analyte >> >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.2 years ago Yi, Ming NIH/NCI [C] ▴ 100

Login before adding your answer.