EdgeR: paired samples together with independant samples

0

Entering edit mode

Maria Keays ▴ 30

@maria-keays-5590

Last seen 9.1 years ago

Hello, I read this thread and related user guide material with interest because I am working with a very similar data set with paired samples. However, I'm having trouble which I think stems from my data being unbalanced? I have four patients with a disease and three without, and within that for some patients I have replicates but for others I do not. I've created a design matrix as described on p32 of the 27 October 2012 edgeR user's guide, but when I try to estimate the common dispersion using estimateGLMCommonDisp() it tells me: "Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset) : Design matrix not of full rank. The following coefficients not estimable: DiseaseHealthy:Patient4" I guess because I have 4 patients in the diseased set and only 3 in the healthy set? If I remove Patient4 and try again, I'm able to continue the analysis successfully, but I'd obviously like to be able to include all the data -- is that possible? If so, could you explain how to do it? The original annotations for my data are below: Disease Patient Treatment disease1 1 control disease1 1 control disease1 1 control disease1 2 control disease1 3 control disease1 3 control disease1 4 control disease1 1 treat disease1 1 treat disease1 1 treat disease1 2 treat disease1 3 treat disease1 3 treat disease1 4 treat healthy 5 control healthy 6 control healthy 6 control healthy 6 control healthy 7 control healthy 7 control healthy 5 treat healthy 6 treat healthy 6 treat healthy 6 treat healthy 7 treat healthy 7 treat As I was following the user's guide I amended the "Patient" labels so it looked like this when I created the design matrix: Disease Patient Treatment disease1 1 control disease1 1 control disease1 1 control disease1 2 control disease1 3 control disease1 3 control disease1 4 control disease1 1 treat disease1 1 treat disease1 1 treat disease1 2 treat disease1 3 treat disease1 3 treat disease1 4 treat healthy 1 control healthy 2 control healthy 2 control healthy 2 control healthy 3 control healthy 3 control healthy 1 treat healthy 2 treat healthy 2 treat healthy 2 treat healthy 3 treat healthy 3 treat Thanks! Maria On 25/10/2012 06:18, Gordon K Smyth wrote: > Dear Anna, > > You are right to recognise that the analysis of this sort of design is > more complex than many other experiments, because it includes > comparisons both within and between patients. I have included a new > section in the edgeR User's Guide based on your experiment that > describes the analysis. This will appear in the official release of > edgeR in a couple of days. In the meantime, see pages 31-33 of: > > http://bioinf.wehi.edu.au/software/edgeR/edgeRUsersGuide.pdf > > Best wishes > Gordon > >> Date: Tue, 23 Oct 2012 06:37:44 -0700 (PDT) >> From: "anna [guest]" <guest at="" bioconductor.org=""> >> To: bioconductor at r-project.org, m.nadira at yahoo.fr >> Subject: [BioC] EdgeR: paired samples together with independant >> samples >> >> >> Hello, >> I am using EdgeR to analyse my RNAseq data. >> >> I have: >> >> cells from 3 healthy patients , either treated or not with a hormone . >> >> cells from 3 patients with disease D1, either treated or not with the >> hormone >> >> cells from 3 patients with disease D2, either treated or not with the >> hormone. >> >> I would like to know what is wrong in the response to the hormone in >> patients with disease D1 and D2. >> >> I don't know how to combine paired comparisons, with pairwise >> comparisons, in a unique glm analysis. >> >> thank you very much, >> anna >> >> -- output of sessionInfo(): >> >> R version 2.15.1 (2012-06-22) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 >> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C >> [5] LC_TIME=French_France.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] tools_2.15.1 >> > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

RNASeq edgeR RNASeq edgeR • 2.1k views

ADD COMMENT • link updated 12 months ago by Gordon Smyth 52k • written 12.4 years ago by Maria Keays ▴ 30

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

Dear Maria,

Thanks for the specific reference to the documentation that you've followed.

Yes, you are correct, the error is arising because there is no 4th patient in the healthy group. If you have a look at your design matrix, you will see that there is a column called DiseaseHealthy:Patient4 that consists entirely of zeros. It should be column 8, but check:

design[,8]

The easiest way to proceed is simply to remove that column manually from the design matrix:

design2 <- design[,-8]

Your experiment has another issue, in that you have repeat samples on several of the patients. Are these biological replicates? If not, if they are just technical replicates, then they should be collapsed into one library before analysis.

Best wishes
Gordon

ADD COMMENT • link 12.4 years ago • updated 12 months ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, Thanks very much for the helpful advice. I'm treating them as biological replicates -- they are cell cultures and it's just that I have multiple separately treated/untreated pairs of cultures from some patients and only one treated/untreated pair for others. So although some cultures came from the same patient, they were all treated separately and then RNA was extracted from each culture. Would you say that's the right thing to do? Thanks and best wishes, Maria On 07/11/2012 00:01, Gordon K Smyth wrote: > Dear Maria, > > Thanks for the specific reference to the documentation that you've > followed. > > Yes, you are correct, the error is arising because there is no 4th > patient in the healthy group. If you have a look at your design > matrix, you will see that there is a column called > DiseaseHealthy:Patient4 that consists entirely of zeros. It should be > column 8, but check: > > design[,8] > > The easiest way to proceed is simply to remove that column manually > from the design matrix: > > design2 <- design[,-8] > > Your experiment has another issue, in that you have repeat samples on > several of the patients. Are these biological replicates? If not, if > they are just technical replicates, then they should be collapsed into > one library before analysis. > > Best wishes > Gordon > >> Date: Tue, 06 Nov 2012 09:19:08 +0000 >> From: Maria Keays <mkeays at="" ebi.ac.uk=""> >> To: bioconductor at r-project.org >> Subject: Re: [BioC] EdgeR: paired samples together with independant >> samples >> >> Hello, >> >> I read this thread and related user guide material with interest because >> I am working with a very similar data set with paired samples. However, >> I'm having trouble which I think stems from my data being unbalanced? I >> have four patients with a disease and three without, and within that for >> some patients I have replicates but for others I do not. I've created a >> design matrix as described on p32 of the 27 October 2012 edgeR user's >> guide, but when I try to estimate the common dispersion using >> estimateGLMCommonDisp() it tells me: >> >> "Error in glmFit.default(y, design = design, dispersion = dispersion, >> offset = offset) : >> Design matrix not of full rank. The following coefficients not >> estimable: >> DiseaseHealthy:Patient4" >> >> I guess because I have 4 patients in the diseased set and only 3 in the >> healthy set? If I remove Patient4 and try again, I'm able to continue >> the analysis successfully, but I'd obviously like to be able to include >> all the data -- is that possible? If so, could you explain how to do it? >> >> The original annotations for my data are below: >> >> Disease Patient Treatment >> disease1 1 control >> disease1 1 control >> disease1 1 control >> disease1 2 control >> disease1 3 control >> disease1 3 control >> disease1 4 control >> disease1 1 treat >> disease1 1 treat >> disease1 1 treat >> disease1 2 treat >> disease1 3 treat >> disease1 3 treat >> disease1 4 treat >> healthy 5 control >> healthy 6 control >> healthy 6 control >> healthy 6 control >> healthy 7 control >> healthy 7 control >> healthy 5 treat >> healthy 6 treat >> healthy 6 treat >> healthy 6 treat >> healthy 7 treat >> healthy 7 treat >> >> As I was following the user's guide I amended the "Patient" labels so it >> looked like this when I created the design matrix: >> >> Disease Patient Treatment >> disease1 1 control >> disease1 1 control >> disease1 1 control >> disease1 2 control >> disease1 3 control >> disease1 3 control >> disease1 4 control >> disease1 1 treat >> disease1 1 treat >> disease1 1 treat >> disease1 2 treat >> disease1 3 treat >> disease1 3 treat >> disease1 4 treat >> healthy 1 control >> healthy 2 control >> healthy 2 control >> healthy 2 control >> healthy 3 control >> healthy 3 control >> healthy 1 treat >> healthy 2 treat >> healthy 2 treat >> healthy 2 treat >> healthy 3 treat >> healthy 3 treat >> >> Thanks! >> Maria >> >> >> On 25/10/2012 06:18, Gordon K Smyth wrote: >>> Dear Anna, >>> >>> You are right to recognise that the analysis of this sort of design is >>> more complex than many other experiments, because it includes >>> comparisons both within and between patients. I have included a new >>> section in the edgeR User's Guide based on your experiment that >>> describes the analysis. This will appear in the official release of >>> edgeR in a couple of days. In the meantime, see pages 31-33 of: >>> >>> http://bioinf.wehi.edu.au/software/edgeR/edgeRUsersGuide.pdf >>> >>> Best wishes >>> Gordon >>> >>>> Date: Tue, 23 Oct 2012 06:37:44 -0700 (PDT) >>>> From: "anna [guest]" <guest at="" bioconductor.org=""> >>>> To: bioconductor at r-project.org, m.nadira at yahoo.fr >>>> Subject: [BioC] EdgeR: paired samples together with independant >>>> samples >>>> >>>> >>>> Hello, >>>> I am using EdgeR to analyse my RNAseq data. >>>> >>>> I have: >>>> >>>> cells from 3 healthy patients , either treated or not with a hormone . >>>> >>>> cells from 3 patients with disease D1, either treated or not with the >>>> hormone >>>> >>>> cells from 3 patients with disease D2, either treated or not with the >>>> hormone. >>>> >>>> I would like to know what is wrong in the response to the hormone in >>>> patients with disease D1 and D2. >>>> >>>> I don't know how to combine paired comparisons, with pairwise >>>> comparisons, in a unique glm analysis. >>>> >>>> thank you very much, >>>> anna >>>> >>>> -- output of sessionInfo(): >>>> >>>> R version 2.15.1 (2012-06-22) >>>> Platform: i386-pc-mingw32/i386 (32-bit) >>>> >>>> locale: >>>> [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 >>>> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C >>>> [5] LC_TIME=French_France.1252 >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> loaded via a namespace (and not attached): >>>> [1] tools_2.15.1 >>>> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD REPLY • link 12.4 years ago Maria Keays ▴ 30

0

Entering edit mode

Dear Maria,

Sounds ok from what you say not to collapse libraries. However, if the three treated cultures and three untreated cultures for one patient are truly three pairs, then this pairing should be reflected in the analysis. You can handle this by numbering the samples by paired culture from 1 to 7 instead of numbering by patient.

An MDS plot could guide you in judging whether there are baseline differences between the different pairs for one patient, and hence whether your pairing should be by culture instead of by patient.

Best wishes
Gordon

ADD REPLY • link 12.4 years ago • updated 12 months ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, I have another question about this analysis. Previously I performed an analysis on the same data but without incorporating effects of patient. My design matrix had columns: "Disease1.Treat", "Disease1.Control", "Healthy.Treat", "Healthy.Control", and I then tested for genes showing a significant interaction between disease and treatment using the contrast ((Disease1.Treat - Disease1.Control) - (Healthy.Treat - Healthy.Control)). I think this is what is explained on pages 25-26 of the edgeR users guide (Oct 27 2012 version). Now I want to take into account patient effects as well, so I have my design matrix with columns: [1] "(Intercept)" "DiseaseDisease1" [3] "DiseaseHealthy:Patient2" "DiseaseDisease1:Patient2" [5] "DiseaseHealthy:Patient3" "DiseaseDisease1:Patient3" [7] "DiseaseDisease1:Patient4" "DiseaseHealthy:TreatmentTreat" [9] "DiseaseDisease1:TreatmentTreat" Reading the explanation on pages 32-33 of the users guide, to do the equivalent contrast to find genes showing significant interaction between disease and treatment, should I simply use: lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,-1,1)) ? I think this is what the guide is saying, but I just want to make sure... Thanks and best wishes, Maria On 07/11/2012 22:55, Gordon K Smyth wrote: > Dear Maria, > > Sounds ok from what you say not to collapse libraries. However, if > the three treated cultures and three untreated cultures for one > patient are truly three pairs, then this pairing should be reflected > in the analysis. You can handle this by numbering the samples by > paired culture from 1 to 7 instead of numbering by patient. > > An MDS plot could guide you in judging whether there are baseline > differences between the different pairs for one patient, and hence > whether your pairing should be by culture instead of by patient. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Wed, 7 Nov 2012, Maria Keays wrote: > >> Dear Gordon, >> >> Thanks very much for the helpful advice. I'm treating them as >> biological replicates -- they are cell cultures and it's just that I >> have multiple separately treated/untreated pairs of cultures from >> some patients and only one treated/untreated pair for others. So >> although some cultures came from the same patient, they were all >> treated separately and then RNA was extracted from each culture. >> Would you say that's the right thing to do? >> >> Thanks and best wishes, >> Maria >> >> >> On 07/11/2012 00:01, Gordon K Smyth wrote: >>> Dear Maria, >>> >>> Thanks for the specific reference to the documentation that you've >>> followed. >>> >>> Yes, you are correct, the error is arising because there is no 4th >>> patient in the healthy group. If you have a look at your design >>> matrix, you will see that there is a column called >>> DiseaseHealthy:Patient4 that consists entirely of zeros. It should >>> be column 8, but check: >>> >>> design[,8] >>> >>> The easiest way to proceed is simply to remove that column manually >>> from the design matrix: >>> >>> design2 <- design[,-8] >>> >>> Your experiment has another issue, in that you have repeat samples >>> on several of the patients. Are these biological replicates? If >>> not, if they are just technical replicates, then they should be >>> collapsed into one library before analysis. >>> >>> Best wishes >>> Gordon >>> >>>> Date: Tue, 06 Nov 2012 09:19:08 +0000 >>>> From: Maria Keays <mkeays at="" ebi.ac.uk=""> >>>> To: bioconductor at r-project.org >>>> Subject: Re: [BioC] EdgeR: paired samples together with independant >>>> samples >>>> >>>> Hello, >>>> >>>> I read this thread and related user guide material with interest >>>> because >>>> I am working with a very similar data set with paired samples. >>>> However, >>>> I'm having trouble which I think stems from my data being >>>> unbalanced? I >>>> have four patients with a disease and three without, and within >>>> that for >>>> some patients I have replicates but for others I do not. I've >>>> created a >>>> design matrix as described on p32 of the 27 October 2012 edgeR user's >>>> guide, but when I try to estimate the common dispersion using >>>> estimateGLMCommonDisp() it tells me: >>>> >>>> "Error in glmFit.default(y, design = design, dispersion = dispersion, >>>> offset = offset) : >>>> Design matrix not of full rank. The following coefficients not >>>> estimable: >>>> DiseaseHealthy:Patient4" >>>> >>>> I guess because I have 4 patients in the diseased set and only 3 in >>>> the >>>> healthy set? If I remove Patient4 and try again, I'm able to continue >>>> the analysis successfully, but I'd obviously like to be able to >>>> include >>>> all the data -- is that possible? If so, could you explain how to >>>> do it? >>>> >>>> The original annotations for my data are below: >>>> >>>> Disease Patient Treatment >>>> disease1 1 control >>>> disease1 1 control >>>> disease1 1 control >>>> disease1 2 control >>>> disease1 3 control >>>> disease1 3 control >>>> disease1 4 control >>>> disease1 1 treat >>>> disease1 1 treat >>>> disease1 1 treat >>>> disease1 2 treat >>>> disease1 3 treat >>>> disease1 3 treat >>>> disease1 4 treat >>>> healthy 5 control >>>> healthy 6 control >>>> healthy 6 control >>>> healthy 6 control >>>> healthy 7 control >>>> healthy 7 control >>>> healthy 5 treat >>>> healthy 6 treat >>>> healthy 6 treat >>>> healthy 6 treat >>>> healthy 7 treat >>>> healthy 7 treat >>>> >>>> As I was following the user's guide I amended the "Patient" labels >>>> so it >>>> looked like this when I created the design matrix: >>>> >>>> Disease Patient Treatment >>>> disease1 1 control >>>> disease1 1 control >>>> disease1 1 control >>>> disease1 2 control >>>> disease1 3 control >>>> disease1 3 control >>>> disease1 4 control >>>> disease1 1 treat >>>> disease1 1 treat >>>> disease1 1 treat >>>> disease1 2 treat >>>> disease1 3 treat >>>> disease1 3 treat >>>> disease1 4 treat >>>> healthy 1 control >>>> healthy 2 control >>>> healthy 2 control >>>> healthy 2 control >>>> healthy 3 control >>>> healthy 3 control >>>> healthy 1 treat >>>> healthy 2 treat >>>> healthy 2 treat >>>> healthy 2 treat >>>> healthy 3 treat >>>> healthy 3 treat >>>> >>>> Thanks! >>>> Maria >>>> >>>> >>>> On 25/10/2012 06:18, Gordon K Smyth wrote: >>>>> Dear Anna, >>>>> >>>>> You are right to recognise that the analysis of this sort of >>>>> design is >>>>> more complex than many other experiments, because it includes >>>>> comparisons both within and between patients. I have included a new >>>>> section in the edgeR User's Guide based on your experiment that >>>>> describes the analysis. This will appear in the official release of >>>>> edgeR in a couple of days. In the meantime, see pages 31-33 of: >>>>> >>>>> http://bioinf.wehi.edu.au/software/edgeR/edgeRUsersGuide.pdf >>>>> >>>>> Best wishes >>>>> Gordon >>>>> >>>>>> Date: Tue, 23 Oct 2012 06:37:44 -0700 (PDT) >>>>>> From: "anna [guest]" <guest at="" bioconductor.org=""> >>>>>> To: bioconductor at r-project.org, m.nadira at yahoo.fr >>>>>> Subject: [BioC] EdgeR: paired samples together with independant >>>>>> samples >>>>>> >>>>>> >>>>>> Hello, >>>>>> I am using EdgeR to analyse my RNAseq data. >>>>>> >>>>>> I have: >>>>>> >>>>>> cells from 3 healthy patients , either treated or not with a >>>>>> hormone . >>>>>> >>>>>> cells from 3 patients with disease D1, either treated or not with >>>>>> the >>>>>> hormone >>>>>> >>>>>> cells from 3 patients with disease D2, either treated or not with >>>>>> the >>>>>> hormone. >>>>>> >>>>>> I would like to know what is wrong in the response to the hormone in >>>>>> patients with disease D1 and D2. >>>>>> >>>>>> I don't know how to combine paired comparisons, with pairwise >>>>>> comparisons, in a unique glm analysis. >>>>>> >>>>>> thank you very much, >>>>>> anna >>>>>> >>>>>> -- output of sessionInfo(): >>>>>> >>>>>> R version 2.15.1 (2012-06-22) >>>>>> Platform: i386-pc-mingw32/i386 (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 >>>>>> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C >>>>>> [5] LC_TIME=French_France.1252 >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] tools_2.15.1 >>>>>> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}