Dear all,
I need your help because I was not able to obtain a list of DEGs with limma analysis from two GSE datasets.
Basically, I want to compare two sets of data generated on the same platform (GPL2112; single channel experiment) and corresponding to two GSEs (10 samples from GSE39589 (GSM972638 to GSM972647) and 10 samples from GSE49505 (GSM1200157 to GSM1200166)). There is no GDS for these datasets. In fact, I want to compare 10 granulosa cells obtain from small follicles to 10 thecal cells from small follicles.
I downloaded the two ExpressionSets (values were already RMA normalized and log2 transformed) from GEOQuery, combine then and directly performs limma analysis (see code below)
However the results of my limma analysis was quite surprising because high number of DEGs was obtained. I supposed that something went wrong with normalization (see boxplots) so I decided to use the CEL file from GEO, I untared , and used the affy package to read the CEL files. I performed rma normalization and limma analysis and all genes are differentially expressed….
I supposed that I missed something in the pipeline and any help will be greatly appreciate…
Thanks in advance
Regards
Carine
############################################################### # Differential expression Limma rm(list=ls()) library(Biobase) library(GEOquery) library(limma) workDir <- "C:/Users/cagenet/Documents/B/Bovin" setwd(workDir) Granu <- getGEO("GSE39589")[[1]] Theca <- getGEO("GSE49505")[[1]] ## combine ExpressionSets, granulosa versus theca eset <- combine(Granu[,1:10], Theca[,6:15]) dim(eset) ## get rid of extra featureSet columns that we don't need fData(eset) <- fData(eset)[,c(1,18,8)] ## design matrix accounting for two different tissues tissue <- c("Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa", "Theca","Theca","Theca","Theca","Theca","Theca","Theca","Theca","Theca","Theca") tissue ## design matrix accounting for two different tissues design <- model.matrix(~factor(tissue)) design colnames(design)<-c("Gran","thequevsGran") ## filter out and take off AFFX probes eset <- eset[-grep("AFFX", featureNames(eset)),] ## fit model fit <- lmFit(eset, design) dim(eset) fit2 <- eBayes(fit) #save all results tT <- topTable(fit2,sort="none", n=Inf) write.table(tT, file="all.bov.txt",sep="\t") ######################################################## # downloaded CEL files corresponding to the 20 samples and put it in workdir library("affy", lib.loc="~/R/win-library/3.3") databov<-ReadAffy() estbov<-rma(databov) head(estbov) boxplot(databov) design<-model.matrix(~0+factor(c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2))) colnames(design)<-c("theque","granu") fit<-lmFit(estbov,design) dim(estbov) fit2 <- eBayes(fit) head(estbov) write.exprs(estbov[1:24128,],file="esetbov.txt") tT <- topTable(fit2,sort="none", n=Inf) write.table(tT, file="affybov.txt",sep="\t")
> sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C [5] LC_TIME=French_France.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] bovinecdf_2.18.0 limma_3.30.12 GEOquery_2.40.0 affy_1.52.0 Biobase_2.34.0 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 IRanges_2.8.2 XML_3.98-1.5 digest_0.6.12 bitops_1.0-6 [6] R6_2.2.0 DBI_0.6 stats4_3.3.3 RSQLite_1.1-2 BiocInstaller_1.24.0 [11] httr_1.2.1 zlibbioc_1.20.0 affyio_1.44.0 S4Vectors_0.12.2 preprocessCore_1.36.0 [16] tools_3.3.3 RCurl_1.95-4.8 rsconnect_0.7 AnnotationDbi_1.36.2 memoise_1.0.0
Hello James,
Thanks again for your help and explanation.I 'm not fluent with statistical models so your lights are very welcome.
In fact these two cells type were obtained from the same cows, in the same lab, by the same people, with the same lot of chips ...and was published in 2015 in Plos one. The authors identified cells markers for the two cell types. They used one way ANOVA using method of moment estimation. And as they only provided DEGs with 4<FC >4, and I need the whole list of cell markers, that’s why I want to analyse again their data. I want to obtain a whole genome view of the possible contamination between the two tissues which are close to each others.
So in light of your explanation, I can used :
The factor effects model and the coding will be :
Or the contrast matrix effect and the code will be :
Do you validate the code ?
Thanks again and have a nice day
Carine
I think this example from the lmFit doc might help you. As stated by James, simply add the coef=2 argument to the topTable call in your first code. Hope I am right ;-)
The underlying cause is that the OP has renamed the columns of the design matrix after calling
model.matrix
. If the column names were not changed,topTable
would automatically detect the intercept and avoid using it for testing.Aaron, please, could you tell me what "OP" means and what it does?Thanks for the tip. The topTable is clearly looking for "(Intercept)" in the coefficients, and removes it making the computation straightforward.
Original poster, see https://en.wikipedia.org/wiki/OP#Internet.
Thank you SamGG and Aaron Lun. I Finally obtained the DEG list ...which is not at all the same find in the article...so maybe I will discard these data.
Regards
Carine