Apologies for this lengthy post...
A colleague asked me to analyze an Agilent 2-colour dataset that is available at GEO:
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32210
I am using limma to do so. This dataset has a common reference design; on all arrays the different samples were always labelled with Cy3, and the reference pool always with Cy5 (so no dye-swaps).
I preprocess the data using standard limma recommendations (removal of positive control probes, bg-correction using normexp() with offset=50, normalizeWithinArrays() with method=loess, normalizeBetweenArrays() with method=Aquantile, whereafter all probes are removed with intensity <75% percentile of the neg control probes, and the neg control probes themselves). I end up with an object called "MA.q3"
> class(MA.q3) [1] "MAList" attr(,"package") [1] "limma" >
So far, so good!
However, I am confused how to properly setup the design for this experiment.
When I build the design using limma's function modelMatrix(), in which I set the reference pool as ref, I noticed all respective samples are assigned using a negative number 1 (-1). This is what confuses me.
> design <- modelMatrix(targets, ref="mouse_reference_RNA") Found unique target names: Adult_liver Adult_Pancreas BAT CCl4_Liver_day13 CCl4_Liver_day3 DM_Liver_organoid EM_Liver_organoid Hepatocytes Lgr5_sorted_cells Liver mouse_reference_RNA Muscle Sorted_duct WAT > design Adult_liver Adult_Pancreas BAT CCl4_Liver_day13 CCl4_Liver_day3 GSM1047592_US22502677_251486837777_S01_GE2_105_Dec08_1_1 0 0 0 -1 0 GSM1047593_US22502677_251486837777_S01_GE2_105_Dec08_1_2 0 0 0 -1 0 GSM1047594_US22502677_251486837777_S01_GE2_105_Dec08_1_3 0 0 0 0 -1 GSM1047595_US22502677_251486837777_S01_GE2_105_Dec08_1_4 0 0 0 0 -1 <<snip>> GSM797991_US22502677_251486826593_S01_GE2_105_Dec08_1_4 0 0 -1 0 0 GSM797992_US22502677_251486825703_S01_GE2_105_Dec08_1_1 0 0 -1 0 0 GSM797993_US22502677_251486825703_S01_GE2_105_Dec08_1_4 0 0 -1 0 0 GSM797994_US22502677_251486830305_S01_GE2_105_Dec08_1_1 -1 0 0 0 0 GSM797995_US22502677_251486833702_S01_GE2_105_Dec08_1_1 -1 0 0 0 0 GSM797996_US22502677_251486830304_S01_GE2_105_Dec08_1_2 0 0 0 0 0 GSM797997_US22502677_251486834188_S01_GE2_105_Dec08_1_2 0 0 0 0 0 GSM797998_US22502677_251486834188_S01_GE2_105_Dec08_1_1 0 0 0 0 0 GSM797999_US22502677_251486834188_S01_GE2_105_Dec08_1_4 0 -1 0 0 0 GSM798000_US22502677_251486835936_S01_GE2_105_Dec08_1_4 0 -1 0 0 0 <<snip>>
I realize this happens because all samples are expressed relative to the reference, which is in the Cy5 channel.
But is such design (with "negative" group assignments) subsequently properly analyzed? Or should I first convert the design matrix to include only positive values?
For now I continued with the (negative) design matrix:
> colnames(design) [1] "Adult_liver" "Adult_Pancreas" "BAT" "CCl4_Liver_day13" "CCl4_Liver_day3" "DM_Liver_organoid" [7] "EM_Liver_organoid" "Hepatocytes" "Lgr5_sorted_cells" "Liver" "Muscle" "Sorted_duct" [13] "WAT" > > cont.matrix <- makeContrasts( + EM_DM= (EM_Liver_organoid - DM_Liver_organoid), + AdultLiver_DM = (Adult_liver - DM_Liver_organoid), + Liver_DM = (Liver - DM_Liver_organoid), + Hepatocytes_DM = (Hepatocytes - DM_Liver_organoid), + levels=design) > cont.matrix Contrasts Levels EM_DM AdultLiver_DM Liver_DM Hepatocytes_DM Adult_liver 0 1 0 0 Adult_Pancreas 0 0 0 0 BAT 0 0 0 0 CCl4_Liver_day13 0 0 0 0 CCl4_Liver_day3 0 0 0 0 DM_Liver_organoid -1 -1 -1 -1 EM_Liver_organoid 1 0 0 0 Hepatocytes 0 0 0 1 Lgr5_sorted_cells 0 0 0 0 Liver 0 0 1 0 Muscle 0 0 0 0 Sorted_duct 0 0 0 0 WAT 0 0 0 0 >
> > fit <- lmFit(MA.q3, design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit3 <- eBayes(fit2, trend=TRUE, robust=TRUE) > > topTable(fit3,coef=1) Row Col ControlType ProbeName SystematicName logFC AveExpr t P.Value adj.P.Val B 31040 367 75 0 A_52_P410495 NM_026336 -7.000364 6.471554 -45.13807 1.646844e-21 6.182251e-17 34.82217 9652 115 7 0 A_51_P422751 NM_017474 -8.485838 6.758554 -37.27941 7.081291e-20 1.329158e-15 32.56920 31911 378 14 0 A_52_P670766 NM_027339 -5.972531 6.422511 -32.83067 8.527083e-19 1.067022e-14 30.88386 25110 297 63 0 A_52_P3070 NM_178669 -4.789018 6.468355 -31.16513 2.356391e-18 2.211473e-14 30.15357 8588 102 43 0 A_51_P229403 NM_027339 -6.958043 6.699792 -30.04494 4.809562e-18 3.611019e-14 29.62718 15296 181 72 0 A_51_P511000 NM_181323 -4.221551 6.577204 -26.42710 5.811238e-17 3.635898e-13 27.70479 18082 214 64 0 A_51_P490840 NM_146077 -5.140845 6.526875 -25.04015 1.647681e-16 8.836278e-13 26.86462 26826 318 5 0 A_51_P492047 NM_020622 -5.232215 6.649090 -23.36079 6.274612e-16 2.880247e-12 25.75803 32009 379 28 0 A_51_P134993 NM_207208 -4.540090 6.624202 -23.24466 6.905227e-16 2.880247e-12 25.67760 34718 411 30 0 A_51_P362292 NM_023529 -4.421324 6.754218 -23.03258 8.234346e-16 3.091173e-12 25.52937 >
By doing so, do I correctly interpret that in when comparing the EM samples vs the DM samples (1st contrast; EM_DM), the most significantly regulated gene is NM_026336, which is down-regulated in the EM versus DM samples (log2 FC= -7.00), and is NOT up-regulated because of the "negative" design matrix?
I am asking because this is not what I concluded when inspecting the M-values of the respective samples; there it seems that NM_0026336 (rownumber =26261) is actually higher expressed in the EM samples compared to the DM samples...
> colnames(MA.q3$A) <- targets$Cy3 > colnames(MA.q3$M) <- targets$Cy3 > MA.q3$M[26261,] CCl4_Liver_day13 CCl4_Liver_day13 CCl4_Liver_day3 CCl4_Liver_day3 Sorted_duct Hepatocytes Liver -0.05987034 0.14093043 -0.05372743 -0.16743075 -0.48689356 -0.03920677 -0.05760740 Lgr5_sorted_cells Sorted_duct Hepatocytes Liver Lgr5_sorted_cells Muscle Muscle 0.12492195 -0.41947946 0.07274417 0.09072236 0.17494198 -0.09277732 -0.32188229 Muscle WAT WAT WAT BAT BAT BAT -0.25082762 -0.02929802 -0.01729277 0.03040873 0.02812201 0.03574133 0.01385537 Adult_liver Adult_liver EM_Liver_organoid EM_Liver_organoid DM_Liver_organoid Adult_Pancreas Adult_Pancreas -0.13853968 0.14293178 -0.11902644 -0.38656530 -7.25316029 -0.07174871 0.07526561 > MA.q3$A[26261,] CCl4_Liver_day13 CCl4_Liver_day13 CCl4_Liver_day3 CCl4_Liver_day3 Sorted_duct Hepatocytes Liver 6.256466 6.333689 6.342524 6.274528 7.544673 6.446924 6.256481 Lgr5_sorted_cells Sorted_duct Hepatocytes Liver Lgr5_sorted_cells Muscle Muscle 6.553766 6.574924 6.525871 6.288565 6.356142 6.228113 6.294111 Muscle WAT WAT WAT BAT BAT BAT 6.341889 6.299306 6.346015 6.234404 6.284631 6.315308 6.288100 Adult_liver Adult_liver EM_Liver_organoid EM_Liver_organoid DM_Liver_organoid Adult_Pancreas Adult_Pancreas 6.302537 6.342125 6.292963 6.203313 9.078806 6.293387 6.303954 >
Your insights on this are appreciated!
Thanks,
Guido
To be complete; here the relevant part of the targets file: