Confusing: identify differentially expression genes (DEG) perturbed by a drug in Limma
1
1
Entering edit mode
kevin.m.hao ▴ 10
@kevinmhao-12183
Last seen 7.3 years ago

Hello,

I would like to identify the Differentially Expressed Genes (DEG) perturbed by a drug
'metformin'.

The experiment is one-color microarray as shown below. The 'controlFileName' refers to the array without
the drug (metformin here) treatment, and 'perturbationFileName' refers to the corresponding array with
the drug (metformin here) treatment. The original file like this

> orgFile
    drugName concentration cellLine perturbationFileName controlFileName
1  metformin       0.00001     MCF7          treat_1.CEL   control_1.CEL
2  metformin       0.00001     MCF7          treat_2.CEL   control_1.CEL
3  metformin     0.0000001     MCF7          treat_3.CEL   control_1.CEL
4  metformin         0.001     MCF7          treat_4.CEL   control_1.CEL
5  metformin       0.00001     MCF7          treat_5.CEL   control_2.CEL
6  metformin     0.0000242     MCF7          treat_6.CEL   control_3.CEL
7  metformin     0.0000242     MCF7          treat_6.CEL   control_4.CEL
8  metformin     0.0000242      PC3          treat_7.CEL   control_5.CEL
9  metformin     0.0000242      PC3          treat_7.CEL   control_6.CEL
10 metformin     0.0000242     HL60          treat_8.CEL   control_7.CEL
11 metformin     0.0000242     HL60          treat_8.CEL   control_8.CEL

####################################
# produce the original file
orgFile <- rbind(
  c("metformin", "0.00001", "MCF7", "treat_1.CEL", "control_1.CEL"),
  c("metformin", "0.00001", "MCF7", "treat_2.CEL", "control_1.CEL"),
  c("metformin", "0.0000001", "MCF7", "treat_3.CEL", "control_1.CEL"),
  c("metformin", "0.001", "MCF7", "treat_4.CEL", "control_1.CEL"),
  c("metformin", "0.00001", "MCF7", "treat_5.CEL", "control_2.CEL"),
  c("metformin", "0.0000242", "MCF7", "treat_6.CEL", "control_3.CEL"),
  c("metformin", "0.0000242", "MCF7", "treat_6.CEL", "control_4.CEL"),
  c("metformin", "0.0000242", "PC3", "treat_7.CEL", "control_5.CEL"),
  c("metformin", "0.0000242", "PC3", "treat_7.CEL", "control_6.CEL"),
  c("metformin", "0.0000242", "HL60", "treat_8.CEL", "control_7.CEL"),
  c("metformin", "0.0000242", "HL60", "treat_8.CEL", "control_8.CEL")
)
colnames(orgFile) <- c("drugName", "concentration", "cellLine", "perturbationFileName", "controlFileName")
orgFile <- as.data.frame(orgFile)
####################################

I have serveral questions confusing me.

1.
The 'concentration' is an important factor for the microarray analysis? If so, why one same 'control file'
can be used for different concentrations (such as: 'control_1.CEL' for concentration of '0.00001', '0.0000001' and
 '0.001' respectively)? If I understand right, each concentration should have their own 'control file', right?

2.
Should I perform limma analysis by considering ('concentraion'+'cellLine') or just by ('cellLine') ignoring concentration? 

3.
If I analyze the microarray by just considering ('cellLine'), then for MCF7, the DEG I identified are 
for example, 'geneA', 'geneB', 'genC' and for PC3, the DEG for example are 'geneB', 'geneC' and 
for HL60 the DEG are 'geneD'. Can I conclude that for 'metformin' perturbation,
the total DEG are: 'geneA', 'geneB', 'geneC', 'geneD' which is union for the three analysis? if not, what is better ways to get DEG
for the 'metformin' drug perturbation?

> totDEG
[1] "geneA" "geneB" "geneC" "geneD"

#####################################
# the code like this
mcf7 <- c("geneA", "geneB", "geneC")
pc3 <- c("geneB", "geneC")
hl60 <- c("geneD")
totDEG <- Reduce(union, list(mcf7, pc3, hl60))
totDEG
#####################################

4.
Can I ignore both 'concentration' and 'cellLine' factors and do Limma analysis using all 'perturbationFileNames' and 'controlFileName',
which can get the DEG 'just one time' instead of that did in my Third question (analyze by different cellLine and get union them)?

is this analysis reasonable, since I have already ignored the 'concentration and cellLine' factors?

> targetFrame
        FileName   Group
1    treat_1.CEL   treat
2    treat_2.CEL   treat
3    treat_3.CEL   treat
4    treat_4.CEL   treat
5  control_1.CEL control
6    treat_5.CEL   treat
7  control_2.CEL control
8    treat_6.CEL   treat
9  control_3.CEL control
10 control_4.CEL control
11   treat_7.CEL   treat
12 control_5.CEL control
13 control_6.CEL control
14   treat_8.CEL   treat
15 control_7.CEL control
16 control_8.CEL control

 

######################################
# the code like this
targetFrame <- data.frame(
  FileName = c("treat_1.CEL", "treat_2.CEL", "treat_3.CEL", "treat_4.CEL",
    "control_1.CEL", "treat_5.CEL", "control_2.CEL", "treat_6.CEL", "control_3.CEL",
    "control_4.CEL", "treat_7.CEL", "control_5.CEL", "control_6.CEL",
    "treat_8.CEL", "control_7.CEL", "control_8.CEL"),
  Group = c("treat", "treat", "treat", "treat", "control", "treat", "control",
    "treat", "control", "control", "treat", "control", "control", "treat",
    "control", "control")
)
targetFrame
library(limma)
group <- factor(targetFrame$Group)
design <- model.matrix(~ 0 + group)
colnames(design) <- c("control", "treat")
# eset is log2 data for example
fit <- lmFit(eset, design)
cm <- makeContrasts(treat - control)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
#######################################

5. 
Given the 'orgFile', what is the proper way to to identify the Differentially Expressed Genes (DEG) perturbed by 'metformin'?

Any help will be appreciated!

limma cmap microarrays differential gene expression • 1.6k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 23 minutes ago
WEHI, Melbourne, Australia

The starting point for any analysis is to make a proper targets frame. You obviously have data from only 16 microarrays, so the targets frame should have 16 rows (not 17 as you have above). The targets frame should also include columns for "cell line" and "concentration".

There is no reason why you need to ignore cell line or concentration in your analysis. Making decisions like that though is up to you based on what your aims are.

The "original file" that you show is not useful and you cannot do any sensible analysis using it. There is no point in asking us any questions about the "original file" because we don't know where you got it from or what it was supposed to be used for.

I cannot see any reason why each concentration should need a separate control. The control simply means no treatment, in other words the control files correspond to a concentration of zero. Trying to pair up particular treatment files with particular control files, as in your "original file", is unnecessary and makes little sense.

ADD COMMENT
0
Entering edit mode

Thanks Gordon. I modified the typos I made before. Now it is 16 microarrays.

In fact, my goad is that: I would like to compare DEGs induced by two kinds of perturbations (e.g., one is drugA, and another is rnaA). However, I can not do wet experiment for that, so I just want to analyze the exisiting microarray data (for example, data deposited in CMAP and NCBI GEO).

In principal, these microarrays should use the same cell line and same concentration (if any) and other same factors except that the perturbations are different, right?

However, all these microarrays may be from different researchers, so the factors can not be ensured to be same. At this time, I need determine how to use these microarrays to ensure the comparison of DEGs for different perturbations tends to be reasonable.

If drugA was tested in MCF7 cell line and rnaA was tested in PC3 cell line, is there a reasonable way to compare DEGs for these two perturbations? since they were tesed in different cell lines.

What is the bottom line for comparing DEGs for two different perturbations? (MUST be same cell line? same concentration?) But if analyzing public microarray data, the limitations can be ensured, how to do? 

Sorry for asking so many (naive) questions, but I am new in microarray analysis field.

Thanks so much.

 

 

ADD REPLY
1
Entering edit mode

The sensibility of comparing DEG lists depends on the biological context of the contrasts. Do your perturbations affect the same pathway? Are the cell lines related in terms of their originating cell type? If you can't answer these questions, you won't be able to interpret any similarities or differences in the DEG lists. For example, MCF7 is derived from the mammary epithelium while PC3 is derived from prostate tissue. Their regulatory networks are probably completely different - how do you plan to interpret the effects of two different perturbations on each of the two tissues?

Good experimental design obviously plays a big role here. If both contrasts involved the same cell line, then you could interpret similarities or differences in the lists as those caused by the perturbations. Alternatively, if both contrasts involved the same perturbation, then similarities/differences could be attributed to the cell line. However, if you aren't involved in planning the experiment - and you wouldn't be, for public data - then there's little you can do - see Fisher's quote:

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

ADD REPLY
0
Entering edit mode

Thanks Aaron.

If I just know drugA and rnaA were both tested in the same cell lines such as MCF7 (homo), but I do not know if other experimental factors (such as concentrations, time etc) are same (due to that they are from different researchers' experiment), can I compare the DEG lists perturbed by drugA and rnaA?

If nope, the large scale of comparison of DEG lists (perturbed saying by 1000 drugs and 1000 RNAs, which is a 1000-by-1000 similarity matrix if interpreting the similairities in the lists) in public data seems not easy, since it is difficult for computational researchers to pick up all same experimental factors of microarrays deposited by different researchers, right?

If I just want to interpret the similarities in the DEG lists (in a large scale) as those caused by the perturbations, what is a reasonable way to do it?

Thanks!

ADD REPLY
1
Entering edit mode

"Can I compare the DEG lists perturbed by drugA and rnaA?" That is not a question that is easily answered without further biological context. If one perturbation affects DNA replication, and another perturbation affects protein synthesis, what exactly are you hoping to find when you compare the DEG lists?

If the perturbations affect the same pathway, perhaps the comparison is more understandable. However, there will always be the possibility of additional confounding variables when you try to combine results from different studies. Never mind concentration and time of perturbation, what about culture conditions? Stage of growth? That's not even considering mycoplasma contamination or misidentification of cell lines.

If you're concerned about additional variables, the only way to control them is to redo the experiment yourself. If you won't or can't do that, you'll just have to accept that the comparison of DEG lists is confounded by these experimental factors, and proceed cautiously with the interpretation of the results.

ADD REPLY
0
Entering edit mode

Thanks Aaron.

By comparison of the DEG lists by computing the gene list similarity using for example 'biological process, BP', one may identify if the two perturbations have the same action to certain diseases.

If so, drug repositioning may be generated, right?

I am still confusing that how to extract the DEGs more reasonably,if one perturbation was tested in multiple microarrays of different cell lines.

 

ADD REPLY

Login before adding your answer.

Traffic: 736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6