Question

Comparing across sets of regulated genes -- Strategies?

0

Entering edit mode

Calin-Jageman, Robert ▴ 30

@calin-jageman-robert-6431

Last seen 8.2 years ago

I'm looking for some comments on different strategies to compare different sets of regulated genes. I've seen different approaches, and would like to get some insight into pros/cons of each.

In my specific application, I have a microarray analysis of gene expression 1 hour after learning, and then a later experiment of tissue collected 24 hours after learning. I'd like to compare the regulation observed at each time point:

Which genes are similarly regulated at both time points?
Which genes are distinctly regulated (only at one time point)?
Overall, how similar is regulation across these different conditions?

I think this is a specific example that would generalize to comparing regulation across groups (say: comparing regulated genes across men vs. women, etc.)

Here are the strategies I'm exploring with what I can tell of the pros and cons:

1) Venn diagram of the two lists of regulated transcripts. I see this a lot. But it seems to me to be completely inadequate. It makes the assumption the difference between significant and non-significant is, itself significant, which is... well, wrong. A gene could be missing from one list just due to lack of power, not due to a meaningful difference in the degree of regulation.

2) Re-run the analysis over both experiments looking for overall regulated transcripts and interaction-significant transcripts. It seems the question of "genes regulated at both time points" requires an overall analysis to see if the transcript is significant when examined over both conditions. And it seems the "distinctly regulated genes" question is really about an interaction (in my case between learning-regulation and time-point). This feels right, but it seems there could be two weak points:

I worry that 'overall regulated' may not be as stringent as 'regulated at each time point, regardless of the other". That is, maybe strong regulation at one time point pulls a gene up to overall regulated even if the evidence for regulation at the other time point is weak. The Venn diagram approach seems a bit more stringent in this case for identifying transcripts clearly regulated under both conditions.
It also seems that the interaction question could end up being low-powered, leaving many transcripts in a gray zone (neither an interaction, nor an overall regulated). One way I've tried to address this is to do this overall analysis only on the set of genes already marked regulated at one or the other time point--thus, fewer comparisons, and higher power. Not sure if this makes sense, though.

3) Scatter plot? Both the above strategies make qualitative judgements, but I'd also like to get an overall feel of the degree to which regulation is similar between time points. So I've also tried creating a scatterplot of the FCs from each time point. Overall, it shows almost no correlation (r = 0.03) even though overall expression levels were very consistent across experiments. This seems like reasonable evidence that regulation is fairly distinct across these time points. The weakness of this approach, though, might be that it counts tons of non-regulated transcripts, most of which show FCs that consist entirely of noise, and that could be washing out a real relationship that exists amongst actually regulated genes. So I've tried color-coding my scatterplot by if the gene was significant at one, the other, or both time points, so that any subtrends might be obvious--though here there is probably a restriction of range problem. I've copied in the scatterplot at the end of this post.

I'm guessing there are no "right" ways to answer these questions--but I was hoping to collect some feedback and comments to help guide me in making some informed choices. Any input more than welcome. Thanks,

Bob

The scatter on the left is the overall expression levels across the two studies, which shows good reliability of measurement. The one on the right shows the LFCs for the two studies, coded by if the transcript was regulated in one (triangle) the other (square), both (x), or neither (circle) study. There are very few dots well away from both axes, which is were a transcript strongly regulated in both conditions would be (whether consistently or inconsistently regulated). Overall, there is almost no correlation between the LFCs...so my interpretation is that these processes are largely distinct... but is this a reasonable claim?

limma differential expression • 1.7k views

ADD COMMENT • link updated 8.2 years ago by Lluís Revilla Sancho ▴ 760 • written 8.2 years ago by Calin-Jageman, Robert ▴ 30

0

Entering edit mode

What do you mean by "regulated"? Do you mean differentially expressed? If so, relative to what? Your plots suggest that you've got an untrained control, so a more complete description of your experimental design would be helpful.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry this wasn't clear. Each experiment compares tissue from a set of trained animals to matched untrained tissue. So, by regulated, I mean there is a LFC from trained/untrained that is both statistically and practically significant.

I think this is just a specific case of an issue that often comes up. For example, I saw a talk recently that looked at changes in gene expression with a traumatic brain injury. The presenters broke the animals down by gender and compared the list of regulated transcripts (regulated by injury) across the males and females. They happened to use the Venn diagram approach, and because there were almost no transcripts in common, they made the claim that males and females have different transcriptional responses to brain injury... Again, just an example, to give the flavor of what I'm talking about. It seems to me the Venn diagram approach is completely unsatisfactory, so I'm asking for advice about better approaches.

ADD REPLY • link 8.2 years ago Calin-Jageman, Robert ▴ 30

score 2 · Answer 1 · 2017-02-15

There are lots of options for getting at your questions. Here are a few of them:

For your third question, regarding the overall similarity of the logFC values across all genes in two contrasts, there's a function in limma called genas that can do that. It will give you the estimate of biological correlation between the two contrasts, as well as a p-value for the hypothesis that this correlation is different from zero. If you use it, make sure to pay attention to the "subset" argument, as different values for this can have a large effect on the result. I'd probably recommend the Fpval mode. This function also generates the corresponding scatter plot. If both contrasts have a large number of DE genes and genas does not find a significant correlation between the two, I think you can confidently claim that the two contrasts are measuring distinct processes.

For finding individual genes with significantly different logFC values in the two contrasts, this is exactly the question answered by testing the interaction, as you suggest. If you want to find genes with similar fold changes in both contrasts, you could try the the equivalence test implemented in DESeq2 (i.e. DESeq2::results() with altHypothesis="lessAbs"; see DESeq2 manual section 3.9). You would also run this test on the interaction contrast. This kind of test tends to have pretty low power, however. Also, this test will identify genes that are not changing in either contrast, since if both logFC values are zero, then the interaction term is also zero, so you probably want to restrict this test to genes that are DE (using a fairly relaxed significance threshold, e.g. 20% FDR) in at least one of the two contrasts, as measured by a simultaneous test of both contrasts simultaneously.

That's all the tricks I can think of off the top of my head. I'm sure there are others. There's really no silver bullet for these kinds of questions, and you always have to keep in mind that number of significant genes is not synonymous with degree of biological effect. Even if only 10 genes are in common, it's possible those are the only 10 that are relevant to the treatment or effect you're studying. So ultimately, I don't think this is a question that can be answered purely with statistics.

score 0 · Answer 2 · 2017-02-15

You seem to use "regulated" as synonym of with a (significant?) fold change, not of different interactions in the promoter, which could affect to the expression levels, or different interactions at transcription level.

One way to see if genes are similarly expressed at both time points would be to use the sets of "regulated" genes in a comparison as a gene set to be evaluated in the other comparison (and the other way round). You could make subsets of fewer genes to see if there is any difference between subgroups of genes by bootstraping.

Other way to measure if there is a group of genes differentially regulated is using the GSAR package, which has a function (GSNCAtest) to test if the relationship of a genes of a given gene set is different compared between two groups/contrasts.