Question

Comparison of DE lists from different datasets of same microarray platform

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 16 months ago

Italy

Dear All,

i would like to ask you a more "beginner" question about a initial comparison of DE lists i have acquired from limma, between two microarray datasets. Although the platform is the same (Agilent), the comparisons are somehow-different due to the different time point of comparison: in the one dataset, i have compared bystander samples vs controls in 4hours, whereas in the other i have performed the same comparison, but in the time-point of 30 min. Also, the same cell type was used in both experiments--IMR-90 human lung fibroblasts--. I understand due to the different time-point comparison, any general comparisons might be inappropriate-but, as I'm highly interested of finding common DE genes between both time-points--which could indicate interesting patterns or groups of genes in both time-points:

thus, for a start, i could compare the DE probe-sets (i.e. with adjusted p-val < 0.05) in a VENN diagram ? or i could also compare the final gene symbols, in case i miss anything for different DE probe-sets in the two datasets, annotated in the same gene symbol ?

Finally, also a scatter-plot would be helpful for this cause ? And if my notion is correct, i should similarly use the logFCs or the t-statistics from the common probe-sets/gene symbols ?

Thank you,

Konstantinos

microarray DE-comparison meta-analysis scatter-plot limma • 2.6k views

ADD COMMENT • link 9.1 years ago Konstantinos Yeles ▴ 80

score 1 · Answer 1 · 2016-03-18

I don't see any problem with intersecting the DE sets. Each comparison is performed within each data set, so any batch effects should cancel out. To do the intersection, the two choices - probe set ID or gene symbol - would only matter if you have more than one probe set per gene on your platform. I would be inclined to use the probe sets, as they're easier to match up between data sets when the platform is the same. Otherwise, you could end up with a situation where you find that different probe sets for the same gene are DE in different comparisons - this would lead to the question of why the same probe sets aren't DE across your comparisons, which may be interesting to answer (e.g., isoform differences?) or not (spurious differences from technical noise).

It is also common to make a logFC-logFC scatter plot to compare two DE comparisons. I think log-fold changes are more useful than t-statistics for such a plot, as the former gives you the biological effect size while the latter gives you statistical significance (which is probably less interesting to visualize).

score 0 · Answer 2 · 2016-03-18

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 16 months ago

Italy

Dear Aaron,

thank you one more time for your thoughts on this matter. The same platform i mentioned is a specific Agilent platform-in both datasets, after topTable i end up with various duplicate probe-IDs matching to the same gene symbol (and i used the MAD metric, mentioned in my previous post to remove duplicates in also both datasets). But i noticed, as you mentioned about focusing on the probe-sets: i made both VENN diagrams with DE symbols and probesetIDs, and i saw in very few cases--but is very interesting--specific genes with the gene SYMBOL ended with different probesetIDs. But this could be attributed to various reasons, as you mentioned.

Thus, about your last suggestion: i should use the common intersected probesets from above for both studies, and use something like the following ?

plot( LFC1, LFC2,...) # where LFC1 the vector of log-FCs of the first study, and similarly for the second dataset ?

Finally, i could also compute the correlation of the above scatterplot ? with something like:

cor(LFC1, LFC2) ?

ADD COMMENT • link 9.1 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

Yes, and yes. Also, use the "Add comment" to respond to answers, don't make a new answer.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

please excuse me for returning to this matter, but i would like to ask you two specific questions about the interpretation of the created scatter plot. My below data frame, has the logFCs for the common probesets along with the gene symbols as the row names:

> head(dat2)
      LFCS.05H  LFCS.4H
MMP3  2.581726 1.932401
MT1E  2.574222 2.165657
MT1B  2.421996 1.905444
MT1L  2.364336 1.931646
CXCL2 2.426432 2.644356
MT1H  2.380918 2.001439

Then, i firstly used :

plot(dat2$LFCS.05H, dat$LFCS.4H)

# Also there is a relatively high correlation:

signif(cor(dat$LFCS.05H, dat$LFCS.4H),2)
[1] 0.8

But also, how could i add a slope in order to make the plot more interpretable ? Or even from the plot (the link below) i ca state that there is an obvious correlation for my two vectors of logFCs for both comparisons ? (just to pinpoint, in both comparisons the common probesets are all up-regulated, which is interesting for further investigation).

Also, the link to the figure of the scatterplot:

https://www.dropbox.com/s/3bkz4ltgu78pni6/Rplot.png?dl=0

Thank you,

Konstantinos

ADD REPLY • link 9.0 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

It's generally more informative to make the logFC-logFC plot using all features, rather than just those that are in the intersection. If you restrict the plot to genes that are DE in both comparisons, you'll generally be selecting for points in the corners of the plot; this can result in some spuriously large correlations. Anyway, as to adding a line, I'll point you in one direction; try using lm to perform a linear regression, and then supply the coefficients to abline.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron thank you again for your recommendation. I wrongfully thought in the beginning that plotting only the common DE-probesets would be mostly interesting--so you suggest above to use all the DE genes, not the common in both lists ? because one of the two DE lists, is relatively bigger(~400 genes vs ~60 genes)-or it could be still informative ?

Moreover, regarding the argument lm, you mean something like:

fit <- lm(dat2$LFCS0.5H~dat2$LFCS.4H)
> abline(fit$coef, lwd=2) ?

But in the above lm function, usually does not take a predictor and a dependent variable in the linear model ?

ADD REPLY • link 9.0 years ago Konstantinos Yeles ▴ 80