It should be okay to use ROAST in this application, as the definition of the gene set (from one experiment) is independent of the application of the gene set in ROAST (in the second experiment). This means that you're not doing any inappropriate data snooping, e.g., by running ROAST on the same data that you used to define the gene set.
In practice, I would define my DGE list from the first experiment, and then split it into two separate DGE lists based on the sign of the log-fold change. I would then use each of the two sublists in ROAST for the comparison in the second experiment. If the experiments have similar DE profiles, then the "up" sublist should have a consistent direction of "up" in the ROAST output, and similarly for the "down" sublist.
Besides intersection, no "sophisticated approaches" come to mind. One (visual) alternative might be to simply plot the log-fold changes of one experiment against the log-fold change of the other experiment for all genes. If you have similar DE profiles between experiments, then you should see a more-or-less straight line through the origin. You can also compute the correlation coefficient to get some quantitative measure of similarity. You don't have to worry about technical correlations as you've got independent experiments.
Edit: roast
can now accept a vector of gene.weights
. This is intended to accept log-fold changes associated with the gene set, and can have both positive or negative entries. Thus, you don't have to split the DGE list from the first experiment to define your up/down sets. Rather, you can use the entire set in ROAST, along with the vector of log-fold changes for all genes in the set.
Let's say that you have a vector of probeset IDs from your first experiment, named
dge.set
. This is associated with a vector of log-fold changes, nameddge.fc
. Now, I'll assume you have a suitable matrix-like object for your second experiment namedy
, for which the probeset IDs are stored in the row names. We should be able to run:Of course, this assumes that the probeset IDs are directly comparable between your two chips. You can check this by looking at the
hgu133a.db
andhgu133plus2.db
annotation packages. A brief look suggests that the probeset IDs are shared and match up to the same Entrez IDs, etc., so it's probably okay.Dear Aaron,
firstly thank you for your code and your recommendations. Hovever, i need to ask you one more question that looks not important, but it concerns me.
So, based on your above approach, i have the top table data frame named top with me selected DE probesets from the first experiment and based on your above: set <- rownames(top)-so these are the probesetIDs of the first DEG list.
Secondly, i assume i have the Eset2 that is the ExpressionSet of the second dataset which we want to test the first DEG list. Thus, about your above code it should be:
matched <- match(set, rownames(Eset2) ?
and after the rest of the above code you wrote ?
Also, as my DEGlist has no duplicate probesets, should i also filter my Eset2 from my second dataset for any duplicates or after limma annotate mt topTable and use the rownames from this as rownames of my Eset2 ?
Thank you again for your consideration on this matter !!
That looks right; you should be able to just replace
y
in my code withEset2
in your code.As for your duplicate probesets, you can use
avereps
from thelimma
package on yourEset2
object. This will merge duplicates by replacing all duplicated probesets with a single row representing their average expression in each library. This seems better than just filtering out duplicates, which would result in the loss of information. It's also better than manually replacing row names, which could result in book-keeping errors if you're not careful with what you're doing.Dear Aaron,
thank you again for your recommendations; regarding the duplicates matter, i tend to use not the average, as i have read in other posts/threads that i should treat each probeset as unique and not average, because duplicates of one transcript might represent alternative transcripts or isoforms and thus should not consider the average or median-instead, i use the MAD(Median Absolut Deviation) to keep the probeset with the bigger one from other duplicates. Also i have heard the custom cdf from Brain Array, but the Annotation approach is a whole procedure by itself and thus many different approaches exist.
Dear Aaron,
i used your code as above and i also implemented the argument contrast=2 to specify the specific contrast i need from my design matrix:
genes <- set$PROBEID # where set is the data.frame with the DE probesetIDs from my first experiment
matched <- match(genes, rownames(data.trusted.eset))
matched <- matched[!is.na(matched)]
But when i used :
roast(data.trusted.eset, index=matched, design, contrast=2,gene.weights=genes.fc[matched])
# where data.trusted.eset is the expressionset from the second experiment, and also design the design matrix of the data.trusted.eset, it gave me the following result:
Active.Prop P.Value
Down NA NA
Up NA NA
UpOrDown NA NA
Mixed NA NA
Any ideas or help ?
matched
is empty. If so, the set's empty, so there's no point runningroast
.genes.fc[matched]
containsNA
values. If so, subsetmatched
so that you get rid of those entries:data.trusted.eset
is okay. Does the following code generateNA
's (assuming there's at least 10 genes in your second data set)?Dear Aaron,
yes you are right: when i write hgu133a$logFC[matched] where hgu133a the dataframe for topTable regarding the DEG genes and other info of the first dataset used to use it for the DEG genes ->
head(hgu133a)
PROBEID GENE_SYMBOL
1 209735_at ABCG2
2 207502_at GUCA2B
3 206784_at AQP8
4 206209_s_at CA4
5 217546_at MT1M
6 204475_at MMP1
GENE_NAME
1 ATP-binding cassette, sub-family G (WHITE), member 2 (Junior blood group)
2 guanylate cyclase activator 2B (uroguanylin)
3 aquaporin 8
4 carbonic anhydrase IV
5 metallothionein 1M
6 matrix metallopeptidase 1 (interstitial collagenase)
ENTREZID logFC adjusted.P.Value
1 9429 -6.432513 0.006563296
2 2981 -6.597180 0.006563296
3 343 -8.451506 0.006563296
4 762 -7.728190 0.006563296
5 4499 -6.294264 0.008197294
6 4312 5.620516 0.009569328
and it gave me many NAs:
hgu133a$logFC[matched]
[1] -6.427696 -6.279190 -6.294264 -2.354063 -3.191032 -2.176708 -6.091865
[8] -3.196283 -3.294334 2.372149 -5.675410 -2.697132 1.365858 1.516165
[15] -3.776092 NA NA 2.354714 -2.588221 -3.528173 1.693774
[22] 3.619838 1.937780 -2.273544 NA -3.732733 -2.245229 1.283509
[29] NA -1.080846 NA 2.146243 NA -1.761236 2.467160
[36] -3.808038 -1.667233 1.634998 2.101009 NA 1.598870 1.578977
[43] -3.426366 NA 1.014390 1.268782 -2.829184 2.166192 NA
[50] NA -3.652755 1.419136 -1.928969 -1.911653 2.624400 NA.....
thus i wrote your function above:
length(matched) :208
and then roast(data.trusted.eset, index=matched, design, contrast=2, gene.weights=hgu133a$logFC[matched]):
Active.Prop P.Value
Down 0.2500000 1.0000000000
Up 0.3509615 0.0005002501
UpOrDown 0.3509615 0.0010000000
Mixed 0.6009615 0.0030000000
But i find a bit difficult to explain the above results
Check out the value and details section in
?roast
for an explanation of the various fields. In your case, the test result for "Up" is significant. This indicates that, generally, the up- and down-regulated genes in your first set are also significantly up- and down-regulated, respectively, in your second experiment (specifically, 35% of all genes in the set have the same changes between experiments). If I remember correctly, the interpretation above is because thegene.weights
are used to adjust the sign of the log-fold changes in the second experiment, so if you have negative log-FCs from both the first and second experiment, the final log-FC will be seen as positive - hence, "up".