Dear Community,
in addition to one of my previous posts about mroast testing for findind "differentially activated" KEGG pathways, i would like to make two very important questions regarding some useful implementations about my current analysis in a illumina microarray dataset(Human HT-12 v4 beadchip) which i have analyzed and pre-processed with limma. My first question is if i could similarly test for multiple gene sets regarding Gene Ontology via mroast. My design matrix for my expressionset is also mentioned in other post(C: Correct construction of design matrix in limma for multiple contrasts for gene e)
My first approach is similar to my post for KEGG with mroast(C: Appropriate use of the function mroast to find KEGG pathways for differential ex) :
library(illuminaHumanv4.db)
x <- illuminaHumanv4GO2PROBE
mapped_probes <- mappedkeys(x)
xx <- as.list(x[mapped_probes])
indices <- ids2indices(xx, rownames(filtered.2)) # filtered.2= normalized & filtered-EListRaw-class
res <- mroast(filtered.2, indices, design, contrast=6) # contrast=6 in the design matrix in the above link represents the difference between the combined therapy of two compounds vs DMSO(control group)
My second approach is based on an edx lesson(PH525x series) for gene set testing in R, which i slightly modified-the motivation for the second approach is that i wanted to define more the minimum length for each gene set for testing:
go2probe <- as.list(illuminaHumanv4GO2PROBE)
govector <- unlist(go2probe)
golengths <- sapply(go2probe, length)
idxvector <- match(govector, rownames(filtered.2))
idx <- split(idxvector, rep(names(go2probe),golengths))
idxclean <- lapply(idx,function(x) x[is.na(x)])
idxlengths <- sapply(idxclean,length)
idxsub <- idxclean[idxlengths>=10]
res <- mroast(filtered.2,idxsub,design,contrast=6)
and then for both cases use the GO.db database to find each GO term to which specific term is annotated(i.e. MF, BP or CC).
To sum up, my first question is (because i use R for the past 6 months) if both the above methodologies are appropriate for the use of mroast for finding "DE GO terms" ?
Moreover, my second and also very crusial question, is if any of these above methods could be also used with other curated gene-sets as inputs for mroast in R, such as the molecular signature databases in the Broad Institute (http://bioinf.wehi.edu.au/software/MSigDB/) ?? My main reason for asking is that besides various pathway analysis methodologies and repositories like KEGG or Reactome, as my current dataset describes the use of combinatorial inhibitors in a specific metastatic breast cancer cell line, i found extremely interesting and possibly useful to use specific molecular signatures from the Broad Institute, such as the C6 oncogenic signatures. Or these are completely different and irrelevant for my specific analysis ?
Thank you in advance !!!
Dear Aaron,
thank you for your useful recommendations !! Regarding the second part of my question, for instance if i choose the C6 oncogenic signatures, how i can imprort and correctly use the "human_c6_v4.rdata" ?? I used the function load("C:/Users/Efstathios/Desktop/human_c6_v4.rdata") but i couldnt manage to read the data object
That command should work; it should load into memory a list called
Hs.c6
, where each element of the list is a named set that contains Entrez gene IDs. You probably need to match your Illumina probe IDs with their Entrez IDs first:and then match the IDs with the elements in
Hs.c6
:---
Edit: to account for duplicated probe IDs, you'll need to match them back to the row names. This will give you the index of
filtered.2
corresponding to each probe ID and, subsequently, each Entrez ID in the set.Dear Aaron,
your code works fine and i get the match of the probe ENTREZIDs with these of each gene set. Only i have two considerations i would like to ask you and discuss:
1. Firstly, as my filtered.2 eset contains duplicate probeIDs, after the function select i get both duplicate PROBEIDs, as duplicate ENTREZIDs. In your opinion sould i use the filtered.2 eset after i have removed any duplicate and probeIDs matching to NAs, or this should not be a problem for mroast ? I mean, if there are double probeIDs counted in each gene set or it is not something i have to concern ?
2. Secondly, regarding the output of mroast (i.e the gene set $GLI1_UP.V1_UP), i searched that for the example of C6 gene set, additional information can be found at (http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=C6). Thus, for example if the total direction for an example gene set like AKT_UP.V1_DN is up(upregulated), and from description "Genes down-regulated in mouse prostate by transgenic expression of human AKT1 gene[GeneID=207] vs controls" it means that this specific group of genes found downregulated in prostate cancer are totally upregulated in my expression set ? Or i should interpret differently the results ?
Thank you for your condideration on this matter
mroast
automatically handles duplicate probe IDs for the same gene. This is because the correlations between probes from the same gene effectively downweights their effect. Even if a gene has many probes in the set, the contribution of those duplicate probes to the ROAST result will be limited if the probes are highly correlated (as later probes provide no more information than the first). Manual removal of duplicate probes is unnecessary. In fact, removal is unwise in the presence of weak correlations, where different probes may actually provide additional information.For your second question, your interpretation seems correct. If the overall direction reported by
mroast
is "Up", then the genes in the set are generally up-regulated across your specified contrast. Whether this is consistent with the annotation of the set depends on the nature of your contrast and the biology of your system.Thank you again Aaron for your answers !! To sum up,i just need to run the code with ids2indices as you described above, then order the results of mroast(i.e with PValue mixed or adjustedPvalue) and then based on the info of each gene set found DE to interprent the description of the set with the total perturbation in my expressionset.
Dear Aaron,
one more naive question:
regarding the use of molecular signatures from the Broad Insitute:
based on your code:
load("C:/Users/Efstathios/Desktop/human_c6_v4.rdata")
entrezid <- select(illuminaHumanv4.db, keytype="PROBEID", column="ENTREZID",
keys=rownames(filtered.2))
indices <- ids2indices(Hs.c6, entrezid$ENTREZID)
set.size <- sapply(indices, FUN=length)
indices2 <- indices[set.size >= 10]
res <- mroast(filtered.2, indices2, design, contrast=4)
But after the final function mroast it gives me this error:
Error in y[index, , drop = FALSE] : subscript out of bounds
I also used another number of contrast like 3 or 2, as my design matrix except from the first column which is the intercept has other 6 colums which represent various coefficients.
Moreover, my filtered.2 object is of class "EList", and with rownames:
head(rownames(filtered.2))
[1] "ILMN_3241953" "ILMN_1755321" "ILMN_1698554" "ILMN_2061446" "ILMN_1676336" "ILMN_3237396"
Any ideas or suggestions regarding this specific error ??
See my edits to C: Appropriate methodologies for using various inputs(GOs, gene-sets) for mroast te.
i think i have found the problem:
dim(filtered.2)
[1] 13358 12
sum(duplicated(entrezid$PROBEID))
[1] 806
-So there are except ENTREZIDs also duplicate PROBEIDs mathing to the same entrezid-
entrezid <- entrezid[!duplicated(entrezid[,1]),]
when i removed the duplicated probeIDs then the code worked fine, but do you agree with my approach ??
Don't do it that way, because if you remove duplicates in
entrezid
, you'll eliminate some relationships that might be interesting. For example, consider a probe A that is associated with genes X and Y. This shows up as duplicate rows A:X and A:Y in theentrezid
table. However, if you eliminate duplicate entries of A, you'll throw out the A:Y relationship that might be interesting. More specifically, gene sets containing Y would fail to include probe A after processing intoindices
.I understand that it is more complicated that it seemed because i have used this simple code from above to remove duplicates for annotation terms, which is a lot different- Thank you again for your valuable correction !!