Hello all,
I am currently doing some GO enrichment using topGO on an RNA-seq experiment that was analyzed using tophat/cufflinks.
Having used topGO before, I am fairly versed in its proper use. I would like to do GSEA using the KS test with the elim algorithm, however I want to be sure I am processing my data from the tophat pipeline correctly. I have no experience with tophat, as this part of the pipeline was done by someone else. The tophat output data I received comprises of ~13000 genes, however I am noticing that a fair amount of them are lncRNA, miRNAs,... In addition, near the bottom of the list, there are ~2500 genes with q value>0.9 where their logFC is negligible. When I execute my enrichment protocol in topGO, most of my top hits are high on the GO hierarchy like GO:0008150 (biological_process), and not all that meaningful. Should I be filtering my tophat data before inputting it into topGO in any way, like using only protein coding genes? or will this obscure the data? Just to be sure, here is my topGO code:
where all_genes is my named vector of q values with entrez IDs as names.
GO_data_BP<-new("topGOdata",ontology="BP",allGenes=all_genes,geneSelectionFun=function(p) p<p_value,description="GO enrichment analysis",annot=annFUN.org,mapping="org.Mm.eg.db",ID="entrez",nodeSize=5) result_KS_elim_BP<-runTest(GO_data_BP,algorithm="elim",statistic="ks")
What do you want to learn when using the KS test with the elim algorithm? What are the functions of your 13k genes??