Hi All,
I am using GOseq as part of the Trinity pipeline to perform gene ontology testing for genes in a de novo transcriptome assembly. I used edgeR to perform DGE analysis comparing groups that were treated with a chemical to those that went untreated, and my output looks like the following:
> dim(go.df) [1] 50 11 > head(go.df) Gene sampleA sampleB logFC logCPM LR PValue FDR Transcriptome tissue 1 TRINITY_DN165600_c9_g1 control treatment 9.785901 1.3670613 25.66781 4.06e-07 8.839530e-04 meta_ge1 brain 2 TRINITY_DN169366_c12_g1 control treatment 9.574774 3.7314640 23.72559 1.11e-06 2.034114e-03 meta_ge1 brain 3 TRINITY_DN236596_c2_g2 control treatment 8.619980 2.9329807 21.75304 3.10e-06 5.069438e-03 meta_ge1 brain 4 TRINITY_DN138776_c1_g1 control treatment 8.542091 3.4077197 58.47233 2.06e-14 4.720000e-10 meta_ge1 brain 5 TRINITY_DN217803_c2_g1 control treatment 8.530312 0.1282462 31.91309 1.61e-08 5.680000e-05 meta_ge1 brain 6 TRINITY_DN85324_c0_g1 control treatment 7.463319 -0.7732755 26.05991 3.31e-07 8.120070e-04 meta_ge1 brain Comparison 1 LM 2 LM 3 LM 4 LM 5 LM 6 LM
In total, I had a very small number of DEGs (50) for this comparison and I have both positive and negative LogFC values. My transcriptome is large (composed of roughly 1.3 million transcripts) with ~83,000 transcripts having associated GO terms. After reading in the appropriate files (gene lengths, matrix containing background gene IDs, GO annotations from the Trinotate pipeline), I ran GOseq and found a good number (67) of over-represented GO categories, but I had zero under-represented categories, which makes me think something went awry in my R script. My general question is this: is it reasonable to expect to have a large number of over-represented categories and few to no under-represented categories?
Admittedly, I am new to using GOseq and GO term analysis and I would greatly appreciate any insight on this issue. Below is a section of my code where the over- and under-represented categories are decided. I am also happy to provide the entire block of code (154 lines) if the section below isn't sufficient.
Cheers,
Alex
# perform functional enrichment testing for each category. filename = c("enriched","depleted") for (feature_cat in factor_list) { message('Processing category: ', feature_cat) gene_ids_in_feature_cat = rownames(factor_labeling)[factor_labeling$type == feature_cat] cat_genes_vec = as.integer(sample_set_gene_ids %in% gene_ids_in_feature_cat) pwf$DEgenes = cat_genes_vec res = goseq(pwf,gene2cat=GO_info_listed, use_genes_without_cat=TRUE) ## over-represented categories: pvals = res$over_represented_pvalue pvals[pvals > 1 - 1e-10] = 1 - 1e-10 q = qvalue(pvals) res$over_represented_FDR = q$qvalues go_enrich_filename = paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],'.GOseq.enriched.txt', sep='') result_table = res[res$over_represented_pvalue<=0.05,] descr = unlist(lapply(result_table$category, get_GO_term_descr)) result_table$go_term = descr; result_table$gene_ids = do.call(rbind, lapply(result_table$category, function(x) { gene_list = GO_to_gene_list[[x]] gene_list = gene_list[gene_list %in% rownames(factor_labeling)] paste(gene_list, collapse=', '); }) ) write.table(result_table[order(result_table$over_represented_pvalue),], file=go_enrich_filename, sep=' ', quote=F, row.names=F) #search GO database to get more information for GO terms in enriched or depleted database; send it to text file sink(paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],"_","GOterms_treatment_enriched_description.txt", sep = "")) for(go in result_table$category){ print(GOTERM[[go]]) cat("-----------------------------------\n") } sink() ## under-represented categories: pvals = res$under_represented_pvalue pvals[pvals>1-1e-10] = 1 - 1e-10 q = qvalue(pvals) res$under_represented_FDR = q$qvalues go_depleted_filename = paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2], '.GOseq.depleted.txt', sep='') result_table = res[res$under_represented_pvalue<=0.05,] descr = unlist(lapply(result_table$category, get_GO_term_descr)) result_table$go_term = descr; write.table(result_table[order(result_table$under_represented_pvalue),], file=go_depleted_filename, sep=' ', quote=F, row.names=F) #search GO database to get more information for GO terms in enriched or depleted database; send it to text file sink(paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],"_","GOterms_treatment_depleted_description.txt", sep = "")) for(go in result_table$category){ print(GOTERM[[go]]) cat("-----------------------------------\n") } sink() > sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] qvalue_2.8.0 GO.db_3.4.1 AnnotationDbi_1.38.2 goseq_1.28.0 [5] geneLenDataBase_1.12.0 BiasedUrn_1.07 colorspace_1.3-2 heatmap3_1.1.1 [9] Glimma_1.4.0 ape_5.0 gplots_3.0.1 ctc_1.50.0 [13] amap_0.8-14 DESeq2_1.16.1 SummarizedExperiment_1.6.5 DelayedArray_0.2.7 [17] matrixStats_0.53.1 Biobase_2.36.2 GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 [21] IRanges_2.10.5 S4Vectors_0.14.7 BiocGenerics_0.22.1 edgeR_3.18.1 [25] limma_3.32.10