Question

GOseq No Underrepresented Categories

0

Entering edit mode

mart1139 • 0

@mart1139-14954

Last seen 5.4 years ago

Purdue University

Hi All,

I am using GOseq as part of the Trinity pipeline to perform gene ontology testing for genes in a de novo transcriptome assembly. I used edgeR to perform DGE analysis comparing groups that were treated with a chemical to those that went untreated, and my output looks like the following:

> dim(go.df)
[1] 50 11
> head(go.df)
                     Gene sampleA   sampleB    logFC     logCPM       LR   PValue          FDR Transcriptome tissue
1  TRINITY_DN165600_c9_g1 control treatment 9.785901  1.3670613 25.66781 4.06e-07 8.839530e-04      meta_ge1  brain
2 TRINITY_DN169366_c12_g1 control treatment 9.574774  3.7314640 23.72559 1.11e-06 2.034114e-03      meta_ge1  brain
3  TRINITY_DN236596_c2_g2 control treatment 8.619980  2.9329807 21.75304 3.10e-06 5.069438e-03      meta_ge1  brain
4  TRINITY_DN138776_c1_g1 control treatment 8.542091  3.4077197 58.47233 2.06e-14 4.720000e-10      meta_ge1  brain
5  TRINITY_DN217803_c2_g1 control treatment 8.530312  0.1282462 31.91309 1.61e-08 5.680000e-05      meta_ge1  brain
6   TRINITY_DN85324_c0_g1 control treatment 7.463319 -0.7732755 26.05991 3.31e-07 8.120070e-04      meta_ge1  brain
  Comparison
1         LM
2         LM
3         LM
4         LM
5         LM
6         LM

In total, I had a very small number of DEGs (50) for this comparison and I have both positive and negative LogFC values. My transcriptome is large (composed of roughly 1.3 million transcripts) with ~83,000 transcripts having associated GO terms. After reading in the appropriate files (gene lengths, matrix containing background gene IDs, GO annotations from the Trinotate pipeline), I ran GOseq and found a good number (67) of over-represented GO categories, but I had zero under-represented categories, which makes me think something went awry in my R script. My general question is this: is it reasonable to expect to have a large number of over-represented categories and few to no under-represented categories?

Admittedly, I am new to using GOseq and GO term analysis and I would greatly appreciate any insight on this issue. Below is a section of my code where the over- and under-represented categories are decided. I am also happy to provide the entire block of code (154 lines) if the section below isn't sufficient.

Cheers,

Alex

# perform functional enrichment testing for each category.

filename = c("enriched","depleted")



for (feature_cat in factor_list) {

  message('Processing category: ', feature_cat)

  gene_ids_in_feature_cat = rownames(factor_labeling)[factor_labeling$type == feature_cat]

  cat_genes_vec = as.integer(sample_set_gene_ids %in% gene_ids_in_feature_cat)

  pwf$DEgenes = cat_genes_vec

  res = goseq(pwf,gene2cat=GO_info_listed, use_genes_without_cat=TRUE)

  ## over-represented categories:

  pvals = res$over_represented_pvalue

  pvals[pvals > 1 - 1e-10] = 1 - 1e-10

  q = qvalue(pvals)

  res$over_represented_FDR = q$qvalues

  go_enrich_filename = paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],'.GOseq.enriched.txt', sep='')

  result_table = res[res$over_represented_pvalue<=0.05,]

  descr = unlist(lapply(result_table$category, get_GO_term_descr))

  result_table$go_term = descr;

  result_table$gene_ids = do.call(rbind, lapply(result_table$category, function(x) { 

    gene_list = GO_to_gene_list[[x]]

    gene_list = gene_list[gene_list %in% rownames(factor_labeling)]

    paste(gene_list, collapse=', ');

  }) )

  write.table(result_table[order(result_table$over_represented_pvalue),], file=go_enrich_filename, sep=' ', quote=F, row.names=F)

  #search GO database to get more information for GO terms in enriched or depleted database; send it to text file

  sink(paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],"_","GOterms_treatment_enriched_description.txt", sep = ""))

  for(go in result_table$category){

    print(GOTERM[[go]])

    cat("-----------------------------------\n")

  }

  sink()

  ## under-represented categories:

  pvals = res$under_represented_pvalue

  pvals[pvals>1-1e-10] = 1 - 1e-10

  q = qvalue(pvals)

  res$under_represented_FDR = q$qvalues

  go_depleted_filename = paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2], '.GOseq.depleted.txt', sep='')

  result_table = res[res$under_represented_pvalue<=0.05,]

  descr = unlist(lapply(result_table$category, get_GO_term_descr))

  result_table$go_term = descr;

  write.table(result_table[order(result_table$under_represented_pvalue),], file=go_depleted_filename, sep=' ', quote=F, row.names=F)

  #search GO database to get more information for GO terms in enriched or depleted database; send it to text file

  sink(paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],"_","GOterms_treatment_depleted_description.txt", sep = ""))

  for(go in result_table$category){

    print(GOTERM[[go]])

    cat("-----------------------------------\n")

  }

  sink()

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qvalue_2.8.0               GO.db_3.4.1                AnnotationDbi_1.38.2       goseq_1.28.0              
 [5] geneLenDataBase_1.12.0     BiasedUrn_1.07             colorspace_1.3-2           heatmap3_1.1.1            
 [9] Glimma_1.4.0               ape_5.0                    gplots_3.0.1               ctc_1.50.0                
[13] amap_0.8-14                DESeq2_1.16.1              SummarizedExperiment_1.6.5 DelayedArray_0.2.7        
[17] matrixStats_0.53.1         Biobase_2.36.2             GenomicRanges_1.28.6       GenomeInfoDb_1.12.3       
[21] IRanges_2.10.5             S4Vectors_0.14.7           BiocGenerics_0.22.1        edgeR_3.18.1              
[25] limma_3.32.10

goseq go enrichment • 1.1k views

ADD COMMENT • link 6.6 years ago mart1139 • 0