Question

Finding enriched pathways from a gene list.

1

Entering edit mode

omarrafiqued ▴ 50

@omarrafiqued-21833

Last seen 12 months ago

India

I have a gene list and now I want to use Go or KEGG for the enrichment analysis of the top deferentially expressed genes. The problem I am facing is with the data-set. The data-set is a matrix with with approximately 570 samples and 12000 genes. The sample names are in the standard format e.g. "TCGA-3C-AALK-01A-11R-A41B-07". I get this. But the gene names are something I done understand. for example, the first five genes in the data-set are named as , "1", "87769", "144568", "2", "53947"..... I don't know if they are ENTREZ IDs or some other format of gene naming. Could someone please clarify this confusion. Furthermore, could someone provide an R code to do enrichment analysis using the above naming format. For clarification below I have provided the first 100 gene names in the data-set...

   [1] "1"         "87769"     "144568"    "2"         "53947"     "65985"     "51166"    
   [8] "79719"     "22848"     "57505"     "80755"     "16"        "60496"     "132949"   
  [15] "10157"     "26574"     "9625"      "18"        "10349"     "79963"     "26154"    
  [22] "650655"    "19"        "20"        "21"        "24"        "23461"     "23460"    
  [29] "10347"     "10351"     "10350"     "23456"     "5243"      "5244"      "10058"    
  [36] "11194"     "23457"     "89845"     "85320"     "4363"      "1244"      "8714"     
  [43] "10257"     "10057"     "730013"    "368"       "6833"      "10060"     "215"      
  [50] "225"       "5825"      "5826"      "6059"      "9619"      "9429"      "83451"    
  [57] "26090"     "84945"     "25864"     "84836"     "116236"    "84696"     "11057"    
  [64] "171586"    "63874"     "51099"     "57406"     "79575"     "10152"     "25890"    
  [71] "51225"     "27"        "3983"      "84448"     "22885"     "28"        "26"       
  [78] "29"        "80325"     "25841"     "30"        "10449"     "31"        "32"       
  [85] "80724"     "84129"     "27034"     "34"        "36"        "35"        "37"       
  [92] "176"       "9744"      "23527"     "116983"    "38"        "39"        "64746"    
  [99] "79777"     "91452"

Thanks .

microarray limma kegg ENTREZ enrichment analysis • 5.9k views

ADD COMMENT • link updated 5.1 years ago by Gordon Smyth 52k • written 5.1 years ago by omarrafiqued ▴ 50

1

Entering edit mode

They appear to be Entrez IDs, indeed; however, please quote the exact source of your data (and check there yourself) in order to help to confirm this.

For the enrichment work itself, you can eventually use:

topGO
KEGGprofile

Both of these accept Entrez IDs and are both Bioconductor packages.

ADD REPLY • link 5.1 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thanks for the answer.

Do the R libraries you listed below require internet connection.

I downloaded the data from the following link:

http://gdac.broadinstitute.org/runs/stddata_201504_02/data/BRCA/20150402/

With the following file name.

gdac.broadinstitute.orgBRCA.MergernaseqilluminahiseqrnasequnceduLevel3geneexpression_data.Level3.2015040200.0.0.tar.gz

ADD REPLY • link 5.1 years ago omarrafiqued ▴ 50

1

Entering edit mode

Thanks. Then —yes— they are likely Entrez IDs. For your other question, I believe they require an Internet connection. Can you not check that yourself ... ?

ADD REPLY • link 5.1 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

I have very limited access to the internet...so I had to ask...Thanks for the answers.

ADD REPLY • link 5.1 years ago omarrafiqued ▴ 50

score 3 · Accepted Answer · 2020-02-15

The row.names you show do appear to be human Entrez Gene Ids.

You added a limma tag to your question and I am guessing that you have used limma before. So you would know that there are several pathway analysis functions provided by limma and they all work with Entrez Gene Ids. Of all the gene set testing functions provided by limma (roast, fry, camera, wilcoxGST, goana and kegga), only kegga requires in internet connection. The kegga internet requirement is unavoidable because of KEGG's licensing restrictions.

For example, if the top row of genes in your question was your gene list, a GO analysis could be done by

Genes <- c("1","87769","144568","2","53947","65985","51166" 
g <- goana(Genes)
topGO(g)

provided you have the Bioconductor limma, GO.db and org.Hs.eg.db packages installed.

There are plenty of examples showing how to do pathway enrichment analyses in the context of a limma or edgeR differential expression analysis, for example

Of course you can't do an enrichment analysis until you have a gene list and you won't have a gene list until you undertake a differential expression analysis. At the moment you haven't mentioned any analysis.