Question

Getting Bioconductor genome wide annotation for Candida auris

0

Entering edit mode

Javan Okendo • 0

@javan-okendo-22941

Last seen 3.5 years ago

University of Cape Town

I would like to do the enrichment analysis using clusterProfiler on Candida auris proteomics data. I have tried to look for genome wide annotation for Candida auris but I have not managed to get it. Could you point me to the right direction where I can obtain the Bioconductor genome wide annotation for the Candida auris?

AnnotationFilter annotationTools AnnotationHub AnnotationData • 3.2k views

ADD COMMENT • link updated 3.5 years ago by Guido Hooiveld ★ 4.1k • written 3.5 years ago by Javan Okendo • 0

1

Entering edit mode

You have tagged AnnotationHub in your question, so I presume you know about that. Is there no C. auris data available? Also, what do you mean by 'genome wide annotation'? Are you looking for genetic position information or gene type annotations?

ADD REPLY • link 3.5 years ago James W. MacDonald 68k

0

Entering edit mode

@James I am using clusterProfiler: https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/ for the enrichment analysis and you need to provide organisms genome annotation from the bioconductor.

ADD REPLY • link 3.5 years ago Javan Okendo • 0

0

Entering edit mode

Since you would like to use clusterProfiler with proteomics data, I would like to refer you to one of my previous posts: GO enrichment analysis on Solanum lycopersicum proteomics dataset (UniProt IDs) Key is that you can make use of the UniProt-based Gene Ontology annotation information that is compiled by the GOA group. For your organism such GO annotations (in GAF format) are luckily also available! It is the file 4144447.[_auris_B8441.goa, available here (direct link: http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/4144447.[_auris_B8441.goa). Please note that there apparently is a spelling error in the file name... the [ should obviously be a C!

After importing the *.goa file I was able to perform a GO overrepresentation analysis (using the code I linked to and some random UniProt IDs).

GAF <- readGAF(filename="4144447.[_auris_B8441.goa")

<<snip>>

dotplot(res.GO.ora, showCategory=10)

enter image description here

clusterProfiler also allows you to easily analyze your data using pathway information from the KEGG database. Although your organism is 'known' by KEGG (here;organism id = caur), somehow C. Auris UniProt ids have not been translated to KEGG identifiers (nor it seems C. Auris-specific pathways are available). Compare for example this (empty) conversion output for C. Auris with the same output for S cerevisiae. In other words, KEGG patwhay analysis for C.Auris is not possible.

ADD REPLY • link 3.5 years ago Guido Hooiveld ★ 4.1k

score 1 · Answer 1 · 2021-10-19

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

There isn't an OrgDb for C. auris on the AnnotationHub, so you will need to generate one yourself, using makeOrgPackageFromNCBI in the AnnotationForge package. You can emulate the example in ?makeOrgPackageFromNCBI, substituting Candida and auris for genus and species, and 498019 for the tax_id. You could also use your actual email and name, but it doesn't really matter so long as the maintainer field has a name and email, and the email is bracketed by < and >. If you don't do that the package won't install.

Make sure you have the current version of AnnotationForge (AnnotationForge_1.35.2). I patched a bug recently that would cause it to error out on a species like this. And be prepared to wait for a while - that function downloads and parses a huge amount of data.

Once you have the package built you can then do install.packages("org.Cauris.eg.db", repos = NULL) and if you are on Windows, add a type = "source" to the install.packages arguments.

ADD COMMENT • link 3.5 years ago James W. MacDonald 68k

0

Entering edit mode

@James this is the clusterProfiler Rscript I use for enrichment analysis. The data being analyzed is from Candidas auris proteomic data. The data can be downloaded from differential analysis data. I will appreciate if you can help in solving this problem

setwd("C:\\Users\\Javan\\Desktop\\NelsonSoares\\candidaProject\\DifferentialsPx")
library(clusterProfiler)
#library(org.Hs.eg.db)
library(enrichplot)
library(dplyr)
library(pathview)
library(proteus)
library(org.Sc.sgd.db)
#library(org.Mm.eg.db)
keytypes(org.Sc.sgd.db) #Show the database keytypes

data <- read.csv("SA01-SB01.csv",header = T,sep = ',')

colnames(data)

data <- dplyr::select(data, X,EffectSize,pValue) ; dim(data)

data = subset(data,EffectSize >= 1.5 | pValue < 0.05 ) ;dim(data)#| EffectSize <= -1);dim(data)

gene <- data$X# extract Gene names

# this translates the protein IDs to ENTREZID
gene.df <- bitr(gene, fromType = "UNIPROT", toType = "ENTREZID",OrgDb = org.Sc.sgd.db) ; dim(gene.df) # This is the stage which is failing. 


# Make a geneList for some future functions
geneList <- gene.df$ENTREZID
names(geneList) <- as.character(gene.df$UNIPROT)
geneList <- sort(geneList, decreasing = TRUE)

# gene enrichment analysis cnplots are commented out as they look crazy with a large number of proteins
## BP
ego_BP2 <- enrichGO(gene = gene.df$ENTREZID,
                    OrgDb = org.Sc.sgd.db
                    ont = "BP",
                    pAdjustMethod = "BH",
                    readable = TRUE,
                    pvalueCutoff  = 0.01,
                    qvalueCutoff  = 0.05)

head(ego_BP2,10) #check the first 10 entries from ego_BP

df = as.data.frame(ego_BP2)

write.csv(df,"LTBI_B1vsPPD_GO_BP.csv")

ego2 <- simplify(ego_BP2) ; dim(ego2) # remove redundant GO terms first 

dotplot(ego2, showCategory=24,x="count",font.size = 9,title=" ")

#Barplot
barplot(ego2, 
        drop = TRUE, 
        showCategory = 20, 
        title = " ",
        font.size = 9,
        x="count")

ADD REPLY • link 3.5 years ago Javan Okendo • 0

0

Entering edit mode

If you have C. auris data, then you need to create an OrgDb, as I have already noted in my previous answer. The org.Sc.eg.db package contains data for S. cerevisiae. While both are yeasts, I would imagine that the UniProt IDs for S cerevisiae are not the same as for C. auris. You probably need to follow my existing advice to build an OrgDb package for the actual species you are working with.

ADD REPLY • link 3.5 years ago James W. MacDonald 68k

0

Entering edit mode

@James I did create the org.Cauris.eg.db but the Uniprot key name is missing when I do keytypes(org.Cauris.eg.db) as below. Is there a way I can add this uniprot information in the package. FYI this is my first Bioconductor package and I am happy.

keytypes(org.Cauris.eg.db)
 [1] "BLASTX_SUMMARY"  "CGD_ORTHOLOG"    "DESCRIPTION"     "EVIDENCE"        "EVIDENCEALL"    
 [6] "GID"             "GO"              "GOALL"           "ONTOLOGY"        "ONTOLOGYALL"    
[11] "ORTHOLOG_SOURCE" "PFAM_ID"         "PFAM_TERM"       "POSITION"        "STRAND"

When you do the same on the human database you will see Uniprot information plus other keytype informations as follows:

keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
 [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "GENENAME"     "GENETYPE"     "GO"          
[13] "GOALL"        "IPI"          "MAP"          "OMIM"        
[17] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"      
[25] "UCSCKG"       "UNIPROT"

ADD REPLY • link 3.5 years ago Javan Okendo • 0

0

Entering edit mode

Ah, I get it. No, the UniProt data aren't added to a package generated using makeOrgPackageFromNCBI. That would require an additional download step, and it's already painful enough as it is. Looking at uniprot.org, it doesn't seem like many (if any) of the genes have an NCBI Gene ID, or a GID for that matter.

What do you get from

length(keys(org.Cauris.eg.db))
## and
head(keys(org.Cauris.eg.db))

If you have UniProt KB IDs, you can do a test to see what UniProt has for them, by doing something like

## use a subset of your genes
genesub <- gene[1:500] ## or something smaller
URL <- paste0("https://www.uniprot.org/mapping/?from=ACC&to=P_ENTREZGENEID&format=tab&query=", paste(genesub, collapse = "%20"))
read.table(URL, sep = "\t", fill = TRUE, header = TRUE)

If you just get something like

[1] From To  
<0 rows> (or 0-length row.names)

That means UniProt doesn't have the mappings, which makes it tough to do.

ADD REPLY • link 3.5 years ago James W. MacDonald 68k

0

Entering edit mode

@James thanks for being patient with me. The org.Cauris.eg.db package I created is available here: `https://github.com/javanOkendo/Candida_aurisPackage

length(keys(org.Cauris.eg.db)) gives 5585

head(keys(org.Cauris.eg.db))  indicates "B9J08_000001" "B9J08_000002" "B9J08_000003" "B9J08_000004" "B9J08_000005" "B9J08_000006"

genesub <- gene[1:500] ## or something smaller
URL <- paste0("https://www.uniprot.org/mapping/?from=ACC&to=P_ENTREZGENEID&format=tab&query=", paste(genesub, collapse = "%20"))
read.table(URL, sep = "\t", fill = TRUE, header = TRUE)

Gives: [1]

From To  
<0 rows> (or 0-length row.names)