Question

Trying to convert ATACseq annotations of non-model species into human orthologs, but not sure of what to do.

0

Entering edit mode

ronin • 0

@73e6e70d

Last seen 22 hours ago

Estonia

I have a list of genes from an ATACseq experiment in an Excel file which was annotated with HOMER. Here is an example of what it looks like:

atacseq peak annotations

My goal is to convert these Entrez ID's in my Excel document into human orthologs that I can use for GO purposes. I'm dealing with tens of thousands of IDs, so I am hoping to find a way to use some code to do this. However, I do not have the slightest idea of how I can do this!

Ideally I'd have a column added to the file that had the corresponding human ortholog for each row, with a blank space where no gene was found. My study species is non-model, the Eurasian perch (Perca fluviatilis). I'm feeling pretty lost and stuck right now, so any code or advice that could be provided would be immensely appreciated. Cheers!

genomes ATACSeq • 395 views

ADD COMMENT • link 2 days ago • updated 22 hours ago ronin • 0

0

Entering edit mode

Not an answer to your question, but it seems the content of the column Entrez ID are actually no entrez ids but rather aliases... Entrez ids consist of numbers only, but your ids start with PFLUV_.

[Added: the ids starting with PFLUV_ are no aliases, but locus tags (see for example https://www.ncbi.nlm.nih.gov/gene/120575494); like James also noted below.]

See: https://www.ncbi.nlm.nih.gov/gene/?term=txid8168[Organism:noexp]

You will then also notice that NCBI has info on ~32k genes of this species, and the function makeOrgPackageFromNCBI from the package AnnotationForge will allow you to create a Bioconductor-compatibe annotation package for your species. See e.g. here for a recent thread on the use of this function: # Invalid keytype: GOALL. Please use the keytypes method to see a listing of valid arguments.

ADD REPLY • link 2 days ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

There's already an OrgDb for this species on the AnnotationHub that hypothetically would be ideal, but there are almost no GO mappings, so it won't be very useful. Making an OrgDb might work better though? I don't know how the OrgDb packages are made for the AnnotationHub, but I sort of presume it's via makeOrgPackageFromNCBI.

ADD REPLY • link 2 days ago James W. MacDonald 67k

score 1 · Answer 1 · 2024-09-17

It's probably not going to work that well using that species, but you can give it a shot. Note that you don't appear to have any good IDs. The closest is the column labeled EntrezID, which has LocusTags in it. You can convert those IDs using EDirect

esearch -db gene -query "Perca fluvialitis" | efetch -format docsum | xtract -pattern DocumentSummary -element Name Id OtherAliases OtherDesignations > tmp.txt

You can then use R

## just the first few - you will have to read them all in
> ids <- paste0("PFLUV_G", sprintf("%08d", c(184910,27280,227510,64610,172300,144370)))
> ids
[1] "PFLUV_G00184910" "PFLUV_G00027280" "PFLUV_G00227510" "PFLUV_G00064610"
[5] "PFLUV_G00172300" "PFLUV_G00144370"

## read in the file from EDirect and map
> mapper <- read.delim("tmp.txt", header = FALSE)
> head(mapper)
       V1        V2                             V3
1     saa 114559349                EPR50_G00004000
2 hsd11b2 120556064                PFLUV_G00038130
3  socs1a 114569781                  SOCS-1, socs1
4  socs3a 114569548 EPR50_G00158030, SOCS-3, socs3
5    ahr2 114550691           EPR50_G00241230, AhR
6    tp53 114545465                EPR50_G00197220
                                                                                                               V4
1                                                                                  serum amyloid A-5 protein-like
2 LOW QUALITY PROTEIN: corticosteroid 11-beta-dehydrogenase isozyme 2|11-beta-hydroxysteroid dehydrogenase type 2
3                                                                              suppressor of cytokine signaling 1
4                                                                              suppressor of cytokine signaling 3
5                                                                                  aryl hydrocarbon receptor-like
6                                                            cellular tumor antigen p53-like|tumor suppressor p53
> egids <- mapper[match(ids, mapper[,3]),2]
> egids
[1] 120575335 120555898 120547723 120558550 120574289        NA

## now convert to Human using Orthology.eg.db

> library(Orthology.eg.db)
> select(Orthology.eg.db, egids, "Homo.sapiens","Perca.fluvialitis")
  Perca.fluvialitis Homo.sapiens
1         120575335           NA
2         120555898           NA
3         120547723           NA
4         120558550           NA
5         120574289           NA
6                NA           NA

It may simply be that the IDs I chose are not mapping (they are mostly LOC genes, meaning that they are probably a gene, but not identified), and you will get better results with the whole set, but it's probably going to be tough sledding.

There is an OrgDb for P. fluviatilis, but only a handful of the Gene IDs are mapped to any GO term, so that's a bust as well.

score 1 · Answer 2 · 2024-09-17

I would use the mapper I note above, and build your own OrgDb, then do the GO analysis directly.

> library(AnnotationForge)
> options(timeout = 1e6)
> makeOrgPackageFromNCBI("0.0.1","me <me@mine.org>", "me",".", "8168","Perca", "fluviatilis")
## this previous step takes forever, because you have to download big files and parse them.
## an alternative is to download all the gene2xxx.gz and the gene_info files directly from
## https://ftp.ncbi.nlm.nih.gov/gene/DATA/ and then run makeOrgPackageFromNCBI
## in the same dir as those downloaded files

## the type = "source" in the line below is only necessary if on Windows
> install.packages("org.Pfluviatilis.eg.db/", repos = NULL, type = "source")
## this part is just inside baseball to show the GO mappings
> library(DBI)
> con <- org.Pfluviatilis.eg_dbconn()
> dbGetQuery(con, "select count(distinct(GID)) from go_bp_all inner join genes using (_id);")
  count(distinct(GID))
1                16350

Having 16350 genes mapping to GO terms is reasonable, so this should be OK for a direct GO analysis.