The positional location in the OrgDb
packages are a relic of an earlier time, before we had the TxDb
packages. I have been meaning to remove those data for a while now, but as you can see that would break some long-standing packages, so that might be a difficult thing to do at this point.
Anyway, you should really be using a TxDb
package for that sort of thing, because there is a guarantee as to what genome build the positions came from. The OrgDb
packages aren't tied to a build, and I don't even remember where the data come from, so there's a question of provenance. If I were to recapitulate what is done in the example for nearestTSS
, I would do this:
> library(Homo.sapiens)
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## put the newest TxDb in there...
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
> tx <- transcripts(Homo.sapiens, columns = c("SYMBOL","GENEID"))
'select()' returned 1:1 mapping between keys and columns
> tss <- resize(tx, width = 1, fix = "start")
> gr <- GRanges(paste0("chr", c("1","1")), IRanges(c(1000000,2000000), width = 1))
> tss[nearest(gr, tss)]
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | GENEID SYMBOL
<Rle> <IRanges> <Rle> | <CharacterList> <CharacterList>
[1] chr1 999981 - | 57801 HES4
[2] chr1 2003714 - | 85452 CFAP74
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
>
> distanceToNearest(gr, tss)
Hits object with 2 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 10974 | 18
[2] 2 11259 | 3713
-------
queryLength: 2 / subjectLength: 258145
## as compared to
> nearestTSS(chr = c("1","1"), locus = c(1000000,2000000))
gene_id symbol width tss strand distance
1 57801 HES4 1134 1000097 - -97
2 85452 CFAP74 81830 2003786 - -3786
But you have pig data, so it's a bit more complicated. Here's how I would do that.
> library(AnnotationHub)
> hub <- AnnotationHub()
|======================================================================| 100%
snapshotDate(): 2021-10-20
> query(hub, c("txdb","sus scrofa"))
AnnotationHub with 13 records
# snapshotDate(): 2021-10-20
# $dataprovider: UCSC
# $species: Sus scrofa
# $rdataclass: TxDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH52273"]]'
title
AH52273 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite
AH61788 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite
AH61800 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite
AH66180 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite
AH66181 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite
... ...
AH75768 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite
AH79598 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite
AH79599 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite
AH84144 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite <-------------------------- Probably the newest one right there
AH84145 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite
> txdb <- hub[["AH84144"]]
downloading 1 resources
retrieving 1 resource
|======================================================================| 100%
loading from cache
> query(hub, c("orgdb","sus scrofa"))
AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH95961
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Sus scrofa
# $rdataclass: OrgDb
# $rdatadateadded: 2021-10-08
# $title: org.Ss.eg.db.sqlite
# $description: NCBI gene ID based annotations about Sus scrofa
# $taxonomyid: 9823
# $genome: NCBI genomes
# $sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/p...
# $sourcesize: NA
# $tags: c("NCBI", "Gene", "Annotation")
# retrieve record with 'object[["AH95961"]]'
> orgdb <- hub[["AH95961"]]
downloading 1 resources
retrieving 1 resource
|======================================================================| 100%
loading from cache
> pigtx <- transcripts(txdb)
## now add in symbol and NCBI Gene ID
> mcols(pigtx)$SYMBOL <- mapIds(orgdb, pigtx$tx_name, "SYMBOL", "REFSEQ")
'select()' returned 1:1 mapping between keys and columns
> mcols(pigtx)$GENEID <- mapIds(orgdb, pigtx$tx_name, "ENTREZID", "REFSEQ")
'select()' returned 1:1 mapping between keys and columns
> pigtss <- resize(pigtx, 1)
> pigtss[nearest(gr, pigtss)]
GRanges object with 2 ranges and 4 metadata columns:
seqnames ranges strand | tx_id tx_name SYMBOL GENEID
<Rle> <IRanges> <Rle> | <integer> <character> <character> <character>
[1] chr1 1013802 - | 178 NR_128500 MIR9815 104796093
[2] chr1 1665643 - | 180 NR_128509 MIR9824 104796099
-------
seqinfo: 613 sequences (1 circular) from susScr11 genome
>
From the man page of
org.Ss.eg.db
here it doesn't look like we report CHRLOC information for S. scrofa.