I'm attempting to use bumphunter::annotateNearest() with Canis familiaris 3.1 (see https://github.com/ririzarr/bumphunter/issues/1). annotateNearest() appears to accept an object with a set of transcripts and annotate against that, even if the transcripts are from a non-human genome, but there isn't specific documentation about how to do this. My attempt is below along with the issue I've bumped up against.
I built my transcripts object like so:
library(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
library(RMySQL)
library(bumphunter)
tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
names(tx) <- unlist(mcols(tx)[,2])
con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
refGene <- dbGetQuery(con, "select * from xenoRefGene;")
refGene <- refGene[match(names(tx), refGene$name),]
Nexons = sapply(strsplit(refGene$exonStarts, ","), length)
Exons <- IRangesList(start = lapply(strsplit(refGene$exonStarts, ","), function(x) x[x != "" & !is.na(x)]), end = lapply(strsplit(refGene$exonEnds, ","), function(x) x[x != "" & !is.na(x)]))
mcols(tx) <- DataFrame(CSS = refGene$cdsStart, CSE = refGene$cdsEnd, Tx = refGene$name, Gene = refGene$name2, Nexons = Nexons, Exons = Exons)
I annotated like so:
annotation <- annotateNearest(regions$regions, tx)
(where regions$regions is derfinder output -- I am happy to provide more info there if anyone is curious).
The results are somewhat annotated:
> head(annotation)
dist matchIndex type amountOverlap insideDist size1 size2
1 0 15454 inside NA 59 125 5819
2 0 10018 inside NA -1811 141 15559
3 0 10017 inside NA -34060 118 68867
4 0 13154 inside NA -22676 93 169910
5 0 876 inside NA -30307 59 157801
6 0 13167 inside NA 2095 114 12769
I am surprised that the annotation does not include the gene name or RefSeq ID, as the docs suggest it will. My suspicion is that I did not hand enough information along in my tx object, but the object does contain seqnames, and those weren't passed back in the annotation object.
> tx
GRanges object with 254017 ranges and 6 metadata columns:
seqnames ranges strand | CSS CSE
<Rle> <IRanges> <Rle> | <numeric> <numeric>
NM_011767 chr1 [363561, 368894] + | 75038343 75122798
NM_001011177 chr1 [363601, 367239] + | 363606 367239
NM_001090507 chr1 [363601, 367239] + | 363606 367239
NM_199558 chr1 [363604, 367325] + | 363671 367325
NM_001044283 chr1 [440183, 441303] + | 65252835 65294349
... ... ... ... ... ... ...
NM_001090227 chrX [123325049, 123390403] - | 123325048 123390403
NM_001102830 chrX [123325049, 123390403] - | 123325048 123390403
NM_001135068 chrX [123325049, 123390403] - | 123325048 123390403
NM_001012575 chrX [123325049, 123408056] - | 123325048 123408056
NM_001184797 chrX [123326032, 123438981] - | 123326060 123408176
Tx Gene Nexons
<character> <character> <integer>
NM_011767 NM_011767 Zfr 30
NM_001011177 NM_001011177 zfr 16
NM_001090507 NM_001090507 zfr 17
NM_199558 NM_199558 zfr 15
NM_001044283 NM_001044283 Snx3 10
... ... ... ...
NM_001090227 NM_001090227 MGC68497 6
NM_001102830 NM_001102830 tmlhe 7
NM_001135068 NM_001135068 tmlhe 7
NM_001012575 NM_001012575 TMLHE 7
NM_001184797 NM_001184797 TMLHE 7
Exons
<IRangesList>
NM_011767 [75038283, 75038380] [75038682, 75038775] [75056588, 75056878] ...
NM_001011177 [363600, 363660] [363682, 363706] [363729, 364308] ...
NM_001090507 [363600, 363660] [363682, 363706] [363729, 364308] ...
NM_199558 [363603, 363703] [363723, 364293] [364310, 364412] ...
NM_001044283 [65252221, 65252345] [65252377, 65252423] [65252439, 65252532] ...
... ...
-------
seqinfo: 3268 sequences (1 circular) from canFam3 genome
I would love advice about how to proceed -- is bumphunter behaving as expected (and it's my problem) or is this a bug?
session_info to follow.
Thanks,
Jessica