It looks like you could relatively easily create what you need for C. familiaris, based on the 'TT' data that is supplied by bumphunter.
If you do
library(bumphunter)
data(TT)
TT
> lapply(TT, class)
$txdb
[1] "packageDescription"
$org
[1] "packageDescription"
$transcripts
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
You can see that the TT data object is a list. The first two items point to installed packages (the TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db packages), and the last list item is a GRanges. If we look at ?TT we get
TT package:bumphunter R Documentation
HG19 Transcripts
Description:
Database of transcripts associated with "known" hg19 genes, namely
those with Entrez ID gene identifiers associated.
Usage:
data(TT)
Format:
A ‘list’ (see ‘known_transcripts’) whose ‘transcripts’ slot is a
‘GRanges’ object with 8 metadata columns, "CSS" (coding start),
"CSE" (coding end), "Tx" (transcript name), "Entrez" (Entrez ID),
"Gene" (gene name), "Refseq" (Refseq number), "Nexons" (number of
exons), and "Exons" (the exons themselves, as ‘IRanges’ objects).
Note that CSS is always less than CSE, even on minus strands where
their "start" and "end" meanings are reversed. Similarly with the
exons.
Source:
the value of ‘bumphunter::known_transcripts()’
So hypothetically you could use makeTxPackageFromUCSC() to make and install a TxDb package. Then use makeOrgPackageFromNCBI() (from AnnotationForge package) to make and install an org.Cf.eg.db package. Now for the tricky part; you need a GRanges object with some extra data that I can't seem to extract from a TxDb object.
However, these data are available on UCSC in the refGene table, and we can get at them using RMySQL:
> library(RMySQL)
> con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
> tab <- dbGetQuery(con, "select * from refGene;")
> head(tab, 1)
bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount
1 840 NM_001003183 chr21 + 33472196 33474472 33472794 33473719 4
exonStarts exonEnds
1 33472196,33472773,33473044,33473400, 33472322,33472892,33473194,33474472,
score name2 cdsStartStat cdsEndStat exonFrames
1 0 ADM cmpl cmpl -1,0,2,2,
And the TT$transcripts object has in it:
> TT$transcripts[1,]
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | CSS CSE
<Rle> <IRanges> <Rle> | <integer> <integer>
uc002qsd.4 chr19 [58858172, 58864865] - | 58858388 58864803
Tx Entrez Gene Refseq Nexons
<character> <Rle> <Rle> <Rle> <integer>
uc002qsd.4 uc002qsd.4 1 A1BG NM_130786 NP_570602 8
Exons
<IRangesList>
uc002qsd.4 [58858172, 58858395] [58858719, 58859006] [58861736, 58862017] ...
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Edit to add building TxDb package
And then
> makeTxDbPackageFromUCSC("3.10.0", "me <me@mine.org>", "me", genome="canFam3", tablename="refGene")
> install.packages("TxDb.Cfamiliaris.UCSC.canFam3.refGene/", repos=NULL)
> tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.refGene)
> names(tx) <- unlist(mcols(tx)[,2]))
> tab <- tab[match(names(tx), tab$name),]
> Nexons = sapply(strsplit(tab$exonStarts, ","), length)
> Exons <- IRangesList(start = lapply(strsplit(tab$exonStarts, ","), function(x) x[x != ""]), end = lapply(strsplit(tab$exonEnds, ","), function(x) x[x != ""]))
> tab2 <- DataFrame(CSS = tab$cdsStart, CSE = tab$cdsEnd, Tx = tab$name, Gene = tab$name2, Nexons = Nexons, Exons = Exons)
> mcols(tx) <- tab2
> tx
GRanges object with 1594 ranges and 6 metadata columns:
seqnames ranges strand | CSS CSE
<Rle> <IRanges> <Rle> | <numeric> <numeric>
NM_001193298 chr1 [ 5070925, 5106668] + | 5070956 5106645
NM_001002949 chr1 [13733849, 13900653] + | 13734503 13900450
NM_001080724 chr1 [16131829, 16132827] + | 16131828 16132827
NM_001003268 chr1 [20060788, 20417367] + | 20065495 20416955
NM_001252259 chr1 [26899859, 26936993] + | 26928381 26936545
... ... ... ... ... ... ...
NM_001251943 chrX [ 92445049, 92446007] - | 44192036 44203202
NM_001048125 chrX [117494989, 117515419] - | 117494988 117515419
NM_001271782 chrX [122852041, 122878835] - | 122852115 122878791
NM_001003212 chrX [122897024, 123043178] - | 122897136 123043178
NM_001195154 chrX [123268867, 123291467] - | 123269260 123291233
Tx Gene Nexons
<character> <character> <numeric>
NM_001193298 NM_001193298 CYB5A 5
NM_001002949 NM_001002949 BCL2 5
NM_001080724 NM_001080724 MC4R 1
NM_001003268 NM_001003268 TCF4 19
NM_001252259 NM_001252259 TBPL1 7
... ... ... ...
NM_001251943 NM_001251943 EIF4E2 6
NM_001048125 NM_001048125 IDS 9
NM_001271782 NM_001271782 MPP1 12
NM_001003212 NM_001003212 F8 26
NM_001195154 NM_001195154 CLIC2 6
Exons
<IRangesList>
NM_001193298 [5070924, 5071085] [5098788, 5098917] [5100771, 5100801] ...
NM_001002949 [13733848, 13733995] [13734221, 13734659] [13734742, 13734811] ...
NM_001080724 [16131828, 16132827]
NM_001003268 [20060787, 20061314] [20065495, 20065568] [20181826, 20181888] ...
NM_001252259 [26899858, 26900050] [26928337, 26928516] [26930859, 26930942] ...
... ...
NM_001251943 [44192036, 44192151] [44193800, 44193935] [44199498, 44199603] ...
NM_001048125 [117494988, 117495479] [117498997, 117499171] [117501975, 117502102] ...
NM_001271782 [122852040, 122852292] [122853336, 122853411] [122853566, 122853769] ...
NM_001003212 [122897023, 122897292] [122906044, 122906221] [122907219, 122907368] ...
NM_001195154 [123268866, 123269422] [123270189, 123270371] [123271021, 123271128] ...
-------
seqinfo: 3268 sequences (1 circular) from canFam3 genome
At which point I think you can just make a 'TT' type object (assuming you have made the org.Cf.eg.db package)
myTT <- list(packageDescription("TxDb.Cfamiliaris.UCSC.canFam3.refGene"), packageDescription("org.Cf.eg.db"), tx)
and then I presume you can go forward. But do note that there are only like 1500 transcripts annotated here, so the nearest transcript can be pretty far away.
Note that I have edited the previous answer twice now, once to add in the creation of the TxDb package, and once to correct the code to create Nexons.
This gives me something to go forward with -- thank you!
1500 transcripts is not all that useful, and I'm surprised, because many more have been annotated in dog. It seems like my best next steps are to start playing with the code and contact the GenomicFeatures folks to ask how to make a richer tx database then (but if you have a better idea for me I'm all ears :)
Thanks for all the help,
Jessica