Question

Using bumphunter with non-human genomes

0

Entering edit mode

jessica.hekman ▴ 40

@jessicahekman-6877

Last seen 9.6 years ago

United States

I am using derfinder with the Canis familiaris genome; derfinder uses bumphunter::annotateNearest() to annotate its results; and bumphunter only annotates against hg19. (see derfinder: analyzeChr() breaks on chromosomes > 22)

I'm willing to build an annotation object for Canis familiaris, but am not sure how to go about doing so. Any help would be appreciated!

> devtools::session_info()
Session info-------------------------------------------------------------------
 setting  value                       
 version  R version 3.1.1 (2014-07-10)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Chicago             

Packages-----------------------------------------------------------------------
 package       * version date       source        
 BiocGenerics  * 0.12.0  2014-10-15 Bioconductor  
 bumphunter    * 1.6.0   2014-10-15 Bioconductor  
 codetools       0.2.9   2014-08-21 CRAN (R 3.1.1)
 devtools      * 1.6.1   2014-10-07 CRAN (R 3.1.1)
 digest          0.6.4   2013-12-03 CRAN (R 3.0.2)
 doRNG           1.6     2014-03-07 CRAN (R 3.1.1)
 foreach       * 1.4.2   2014-04-11 CRAN (R 3.1.1)
 GenomeInfoDb  * 1.2.2   2014-10-30 Bioconductor  
 GenomicRanges * 1.18.1  2014-10-30 Bioconductor  
 IRanges       * 2.0.0   2014-10-15 Bioconductor  
 iterators     * 1.0.7   2014-04-11 CRAN (R 3.1.1)
 lattice         0.20.29 2014-04-04 CRAN (R 3.1.1)
 locfit        * 1.5.9.1 2013-04-20 CRAN (R 3.0.2)
 matrixStats     0.10.3  2014-10-15 CRAN (R 3.1.1)
 pkgmaker        0.22    2014-05-14 CRAN (R 3.1.1)
 R.methodsS3     1.6.1   2014-01-05 CRAN (R 3.1.1)
 registry        0.2     2012-01-24 CRAN (R 3.1.1)
 rngtools        1.2.4   2014-03-06 CRAN (R 3.1.1)
 rstudioapi      0.1     2014-03-27 CRAN (R 3.1.1)
 S4Vectors     * 0.4.0   2014-10-15 Bioconductor  
 stringr         0.6.2   2012-12-06 CRAN (R 3.0.2)
 xtable          1.7.4   2014-09-12 CRAN (R 3.1.1)
 XVector         0.6.0   2014-10-15 Bioconductor

Thanks,

Jessica

bumphunter • 3.4k views

ADD COMMENT • link updated 10.2 years ago by Leonardo Collado Torres ★ 1.1k • written 10.4 years ago by jessica.hekman ▴ 40

1

Entering edit mode

rafa ▴ 60

@rafa-7160

Last seen 6.4 years ago

United States

Hi I updated all this. I will make a general announcement.

ADD COMMENT • link 10.3 years ago rafa ▴ 60

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

> library(GenomicFeatures)
> z <- makeTranscriptDbFromUCSC("canFam3", "refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 150 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts ‘NM_001286967’ ‘NM_001002949’
  ‘NM_001003268’ ‘NM_001003372’ ‘NM_001012395’ ‘NM_001003021’
  ‘NM_001097552’ ‘NM_001197156’ ‘NM_001190429’ ‘NM_001024636’
  ‘NM_001003011’ ‘NM_001097552’ ‘NM_001003148’ ‘NM_001003195’
  ‘NM_001003184’ ‘NM_001247976’ ‘NM_001003321’ ‘NM_001097547’
  ‘NM_001284483’ ‘NM_001003140’ ‘NM_001003300’ ‘NM_001197035’
  ‘NM_001252229’ ‘NM_001284472’ ‘NM_001003125’ ‘NM_001003083’
  ‘NM_001251942’ ‘NM_001003207’ ‘NM_001014282’ ‘NM_001002958’
  ‘NM_001006646’ ‘NM_001161713’ ‘NM_001195845’ ‘NM_001252035’
  ‘NM_001193299’ ‘NM_001006650’ ‘NM_001197118’ ‘NM_001003077’
  ‘NM_001003341’ ‘NM_001195512’ ‘NM_001002995’ ‘NM_001006123’
  ‘NM_001002939’ ‘NM_001005262’ ‘NM_001002960’ ‘NM_001113711� [... truncated]
2: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
3: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
4: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
5: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
6: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
7: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
> z
TxDb object:
| Db type: TxDb
| Supporting package: GenomicFeatures
| Data source: UCSC
| Genome: canFam3
| Organism: Canis familiaris
| UCSC Table: refGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 1594
| exon_nrow: 14115
| cds_nrow: 13653
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2014-11-10 10:12:25 -0800 (Mon, 10 Nov 2014)
| GenomicFeatures version at creation time: 1.18.2
| RSQLite version at creation time: 1.0.0
| DBSCHEMAVERSION: 1.0

> genes(z)
GRanges object with 1496 ranges and 1 metadata column:
            seqnames                 ranges strand   |     gene_id
               <Rle>              <IRanges>  <Rle>   | <character>
  100034663     chr9 [ 21823976,  21829557]      +   |   100034663
  100048999     chr1 [111812658, 111834717]      +   |   100048999
  100049000     chr1 [111745413, 111756011]      +   |   100049000
  100049001     chr6 [ 39505485,  39506909]      +   |   100049001
  100049002     chr9 [  4101050,   4123864]      +   |   100049002
        ...      ...                    ...    ... ...         ...
     612956    chr21 [ 34010884,  34015974]      -   |      612956
     612966    chr28 [ 29947731,  29996625]      -   |      612966
     613014     chrX [123046451, 123064580]      +   |      613014
     664717     chr5 [ 56421687,  56423120]      +   |      664717
     751768    chr29 [ 26384493,  26422575]      -   |      751768
  -------

Or if you think Ensembl has a better build than NCBI/UCSC, you can use makeTranscriptDbFromBiomart()

> zz <- makeTranscriptDbFromBiomart("ensembl","cfamiliaris_gene_ensembl")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
2: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
3: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
4: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
5: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
6: 'dbBeginTransaction' is deprecated.
Use 'dbBegin' instead.
See help("Deprecated")
> zz
TxDb object:
| Db type: TxDb
| Supporting package: GenomicFeatures
| Data source: BioMart
| Organism: Canis familiaris
| Resource URL: www.biomart.org:80
| BioMart database: ensembl
| BioMart database version: ENSEMBL GENES 77 (SANGER UK)
| BioMart dataset: cfamiliaris_gene_ensembl
| BioMart dataset description: Canis familiaris genes (CanFam3.1)
| BioMart dataset version: CanFam3.1
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 29884
| exon_nrow: 218268
| cds_nrow: 203663
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2014-11-10 10:17:53 -0800 (Mon, 10 Nov 2014)
| GenomicFeatures version at creation time: 1.18.2
| RSQLite version at creation time: 1.0.0
| DBSCHEMAVERSION: 1.0

But you might need to add in seqinfo data, and convert to UCSC genome specifications for bumphunter(). Not sure about that one:

> genes(zz)
GRanges object with 24580 ranges and 1 metadata column:
                     seqnames                 ranges strand   |
                        <Rle>              <IRanges>  <Rle>   |
  ENSCAFG00000000001        1       [247829, 322180]      -   |
  ENSCAFG00000000002        1       [363607, 364548]      +   |
  ENSCAFG00000000005        1       [509125, 565905]      +   |
  ENSCAFG00000000006        1       [571409, 571749]      +   |
  ENSCAFG00000000007        1       [600162, 635340]      -   |
                 ...      ...                    ...    ... ...
  ENSCAFG00000032763       27 [ 37124095,  37130390]      -   |
  ENSCAFG00000032764        1 [105074091, 105074616]      +   |
  ENSCAFG00000032765       11 [ 66234288,  66235340]      -   |
  ENSCAFG00000032766       15 [ 37112304,  37112412]      +   |
  ENSCAFG00000032767        8 [ 71773821,  71775125]      +   |
                                gene_id
                            <character>
  ENSCAFG00000000001 ENSCAFG00000000001
  ENSCAFG00000000002 ENSCAFG00000000002
  ENSCAFG00000000005 ENSCAFG00000000005
  ENSCAFG00000000006 ENSCAFG00000000006
  ENSCAFG00000000007 ENSCAFG00000000007
                 ...                ...
  ENSCAFG00000032763 ENSCAFG00000032763
  ENSCAFG00000032764 ENSCAFG00000032764
  ENSCAFG00000032765 ENSCAFG00000032765
  ENSCAFG00000032766 ENSCAFG00000032766
  ENSCAFG00000032767 ENSCAFG00000032767
  -------
  seqinfo: 3268 sequences (1 circular) from an unspecified genome

ADD COMMENT • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

From https://github.com/ririzarr/bumphunter/issues/1 it seems that you will need to add some data to `zz`.

Digging further, from https://github.com/ririzarr/bumphunter/blob/master/R/matchGenes.R#L34-L36 and https://github.com/ririzarr/bumphunter/blob/master/R/matchGenes.R#L53-L55 it seems that you need metadata variables `Tx`, `Exons`, `Nexons`, `CSS`, `CSE`, `Gene`, and `Refseq`.

ADD REPLY • link 10.4 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Plus matchGenes() will fail if the genome isn't hg19. See the top of annotateNearest(), which calls matchGenes():

https://github.com/ririzarr/bumphunter/blob/master/R/annotateNearest.R

ADD REPLY • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

That's true if you are using a character vector of length 1 for `subject`. As is, the function will work if you supply a `GRanges` object. However, the output won't be the same you get for `subject='hg19'` because the rest of the code for `annotateNearest()` doesn't use the metadata.

This is clearer looking at:

> library(GenomicRanges)
> library(bumphunter)
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1      2013-03-22
> gr <- GRanges('chr1', IRanges(1, 5000))
> gr
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1 [1, 5000]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> annotateNearest(gr, 'hg19')
Matching regions to genes.
nearestgene: loading bumphunter hg19 transcript database
finding nearest transcripts...
AnnotatingDone.
              name annotation description region distance
uc001aaa.3 DDX11L1  NR_046018        <NA>   <NA>        0
           subregion insidedistance exonnumber nexons  UTR
uc001aaa.3      <NA>             NA         NA      3 <NA>
           strand geneL codingL
uc001aaa.3      +     0       0
> data(TT, package = "bumphunter", envir = environment())
> annotateNearest(gr, TT$transcripts)
  dist matchIndex     type amountOverlap insideDist size1 size2
1 6874       1037 disjoint            NA         NA  5000  2536
> devtools::session_info()
Session info---------------------------------------------------------------------------------------------------------------------------------
 setting  value                                             
 version  R Under development (unstable) (2014-11-01 r66923)
 system   x86_64, darwin10.8.0                              
 ui       AQUA                                              
 language (EN)                                              
 collate  en_US.UTF-8                                       
 tz       America/New_York                                  

Packages-------------------------------------------------------------------------------------------------------------------------------------
 package       * version date       source        
 BiocGenerics  * 0.13.0  2014-10-14 Bioconductor  
 bumphunter    * 1.7.0   2014-10-14 Bioconductor  
 codetools       0.2.9   2014-08-21 CRAN (R 3.2.0)
 devtools        1.6.1   2014-10-07 CRAN (R 3.2.0)
 digest          0.6.4   2013-12-03 CRAN (R 3.2.0)
 doRNG           1.6     2014-03-07 CRAN (R 3.2.0)
 foreach       * 1.4.2   2014-04-11 CRAN (R 3.2.0)
 GenomeInfoDb  * 1.3.6   2014-10-26 Bioconductor  
 GenomicRanges * 1.19.3  2014-11-03 Bioconductor  
 IRanges       * 2.1.4   2014-10-24 Bioconductor  
 iterators     * 1.0.7   2014-04-11 CRAN (R 3.2.0)
 lattice         0.20.29 2014-04-04 CRAN (R 3.2.0)
 locfit        * 1.5.9.1 2013-04-20 CRAN (R 3.2.0)
 matrixStats     0.10.3  2014-10-15 CRAN (R 3.2.0)
 pkgmaker        0.22    2014-05-14 CRAN (R 3.2.0)
 R.methodsS3     1.6.1   2014-01-05 CRAN (R 3.2.0)
 registry        0.2     2012-01-24 CRAN (R 3.2.0)
 rngtools        1.2.4   2014-03-06 CRAN (R 3.2.0)
 rstudioapi      0.1     2014-03-27 CRAN (R 3.2.0)
 S4Vectors     * 0.5.2   2014-10-18 Bioconductor  
 stringr         0.6.2   2012-12-06 CRAN (R 3.2.0)
 xtable          1.7.4   2014-09-12 CRAN (R 3.2.0)
 XVector         0.7.1   2014-10-21 Bioconductor  
>

The results are different even though the same `GRanges` object is used.

In other words, I think that with some docs on what the metadata columns mean (CSE, CSS?), and some minor modifications to annotateNearest(), you could build your own annotation object and feed it to this function to get similar results to `subject='hg19'` for other organisms.

ADD REPLY • link 10.4 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

It sounds like the consensus is that the code James suggested will NOT be sufficient without additional metadata added. I don't mind trying it just in case but am not sure what I would do with the z/zz object he describes. How would I hand that to derfinder to pass on to bumphunter to make things work with CF3?

ADD REPLY • link 10.4 years ago jessica.hekman ▴ 40

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 9 weeks ago

United States

Solved in bumphunter 1.7.3 See A: bumphunter::annotateNearest with non-human genome for the full answer

ADD COMMENT • link 10.2 years ago Leonardo Collado Torres ★ 1.1k

score 2 · Accepted Answer · 2014-11-11

It looks like you could relatively easily create what you need for C. familiaris, based on the 'TT' data that is supplied by bumphunter.

If you do

library(bumphunter)
data(TT)
TT
> lapply(TT, class)
$txdb
[1] "packageDescription"
$org
[1] "packageDescription"
$transcripts
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

You can see that the TT data object is a list. The first two items point to installed packages (the TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db packages), and the last list item is a GRanges. If we look at ?TT we get

TT                 package:bumphunter                  R Documentation

HG19 Transcripts

Description:

     Database of transcripts associated with "known" hg19 genes, namely
     those with Entrez ID gene identifiers associated.

Usage:

     data(TT)
     
Format:

     A ‘list’ (see ‘known_transcripts’) whose ‘transcripts’ slot is a
     ‘GRanges’ object with 8 metadata columns, "CSS" (coding start),
     "CSE" (coding end), "Tx" (transcript name), "Entrez" (Entrez ID),
     "Gene" (gene name), "Refseq" (Refseq number), "Nexons" (number of
     exons), and "Exons" (the exons themselves, as ‘IRanges’ objects).
     Note that CSS is always less than CSE, even on minus strands where
     their "start" and "end" meanings are reversed.  Similarly with the
     exons.

Source:

     the value of ‘bumphunter::known_transcripts()’

So hypothetically you could use makeTxPackageFromUCSC() to make and install a TxDb package. Then use makeOrgPackageFromNCBI() (from AnnotationForge package) to make and install an org.Cf.eg.db package. Now for the tricky part; you need a GRanges object with some extra data that I can't seem to extract from a TxDb object.

However, these data are available on UCSC in the refGene table, and we can get at them using RMySQL:

> library(RMySQL)
> con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
> tab <- dbGetQuery(con, "select * from refGene;")
> head(tab, 1)
  bin         name chrom strand  txStart    txEnd cdsStart   cdsEnd exonCount
1 840 NM_001003183 chr21      + 33472196 33474472 33472794 33473719         4
                            exonStarts                             exonEnds
1 33472196,33472773,33473044,33473400, 33472322,33472892,33473194,33474472,
  score name2 cdsStartStat cdsEndStat exonFrames
1     0   ADM         cmpl       cmpl  -1,0,2,2,

And the TT$transcripts object has in it:

> TT$transcripts[1,]
GRanges object with 1 range and 8 metadata columns:
             seqnames               ranges strand |       CSS       CSE
                <Rle>            <IRanges>  <Rle> | <integer> <integer>
  uc002qsd.4    chr19 [58858172, 58864865]      - |  58858388  58864803
                      Tx Entrez  Gene              Refseq    Nexons
             <character>  <Rle> <Rle>               <Rle> <integer>
  uc002qsd.4  uc002qsd.4      1  A1BG NM_130786 NP_570602         8
                                                                          Exons
                                                                  <IRangesList>
  uc002qsd.4 [58858172, 58858395] [58858719, 58859006] [58861736, 58862017] ...
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Edit to add building TxDb package

And then

> makeTxDbPackageFromUCSC("3.10.0", "me <me@mine.org>", "me", genome="canFam3", tablename="refGene")
> install.packages("TxDb.Cfamiliaris.UCSC.canFam3.refGene/", repos=NULL)
> tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.refGene) 
> names(tx) <- unlist(mcols(tx)[,2])) 
> tab <- tab[match(names(tx), tab$name),] 
> Nexons = sapply(strsplit(tab$exonStarts, ","), length)
> Exons <- IRangesList(start = lapply(strsplit(tab$exonStarts, ","), function(x) x[x != ""]), end = lapply(strsplit(tab$exonEnds, ","), function(x) x[x != ""])) 
> tab2 <- DataFrame(CSS = tab$cdsStart, CSE = tab$cdsEnd, Tx = tab$name, Gene = tab$name2, Nexons = Nexons, Exons = Exons) 
> mcols(tx) <- tab2 

> tx
GRanges object with 1594 ranges and 6 metadata columns:
               seqnames                 ranges strand   |       CSS       CSE
                  <Rle>              <IRanges>  <Rle>   | <numeric> <numeric>
  NM_001193298     chr1   [ 5070925,  5106668]      +   |   5070956   5106645
  NM_001002949     chr1   [13733849, 13900653]      +   |  13734503  13900450
  NM_001080724     chr1   [16131829, 16132827]      +   |  16131828  16132827
  NM_001003268     chr1   [20060788, 20417367]      +   |  20065495  20416955
  NM_001252259     chr1   [26899859, 26936993]      +   |  26928381  26936545
           ...      ...                    ...    ... ...       ...       ...
  NM_001251943     chrX [ 92445049,  92446007]      -   |  44192036  44203202
  NM_001048125     chrX [117494989, 117515419]      -   | 117494988 117515419
  NM_001271782     chrX [122852041, 122878835]      -   | 122852115 122878791
  NM_001003212     chrX [122897024, 123043178]      -   | 122897136 123043178
  NM_001195154     chrX [123268867, 123291467]      -   | 123269260 123291233
                         Tx        Gene    Nexons
                <character> <character> <numeric>
  NM_001193298 NM_001193298       CYB5A         5
  NM_001002949 NM_001002949        BCL2         5
  NM_001080724 NM_001080724        MC4R         1
  NM_001003268 NM_001003268        TCF4        19
  NM_001252259 NM_001252259       TBPL1         7
           ...          ...         ...       ...
  NM_001251943 NM_001251943      EIF4E2         6
  NM_001048125 NM_001048125         IDS         9
  NM_001271782 NM_001271782        MPP1        12
  NM_001003212 NM_001003212          F8        26
  NM_001195154 NM_001195154       CLIC2         6
                                                                                  Exons
                                                                          <IRangesList>
  NM_001193298             [5070924, 5071085] [5098788, 5098917] [5100771, 5100801] ...
  NM_001002949       [13733848, 13733995] [13734221, 13734659] [13734742, 13734811] ...
  NM_001080724                                                     [16131828, 16132827]
  NM_001003268       [20060787, 20061314] [20065495, 20065568] [20181826, 20181888] ...
  NM_001252259       [26899858, 26900050] [26928337, 26928516] [26930859, 26930942] ...
           ...                                                                      ...
  NM_001251943       [44192036, 44192151] [44193800, 44193935] [44199498, 44199603] ...
  NM_001048125 [117494988, 117495479] [117498997, 117499171] [117501975, 117502102] ...
  NM_001271782 [122852040, 122852292] [122853336, 122853411] [122853566, 122853769] ...
  NM_001003212 [122897023, 122897292] [122906044, 122906221] [122907219, 122907368] ...
  NM_001195154 [123268866, 123269422] [123270189, 123270371] [123271021, 123271128] ...
  -------
  seqinfo: 3268 sequences (1 circular) from canFam3 genome

At which point I think you can just make a 'TT' type object (assuming you have made the org.Cf.eg.db package)

myTT <- list(packageDescription("TxDb.Cfamiliaris.UCSC.canFam3.refGene"), packageDescription("org.Cf.eg.db"), tx)

and then I presume you can go forward. But do note that there are only like 1500 transcripts annotated here, so the nearest transcript can be pretty far away.