bumphunter::annotateNearest with non-human genome
3
1
Entering edit mode
@jessicahekman-6877
Last seen 9.3 years ago
United States

I'm attempting to use bumphunter::annotateNearest() with Canis familiaris 3.1 (see https://github.com/ririzarr/bumphunter/issues/1). annotateNearest() appears to accept an object with a set of transcripts and annotate against that, even if the transcripts are from a non-human genome, but there isn't specific documentation about how to do this. My attempt is below along with the issue I've bumped up against.

I built my transcripts object like so:

library(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
library(RMySQL)
library(bumphunter)
tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
names(tx) <- unlist(mcols(tx)[,2])

con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
refGene <- dbGetQuery(con, "select * from xenoRefGene;")
refGene <- refGene[match(names(tx), refGene$name),]
Nexons = sapply(strsplit(refGene$exonStarts, ","), length)
Exons <- IRangesList(start = lapply(strsplit(refGene$exonStarts, ","), function(x) x[x != "" & !is.na(x)]), end = lapply(strsplit(refGene$exonEnds, ","), function(x) x[x != "" & !is.na(x)]))
mcols(tx) <- DataFrame(CSS = refGene$cdsStart, CSE = refGene$cdsEnd, Tx = refGene$name, Gene = refGene$name2, Nexons = Nexons, Exons = Exons)

I annotated like so:

annotation <- annotateNearest(regions$regions, tx)

(where regions$regions is derfinder output -- I am happy to provide more info there if anyone is curious).

The results are somewhat annotated:

> head(annotation)
dist matchIndex   type amountOverlap insideDist size1  size2
1    0      15454 inside            NA         59   125   5819
2    0      10018 inside            NA      -1811   141  15559
3    0      10017 inside            NA     -34060   118  68867
4    0      13154 inside            NA     -22676    93 169910
5    0        876 inside            NA     -30307    59 157801
6    0      13167 inside            NA       2095   114  12769

 

I am surprised that the annotation does not include the gene name or RefSeq ID, as the docs suggest it will. My suspicion is that I did not hand enough information along in my tx object, but the object does contain seqnames, and those weren't passed back in the annotation object.

> tx
GRanges object with 254017 ranges and 6 metadata columns:
               seqnames                 ranges strand   |       CSS       CSE
                  <Rle>              <IRanges>  <Rle>   | <numeric> <numeric>
     NM_011767     chr1       [363561, 368894]      +   |  75038343  75122798
  NM_001011177     chr1       [363601, 367239]      +   |    363606    367239
  NM_001090507     chr1       [363601, 367239]      +   |    363606    367239
     NM_199558     chr1       [363604, 367325]      +   |    363671    367325
  NM_001044283     chr1       [440183, 441303]      +   |  65252835  65294349
           ...      ...                    ...    ... ...       ...       ...
  NM_001090227     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001102830     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001135068     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001012575     chrX [123325049, 123408056]      -   | 123325048 123408056
  NM_001184797     chrX [123326032, 123438981]      -   | 123326060 123408176
                         Tx        Gene    Nexons
                <character> <character> <integer>
     NM_011767    NM_011767         Zfr        30
  NM_001011177 NM_001011177         zfr        16
  NM_001090507 NM_001090507         zfr        17
     NM_199558    NM_199558         zfr        15
  NM_001044283 NM_001044283        Snx3        10
           ...          ...         ...       ...
  NM_001090227 NM_001090227    MGC68497         6
  NM_001102830 NM_001102830       tmlhe         7
  NM_001135068 NM_001135068       tmlhe         7
  NM_001012575 NM_001012575       TMLHE         7
  NM_001184797 NM_001184797       TMLHE         7
                                                                                  Exons
                                                                          <IRangesList>
     NM_011767       [75038283, 75038380] [75038682, 75038775] [75056588, 75056878] ...
  NM_001011177                   [363600, 363660] [363682, 363706] [363729, 364308] ...
  NM_001090507                   [363600, 363660] [363682, 363706] [363729, 364308] ...
     NM_199558                   [363603, 363703] [363723, 364293] [364310, 364412] ...
  NM_001044283       [65252221, 65252345] [65252377, 65252423] [65252439, 65252532] ...
           ...                                                                      ...
  -------
  seqinfo: 3268 sequences (1 circular) from canFam3 genome

 

I would love advice about how to proceed -- is bumphunter behaving as expected (and it's my problem) or is this a bug?

session_info to follow.

Thanks,
Jessica

 
bumphunter • 2.4k views
ADD COMMENT
0
Entering edit mode
> devtools::session_info()
Session info-------------------------------------------------------------------
 setting  value                       
 version  R version 3.1.1 (2014-07-10)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Chicago             

Packages-----------------------------------------------------------------------
 package                                   * version   date      
 acepack                                     1.3.3.3   2013-05-03
 annotate                                    1.44.0    2014-10-15
 AnnotationDbi                             * 1.28.1    2014-10-30
 base64enc                                   0.1.2     2014-06-26
 BatchJobs                                   1.5       2014-10-30
 BBmisc                                      1.8       2014-10-30
 Biobase                                   * 2.26.0    2014-10-15
 BiocGenerics                              * 0.12.1    2014-11-19
 BiocParallel                              * 1.0.0     2014-11-11
 biomaRt                                     2.22.0    2014-10-15
 Biostrings                                  2.34.0    2014-10-15
 bitops                                      1.0.6     2013-08-17
 brew                                        1.0.6     2011-04-13
 bumphunter                                * 1.6.0     2014-10-15
 checkmate                                   1.5.0     2014-10-19
 cluster                                     1.15.3    2014-09-04
 codetools                                   0.2.9     2014-08-21
 colorspace                                  1.2.4     2013-09-30
 DBI                                       * 0.3.1     2014-09-24
 DESeq2                                      1.6.2     2014-11-20
 devtools                                    1.6.1     2014-10-07
 digest                                      0.6.4     2013-12-03
 doRNG                                       1.6       2014-03-07
 fail                                        1.2       2013-09-19
 foreach                                   * 1.4.2     2014-04-11
 foreign                                     0.8.61    2014-03-28
 Formula                                     1.1.2     2014-07-13
 genefilter                                  1.48.1    2014-10-30
 geneplotter                                 1.44.0    2014-10-15
 GenomeInfoDb                              * 1.2.3     2014-11-17
 GenomicAlignments                           1.2.1     2014-11-10
 GenomicFeatures                           * 1.18.2    2014-10-30
 GenomicRanges                             * 1.18.3    2014-11-19
 ggplot2                                     1.0.0     2014-05-21
 gtable                                      0.1.2     2012-12-05
 Hmisc                                       3.14.5    2014-09-12
 IRanges                                   * 2.0.0     2014-10-15
 iterators                                 * 1.0.7     2014-04-11
 lattice                                     0.20.29   2014-04-04
 latticeExtra                                0.6.26    2013-08-15
 locfit                                    * 1.5.9.1   2013-04-20
 MASS                                        7.3.35    2014-09-30
 matrixStats                                 0.10.3    2014-10-15
 munsell                                     0.4.2     2013-07-11
 nnet                                        7.3.8     2014-03-28
 pkgmaker                                    0.22      2014-05-14
 plyr                                        1.8.1     2014-02-26
 proto                                       0.3.10    2012-12-22
 R.methodsS3                                 1.6.1     2014-01-05
 RColorBrewer                                1.0.5     2011-06-17
 Rcpp                                        0.11.3    2014-09-29
 RcppArmadillo                               0.4.500.0 2014-10-30
 RCurl                                       1.95.4.3  2014-07-29
 registry                                    0.2       2012-01-24
 reshape2                                    1.4       2014-04-23
 RMySQL                                    * 0.9.3     2012-01-17
 rngtools                                    1.2.4     2014-03-06
 rpart                                       4.1.8     2014-03-28
 Rsamtools                                   1.18.2    2014-11-11
 RSQLite                                     1.0.0     2014-10-25
 rstudioapi                                  0.1       2014-03-27
 rtracklayer                                 1.26.2    2014-11-11
 S4Vectors                                 * 0.4.0     2014-10-15
 scales                                      0.2.4     2014-04-22
 sendmailR                                   1.2.1     2014-09-21
 stringr                                     0.6.2     2012-12-06
 survival                                    2.37.7    2014-01-22
 TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene * 3.10.0    2014-11-24
 XML                                         3.98.1.1  2013-06-20
 xtable                                      1.7.4     2014-09-12
 XVector                                     0.6.0     2014-10-15
 zlibbioc                                    1.12.0    2014-10-15

ADD REPLY
1
Entering edit mode
rafa ▴ 60
@rafa-7160
Last seen 6.1 years ago
United States

Jessica, I've made some major changes to bumphunter to accommodate this. If you install the latest devel version 1.7.3 (still being tested) the matchGenes and annotateNearest will no longer assume hg19. Here is an example:

library(bumphunter)
islands=makeGRangesFromDataFrame(read.delim("http://rafalab.jhsph.edu/CGI/model-based-cpg-islands-hg19.txt")[1:100,])

library("TxDb.Hsapiens.UCSC.hg19.knownGene")

genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene, annotationPackage="org.Hs.eg.db")
tab<- matchGenes(islands,genes)

Note that if you don't specify the annotationPackage argument the function will try to figure out what library to load using the species method. If it can't find an annotation package it will continue and return an object with no gene annotation but all the other components. You should be able to substitute the human TxDb for TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene

Please try it out and report back if you run into trouble.

Note we are not making this back compatible. Apologies if this breaks code, but the previous way had to many problems to keep around. 

ps - You might need to install from gitbub like this devtools::install_github("ririzarr/bumphunter")

ADD COMMENT
1
Entering edit mode
@lcolladotor
Last seen 1 day ago
United States

Hi Jessica,

You can now use matchGenes() (similar to what annotateTranscripts used to do) with non-human genomes as Rafa already mentioned. Basically, use:

library('GenomicFeatures')
library('bumphunter')
can_txdb <- makeTranscriptDbFromUCSC("canFam3", "refGene")
can_genes <- annotateTranscripts(can_txdb, 'org.Cf.eg.db')
annotation <- matchGenes(x, can_genes)

where 'x' is a data.frame or GRanges. For example:

> matchGenes(head(can_genes), can_genes)
          name                                          annotation
1        KRT12                           NM_001082421 NP_001075890
2     CEACAM23                           NM_001097552 NP_001091021
3     CEACAM23                           NM_001097552 NP_001091021
4     CEACAM28 NM_001097546 NP_001091015 XM_005615767 XP_005615824
5     CEACAM24                           NM_001097554 NP_001091023
6 LOC100049001                           NM_001097555 NP_001091024
     description region distance      subregion insideDistance
1         covers covers        0 covers exon(s)             NA
2         covers covers        0 covers exon(s)             NA
3         covers covers        0 covers exon(s)             NA
4 covers exon(s) inside        5 covers exon(s)              0
5         covers covers        0 covers exon(s)             NA
6         covers covers        0 covers exon(s)             NA
  exonnumber nexons            UTR strand geneL codingL    Entrez
1         NA      8           <NA>      +  5581    5190 100034663
2         NA      9           <NA>      + 16264   16245 100048998
3         NA      9           <NA>      - 32617   16245 100048998
4          1      9 overlaps 3'UTR      + 22076   22029    484473
5         NA      6           <NA>      + 10598   10434 100049000
6         NA      5           <NA>      +  1424    1266 100049001
  subjectHits
1           1
2           2
3           3
4        1740
5           5
6           6
> 

Note that I proposed some fixes for some minor bugs at https://github.com/ririzarr/bumphunter/pull/3

Cheers,

Leo

 

> devtools::session_info()
Session info------------------------------------------------------------
 setting  value                                             
 version  R Under development (unstable) (2014-11-01 r66923)
 system   x86_64, darwin10.8.0                              
 ui       AQUA                                              
 language (EN)                                              
 collate  en_US.UTF-8                                       
 tz       America/New_York                                  

Packages----------------------------------------------------------------
 package           * version  date       source        
 AnnotationDbi     * 1.29.13  2015-01-14 Bioconductor  
 base64enc           0.1.2    2014-06-26 CRAN (R 3.2.0)
 BatchJobs           1.4      2014-09-24 CRAN (R 3.2.0)
 BBmisc              1.7      2014-06-21 CRAN (R 3.2.0)
 Biobase           * 2.27.1   2014-12-20 Bioconductor  
 BiocGenerics      * 0.13.4   2014-12-31 Bioconductor  
 BiocParallel        1.1.11   2015-01-10 Bioconductor  
 biomaRt             2.23.5   2014-11-22 Bioconductor  
 Biostrings          2.35.7   2014-12-13 Bioconductor  
 bitops              1.0.6    2013-08-17 CRAN (R 3.2.0)
 brew                1.0.6    2011-04-13 CRAN (R 3.2.0)
 bumphunter        * 1.7.4    2015-01-15 Bioconductor  
 checkmate           1.5.0    2014-10-19 CRAN (R 3.2.0)
 codetools           0.2.9    2014-08-21 CRAN (R 3.2.0)
 DBI               * 0.3.1    2014-09-24 CRAN (R 3.2.0)
 devtools            1.6.1    2014-10-07 CRAN (R 3.2.0)
 digest              0.6.4    2013-12-03 CRAN (R 3.2.0)
 doRNG               1.6      2014-03-07 CRAN (R 3.2.0)
 fail                1.2      2013-09-19 CRAN (R 3.2.0)
 foreach           * 1.4.2    2014-04-11 CRAN (R 3.2.0)
 GenomeInfoDb      * 1.3.12   2014-12-22 Bioconductor  
 GenomicAlignments   1.3.24   2015-01-14 Bioconductor  
 GenomicFeatures   * 1.19.10  2015-01-09 Bioconductor  
 GenomicRanges     * 1.19.33  2015-01-13 Bioconductor  
 IRanges           * 2.1.35   2015-01-07 Bioconductor  
 iterators         * 1.0.7    2014-04-11 CRAN (R 3.2.0)
 lattice             0.20.29  2014-04-04 CRAN (R 3.2.0)
 locfit            * 1.5.9.1  2013-04-20 CRAN (R 3.2.0)
 matrixStats         0.10.3   2014-10-15 CRAN (R 3.2.0)
 org.Cf.eg.db      * 3.0.0    2014-09-26 Bioconductor  
 pkgmaker            0.22     2014-05-14 CRAN (R 3.2.0)
 R.methodsS3         1.6.1    2014-01-05 CRAN (R 3.2.0)
 RCurl               1.95.4.3 2014-07-29 CRAN (R 3.2.0)
 registry            0.2      2012-01-24 CRAN (R 3.2.0)
 rngtools            1.2.4    2014-03-06 CRAN (R 3.2.0)
 Rsamtools           1.19.26  2015-01-13 Bioconductor  
 RSQLite           * 1.0.0    2014-10-25 CRAN (R 3.2.0)
 rstudioapi          0.1      2014-03-27 CRAN (R 3.2.0)
 rtracklayer         1.27.7   2015-01-14 Bioconductor  
 S4Vectors         * 0.5.16   2015-01-07 Bioconductor  
 sendmailR           1.2.1    2014-09-21 CRAN (R 3.2.0)
 stringr             0.6.2    2012-12-06 CRAN (R 3.2.0)
 XML                 3.98.1.1 2013-06-20 CRAN (R 3.2.0)
 xtable              1.7.4    2014-09-12 CRAN (R 3.2.0)
 XVector           * 0.7.3    2014-11-24 Bioconductor  
 zlibbioc            1.13.0   2014-10-14 Bioconductor 
ADD COMMENT
0
Entering edit mode

Amazing that it took me this long to get back to this! (In the meantime I went to Siberia, passed my prelims, presented my research twice. and helped submit an NSF proposal... All of which prevented me from getting to my own research.)

Anyways: It works! Thanks again to Leo and Rafa for your support.

Leo, I suggest that you modify the derfinder advanced vignette section on non-human data to add the suggestion to use code like this (this is what worked for me):

 

library('GenomicFeatures')
library('bumphunter')

can_txdb <- makeTranscriptDbFromUCSC("canFam3", "refGene")
can_genes <- annotateTranscripts(can_txdb, 'org.Cf.eg.db')

# Perhaps a note here that makeTranscriptDbFromUCSC is network-intensive and so it makes sense to run it once and then save the output and load it again in subsequent runs

results <- analyzeChr(..., runAnnotation=FALSE)

annotation <- matchGenes(results$regions$regions, can_genes)

 

By the way, I have been unable to figure out how to make an appropriate txdb object to hand to analyzeChr() so that this can all be done in one step. I'm happy with how things work for me now so I'm not intending to spend much time on it, but if you want to have a cleaner example in the advanced vignette, you could suggest some code for me to try and I'd be happy to make sure it works before you put it in there.

My hope is that I will now get to move forward with analyzing the implications of my DERs, in concert with some DEGs I got from DESeq2. If and when this is published I'll let you know (of course I would reference derfinder).

Thanks,

Jessica

ADD REPLY
0
Entering edit mode
@jessicahekman-6877
Last seen 9.3 years ago
United States

Thank you so much! I am in the middle of final exams at the moment but I will check it out in the next few days and report back.

I really appreciate the work.

Jessica

ADD COMMENT

Login before adding your answer.

Traffic: 847 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6