Question

Assign genes to CpG islands

0

Entering edit mode

José Luis Lavín ▴ 10

@jose-luis-lavin-5531

Last seen 7.1 years ago

Spain

Dear all,

I'm looking for a Bioconductor/R package that enables me to annotate a table of CpG island coordinates (downloaded fon the UCSC table browser) to the nearest gene.

My table has the following fields:

#bin chrom chromStart chromEnd CpGname length cpgNum gcNum perCpg perGc obsExp

After a first search on th internet I tried the following code to create a Granges object and then annotate it to the nearest gene and to the preceding gene.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)

genes = genes(TxDb.Mmusculus.UCSC.mm10.ensGene)

cpg<-read.table("mm10_CpG_coords.txt" ,header=TRUE, stringsAsFactors=TRUE, sep ="\t"  )
gr = makeGRangesFromDataFrame(cpg)
gr = GRanges(cpg$chrom, IRanges(cpg$chromStart, cpg$chromEnd))
head(gr)

GRanges with 6 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [ 84934572,  84935054]      *
  [2]     chr1 [ 63176547,  63177427]      *
  [3]     chr1 [125435174, 125435976]      *
  [4]     chr1 [183368926, 183369826]      *
  [5]     chr1 [  3531624,   3531843]      *
  [6]     chr1 [  3670619,   3671074]      *
  ---
  seqlengths:
                   chr1                chr10 ...                 chrX                 chrY
                     NA                   NA ...                   NA                   NA


near<-genes[nearest(gr, genes)]

pre<-genes[precede(gr, genes)]

But I get the following error related to what I think is due to the Granges object I obtain, which doesn't seem to have any seqlenghts information for the different chromosomes...

Error in IRanges:::normalizeSingleBracketSubscript(i, x) :
  subscript contains NAs or out of bounds indices

Would you please help me with this issue.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] XVector_0.4.0                           hiAnnotator_1.0.0                      
 [3] TxDb.Mmusculus.UCSC.mm10.ensGene_2.14.0 GenomicFeatures_1.16.3                 
 [5] AnnotationDbi_1.26.1                    Biobase_2.24.0                         
 [7] GenomicRanges_1.16.4                    GenomeInfoDb_1.0.2                     
 [9] IRanges_1.22.10                         BiocGenerics_0.10.0                    

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9              BiocParallel_0.6.1     
 [5] biomaRt_2.20.0          Biostrings_2.32.1       bitops_1.0-6            brew_1.0-6             
 [9] BSgenome_1.32.0         codetools_0.2-11        colorspace_1.2-6        checkmate_1.6.0        
[13] DBI_0.3.1               digest_0.6.8            fail_1.2                foreach_1.4.2          
[17] GenomicAlignments_1.0.6 ggplot2_1.0.1           grid_3.1.2              gtable_0.1.2           
[21] iterators_1.0.7         magrittr_1.5            MASS_7.3-42             munsell_0.4.2          
[25] plyr_1.8.3              proto_0.3-10            Rcpp_0.11.6             RCurl_1.95-4.7         
[29] reshape2_1.4.1          Rsamtools_1.16.1        RSQLite_1.0.0           rtracklayer_1.24.2     
[33] scales_0.2.5            sendmailR_1.2-1         stats4_3.1.2            stringi_0.5-5          
[37] stringr_1.0.0           tools_3.1.2             XML_3.98-1.3            zlibbioc_1.10.0

Best wishes

JL

annotation genomicranges bsgenome • 2.9k views

ADD COMMENT • link updated 9.4 years ago by Ou, Jianhong ★ 1.3k • written 9.4 years ago by José Luis Lavín ▴ 10

score 2 · Answer 1 · 2015-10-26

I think you want have a try of ChIPpeakAnno.

## you can load the cpg islands like Michael said,
library(rtracklayer)
session <- browserSession()
genome(session) <- "mm10"
cpg <- session[["CpG Islands"]]
## Or in case you want to import the local file to save the metadata, 
## you can have a try toGRange
library(ChIPpeakAnno)
cpg <- toGRanges("mm10_CpG_coords.txt" ,header=TRUE, stringsAsFactors=TRUE, sep ="\t", colNames=c("bin", "sapce", "start", "end", "CpGname", "length", "cpgNum", "gcNum", "perCpg", "perGc", "obsExp")
## Then annotate the cpgs by annotatePeakInBatch
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
annoData <- annoGR(TxDb.Mmusculus.UCSC.mm10.ensGene)
cpg.anno <- annotatePeakInBatch(cpg, AnnotationData=annoData, output="nearestLocation")
head(cpg.anno)
## You can also have a try EnsDb.Mmusculus.v79
library(EnsDb.Mmusculus.v79)
ensData <- annoGR(EnsDb.Mmusculus.v79)
cpg.anno1 <- annotatePeakInBatch(cpg, AnnotationData=ensData, output="nearestLocation")

Let me know if you have any question.

score 1 · Answer 2 · 2015-10-26

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.2 years ago

United States

Might be easier to grab that table from UCSC like this:

library(rtracklayer)
session <- browserSession()
genome(session) <- "mm10"
cpg <- session[["CpG Islands"]]

Then, you have your seqlengths.

As the error message clearly states, you have NAs in your subscript, since there are CPG islands on contigs without any genes. Since those islands are unlikely to be of interest, just restrict the islands to the standard chromosomes:

cpg <- keepStandardChromosomes(cpg)

ADD COMMENT • link 9.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Dear Michael,

Thanks for your solution, although there's a little problem. When I tried your code, R fails to set session to genome mm10. Yielding the following error:

Error in `genome<-`(`*tmp*`, value = "mm10") :
  Failed to set session genome to 'mm10'

I think this issue may be related to that described in this post: Failed to set session genome to ....

Is there a way fix this issue?

Best wishes

JL

> sessionInfo()   
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.26.3                     TxDb.Mmusculus.UCSC.mm10.ensGene_3.0.0 GenomicFeatures_1.18.7                
 [4] AnnotationDbi_1.28.2                   Biobase_2.26.0                         GenomicAlignments_1.2.2               
 [7] Rsamtools_1.18.3                       Biostrings_2.34.1                      BiocInstaller_1.16.5                  
[10] XVector_0.6.0                          GenomicRanges_1.18.4                   GenomeInfoDb_1.2.5                    
[13] IRanges_2.0.1                          S4Vectors_0.4.0                        BiocGenerics_0.12.1                   

loaded via a namespace (and not attached):
 [1] base64enc_0.1-3    BatchJobs_1.6      BBmisc_1.9         BiocParallel_1.0.3 biomaRt_2.22.0     bitops_1.0-6      
 [7] brew_1.0-6         codetools_0.2-14   checkmate_1.6.3    DBI_0.3.1          digest_0.6.8       fail_1.3          
[13] foreach_1.4.3      iterators_1.0.8    magrittr_1.5       RCurl_1.95-4.7     RSQLite_1.0.0      sendmailR_1.2-1   
[19] stringi_1.0-1      stringr_1.0.0      tools_3.1.2        XML_3.98-1.3       zlibbioc_1.12.0

ADD REPLY • link 9.4 years ago José Luis Lavín ▴ 10

0

Entering edit mode

I updated that support post to indicate that the solution is to update your Bioconductor. You are two releases behind now.