Dear all,
I'm looking for a Bioconductor/R package that enables me to annotate a table of CpG island coordinates (downloaded fon the UCSC table browser) to the nearest gene.
My table has the following fields:
#bin chrom chromStart chromEnd CpGname length cpgNum gcNum perCpg perGc obsExp
After a first search on th internet I tried the following code to create a Granges object and then annotate it to the nearest gene and to the preceding gene.
library(TxDb.Mmusculus.UCSC.mm10.ensGene) genes = genes(TxDb.Mmusculus.UCSC.mm10.ensGene) cpg<-read.table("mm10_CpG_coords.txt" ,header=TRUE, stringsAsFactors=TRUE, sep ="\t" ) gr = makeGRangesFromDataFrame(cpg) gr = GRanges(cpg$chrom, IRanges(cpg$chromStart, cpg$chromEnd)) head(gr) GRanges with 6 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [ 84934572, 84935054] * [2] chr1 [ 63176547, 63177427] * [3] chr1 [125435174, 125435976] * [4] chr1 [183368926, 183369826] * [5] chr1 [ 3531624, 3531843] * [6] chr1 [ 3670619, 3671074] * --- seqlengths: chr1 chr10 ... chrX chrY NA NA ... NA NA near<-genes[nearest(gr, genes)] pre<-genes[precede(gr, genes)]
But I get the following error related to what I think is due to the Granges object I obtain, which doesn't seem to have any seqlenghts information for the different chromosomes...
Error in IRanges:::normalizeSingleBracketSubscript(i, x) : subscript contains NAs or out of bounds indices
Would you please help me with this issue.
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 [4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] XVector_0.4.0 hiAnnotator_1.0.0 [3] TxDb.Mmusculus.UCSC.mm10.ensGene_2.14.0 GenomicFeatures_1.16.3 [5] AnnotationDbi_1.26.1 Biobase_2.24.0 [7] GenomicRanges_1.16.4 GenomeInfoDb_1.0.2 [9] IRanges_1.22.10 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9 BiocParallel_0.6.1 [5] biomaRt_2.20.0 Biostrings_2.32.1 bitops_1.0-6 brew_1.0-6 [9] BSgenome_1.32.0 codetools_0.2-11 colorspace_1.2-6 checkmate_1.6.0 [13] DBI_0.3.1 digest_0.6.8 fail_1.2 foreach_1.4.2 [17] GenomicAlignments_1.0.6 ggplot2_1.0.1 grid_3.1.2 gtable_0.1.2 [21] iterators_1.0.7 magrittr_1.5 MASS_7.3-42 munsell_0.4.2 [25] plyr_1.8.3 proto_0.3-10 Rcpp_0.11.6 RCurl_1.95-4.7 [29] reshape2_1.4.1 Rsamtools_1.16.1 RSQLite_1.0.0 rtracklayer_1.24.2 [33] scales_0.2.5 sendmailR_1.2-1 stats4_3.1.2 stringi_0.5-5 [37] stringr_1.0.0 tools_3.1.2 XML_3.98-1.3 zlibbioc_1.10.0
Best wishes
JL
Thank you Ou, Jianhong, your solution works very well and is easy to implement!
Best wishes
JL