I am trying to obtain the promoter sequences of several genes using the GRCh38 genome and the information about transcripts locations from the TxDb package. Unlike hg19 assemblies, the genome package for the GRCh38 assembly is from NCBI and the transcript package from UCSF:
TxDb.Hsapiens.UCSC.hg38.knownGene
I have tried the following without success:
> library("BSgenome.Hsapiens.NCBI.GRCh38") > library("TxDb.Hsapiens.UCSC.hg38.knownGene") > Hsapiens <- BSgenome.Hsapiens.NCBI.GRCh38 > txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # get transcript locations by gene: > chr_loc <- transcriptsBy(txdb, by = "gene") # get sequence 300 bp upstream of TSS for 10 first genes. > pro_seq <- getPromoterSeq(chr_loc[1:10], Hsapiens, upstream=300, downstream=0) Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence chr19 not found In addition: Warning message: In .Seqinfo.mergexy(x, y) : The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.) # Change chromosome naming style: > seqlevelsStyle(Hsapiens) = "UCSC" # Try again: > pro_seq <- getPromoterSeq(chr_loc[1:10], Hsapiens, upstream=300, downstream=0) Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence", : sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM have incompatible genomes: - in 'x': GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38, GRCh38 - in 'y': hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38, hg38 # Try change genome name: > genome(Hsapiens) <- "hg38" Error in `seqinfo<-`(`*tmp*`, value = <S4 object of class "Seqinfo">) : 'new2old' must be specified when replacing the 'seqinfo' of a BSgenome object
The same occurs if I try to modify the genome information in the TxDb object. The help page for ?genome
suggests this is the correct usage. Is there anything I am missing here? How can I get promoter sequences for the GRCh38 genome?
Incidentally, the description of the SNPlocs.Hsapiens.dbSNP141.GRCh38 package suggests there exists a `BSgenome.Hsapiens.UCSC.hg38` package that would solve this issue. However this package is not available:
> biocLite("BSgenome.Hsapiens.UCSC.hg38", type="source") BioC_mirror: http://bioconductor.org Using Bioconductor version 3.0 (BiocInstaller 1.16.1), R version 3.1.2. Installing package(s) 'BSgenome.Hsapiens.UCSC.hg38' Warning message: package ‘BSgenome.Hsapiens.UCSC.hg38’ is not available (for R version 3.1.2)
The output of sessionInfo()
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.0.0 [2] GenomicFeatures_1.18.3 [3] AnnotationDbi_1.28.1 [4] Biobase_2.26.0 [5] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 [6] BSgenome_1.34.0 [7] rtracklayer_1.26.2 [8] Biostrings_2.34.1 [9] XVector_0.6.0 [10] GenomicRanges_1.18.3 [11] GenomeInfoDb_1.2.3 [12] IRanges_2.0.1 [13] S4Vectors_0.4.0 [14] BiocGenerics_0.12.1 loaded via a namespace (and not attached): [1] BBmisc_1.8 BatchJobs_1.5 BiocParallel_1.0.0 [4] DBI_0.3.1 GenomicAlignments_1.2.1 RCurl_1.95-4.5 [7] RSQLite_1.0.0 Rsamtools_1.18.2 XML_3.98-1.1 [10] base64enc_0.1-2 biomaRt_2.22.0 bitops_1.0-6 [13] brew_1.0-6 checkmate_1.5.1 codetools_0.2-9 [16] digest_0.6.6 fail_1.2 foreach_1.4.2 [19] iterators_1.0.7 sendmailR_1.2-1 stringr_0.6.2 [22] tools_3.1.2 zlibbioc_1.12.0
Thanks Herve, this works.