How to add chromosomes to a BSgenome ?
1
0
Entering edit mode
@charles-plessy-7857
Last seen 14 months ago
Japan

I have aligned some transcriptome reads on a custom genome made of hg38 supplemented by viral genomes.  Unfortunately, in some downstream analyses (using CAGEr), the reads aligning on the viral sequences are ignored because the chromosome name is not in `seqnames(BSgenome.Hsapiens.UCSC.hg38)`.  I am thinking of adding the viral sequences to the BSgenome object at run time, but did not find how.  Can somebody give me a hint ?

bsgenome virus • 2.2k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 11 minutes ago
Seattle, WA, United States

Hi,

You can't add chromosomes to a BSgenome object at the moment. What you could do instead is forge a new BSgenome data package that contains all the original sequences plus the additional sequences. For this you will first need to manage to put all the sequences on disk (in a 2bit or FASTA file). Then see the BSgenomeForge vignette in the BSgenome package for how to forge the BSgenome data package using this file.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Many thanks !  After studying the vignette and the source code, I ended up forging a package with the following Rmarkdown template.  Comments from you and others are welcome!

---
title: "BSgenome object containing human and HPV genome sequences."
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r get_sequences}
setwd("seqs_srcdir")

system("wget --quiet --continue http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz")
system("tar xvf hg38.chromFa.tar.gz --strip-components 2")
unlink("hg38.chromFa.tar.gz")

system("curl --silent 'http://getentry.ddbj.nig.ac.jp/getentry/na/K02718.1/?format=fasta'| sed 's/K02718|K02718.1/HPV16REF.1/' > HPV16REF.1.fa")

system("curl --silent 'http://getentry.ddbj.nig.ac.jp/getentry/na/X05015.1/?format=fasta' | sed 's/X05015|X05015.1/HPV18REF.1/' > HPV18REF.1.fa")

setwd("..")
```

```{r forge_package}
BSseed <- list(
  Package           = "BSgenome.Hsapiens.plusHPV.hg38"
, Title             = "Full genome sequences for Homo sapiens (UCSC version hg38), plus HPV"
, Description       = "Full genome sequences for Homo sapiens (Human) as provided by UCSC (hg38, Dec. 2013), plus HPV sequences provided by the Papillomavirus Episteme, and stored in Biostrings objects."
, Version           = "38.8.2016-09-05.1"
, Suggests          = "TxDb.Hsapiens.UCSC.hg38.knownGene"
, organism          = "Homo sapiens"
, common_name       = "Human"
, provider          = "custom"
, provider_version  = "hg38+HPV"
, release_date      = "Sep. 2016"
, release_name      = "Genome Reference Consortium GRCh38, plus HPV"
, source_url        = "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/ and https://pave.niaid.nih.gov/"
, organism_biocview = "Homo_sapiens"
, seqnames          = "sub('.fa', '', list.files('seqs_srcdir/', pattern = '*.fa'))"

, circ_seqs         = "c('chrM', sub('.fa', '', list.files('seqs_srcdir/', pattern = 'HPV.*.fa')))"
, BSgenomeObjname   = "Hsapiens"
)

unlink("BSgenome.Hsapiens.plusHPV.hg38", recursive = TRUE)

BSgenome::forgeBSgenomeDataPkg( BSseed
                    , seqs_srcdir = "seqs_srcdir"
                    , destdir = "."
                    , verbose = TRUE)
```

Minor edit: added ".1" in HPV reference file names for consistency with our internal projects.  To readers finding this through search engines, please note that K02718 and X05015 were downloaded from DDBJ for convenience of the example, but updated HPV16 and 18 sequences are available in PaVE.

ADD REPLY
0
Entering edit mode

Glad to see you discovered you could replace the seed file with an ordinary list. Nicely done!

H.

ADD REPLY

Login before adding your answer.

Traffic: 600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6