Question

How to add chromosomes to a BSgenome ?

0

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 15 months ago

Japan

I have aligned some transcriptome reads on a custom genome made of hg38 supplemented by viral genomes. Unfortunately, in some downstream analyses (using CAGEr), the reads aligning on the viral sequences are ignored because the chromosome name is not in `seqnames(BSgenome.Hsapiens.UCSC.hg38)`. I am thinking of adding the viral sequences to the BSgenome object at run time, but did not find how. Can somebody give me a hint ?

bsgenome virus • 2.2k views

ADD COMMENT • link updated 8.3 years ago by Hervé Pagès 16k • written 8.3 years ago by Charles Plessy ▴ 180

score 2 · Accepted Answer · 2016-09-02

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 7 days ago

Seattle, WA, United States

Hi,

You can't add chromosomes to a BSgenome object at the moment. What you could do instead is forge a new BSgenome data package that contains all the original sequences plus the additional sequences. For this you will first need to manage to put all the sequences on disk (in a 2bit or FASTA file). Then see the BSgenomeForge vignette in the BSgenome package for how to forge the BSgenome data package using this file.

Hope this helps,

H.

ADD COMMENT • link 8.3 years ago Hervé Pagès 16k

0

Entering edit mode

Many thanks ! After studying the vignette and the source code, I ended up forging a package with the following Rmarkdown template. Comments from you and others are welcome!

---
title: "BSgenome object containing human and HPV genome sequences."
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r get_sequences}
setwd("seqs_srcdir")

system("wget --quiet --continue http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz")
system("tar xvf hg38.chromFa.tar.gz --strip-components 2")
unlink("hg38.chromFa.tar.gz")

system("curl --silent 'http://getentry.ddbj.nig.ac.jp/getentry/na/K02718.1/?format=fasta'| sed 's/K02718|K02718.1/HPV16REF.1/' > HPV16REF.1.fa")

system("curl --silent 'http://getentry.ddbj.nig.ac.jp/getentry/na/X05015.1/?format=fasta' | sed 's/X05015|X05015.1/HPV18REF.1/' > HPV18REF.1.fa")

setwd("..")
```

```{r forge_package}
BSseed <- list(
  Package           = "BSgenome.Hsapiens.plusHPV.hg38"
, Title             = "Full genome sequences for Homo sapiens (UCSC version hg38), plus HPV"
, Description       = "Full genome sequences for Homo sapiens (Human) as provided by UCSC (hg38, Dec. 2013), plus HPV sequences provided by the Papillomavirus Episteme, and stored in Biostrings objects."
, Version           = "38.8.2016-09-05.1"
, Suggests          = "TxDb.Hsapiens.UCSC.hg38.knownGene"
, organism          = "Homo sapiens"
, common_name       = "Human"
, provider          = "custom"
, provider_version  = "hg38+HPV"
, release_date      = "Sep. 2016"
, release_name      = "Genome Reference Consortium GRCh38, plus HPV"
, source_url        = "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/ and https://pave.niaid.nih.gov/"
, organism_biocview = "Homo_sapiens"
, seqnames          = "sub('.fa', '', list.files('seqs_srcdir/', pattern = '*.fa'))"

, circ_seqs         = "c('chrM', sub('.fa', '', list.files('seqs_srcdir/', pattern = 'HPV.*.fa')))"
, BSgenomeObjname   = "Hsapiens"
)

unlink("BSgenome.Hsapiens.plusHPV.hg38", recursive = TRUE)

BSgenome::forgeBSgenomeDataPkg( BSseed
                    , seqs_srcdir = "seqs_srcdir"
                    , destdir = "."
                    , verbose = TRUE)
```

Minor edit: added ".1" in HPV reference file names for consistency with our internal projects. To readers finding this through search engines, please note that K02718 and X05015 were downloaded from DDBJ for convenience of the example, but updated HPV16 and 18 sequences are available in PaVE.

ADD REPLY • link 8.3 years ago Charles Plessy ▴ 180

0

Entering edit mode

Glad to see you discovered you could replace the seed file with an ordinary list. Nicely done!

H.

ADD REPLY • link 8.3 years ago Hervé Pagès 16k