Hello,
I am using the cleanUpdTSeq package to analyze a set of potential 3' UTRs in the parasite Giardia lamblia. I am working from a dataframe with the coordinates of putative 3' UTRs, as per the package requirements. The issue arises with the function "buildFeatureVector" which adds the upstream and downstream sequences around the coordinates in the data frame. Here is the code I am using and the error message:
> peaks_grange <- BED2GRangesSeq(peaks, header = TRUE, withSeq = FALSE)
> peaks_feature <- buildFeatureVector(peaks_grange, BSgenomeName = Glamblia, sampleType = 'unknkown', method = 'NaiveBayes', fetchSeq = TRUE)
Error in gsub("^chr", chr, ignore.case = TRUE) : argument "x" is missing, with no default
I believe the issue comes from a compatibility issue with the Glamblia genome. I prepared the package for use with BSgenome by following the instructions here: https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf
The genome is forged from a single fasta file which I named gl34.fa
for the species name and genome version. The instructions speak of assembling a genome from multiple files; one for each chromosome, and I believe the issue comes from my single file containing all the data. Instead of having a file name for each chromosome, as follows (from the genome used in package examples) :
> seqnames(Drerio)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chr23" "chr24" "chr25"
[26] "chrM"
I get the following:
> seqnames(Glamblia)
[1] "gl34"
The genome is only available as a single fasta file (no twoBit format) which contains all 5 chromosomes (GLCHR01 through GLCHR05) and a few extra fragments (ex: AACB02000013_frag).
Below is the seed file used to forge/build the genome:
Package: BSgenome.Glamblia.GiardiaDB.35
Title: Genome sequence for Giardia lamblia (GiardiaDB version 35)
Version: 0.1.0
License: GPL
Description: BSGenome compatible Giardia genome for use in CleanupDTSeq package
Author: D.Y.Bilodeau
Maintainer: D.Y.Bilodeau <danielle.bilodeau@ucdenver.edu>
organism: Giardia lamblia
common_name: Giardia
provider: GiardiaDB
provider_version: 35
release_date: 2017-08-15
release_name: release-35
source_url: http://giardiadb.org/common/downloads/release-35/GintestinalisAssemblageAWB/fasta/data/
organism_biocview: Giardia_lamblia
BSgenomeObjname: Glamblia
seqs_srcdir: /Users/DanB/R/BSGenome
seqnames: paste("gl", "34", sep = "")
Is there a way to forge the genome that would make it compatible with this package? There also seems to be an issue with the use of gsub above where there is no x argument (I believe this line comes from the "getUpstreamSequence" function.
Below is my session info.
Thanks!
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Glamblia.GiardiaDB.35_0.1.0 cleanUpdTSeq_1.20.0 e1071_1.7-0.1 seqinr_3.4-5 BSgenome.Drerio.UCSC.danRer7_1.4.0
[6] BSgenome_1.50.0 rtracklayer_1.42.1 Biostrings_2.50.2 XVector_0.22.0 GenomicRanges_1.34.0
[11] GenomeInfoDb_1.18.1 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] knitr_1.21 MASS_7.3-51.1 zlibbioc_1.28.0 GenomicAlignments_1.18.1 BiocParallel_1.16.5 lattice_0.20-38 tools_3.5.2
[8] SummarizedExperiment_1.12.0 grid_3.5.2 Biobase_2.42.0 xfun_0.4 class_7.3-15 matrixStats_0.54.0 ade4_1.7-13
[15] yaml_2.2.0 Matrix_1.2-15 GenomeInfoDbData_1.2.0 bitops_1.0-6 RCurl_1.95-4.11 DelayedArray_0.8.0 compiler_3.5.2
Hi Julie,
Here's the output:
Danielle,
If you look at the seqnames column, you will see that your peak ranges are located in different chromosomes such as GLCHR03 and GL50803. In order to obtain sequence from these different chromosomes, the BSgenome needs to have the matched chromosome name. Dose it make sense? Thanks!
Best regards,
Julie
Hi Julie,
It definitely makes sense, but I'm not sure why the BSgenome wouldn't have those chromosome names. The seqnames in the GRange object were taken directly from the same genome I used to forge the BSgenome.
I suppose my question then is, where does the BSgenome obtain the chromosome names from? Does it read them directly from the fasta file provided and could an issue with the fasta file itself be causing this?
Thanks,
Danielle
Danielle,
I believe your guess is correct, i.e., it reads the chromosome names directly from the fasta file provided. There are probably issues with the fasta file itself. You can open the fasta file to see what is the first line.
Alternatively, you can use grep to find all the chromosome names in the fasta file.
more yourfastafile |grep ">"
Best,
Julie
Thanks for the grep tip! Here's what I get when I look for ">" in my fasta file (I removed the middle lines to make it shorter):
It looks like all the chromosome names are there. If I understand correctly, the BSgenome should then have all the seqnames, correct?
Thanks for continuing to work with me on this!
Danielle
Danielle,
You will need to remove anything after | and any spaces before that.
For example, use GLCHR01 to replace “
GLCHR01 | organism=GiardiaAssemblageAisolateWB | version=2013-02-08 | length=1491578 | SO=chromosome
=2013-02-08 | length=863 | SO=supercontig”
Best,
Julie
Hi Julie,
Thanks for your patience while I made the change you recommended. I now have a new fasta file with only the chromosome name in each header. Here's an excerpt:
I used this new fasta file to generate a custom BSgenome but got the following output:
It looks like the function recognized that there were 211 seqnames, but only used the first one. I tried installing the package anyway just in case, but I still get the same error:
And here's the output looking at the seqnames for the newly forged genome:
I am now wondering if the forgeBSgenome function gets the seqnames from the filenames, rather than from the fasta file itself. In that case, how do I communicate the seqnames to it if I only have the one file?
Thanks again for your help!
Best,
Danielle