Question

Unable to generate feature vector with custom genome in cleanUpdTSeq

0

Entering edit mode

danielle.bilodeau • 0

@daniellebilodeau-20102

Last seen 6.1 years ago

Hello,

I am using the cleanUpdTSeq package to analyze a set of potential 3' UTRs in the parasite Giardia lamblia. I am working from a dataframe with the coordinates of putative 3' UTRs, as per the package requirements. The issue arises with the function "buildFeatureVector" which adds the upstream and downstream sequences around the coordinates in the data frame. Here is the code I am using and the error message:

> peaks_grange <- BED2GRangesSeq(peaks, header = TRUE, withSeq = FALSE)
> peaks_feature <- buildFeatureVector(peaks_grange, BSgenomeName = Glamblia, sampleType = 'unknkown', method = 'NaiveBayes', fetchSeq = TRUE)
Error in gsub("^chr", chr, ignore.case = TRUE) : argument "x" is missing, with no default

I believe the issue comes from a compatibility issue with the Glamblia genome. I prepared the package for use with BSgenome by following the instructions here: https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

The genome is forged from a single fasta file which I named gl34.fa for the species name and genome version. The instructions speak of assembling a genome from multiple files; one for each chromosome, and I believe the issue comes from my single file containing all the data. Instead of having a file name for each chromosome, as follows (from the genome used in package examples) :

> seqnames(Drerio)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chr23" "chr24" "chr25"
[26] "chrM"

I get the following:

> seqnames(Glamblia)
[1] "gl34"

The genome is only available as a single fasta file (no twoBit format) which contains all 5 chromosomes (GLCHR01 through GLCHR05) and a few extra fragments (ex: AACB02000013_frag).

Below is the seed file used to forge/build the genome:

Package: BSgenome.Glamblia.GiardiaDB.35
Title: Genome sequence for Giardia lamblia (GiardiaDB version 35)
Version: 0.1.0
License: GPL
Description: BSGenome compatible Giardia genome for use in CleanupDTSeq package
Author: D.Y.Bilodeau
Maintainer: D.Y.Bilodeau <danielle.bilodeau@ucdenver.edu>
organism: Giardia lamblia
common_name: Giardia
provider: GiardiaDB
provider_version: 35
release_date: 2017-08-15
release_name: release-35
source_url: http://giardiadb.org/common/downloads/release-35/GintestinalisAssemblageAWB/fasta/data/
organism_biocview: Giardia_lamblia
BSgenomeObjname: Glamblia
seqs_srcdir: /Users/DanB/R/BSGenome
seqnames: paste("gl", "34", sep = "")

Is there a way to forge the genome that would make it compatible with this package? There also seems to be an issue with the use of gsub above where there is no x argument (I believe this line comes from the "getUpstreamSequence" function.

Below is my session info.

Thanks!

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Glamblia.GiardiaDB.35_0.1.0 cleanUpdTSeq_1.20.0                  e1071_1.7-0.1                        seqinr_3.4-5                         BSgenome.Drerio.UCSC.danRer7_1.4.0  
 [6] BSgenome_1.50.0                      rtracklayer_1.42.1                   Biostrings_2.50.2                    XVector_0.22.0                       GenomicRanges_1.34.0                
[11] GenomeInfoDb_1.18.1                  IRanges_2.16.0                       S4Vectors_0.20.1                     BiocGenerics_0.28.0                 

loaded via a namespace (and not attached):
 [1] knitr_1.21                  MASS_7.3-51.1               zlibbioc_1.28.0             GenomicAlignments_1.18.1    BiocParallel_1.16.5         lattice_0.20-38             tools_3.5.2                
 [8] SummarizedExperiment_1.12.0 grid_3.5.2                  Biobase_2.42.0              xfun_0.4                    class_7.3-15                matrixStats_0.54.0          ade4_1.7-13                
[15] yaml_2.2.0                  Matrix_1.2-15               GenomeInfoDbData_1.2.0      bitops_1.0-6                RCurl_1.95-4.11             DelayedArray_0.8.0          compiler_3.5.2

cleanUpdTSeq BSgenome BSgenomeForge • 2.0k views

ADD COMMENT • link updated 6.1 years ago by Julie Zhu ★ 4.3k • written 6.1 years ago by danielle.bilodeau • 0

score 0 · Answer 1 · 2019-03-05

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 17 months ago

United States

Danielle, Could you please type the following and post the output? I would like to see the chromosome names in the peaks_grange. Thanks!

peaks_grange

Best regards, Julie

ADD COMMENT • link 6.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Hi Julie,

Here's the output:

> peaks_grange
GRanges object with 3850 ranges and 1 metadata column:
                     seqnames    ranges strand |     score
                        <Rle> <IRanges>  <Rle> | <numeric>
    GL50803_10013_42  GLCHR03   1211025      + |         1
   GL50803_10014_486  GLCHR03   1212644      + |         1
   GL50803_10027_239  GLCHR05   1202275      + |         1
   GL50803_10034_900  GLCHR05   1198559      + |         1
   GL50803_10034_909  GLCHR05   1198568      + |         1
                 ...      ...       ...    ... .       ...
  GL50803_99576_2616 CH991778     13089      + |         1
  GL50803_99670_1289  GLCHR05   2084485      + |         1
    GL50803_99712_52  GLCHR01    177566      + |         1
   GL50803_9979_4092  GLCHR03   1202316      + |         1
   GL50803_9979_4442  GLCHR03   1202666      + |         1
  -------
  seqinfo: 16 sequences from an unspecified genome; no seqlengths

ADD REPLY • link 6.1 years ago danielle.bilodeau • 0

0

Entering edit mode

Danielle,

If you look at the seqnames column, you will see that your peak ranges are located in different chromosomes such as GLCHR03 and GL50803. In order to obtain sequence from these different chromosomes, the BSgenome needs to have the matched chromosome name. Dose it make sense? Thanks!

Best regards,

Julie

ADD REPLY • link 6.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Hi Julie,

It definitely makes sense, but I'm not sure why the BSgenome wouldn't have those chromosome names. The seqnames in the GRange object were taken directly from the same genome I used to forge the BSgenome.

I suppose my question then is, where does the BSgenome obtain the chromosome names from? Does it read them directly from the fasta file provided and could an issue with the fasta file itself be causing this?

Thanks,

Danielle

ADD REPLY • link 6.1 years ago danielle.bilodeau • 0

0

Entering edit mode

Danielle,

I believe your guess is correct, i.e., it reads the chromosome names directly from the fasta file provided. There are probably issues with the fasta file itself. You can open the fasta file to see what is the first line.

Alternatively, you can use grep to find all the chromosome names in the fasta file.

more yourfastafile |grep ">"

Best,

Julie

ADD REPLY • link 6.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Thanks for the grep tip! Here's what I get when I look for ">" in my fasta file (I removed the middle lines to make it shorter):

$ grep ">" gl34.fa 
>GLCHR01 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=1491578 | SO=chromosome
>GLCHR02 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=1519671 | SO=chromosome
>GLCHR03 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=1958359 | SO=chromosome
>GLCHR04 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=2762469 | SO=chromosome
>GLCHR05 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=4471437 | SO=chromosome
>AACB02000013_frag | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=56441 | SO=contig
>AACB02000029_frag | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=19973 | SO=contig
>AACB02000066 | organism=Giardia_Assemblage_A_isolate_WB | version=2010-03-10 | length=41384 | SO=contig
[....]
>CH991849 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=1039 | SO=supercontig
>CH991850 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=863 | SO=supercontig
>CH991851 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=1165 | SO=supercontig
>CH991852 | organism=Giardia_Assemblage_A_isolate_WB | version=2013-02-08 | length=1140 | SO=supercontig

It looks like all the chromosome names are there. If I understand correctly, the BSgenome should then have all the seqnames, correct?

Thanks for continuing to work with me on this!

Danielle

ADD REPLY • link 6.1 years ago danielle.bilodeau • 0

0

Entering edit mode

Danielle,

You will need to remove anything after | and any spaces before that.

For example, use GLCHR01 to replace “

GLCHR01 | organism=GiardiaAssemblageAisolateWB | version=2013-02-08 | length=1491578 | SO=chromosome

=2013-02-08 | length=863 | SO=supercontig”

Best,

Julie

ADD REPLY • link 6.1 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

Hi Julie,

Thanks for your patience while I made the change you recommended. I now have a new fasta file with only the chromosome name in each header. Here's an excerpt:

$ grep ">" glamblia34.fa 
>GLCHR01
>GLCHR02
>GLCHR03
>GLCHR04
>GLCHR05
>AACB02000013
>AACB02000029

I used this new fasta file to generate a custom BSgenome but got the following output:

>forgeBSgenomeDataPkg('/Users/DanB/R/BSGenome/modified_BSgenome/BSgenome.Glamblia.GiardiaDB.35.new-seed')
Creating package in ./BSgenome.GlambliaN.GiardiaDB.35.new 
Loading 'glamblia34' sequence from FASTA file '/Users/DanB/R/BSGenome/modified_BSgenome/glamblia34.fa' ... DONE
Writing all sequences to './BSgenome.GlambliaN.GiardiaDB.35.new/inst/extdata/single_sequences.2bit' ... DONE
Warning message:
In .forgeTwobitFileFromFastaFiles(seqnames, prefix, suffix, seqs_srcdir,  :
  file contains 211 sequences, using the first sequence only

It looks like the function recognized that there were 211 seqnames, but only used the first one. I tried installing the package anyway just in case, but I still get the same error:

> peaks_grange <- BED2GRangesSeq(peakses, header = TRUE, withSeq = FALSE)
> peaks_feature <- buildFeatureVector(peaks_grange, BSgenomeName = GlambliaN, sampleType = 'unknkown', method = 'NaiveBayes', fetchSeq = TRUE)
Not all the sequence names are in the genome.
Try to convert.
Error in gsub("^chr", chr, ignore.case = TRUE) : 
  argument "x" is missing, with no default

And here's the output looking at the seqnames for the newly forged genome:

>seqnames(GlambliaN)
[1] "glamblia34"

I am now wondering if the forgeBSgenome function gets the seqnames from the filenames, rather than from the fasta file itself. In that case, how do I communicate the seqnames to it if I only have the one file?

Thanks again for your help!

Best,

Danielle

ADD REPLY • link 6.1 years ago danielle.bilodeau • 0

score 0 · Answer 2 · 2019-03-05

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 17 months ago

United States

Danielle, Could you please type the following and post the output? I would like to see the chromosome names in the peaks_grange. Thanks!

peaks_grange

Best regards, Julie

ADD COMMENT • link 6.1 years ago Julie Zhu ★ 4.3k