Question

CRISPRseek with (rough) Draft Genome

0

Entering edit mode

rajewski23 • 0

@rajewski23-10883

Last seen 8.8 years ago

I'm trying to design a CRISPR gRNA and search for off-targets with CRISPRseek.

My organism has a very rough draft genome that is organized into ~160,000 scaffolds. I've been able to forge a BSgenome package where I placed a single scaffold into seqnames and all of the scaffolds into the mseqnames category in the seed file. This works fine for creating the BSgenome package. When I run offTargetAnalysis() though, it seems that CRISPRseek only deals with the single BSgenome seq sequence and ignores the scaffolds under mseq.

I'll post the code and error produced from the offTargetAnalysis() along with session info below.

I'm wondering if there is a way to:

get offTargetAnalysis() to not ignore these mseq sequences,
make a BSgenome object with 160,000 seqnames in the seedfile, or
perhaps run offTargetAnalysis() on a DNAStringSet object and bypass the BSgenome step entirely?

Thanks for any help,

Alex Rajewski

> offTargetAnalysis("~/Downloads/Target.fa", findgRNAs = TRUE, findgRNAsWithREcutOnly = FALSE, enable.multicore = TRUE, BSgenomeName = BSgenome.Ntom.SolGenomics.1.0, outputDir = "~/Downloads/CRISPRseek", overwrite = TRUE, annotateExon = FALSE)
Validating input ...
Searching for gRNAs ...
>>> Finding all hits in sequence Ntom_ASAG01000001.1.fa ...
>>> DONE searching
Building feature vectors for scoring ...
Error in buildFeatureVectorForScoring(hits = hits, canonical.PAM = PAM,  : 
  Empty hits!
In addition: Warning message:
In searchHits(gRNAs = gRNAs, PAM = PAM.pattern, BSgenomeName = BSgenomeName,  :
  No matching found, please check your input sequence, and make
            sure you are using the right genome. You can also alter your 
            search criteria such as increasing max.mismatch!

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] CRISPRseek_1.10.3                 BSgenome.Ntom.SolGenomics.1.0_1.0
 [3] BSgenome_1.38.0                   rtracklayer_1.30.4               
 [5] Biostrings_2.38.4                 XVector_0.10.0                   
 [7] GenomicRanges_1.22.4              GenomeInfoDb_1.6.3               
 [9] IRanges_2.4.8                     S4Vectors_0.8.11                 
[11] BiocGenerics_0.16.1              

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.3    BiocParallel_1.4.3        
 [4] tools_3.2.4                SummarizedExperiment_1.0.2 data.table_1.9.6          
 [7] Biobase_2.30.0             lambda.r_1.1.7             futile.logger_1.4.1       
[10] seqinr_3.1-5               ade4_1.7-4                 futile.options_1.0.0      
[13] bitops_1.0-6               RCurl_1.95-4.8             BiocInstaller_1.20.3      
[16] Rsamtools_1.22.0           XML_3.98-1.4               chron_2.3-47

crisprseek grna searching bsgenome forge offtarget analysis • 1.5k views

ADD COMMENT • link updated 8.8 years ago by Hervé Pagès 16k • written 8.8 years ago by rajewski23 • 0

1

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 17 months ago

United States

Alex, Could you please try compare2Sequences workflow by setting searchDirection="1to2", inputFile1Path = "gRNA.fa", inputFile2Path = "RoughDraftGenome.fa"? Thanks! Regarding making a BSgenome object with 160,000 seqnames in the seed file, Herve might have a good suggestion for this. Best, Julie

ADD COMMENT • link updated 8.8 years ago by Dan Tenenbaum ★ 8.2k • written 8.8 years ago by Julie Zhu ★ 4.3k

score 2 · Accepted Answer · 2016-06-13

Hi Alex,

You can forge a BSgenome package with 160,000 seqnames but in that case it's better to not list the seqnames in the seed file (this might actually not work due to some limitations in the read.dcf() function used internally to read the seed file).

As explained in the BSgenomeForge vignette, the seqnames field is only needed when the sequences are stored in individual files (i.e. 1 file per sequence). If all the sequences are stored in a single file then the seqnames field is optional. Note that it can still be used if you want to select a subset of sequences and/or re-order them.

The preferred format for storing all the sequences together is the 2bit format from UCSC. It outperforms compressed indexed FASTA (aka razip format, fa.rz extension) in all aspects (furthermore people have reported issues when using razip on Windows). If your sequences are not in a 2bit file, just import them with import() (from the rtracklayer package) and export the DNAStringSet object as 2bit with export(x, "some_file_name.2bit"). Then no need to put a seqnames field in your seed file but you need to add this line:

seqfile_name: some_file_name.2bit

The seqfile_name field should specify the name of the file only, not the path to it. The path to the folder containing the file should be specified thru the seqs_srcdir field, as usual. See the BSgenomeForge vignette for a description of the seqs_srcdir and seqfile_name fields.

Many seed files in the extdata/GentlemanLab folder in the BSgenome package do the above. See for example BSgenome.Btaurus.UCSC.bosTau8-seed or BSgenome.Drerio.UCSC.danRer10-seed.

Hope this helps,

H.