Bioconductor,
I am working on an RNA sequencing experiment on Arabidopsis
(Illumina,
100bp, single end), using tophat to map the reads, and R for the rest.
Many (9 of 12) libraries worked with out error. Three of the
libraries
had no `accepted_hit.bam` file, while they did have a junctions.bed,
and
had an error of "Error: could not convert to BAM with samtools". in
the
standard error output.
Additionally in tophat_out/log/accepted_hits_sam_to_bam.log
[samopen] SAM header is present: 7 sequences.
Parse error at line 19971338: CIGAR and sequence length are
inconsistent
Obviously, somewhere the CIGAR or sequence lengths were altered and
no
longer 'mesh'.
I run :
library(ShortRead)
reads <- readFastq("myFile")
treads <- trimEnds(reads, a = "5") #trim at a quality score of 20
nreads <- treads[width(reads) > 21] # keep only reads longer than
21 bp
writeFastq(nreads, "myNewFile.fastq")
then a 'normal' (few options changed) tophat call
tophat -p 4 -o outdir -G my.gtf BowtieIndex
myNewFile.fastq
Any ideas on what went wrong, and why it didn't work some of the time?
Should I just rerun the trimming scripts and see if they map
afterwards?
Thanks in advance!
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] ShortRead_1.18.0 latticeExtra_0.6-24 RColorBrewer_1.0-5
[4] Rsamtools_1.12.3 lattice_0.20-15 Biostrings_2.28.0
[7] GenomicRanges_1.12.3 IRanges_1.18.1 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] Biobase_2.20.0 bitops_1.0-5 grid_3.0.0 hwriter_1.3
stats4_3.0.0
[6] zlibbioc_1.6.0
--
Sam McInturf
[[alternative HTML version deleted]]
On 05/29/2013 04:02 PM, Sam McInturf wrote:
> Bioconductor,
> I am working on an RNA sequencing experiment on Arabidopsis
(Illumina,
> 100bp, single end), using tophat to map the reads, and R for the
rest.
> Many (9 of 12) libraries worked with out error. Three of the
libraries
> had no `accepted_hit.bam` file, while they did have a junctions.bed,
and
> had an error of "Error: could not convert to BAM with samtools". in
the
> standard error output.
>
> Additionally in tophat_out/log/accepted_hits_sam_to_bam.log
> [samopen] SAM header is present: 7 sequences.
> Parse error at line 19971338: CIGAR and sequence length are
inconsistent
Hi Sam -- maybe you can discover what this line of the sam file looks
like, and
compare the sequence there to the sequence in the fastq file, both
before and
after trimming. Martin
>
> Obviously, somewhere the CIGAR or sequence lengths were altered and
no
> longer 'mesh'.
> I run :
>
> library(ShortRead)
> reads <- readFastq("myFile")
> treads <- trimEnds(reads, a = "5") #trim at a quality score of
20
> nreads <- treads[width(reads) > 21] # keep only reads longer than
21 bp
> writeFastq(nreads, "myNewFile.fastq")
>
> then a 'normal' (few options changed) tophat call
> tophat -p 4 -o outdir -G my.gtf BowtieIndex
myNewFile.fastq
>
> Any ideas on what went wrong, and why it didn't work some of the
time?
> Should I just rerun the trimming scripts and see if they map
afterwards?
>
>
> Thanks in advance!
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] ShortRead_1.18.0 latticeExtra_0.6-24 RColorBrewer_1.0-5
> [4] Rsamtools_1.12.3 lattice_0.20-15 Biostrings_2.28.0
> [7] GenomicRanges_1.12.3 IRanges_1.18.1 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.20.0 bitops_1.0-5 grid_3.0.0 hwriter_1.3
> stats4_3.0.0
> [6] zlibbioc_1.6.0
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793