Entering edit mode
I'm dealing with some Solexa/Illumina data with ShortRead for the
first
time and had a couple of questions relating to QA:
1. Memory requirements: My data comprises 7 s_N_export.txt files. Each
one
comprises 10-20 million aligned reads. If I try to run qa() over the
whole
directory my machine rapidly grinds to a halt. Tackling each file
individually keeps my machine running, but takes >1 hour for each one.
The
ShortRead vignette says evaluating a single lane can take 'several
minutes', so I'm wondering if anyone can offer any clues as to why I'm
struggling so much? The machine in question has 6GB of RAM - do I just
need
more?
2. Read distribution: The QA results I'm getting for the 'read
distribution' section don't quite look like those presented in the
example
ShortRead Solexa QA report. My interpretation is that this is because
my
data is actually rather high quality, but I'd appreciate a second
opinion.
To quote from the ShortRead QA report:
'Ideally, the cumulative proportion of reads will transition sharply
from
low to high. Portions to the left of the transition might correspond
roughly to sequencing or sample processing errors, and correspond to
reads
that are represented relatively infrequently [...]. Portions to the
right
of the transition represent reads that are over-represented compared
to
expectation.'
Typically the read distribution plots I'm seeing look like this:
http://dl.dropbox.com/u/419878/readOccurences.jpg
There is a sharp transition, but no portion to the left. I interpret
this
as a good sign: most of the reads are seen a small number of times
(<10),
and there are relatively few over-represented reads. Is there anything
there that would worry more experienced heads?
--
Alex Gutteridge