Entering edit mode
I agree with Nicolas that counting reads is much more subtle than
people
think before they consider it in detail. There is a detailed
discussion
of multiple counting issues and solutions in:
http://www.ncbi.nlm.nih.gov/pubmed/24227677
The solutions are implemented in the RSubread package and support
paired
end reads and so on.
Best wishes
Gordon
> Date: Mon, 2 Dec 2013 10:23:08 +0100
> From: Nicolas Delhomme <nicolas.delhomme at="" umu.se="">
> To: Gabriele Zoppoli <zoppoli at="" gmail.com="">
> Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
> Subject: Re: [BioC] easyRNASeq packages: overlapping exons warning
> unsolved (?) and use of paired end information
>
> Dear Gabriele,
>
> Sorry I never came to see that post before. Not an excuse, but Bioc
has such a huge traffic - I made a rule now to sort these emails.
>
> On 06 Jun 2013, at 16:52, Gabriele Zoppoli <zoppoli at="" gmail.com="">
wrote:
>
>> Dear Bioconductor mailing list,
>>
>> I am trying to load and summarize ~ 60 paired-read BAM files from
human
>> cancer RNA-seq experiments using Illumina 2x50 protocol, for
downstream use
>> with edgeR and DESeq.
>>
>> First question: I noticed several posts have been issued on the
same topic,
>> i.e. the way to solve the warning: "There are [any number]
synthetic exons
>> as determined from your annotation that overlap! This implies that
some
>> reads will be counted more than once! Is that really what you
want?" when
>> using easyRNASeq.
>>
>> So far, I haven't seen any answer that doesn't pass through the use
of
>> GenomicRanges, or even any answer at all for some decently written
posts.
>
> Sadly that happens when some threads go off-list. I may miss times
when this happen, as I receive numerous inquiries off-list. Many have
to do with topics that have already been discussed on the list or are
trivial so I do not consider it worth to bring these back to the list.
Too much information kill the information.
>
>> The easyRNASeq vignette is not entirely clear on that point. I'm
therefore
>> wondering whether anybody has come up with a solution and posted it
in a
>> plain and reproducible fashion.
>
> Point taken, I?ve been discussing this a lot and I agree the
vignette is the best place for that sort of information. However, the
plain and reproducible is going to be difficult, because there are
many different annotation formats and many different organisms
specificity. It is very frequently necessary to make custom
annotations because users have different questions or different usage.
I?m finalising some edition to the vignettes with regards to this.
Should be in version 1.8.4 by the end of this week. Note that there?s
already some information on that in the vignette, in particular
section 7.1.
>
> To elaborate, the point of this warning is to make sure that you -
the user - understand what you are doing. Most people think that
counting reads is easy, but it?s all the contrary. And if you do not
not understand how counting happens, you are bound to make mistakes
that will eventually flaw your analysis. As there are many ways of
counting and as users might have different opinions or different usage
for read counting, there is no single optimal solution that can be
implemented - things are standardising though and I?m working on
having solution(s) for this in a next version of easyRNASeq. Anyway,
at the time I designed the package, reporting a warning and letting
the user deal with it as its convenience was the optimal thing to do.
Hence, the warning is not for me to solve but for you to weight how
much this might affect the analysis you are planning.
>
> Now, if you look at the vignette in the version 1.8.3 of the
package, you?ll see that section 7.1 describes how you can get read
counts per ?gene? minimising cases where ?multiple counting? might
occur; what I call generating ?synthetic transcripts?. Even when doing
this, there will still be cases where it occurs, i.e. if genes on
different strands are overlapping, which is what the warning message
you are seeing above is telling you. To decide whether this is
critical or not depends on your organisms and on the cases where these
occur; e.g. in drosophila, poplar, these are often due to overlapping
UTRs and I deal with these simply by shortening the UTRs to avoid
multiple counting. Hence what you need to do after you perform the
steps in section 7.1 is to check for ?synthetic transcripts? overlaps.
Here?s how you can get the list of those:
>
> obj <- as(syntheticGeneModel,?RangedData)
> ovl <- findOverlaps(obj,ignoreSelf=TRUE,ignoreRedundant=TRUE)
>
> and then investigate the ?ovl? HitLists object. I?m adding some
details about this as well in the vignette; should be out by the end
of the week. Moreover, I have added a flag in easyRNASeq that reports
which annotation entry has an overlap, e.g. to flag the genes for
downstream analyses. This information is only accessible with the
outputFormat=?RNAseq? and the details will be in the vignette.
>
> Finally, a note on easyRNASeq counting. EasyRNASeq assumes that you
know what you are doing, i.e. at the moment it does not check if you
are providing multiple mapping reads, chimeric reads (non properly
paired reads), etc. You need to check what your aligner of choice
reports, my settings drop multiple mapping and chimeric reads but I?ve
investigated what consequences this has on my reference organism so I
know which genes are affected by this. Actually, this behaviour of
easyRNASeq will change in the next development version; checks will be
performed on the data to assess strand specificity, presence of
multiple mapping, chimeric reads, etc., which should make its use more
intuitive and less error-prone.
>
>> Second question: does anybody know whether the aforementioned
easyRNASeq
>> package makes use of the "properly paired" reads for summarization?
I
>> really couldn't find anything on that either, even after a month of
>> googling around.
>
> There?s no reason not to email me directly, so do not hesitate doing
so whenever you have questions; please mind the changed email address.
I can feel your frustration through this email and I understand it if
you?ve been looking for answers for so long before emailing. Once
again, I?m very sorry that I did not see that email earlier. I very
welcome questions, comments, criticisms because eventually that
feedback benefits me and others.
> Anyway, the reason why there is no answer out there, is that I
believe this is the first time that question is asked as you state it.
The short answer is ?not at the moment?. The long answer is that
easyRNASeq use a coverage approach - i.e. looking at the coverage of
every bp and dividing that by the fractions of the fragments
overlapping a given feature (e.g. an exon). I?m currently re-
implementing the whole package so as to better support pairing,
strandedness, parallel processing, etc. Again to re-iterate the
comments above, at the moment to use easyRNASeq in a sensible way,
knowing the content of your BAM file - i.e. which reads are reported
by your aligner - is essential for accurate counting. This should be
significantly simplified, once I?m done with the easyRNASeq re-
implementation.
>
> Once again, sorry for not seeing that email before.
>
> Hope it helps anyway,
>
> Cheers,
>
> Nico
>
>
>>
>> That's what I've done so far:
>>
>> countTable <- easyRNASeq(filesDirectory=getwd(),
>> organism="Hsapiens",
>> annotationMethod="rda",
>> annotationFile="gAnnot.rda",
>> gapped=TRUE, count="genes",
>> summarization="geneModels",
>> filesDirectory=getwd(),
>> filenames=BAM_files,
>> outputFormat="RNAseq",
>> nBcores=4)
>> Checking arguments...
>> Fetching annotations...
>> Computing gene models...
>> Summarizing counts...
>> Processing sample1.bam
>> Updating the read length information.
>> The alignments are gapped.
>> Minimum length of 1 bp.
>> Maximum length of 51 bp.
>> [...]
>> Preparing output
>>
>> Warning messages:
>> [...]
>> 2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
>> annotationMethod = "rda", :
>> There are 16816 synthetic exons as determined from your annotation
that
>> overlap! This implies that some reads will be counted more than
once! Is
>> that really what you want?
>> [...]
>>
>> ##rda file is derived from a previous iteration of the same command
using
>> annotationMethod="biomaRt" and then doing
>>
>> gAnnot <- genomicAnnotation(count.genes)
>> gAnnot <- gAnnot[space(gAnnot) %in%
>> paste("chr",c(1:22,"X","Y","M"),sep=""),]
>> save(gAnnot,file="gAnnot.rda")
>>
>> as suggested by Nicholas Delhomme
>>
>> Thanks in advance!
>>
>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
methods
>> [8] base
>>
>> other attached packages:
>> [1] easyRNASeq_1.6.0 ShortRead_1.18.0
latticeExtra_0.6-24
>> [4] RColorBrewer_1.0-5 Rsamtools_1.12.3 DESeq_1.12.0
>> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0
>> [10] GenomicRanges_1.12.4 Biostrings_2.28.0 IRanges_1.18.1
>> [13] edgeR_3.2.3 limma_3.16.5 biomaRt_2.16.0
>> [16] Biobase_2.20.0 genomeIntervals_1.16.0
BiocGenerics_0.6.0
>> [19] intervals_0.14.0
>>
>> loaded via a namespace (and not attached):
>> [1] annotate_1.38.0 AnnotationDbi_1.22.6 bitops_1.0-5
>> [4] DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0
>> [7] grid_3.0.1 hwriter_1.3 RCurl_1.95-4.1
>> [10] RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1
>> [13] survival_2.37-4 XML_3.96-1.1 xtable_1.7-1
>> [16] zlibbioc_1.6.0
>>
>>
>>
>>
>> --
>> Gabriele Zoppoli, MD
>> Ph.D., Clinical and Experimental Oncology and Hematology
>> Visiting Researcher, BCTRL J.C. Heuson, Institut J. Bordet,
Bruxelles BE
>> Internal Medicine Resident, DiMI, IRCCS AOU San Martino IST,
Genova, IT
>> Former Guest Researcher, LMP, CCR, NCI, NIH, Bethesda MD
>>
>>
>> Tel: +39 010 353 7968
>> Mobile 1: +32 478 337 942
>> Mobile 2: +39 349 617 0129
>> Email: gabriele.zoppoli at unige.it
>> Alt. Email: zoppoli at gmail.com
>> Alt. Email 2: gzoppoli at libero.it
>> Alt. Email 3: gabriele.zoppoli at bordet.be
>> ----------------------------------------------------------
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}