Entering edit mode
Dear Gabriele,
Sorry I never came to see that post before. Not an excuse, but Bioc
has such a huge traffic - I made a rule now to sort these emails.
On 06 Jun 2013, at 16:52, Gabriele Zoppoli <zoppoli at="" gmail.com="">
wrote:
> Dear Bioconductor mailing list,
>
> I am trying to load and summarize ~ 60 paired-read BAM files from
human
> cancer RNA-seq experiments using Illumina 2x50 protocol, for
downstream use
> with edgeR and DESeq.
>
> First question: I noticed several posts have been issued on the same
topic,
> i.e. the way to solve the warning: "There are [any number] synthetic
exons
> as determined from your annotation that overlap! This implies that
some
> reads will be counted more than once! Is that really what you want?"
when
> using easyRNASeq.
>
> So far, I haven't seen any answer that doesn't pass through the use
of
> GenomicRanges, or even any answer at all for some decently written
posts.
Sadly that happens when some threads go off-list. I may miss times
when this happen, as I receive numerous inquiries off-list. Many have
to do with topics that have already been discussed on the list or are
trivial so I do not consider it worth to bring these back to the list.
Too much information kill the information.
> The easyRNASeq vignette is not entirely clear on that point. I'm
therefore
> wondering whether anybody has come up with a solution and posted it
in a
> plain and reproducible fashion.
Point taken, I?ve been discussing this a lot and I agree the vignette
is the best place for that sort of information. However, the plain and
reproducible is going to be difficult, because there are many
different annotation formats and many different organisms specificity.
It is very frequently necessary to make custom annotations because
users have different questions or different usage. I?m finalising some
edition to the vignettes with regards to this. Should be in version
1.8.4 by the end of this week. Note that there?s already some
information on that in the vignette, in particular section 7.1.
To elaborate, the point of this warning is to make sure that you - the
user - understand what you are doing. Most people think that counting
reads is easy, but it?s all the contrary. And if you do not not
understand how counting happens, you are bound to make mistakes that
will eventually flaw your analysis. As there are many ways of counting
and as users might have different opinions or different usage for read
counting, there is no single optimal solution that can be implemented
- things are standardising though and I?m working on having
solution(s) for this in a next version of easyRNASeq. Anyway, at the
time I designed the package, reporting a warning and letting the user
deal with it as its convenience was the optimal thing to do. Hence,
the warning is not for me to solve but for you to weight how much this
might affect the analysis you are planning.
Now, if you look at the vignette in the version 1.8.3 of the package,
you?ll see that section 7.1 describes how you can get read counts per
?gene? minimising cases where ?multiple counting? might occur; what I
call generating ?synthetic transcripts?. Even when doing this, there
will still be cases where it occurs, i.e. if genes on different
strands are overlapping, which is what the warning message you are
seeing above is telling you. To decide whether this is critical or not
depends on your organisms and on the cases where these occur; e.g. in
drosophila, poplar, these are often due to overlapping UTRs and I deal
with these simply by shortening the UTRs to avoid multiple counting.
Hence what you need to do after you perform the steps in section 7.1
is to check for ?synthetic transcripts? overlaps. Here?s how you can
get the list of those:
obj <- as(syntheticGeneModel,?RangedData)
ovl <- findOverlaps(obj,ignoreSelf=TRUE,ignoreRedundant=TRUE)
and then investigate the ?ovl? HitLists object. I?m adding some
details about this as well in the vignette; should be out by the end
of the week. Moreover, I have added a flag in easyRNASeq that reports
which annotation entry has an overlap, e.g. to flag the genes for
downstream analyses. This information is only accessible with the
outputFormat=?RNAseq? and the details will be in the vignette.
Finally, a note on easyRNASeq counting. EasyRNASeq assumes that you
know what you are doing, i.e. at the moment it does not check if you
are providing multiple mapping reads, chimeric reads (non properly
paired reads), etc. You need to check what your aligner of choice
reports, my settings drop multiple mapping and chimeric reads but I?ve
investigated what consequences this has on my reference organism so I
know which genes are affected by this. Actually, this behaviour of
easyRNASeq will change in the next development version; checks will be
performed on the data to assess strand specificity, presence of
multiple mapping, chimeric reads, etc., which should make its use more
intuitive and less error-prone.
> Second question: does anybody know whether the aforementioned
easyRNASeq
> package makes use of the "properly paired" reads for summarization?
I
> really couldn't find anything on that either, even after a month of
> googling around.
There?s no reason not to email me directly, so do not hesitate doing
so whenever you have questions; please mind the changed email address.
I can feel your frustration through this email and I understand it if
you?ve been looking for answers for so long before emailing. Once
again, I?m very sorry that I did not see that email earlier. I very
welcome questions, comments, criticisms because eventually that
feedback benefits me and others.
Anyway, the reason why there is no answer out there, is that I believe
this is the first time that question is asked as you state it. The
short answer is ?not at the moment?. The long answer is that
easyRNASeq use a coverage approach - i.e. looking at the coverage of
every bp and dividing that by the fractions of the fragments
overlapping a given feature (e.g. an exon). I?m currently re-
implementing the whole package so as to better support pairing,
strandedness, parallel processing, etc. Again to re-iterate the
comments above, at the moment to use easyRNASeq in a sensible way,
knowing the content of your BAM file - i.e. which reads are reported
by your aligner - is essential for accurate counting. This should be
significantly simplified, once I?m done with the easyRNASeq re-
implementation.
Once again, sorry for not seeing that email before.
Hope it helps anyway,
Cheers,
Nico
>
> That's what I've done so far:
>
> countTable <- easyRNASeq(filesDirectory=getwd(),
> organism="Hsapiens",
> annotationMethod="rda",
> annotationFile="gAnnot.rda",
> gapped=TRUE, count="genes",
> summarization="geneModels",
> filesDirectory=getwd(),
> filenames=BAM_files,
> outputFormat="RNAseq",
> nBcores=4)
> Checking arguments...
> Fetching annotations...
> Computing gene models...
> Summarizing counts...
> Processing sample1.bam
> Updating the read length information.
> The alignments are gapped.
> Minimum length of 1 bp.
> Maximum length of 51 bp.
> [...]
> Preparing output
>
> Warning messages:
> [...]
> 2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
> annotationMethod = "rda", :
> There are 16816 synthetic exons as determined from your annotation
that
> overlap! This implies that some reads will be counted more than
once! Is
> that really what you want?
> [...]
>
> ##rda file is derived from a previous iteration of the same command
using
> annotationMethod="biomaRt" and then doing
>
> gAnnot <- genomicAnnotation(count.genes)
> gAnnot <- gAnnot[space(gAnnot) %in%
> paste("chr",c(1:22,"X","Y","M"),sep=""),]
> save(gAnnot,file="gAnnot.rda")
>
> as suggested by Nicholas Delhomme
>
> Thanks in advance!
>
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] easyRNASeq_1.6.0 ShortRead_1.18.0
latticeExtra_0.6-24
> [4] RColorBrewer_1.0-5 Rsamtools_1.12.3 DESeq_1.12.0
> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0
> [10] GenomicRanges_1.12.4 Biostrings_2.28.0 IRanges_1.18.1
> [13] edgeR_3.2.3 limma_3.16.5 biomaRt_2.16.0
> [16] Biobase_2.20.0 genomeIntervals_1.16.0
BiocGenerics_0.6.0
> [19] intervals_0.14.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.38.0 AnnotationDbi_1.22.6 bitops_1.0-5
> [4] DBI_0.2-7 genefilter_1.42.0 geneplotter_1.38.0
> [7] grid_3.0.1 hwriter_1.3 RCurl_1.95-4.1
> [10] RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1
> [13] survival_2.37-4 XML_3.96-1.1 xtable_1.7-1
> [16] zlibbioc_1.6.0
>
>
>
>
> --
> Gabriele Zoppoli, MD
> Ph.D., Clinical and Experimental Oncology and Hematology
> Visiting Researcher, BCTRL J.C. Heuson, Institut J. Bordet,
Bruxelles BE
> Internal Medicine Resident, DiMI, IRCCS AOU San Martino IST, Genova,
IT
> Former Guest Researcher, LMP, CCR, NCI, NIH, Bethesda MD
>
>
> Tel: +39 010 353 7968
> Mobile 1: +32 478 337 942
> Mobile 2: +39 349 617 0129
> Email: gabriele.zoppoli at unige.it
> Alt. Email: zoppoli at gmail.com
> Alt. Email 2: gzoppoli at libero.it
> Alt. Email 3: gabriele.zoppoli at bordet.be
> ----------------------------------------------------------
>
>
> ??? ????? ???? ?? ????? ??' ????? ???? ??????,
>
> ??????? ?' ??????, ??? ?' ??????????? ???????:
>
> ?? ?? ???? ??? ???????, ???? ?? ??? ?????? ?????.
> *Father Zeus, at least deliver the sons of Acheans from the gloom,*
> *And make clear the air, and give it to our eyes to see.*
> *In the light destroy us, since to do thus pleases you. (Il. 17,
645-7)
> *
>
>
> ----------------------------------------------------------
>
> CONFIDENTIALITY NOTICE\ \ This e-mail message is
intende...{{dropped:14}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
---------------------------------------------------------------
Nicolas Delhomme
Nathaniel Street Lab
Department of Plant Physiology
Ume? Plant Science Center
Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Ume? universitet
Ume? S-901 87 Sweden
---------------------------------------------------------------