Entering edit mode
Alejandro Reyes
★
1.9k
@alejandro-reyes-5124
Last seen 4 months ago
Novartis Institutes for BioMedical Reseā¦
Hi Jose,
I have an e-mail answering to this thread on the 24.03.2014, maybe you
missed it or did I write your e-mail wrong?
http://permalink.gmane.org/gmane.science.biology.informatics.conductor
/53937
Your concern is answered by the second point that I describe there. If
you
look at your "fitted splicing" plot, you can see this. The extreme
case
is the coefficients fitted
for your exon E032, it has a value of ~40,000 in one of your
conditions
and a value of ~800 on
your other conditions. This will affect the estimation of relative
exon
abundances from
your other exons. As I mentioned before, this is a limitation of the
DEXSeq model,
but luckily, genes like this cases seem to be exceptions rather than
the
rule
(at least in my experience!).
About using the output from voom to test for DEU, I have not explored
that option,
but maybe the maintainers/authors of that package are able to guide
you
better.
Hope it is useful,
Alejandro
> Dear Alejandro,
> Have you had time to take a look at my problem (please see below)?
> I am now using DEXSeq 1.9 to analyze the same ecs objects I had
> analyzed with 1.8 but it produces the very same results. The problem
> was regarding too many exons with very low log2FC and very low
> p.values. I send here an object with the subsetByGene (ecs.one) with
> one particular gene. The E029 has a very low p.value with a very low
> log2FC. Either the log2FCs are not OK or the p.values. I cannot
> understand how such low log2FC for the DEU analysis can give those
low
> p.values. Indeed the complete original ecs gave me 98000 exons with
> table(res.48$padjust<0.05).
> However the same analysis (ecs object 4CTRLS vs 4 TREATED) gave me
> nice results when analysed with DEXSeq 1.6 on the complete design
> without splitting into two.
> Here's the picture with expression and splicing values:
> Immagine in linea 2
>
> Here's the design of the ecs object created with CTRL vs HYPOXIA
(only
> at 48h):
> > design(ecs.one)
> sampleName fileName
condition
> N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam
CTRL
> N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam
CTRL
> CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam
CTRL
> CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam
CTRL
> HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam
HYPOXIA
> HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam
HYPOXIA
>
> Maybe the sampleNames?? N1andN2 come from another batch but it is
> still a CTRL. If they were different I would expect higher
dispersions
> and hence higher p.values not lower ones, wouldn't I?
> I have tried to trace the problem a bit with these gene:
>
> modelFrame<-constructModelFrame(ecs.one)
> formula0 = ~sample + exon
> formula1 = ~sample + exon + condition:exon
> mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame))
> mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame))
> count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG00000170017",
"E029")
>
> > mm0
> (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1
> sampleN2 exonothers
> 1 1 0 0 0 1 0
> 0
> 2 1 0 0 0 0 1
> 0
> *3 1 0 0 0 0
> 0 0*
> 4 1 1 0 0 0 0
> 0
> 5 1 0 1 0 0 0
> 0
> 6 1 0 0 1 0 0
> 0
> 7 1 0 0 0 1 0
> 1
> 8 1 0 0 0 0 1
> 1
> 9 1 0 0 0 0 0
> 1
> 10 1 1 0 0 0 0
> 1
> 11 1 0 1 0 0 0
> 1
> 12 1 0 0 1 0 0
> 1
> attr(,"assign")
> [1] 0 1 1 1 1 1 2
> attr(,"contrasts")
> attr(,"contrasts")$sample
> [1] "contr.treatment"
>
> attr(,"contrasts")$exon
> [1] "contr.treatment"
>
> Does it seem OK to you? I guess the intercept is CTRL2 (in bold) but
> why? Does it have to do with the 'CTRL' string in the sampleName? I
> tried to change the sample names to CTRL1,CTRL2... but the result
was
> the same.
>
> Here's the mm1
> > mm1
> (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1
> sampleN2 exonothers exonthis:conditionHYPOXIA
> 1 1 0 0 0 1 0
> 0 0
> 2 1 0 0 0 0 1
> 0 0
> 3 1 0 0 0 0 0
> 0 0
> 4 1 1 0 0 0 0
> 0 0
> 5 1 0 1 0 0 0
> 0 1
> 6 1 0 0 1 0 0
> 0 1
> 7 1 0 0 0 1 0
> 1 0
> 8 1 0 0 0 0 1
> 1 0
> 9 1 0 0 0 0 0
> 1 0
> 10 1 1 0 0 0 0
1
> 0
> 11 1 0 1 0 0 0
1
> 0
> 12 1 0 0 1 0 0
1
> 0
> > sessionInfo()
> R Under development (unstable) (2014-02-09 r64949)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> base
>
> other attached packages:
> [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5
> Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 setwidth_1.0-3
> colorout_1.0-2
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15
> bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19
> GenomicRanges_1.15.39
> [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1
> Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18
stats4_3.1.0
> [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1
> XVector_0.3.7 zlibbioc_1.9.0
>
>
> I hope you can give me some hints since I am a bit confused and
stuck
> with these results.
> By the way, for the other Bioc, I know limma/voom used on exons can
> also work nicely. Is there an easy way to implement a sort of DEU
test
> with limma voom counts? I guess the annotation gtf used to count
them
> should be used to construct models and include it in a similar way
> with formulae in linearmodels as DEXSeq does with glmnb.fit
function.
> It would be perfect to have it straight as a function also in limma
to
> compare results.
>
> Thanks again for your efforts,
>
> Looking forward to hearing to your comments.
> Jose
>
>
>
>
> 2014-03-20 16:07 GMT+01:00 Jose Garcia
> <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>:
>
> Hi Alejandro,
> I solved the problem by re-creating the object ecs.24. I had
made
> one DEXSeq analysis up to the end by first creating an ecs
object.
> Then I just split the ecs object, which already had p.value and
> other info, and re-run the analysis from sizeFactors on onto the
> new split ecs.24 object.
> Now it has worked.
> However, I have obtained a much harder to interpret result which
> points to something wrong I do not know why. And it is present
in
> both the split and the original ecs.24 and ecs objects.
> From scratch:
> I made dexseq_prepare_annotation.py with the script from DEXSeq
> 1.6 which contained the '-r' parameter in order to avoid
counting
> exons overlapping different genes. Then I tried to count using
the
> new dexseq_count.py in the same package but it gave me an error
> because it had been introduced a check for NH tag in the bam
that
> I do not have because I use SOAPSplice. You suggested to use the
> old dexseq_count.py whithout the check (from DEXSeq 1.4).
> It worked and then I used the following script:
>
> sampleFiles.R_ExonOUT<-Files
> sampleName.R_ExonOUT<-Names
> sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep(
"HYPOXIA",2))
> sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4))
> sampleTable.R_ExonOUT <- data.frame(sampleName =
sampleName.R_ExonOUT,
> fileName = sampleFiles.R_ExonOUT,
> condition =
sampleCondition.R_ExonOUT,
> experiment =
sampleExperiment.R_ExonOUT)
> inDir = getwd()
> annotationfile = file.path
> ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXS
eq.gff")
>
> ecs = read.HTSeqCounts(countfiles = file.path(inDir,
> sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT,
> flattenedfile = annotationfile)
>
> sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName
> ecs <- estimateSizeFactors(ecs)
> library(parallel)
> ecs <- estimateDispersions(ecs,nCores=8)
> ecs <- fitDispersionFunction(ecs)
> ecs <- testForDEU(ecs, nCores=8)
> ecs <- estimatelog2FoldChanges(ecs, nCores=8)
> res<- DEUresultTable(ecs)
>
> The problem is that I have some exons with a ridiculous log2FC
but
> with a very good p.adjust.
> Same thing with the ecs.24 or ecs.48 split objects. Here an
example:
>
> head(res.48[which(res.48$geneID=="ENSG00000148516"),])
>
> geneID exonID dispersion pvalue padjust meanBase
> log2fold(HYPOXIA/CTRL)
>
> ENSG00000148516:E036 ENSG00000148516 E036 0.014798679
> 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01
>
> ENSG00000148516:E049 ENSG00000148516 E049 0.011425856
> 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01
>
> ENSG00000148516:E039 ENSG00000148516 E039 0.014486497
> 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01
>
> *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072
> 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05*
>
> ENSG00000148516:E033 ENSG00000148516 E033 0.037143254
> 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01
>
> ENSG00000148516:E034 ENSG00000148516 E034 0.019826955
> 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01
>
>
> If you look at the plot (just a few exons around E50)
>
> plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T)
>
>
> Immagine in linea 3
>
> It seems clear that all those p-values cannot come from those
> log2FC that are adjusted for expression of all exons coming from
> the same gene.
>
> I have checked the design and formula:
>
> design(ecs.48)
>
> sampleName fileName condition experiment
>
> N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL
RUN_2
>
> N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL
RUN_2
>
> CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL
RUN_1
>
> CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL
RUN_1
>
> HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam
> HYPOXIA RUN_1
>
> HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam
> HYPOXIA RUN_1
>
>
> formula(ecs.48)
>
> $formulaDispersion
>
> [1] "~sample + exon + condition:exon"
>
>
> $formula0
>
> [1] "~sample + exon"
>
>
> $formula1
>
> [1] "~sample + exon + condition:exon"
>
>
> So, I am a bit stuck with it. I guess everything comes from
having
> used different versions but I cannot come across it.
Summarizing:
>
> SOASPSplice
>
> dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72
> (hg19) -r no
>
> dexseq_count.py (From DEXSeq 1.4)
>
> Analysis (DEXSeq 1.8)
>
> Thanks for the help,
>
>
> Jose
>
>
>
> sessionInfo()
>
> R version 3.0.1 (2013-05-16)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
>
> locale:
>
> [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
>
> [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
>
> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
>
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US
LC_IDENTIFICATION=C
>
>
> attached base packages:
>
> [1] parallel stats graphics grDevices utils datasets
methods
>
> [8] base
>
>
> other attached packages:
>
> [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0
>
>
> loaded via a namespace (and not attached):
>
> [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6
>
> [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6
>
> [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18
>
> [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1
>
> [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0
>
>
>
> 2014-03-19 13:18 GMT+01:00 Jose Garcia
> <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>:
>
> Hi Alejandro,
> I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated
> samples. My design is the following:
>
> design(ecs.24)
>
> sampleName fileName condition
experiment
>
> H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam
> HYPOXIA RUN_2
>
> H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam
> HYPOXIA RUN_2
>
> N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam
CTRL
> RUN_2
>
> N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam
CTRL
> RUN_2
>
> CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam
> CTRL RUN_1
>
> CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam
> CTRL RUN_1
>
> When I follow the vignette:
>
> ecs.24 <- estimateDispersions(ecs.24,nCores=8)
>
> ....Done
>
> ecs.24 <- fitDispersionFunction(ecs.24)
>
> ....Done
>
> ecs.24 <- testForDEU(ecs.24, nCores=8)
>
> .....
>
> ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8)
>
> *Error in `row.names<-.data.frame`(`*tmp*`, value =
> c("geneID", "exonID", : *
>
> * duplicate 'row.names' are not allowed*
>
> *Calls: estimatelog2FoldChanges ... pData<- -> pData<- ->
> row.names<- -> row.names<-.data.frame*
>
> *In addition: Warning message:*
>
> *non-unique value when setting 'row.names':
> 'log2fold(CTRL/HYPOXIA)' *
>
>
> I checked for duplication as you had suggested elsewhere
>
> any(duplicated(featureNames(ecs.24)))
>
> [1] FALSE
>
>
any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":")))
>
> [1] FALSE
>
>
> I also checked that the condition in design where factors:
>
>
> is.factor(pData(ecs.24)$condition)
>
> [1] TRUE
>
>
> The only explanation I could come to is the fact that I have
> an even number of samples for control and treated? or that I
> have the 'experiment' column in the design, but it would be
> irrelevant since the default formula is only taking
condition
> into consideration, isn't it?
>
> What could be the origin of the error?
>
> Thanks again!
>
> Jose
>
>
>
> > sessionInfo()
>
> R version 3.0.1 (2013-05-16)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
>
> locale:
>
> [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
>
> [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
>
> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
>
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
>
>
> attached base packages:
>
> [1] parallel stats graphics grDevices utils
datasets
> methods
>
> [8] base
>
>
> other attached packages:
>
> [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0
>
>
> loaded via a namespace (and not attached):
>
> [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6
>
> [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6
>
> [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18
>
> [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1
>
> [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0
>
>
>
> 2014-03-13 16:32 GMT+01:00 Alejandro Reyes
> <alejandro.reyes at="" embl.de="" <mailto:alejandro.reyes="" at="" embl.de="">>:
>
> Dear Xiayu Rao,
>
> Thanks for your interest in DEXSeq.
> That looks strange, does it happen when you use files
> different from the
> one you generated by your own? Could you maybe send me
> (offline) your
> gtf file and the first 1000 lines of one of your sam
files?
>
> Bests,
> Alejandro
>
> > Hello,
> >
> > DEXSeq is a great tool. Thank you for that. I recently
> generate my own gtf file with the same format as the
> exon.gff generated by dexseq_prepare_annotation.py.
> However, the output is strange. I tried to find the
reason
> with no success. Could you please provide any idea about
> that problem? Thank you very much in advance!
> >
> > Note: I used the latest dexseq, and the sam files had
> been sorted.
> >
> > 1 genes.gtf exonic_part 12228 12594 .
> + . exonic_part_number "001"; gene_id
> "ENSG00000223972"
> > 1 genes.gtf exonic_part 12722 12974 .
> + . exonic_part_number "002"; gene_id
> "ENSG00000223972"
> > 1 genes.gtf exonic_part 13053 13220 .
> + . exonic_part_number "003"; gene_id
> "ENSG00000223972"
> > 1 genes.gtf exonic_part 14830 14969 .
> - . exonic_part_number "001"; gene_id
> "ENSG00000223972+ENSG00000227232"
> > .............
> >
> >
> > ==> SRR791043_counts.txt <==
> > :001G00027000003"
> > :002G00021000003"
> > :003G00070000003"
> > :004G00040000003"
> > :005G00060000003"
> > :006G00030000003"
> > :007G00019000003"
> > :008G00045600003"
> > :009G00020400003"
> > :001G00000000005"
> >
> >
> > Thanks,
> > Xiayu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> --
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
>
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it
> <mailto:garciamanteiga.josemanuel at="" hsr.it="">
>
>
>
>
> --
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
>
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it
> <mailto:garciamanteiga.josemanuel at="" hsr.it="">
>
>
>
>
> --
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
>
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it
> <mailto:garciamanteiga.josemanuel at="" hsr.it="">