DEXSeq output - count file

0

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 4 months ago

Novartis Institutes for BioMedical Rese…

Hi Jose, I have an e-mail answering to this thread on the 24.03.2014, maybe you missed it or did I write your e-mail wrong? http://permalink.gmane.org/gmane.science.biology.informatics.conductor /53937 Your concern is answered by the second point that I describe there. If you look at your "fitted splicing" plot, you can see this. The extreme case is the coefficients fitted for your exon E032, it has a value of ~40,000 in one of your conditions and a value of ~800 on your other conditions. This will affect the estimation of relative exon abundances from your other exons. As I mentioned before, this is a limitation of the DEXSeq model, but luckily, genes like this cases seem to be exceptions rather than the rule (at least in my experience!). About using the output from voom to test for DEU, I have not explored that option, but maybe the maintainers/authors of that package are able to guide you better. Hope it is useful, Alejandro > Dear Alejandro, > Have you had time to take a look at my problem (please see below)? > I am now using DEXSeq 1.9 to analyze the same ecs objects I had > analyzed with 1.8 but it produces the very same results. The problem > was regarding too many exons with very low log2FC and very low > p.values. I send here an object with the subsetByGene (ecs.one) with > one particular gene. The E029 has a very low p.value with a very low > log2FC. Either the log2FCs are not OK or the p.values. I cannot > understand how such low log2FC for the DEU analysis can give those low > p.values. Indeed the complete original ecs gave me 98000 exons with > table(res.48$padjust<0.05). > However the same analysis (ecs object 4CTRLS vs 4 TREATED) gave me > nice results when analysed with DEXSeq 1.6 on the complete design > without splitting into two. > Here's the picture with expression and splicing values: > Immagine in linea 2 > > Here's the design of the ecs object created with CTRL vs HYPOXIA (only > at 48h): > > design(ecs.one) > sampleName fileName condition > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam HYPOXIA > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam HYPOXIA > > Maybe the sampleNames?? N1andN2 come from another batch but it is > still a CTRL. If they were different I would expect higher dispersions > and hence higher p.values not lower ones, wouldn't I? > I have tried to trace the problem a bit with these gene: > > modelFrame<-constructModelFrame(ecs.one) > formula0 = ~sample + exon > formula1 = ~sample + exon + condition:exon > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) > count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG00000170017", "E029") > > > mm0 > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > sampleN2 exonothers > 1 1 0 0 0 1 0 > 0 > 2 1 0 0 0 0 1 > 0 > *3 1 0 0 0 0 > 0 0* > 4 1 1 0 0 0 0 > 0 > 5 1 0 1 0 0 0 > 0 > 6 1 0 0 1 0 0 > 0 > 7 1 0 0 0 1 0 > 1 > 8 1 0 0 0 0 1 > 1 > 9 1 0 0 0 0 0 > 1 > 10 1 1 0 0 0 0 > 1 > 11 1 0 1 0 0 0 > 1 > 12 1 0 0 1 0 0 > 1 > attr(,"assign") > [1] 0 1 1 1 1 1 2 > attr(,"contrasts") > attr(,"contrasts")$sample > [1] "contr.treatment" > > attr(,"contrasts")$exon > [1] "contr.treatment" > > Does it seem OK to you? I guess the intercept is CTRL2 (in bold) but > why? Does it have to do with the 'CTRL' string in the sampleName? I > tried to change the sample names to CTRL1,CTRL2... but the result was > the same. > > Here's the mm1 > > mm1 > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > sampleN2 exonothers exonthis:conditionHYPOXIA > 1 1 0 0 0 1 0 > 0 0 > 2 1 0 0 0 0 1 > 0 0 > 3 1 0 0 0 0 0 > 0 0 > 4 1 1 0 0 0 0 > 0 0 > 5 1 0 1 0 0 0 > 0 1 > 6 1 0 0 1 0 0 > 0 1 > 7 1 0 0 0 1 0 > 1 0 > 8 1 0 0 0 0 1 > 1 0 > 9 1 0 0 0 0 0 > 1 0 > 10 1 1 0 0 0 0 1 > 0 > 11 1 0 1 0 0 0 1 > 0 > 12 1 0 0 1 0 0 1 > 0 > > sessionInfo() > R Under development (unstable) (2014-02-09 r64949) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 setwidth_1.0-3 > colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 > GenomicRanges_1.15.39 > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 stats4_3.1.0 > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > XVector_0.3.7 zlibbioc_1.9.0 > > > I hope you can give me some hints since I am a bit confused and stuck > with these results. > By the way, for the other Bioc, I know limma/voom used on exons can > also work nicely. Is there an easy way to implement a sort of DEU test > with limma voom counts? I guess the annotation gtf used to count them > should be used to construct models and include it in a similar way > with formulae in linearmodels as DEXSeq does with glmnb.fit function. > It would be perfect to have it straight as a function also in limma to > compare results. > > Thanks again for your efforts, > > Looking forward to hearing to your comments. > Jose > > > > > 2014-03-20 16:07 GMT+01:00 Jose Garcia > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>: > > Hi Alejandro, > I solved the problem by re-creating the object ecs.24. I had made > one DEXSeq analysis up to the end by first creating an ecs object. > Then I just split the ecs object, which already had p.value and > other info, and re-run the analysis from sizeFactors on onto the > new split ecs.24 object. > Now it has worked. > However, I have obtained a much harder to interpret result which > points to something wrong I do not know why. And it is present in > both the split and the original ecs.24 and ecs objects. > From scratch: > I made dexseq_prepare_annotation.py with the script from DEXSeq > 1.6 which contained the '-r' parameter in order to avoid counting > exons overlapping different genes. Then I tried to count using the > new dexseq_count.py in the same package but it gave me an error > because it had been introduced a check for NH tag in the bam that > I do not have because I use SOAPSplice. You suggested to use the > old dexseq_count.py whithout the check (from DEXSeq 1.4). > It worked and then I used the following script: > > sampleFiles.R_ExonOUT<-Files > sampleName.R_ExonOUT<-Names > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep( "HYPOXIA",2)) > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > sampleTable.R_ExonOUT <- data.frame(sampleName = sampleName.R_ExonOUT, > fileName = sampleFiles.R_ExonOUT, > condition = sampleCondition.R_ExonOUT, > experiment = sampleExperiment.R_ExonOUT) > inDir = getwd() > annotationfile = file.path > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXS eq.gff") > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT, > flattenedfile = annotationfile) > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > ecs <- estimateSizeFactors(ecs) > library(parallel) > ecs <- estimateDispersions(ecs,nCores=8) > ecs <- fitDispersionFunction(ecs) > ecs <- testForDEU(ecs, nCores=8) > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > res<- DEUresultTable(ecs) > > The problem is that I have some exons with a ridiculous log2FC but > with a very good p.adjust. > Same thing with the ecs.24 or ecs.48 split objects. Here an example: > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > geneID exonID dispersion pvalue padjust meanBase > log2fold(HYPOXIA/CTRL) > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 > > > If you look at the plot (just a few exons around E50) > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > Immagine in linea 3 > > It seems clear that all those p-values cannot come from those > log2FC that are adjusted for expression of all exons coming from > the same gene. > > I have checked the design and formula: > > design(ecs.48) > > sampleName fileName condition experiment > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL RUN_1 > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL RUN_1 > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > HYPOXIA RUN_1 > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > HYPOXIA RUN_1 > > > formula(ecs.48) > > $formulaDispersion > > [1] "~sample + exon + condition:exon" > > > $formula0 > > [1] "~sample + exon" > > > $formula1 > > [1] "~sample + exon + condition:exon" > > > So, I am a bit stuck with it. I guess everything comes from having > used different versions but I cannot come across it. Summarizing: > > SOASPSplice > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 > (hg19) -r no > > dexseq_count.py (From DEXSeq 1.4) > > Analysis (DEXSeq 1.8) > > Thanks for the help, > > > Jose > > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > locale: > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > other attached packages: > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > loaded via a namespace (and not attached): > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>: > > Hi Alejandro, > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated > samples. My design is the following: > > design(ecs.24) > > sampleName fileName condition experiment > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam > HYPOXIA RUN_2 > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam > HYPOXIA RUN_2 > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > RUN_2 > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > RUN_2 > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > CTRL RUN_1 > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > CTRL RUN_1 > > When I follow the vignette: > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > ....Done > > ecs.24 <- fitDispersionFunction(ecs.24) > > ....Done > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > ..... > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > *Error in `row.names<-.data.frame`(`*tmp*`, value = > c("geneID", "exonID", : * > > * duplicate 'row.names' are not allowed* > > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> > row.names<- -> row.names<-.data.frame* > > *In addition: Warning message:* > > *non-unique value when setting 'row.names': > 'log2fold(CTRL/HYPOXIA)' * > > > I checked for duplication as you had suggested elsewhere > > any(duplicated(featureNames(ecs.24))) > > [1] FALSE > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > [1] FALSE > > > I also checked that the condition in design where factors: > > > is.factor(pData(ecs.24)$condition) > > [1] TRUE > > > The only explanation I could come to is the fact that I have > an even number of samples for control and treated? or that I > have the 'experiment' column in the design, but it would be > irrelevant since the default formula is only taking condition > into consideration, isn't it? > > What could be the origin of the error? > > Thanks again! > > Jose > > > > > sessionInfo() > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > locale: > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets > methods > > [8] base > > > other attached packages: > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > loaded via a namespace (and not attached): > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > <alejandro.reyes at="" embl.de="" <mailto:alejandro.reyes="" at="" embl.de="">>: > > Dear Xiayu Rao, > > Thanks for your interest in DEXSeq. > That looks strange, does it happen when you use files > different from the > one you generated by your own? Could you maybe send me > (offline) your > gtf file and the first 1000 lines of one of your sam files? > > Bests, > Alejandro > > > Hello, > > > > DEXSeq is a great tool. Thank you for that. I recently > generate my own gtf file with the same format as the > exon.gff generated by dexseq_prepare_annotation.py. > However, the output is strange. I tried to find the reason > with no success. Could you please provide any idea about > that problem? Thank you very much in advance! > > > > Note: I used the latest dexseq, and the sam files had > been sorted. > > > > 1 genes.gtf exonic_part 12228 12594 . > + . exonic_part_number "001"; gene_id > "ENSG00000223972" > > 1 genes.gtf exonic_part 12722 12974 . > + . exonic_part_number "002"; gene_id > "ENSG00000223972" > > 1 genes.gtf exonic_part 13053 13220 . > + . exonic_part_number "003"; gene_id > "ENSG00000223972" > > 1 genes.gtf exonic_part 14830 14969 . > - . exonic_part_number "001"; gene_id > "ENSG00000223972+ENSG00000227232" > > ............. > > > > > > ==> SRR791043_counts.txt <== > > :001G00027000003" > > :002G00021000003" > > :003G00070000003" > > :004G00040000003" > > :005G00060000003" > > :006G00030000003" > > :007G00019000003" > > :008G00045600003" > > :009G00020400003" > > :001G00000000005" > > > > > > Thanks, > > Xiayu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it="">

Annotation limma DEXSeq Annotation limma DEXSeq • 1.7k views

ADD COMMENT • link updated 10.6 years ago by Jose M Garcia Manteiga ▴ 310 • written 10.6 years ago by Alejandro Reyes ★ 1.9k

0

Entering edit mode

Jose M Garcia Manteiga ▴ 310

@jose-m-garcia-manteiga-6046

Last seen 7.7 years ago

Italy

Hi Alejandro, I apologize, I did not see the answer in http://permalink.gmane.org/gmane.science.biology.informatics.conductor /53937 I was waiting for a Bioc... sorry about that. Ok, then those exons with very low log2FC and low p.value would belong, if I got it right, to genes with many differentially used exons and some with a very high log2FC, in that case, the linear model would recognise wrongly as DU the complementary set of exons that actually are not. Since you say that luckily these cases are exceptions but I have ~98000 exons with p.adjust<0.05, I could have something really interesting or a terrible flaw in my 48h compared to my 24 h treated samples. If the former case, I could run DEXSeq at the gene-level to identify the genes and trust, which log2FC? or which p.values? to detect interesting exons? I first thought to put a threshold of log2FC, but the "volcano" was strange with few volcano-like behaviour. or better should I make gene-level DEXSeq and then filter out those genes with very huge log2FC exons. Thanks again for your help Jose 2014-04-03 13:15 GMT+02:00 Alejandro Reyes <alejandro.reyes@embl.de>: > Hi Jose, > > I have an e-mail answering to this thread on the 24.03.2014, maybe you > missed it or did I write your e-mail wrong? > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/53937 > > Your concern is answered by the second point that I describe there. If you > look at your "fitted splicing" plot, you can see this. The extreme case > is the coefficients fitted > for your exon E032, it has a value of ~40,000 in one of your conditions > and a value of ~800 on > your other conditions. This will affect the estimation of relative exon > abundances from > your other exons. As I mentioned before, this is a limitation of the > DEXSeq model, > but luckily, genes like this cases seem to be exceptions rather than the > rule > (at least in my experience!). > > About using the output from voom to test for DEU, I have not explored > that option, > but maybe the maintainers/authors of that package are able to guide you > better. > > Hope it is useful, > Alejandro > > > > > > Dear Alejandro, > > Have you had time to take a look at my problem (please see below)? > > I am now using DEXSeq 1.9 to analyze the same ecs objects I had > > analyzed with 1.8 but it produces the very same results. The problem > > was regarding too many exons with very low log2FC and very low > > p.values. I send here an object with the subsetByGene (ecs.one) with > > one particular gene. The E029 has a very low p.value with a very low > > log2FC. Either the log2FCs are not OK or the p.values. I cannot > > understand how such low log2FC for the DEU analysis can give those low > > p.values. Indeed the complete original ecs gave me 98000 exons with > > table(res.48$padjust<0.05). > > However the same analysis (ecs object 4CTRLS vs 4 TREATED) gave me > > nice results when analysed with DEXSeq 1.6 on the complete design > > without splitting into two. > > Here's the picture with expression and splicing values: > > Immagine in linea 2 > > > > Here's the design of the ecs object created with CTRL vs HYPOXIA (only > > at 48h): > > > design(ecs.one) > > sampleName fileName condition > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam HYPOXIA > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam HYPOXIA > > > > Maybe the sampleNames?? N1andN2 come from another batch but it is > > still a CTRL. If they were different I would expect higher dispersions > > and hence higher p.values not lower ones, wouldn't I? > > I have tried to trace the problem a bit with these gene: > > > > modelFrame<-constructModelFrame(ecs.one) > > formula0 = ~sample + exon > > formula1 = ~sample + exon + condition:exon > > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) > > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) > > > count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG00000170017", "E029") > > > > > mm0 > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > > sampleN2 exonothers > > 1 1 0 0 0 1 0 > > 0 > > 2 1 0 0 0 0 1 > > 0 > > *3 1 0 0 0 0 > > 0 0* > > 4 1 1 0 0 0 0 > > 0 > > 5 1 0 1 0 0 0 > > 0 > > 6 1 0 0 1 0 0 > > 0 > > 7 1 0 0 0 1 0 > > 1 > > 8 1 0 0 0 0 1 > > 1 > > 9 1 0 0 0 0 0 > > 1 > > 10 1 1 0 0 0 0 > > 1 > > 11 1 0 1 0 0 0 > > 1 > > 12 1 0 0 1 0 0 > > 1 > > attr(,"assign") > > [1] 0 1 1 1 1 1 2 > > attr(,"contrasts") > > attr(,"contrasts")$sample > > [1] "contr.treatment" > > > > attr(,"contrasts")$exon > > [1] "contr.treatment" > > > > Does it seem OK to you? I guess the intercept is CTRL2 (in bold) but > > why? Does it have to do with the 'CTRL' string in the sampleName? I > > tried to change the sample names to CTRL1,CTRL2... but the result was > > the same. > > > > Here's the mm1 > > > mm1 > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > > sampleN2 exonothers exonthis:conditionHYPOXIA > > 1 1 0 0 0 1 0 > > 0 0 > > 2 1 0 0 0 0 1 > > 0 0 > > 3 1 0 0 0 0 0 > > 0 0 > > 4 1 1 0 0 0 0 > > 0 0 > > 5 1 0 1 0 0 0 > > 0 1 > > 6 1 0 0 1 0 0 > > 0 1 > > 7 1 0 0 0 1 0 > > 1 0 > > 8 1 0 0 0 0 1 > > 1 0 > > 9 1 0 0 0 0 0 > > 1 0 > > 10 1 1 0 0 0 0 1 > > 0 > > 11 1 0 1 0 0 0 1 > > 0 > > 12 1 0 0 1 0 0 1 > > 0 > > > sessionInfo() > > R Under development (unstable) (2014-02-09 r64949) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > base > > > > other attached packages: > > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 > > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 setwidth_1.0-3 > > colorout_1.0-2 > > > > loaded via a namespace (and not attached): > > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 > > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 > > GenomicRanges_1.15.39 > > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 > > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 stats4_3.1.0 > > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > > XVector_0.3.7 zlibbioc_1.9.0 > > > > > > I hope you can give me some hints since I am a bit confused and stuck > > with these results. > > By the way, for the other Bioc, I know limma/voom used on exons can > > also work nicely. Is there an easy way to implement a sort of DEU test > > with limma voom counts? I guess the annotation gtf used to count them > > should be used to construct models and include it in a similar way > > with formulae in linearmodels as DEXSeq does with glmnb.fit function. > > It would be perfect to have it straight as a function also in limma to > > compare results. > > > > Thanks again for your efforts, > > > > Looking forward to hearing to your comments. > > Jose > > > > > > > > > > 2014-03-20 16:07 GMT+01:00 Jose Garcia > > <garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>: > > > > Hi Alejandro, > > I solved the problem by re-creating the object ecs.24. I had made > > one DEXSeq analysis up to the end by first creating an ecs object. > > Then I just split the ecs object, which already had p.value and > > other info, and re-run the analysis from sizeFactors on onto the > > new split ecs.24 object. > > Now it has worked. > > However, I have obtained a much harder to interpret result which > > points to something wrong I do not know why. And it is present in > > both the split and the original ecs.24 and ecs objects. > > From scratch: > > I made dexseq_prepare_annotation.py with the script from DEXSeq > > 1.6 which contained the '-r' parameter in order to avoid counting > > exons overlapping different genes. Then I tried to count using the > > new dexseq_count.py in the same package but it gave me an error > > because it had been introduced a check for NH tag in the bam that > > I do not have because I use SOAPSplice. You suggested to use the > > old dexseq_count.py whithout the check (from DEXSeq 1.4). > > It worked and then I used the following script: > > > > sampleFiles.R_ExonOUT<-Files > > sampleName.R_ExonOUT<-Names > > > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep("HYP OXIA",2)) > > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > > sampleTable.R_ExonOUT <- data.frame(sampleName = > sampleName.R_ExonOUT, > > fileName = sampleFiles.R_ExonOUT, > > condition = sampleCondition.R_ExonOUT, > > experiment = sampleExperiment.R_ExonOUT) > > inDir = getwd() > > annotationfile = file.path > > > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXSeq.g ff") > > > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > > sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT, > > flattenedfile = annotationfile) > > > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > > ecs <- estimateSizeFactors(ecs) > > library(parallel) > > ecs <- estimateDispersions(ecs,nCores=8) > > ecs <- fitDispersionFunction(ecs) > > ecs <- testForDEU(ecs, nCores=8) > > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > > res<- DEUresultTable(ecs) > > > > The problem is that I have some exons with a ridiculous log2FC but > > with a very good p.adjust. > > Same thing with the ecs.24 or ecs.48 split objects. Here an example: > > > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > > > geneID exonID dispersion pvalue padjust meanBase > > log2fold(HYPOXIA/CTRL) > > > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 > > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 > > > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 > > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 > > > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 > > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 > > > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 > > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* > > > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 > > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 > > > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 > > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 > > > > > > If you look at the plot (just a few exons around E50) > > > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > > > > Immagine in linea 3 > > > > It seems clear that all those p-values cannot come from those > > log2FC that are adjusted for expression of all exons coming from > > the same gene. > > > > I have checked the design and formula: > > > > design(ecs.48) > > > > sampleName fileName condition experiment > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > RUN_1 > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > RUN_1 > > > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > HYPOXIA RUN_1 > > > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > HYPOXIA RUN_1 > > > > > > formula(ecs.48) > > > > $formulaDispersion > > > > [1] "~sample + exon + condition:exon" > > > > > > $formula0 > > > > [1] "~sample + exon" > > > > > > $formula1 > > > > [1] "~sample + exon + condition:exon" > > > > > > So, I am a bit stuck with it. I guess everything comes from having > > used different versions but I cannot come across it. Summarizing: > > > > SOASPSplice > > > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 > > (hg19) -r no > > > > dexseq_count.py (From DEXSeq 1.4) > > > > Analysis (DEXSeq 1.8) > > > > Thanks for the help, > > > > > > Jose > > > > > > > > sessionInfo() > > > > R version 3.0.1 (2013-05-16) > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > > > > attached base packages: > > > > [1] parallel stats graphics grDevices utils datasets methods > > > > [8] base > > > > > > other attached packages: > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > loaded via a namespace (and not attached): > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > > <garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>: > > > > Hi Alejandro, > > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated > > samples. My design is the following: > > > > design(ecs.24) > > > > sampleName fileName condition experiment > > > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam > > HYPOXIA RUN_2 > > > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam > > HYPOXIA RUN_2 > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > > RUN_2 > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > > RUN_2 > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > CTRL RUN_1 > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > CTRL RUN_1 > > > > When I follow the vignette: > > > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > > > ....Done > > > > ecs.24 <- fitDispersionFunction(ecs.24) > > > > ....Done > > > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > > > ..... > > > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > > > *Error in `row.names<-.data.frame`(`*tmp*`, value = > > c("geneID", "exonID", : * > > > > * duplicate 'row.names' are not allowed* > > > > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> > > row.names<- -> row.names<-.data.frame* > > > > *In addition: Warning message:* > > > > *non-unique value when setting 'row.names': > > 'log2fold(CTRL/HYPOXIA)' * > > > > > > I checked for duplication as you had suggested elsewhere > > > > any(duplicated(featureNames(ecs.24))) > > > > [1] FALSE > > > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > > > [1] FALSE > > > > > > I also checked that the condition in design where factors: > > > > > > is.factor(pData(ecs.24)$condition) > > > > [1] TRUE > > > > > > The only explanation I could come to is the fact that I have > > an even number of samples for control and treated? or that I > > have the 'experiment' column in the design, but it would be > > irrelevant since the default formula is only taking condition > > into consideration, isn't it? > > > > What could be the origin of the error? > > > > Thanks again! > > > > Jose > > > > > > > > > sessionInfo() > > > > R version 3.0.1 (2013-05-16) > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > > > > attached base packages: > > > > [1] parallel stats graphics grDevices utils datasets > > methods > > > > [8] base > > > > > > other attached packages: > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > loaded via a namespace (and not attached): > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > > <alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de="">>: > > > > Dear Xiayu Rao, > > > > Thanks for your interest in DEXSeq. > > That looks strange, does it happen when you use files > > different from the > > one you generated by your own? Could you maybe send me > > (offline) your > > gtf file and the first 1000 lines of one of your sam files? > > > > Bests, > > Alejandro > > > > > Hello, > > > > > > DEXSeq is a great tool. Thank you for that. I recently > > generate my own gtf file with the same format as the > > exon.gff generated by dexseq_prepare_annotation.py. > > However, the output is strange. I tried to find the reason > > with no success. Could you please provide any idea about > > that problem? Thank you very much in advance! > > > > > > Note: I used the latest dexseq, and the sam files had > > been sorted. > > > > > > 1 genes.gtf exonic_part 12228 12594 . > > + . exonic_part_number "001"; gene_id > > "ENSG00000223972" > > > 1 genes.gtf exonic_part 12722 12974 . > > + . exonic_part_number "002"; gene_id > > "ENSG00000223972" > > > 1 genes.gtf exonic_part 13053 13220 . > > + . exonic_part_number "003"; gene_id > > "ENSG00000223972" > > > 1 genes.gtf exonic_part 14830 14969 . > > - . exonic_part_number "001"; gene_id > > "ENSG00000223972+ENSG00000227232" > > > ............. > > > > > > > > > ==> SRR791043_counts.txt <== > > > :001G00027000003" > > > :002G00021000003" > > > :003G00070000003" > > > :004G00040000003" > > > :005G00060000003" > > > :006G00030000003" > > > :007G00019000003" > > > :008G00045600003" > > > :009G00020400003" > > > :001G00000000005" > > > > > > > > > Thanks, > > > Xiayu > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org <mailto:> Bioconductor@r-project.org> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]

ADD COMMENT • link 10.6 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

Hi Jose, 98,000 hits!!?? Would if be possible for you to send me your raw input files offline? (via e.g. dropbox, ftp, etc: count files and DEXSeq flattened gtf file), so I can have a closer look at your data? Best regards, Alejandro > Hi Alejandro, > I apologize, I did not see the answer in > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/53937 > I was waiting for a Bioc... sorry about that. > > Ok, then those exons with very low log2FC and low p.value would > belong, if I got it right, to genes with many differentially used > exons and some with a very high log2FC, in that case, the linear model > would recognise wrongly as DU the complementary set of exons that > actually are not. Since you say that luckily these cases are > exceptions but I have ~98000 exons with p.adjust<0.05, I could have > something really interesting or a terrible flaw in my 48h compared to > my 24 h treated samples. > If the former case, I could run DEXSeq at the gene-level to identify > the genes and trust, which log2FC? or which p.values? to detect > interesting exons? > I first thought to put a threshold of log2FC, but the "volcano" was > strange with few volcano-like behaviour. > or better should I make gene-level DEXSeq and then filter out those > genes with very huge log2FC exons. > Thanks again for your help > Jose > > > > 2014-04-03 13:15 GMT+02:00 Alejandro Reyes <alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de="">>: > > Hi Jose, > > I have an e-mail answering to this thread on the 24.03.2014, maybe you > missed it or did I write your e-mail wrong? > > http://permalink.gmane.org/gmane.science.biology.informatics.con ductor/53937 > > Your concern is answered by the second point that I describe > there. If you > look at your "fitted splicing" plot, you can see this. The > extreme case > is the coefficients fitted > for your exon E032, it has a value of ~40,000 in one of your > conditions > and a value of ~800 on > your other conditions. This will affect the estimation of > relative exon > abundances from > your other exons. As I mentioned before, this is a limitation of the > DEXSeq model, > but luckily, genes like this cases seem to be exceptions rather > than the > rule > (at least in my experience!). > > About using the output from voom to test for DEU, I have not explored > that option, > but maybe the maintainers/authors of that package are able to > guide you > better. > > Hope it is useful, > Alejandro > > > > > > Dear Alejandro, > > Have you had time to take a look at my problem (please see below)? > > I am now using DEXSeq 1.9 to analyze the same ecs objects I had > > analyzed with 1.8 but it produces the very same results. The problem > > was regarding too many exons with very low log2FC and very low > > p.values. I send here an object with the subsetByGene (ecs.one) with > > one particular gene. The E029 has a very low p.value with a very low > > log2FC. Either the log2FCs are not OK or the p.values. I cannot > > understand how such low log2FC for the DEU analysis can give > those low > > p.values. Indeed the complete original ecs gave me 98000 exons with > > table(res.48$padjust<0.05). > > However the same analysis (ecs object 4CTRLS vs 4 TREATED) gave me > > nice results when analysed with DEXSeq 1.6 on the complete design > > without splitting into two. > > Here's the picture with expression and splicing values: > > Immagine in linea 2 > > > > Here's the design of the ecs object created with CTRL vs HYPOXIA > (only > > at 48h): > > > design(ecs.one) > > sampleName fileName condition > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > CTRL > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > CTRL > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > HYPOXIA > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > HYPOXIA > > > > Maybe the sampleNames?? N1andN2 come from another batch but it is > > still a CTRL. If they were different I would expect higher > dispersions > > and hence higher p.values not lower ones, wouldn't I? > > I have tried to trace the problem a bit with these gene: > > > > modelFrame<-constructModelFrame(ecs.one) > > formula0 = ~sample + exon > > formula1 = ~sample + exon + condition:exon > > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) > > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) > > > count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG000001700 17","E029") > > > > > mm0 > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > > sampleN2 exonothers > > 1 1 0 0 0 1 0 > > 0 > > 2 1 0 0 0 0 1 > > 0 > > *3 1 0 0 0 0 > > 0 0* > > 4 1 1 0 0 0 0 > > 0 > > 5 1 0 1 0 0 0 > > 0 > > 6 1 0 0 1 0 0 > > 0 > > 7 1 0 0 0 1 0 > > 1 > > 8 1 0 0 0 0 1 > > 1 > > 9 1 0 0 0 0 0 > > 1 > > 10 1 1 0 0 0 0 > > 1 > > 11 1 0 1 0 0 0 > > 1 > > 12 1 0 0 1 0 0 > > 1 > > attr(,"assign") > > [1] 0 1 1 1 1 1 2 > > attr(,"contrasts") > > attr(,"contrasts")$sample > > [1] "contr.treatment" > > > > attr(,"contrasts")$exon > > [1] "contr.treatment" > > > > Does it seem OK to you? I guess the intercept is CTRL2 (in bold) but > > why? Does it have to do with the 'CTRL' string in the sampleName? I > > tried to change the sample names to CTRL1,CTRL2... but the > result was > > the same. > > > > Here's the mm1 > > > mm1 > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > > sampleN2 exonothers exonthis:conditionHYPOXIA > > 1 1 0 0 0 1 0 > > 0 0 > > 2 1 0 0 0 0 1 > > 0 0 > > 3 1 0 0 0 0 0 > > 0 0 > > 4 1 1 0 0 0 0 > > 0 0 > > 5 1 0 1 0 0 0 > > 0 1 > > 6 1 0 0 1 0 0 > > 0 1 > > 7 1 0 0 0 1 0 > > 1 0 > > 8 1 0 0 0 0 1 > > 1 0 > > 9 1 0 0 0 0 0 > > 1 0 > > 10 1 1 0 0 0 0 1 > > 0 > > 11 1 0 1 0 0 0 1 > > 0 > > 12 1 0 0 1 0 0 1 > > 0 > > > sessionInfo() > > R Under development (unstable) (2014-02-09 r64949) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > base > > > > other attached packages: > > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 > > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 setwidth_1.0-3 > > colorout_1.0-2 > > > > loaded via a namespace (and not attached): > > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 > > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 > > GenomicRanges_1.15.39 > > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 > > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 > stats4_3.1.0 > > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > > XVector_0.3.7 zlibbioc_1.9.0 > > > > > > I hope you can give me some hints since I am a bit confused and > stuck > > with these results. > > By the way, for the other Bioc, I know limma/voom used on exons can > > also work nicely. Is there an easy way to implement a sort of > DEU test > > with limma voom counts? I guess the annotation gtf used to count > them > > should be used to construct models and include it in a similar way > > with formulae in linearmodels as DEXSeq does with glmnb.fit > function. > > It would be perfect to have it straight as a function also in > limma to > > compare results. > > > > Thanks again for your efforts, > > > > Looking forward to hearing to your comments. > > Jose > > > > > > > > > > 2014-03-20 16:07 GMT+01:00 Jose Garcia > > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>>: > > > > Hi Alejandro, > > I solved the problem by re-creating the object ecs.24. I had > made > > one DEXSeq analysis up to the end by first creating an ecs > object. > > Then I just split the ecs object, which already had p.value and > > other info, and re-run the analysis from sizeFactors on onto the > > new split ecs.24 object. > > Now it has worked. > > However, I have obtained a much harder to interpret result which > > points to something wrong I do not know why. And it is > present in > > both the split and the original ecs.24 and ecs objects. > > From scratch: > > I made dexseq_prepare_annotation.py with the script from DEXSeq > > 1.6 which contained the '-r' parameter in order to avoid > counting > > exons overlapping different genes. Then I tried to count > using the > > new dexseq_count.py in the same package but it gave me an error > > because it had been introduced a check for NH tag in the bam > that > > I do not have because I use SOAPSplice. You suggested to use the > > old dexseq_count.py whithout the check (from DEXSeq 1.4). > > It worked and then I used the following script: > > > > sampleFiles.R_ExonOUT<-Files > > sampleName.R_ExonOUT<-Names > > > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep( "HYPOXIA",2)) > > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > > sampleTable.R_ExonOUT <- data.frame(sampleName = > sampleName.R_ExonOUT, > > fileName = sampleFiles.R_ExonOUT, > > condition = > sampleCondition.R_ExonOUT, > > experiment = > sampleExperiment.R_ExonOUT) > > inDir = getwd() > > annotationfile = file.path > > > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXS eq.gff") > > > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > > sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT, > > flattenedfile = annotationfile) > > > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > > ecs <- estimateSizeFactors(ecs) > > library(parallel) > > ecs <- estimateDispersions(ecs,nCores=8) > > ecs <- fitDispersionFunction(ecs) > > ecs <- testForDEU(ecs, nCores=8) > > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > > res<- DEUresultTable(ecs) > > > > The problem is that I have some exons with a ridiculous > log2FC but > > with a very good p.adjust. > > Same thing with the ecs.24 or ecs.48 split objects. Here an > example: > > > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > > > geneID exonID dispersion pvalue padjust meanBase > > log2fold(HYPOXIA/CTRL) > > > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 > > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 > > > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 > > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 > > > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 > > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 > > > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 > > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* > > > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 > > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 > > > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 > > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 > > > > > > If you look at the plot (just a few exons around E50) > > > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > > > > Immagine in linea 3 > > > > It seems clear that all those p-values cannot come from those > > log2FC that are adjusted for expression of all exons coming from > > the same gene. > > > > I have checked the design and formula: > > > > design(ecs.48) > > > > sampleName fileName condition experiment > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL RUN_2 > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL RUN_2 > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > RUN_1 > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > RUN_1 > > > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > HYPOXIA RUN_1 > > > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > HYPOXIA RUN_1 > > > > > > formula(ecs.48) > > > > $formulaDispersion > > > > [1] "~sample + exon + condition:exon" > > > > > > $formula0 > > > > [1] "~sample + exon" > > > > > > $formula1 > > > > [1] "~sample + exon + condition:exon" > > > > > > So, I am a bit stuck with it. I guess everything comes from > having > > used different versions but I cannot come across it. > Summarizing: > > > > SOASPSplice > > > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 > > (hg19) -r no > > > > dexseq_count.py (From DEXSeq 1.4) > > > > Analysis (DEXSeq 1.8) > > > > Thanks for the help, > > > > > > Jose > > > > > > > > sessionInfo() > > > > R version 3.0.1 (2013-05-16) > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > > > > attached base packages: > > > > [1] parallel stats graphics grDevices utils datasets > methods > > > > [8] base > > > > > > other attached packages: > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > loaded via a namespace (and not attached): > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>>: > > > > Hi Alejandro, > > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated > > samples. My design is the following: > > > > design(ecs.24) > > > > sampleName fileName condition experiment > > > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam > > HYPOXIA RUN_2 > > > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam > > HYPOXIA RUN_2 > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > CTRL > > RUN_2 > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > CTRL > > RUN_2 > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > CTRL RUN_1 > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > CTRL RUN_1 > > > > When I follow the vignette: > > > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > > > ....Done > > > > ecs.24 <- fitDispersionFunction(ecs.24) > > > > ....Done > > > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > > > ..... > > > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > > > *Error in `row.names<-.data.frame`(`*tmp*`, value = > > c("geneID", "exonID", : * > > > > * duplicate 'row.names' are not allowed* > > > > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> > > row.names<- -> row.names<-.data.frame* > > > > *In addition: Warning message:* > > > > *non-unique value when setting 'row.names': > > 'log2fold(CTRL/HYPOXIA)' * > > > > > > I checked for duplication as you had suggested elsewhere > > > > any(duplicated(featureNames(ecs.24))) > > > > [1] FALSE > > > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > > > [1] FALSE > > > > > > I also checked that the condition in design where factors: > > > > > > is.factor(pData(ecs.24)$condition) > > > > [1] TRUE > > > > > > The only explanation I could come to is the fact that I have > > an even number of samples for control and treated? or that I > > have the 'experiment' column in the design, but it would be > > irrelevant since the default formula is only taking > condition > > into consideration, isn't it? > > > > What could be the origin of the error? > > > > Thanks again! > > > > Jose > > > > > > > > > sessionInfo() > > > > R version 3.0.1 (2013-05-16) > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > locale: > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > > > > attached base packages: > > > > [1] parallel stats graphics grDevices utils > datasets > > methods > > > > [8] base > > > > > > other attached packages: > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > loaded via a namespace (and not attached): > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > > <alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de="">>>: > > > > Dear Xiayu Rao, > > > > Thanks for your interest in DEXSeq. > > That looks strange, does it happen when you use files > > different from the > > one you generated by your own? Could you maybe send me > > (offline) your > > gtf file and the first 1000 lines of one of your sam > files? > > > > Bests, > > Alejandro > > > > > Hello, > > > > > > DEXSeq is a great tool. Thank you for that. I recently > > generate my own gtf file with the same format as the > > exon.gff generated by dexseq_prepare_annotation.py. > > However, the output is strange. I tried to find the > reason > > with no success. Could you please provide any idea about > > that problem? Thank you very much in advance! > > > > > > Note: I used the latest dexseq, and the sam files had > > been sorted. > > > > > > 1 genes.gtf exonic_part 12228 12594 . > > + . exonic_part_number "001"; gene_id > > "ENSG00000223972" > > > 1 genes.gtf exonic_part 12722 12974 . > > + . exonic_part_number "002"; gene_id > > "ENSG00000223972" > > > 1 genes.gtf exonic_part 13053 13220 . > > + . exonic_part_number "003"; gene_id > > "ENSG00000223972" > > > 1 genes.gtf exonic_part 14830 14969 . > > - . exonic_part_number "001"; gene_id > > "ENSG00000223972+ENSG00000227232" > > > ............. > > > > > > > > > ==> SRR791043_counts.txt <== > > > :001G00027000003" > > > :002G00021000003" > > > :003G00070000003" > > > :004G00040000003" > > > :005G00060000003" > > > :006G00030000003" > > > :007G00019000003" > > > :008G00045600003" > > > :009G00020400003" > > > :001G00000000005" > > > > > > > > > Thanks, > > > Xiayu > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it="">

ADD REPLY • link 10.6 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi again Jose, In the meantime, I just noticed that you mention that N1 and N2 are from a different batch. Did you noticed any obvious batch effect (GC content?)? Any difference in the library preparation or during the sequencing run? The problem with this comparison is that you are getting the combined biological + potential batch effects, this could explain the differences... anyway once I have your data I can explore if DEXSeq is doing something strange. Alejandro > Hi Jose, > > 98,000 hits!!?? Would if be possible for you to send me your raw input > files > offline? (via e.g. dropbox, ftp, etc: count files and DEXSeq flattened > gtf file), so I can > have a closer look at your data? > > Best regards, > Alejandro > > > > > > >> Hi Alejandro, >> I apologize, I did not see the answer in >> http://permalink.gmane.org/gmane.science.biology.informatics.conduc tor/53937 >> I was waiting for a Bioc... sorry about that. >> >> Ok, then those exons with very low log2FC and low p.value would >> belong, if I got it right, to genes with many differentially used >> exons and some with a very high log2FC, in that case, the linear >> model would recognise wrongly as DU the complementary set of exons >> that actually are not. Since you say that luckily these cases are >> exceptions but I have ~98000 exons with p.adjust<0.05, I could have >> something really interesting or a terrible flaw in my 48h compared to >> my 24 h treated samples. >> If the former case, I could run DEXSeq at the gene-level to identify >> the genes and trust, which log2FC? or which p.values? to detect >> interesting exons? >> I first thought to put a threshold of log2FC, but the "volcano" was >> strange with few volcano-like behaviour. >> or better should I make gene-level DEXSeq and then filter out those >> genes with very huge log2FC exons. >> Thanks again for your help >> Jose >> >> >> >> 2014-04-03 13:15 GMT+02:00 Alejandro Reyes <alejandro.reyes at="" embl.de="">> <mailto:alejandro.reyes at="" embl.de="">>: >> >> Hi Jose, >> >> I have an e-mail answering to this thread on the 24.03.2014, >> maybe you >> missed it or did I write your e-mail wrong? >> >> http://permalink.gmane.org/gmane.science.biology.informatics.conduc tor/53937 >> >> Your concern is answered by the second point that I describe >> there. If you >> look at your "fitted splicing" plot, you can see this. The >> extreme case >> is the coefficients fitted >> for your exon E032, it has a value of ~40,000 in one of your >> conditions >> and a value of ~800 on >> your other conditions. This will affect the estimation of >> relative exon >> abundances from >> your other exons. As I mentioned before, this is a limitation of the >> DEXSeq model, >> but luckily, genes like this cases seem to be exceptions rather >> than the >> rule >> (at least in my experience!). >> >> About using the output from voom to test for DEU, I have not >> explored >> that option, >> but maybe the maintainers/authors of that package are able to >> guide you >> better. >> >> Hope it is useful, >> Alejandro >> >> >> >> >> > Dear Alejandro, >> > Have you had time to take a look at my problem (please see below)? >> > I am now using DEXSeq 1.9 to analyze the same ecs objects I had >> > analyzed with 1.8 but it produces the very same results. The >> problem >> > was regarding too many exons with very low log2FC and very low >> > p.values. I send here an object with the subsetByGene (ecs.one) >> with >> > one particular gene. The E029 has a very low p.value with a >> very low >> > log2FC. Either the log2FCs are not OK or the p.values. I cannot >> > understand how such low log2FC for the DEU analysis can give >> those low >> > p.values. Indeed the complete original ecs gave me 98000 exons >> with >> > table(res.48$padjust<0.05). >> > However the same analysis (ecs object 4CTRLS vs 4 TREATED) gave me >> > nice results when analysed with DEXSeq 1.6 on the complete design >> > without splitting into two. >> > Here's the picture with expression and splicing values: >> > Immagine in linea 2 >> > >> > Here's the design of the ecs object created with CTRL vs HYPOXIA >> (only >> > at 48h): >> > > design(ecs.one) >> > sampleName fileName condition >> > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam >> CTRL >> > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam >> CTRL >> > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam >> CTRL >> > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam >> CTRL >> > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam >> HYPOXIA >> > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam >> HYPOXIA >> > >> > Maybe the sampleNames?? N1andN2 come from another batch but it is >> > still a CTRL. If they were different I would expect higher >> dispersions >> > and hence higher p.values not lower ones, wouldn't I? >> > I have tried to trace the problem a bit with these gene: >> > >> > modelFrame<-constructModelFrame(ecs.one) >> > formula0 = ~sample + exon >> > formula1 = ~sample + exon + condition:exon >> > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) >> > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) >> > >> count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG00000170017" ,"E029") >> > >> > > mm0 >> > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 >> > sampleN2 exonothers >> > 1 1 0 0 0 1 0 >> > 0 >> > 2 1 0 0 0 0 1 >> > 0 >> > *3 1 0 0 0 0 >> > 0 0* >> > 4 1 1 0 0 0 0 >> > 0 >> > 5 1 0 1 0 0 0 >> > 0 >> > 6 1 0 0 1 0 0 >> > 0 >> > 7 1 0 0 0 1 0 >> > 1 >> > 8 1 0 0 0 0 1 >> > 1 >> > 9 1 0 0 0 0 0 >> > 1 >> > 10 1 1 0 0 0 0 >> > 1 >> > 11 1 0 1 0 0 0 >> > 1 >> > 12 1 0 0 1 0 0 >> > 1 >> > attr(,"assign") >> > [1] 0 1 1 1 1 1 2 >> > attr(,"contrasts") >> > attr(,"contrasts")$sample >> > [1] "contr.treatment" >> > >> > attr(,"contrasts")$exon >> > [1] "contr.treatment" >> > >> > Does it seem OK to you? I guess the intercept is CTRL2 (in >> bold) but >> > why? Does it have to do with the 'CTRL' string in the >> sampleName? I >> > tried to change the sample names to CTRL1,CTRL2... but the >> result was >> > the same. >> > >> > Here's the mm1 >> > > mm1 >> > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 >> > sampleN2 exonothers exonthis:conditionHYPOXIA >> > 1 1 0 0 0 1 0 >> > 0 0 >> > 2 1 0 0 0 0 1 >> > 0 0 >> > 3 1 0 0 0 0 0 >> > 0 0 >> > 4 1 1 0 0 0 0 >> > 0 0 >> > 5 1 0 1 0 0 0 >> > 0 1 >> > 6 1 0 0 1 0 0 >> > 0 1 >> > 7 1 0 0 0 1 0 >> > 1 0 >> > 8 1 0 0 0 0 1 >> > 1 0 >> > 9 1 0 0 0 0 0 >> > 1 0 >> > 10 1 1 0 0 0 0 1 >> > 0 >> > 11 1 0 1 0 0 0 1 >> > 0 >> > 12 1 0 0 1 0 0 1 >> > 0 >> > > sessionInfo() >> > R Under development (unstable) (2014-02-09 r64949) >> > Platform: x86_64-apple-darwin10.8.0 (64-bit) >> > >> > locale: >> > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> > >> > attached base packages: >> > [1] parallel stats graphics grDevices utils datasets methods >> > base >> > >> > other attached packages: >> > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 >> > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 >> setwidth_1.0-3 >> > colorout_1.0-2 >> > >> > loaded via a namespace (and not attached): >> > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 >> > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 >> > GenomicRanges_1.15.39 >> > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 >> > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 >> stats4_3.1.0 >> > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 >> > XVector_0.3.7 zlibbioc_1.9.0 >> > >> > >> > I hope you can give me some hints since I am a bit confused and >> stuck >> > with these results. >> > By the way, for the other Bioc, I know limma/voom used on exons >> can >> > also work nicely. Is there an easy way to implement a sort of >> DEU test >> > with limma voom counts? I guess the annotation gtf used to count >> them >> > should be used to construct models and include it in a similar way >> > with formulae in linearmodels as DEXSeq does with glmnb.fit >> function. >> > It would be perfect to have it straight as a function also in >> limma to >> > compare results. >> > >> > Thanks again for your efforts, >> > >> > Looking forward to hearing to your comments. >> > Jose >> > >> > >> > >> > >> > 2014-03-20 16:07 GMT+01:00 Jose Garcia >> > <garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> >> > <mailto:garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>>: >> > >> > Hi Alejandro, >> > I solved the problem by re-creating the object ecs.24. I had >> made >> > one DEXSeq analysis up to the end by first creating an ecs >> object. >> > Then I just split the ecs object, which already had p.value >> and >> > other info, and re-run the analysis from sizeFactors on >> onto the >> > new split ecs.24 object. >> > Now it has worked. >> > However, I have obtained a much harder to interpret result >> which >> > points to something wrong I do not know why. And it is >> present in >> > both the split and the original ecs.24 and ecs objects. >> > From scratch: >> > I made dexseq_prepare_annotation.py with the script from >> DEXSeq >> > 1.6 which contained the '-r' parameter in order to avoid >> counting >> > exons overlapping different genes. Then I tried to count >> using the >> > new dexseq_count.py in the same package but it gave me an >> error >> > because it had been introduced a check for NH tag in the bam >> that >> > I do not have because I use SOAPSplice. You suggested to >> use the >> > old dexseq_count.py whithout the check (from DEXSeq 1.4). >> > It worked and then I used the following script: >> > >> > sampleFiles.R_ExonOUT<-Files >> > sampleName.R_ExonOUT<-Names >> > >> sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep("HY POXIA",2)) >> > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) >> > sampleTable.R_ExonOUT <- data.frame(sampleName = >> sampleName.R_ExonOUT, >> > fileName = sampleFiles.R_ExonOUT, >> > condition = >> sampleCondition.R_ExonOUT, >> > experiment = >> sampleExperiment.R_ExonOUT) >> > inDir = getwd() >> > annotationfile = file.path >> > >> ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXSeq. gff") >> > >> > ecs = read.HTSeqCounts(countfiles = file.path(inDir, >> > sampleTable.R_ExonOUT$fileName),design = >> sampleTable.R_ExonOUT, >> > flattenedfile = annotationfile) >> > >> > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName >> > ecs <- estimateSizeFactors(ecs) >> > library(parallel) >> > ecs <- estimateDispersions(ecs,nCores=8) >> > ecs <- fitDispersionFunction(ecs) >> > ecs <- testForDEU(ecs, nCores=8) >> > ecs <- estimatelog2FoldChanges(ecs, nCores=8) >> > res<- DEUresultTable(ecs) >> > >> > The problem is that I have some exons with a ridiculous >> log2FC but >> > with a very good p.adjust. >> > Same thing with the ecs.24 or ecs.48 split objects. Here an >> example: >> > >> > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) >> > >> > geneID exonID dispersion pvalue padjust meanBase >> > log2fold(HYPOXIA/CTRL) >> > >> > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 >> > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 >> > >> > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 >> > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 >> > >> > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 >> > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 >> > >> > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 >> > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* >> > >> > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 >> > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 >> > >> > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 >> > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 >> > >> > >> > If you look at the plot (just a few exons around E50) >> > >> > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) >> > >> > >> > Immagine in linea 3 >> > >> > It seems clear that all those p-values cannot come from those >> > log2FC that are adjusted for expression of all exons coming >> from >> > the same gene. >> > >> > I have checked the design and formula: >> > >> > design(ecs.48) >> > >> > sampleName fileName condition experiment >> > >> > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL >> RUN_2 >> > >> > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL >> RUN_2 >> > >> > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL >> RUN_1 >> > >> > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL >> RUN_1 >> > >> > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam >> > HYPOXIA RUN_1 >> > >> > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam >> > HYPOXIA RUN_1 >> > >> > >> > formula(ecs.48) >> > >> > $formulaDispersion >> > >> > [1] "~sample + exon + condition:exon" >> > >> > >> > $formula0 >> > >> > [1] "~sample + exon" >> > >> > >> > $formula1 >> > >> > [1] "~sample + exon + condition:exon" >> > >> > >> > So, I am a bit stuck with it. I guess everything comes from >> having >> > used different versions but I cannot come across it. >> Summarizing: >> > >> > SOASPSplice >> > >> > dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 >> > (hg19) -r no >> > >> > dexseq_count.py (From DEXSeq 1.4) >> > >> > Analysis (DEXSeq 1.8) >> > >> > Thanks for the help, >> > >> > >> > Jose >> > >> > >> > >> > sessionInfo() >> > >> > R version 3.0.1 (2013-05-16) >> > >> > Platform: x86_64-unknown-linux-gnu (64-bit) >> > >> > >> > locale: >> > >> > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US >> > >> > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US >> > >> > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> > >> > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C >> > >> > >> > attached base packages: >> > >> > [1] parallel stats graphics grDevices utils datasets >> methods >> > >> > [8] base >> > >> > >> > other attached packages: >> > >> > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 >> > >> > >> > loaded via a namespace (and not attached): >> > >> > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 >> > >> > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 >> > >> > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 >> > >> > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 >> > >> > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 >> > >> > >> > >> > 2014-03-19 13:18 GMT+01:00 Jose Garcia >> > <garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> >> > <mailto:garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>>: >> > >> > Hi Alejandro, >> > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated >> > samples. My design is the following: >> > >> > design(ecs.24) >> > >> > sampleName fileName condition experiment >> > >> > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam >> > HYPOXIA RUN_2 >> > >> > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam >> > HYPOXIA RUN_2 >> > >> > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam >> CTRL >> > RUN_2 >> > >> > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam >> CTRL >> > RUN_2 >> > >> > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam >> > CTRL RUN_1 >> > >> > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam >> > CTRL RUN_1 >> > >> > When I follow the vignette: >> > >> > ecs.24 <- estimateDispersions(ecs.24,nCores=8) >> > >> > ....Done >> > >> > ecs.24 <- fitDispersionFunction(ecs.24) >> > >> > ....Done >> > >> > ecs.24 <- testForDEU(ecs.24, nCores=8) >> > >> > ..... >> > >> > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) >> > >> > *Error in `row.names<-.data.frame`(`*tmp*`, value = >> > c("geneID", "exonID", : * >> > >> > * duplicate 'row.names' are not allowed* >> > >> > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> >> > row.names<- -> row.names<-.data.frame* >> > >> > *In addition: Warning message:* >> > >> > *non-unique value when setting 'row.names': >> > 'log2fold(CTRL/HYPOXIA)' * >> > >> > >> > I checked for duplication as you had suggested elsewhere >> > >> > any(duplicated(featureNames(ecs.24))) >> > >> > [1] FALSE >> > >> > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) >> > >> > [1] FALSE >> > >> > >> > I also checked that the condition in design where factors: >> > >> > >> > is.factor(pData(ecs.24)$condition) >> > >> > [1] TRUE >> > >> > >> > The only explanation I could come to is the fact that I >> have >> > an even number of samples for control and treated? or >> that I >> > have the 'experiment' column in the design, but it >> would be >> > irrelevant since the default formula is only taking >> condition >> > into consideration, isn't it? >> > >> > What could be the origin of the error? >> > >> > Thanks again! >> > >> > Jose >> > >> > >> > >> > > sessionInfo() >> > >> > R version 3.0.1 (2013-05-16) >> > >> > Platform: x86_64-unknown-linux-gnu (64-bit) >> > >> > >> > locale: >> > >> > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US >> > >> > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US >> > >> > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> > >> > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US >> LC_IDENTIFICATION=C >> > >> > >> > attached base packages: >> > >> > [1] parallel stats graphics grDevices utils >> datasets >> > methods >> > >> > [8] base >> > >> > >> > other attached packages: >> > >> > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 >> > >> > >> > loaded via a namespace (and not attached): >> > >> > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 >> > >> > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 >> > >> > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 >> > >> > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 >> > >> > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 >> > >> > >> > >> > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes >> > <alejandro.reyes at="" embl.de="">> <mailto:alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de="">> <mailto:alejandro.reyes at="" embl.de="">>>: >> > >> > Dear Xiayu Rao, >> > >> > Thanks for your interest in DEXSeq. >> > That looks strange, does it happen when you use files >> > different from the >> > one you generated by your own? Could you maybe send me >> > (offline) your >> > gtf file and the first 1000 lines of one of your sam >> files? >> > >> > Bests, >> > Alejandro >> > >> > > Hello, >> > > >> > > DEXSeq is a great tool. Thank you for that. I >> recently >> > generate my own gtf file with the same format as the >> > exon.gff generated by dexseq_prepare_annotation.py. >> > However, the output is strange. I tried to find the >> reason >> > with no success. Could you please provide any idea >> about >> > that problem? Thank you very much in advance! >> > > >> > > Note: I used the latest dexseq, and the sam files >> had >> > been sorted. >> > > >> > > 1 genes.gtf exonic_part 12228 12594 . >> > + . exonic_part_number "001"; gene_id >> > "ENSG00000223972" >> > > 1 genes.gtf exonic_part 12722 12974 . >> > + . exonic_part_number "002"; gene_id >> > "ENSG00000223972" >> > > 1 genes.gtf exonic_part 13053 13220 . >> > + . exonic_part_number "003"; gene_id >> > "ENSG00000223972" >> > > 1 genes.gtf exonic_part 14830 14969 . >> > - . exonic_part_number "001"; gene_id >> > "ENSG00000223972+ENSG00000227232" >> > > ............. >> > > >> > > >> > > ==> SRR791043_counts.txt <== >> > > :001G00027000003" >> > > :002G00021000003" >> > > :003G00070000003" >> > > :004G00040000003" >> > > :005G00060000003" >> > > :006G00030000003" >> > > :007G00019000003" >> > > :008G00045600003" >> > > :009G00020400003" >> > > :001G00000000005" >> > > >> > > >> > > Thanks, >> > > Xiayu >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> <mailto:bioconductor at="" r-project.org="">> <mailto:bioconductor at="" r-project.org="">> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> > >> > >> > -- >> > Jose M. Garcia Manteiga PhD >> > Research Assistant - Data Analysis in Functional Genomics >> > Center for Translational Genomics and BioInformatics >> > Dibit2-Basilica, 3A3 >> > San Raffaele Scientific Institute >> > Via Olgettina 58, 20132 Milano (MI), Italy >> > >> > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> >> > e-mail: garciamanteiga.josemanuel at hsr.it >> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> >> > <mailto:garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> >> > >> > >> > >> > >> > -- >> > Jose M. Garcia Manteiga PhD >> > Research Assistant - Data Analysis in Functional Genomics >> > Center for Translational Genomics and BioInformatics >> > Dibit2-Basilica, 3A3 >> > San Raffaele Scientific Institute >> > Via Olgettina 58, 20132 Milano (MI), Italy >> > >> > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> >> > e-mail: garciamanteiga.josemanuel at hsr.it >> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> >> > <mailto:garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> >> > >> > >> > >> > >> > -- >> > Jose M. Garcia Manteiga PhD >> > Research Assistant - Data Analysis in Functional Genomics >> > Center for Translational Genomics and BioInformatics >> > Dibit2-Basilica, 3A3 >> > San Raffaele Scientific Institute >> > Via Olgettina 58, 20132 Milano (MI), Italy >> > >> > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> >> > e-mail: garciamanteiga.josemanuel at hsr.it >> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> >> > <mailto:garciamanteiga.josemanuel at="" hsr.it="">> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> >> >> >> >> >> -- >> Jose M. Garcia Manteiga PhD >> Research Assistant - Data Analysis in Functional Genomics >> Center for Translational Genomics and BioInformatics >> Dibit2-Basilica, 3A3 >> San Raffaele Scientific Institute >> Via Olgettina 58, 20132 Milano (MI), Italy >> >> Tel: +39-02-2643-4751 >> e-mail: garciamanteiga.josemanuel at hsr.it >> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> >

ADD REPLY • link 10.6 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi Alejandro, Yes, there is a batch effect. There are two experiments, exp1 with Controls and Hypoxia samples at 24h and a second at 48h. However the controls were always at 24h. In DESeq2 (and a limma/voom on exons I have to fetch the plot, I'll send it to you later) pca, you clearly see that the controls cluster well together(N1,N2,CTRL2,CTRL3) while 24 and 48 hours were separated. A DESeq2 analysis of the kind ~condition + experiment (but also ~condition ) gave good results. Also DEXSeq CTRL vs HYPOXIA (not including experiment as other factor) But in anycase, I would have expected higher p.values rather than lower. I will explore potential differences in terms of GC and other and will send you a dropbox link for the RData of the ecs and res objects. Thanks a lot! J 2014-04-03 14:47 GMT+02:00 Alejandro Reyes <alejandro.reyes@embl.de>: > Hi Jose, > > 98,000 hits!!?? Would if be possible for you to send me your raw input > files > offline? (via e.g. dropbox, ftp, etc: count files and DEXSeq flattened > gtf file), so I can > have a closer look at your data? > > Best regards, > Alejandro > > > > > > > > Hi Alejandro, > > I apologize, I did not see the answer in > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/53937 > > I was waiting for a Bioc... sorry about that. > > > > Ok, then those exons with very low log2FC and low p.value would > > belong, if I got it right, to genes with many differentially used > > exons and some with a very high log2FC, in that case, the linear model > > would recognise wrongly as DU the complementary set of exons that > > actually are not. Since you say that luckily these cases are > > exceptions but I have ~98000 exons with p.adjust<0.05, I could have > > something really interesting or a terrible flaw in my 48h compared to > > my 24 h treated samples. > > If the former case, I could run DEXSeq at the gene-level to identify > > the genes and trust, which log2FC? or which p.values? to detect > > interesting exons? > > I first thought to put a threshold of log2FC, but the "volcano" was > > strange with few volcano-like behaviour. > > or better should I make gene-level DEXSeq and then filter out those > > genes with very huge log2FC exons. > > Thanks again for your help > > Jose > > > > > > > > 2014-04-03 13:15 GMT+02:00 Alejandro Reyes <alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de>>: > > > > Hi Jose, > > > > I have an e-mail answering to this thread on the 24.03.2014, maybe > you > > missed it or did I write your e-mail wrong? > > > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/53937 > > > > Your concern is answered by the second point that I describe > > there. If you > > look at your "fitted splicing" plot, you can see this. The > > extreme case > > is the coefficients fitted > > for your exon E032, it has a value of ~40,000 in one of your > > conditions > > and a value of ~800 on > > your other conditions. This will affect the estimation of > > relative exon > > abundances from > > your other exons. As I mentioned before, this is a limitation of the > > DEXSeq model, > > but luckily, genes like this cases seem to be exceptions rather > > than the > > rule > > (at least in my experience!). > > > > About using the output from voom to test for DEU, I have not explored > > that option, > > but maybe the maintainers/authors of that package are able to > > guide you > > better. > > > > Hope it is useful, > > Alejandro > > > > > > > > > > > Dear Alejandro, > > > Have you had time to take a look at my problem (please see below)? > > > I am now using DEXSeq 1.9 to analyze the same ecs objects I had > > > analyzed with 1.8 but it produces the very same results. The > problem > > > was regarding too many exons with very low log2FC and very low > > > p.values. I send here an object with the subsetByGene (ecs.one) > with > > > one particular gene. The E029 has a very low p.value with a very > low > > > log2FC. Either the log2FCs are not OK or the p.values. I cannot > > > understand how such low log2FC for the DEU analysis can give > > those low > > > p.values. Indeed the complete original ecs gave me 98000 exons with > > > table(res.48$padjust<0.05). > > > However the same analysis (ecs object 4CTRLS vs 4 TREATED) gave me > > > nice results when analysed with DEXSeq 1.6 on the complete design > > > without splitting into two. > > > Here's the picture with expression and splicing values: > > > Immagine in linea 2 > > > > > > Here's the design of the ecs object created with CTRL vs HYPOXIA > > (only > > > at 48h): > > > > design(ecs.one) > > > sampleName fileName condition > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > CTRL > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > CTRL > > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > HYPOXIA > > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > HYPOXIA > > > > > > Maybe the sampleNames?? N1andN2 come from another batch but it is > > > still a CTRL. If they were different I would expect higher > > dispersions > > > and hence higher p.values not lower ones, wouldn't I? > > > I have tried to trace the problem a bit with these gene: > > > > > > modelFrame<-constructModelFrame(ecs.one) > > > formula0 = ~sample + exon > > > formula1 = ~sample + exon + condition:exon > > > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) > > > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) > > > > > > count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG00000170017", "E029") > > > > > > > mm0 > > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > > > sampleN2 exonothers > > > 1 1 0 0 0 1 0 > > > 0 > > > 2 1 0 0 0 0 1 > > > 0 > > > *3 1 0 0 0 0 > > > 0 0* > > > 4 1 1 0 0 0 > 0 > > > 0 > > > 5 1 0 1 0 0 0 > > > 0 > > > 6 1 0 0 1 0 0 > > > 0 > > > 7 1 0 0 0 1 0 > > > 1 > > > 8 1 0 0 0 0 1 > > > 1 > > > 9 1 0 0 0 0 0 > > > 1 > > > 10 1 1 0 0 0 0 > > > 1 > > > 11 1 0 1 0 0 0 > > > 1 > > > 12 1 0 0 1 0 0 > > > 1 > > > attr(,"assign") > > > [1] 0 1 1 1 1 1 2 > > > attr(,"contrasts") > > > attr(,"contrasts")$sample > > > [1] "contr.treatment" > > > > > > attr(,"contrasts")$exon > > > [1] "contr.treatment" > > > > > > Does it seem OK to you? I guess the intercept is CTRL2 (in bold) > but > > > why? Does it have to do with the 'CTRL' string in the sampleName? I > > > tried to change the sample names to CTRL1,CTRL2... but the > > result was > > > the same. > > > > > > Here's the mm1 > > > > mm1 > > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 sampleN1 > > > sampleN2 exonothers exonthis:conditionHYPOXIA > > > 1 1 0 0 0 1 0 > > > 0 0 > > > 2 1 0 0 0 0 1 > > > 0 0 > > > 3 1 0 0 0 0 0 > > > 0 0 > > > 4 1 1 0 0 0 0 > > > 0 0 > > > 5 1 0 1 0 0 0 > > > 0 1 > > > 6 1 0 0 1 0 0 > > > 0 1 > > > 7 1 0 0 0 1 0 > > > 1 0 > > > 8 1 0 0 0 0 1 > > > 1 0 > > > 9 1 0 0 0 0 0 > > > 1 0 > > > 10 1 1 0 0 0 0 1 > > > 0 > > > 11 1 0 1 0 0 0 1 > > > 0 > > > 12 1 0 0 1 0 0 1 > > > 0 > > > > sessionInfo() > > > R Under development (unstable) (2014-02-09 r64949) > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > locale: > > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > attached base packages: > > > [1] parallel stats graphics grDevices utils datasets methods > > > base > > > > > > other attached packages: > > > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 > > > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 setwidth_1.0-3 > > > colorout_1.0-2 > > > > > > loaded via a namespace (and not attached): > > > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 > > > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 > > > GenomicRanges_1.15.39 > > > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 > > > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 > > stats4_3.1.0 > > > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > > > XVector_0.3.7 zlibbioc_1.9.0 > > > > > > > > > I hope you can give me some hints since I am a bit confused and > > stuck > > > with these results. > > > By the way, for the other Bioc, I know limma/voom used on exons can > > > also work nicely. Is there an easy way to implement a sort of > > DEU test > > > with limma voom counts? I guess the annotation gtf used to count > > them > > > should be used to construct models and include it in a similar way > > > with formulae in linearmodels as DEXSeq does with glmnb.fit > > function. > > > It would be perfect to have it straight as a function also in > > limma to > > > compare results. > > > > > > Thanks again for your efforts, > > > > > > Looking forward to hearing to your comments. > > > Jose > > > > > > > > > > > > > > > 2014-03-20 16:07 GMT+01:00 Jose Garcia > > > <garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>>: > > > > > > Hi Alejandro, > > > I solved the problem by re-creating the object ecs.24. I had > > made > > > one DEXSeq analysis up to the end by first creating an ecs > > object. > > > Then I just split the ecs object, which already had p.value and > > > other info, and re-run the analysis from sizeFactors on onto > the > > > new split ecs.24 object. > > > Now it has worked. > > > However, I have obtained a much harder to interpret result > which > > > points to something wrong I do not know why. And it is > > present in > > > both the split and the original ecs.24 and ecs objects. > > > From scratch: > > > I made dexseq_prepare_annotation.py with the script from DEXSeq > > > 1.6 which contained the '-r' parameter in order to avoid > > counting > > > exons overlapping different genes. Then I tried to count > > using the > > > new dexseq_count.py in the same package but it gave me an error > > > because it had been introduced a check for NH tag in the bam > > that > > > I do not have because I use SOAPSplice. You suggested to use > the > > > old dexseq_count.py whithout the check (from DEXSeq 1.4). > > > It worked and then I used the following script: > > > > > > sampleFiles.R_ExonOUT<-Files > > > sampleName.R_ExonOUT<-Names > > > > > > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep("HYP OXIA",2)) > > > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > > > sampleTable.R_ExonOUT <- data.frame(sampleName = > > sampleName.R_ExonOUT, > > > fileName = sampleFiles.R_ExonOUT, > > > condition = > > sampleCondition.R_ExonOUT, > > > experiment = > > sampleExperiment.R_ExonOUT) > > > inDir = getwd() > > > annotationfile = file.path > > > > > > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXSeq.g ff") > > > > > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > > > sampleTable.R_ExonOUT$fileName),design = sampleTable.R_ExonOUT, > > > flattenedfile = annotationfile) > > > > > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > > > ecs <- estimateSizeFactors(ecs) > > > library(parallel) > > > ecs <- estimateDispersions(ecs,nCores=8) > > > ecs <- fitDispersionFunction(ecs) > > > ecs <- testForDEU(ecs, nCores=8) > > > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > > > res<- DEUresultTable(ecs) > > > > > > The problem is that I have some exons with a ridiculous > > log2FC but > > > with a very good p.adjust. > > > Same thing with the ecs.24 or ecs.48 split objects. Here an > > example: > > > > > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > > > > > geneID exonID dispersion pvalue padjust meanBase > > > log2fold(HYPOXIA/CTRL) > > > > > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 > > > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 > > > > > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 > > > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 > > > > > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 > > > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 > > > > > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 > > > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* > > > > > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 > > > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 > > > > > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 > > > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 > > > > > > > > > If you look at the plot (just a few exons around E50) > > > > > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > > > > > > > Immagine in linea 3 > > > > > > It seems clear that all those p-values cannot come from those > > > log2FC that are adjusted for expression of all exons coming > from > > > the same gene. > > > > > > I have checked the design and formula: > > > > > > design(ecs.48) > > > > > > sampleName fileName condition experiment > > > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam CTRL > RUN_2 > > > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam CTRL > RUN_2 > > > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam CTRL > > RUN_1 > > > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam CTRL > > RUN_1 > > > > > > HYPOXIA2 HYPOXIA2 Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > > HYPOXIA RUN_1 > > > > > > HYPOXIA3 HYPOXIA3 Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > > HYPOXIA RUN_1 > > > > > > > > > formula(ecs.48) > > > > > > $formulaDispersion > > > > > > [1] "~sample + exon + condition:exon" > > > > > > > > > $formula0 > > > > > > [1] "~sample + exon" > > > > > > > > > $formula1 > > > > > > [1] "~sample + exon + condition:exon" > > > > > > > > > So, I am a bit stuck with it. I guess everything comes from > > having > > > used different versions but I cannot come across it. > > Summarizing: > > > > > > SOASPSplice > > > > > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with Ensembl72 > > > (hg19) -r no > > > > > > dexseq_count.py (From DEXSeq 1.4) > > > > > > Analysis (DEXSeq 1.8) > > > > > > Thanks for the help, > > > > > > > > > Jose > > > > > > > > > > > > sessionInfo() > > > > > > R version 3.0.1 (2013-05-16) > > > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > > locale: > > > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > > > > > > > > attached base packages: > > > > > > [1] parallel stats graphics grDevices utils datasets > > methods > > > > > > [8] base > > > > > > > > > other attached packages: > > > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > > > > loaded via a namespace (and not attached): > > > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > > > <garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>>: > > > > > > Hi Alejandro, > > > I am analyzing with DEXSeq my data. 4 CTRLs and 2 Treated > > > samples. My design is the following: > > > > > > design(ecs.24) > > > > > > sampleName fileName condition experiment > > > > > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam > > > HYPOXIA RUN_2 > > > > > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam > > > HYPOXIA RUN_2 > > > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > > CTRL > > > RUN_2 > > > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > > CTRL > > > RUN_2 > > > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > > CTRL RUN_1 > > > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > > CTRL RUN_1 > > > > > > When I follow the vignette: > > > > > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > > > > > ....Done > > > > > > ecs.24 <- fitDispersionFunction(ecs.24) > > > > > > ....Done > > > > > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > > > > > ..... > > > > > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > > > > > *Error in `row.names<-.data.frame`(`*tmp*`, value = > > > c("geneID", "exonID", : * > > > > > > * duplicate 'row.names' are not allowed* > > > > > > *Calls: estimatelog2FoldChanges ... pData<- -> pData<- -> > > > row.names<- -> row.names<-.data.frame* > > > > > > *In addition: Warning message:* > > > > > > *non-unique value when setting 'row.names': > > > 'log2fold(CTRL/HYPOXIA)' * > > > > > > > > > I checked for duplication as you had suggested elsewhere > > > > > > any(duplicated(featureNames(ecs.24))) > > > > > > [1] FALSE > > > > > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > > > > > [1] FALSE > > > > > > > > > I also checked that the condition in design where factors: > > > > > > > > > is.factor(pData(ecs.24)$condition) > > > > > > [1] TRUE > > > > > > > > > The only explanation I could come to is the fact that I > have > > > an even number of samples for control and treated? or that > I > > > have the 'experiment' column in the design, but it would be > > > irrelevant since the default formula is only taking > > condition > > > into consideration, isn't it? > > > > > > What could be the origin of the error? > > > > > > Thanks again! > > > > > > Jose > > > > > > > > > > > > > sessionInfo() > > > > > > R version 3.0.1 (2013-05-16) > > > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > > locale: > > > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US > LC_IDENTIFICATION=C > > > > > > > > > attached base packages: > > > > > > [1] parallel stats graphics grDevices utils > > datasets > > > methods > > > > > > [8] base > > > > > > > > > other attached packages: > > > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > > > > loaded via a namespace (and not attached): > > > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > > > <alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de> <mailto:alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de>>>: > > > > > > Dear Xiayu Rao, > > > > > > Thanks for your interest in DEXSeq. > > > That looks strange, does it happen when you use files > > > different from the > > > one you generated by your own? Could you maybe send me > > > (offline) your > > > gtf file and the first 1000 lines of one of your sam > > files? > > > > > > Bests, > > > Alejandro > > > > > > > Hello, > > > > > > > > DEXSeq is a great tool. Thank you for that. I > recently > > > generate my own gtf file with the same format as the > > > exon.gff generated by dexseq_prepare_annotation.py. > > > However, the output is strange. I tried to find the > > reason > > > with no success. Could you please provide any idea > about > > > that problem? Thank you very much in advance! > > > > > > > > Note: I used the latest dexseq, and the sam files had > > > been sorted. > > > > > > > > 1 genes.gtf exonic_part 12228 12594 . > > > + . exonic_part_number "001"; gene_id > > > "ENSG00000223972" > > > > 1 genes.gtf exonic_part 12722 12974 . > > > + . exonic_part_number "002"; gene_id > > > "ENSG00000223972" > > > > 1 genes.gtf exonic_part 13053 13220 . > > > + . exonic_part_number "003"; gene_id > > > "ENSG00000223972" > > > > 1 genes.gtf exonic_part 14830 14969 . > > > - . exonic_part_number "001"; gene_id > > > "ENSG00000223972+ENSG00000227232" > > > > ............. > > > > > > > > > > > > ==> SRR791043_counts.txt <== > > > > :001G00027000003" > > > > :002G00021000003" > > > > :003G00070000003" > > > > :004G00040000003" > > > > :005G00060000003" > > > > :006G00030000003" > > > > :007G00019000003" > > > > :008G00045600003" > > > > :009G00020400003" > > > > :001G00000000005" > > > > > > > > > > > > Thanks, > > > > Xiayu > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-project.org>> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

Hi Jose, Thanks for sending your data! I had a look at it and the DEXSeq code seems to be doing what it is suppose to do. The reason you are getting low p-values (instead of high p-values) is because the samples within the same group are very, very similar to each other (the dispersion estimates reflect this) and very,very different from the rest of the samples. You can already see in a pairs plot of the normalized counts. What exactly is HYPOXIA2 and HYPOXIA3, are these independent experiments (cell-lines?) treated with the same condition? I understand these large differences are difficult to explain biologically, I would first go back to the experimentalist and make sure everything during the library preparation was different between your groups. Alejandro > Hi Alejandro, > Yes, there is a batch effect. There are two experiments, exp1 with > Controls and Hypoxia samples at 24h and a second at 48h. However the > controls were always at 24h. In DESeq2 (and a limma/voom on exons I > have to fetch the plot, I'll send it to you later) pca, you clearly > see that the controls cluster well together(N1,N2,CTRL2,CTRL3) while > 24 and 48 hours were separated. A DESeq2 analysis of the kind > ~condition + experiment (but also ~condition ) gave good results. Also > DEXSeq CTRL vs HYPOXIA (not including experiment as other factor) > But in anycase, I would have expected higher p.values rather than lower. > I will explore potential differences in terms of GC and other and will > send you a dropbox link for the RData of the ecs and res objects. > Thanks a lot! > J > > > > 2014-04-03 14:47 GMT+02:00 Alejandro Reyes <alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de="">>: > > Hi Jose, > > 98,000 hits!!?? Would if be possible for you to send me your raw input > files > offline? (via e.g. dropbox, ftp, etc: count files and DEXSeq flattened > gtf file), so I can > have a closer look at your data? > > Best regards, > Alejandro > > > > > > > > Hi Alejandro, > > I apologize, I did not see the answer in > > > http://permalink.gmane.org/gmane.science.biology.informatics.con ductor/53937 > > I was waiting for a Bioc... sorry about that. > > > > Ok, then those exons with very low log2FC and low p.value would > > belong, if I got it right, to genes with many differentially used > > exons and some with a very high log2FC, in that case, the linear > model > > would recognise wrongly as DU the complementary set of exons that > > actually are not. Since you say that luckily these cases are > > exceptions but I have ~98000 exons with p.adjust<0.05, I could have > > something really interesting or a terrible flaw in my 48h > compared to > > my 24 h treated samples. > > If the former case, I could run DEXSeq at the gene-level to identify > > the genes and trust, which log2FC? or which p.values? to detect > > interesting exons? > > I first thought to put a threshold of log2FC, but the "volcano" was > > strange with few volcano-like behaviour. > > or better should I make gene-level DEXSeq and then filter out those > > genes with very huge log2FC exons. > > Thanks again for your help > > Jose > > > > > > > > 2014-04-03 13:15 GMT+02:00 Alejandro Reyes > <alejandro.reyes at="" embl.de="" <mailto:alejandro.reyes="" at="" embl.de=""> > > <mailto:alejandro.reyes at="" embl.de="" <mailto:alejandro.reyes="" at="" embl.de="">>>: > > > > Hi Jose, > > > > I have an e-mail answering to this thread on the 24.03.2014, > maybe you > > missed it or did I write your e-mail wrong? > > > > > http://permalink.gmane.org/gmane.science.biology.informatics.con ductor/53937 > > > > Your concern is answered by the second point that I describe > > there. If you > > look at your "fitted splicing" plot, you can see this. The > > extreme case > > is the coefficients fitted > > for your exon E032, it has a value of ~40,000 in one of your > > conditions > > and a value of ~800 on > > your other conditions. This will affect the estimation of > > relative exon > > abundances from > > your other exons. As I mentioned before, this is a > limitation of the > > DEXSeq model, > > but luckily, genes like this cases seem to be exceptions rather > > than the > > rule > > (at least in my experience!). > > > > About using the output from voom to test for DEU, I have not > explored > > that option, > > but maybe the maintainers/authors of that package are able to > > guide you > > better. > > > > Hope it is useful, > > Alejandro > > > > > > > > > > > Dear Alejandro, > > > Have you had time to take a look at my problem (please see > below)? > > > I am now using DEXSeq 1.9 to analyze the same ecs objects > I had > > > analyzed with 1.8 but it produces the very same results. > The problem > > > was regarding too many exons with very low log2FC and very low > > > p.values. I send here an object with the subsetByGene > (ecs.one) with > > > one particular gene. The E029 has a very low p.value with > a very low > > > log2FC. Either the log2FCs are not OK or the p.values. I > cannot > > > understand how such low log2FC for the DEU analysis can give > > those low > > > p.values. Indeed the complete original ecs gave me 98000 > exons with > > > table(res.48$padjust<0.05). > > > However the same analysis (ecs object 4CTRLS vs 4 TREATED) > gave me > > > nice results when analysed with DEXSeq 1.6 on the complete > design > > > without splitting into two. > > > Here's the picture with expression and splicing values: > > > Immagine in linea 2 > > > > > > Here's the design of the ecs object created with CTRL vs > HYPOXIA > > (only > > > at 48h): > > > > design(ecs.one) > > > sampleName fileName condition > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > CTRL > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > CTRL > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > CTRL > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > CTRL > > > HYPOXIA2 HYPOXIA2 > Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > HYPOXIA > > > HYPOXIA3 HYPOXIA3 > Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > HYPOXIA > > > > > > Maybe the sampleNames?? N1andN2 come from another batch > but it is > > > still a CTRL. If they were different I would expect higher > > dispersions > > > and hence higher p.values not lower ones, wouldn't I? > > > I have tried to trace the problem a bit with these gene: > > > > > > modelFrame<-constructModelFrame(ecs.one) > > > formula0 = ~sample + exon > > > formula1 = ~sample + exon + condition:exon > > > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) > > > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) > > > > > > count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG000001700 17","E029") > > > > > > > mm0 > > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 > sampleN1 > > > sampleN2 exonothers > > > 1 1 0 0 0 1 0 > > > 0 > > > 2 1 0 0 0 0 1 > > > 0 > > > *3 1 0 0 0 0 > > > 0 0* > > > 4 1 1 0 0 0 0 > > > 0 > > > 5 1 0 1 0 0 0 > > > 0 > > > 6 1 0 0 1 0 0 > > > 0 > > > 7 1 0 0 0 1 0 > > > 1 > > > 8 1 0 0 0 0 1 > > > 1 > > > 9 1 0 0 0 0 0 > > > 1 > > > 10 1 1 0 0 0 0 > > > 1 > > > 11 1 0 1 0 0 0 > > > 1 > > > 12 1 0 0 1 0 0 > > > 1 > > > attr(,"assign") > > > [1] 0 1 1 1 1 1 2 > > > attr(,"contrasts") > > > attr(,"contrasts")$sample > > > [1] "contr.treatment" > > > > > > attr(,"contrasts")$exon > > > [1] "contr.treatment" > > > > > > Does it seem OK to you? I guess the intercept is CTRL2 (in > bold) but > > > why? Does it have to do with the 'CTRL' string in the > sampleName? I > > > tried to change the sample names to CTRL1,CTRL2... but the > > result was > > > the same. > > > > > > Here's the mm1 > > > > mm1 > > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 > sampleN1 > > > sampleN2 exonothers exonthis:conditionHYPOXIA > > > 1 1 0 0 0 1 0 > > > 0 0 > > > 2 1 0 0 0 0 1 > > > 0 0 > > > 3 1 0 0 0 0 0 > > > 0 0 > > > 4 1 1 0 0 0 0 > > > 0 0 > > > 5 1 0 1 0 0 0 > > > 0 1 > > > 6 1 0 0 1 0 0 > > > 0 1 > > > 7 1 0 0 0 1 0 > > > 1 0 > > > 8 1 0 0 0 0 1 > > > 1 0 > > > 9 1 0 0 0 0 0 > > > 1 0 > > > 10 1 1 0 0 0 0 1 > > > 0 > > > 11 1 0 1 0 0 0 1 > > > 0 > > > 12 1 0 0 1 0 0 1 > > > 0 > > > > sessionInfo() > > > R Under development (unstable) (2014-02-09 r64949) > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > locale: > > > [1] > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > attached base packages: > > > [1] parallel stats graphics grDevices utils datasets > methods > > > base > > > > > > other attached packages: > > > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 > > > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 > setwidth_1.0-3 > > > colorout_1.0-2 > > > > > > loaded via a namespace (and not attached): > > > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 > > > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 > > > GenomicRanges_1.15.39 > > > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 > > > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 > > stats4_3.1.0 > > > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > > > XVector_0.3.7 zlibbioc_1.9.0 > > > > > > > > > I hope you can give me some hints since I am a bit > confused and > > stuck > > > with these results. > > > By the way, for the other Bioc, I know limma/voom used on > exons can > > > also work nicely. Is there an easy way to implement a sort of > > DEU test > > > with limma voom counts? I guess the annotation gtf used to > count > > them > > > should be used to construct models and include it in a > similar way > > > with formulae in linearmodels as DEXSeq does with glmnb.fit > > function. > > > It would be perfect to have it straight as a function also in > > limma to > > > compare results. > > > > > > Thanks again for your efforts, > > > > > > Looking forward to hearing to your comments. > > > Jose > > > > > > > > > > > > > > > 2014-03-20 16:07 GMT+01:00 Jose Garcia > > > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>>>: > > > > > > Hi Alejandro, > > > I solved the problem by re-creating the object ecs.24. > I had > > made > > > one DEXSeq analysis up to the end by first creating an ecs > > object. > > > Then I just split the ecs object, which already had > p.value and > > > other info, and re-run the analysis from sizeFactors > on onto the > > > new split ecs.24 object. > > > Now it has worked. > > > However, I have obtained a much harder to interpret > result which > > > points to something wrong I do not know why. And it is > > present in > > > both the split and the original ecs.24 and ecs objects. > > > From scratch: > > > I made dexseq_prepare_annotation.py with the script > from DEXSeq > > > 1.6 which contained the '-r' parameter in order to avoid > > counting > > > exons overlapping different genes. Then I tried to count > > using the > > > new dexseq_count.py in the same package but it gave me > an error > > > because it had been introduced a check for NH tag in > the bam > > that > > > I do not have because I use SOAPSplice. You suggested > to use the > > > old dexseq_count.py whithout the check (from DEXSeq 1.4). > > > It worked and then I used the following script: > > > > > > sampleFiles.R_ExonOUT<-Files > > > sampleName.R_ExonOUT<-Names > > > > > > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep( "HYPOXIA",2)) > > > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > > > sampleTable.R_ExonOUT <- data.frame(sampleName = > > sampleName.R_ExonOUT, > > > fileName = > sampleFiles.R_ExonOUT, > > > condition = > > sampleCondition.R_ExonOUT, > > > experiment = > > sampleExperiment.R_ExonOUT) > > > inDir = getwd() > > > annotationfile = file.path > > > > > > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXS eq.gff") > > > > > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > > > sampleTable.R_ExonOUT$fileName),design = > sampleTable.R_ExonOUT, > > > flattenedfile = annotationfile) > > > > > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > > > ecs <- estimateSizeFactors(ecs) > > > library(parallel) > > > ecs <- estimateDispersions(ecs,nCores=8) > > > ecs <- fitDispersionFunction(ecs) > > > ecs <- testForDEU(ecs, nCores=8) > > > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > > > res<- DEUresultTable(ecs) > > > > > > The problem is that I have some exons with a ridiculous > > log2FC but > > > with a very good p.adjust. > > > Same thing with the ecs.24 or ecs.48 split objects. > Here an > > example: > > > > > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > > > > > geneID exonID dispersion pvalue padjust meanBase > > > log2fold(HYPOXIA/CTRL) > > > > > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 > > > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 > > > > > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 > > > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 > > > > > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 > > > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 > > > > > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 > > > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* > > > > > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 > > > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 > > > > > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 > > > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 > > > > > > > > > If you look at the plot (just a few exons around E50) > > > > > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > > > > > > > Immagine in linea 3 > > > > > > It seems clear that all those p-values cannot come > from those > > > log2FC that are adjusted for expression of all exons > coming from > > > the same gene. > > > > > > I have checked the design and formula: > > > > > > design(ecs.48) > > > > > > sampleName fileName condition experiment > > > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > CTRL RUN_2 > > > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > CTRL RUN_2 > > > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > CTRL > > RUN_1 > > > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > CTRL > > RUN_1 > > > > > > HYPOXIA2 HYPOXIA2 > Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > > HYPOXIA RUN_1 > > > > > > HYPOXIA3 HYPOXIA3 > Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > > HYPOXIA RUN_1 > > > > > > > > > formula(ecs.48) > > > > > > $formulaDispersion > > > > > > [1] "~sample + exon + condition:exon" > > > > > > > > > $formula0 > > > > > > [1] "~sample + exon" > > > > > > > > > $formula1 > > > > > > [1] "~sample + exon + condition:exon" > > > > > > > > > So, I am a bit stuck with it. I guess everything comes > from > > having > > > used different versions but I cannot come across it. > > Summarizing: > > > > > > SOASPSplice > > > > > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with > Ensembl72 > > > (hg19) -r no > > > > > > dexseq_count.py (From DEXSeq 1.4) > > > > > > Analysis (DEXSeq 1.8) > > > > > > Thanks for the help, > > > > > > > > > Jose > > > > > > > > > > > > sessionInfo() > > > > > > R version 3.0.1 (2013-05-16) > > > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > > locale: > > > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US > LC_IDENTIFICATION=C > > > > > > > > > attached base packages: > > > > > > [1] parallel stats graphics grDevices utils datasets > > methods > > > > > > [8] base > > > > > > > > > other attached packages: > > > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > > > > loaded via a namespace (and not attached): > > > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > > > <garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>>>: > > > > > > Hi Alejandro, > > > I am analyzing with DEXSeq my data. 4 CTRLs and 2 > Treated > > > samples. My design is the following: > > > > > > design(ecs.24) > > > > > > sampleName fileName condition experiment > > > > > > H1 H1 Exon_Martelli_Sample_Martelli_H_1.bam > > > HYPOXIA RUN_2 > > > > > > H2 H2 Exon_Martelli_Sample_Martelli_H_2.bam > > > HYPOXIA RUN_2 > > > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > > CTRL > > > RUN_2 > > > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > > CTRL > > > RUN_2 > > > > > > CTRL2 CTRL2 > Exon_Martelli_Sample_Martelli_CTRL_2.bam > > > CTRL RUN_1 > > > > > > CTRL3 CTRL3 > Exon_Martelli_Sample_Martelli_CTRL_3.bam > > > CTRL RUN_1 > > > > > > When I follow the vignette: > > > > > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > > > > > ....Done > > > > > > ecs.24 <- fitDispersionFunction(ecs.24) > > > > > > ....Done > > > > > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > > > > > ..... > > > > > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > > > > > *Error in `row.names<-.data.frame`(`*tmp*`, value = > > > c("geneID", "exonID", : * > > > > > > * duplicate 'row.names' are not allowed* > > > > > > *Calls: estimatelog2FoldChanges ... pData<- -> > pData<- -> > > > row.names<- -> row.names<-.data.frame* > > > > > > *In addition: Warning message:* > > > > > > *non-unique value when setting 'row.names': > > > 'log2fold(CTRL/HYPOXIA)' * > > > > > > > > > I checked for duplication as you had suggested > elsewhere > > > > > > any(duplicated(featureNames(ecs.24))) > > > > > > [1] FALSE > > > > > > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > > > > > [1] FALSE > > > > > > > > > I also checked that the condition in design where > factors: > > > > > > > > > is.factor(pData(ecs.24)$condition) > > > > > > [1] TRUE > > > > > > > > > The only explanation I could come to is the fact > that I have > > > an even number of samples for control and treated? > or that I > > > have the 'experiment' column in the design, but it > would be > > > irrelevant since the default formula is only taking > > condition > > > into consideration, isn't it? > > > > > > What could be the origin of the error? > > > > > > Thanks again! > > > > > > Jose > > > > > > > > > > > > > sessionInfo() > > > > > > R version 3.0.1 (2013-05-16) > > > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > > locale: > > > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US > LC_MESSAGES=en_US > > > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US > LC_IDENTIFICATION=C > > > > > > > > > attached base packages: > > > > > > [1] parallel stats graphics grDevices utils > > datasets > > > methods > > > > > > [8] base > > > > > > > > > other attached packages: > > > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > > > > loaded via a namespace (and not attached): > > > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > > > <alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de=""> > > <mailto:alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de="">> <mailto:alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de=""> > > <mailto:alejandro.reyes at="" embl.de=""> <mailto:alejandro.reyes at="" embl.de="">>>>: > > > > > > Dear Xiayu Rao, > > > > > > Thanks for your interest in DEXSeq. > > > That looks strange, does it happen when you > use files > > > different from the > > > one you generated by your own? Could you maybe > send me > > > (offline) your > > > gtf file and the first 1000 lines of one of > your sam > > files? > > > > > > Bests, > > > Alejandro > > > > > > > Hello, > > > > > > > > DEXSeq is a great tool. Thank you for that. > I recently > > > generate my own gtf file with the same format > as the > > > exon.gff generated by > dexseq_prepare_annotation.py. > > > However, the output is strange. I tried to > find the > > reason > > > with no success. Could you please provide any > idea about > > > that problem? Thank you very much in advance! > > > > > > > > Note: I used the latest dexseq, and the sam > files had > > > been sorted. > > > > > > > > 1 genes.gtf exonic_part 12228 > 12594 . > > > + . exonic_part_number "001"; gene_id > > > "ENSG00000223972" > > > > 1 genes.gtf exonic_part 12722 > 12974 . > > > + . exonic_part_number "002"; gene_id > > > "ENSG00000223972" > > > > 1 genes.gtf exonic_part 13053 > 13220 . > > > + . exonic_part_number "003"; gene_id > > > "ENSG00000223972" > > > > 1 genes.gtf exonic_part 14830 > 14969 . > > > - . exonic_part_number "001"; gene_id > > > "ENSG00000223972+ENSG00000227232" > > > > ............. > > > > > > > > > > > > ==> SRR791043_counts.txt <== > > > > :001G00027000003" > > > > :002G00021000003" > > > > :003G00070000003" > > > > :004G00040000003" > > > > :005G00060000003" > > > > :006G00030000003" > > > > :007G00019000003" > > > > :008G00045600003" > > > > :009G00020400003" > > > > :001G00000000005" > > > > > > > > > > > > Thanks, > > > > Xiayu > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org=""> > > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional > Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>> > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>> > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">>> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> > > <mailto:garciamanteiga.josemanuel at="" hsr.it=""> <mailto:garciamanteiga.josemanuel at="" hsr.it="">> > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it="">

ADD REPLY • link 10.6 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi Alejandro, Thanks for your inspection and troubleshooting. The experiment is made upon HUVEC cells, human endothelial cells. CTRLs are all treated with normoxic condition (~20% O2) for 24 hours, but there are two samples from one batch(plating, treating, RNA isolating, library preparing, and sequencing) and the other two from the other batch. The ecs.48 object contains all 4 controls from 2 runs and 2 experiments but all for 24h. And the HYPOXIA2 and HYPOXIA3 are two replicates from the batch number 1 where HUVEC cells were treated with low O2 tension (1%) for 48 hours. Biologically it is supposed to give drammatic changes and indeed DESeq2 has already said so with ~2000 genes UP/DOWN with padj<0.05 with good biological meaning. Besides, when look at them separatedly, 48h treatment induces many genes and covers most of the phenotypic response compared to 24h low O2. That said, however, I got really surprised to get 98000 exons of the 48 h, because this would mean that I have > 1000 genes with exons differentially used upon hypoxia. Although there is background that points to AS relevance in transcriptional response to hypoxia (and that is why we wanted to use DEXSEq on these samples), the previous data using exon arrays gave very limited results( just a handful of exons). It is hard to 'believe', at least. Of course, we came from ~175M reads samples and a robust design, but still... If there is no bug nor flaw and it is only a matter of 'super' sensitivity combined with 'super' changes, would you recommend then just to focus on those exons showing higher log2FC, like >1 (~9000) or even >2 (~2000)? Thanks again for you help Jose 2014-04-08 11:14 GMT+02:00 Alejandro Reyes <alejandro.reyes@embl.de>: > Hi Jose, > > Thanks for sending your data! > > I had a look at it and the DEXSeq code seems to be doing what it is > suppose to do. > > The reason you are getting low p-values (instead of high p-values) is > because the samples within the same group are very, very similar to each > other (the dispersion estimates reflect this) and very,very different > from the rest of the samples. You can already see in a pairs plot of the > normalized counts. What exactly is HYPOXIA2 and HYPOXIA3, are these > independent experiments (cell-lines?) treated with the same condition? I > understand these large differences are difficult to explain > biologically, I would first go back to the experimentalist and make sure > everything during the library preparation was different between your > groups. > > Alejandro > > > > > > Hi Alejandro, > > Yes, there is a batch effect. There are two experiments, exp1 with > > Controls and Hypoxia samples at 24h and a second at 48h. However the > > controls were always at 24h. In DESeq2 (and a limma/voom on exons I > > have to fetch the plot, I'll send it to you later) pca, you clearly > > see that the controls cluster well together(N1,N2,CTRL2,CTRL3) while > > 24 and 48 hours were separated. A DESeq2 analysis of the kind > > ~condition + experiment (but also ~condition ) gave good results. Also > > DEXSeq CTRL vs HYPOXIA (not including experiment as other factor) > > But in anycase, I would have expected higher p.values rather than lower. > > I will explore potential differences in terms of GC and other and will > > send you a dropbox link for the RData of the ecs and res objects. > > Thanks a lot! > > J > > > > > > > > 2014-04-03 14:47 GMT+02:00 Alejandro Reyes <alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de>>: > > > > Hi Jose, > > > > 98,000 hits!!?? Would if be possible for you to send me your raw > input > > files > > offline? (via e.g. dropbox, ftp, etc: count files and DEXSeq > flattened > > gtf file), so I can > > have a closer look at your data? > > > > Best regards, > > Alejandro > > > > > > > > > > > > > > > Hi Alejandro, > > > I apologize, I did not see the answer in > > > > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/53937 > > > I was waiting for a Bioc... sorry about that. > > > > > > Ok, then those exons with very low log2FC and low p.value would > > > belong, if I got it right, to genes with many differentially used > > > exons and some with a very high log2FC, in that case, the linear > > model > > > would recognise wrongly as DU the complementary set of exons that > > > actually are not. Since you say that luckily these cases are > > > exceptions but I have ~98000 exons with p.adjust<0.05, I could have > > > something really interesting or a terrible flaw in my 48h > > compared to > > > my 24 h treated samples. > > > If the former case, I could run DEXSeq at the gene-level to > identify > > > the genes and trust, which log2FC? or which p.values? to detect > > > interesting exons? > > > I first thought to put a threshold of log2FC, but the "volcano" was > > > strange with few volcano-like behaviour. > > > or better should I make gene-level DEXSeq and then filter out those > > > genes with very huge log2FC exons. > > > Thanks again for your help > > > Jose > > > > > > > > > > > > 2014-04-03 13:15 GMT+02:00 Alejandro Reyes > > <alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de=""> > > > <mailto:alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de=""> >>>: > > > > > > Hi Jose, > > > > > > I have an e-mail answering to this thread on the 24.03.2014, > > maybe you > > > missed it or did I write your e-mail wrong? > > > > > > > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/53937 > > > > > > Your concern is answered by the second point that I describe > > > there. If you > > > look at your "fitted splicing" plot, you can see this. The > > > extreme case > > > is the coefficients fitted > > > for your exon E032, it has a value of ~40,000 in one of your > > > conditions > > > and a value of ~800 on > > > your other conditions. This will affect the estimation of > > > relative exon > > > abundances from > > > your other exons. As I mentioned before, this is a > > limitation of the > > > DEXSeq model, > > > but luckily, genes like this cases seem to be exceptions rather > > > than the > > > rule > > > (at least in my experience!). > > > > > > About using the output from voom to test for DEU, I have not > > explored > > > that option, > > > but maybe the maintainers/authors of that package are able to > > > guide you > > > better. > > > > > > Hope it is useful, > > > Alejandro > > > > > > > > > > > > > > > > Dear Alejandro, > > > > Have you had time to take a look at my problem (please see > > below)? > > > > I am now using DEXSeq 1.9 to analyze the same ecs objects > > I had > > > > analyzed with 1.8 but it produces the very same results. > > The problem > > > > was regarding too many exons with very low log2FC and very > low > > > > p.values. I send here an object with the subsetByGene > > (ecs.one) with > > > > one particular gene. The E029 has a very low p.value with > > a very low > > > > log2FC. Either the log2FCs are not OK or the p.values. I > > cannot > > > > understand how such low log2FC for the DEU analysis can give > > > those low > > > > p.values. Indeed the complete original ecs gave me 98000 > > exons with > > > > table(res.48$padjust<0.05). > > > > However the same analysis (ecs object 4CTRLS vs 4 TREATED) > > gave me > > > > nice results when analysed with DEXSeq 1.6 on the complete > > design > > > > without splitting into two. > > > > Here's the picture with expression and splicing values: > > > > Immagine in linea 2 > > > > > > > > Here's the design of the ecs object created with CTRL vs > > HYPOXIA > > > (only > > > > at 48h): > > > > > design(ecs.one) > > > > sampleName fileName condition > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > > CTRL > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > > CTRL > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > > CTRL > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > > CTRL > > > > HYPOXIA2 HYPOXIA2 > > Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > > HYPOXIA > > > > HYPOXIA3 HYPOXIA3 > > Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > > HYPOXIA > > > > > > > > Maybe the sampleNames?? N1andN2 come from another batch > > but it is > > > > still a CTRL. If they were different I would expect higher > > > dispersions > > > > and hence higher p.values not lower ones, wouldn't I? > > > > I have tried to trace the problem a bit with these gene: > > > > > > > > modelFrame<-constructModelFrame(ecs.one) > > > > formula0 = ~sample + exon > > > > formula1 = ~sample + exon + condition:exon > > > > mm0<-DEXSeq:::rmDepCols(model.matrix(formula0,modelFrame)) > > > > mm1<-DEXSeq:::rmDepCols(model.matrix(formula1,modelFrame)) > > > > > > > > > > count<-DEXSeq:::getCountVector(ecs=ecs.one,geneID="ENSG00000170017", "E029") > > > > > > > > > mm0 > > > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 > > sampleN1 > > > > sampleN2 exonothers > > > > 1 1 0 0 0 1 0 > > > > 0 > > > > 2 1 0 0 0 0 1 > > > > 0 > > > > *3 1 0 0 0 0 > > > > 0 0* > > > > 4 1 1 0 0 0 > 0 > > > > 0 > > > > 5 1 0 1 0 0 0 > > > > 0 > > > > 6 1 0 0 1 0 0 > > > > 0 > > > > 7 1 0 0 0 1 0 > > > > 1 > > > > 8 1 0 0 0 0 1 > > > > 1 > > > > 9 1 0 0 0 0 0 > > > > 1 > > > > 10 1 1 0 0 0 0 > > > > 1 > > > > 11 1 0 1 0 0 0 > > > > 1 > > > > 12 1 0 0 1 0 0 > > > > 1 > > > > attr(,"assign") > > > > [1] 0 1 1 1 1 1 2 > > > > attr(,"contrasts") > > > > attr(,"contrasts")$sample > > > > [1] "contr.treatment" > > > > > > > > attr(,"contrasts")$exon > > > > [1] "contr.treatment" > > > > > > > > Does it seem OK to you? I guess the intercept is CTRL2 (in > > bold) but > > > > why? Does it have to do with the 'CTRL' string in the > > sampleName? I > > > > tried to change the sample names to CTRL1,CTRL2... but the > > > result was > > > > the same. > > > > > > > > Here's the mm1 > > > > > mm1 > > > > (Intercept) sampleCTRL3 sampleHYPOXIA2 sampleHYPOXIA3 > > sampleN1 > > > > sampleN2 exonothers exonthis:conditionHYPOXIA > > > > 1 1 0 0 0 1 0 > > > > 0 0 > > > > 2 1 0 0 0 0 1 > > > > 0 0 > > > > 3 1 0 0 0 0 0 > > > > 0 0 > > > > 4 1 1 0 0 0 0 > > > > 0 0 > > > > 5 1 0 1 0 0 0 > > > > 0 1 > > > > 6 1 0 0 1 0 0 > > > > 0 1 > > > > 7 1 0 0 0 1 0 > > > > 1 0 > > > > 8 1 0 0 0 0 1 > > > > 1 0 > > > > 9 1 0 0 0 0 0 > > > > 1 0 > > > > 10 1 1 0 0 0 0 1 > > > > 0 > > > > 11 1 0 1 0 0 0 1 > > > > 0 > > > > 12 1 0 0 1 0 0 1 > > > > 0 > > > > > sessionInfo() > > > > R Under development (unstable) (2014-02-09 r64949) > > > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > > > > > locale: > > > > [1] > > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > > > > > attached base packages: > > > > [1] parallel stats graphics grDevices utils datasets > > methods > > > > base > > > > > > > > other attached packages: > > > > [1] edgeR_3.5.28 limma_3.19.28 DEXSeq_1.9.5 > > > > Biobase_2.23.6 BiocGenerics_0.9.3 vimcom.plus_0.9-93 > > setwidth_1.0-3 > > > > colorout_1.0-2 > > > > > > > > loaded via a namespace (and not attached): > > > > [1] AnnotationDbi_1.25.15 biomaRt_2.19.3 Biostrings_2.31.15 > > > > bitops_1.0-6 DBI_0.2-7 GenomeInfoDb_0.99.19 > > > > GenomicRanges_1.15.39 > > > > [8] hwriter_1.3 IRanges_1.21.34 RCurl_1.95-4.1 > > > > Rsamtools_1.15.33 RSQLite_0.11.4 statmod_1.4.18 > > > stats4_3.1.0 > > > > [15] stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 > > > > XVector_0.3.7 zlibbioc_1.9.0 > > > > > > > > > > > > I hope you can give me some hints since I am a bit > > confused and > > > stuck > > > > with these results. > > > > By the way, for the other Bioc, I know limma/voom used on > > exons can > > > > also work nicely. Is there an easy way to implement a sort of > > > DEU test > > > > with limma voom counts? I guess the annotation gtf used to > > count > > > them > > > > should be used to construct models and include it in a > > similar way > > > > with formulae in linearmodels as DEXSeq does with glmnb.fit > > > function. > > > > It would be perfect to have it straight as a function also in > > > limma to > > > > compare results. > > > > > > > > Thanks again for your efforts, > > > > > > > > Looking forward to hearing to your comments. > > > > Jose > > > > > > > > > > > > > > > > > > > > 2014-03-20 16:07 GMT+01:00 Jose Garcia > > > > <garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>>>: > > > > > > > > Hi Alejandro, > > > > I solved the problem by re-creating the object ecs.24. > > I had > > > made > > > > one DEXSeq analysis up to the end by first creating an > ecs > > > object. > > > > Then I just split the ecs object, which already had > > p.value and > > > > other info, and re-run the analysis from sizeFactors > > on onto the > > > > new split ecs.24 object. > > > > Now it has worked. > > > > However, I have obtained a much harder to interpret > > result which > > > > points to something wrong I do not know why. And it is > > > present in > > > > both the split and the original ecs.24 and ecs objects. > > > > From scratch: > > > > I made dexseq_prepare_annotation.py with the script > > from DEXSeq > > > > 1.6 which contained the '-r' parameter in order to avoid > > > counting > > > > exons overlapping different genes. Then I tried to count > > > using the > > > > new dexseq_count.py in the same package but it gave me > > an error > > > > because it had been introduced a check for NH tag in > > the bam > > > that > > > > I do not have because I use SOAPSplice. You suggested > > to use the > > > > old dexseq_count.py whithout the check (from DEXSeq 1.4). > > > > It worked and then I used the following script: > > > > > > > > sampleFiles.R_ExonOUT<-Files > > > > sampleName.R_ExonOUT<-Names > > > > > > > > > > sampleCondition.R_ExonOUT<-c(rep("HYPOXIA",2),rep("CTRL",4),rep("HYP OXIA",2)) > > > > sampleExperiment.R_ExonOUT<-c(rep("RUN_2",4),rep("RUN_1",4)) > > > > sampleTable.R_ExonOUT <- data.frame(sampleName = > > > sampleName.R_ExonOUT, > > > > fileName = > > sampleFiles.R_ExonOUT, > > > > condition = > > > sampleCondition.R_ExonOUT, > > > > experiment = > > > sampleExperiment.R_ExonOUT) > > > > inDir = getwd() > > > > annotationfile = file.path > > > > > > > > > > ("/lustre1/genomes/hg19/annotation","Homo_sapiens.ensembl72.DEXSeq.g ff") > > > > > > > > ecs = read.HTSeqCounts(countfiles = file.path(inDir, > > > > sampleTable.R_ExonOUT$fileName),design = > > sampleTable.R_ExonOUT, > > > > flattenedfile = annotationfile) > > > > > > > > sampleNames(ecs) = sampleTable.R_ExonOUT$sampleName > > > > ecs <- estimateSizeFactors(ecs) > > > > library(parallel) > > > > ecs <- estimateDispersions(ecs,nCores=8) > > > > ecs <- fitDispersionFunction(ecs) > > > > ecs <- testForDEU(ecs, nCores=8) > > > > ecs <- estimatelog2FoldChanges(ecs, nCores=8) > > > > res<- DEUresultTable(ecs) > > > > > > > > The problem is that I have some exons with a ridiculous > > > log2FC but > > > > with a very good p.adjust. > > > > Same thing with the ecs.24 or ecs.48 split objects. > > Here an > > > example: > > > > > > > > head(res.48[which(res.48$geneID=="ENSG00000148516"),]) > > > > > > > > geneID exonID dispersion pvalue padjust meanBase > > > > log2fold(HYPOXIA/CTRL) > > > > > > > > ENSG00000148516:E036 ENSG00000148516 E036 0.014798679 > > > > 2.873434e-16 5.646223e-14 171.5313 -6.075811e-01 > > > > > > > > ENSG00000148516:E049 ENSG00000148516 E049 0.011425856 > > > > 2.461690e-14 2.846653e-12 414.4351 -1.907197e-01 > > > > > > > > ENSG00000148516:E039 ENSG00000148516 E039 0.014486497 > > > > 2.332678e-13 2.043916e-11 181.3705 -4.226252e-01 > > > > > > > > *ENSG00000148516:E050 ENSG00000148516 E050 0.009733072 > > > > 1.131825e-12 8.326638e-11 1432.6492 -1.278668e-05* > > > > > > > > ENSG00000148516:E033 ENSG00000148516 E033 0.037143254 > > > > 3.483915e-12 2.236853e-10 514.5010 -5.273017e-01 > > > > > > > > ENSG00000148516:E034 ENSG00000148516 E034 0.019826955 > > > > 4.660942e-12 2.896722e-10 113.6851 -6.541261e-01 > > > > > > > > > > > > If you look at the plot (just a few exons around E50) > > > > > > > > plotDEXSeq(ecs.48,"ENSG00000148516",splicing=T) > > > > > > > > > > > > Immagine in linea 3 > > > > > > > > It seems clear that all those p-values cannot come > > from those > > > > log2FC that are adjusted for expression of all exons > > coming from > > > > the same gene. > > > > > > > > I have checked the design and formula: > > > > > > > > design(ecs.48) > > > > > > > > sampleName fileName condition experiment > > > > > > > > N1 N1 Exon_Martelli_Sample_Martelli_N_1.bam > > CTRL RUN_2 > > > > > > > > N2 N2 Exon_Martelli_Sample_Martelli_N_2.bam > > CTRL RUN_2 > > > > > > > > CTRL2 CTRL2 Exon_Martelli_Sample_Martelli_CTRL_2.bam > > CTRL > > > RUN_1 > > > > > > > > CTRL3 CTRL3 Exon_Martelli_Sample_Martelli_CTRL_3.bam > > CTRL > > > RUN_1 > > > > > > > > HYPOXIA2 HYPOXIA2 > > Exon_Martelli_Sample_Martelli_HYPOXIA_2.bam > > > > HYPOXIA RUN_1 > > > > > > > > HYPOXIA3 HYPOXIA3 > > Exon_Martelli_Sample_Martelli_HYPOXIA_3.bam > > > > HYPOXIA RUN_1 > > > > > > > > > > > > formula(ecs.48) > > > > > > > > $formulaDispersion > > > > > > > > [1] "~sample + exon + condition:exon" > > > > > > > > > > > > $formula0 > > > > > > > > [1] "~sample + exon" > > > > > > > > > > > > $formula1 > > > > > > > > [1] "~sample + exon + condition:exon" > > > > > > > > > > > > So, I am a bit stuck with it. I guess everything comes > > from > > > having > > > > used different versions but I cannot come across it. > > > Summarizing: > > > > > > > > SOASPSplice > > > > > > > > dexseq_prepare_annotation.py (From DEXSeq 1.6) with > > Ensembl72 > > > > (hg19) -r no > > > > > > > > dexseq_count.py (From DEXSeq 1.4) > > > > > > > > Analysis (DEXSeq 1.8) > > > > > > > > Thanks for the help, > > > > > > > > > > > > Jose > > > > > > > > > > > > > > > > sessionInfo() > > > > > > > > R version 3.0.1 (2013-05-16) > > > > > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > > > > > locale: > > > > > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > > > > > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US > > LC_IDENTIFICATION=C > > > > > > > > > > > > attached base packages: > > > > > > > > [1] parallel stats graphics grDevices utils datasets > > > methods > > > > > > > > [8] base > > > > > > > > > > > > other attached packages: > > > > > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > > > > > > > loaded via a namespace (and not attached): > > > > > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 IRanges_1.20.6 > > > > > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > > > > > > > > > 2014-03-19 13:18 GMT+01:00 Jose Garcia > > > > <garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>>>: > > > > > > > > Hi Alejandro, > > > > I am analyzing with DEXSeq my data. 4 CTRLs and 2 > > Treated > > > > samples. My design is the following: > > > > > > > > design(ecs.24) > > > > > > > > sampleName fileName condition experiment > > > > > > > > H1 H1 > Exon_Martelli_Sample_Martelli_H_1.bam > > > > HYPOXIA RUN_2 > > > > > > > > H2 H2 > Exon_Martelli_Sample_Martelli_H_2.bam > > > > HYPOXIA RUN_2 > > > > > > > > N1 N1 > Exon_Martelli_Sample_Martelli_N_1.bam > > > CTRL > > > > RUN_2 > > > > > > > > N2 N2 > Exon_Martelli_Sample_Martelli_N_2.bam > > > CTRL > > > > RUN_2 > > > > > > > > CTRL2 CTRL2 > > Exon_Martelli_Sample_Martelli_CTRL_2.bam > > > > CTRL RUN_1 > > > > > > > > CTRL3 CTRL3 > > Exon_Martelli_Sample_Martelli_CTRL_3.bam > > > > CTRL RUN_1 > > > > > > > > When I follow the vignette: > > > > > > > > ecs.24 <- estimateDispersions(ecs.24,nCores=8) > > > > > > > > ....Done > > > > > > > > ecs.24 <- fitDispersionFunction(ecs.24) > > > > > > > > ....Done > > > > > > > > ecs.24 <- testForDEU(ecs.24, nCores=8) > > > > > > > > ..... > > > > > > > > ecs.24 <- estimatelog2FoldChanges(ecs.24, nCores=8) > > > > > > > > *Error in `row.names<-.data.frame`(`*tmp*`, value = > > > > c("geneID", "exonID", : * > > > > > > > > * duplicate 'row.names' are not allowed* > > > > > > > > *Calls: estimatelog2FoldChanges ... pData<- -> > > pData<- -> > > > > row.names<- -> row.names<-.data.frame* > > > > > > > > *In addition: Warning message:* > > > > > > > > *non-unique value when setting 'row.names': > > > > 'log2fold(CTRL/HYPOXIA)' * > > > > > > > > > > > > I checked for duplication as you had suggested > > elsewhere > > > > > > > > any(duplicated(featureNames(ecs.24))) > > > > > > > > [1] FALSE > > > > > > > > > > any(duplicated(paste(geneIDs(ecs.24),exonIDs(ecs.24),sep=":"))) > > > > > > > > [1] FALSE > > > > > > > > > > > > I also checked that the condition in design where > > factors: > > > > > > > > > > > > is.factor(pData(ecs.24)$condition) > > > > > > > > [1] TRUE > > > > > > > > > > > > The only explanation I could come to is the fact > > that I have > > > > an even number of samples for control and treated? > > or that I > > > > have the 'experiment' column in the design, but it > > would be > > > > irrelevant since the default formula is only taking > > > condition > > > > into consideration, isn't it? > > > > > > > > What could be the origin of the error? > > > > > > > > Thanks again! > > > > > > > > Jose > > > > > > > > > > > > > > > > > sessionInfo() > > > > > > > > R version 3.0.1 (2013-05-16) > > > > > > > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > > > > > > > > > locale: > > > > > > > > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > > > > > > > > [4] LC_COLLATE=en_US LC_MONETARY=en_US > > LC_MESSAGES=en_US > > > > > > > > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > > > > > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US > > LC_IDENTIFICATION=C > > > > > > > > > > > > attached base packages: > > > > > > > > [1] parallel stats graphics grDevices utils > > > datasets > > > > methods > > > > > > > > [8] base > > > > > > > > > > > > other attached packages: > > > > > > > > [1] DEXSeq_1.8.0 Biobase_2.22.0 BiocGenerics_0.8.0 > > > > > > > > > > > > loaded via a namespace (and not attached): > > > > > > > > [1] biomaRt_2.18.0 Biostrings_2.30.1 bitops_1.0-6 > > > > > > > > [4] GenomicRanges_1.14.3 hwriter_1.3 > IRanges_1.20.6 > > > > > > > > [7] RCurl_1.95-4.1 Rsamtools_1.14.2 statmod_1.4.18 > > > > > > > > [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1 > > > > > > > > [13] XML_3.98-1.1 XVector_0.2.0 zlibbioc_1.8.0 > > > > > > > > > > > > > > > > 2014-03-13 16:32 GMT+01:00 Alejandro Reyes > > > > <alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de> > > > <mailto:alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de>> <mailto:alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de> > > > <mailto:alejandro.reyes@embl.de> > <mailto:alejandro.reyes@embl.de>>>>: > > > > > > > > Dear Xiayu Rao, > > > > > > > > Thanks for your interest in DEXSeq. > > > > That looks strange, does it happen when you > > use files > > > > different from the > > > > one you generated by your own? Could you maybe > > send me > > > > (offline) your > > > > gtf file and the first 1000 lines of one of > > your sam > > > files? > > > > > > > > Bests, > > > > Alejandro > > > > > > > > > Hello, > > > > > > > > > > DEXSeq is a great tool. Thank you for that. > > I recently > > > > generate my own gtf file with the same format > > as the > > > > exon.gff generated by > > dexseq_prepare_annotation.py. > > > > However, the output is strange. I tried to > > find the > > > reason > > > > with no success. Could you please provide any > > idea about > > > > that problem? Thank you very much in advance! > > > > > > > > > > Note: I used the latest dexseq, and the sam > > files had > > > > been sorted. > > > > > > > > > > 1 genes.gtf exonic_part 12228 > > 12594 . > > > > + . exonic_part_number "001"; gene_id > > > > "ENSG00000223972" > > > > > 1 genes.gtf exonic_part 12722 > > 12974 . > > > > + . exonic_part_number "002"; gene_id > > > > "ENSG00000223972" > > > > > 1 genes.gtf exonic_part 13053 > > 13220 . > > > > + . exonic_part_number "003"; gene_id > > > > "ENSG00000223972" > > > > > 1 genes.gtf exonic_part 14830 > > 14969 . > > > > - . exonic_part_number "001"; gene_id > > > > "ENSG00000223972+ENSG00000227232" > > > > > ............. > > > > > > > > > > > > > > > ==> SRR791043_counts.txt <== > > > > > :001G00027000003" > > > > > :002G00021000003" > > > > > :003G00070000003" > > > > > :004G00040000003" > > > > > :005G00060000003" > > > > > :006G00030000003" > > > > > :007G00019000003" > > > > > :008G00045600003" > > > > > :009G00020400003" > > > > > :001G00000000005" > > > > > > > > > > > > > > > Thanks, > > > > > Xiayu > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor@r-project.org > > <mailto:bioconductor@r-project.org> > > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-project.org>> > > > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-project.org> > > > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-project.org>>> > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > -- > > > > Jose M. Garcia Manteiga PhD > > > > Research Assistant - Data Analysis in Functional > > Genomics > > > > Center for Translational Genomics and BioInformatics > > > > Dibit2-Basilica, 3A3 > > > > San Raffaele Scientific Institute > > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > <tel:%2b39-02-2643-4751> > > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>> > > > > > > > > > > > > > > > > > > > > -- > > > > Jose M. Garcia Manteiga PhD > > > > Research Assistant - Data Analysis in Functional Genomics > > > > Center for Translational Genomics and BioInformatics > > > > Dibit2-Basilica, 3A3 > > > > San Raffaele Scientific Institute > > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > <tel:%2b39-02-2643-4751> > > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>> > > > > > > > > > > > > > > > > > > > > -- > > > > Jose M. Garcia Manteiga PhD > > > > Research Assistant - Data Analysis in Functional Genomics > > > > Center for Translational Genomics and BioInformatics > > > > Dibit2-Basilica, 3A3 > > > > San Raffaele Scientific Institute > > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > <tel:%2b39-02-2643-4751> > > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>>> > > > > > > > > > > > > > > > -- > > > Jose M. Garcia Manteiga PhD > > > Research Assistant - Data Analysis in Functional Genomics > > > Center for Translational Genomics and BioInformatics > > > Dibit2-Basilica, 3A3 > > > San Raffaele Scientific Institute > > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > > > Tel: +39-02-2643-4751 <tel:%2b39-02-2643-4751> > > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > > <mailto:garciamanteiga.josemanuel@hsr.it> > <mailto:garciamanteiga.josemanuel@hsr.it>> > > > > > > > > > > -- > > Jose M. Garcia Manteiga PhD > > Research Assistant - Data Analysis in Functional Genomics > > Center for Translational Genomics and BioInformatics > > Dibit2-Basilica, 3A3 > > San Raffaele Scientific Institute > > Via Olgettina 58, 20132 Milano (MI), Italy > > > > Tel: +39-02-2643-4751 > > e-mail: garciamanteiga.josemanuel@hsr.it > > <mailto:garciamanteiga.josemanuel@hsr.it> > > -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Jose M Garcia Manteiga ▴ 310

Login before adding your answer.