Question

DEXseq: Making exonCountSet object

0

Entering edit mode

Elena Sorokin ▴ 160

@elena-sorokin-4659

Last seen 10.6 years ago

Hello again, As a result of my recent post, I'm now delving into the DEXseq package. But I am having struggling with building my exonCountSet object, as described in the vignette/ R documentation. First I went through the example in the vignette, and was able to complete the pasilla analysis. With my own data, though, the differential expression testing doesn't work, and additionally, the annotation GFF file doesn't seem to have loaded. Could somebody kindly write out the exact code for reading in the HTSeq counts files using the pasilla input files (p.11 of the vignette), or better yet, point out where my code below is wrong? My goal is to be able to use the neat graphics as well as DE testing on my dataset, so I don't want to do the bare-bones loading. My code is posted below... I apologize that it's lengthy. Many thanks in advance to anybody out there who can help me, Elena _____________________ # For reference, I have four counts files, called veh1_counts.txt, veh2_counts.txt, drug1_counts.txt, drug2_counts.txt that I made using, for example: *python dexseq_count.py DEXSeq_annotations.gff <samfile> veh1_counts.txt * and one GFF file, called DEXseq_annotations.gff, that I made using: * python dexseq_prepare_annotation.py Ce.WS220.65_copy.gtf DEXSeq_annotations.gff* # load annotation file > annotationfile = file.path("DEXSeq_annotations.gff") > annotationfile [1] "DEXSeq_annotations.gff" # I made a data frame called "samples": > samples condition replicate veh1 vehicle 1 veh2 drug 1 drug1 vehicle 2 drug2 drug 2 # make exon count set object ecs = read.HTSeqCounts(countfiles = file.path("C:\\Rdata\\DEXseq", paste(paste(rownames(samples),"counts", sep="_"), "txt", sep=".")), design = samples, flattenedfile = annotationfile) # I thought everything seemed OK, so went on... # Size factors > ecs <- estimateSizeFactors(ecs) # Dispersion > ecs <- estimateDispersions(ecs) # Fit Dispersion > ecs <- fitDispersionFunction(ecs) # Plot Individual exons via mean expression > meanvalues <- rowMeans(counts(pasillaExons)) > plot(meanvalues, fData(pasillaExons)$dispBeforeSharing, log="xy", main="mean vs CR dispersion") Error in plot.window(...) : need finite 'ylim' values In addition: Warning messages: 1: In xy.coords(x, y, xlabel, ylabel, log) : 28 x values <= 0 omitted from logarithmic plot 2: In min(x) : no non-missing arguments to min; returning Inf 3: In max(x) : no non-missing arguments to max; returning -Inf # Test for Expression Differences > test <- testGeneForDEU(ecs) Error in geneIDs(ecs) == geneID : 'geneID' is missing # There must be something wrong with the geneIDs column in my exon count set > head(geneIDs(ecs)) 2L52.1:001 2L52.1:002 2L52.1:003 2L52.1:004 2L52.1:005 2L52.1:006 2L52.1 2L52.1 2L52.1 2L52.1 2L52.1 2L52.1 41112 Levels: 2L52.1 2L52.2 2RSSE.1 2RSSE.2 2RSSE.3 2RSSE.4 2RSSE.5 2RSSE.6 2RSSE.7 2RSSE.8 3R5.1+K08E3.8 3R5.2 4R79.1 4R79.2 4R79.4 4R79.5 6R55.2 AC3.1 ... ZK994.t3 # Are there any other problems with my exon count set? Yes, also the annotation file is not loaded. > ecs ExonCountSet (storageMode: environment) assayData: 176762 features, 4 samples element names: counts protocolData: none phenoData sampleNames: veh1 veh2 drug1 drug2 varLabels: sizeFactor condition replicate type varMetadata: labelDescription featureData featureNames: 2L52.1:001 2L52.1:002 ... ZK994.t3:001 (176762 total) fvarLabels: geneID exonID ... transcripts (13 total) fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation: # Looking at the fData, I note that annotation info is missing. > head(fData(ecs)) geneID exonID testable dispBeforeSharing dispFitted dispersion pvalue padjust chr start end strand transcripts 2L52.1:001 2L52.1 E001 FALSE NA 0.5118 0.5118 NA NA <na> NA NA <na> <na> 2L52.1:002 2L52.1 E002 TRUE 1.15e-09 0.0884 0.0884 NA NA <na> NA NA <na> <na> 2L52.1:003 2L52.1 E003 TRUE 1.40e-02 0.1418 0.1418 NA NA <na> NA NA <na> <na> 2L52.1:004 2L52.1 E004 TRUE 1.54e-09 0.2004 0.2004 NA NA <na> NA NA <na> <na> 2L52.1:005 2L52.1 E005 TRUE 6.67e-10 0.0963 0.0963 NA NA <na> NA NA <na> <na> 2L52.1:006 2L52.1 E006 TRUE 2.16e-09 0.1222 0.1222 NA NA <na> NA NA <na> <na> > sessionInfo() R version 2.14.0 (2011-10-31) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] pasilla_0.2.11 DESeq_1.6.1 locfit_1.5-6 lattice_0.20-0 akima_0.5-7 DEXSeq_1.0.2 Biobase_2.14.0 loaded via a namespace (and not attached): [1] annotate_1.32.1 AnnotationDbi_1.16.11 DBI_0.2-5 genefilter_1.36.0 geneplotter_1.32.1 grid_2.14.0 hwriter_1.3 [8] IRanges_1.12.6 plyr_1.7.1 RColorBrewer_1.0-5 RSQLite_0.11.1 splines_2.14.0 statmod_1.4.14 stringr_0.6 [15] survival_2.36-10 tools_2.14.0 xtable_1.6-0 > [[alternative HTML version deleted]]

Annotation Annotation • 1.7k views

ADD COMMENT • link updated 13.1 years ago by Alejandro Reyes ▴ 30 • written 13.2 years ago by Elena Sorokin ▴ 160

score 0 · Answer 1 · 2012-02-11

Hi Elena On 2012-02-11 02:06, Elena Sorokin wrote: > With my own data, though, the differential expression testing doesn't > work, and additionally, the annotation GFF file doesn't seem to have > loaded. Could somebody kindly write out the exact code for reading in > the HTSeq counts files using the pasilla input files (p.11 of the > vignette), or better yet, point out where my code below is wrong? Probably something went wrong here: > # make exon count set object > ecs = read.HTSeqCounts(countfiles = file.path("C:\\Rdata\\DEXseq", > paste(paste(rownames(samples),"counts", sep="_"), "txt", sep=".")), > design = samples, > flattenedfile = annotationfile) First, look at the return value of the call to file.path: file.path("C:\\Rdata\\DEXseq", paste(paste(rownames(samples),"counts", sep="_"), "txt", sep=".")) If you type this without the rest around it, do you get a list of your file names? Are they really correct? Also check with 'head(count(ecs))' whether the count data really made it into the object. Similarly, check with 'pData(ecs)' and 'head(fData(ecd))' whether the annotation is there. And, after calling 'estimatSizeFactors', look at the result with 'sizeFactors(ecs)'. This should help pinpoint where things go wrong. Simon

score 0 · Answer 2 · 2012-02-11

Dear Simon, Thanks for your reply, even on a Saturday. =) You were right about the incorrect file path- that was problem. (To anybody else out there struggling with general issues getting their files into R, my code is below. Hope I save you some time). Another issue I was hoping you might help me with: I have no differentially-expressed exons (p-value = NA for all exons) at the end of the analysis!! I followed your other recommendations, and have spent some time trying to figure this out, but am still unsure. The fact that there are no integers in the p-value column has me worried. Maybe the problem is that the column names in counts(ecs) do not word- for-word match up with my rownames in the samples dataframe. I didn't know how to change the colnames in counts(ecs), so I left them as they were, noting that the sample identity seemed to have been preserved. Any advice on how to change the count(ecs) headers, or other advice about my non-existent p-values would be appreciated! =) Many thanks, Elena P.S. The graphics w/ gene models in DEXseq are very useful and convenient! ______________________________ library(DEXSeq) options(digits=3) setwd("C:\\Rdata\\DEXseq") library(DEXSeq) rm(list=ls()) # this GFF file is an output from Simon's script dexseq_prepare_annotations.py annotationfile = file.path("/Rdata/DEXseq/realFioles/DEXSeq_annotations.gff") annotationfile [1] "/Rdata/DEXseq/realFioles/DEXSeq_annotations.gff" > samples = data.frame( + condition = c("drug","drug","vehicle","vehicle"), + replicate = c(1:2,1:2), + row.names = c("drug1","drug2","veh1","veh2"), + stringsAsFactors = TRUE, + check.names = FALSE + ) > samples condition replicate drug1 drug 1 drug2 drug 2 veh1 vehicle 1 veh2 vehicle 2 > fullFilenames<- list.files("C:/Rdata/DEXseq/realFioles/",full.names =TRUE,pattern="counts.txt") > fullFilenames [1] "C:/Rdata/DEXseq/realFioles/drug1_counts.txt" "C:/Rdata/DEXseq/realFioles/drug2_counts.txt" "C:/Rdata/DEXseq/realFioles/veh1_counts.txt" [4] "C:/Rdata/DEXseq/realFioles/veh2_counts.txt" > ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile) > head(counts(ecs)) C:/Rdata/DEXseq/realFioles/drug1_counts.txt C:/Rdata/DEXseq/realFioles/drug2_counts.txt C:/Rdata/DEXseq/realFioles/veh1_counts.txt 2L52.1:001 0 2 2 2L52.1:002 4 12 13 2L52.1:003 7 8 7 2L52.1:004 6 4 4 2L52.1:005 9 6 16 2L52.1:006 6 4 13 C:/Rdata/DEXseq/realFioles/veh2_counts.txt 2L52.1:001 4 2L52.1:002 20 2L52.1:003 8 2L52.1:004 7 2L52.1:005 14 2L52.1:006 12 > head(fData(ecs)) geneID exonID testable dispBeforeSharing dispFitted dispersion pvalue padjust chr start end strand transcripts 2L52.1:001 2L52.1 E001 FALSE NA 0.5145 0.5145 NA NA chrII 1867 1911 + 2L52.1 2L52.1:002 2L52.1 E002 TRUE 1.39e-01 0.0899 0.1391 NA NA chrII 2506 2694 + 2L52.1 2L52.1:003 2L52.1 E003 TRUE 9.56e-10 0.1436 0.1436 NA NA chrII 2738 2888 + 2L52.1 2L52.1:004 2L52.1 E004 TRUE 2.52e-10 0.2022 0.2022 NA NA chrII 2931 3036 + 2L52.1 2L52.1:005 2L52.1 E005 TRUE 2.16e-09 0.0979 0.0979 NA NA chrII 3406 3552 + 2L52.1 2L52.1:006 2L52.1 E006 TRUE 2.46e-09 0.1238 0.1238 NA NA chrII 3802 3984 + 2L52.1 # Size factors > ecs<- estimateSizeFactors(ecs) sizeFactors(ecs) # Dispersion > ecs<- estimateDispersions(ecs) # Fit Dispersion > ecs<- fitDispersionFunction(ecs) # Plot Individual exons via mean expression > meanvalues<- rowMeans(counts(ecs)) > plot(meanvalues, fData(ecs)$dispBeforeSharing, log="xy", main="mean vs CR dispersion") > x<- 0.01:max(meanvalues) > y<- ecs at dispFitCoefs[1] + ecs at dispFitCoefs[2] / x > lines(x, y, col="red") # Plot looks good # Test for Expression Difference > test<- testForDEU(ecs) # Seems to work > ecs<- estimatelog2FoldChanges(ecs) > res1<- DEUresultTable(ecs) head(res1) geneID exonID dispersion pvalue padjust meanBase log2fold(drug/vehicle) 2L52.1:001 2L52.1 E001 0.5145 NA NA 2.01 -0.844 2L52.1:002 2L52.1 E002 0.1391 NA NA 12.25 -0.304 2L52.1:003 2L52.1 E003 0.1436 NA NA 7.45 0.738 2L52.1:004 2L52.1 E004 0.2022 NA NA 5.22 0.595 2L52.1:005 2L52.1 E005 0.0979 NA NA 11.18 -0.264 2L52.1:006 2L52.1 E006 0.1238 NA NA 8.71 -0.586 > table(res1$padjust< 0.1) character(0) # Are there any p-values? > table(res1$pvalue< 1) character(0)

score 0 · Answer 3 · 2012-02-13

Hi Elena, Thanks for posting the detail email. The p-values and p-adjusted values should be in your variable "test". You did: test <- testForDEU( ecs ) The function "testForDEU" will return an ExonCountSet object, your variable "ecs" with the columns of the fData for the pvalues and padjusted already filled. Then you will notice that "test" is an ExonCountSet object, so if you do: res1<- DEUresultTable(test) You will get your pvalues! Bext wishes, Alejandro Dear Simon, Thanks for your reply, even on a Saturday. =) You were right about the incorrect file path- that was problem. (To anybody else out there struggling with general issues getting their files into R, my code is below. Hope I save you some time). Another issue I was hoping you might help me with: I have no differentially-expressed exons (p-value = NA for all exons) at the end of the analysis!! I followed your other recommendations, and have spent some time trying to figure this out, but am still unsure. The fact that there are no integers in the p-value column has me worried. Maybe the problem is that the column names in counts(ecs) do not word-for-word match up with my rownames in the samples dataframe. I didn't know how to change the colnames in counts(ecs), so I left them as they were, noting that the sample identity seemed to have been preserved. Any advice on how to change the count(ecs) headers, or other advice about my non-existent p-values would be appreciated! =) Many thanks, Elena P.S. The graphics w/ gene models in DEXseq are very useful and convenient! ______________________________ library(DEXSeq) options(digits=3) setwd("C:\\Rdata\\DEXseq") library(DEXSeq) rm(list=ls()) # this GFF file is an output from Simon's script dexseq_prepare_annotations.py annotationfile = file.path("/Rdata/DEXseq/realFioles/DEXSeq_annotations.gff") annotationfile [1] "/Rdata/DEXseq/realFioles/DEXSeq_annotations.gff" > samples = data.frame( + condition = c("drug","drug","vehicle","vehicle"), + replicate = c(1:2,1:2), + row.names = c("drug1","drug2","veh1","veh2"), + stringsAsFactors = TRUE, + check.names = FALSE + ) > samples condition replicate drug1 drug 1 drug2 drug 2 veh1 vehicle 1 veh2 vehicle 2 > fullFilenames<- list.files("C:/Rdata/DEXseq/realFioles/",full.name s=TRUE,pattern="counts.txt") > fullFilenames [1] "C:/Rdata/DEXseq/realFioles/drug1_counts.txt" "C:/Rdata/DEXseq/realFioles/drug2_counts.txt" "C:/Rdata/DEXseq/realFioles/veh1_counts.txt" [4] "C:/Rdata/DEXseq/realFioles/veh2_counts.txt" > ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile) > head(counts(ecs)) C:/Rdata/DEXseq/realFioles/drug1_counts.txt C:/Rdata/DEXseq/realFioles/drug2_counts.txt C:/Rdata/DEXseq/realFioles/veh1_counts.txt 2L52.1:001 0 2 2 2L52.1:002 4 12 13 2L52.1:003 7 8 7 2L52.1:004 6 4 4 2L52.1:005 9 6 16 2L52.1:006 6 4 13 C:/Rdata/DEXseq/realFioles/veh2_counts.txt 2L52.1:001 4 2L52.1:002 20 2L52.1:003 8 2L52.1:004 7 2L52.1:005 14 2L52.1:006 12 > head(fData(ecs)) geneID exonID testable dispBeforeSharing dispFitted dispersion pvalue padjust chr start end strand transcripts 2L52.1:001 2L52.1 E001 FALSE NA 0.5145 0.5145 NA NA chrII 1867 1911 + 2L52.1 2L52.1:002 2L52.1 E002 TRUE 1.39e-01 0.0899 0.1391 NA NA chrII 2506 2694 + 2L52.1 2L52.1:003 2L52.1 E003 TRUE 9.56e-10 0.1436 0.1436 NA NA chrII 2738 2888 + 2L52.1 2L52.1:004 2L52.1 E004 TRUE 2.52e-10 0.2022 0.2022 NA NA chrII 2931 3036 + 2L52.1 2L52.1:005 2L52.1 E005 TRUE 2.16e-09 0.0979 0.0979 NA NA chrII 3406 3552 + 2L52.1 2L52.1:006 2L52.1 E006 TRUE 2.46e-09 0.1238 0.1238 NA NA chrII 3802 3984 + 2L52.1 # Size factors > ecs<- estimateSizeFactors(ecs) sizeFactors(ecs) # Dispersion > ecs<- estimateDispersions(ecs) # Fit Dispersion > ecs<- fitDispersionFunction(ecs) # Plot Individual exons via mean expression > meanvalues<- rowMeans(counts(ecs)) > plot(meanvalues, fData(ecs)$dispBeforeSharing, log="xy", main="mean vs CR dispersion") > x<- 0.01:max(meanvalues) > y<- ecs@dispFitCoefs[1] + ecs@dispFitCoefs[2] / x > lines(x, y, col="red") # Plot looks good # Test for Expression Difference > test<- testForDEU(ecs) # Seems to work > ecs<- estimatelog2FoldChanges(ecs) > res1<- DEUresultTable(ecs) head(res1) geneID exonID dispersion pvalue padjust meanBase log2fold(drug/vehicle) 2L52.1:001 2L52.1 E001 0.5145 NA NA 2.01 -0.844 2L52.1:002 2L52.1 E002 0.1391 NA NA 12.25 -0.304 2L52.1:003 2L52.1 E003 0.1436 NA NA 7.45 0.738 2L52.1:004 2L52.1 E004 0.2022 NA NA 5.22 0.595 2L52.1:005 2L52.1 E005 0.0979 NA NA 11.18 -0.264 2L52.1:006 2L52.1 E006 0.1238 NA NA 8.71 -0.586 > table(res1$padjust< 0.1) character(0) # Are there any p-values? > table(res1$pvalue< 1) character(0) [[alternative HTML version deleted]]