Question

DESeq2: different number of significant genes between DESeqDataSetFromMatrix and DESeqDataSetFromHTSeqCount

0

Entering edit mode

jillian.waters • 0

@jillianwaters-14491

Last seen 4.8 years ago

Hello,

I've been running in to some troubles with my data analysis with DESeq2, and one thing I decided to confirm was whether I was checking the difference in my results when I used an input count table for DESeqDataSetFromMatrix or when I followed the guidelines for DESeqDataSetFromHTSeqCount. In the former example I have ~1000 genes that are significant, whereas in the latter I have only 1. I am surprised considering that in theory I would think this is virtually identical input, although I am guessing it is an error on my part.

The same sample files were used as direct input into DESeqDataSetFromHTSeqCount that were used to generate the count table used in DESeqDataSetFromMatrix. I have also spot checked some of these files, and the numbers within the individual file match what is in the geneCounts data.frame.

I have entered the code below. Any help would be greatly appreciated. Thank you in advance!

directory <- "/ebio//abt3_projects//Christensenella_Project//CM12//CM12_RNA-seq//WAT_v1//HTSeq//20171006_ALL"
sampleFiles <- grep("htseq_count",list.files(directory),value=TRUE)
sampleTable <- read.table('/ebio//abt3_projects//Christensenella_Project//CM12//CM12_RNA-seq//WAT_v1//CM12_WAT_Design_v11.txt',header=TRUE,sep="\t")
sample_table2=data.frame(sampleName = sampleTable$SampleID,
                          fileName = sampleFiles,
                          condition = sampleTable$Condition)

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sample_table2,
                                       directory = directory,
                                       design= ~ condition)

contrasts = c('condition', 'Blautia', 'Cminuta') 
dds <- DESeq(dds)
res <- results(dds, contrast=contrasts)
summary(res)

out of 37999 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 0, 0% 
LFC < 0 (down)   : 1, 0.0026% 
outliers [1]     : 804, 2.1% 
low counts [2]   : 2, 0.0053% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

The input to generate the count table for DEseqDataSetfromMatrix is below

Cmin1=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S1_htseq_count_V1.txt",sep="\t",header=FALSE)
Cmin2=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S5_htseq_count_V1.txt", sep="\t",header=FALSE)
Cmin3=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S10_htseq_count_V1.txt",sep="\t",header=FALSE)
Cmin4=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S15_htseq_count_V1.txt",sep="\t",header=FALSE)
Cmin5=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S20_htseq_count_V1.txt",sep="\t",header=FALSE)
Bhyd1=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S13_htseq_count_V1.txt",sep="\t",header=FALSE)
Bhyd2=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S19_htseq_count_V1.txt",sep="\t",header=FALSE)
Bhyd3=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S2_htseq_count_V1.txt",sep="\t",header=FALSE)
Bhyd4=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S4_htseq_count_V1.txt",sep="\t",header=FALSE)
Bhyd5=read.table("/ebio/abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/HTSeq/20171006_ALL/S8_htseq_count_V1.txt",sep="\t",header=FALSE)


geneCounts=data.frame(Cmin1[,2],Cmin2[,2],Cmin3[,2],Cmin4[,2],Cmin5[,2],Bhyd1[,2],Bhyd2[,2],Bhyd3[,2],Bhyd4[,2],Bhyd5[,2],Bhyd6[,2],Bhyd7[,2],Bhyd8[,2],Bhyd9[,2],Bhyd10[,2],BiAssoc1[,2],BiAssoc2[,2],BiAssoc3[,2],BiAssoc4[,2],BiAssoc5[,2],BiAssoc6[,2],BiAssoc7[,2],BiAssoc8[,2],BiAssoc9[,2],GF1[,2],GF2[,2])
row.names(geneCounts)=Cmin1[,1]
sizeGeneCounts=dim(geneCounts)
geneCounts = geneCounts[1:(sizeGeneCounts[1]-5),]
condition=c(rep("C_minuta",5),rep("B_hydrogenotrophica",10),rep("BiAssociated",9),rep("Germfree",2))
sampleNames=c("Cmin1","Cmin2","Cmin3","Cmin4","Cmin5","Bhyd1","Bhyd2","Bhyd3","Bhyd4","Bhyd5","Bhyd6","Bhyd7","Bhyd8","Bhyd9","Bhyd10","BiAssoc1","BiAssoc2","BiAssoc3","BiAssoc4","BiAssoc5","BiAssoc6","BiAssoc7","BiAssoc8","BiAssoc9","GF1","GF2")
colnames(geneCounts) = sampleNames

write.table(geneCounts, "geneCounts_20171009_V1.txt", sep="\t",quote=FALSE)

Input for DESeqDataSetFromMatrix

count_file='/ebio//abt3_projects//Christensenella_Project/CM12/CM12_RNA-seq//WAT_v1//HTSeq//20171006_ALL/geneCounts_20171009_V1.txt'
meta_file='/ebio//abt3_projects/Christensenella_Project/CM12/CM12_RNA-seq/WAT_v1/CM12_WAT_Design_v11.txt'

meta = read.delim(meta_file, sep='\t')
counts = read.delim(count_file, sep='\t')
# ordering
rownames(meta) = meta$SampleID
meta = meta[colnames(counts),]

dds_old_2 = DESeqDataSetFromMatrix(countData=counts,
                             colData=meta,
                             design=~Condition)

contrasts = c('Condition', 'Blautia', 'Cminuta') 
dds_old_2 <- DESeq(dds_old_2)
res_old_2 <- results(dds_old_2, contrast=contrasts)
summary(res_old_2)

out of 37999 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 145, 0.38% 
LFC < 0 (down)   : 932, 2.5% 
outliers [1]     : 142, 0.37% 
low counts [2]   : 13714, 36% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

SessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /ebio/abt3_projects/Christensenella_Project/anaconda3/envs/R_DEseq2/lib/R/lib/libRblas.so
LAPACK: /ebio/abt3_projects/Christensenella_Project/anaconda3/envs/R_DEseq2/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocParallel_1.12.0        DESeq2_1.18.1             
 [3] SummarizedExperiment_1.8.0 DelayedArray_0.4.1        
 [5] matrixStats_0.52.2         Biobase_2.38.0            
 [7] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
 [9] IRanges_2.12.0             S4Vectors_0.16.0          
[11] BiocGenerics_0.24.0        ggplot2_2.2.1             
[13] tidyr_0.7.2                dplyr_0.7.4               

loaded via a namespace (and not attached):
 [1] bit64_0.9-7             jsonlite_1.5            splines_3.4.1          
 [4] Formula_1.2-2           assertthat_0.2.0        latticeExtra_0.6-28    
 [7] blob_1.1.0              GenomeInfoDbData_0.99.1 RSQLite_2.0            
[10] backports_1.1.1         lattice_0.20-35         glue_1.2.0             
[13] uuid_0.1-2              digest_0.6.12           RColorBrewer_1.1-2     
[16] XVector_0.18.0          checkmate_1.8.5         colorspace_1.3-2       
[19] htmltools_0.3.6         Matrix_1.2-12           plyr_1.8.4             
[22] XML_3.98-1.9            pkgconfig_2.0.1         genefilter_1.60.0      
[25] zlibbioc_1.24.0         purrr_0.2.4             xtable_1.8-2           
[28] scales_0.5.0            tibble_1.3.4            htmlTable_1.9          
[31] annotate_1.56.1         repr_0.12.0             nnet_7.3-12            
[34] lazyeval_0.2.1          survival_2.41-3         magrittr_1.5           
[37] crayon_1.3.4            memoise_1.1.0           evaluate_0.10.1        
[40] foreign_0.8-69          tools_3.4.1             data.table_1.10.4-3    
[43] stringr_1.2.0           locfit_1.5-9.1          munsell_0.4.3          
[46] cluster_2.0.6           AnnotationDbi_1.40.0    bindrcpp_0.2           
[49] compiler_3.4.1          rlang_0.1.4             grid_3.4.1             
[52] RCurl_1.95-4.8          pbdZMQ_0.2-6            IRkernel_0.7.1         
[55] htmlwidgets_0.9         bitops_1.0-6            base64enc_0.1-3        
[58] gtable_0.2.0            DBI_0.7                 R6_2.2.2               
[61] gridExtra_2.3           knitr_1.17              bit_1.1-12             
[64] bindr_0.1               Hmisc_4.0-3             stringi_1.1.6          
[67] IRdisplay_0.4.4         Rcpp_0.12.14            geneplotter_1.56.0     
[70] rpart_4.1-11            acepack_1.4.1

deseq2 deseqdatasetfrommatrix deseqdataset • 1.5k views

ADD COMMENT • link updated 7.2 years ago by Michael Love 43k • written 7.2 years ago by jillian.waters • 0

score 1 · Answer 1 · 2017-12-05

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

Can you run a test to show that counts(dds) is identical?

ADD COMMENT • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

I ran this, and they are clearly not identical to one another. At first glance the order of the columns is different, as well as some of the values.

After a more thorough look at the input sample information, I can now see that the discrepancy is that in my input sample_table_2 the file name does not match the correct file name, and this is what caused the error.

Apologies for having missed this before posting, and thank you so much for your response!

ADD REPLY • link 7.2 years ago jillian.waters • 0