I am having trouble figuring out why 2 sets of paired samples are affecting my data so severely. I have ~100 samples, each paired (so ~200 sets of count files). I aligned my fastq files with STAR, counted genes with featureCounts, and am using DESeq2 for differential expression analysis. My design is ~sex + sex:nested + sex:condition, where nested is the pairing factor.
At first pass, I get 716 significant genes (padj < 0.05), yet many genes have count plots like this: http://i.imgur.com/2Yrd410.png
The seemingly outlier sample is present in many of the genes, so I removed the pre and post counts for that sample.
I then got 18 sig genes, most of which look like this: http://i.imgur.com/wGE53DH.png
The seemingly outlier sample is present in all of the 18 genes, so I removed the pre and post counts for that sample. Now I get no significant genes. It seems strange that 2 samples would have such a large effect on an otherwise large data set. I would appreciate any ideas and guidance as to how best proceed! Thank you!
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] IHW_1.2.0 DESeq2_1.14.1
[3] SummarizedExperiment_1.4.0 Biobase_2.34.0
[5] GenomicRanges_1.24.3 GenomeInfoDb_1.10.2
[7] IRanges_2.8.1 S4Vectors_0.12.1
[9] BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 RColorBrewer_1.1-2 plyr_1.8.4
[4] XVector_0.14.0 bitops_1.0-6 tools_3.3.1
[7] zlibbioc_1.20.0 digest_0.6.12 rpart_4.1-10
[10] memoise_1.0.0 RSQLite_1.1-2 annotate_1.52.1
[13] tibble_1.2 gtable_0.2.0 lattice_0.20-34
[16] Matrix_1.2-7.1 lpsymphony_1.2.0 DBI_0.5-1
[19] gridExtra_2.2.1 genefilter_1.56.0 cluster_2.0.5
[22] locfit_1.5-9.1 grid_3.3.1 nnet_7.3-12
[25] data.table_1.10.2 AnnotationDbi_1.36.2 fdrtool_1.2.15
[28] XML_3.98-1.5 survival_2.39-5 BiocParallel_1.8.1
[31] foreign_0.8-67 latticeExtra_0.6-28 Formula_1.2-1
[34] geneplotter_1.52.0 ggplot2_2.2.1 Hmisc_3.17-4
[37] scales_0.4.1 splines_3.3.1 assertthat_0.1
[40] colorspace_1.3-1 xtable_1.8-2 acepack_1.4.1
[43] RCurl_1.95-4.8 lazyeval_0.2.0 munsell_0.4.3
[46] slam_0.1-40