DESeq2 influence of 2 samples
1
0
Entering edit mode
@gregorylstone-12225
Last seen 6.0 years ago

I am having trouble figuring out why 2 sets of paired samples are affecting my data so severely. I have ~100 samples, each paired (so ~200 sets of count files). I aligned my fastq files with STAR, counted genes with featureCounts, and am using DESeq2 for differential expression analysis. My design is ~sex + sex:nested + sex:condition, where nested is the pairing factor.

At first pass, I get 716 significant genes (padj < 0.05), yet many genes have count plots like this: http://i.imgur.com/2Yrd410.png

The seemingly outlier sample is present in many of the genes, so I removed the pre and post counts for that sample.

I then got 18 sig genes, most of which look like this: http://i.imgur.com/wGE53DH.png

The seemingly outlier sample is present in all of the 18 genes, so I removed the pre and post counts for that sample. Now I get no significant genes. It seems strange that 2 samples would have such a large effect on an otherwise large data set. I would appreciate any ideas and guidance as to how best proceed! Thank you!

deseq2 outliers differential gene expression • 1.9k views
ADD COMMENT
0
Entering edit mode

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] IHW_1.2.0                  DESeq2_1.14.1
[3] SummarizedExperiment_1.4.0 Biobase_2.34.0
[5] GenomicRanges_1.24.3       GenomeInfoDb_1.10.2
[7] IRanges_2.8.1              S4Vectors_0.12.1
[9] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8          RColorBrewer_1.1-2   plyr_1.8.4
 [4] XVector_0.14.0       bitops_1.0-6         tools_3.3.1
 [7] zlibbioc_1.20.0      digest_0.6.12        rpart_4.1-10
[10] memoise_1.0.0        RSQLite_1.1-2        annotate_1.52.1
[13] tibble_1.2           gtable_0.2.0         lattice_0.20-34
[16] Matrix_1.2-7.1       lpsymphony_1.2.0     DBI_0.5-1
[19] gridExtra_2.2.1      genefilter_1.56.0    cluster_2.0.5
[22] locfit_1.5-9.1       grid_3.3.1           nnet_7.3-12
[25] data.table_1.10.2    AnnotationDbi_1.36.2 fdrtool_1.2.15
[28] XML_3.98-1.5         survival_2.39-5      BiocParallel_1.8.1
[31] foreign_0.8-67       latticeExtra_0.6-28  Formula_1.2-1
[34] geneplotter_1.52.0   ggplot2_2.2.1        Hmisc_3.17-4
[37] scales_0.4.1         splines_3.3.1        assertthat_0.1
[40] colorspace_1.3-1     xtable_1.8-2         acepack_1.4.1
[43] RCurl_1.95-4.8       lazyeval_0.2.0       munsell_0.4.3
[46] slam_0.1-40

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 23 hours ago
United States

Can you post your full code, I want to make sure it's not something going wrong with the construction of the model matrix.

ADD COMMENT
0
Entering edit mode

Yes of course. Here is my code:

library(DESeq2)

library(IHW)

samples <- read.csv(sampleTable.csv)

sampleTable <- data.frame(sample = samples$PID, fileName = samples$featureCounts_PID, sex = samples$Sex, condition = samples$Condition, nested = samples$Nested)

sampleTable$sex <- relevel(sampleTable$sex, ref = "Male")
sampleTable$condition <- relevel(sampleTable$condition, ref = "pre")

#even though counts are from featureCounts, formatted results to be same as HTSeq-count results to use following import method

dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ 1)

model = model.matrix(~ sex + sex:nested + sex:condition, colData(dds))

apply(model, 2, function(x) all(x == 0))

results_source <- DESeq(dds, full = model, betaPrior = FALSE)

sex_diff = results(results_source, contrast = list("sexFemale.conditionpost", "sexMale.conditionpost"))

 

Thank you for the help!

 

 

ADD REPLY
0
Entering edit mode

What is the output of the apply() call?

Please post the code where you remove samples and rerun DESeq2 also.

ADD REPLY
0
Entering edit mode

The code is the same for when I remove samples. I just modify what csv file I read in.

The output of the apply() call is as follows:

(Intercept)               sexFemale          sexMale:nested        sexFemale:nested   sexMale:conditionpost
                  FALSE                   FALSE                   FALSE                   FALSE                   FALSE
sexFemale:conditionpost
                  FALSE

 

Thanks!

ADD REPLY
0
Entering edit mode

And you modify the count matrix as well?

Just want to make sure you're running a comparable analysis.

ADD REPLY
0
Entering edit mode

Yes, I do modify the count matrix as well

ADD REPLY
0
Entering edit mode

I finally can see what's going wrong, since you posted this output.

It's a common R problem: you are coding the patient as a numeric vector instead of as a factor (categorical). So patient 1 + patient 2 = patient 3, etc. 

After coding the patient as a factor, you will get ~100 of coefficients in the model.matrix for the patients, which is the point: you want to control for all the patient differences, and to estimate and test the sex-specific treatment effect.

ADD REPLY
0
Entering edit mode

Wow I'm dumb. Of course it would be something like that! Thank you so much for the help, and for being so generous with your time!

ADD REPLY

Login before adding your answer.

Traffic: 639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6