I'm hoping to use DEseq2 to look for regions of open chromatin that differ between cell types. It seems like the general statistical framework should be highly analogous between RNA-seq and DNase-seq (or, in my case, ATAC-seq). However, I wanted to check that some of the assumptions underlying DEseq2 do in fact translate, as some of my diagnostic plots differ from what I expected from reading the (excellent) DEseq2 documentation.
Data source/background: ATAC-seq was performed on three different cell types within a single tissue (two biological replicates per cell type). Open chromatin peaks were called on independent samples (~30,000 peaks per sample), and I then calculated the per region coverage in the regions defined by the union of all peaks (~50,000 peaks) for each sample. These counts (50,00 regions x 6 samples) constitute the count table I load into DEseq2. I then ran DEseq2 using design = ~ cell_type to identify regions of differentially open chromatin across the three cell types.
Of the 50,000 peaks I examined, 23,000 pass padj < 0.1 (comparing the first two cell types). I get a similar number of differences between the second cell-type comparison, and ~2,000 differences between the third pair (these are qualitatively what I expect for these cell types, but seem quantitatively high). Here's the MA plot for the first cell-type comparison (23,000 differences):
https://www.dropbox.com/s/14nkp0bqtd318jz/Screen%20Shot%202014-09-22%20at%2011.17.48%20AM.png?dl=0
Many of these differences are small or identified in regions with low mean read counts--I can easily filter on LFC to get a more manageable list (setting lfcThreshold=1 and altHypothesis="greaterAbs" reduces this to 1034 regions). When I look at top hits, I get interesting and tissue-relevant results, so I think this is generally working. But I'm worried that getting back so many significant differences might mean my data aren't appropriate for DEseq2 (e.g. too noisy). Additional specific questions:
1) Is the DEseq2 normalization framework appropriate for open chromatin data? Specifically, the normalization by scale factors derived from the median of "pseudo-references" seems to assume similar overall distributions of reads between conditions ("most genes don't change" for RNA-seq, or "most open chromatin doesn't change" if looking at DHS peaks). Given that I get so many significant differences, is this normalization not appropriate for my data, and are there specific alternatives I should consider (total read depth seems inadequate given different signal to noise ratios between samples)? Is part of the problem that RNA-seq considers the entire transcriptome (lots of low-level background), whereas my approach to feature selection has enriched for regions that have higher "expression."
2) I expected some of the lower read depth regions to be removed by independent filtering. But the independent filtering threshold is set at 0%.
https://www.dropbox.com/s/l88z0ajku1ol5k5/Screen%20Shot%202014-09-22%20at%2011.56.21%20AM.png?dl=0
When I test for larger differences (setting lfcThreshold=1 and altHypothesis="greaterAbs"), the independent filtering threshold moves to 40%.
3) Dispersion fit: this isn't related to my concern about too many hits, but parametric fitting of my dispersion curve failed. Local fitting was successful, but I don't see any asymptotic shift at higher read depths as I've seen in other example dispersion fit plots. Is this an acceptable fit or--again--does this suggest a problem with applying DEseq2 to a different type of dataset (do my data have a weaker mean-dispersion relationship and will this undermine my results)?
https://www.dropbox.com/s/8m4xs4s1bjuo4r1/Screen%20Shot%202014-09-22%20at%2011.59.48%20AM.png?dl=0
Please let me know if I can clarify question(s) provide additional information/if I've posted this in the wrong format/forum. I consulted the posting guide--I haven't provided sample code because nothing in DEseq2 is reporting any errors, but I am happy to provide sample data/code if it will help better illustrate my question. I absolutely appreciate any input.
>sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DESeq2_1.4.5 RcppArmadillo_0.4.400.0 Rcpp_0.11.2
[4] GenomicRanges_1.16.4 GenomeInfoDb_1.0.2 IRanges_1.22.10
[7] BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] annotate_1.42.1 AnnotationDbi_1.26.0 Biobase_2.24.0
[4] DBI_0.3.0 genefilter_1.46.1 geneplotter_1.42.0
[7] grid_3.1.1 lattice_0.20-29 locfit_1.5-9.1
[10] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.1.1
[13] stats4_3.1.1 survival_2.37-7 tools_3.1.1
[16] XML_3.98-1.1 xtable_1.7-4 XVector_0.4.0
Awesome--thanks for your response.
1) "It doesn't necessarily assume similar distributions, but that the median ratio will capture the size relationship."
This is very encouraging (not nearly as strong as "most genes don't change.") It's much more in line with what I would have expected from the mechanics of the scale factor estimation process.
2) "So many rows can be DE as long as there is some balancing of up/down, or if there are only one kind (say, "up-regulation"), that this be less than an majority of rows."
I do think there is genuine imbalance in the number of true peaks in my different cells--many genes "up-regulated" in two cell types relative to the third. This may not be obvious from the original MA plot I linked to, but is more pronounced when I filter on LFC:
https://www.dropbox.com/s/3zj3oist0clezl2/Screen%20Shot%202014-09-22%20at%2011.17.17%20AM.png?dl=0
I think these "are less than a majority of rows" (28,000/50,000 regions aren't called as having a significant difference without filtering, 49,000/50,000 when filtering on LFC=1), but I'm not certain. Would there be any value to estimating scale factors from the intersection of peaks instead of the union (would this prevent the effect of having more peaks in one sample/group of samples vs. another from skewing the median)? I would expect this to enrich for regions that don't harbor a difference. At the very least I'm interested to try it and see if it makes much of a difference to my original scale factor estimates.
3) "If you have a method for finding those rows which you think are non-DE, you can use the following code..."
I don't think I have a robust method for doing this--I don't have an apples to apples reference set for regions I expect to be identical in these cells using this assay. I could look for open chromatin peaks in/around genes I don't expect to be differentially expressed, but I worry about making too many leaps. I could also look at constitutively open peaks from ENCODE datasets, or try to derive a set of control regions in silico using RUVSeq. See also my question above re: looking at the intersection of peaks across samples (peaks called in every sample), which I think could bias factor estimation (in a good way) towards non-DE regions.
tl;dr: Scale factor normalization is probably appropriate for these data, and DESeq2 works out of the box. I have low dispersion between biological replicates, which enhances my power to identify small differences between regions. Filtering by LFC is likely to help me more efficiently identify regions harboring biologically meaningful differences in chromatin state.
"Would there be any value to estimating scale factors from the intersection of peaks instead of the union (would this prevent the effect of having more peaks in one sample/group of samples vs. another from skewing the median)? I would expect this to enrich for regions that don't harbor a difference."
I would try this and look at an MA plot, to look if you see a horizontal line of points (they don't have to be on the x-axis, because you haven't yet performed library size normalization). You can do a quick check with:
Where grp1idx identifies the samples in group 1.
It looks like estimating size factors over the intersection doesn't change much vs. the original list of regions (union):
Original size factors (sample 1,2,...,6): 50,000 regions
Size factors from intersection (sample 1,2,...,6): 679 regions
[I did this manually: read in counts for regions in intersection (679 rows x 6 columns), calculate geometric mean per row (region), divide each count by each row geometric mean, calculate median for each column (sample)]
I think this is a good thing/size factor estimation seems to be stable.
How tight should this horizontal line be? It's hard to tell from the raw points, but it seems to level out if I fit a line to it (this is just using the default stat_smooth, generalized additive model wtih y ~ s(x, bs = "cs")).
https://www.dropbox.com/s/m1dbcprc7r91vrg/Screen%20Shot%202014-09-24%20at%2011.06.39%20AM.png?dl=0
Thanks again for your help!
If you make an MA plot with the normalized count matrix, either by running DESeq() and then plotMA(), or
...you would want the x-axis to go through the points on the right.
Will this be different from the first MA plot (which you noted has points on the right side that generally fall below the x-axis)? Generating the plots you've described--log(cell type mean(normalized read counts)) vs. sample mean(log(normalized read counts))--for all three pairs of cell types, I see the right side generally converging near but not necessarily on the x-axis.
https://www.dropbox.com/s/1ituzbtaluuyd7z/Screen%20Shot%202014-09-24%20at%202.25.17%20PM.png?dl=0
Is this enough to conclude my size factors are not well (adequately?) estimated/can I quantify how well they are estimated? Or do I really need to work out a well-defined set of non-DE regions to make firm conclusions?
Those plots look good. The point is then to estimate the size factors from the set of rows which give plots like that, and then apply those size factors in whatever other analysis you do. It seems like the intersection of peaks is a good set of rows for estimating size factors for your dataset.
OK--thanks again. Those plots actually include all 50,000 regions, so I think the initial size factor estimation worked adequately well. I very much appreciate all of your help.