Question

DiffBind for ATACseq without replicates

0

Entering edit mode

CodeAway ▴ 70

@codeaway-12991

Last seen 4.4 years ago

Hi Rory,

I very well realize the importance of replicates as well as your advice in this forum multiple times to other questions earlier that no meaningful statistics can be done without replicates.

However, the data I have received to analyze has bad replicates. There are three matched pairs for two conditions and the biologist says that they can be loosely considered as replicates, but the initial correlation heatmaps indicate that they are quite far apart. As a result, when I use DiffBind, I don't see more than 1 differential binding site with default FDR. So, I set th=1 to get the full report.

However, my question is that if I wish to treat the three pairs separately, two at a time and just compare them without treating them as replicates, what can I do?

I see that the documentation for "dba.contrast" says the following:

minMembers when automatically generating contrasts, minimum number of unique samples in a group. Must be at least 2, as replicates are strongly advised. If you wish to do an analysis with no replicates, you can set the group1 and group2 parameters explicitly.

So, if I have just two samples, one per condition, is there a way to just get the output of fold changes? That is, process them using counts and then get the scores and outputs with fold changes? I can then potentially give these results to the biologist.

Any advice would be greatly appreciated.

Thanks a lot!

DiffBind ATACseq replicate • 3.4k views

ADD COMMENT • link updated 4.4 years ago by Rory Stark ★ 5.2k • written 4.5 years ago by CodeAway ▴ 70

0

Entering edit mode

Better tell the biologist to repeat the experiment, unreplicated "results" are guesses rather then results, only creating ghosts to chase rather than a true hypothesis. ATAC-seq is (by experience) a very robust experiment, so if some replicats fail that means that probably the "surviving" replicates are also not of good quality, so the unreliability of unreplicated results is confounded even more by (probably) bad sample quality.

ADD REPLY • link 4.5 years ago ATpoint ★ 4.8k

0

Entering edit mode

Thanks for the comment. Point taken, but, as you know, it is not so easy to tell them to simply repeat the experiement. Costs, time, etc. come inbetween. So, I can tell her that this is the best we can do by saying that it's not possible to get pValues without good replicates. However, like usually done in RNAseq DE, they expect me to at least give a list of fold changes for all the potentially differentially bound regions.

So, it would be great if I can get the steps to just compare two samples without considering them as replicates and get some fold changes.

ADD REPLY • link 4.5 years ago CodeAway ▴ 70

score 0 · Answer 1 · 2020-10-22

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 3 months ago

Cambridge, UK

It really depends if the matched pairs are different because they are biologically diverse, or because there were serious technical issues.

If the matched pairs are biologically diverse, you can used a multi-factor design to model this by setting the block parameter to first indicate the matching pairs, and this could still be a robust result. (In the soon-to-be-released version of DiffBind, you would do this by explicitly setting a multi-factor design formula).

If you just need to get some fold calculations and ignore the confidence statistics, you can set the score=DBA_SCORE_READS (or score=DBA_READS_MINUS), then retrieve the matrix if counts (using dba.peakset() with bRetrieve=TRUE), and simply divide the read counts for one sample column over the column for the paired sample to get a fold change calculation. (You can log this value for log fold changes).

ADD COMMENT • link 4.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you very much! I will try that.

From your answer here https://support.bioconductor.org/p/67307/ ,

I see that, by default, the score is normalized counts, and not raw counts. Is it better to use the normalized counts instead of the raw counts for my fold change calculation? Also, what exactly is the normalization computation?

By the way, I realized that I forgot to mention that my samples are actually ATACseq, and not ChIPseq. So, is there anything different I should do for ATACseq peaks, from the standard procedure mentioned in the vignette?

My peaks are the output of the ENCODE ATACseq pipeline (narrowPeak files).

Thanks a lot for your answers.

ADD REPLY • link 4.5 years ago CodeAway ▴ 70

0

Entering edit mode

With these data, I would stick to a simple normalization. You can use score=DBA_SCORE_RPKM to get read counts normalized to the depth of sequencing and peak width.

The main thing to do differently in ATAC-seq is to change the value of summits when calling dba.count() to re-center the peaks and make them stadrd width. For ATAC-seq I usually use summits=100.

ADD REPLY • link 4.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you very much for these tips! I have been quite a bit confused about all the possible things to try out. I use DiffBind version ‘2.12.0’.

Here is a snapshot of the PCA plot (with default settings for dba.count):

gfp <- dba.count(gfp, summits=250)
dba.plotPCA(gfp, DBA_CONDITION, label=DBA_ID)

enter image description here

First off, before even looking at this PCA plot, I guess naively, I tried using all purples as replicates and all pink as replicates of the two conditions (as advised by the biologist). I got no significant differential peaks (by DESeq2, i.e., whatever is the default).

But now, after looking at this, I guess we will have to at least drop the samples S03 and S04. If I drop those two, then I guess I will still have a sensible case on my hands. I tried that, but with the default DESeq2, I still got no significant differential peaks.

Then, I went through many of your answers in this thread, and I learnt that sometimes DESeq2 and EdgeR give very different results. I haven't tried EdgeR yet.

In addition, if I try the tips you gave me, including summit=100 and RPKM normalization, then maybe I will get something significant. I will still drop samples S03 and S04. I will have two replicates for each condition then.

Does this look like a reasonable strategy? If any other tips strike you, Rory, I would very much appreciate it.

Thanks a lot!!

ADD REPLY • link 4.5 years ago CodeAway ▴ 70

1

Entering edit mode

In addition, for ATACseq, specifically, is one of DESeq2 or EdgeR more preferable than the other?

ADD REPLY • link 4.5 years ago CodeAway ▴ 70

0

Entering edit mode

Deciding when to drop a replicate is complicated. Just dropping a sample that looks like an outlier, without understanding why it differs from the other samples, can be a mistake. If there is a clear technical issue, that is one thing, but if one sample presents a different signal it may be reflecting true biological variation.

You are using the default score for these plots, which is TMM normalized -- does the clustering change when using RPKM?

Also, the S04 sample isn't really that different. The way we plot PCAs, it visually gives equal weight to the X and Y axes, but we need to look beyond that. S04 sits right amongst the other sample on the first principal component, which account for 93% of the variation. It is separated in the second component, but this accounts for only 3% of the variation. To me it seems consistent. . PCAs focus on the differences between samples, even when they aren't actually very different. I would look at the correlation heatmap (and the correlation values returned when you call dba.plotHeatmap()) to see how closely correlated these samples are. I recall they are paired, so it would be good to look at where the matched pairs sit as well.

Generally, edgeR and DESeq2 aren't systematically different enough for one to be preferable for ATAC-seq. Other choices, especially regarding normalization, have a much greater impact on results. I generally run both analyses just to confirm that they are largely agreeing; if they do not, then I explore what is driving the differences (usually normalization, although edgeR's TMM and DESeq2's RLE are usually pretty close on the same, good data).

ADD REPLY • link 4.4 years ago Rory Stark ★ 5.2k

score 0 · Answer 2 · 2020-10-28

0

Entering edit mode

Ishrat Jahan • 0

@ishrat-jahan-24022

Last seen 4.4 years ago

true must be length 1 (length of condition), not 9 I am getting this error. but couldn't find any solution. does anyone tell me how can i use multiple statements if_else()?