Question

csaw: how to hack filtering windows by negative input controls?

0

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 4 months ago

United States

Hi all,

Aaron - thank you for the excellent csaw package and extensive users guide! I am analyzing a ChIPSeq experiment that has 60 IP samples + 60 inputs. There are 3 treatments X 4 brain regions = 12 groups X 5 reps each. It seems a shame not to use the input samples and I was following section 3.5.3 to compare the scaled average of the IPs to the scaled average of the inputs. However, I'm worried that binding sites that only occur in one treatmentXregion group or even one brain region will not be enough to make the (average IP vs average input) > 3 FC. The inputs are good negative controls in that there is no correlation between an individual's IP and input, so I had the idea to do something similar to how I filter out "unexpressed" genes in RNA-Seq: compare each IP's scaled value to the average scaled input, and then require any 5 samples to have > 3 FC.

Assuming you think this is a reasonable approach, I can't quite figure out how to hack your functions to calculate this. filterWindows() returns the average background abundances but I'm not sure exactly what to pull out of scaledAverage() to get individual scaled averages for the IP. Or do I not even need to worry about scaledAverage() because for IP scale = 1, but instead call cpm() in something like this:

library(csaw)
library(edgeR)

#set up params and count individual windows
param <- readParam( dedup = T, minq = 30)
dt <- windowCounts(bam_files, ext = 250, width = 150, filter = 0, param = param)

#get binned counts
dt_bin <- windowCounts(bam_files, bin = TRUE, width = 10000, filter = 0, param = param)

#separate binned windows & get scale factors for input controls
dt_con_bin <- dt_bin[, grepl("input", bam_files)]
dt_exp_bin <- dt_bin[, !grepl("input", bam_files)]

dt_scale_inf <- scaleControlFilter(dt_exp_bin, dt_con_bin)

#separate individual windows
dt_con <- dt[, grepl("input", bam_files)]
dt_exp <- dt[, !grepl("input", bam_files)]

#run filterWindows to get average scaled background
dt_filt <- filterWindows(dt_exp, dt_con, type = "control", prior.count = 5, scale.info = dt_scale_inf)

#run cpm to get individual IP abundances
IPcpm <- cpm(asDGEList(dt_exp, assay = 1), prior.count = 5, log = TRUE)

# set up filter
filter.stat <- IPcpm - dt_filt$back.abundances
rep.size <- 5
keep <- rowSums(filter.stat > log2(3)) >= rep.size

#filter windows
dt_filtered  <- dt_exp[keep, ]

Does this seem reasonable? One last question - should I re-bin the window counts with only IP samples in order to estimate normalization factors, or is it fine to use the object dt_exp_bin?

Thanks!

csaw • 1.3k views

ADD COMMENT • link updated 6.8 years ago by Aaron Lun ★ 28k • written 6.8 years ago by Jenny Drnevich ★ 2.0k

score 0 · Answer 1 · 2018-07-19

I haven't tried looking at "at least N" filtering strategies in more complex experimental designs. However, I have the feeling that the statistical problems with this filtering strategy in a 2-vs-2 experiment will not go away with more replicates and groups. Certainly, tweaking some of my existing simulations (see here) suggests that requiring "at least 5" to retain a peak will inflate the dispersions and make the results much more conservative.

I would instead recommend adapting your filtering strategy to the comparisons of interest. The average across all libraries is (approximately) independent of the p-value for any given comparison, assuming your design matrix can be formulated to have an intercept column. (As an aside: the design matrix doesn't have to have an intercept, it just has to have columns that are linearly dependent with the intercept.) Thus, the average abundance filter can be generally used for any comparison. For comparisons between two groups, you can use the average of the samples in those groups, as the lack of independence with other groups doesn't matter. This should allow you to retain more appropriate sites for specific comparisons of interest.

So in conclusion: I would apply a relaxed filter using the average of all libraries to get rid of low-abundance regions, and then apply more specific filters after running edgeR (but just before correcting for multiple testing) depending on what comparison you're performing. This allows you to have your cake and eat it too. Keep in mind, though, that there are some interactions between filtering and normalization when you're dealing with efficiency biases. So in summary, something like this:

Apply a relaxed filter based on the average abundance of all libraries, giving a filtered object A.
If normalizing for efficiency biases, applying a stringent filter to get another object B, compute the normalization factors from B, and then assign the normalization factors back to A.
Perform dispersion estimation, DE testing, etc. with edgeR on A.
Apply comparison-specific filters for each contrast to further subset the windows to an object C (and also the test results), and proceed to mergeWindows, combineTests, etc. on C.

If you have blocks of comparisons (e.g., group X is compared to Y, and Z is compared to W), you could take the average of X and Y and the average of Z and W separately, and require your initial filtering step to keep any window where the X/Y or Z/W average was above the threshold. This would still preserve independence within each block of comparison, while simplifying the procedure down to a single filter. However, it doesn't work if every group could be compared to every other group.

It's true that this is a bit of a pain, which is why I never put it in the user's guide. But I would also say that if you have multiple comparisons of interest, it's impossible to have a single independent filter that's optimal for all of them.

And using dt_exp_bin is fine, the counts and retention of windows are independent when bin=TRUE.