Hello I'd like to use package "csaw" to analyze ATAC-seq data and identify regions with differential DNA accessibility. I am aware that csaw was developed for ChIP-seq, where the entire paired-end fragments inform us on the genomic location of a protein of interest. I have seen some usage of csaw with ATAC-seq where users were either filtering fragments to only retain those smaller than one nucleosome (thus the entire fragment can be considered "accessible"), or using all fragments regardless of their sizes (thus possibly spanning nucleosomes or other inaccessible DNA).
I feel like this is suboptimal and instead my desire is to employ the full extent of my dataset by looking at sites of Tn5 insertion. I would like to quantitate Tn5 insertion frequency at all its insertion sites, basically counting only both extremities of the paired-end fragments regardless of fragment length.
My problem is this seems impossible to achieve with the readParam and windowCounts options in csaw.
I though of the following workaround: -filter BAM files to my liking, e.g. for mapq, proper pairing, duplication, etc -split BAM files in the forward and the reverse mates and save as BED (with samtools view and bedtools bamtobed) -combine both BED files -shift the reads (in a strand-aware fashion) so their starts are 5 bp earlier on the chromosome (shifting towards 5') using bedtools shift -shorten the reads so they are only 10 bp long, using awk '{OFS="\t" print $1,$2,$3+10}' (I don't think bedtools slop can take negative values to shorten my reads) -convert back to BAM using bedtools bedtobam
I would like to know if this is a viable option. I do realize a lot of information is lost when the bam goes to bed format, so I need to know if there is any reason to believe that such an input would violate some assumptions made by csaw.
Also if there a simpler workaround to look at insertion sites, I would gladly consider it!
Thanks Alex
I personally always count the 5' ends of every forward and reverse read as this represents the Tn insertion event. I do not see the point in only counting fragments of a certain size since a regulatory element (so basically an ATAC-seq peak) is (functionally or "operationally"-defined) the union of both the nucleosome-free region and the signal that comes from the adjacent histones which are also in open (but not nucleosome-free) chromatin. The histone modifications on the histones that flank the nucleosome-free regions are important regulators, so why excluding signals from there? Counting only fragments of e.g. TLEN < 100bp throws away more than half of the available reads for that regions, and that reduces power. I personally use featureCounts (
--read2pos 5
option) but as Aaron says, simply settingext
to 1 should do the same thing in the csaw universe. Towards that 5bp shift, well, on the total scale of a peak (say somewhat 200-1000bp) or a window (100bp or whatever you define when using csaw) I don't think it matters, this is more important when working on TF footprinting where basepair-resolution is an issue.Thanks ATpoint, Your reasoning matches exactly mine. Previously I used to call peaks (with MACS2 tricked to use both ends of pairs, or Genrich), then use featureCounts (with the options you described), and analyze the results in edgeR pretty much as if it were RNAseq data. My concern came from realizing that peaks are often fairly broad, and I have noticed in my data some areas WITHIN a single peak, where DNA accessibility is different between two treatments. I feel the differential accessibility in such cases occurs on just a small fraction of the peak and may be missed if we look at whole peaks, but a window-based approach like csaw might catch it.
I am not overtly worried about the 5' shift and its impact on resolution. If there is a simple way to implement it, then I will. I am assuming that Tn5 possibly requires enough free DNA to sit properly before it can cut, so I am giving it a footprint of 10 bp (VERY arbitrary call by me). Alex
I have been realizing the same issue. Particularly when a consensus peak with a larger length is generated (by merging peaks from multiple samples). I'm also thinking of trying out this csaw approach for the same issues raised by alexandre.