Question

csaw for ATAC, quantitation of Tn5 insertion sites only

1

Entering edit mode

alexandre.blais ▴ 50

@alexandreblais-23753

Last seen 3 months ago

Canada

Hello I'd like to use package "csaw" to analyze ATAC-seq data and identify regions with differential DNA accessibility. I am aware that csaw was developed for ChIP-seq, where the entire paired-end fragments inform us on the genomic location of a protein of interest. I have seen some usage of csaw with ATAC-seq where users were either filtering fragments to only retain those smaller than one nucleosome (thus the entire fragment can be considered "accessible"), or using all fragments regardless of their sizes (thus possibly spanning nucleosomes or other inaccessible DNA).

I feel like this is suboptimal and instead my desire is to employ the full extent of my dataset by looking at sites of Tn5 insertion. I would like to quantitate Tn5 insertion frequency at all its insertion sites, basically counting only both extremities of the paired-end fragments regardless of fragment length.

My problem is this seems impossible to achieve with the readParam and windowCounts options in csaw.

I though of the following workaround: -filter BAM files to my liking, e.g. for mapq, proper pairing, duplication, etc -split BAM files in the forward and the reverse mates and save as BED (with samtools view and bedtools bamtobed) -combine both BED files -shift the reads (in a strand-aware fashion) so their starts are 5 bp earlier on the chromosome (shifting towards 5') using bedtools shift -shorten the reads so they are only 10 bp long, using awk '{OFS="\t" print $1,$2,$3+10}' (I don't think bedtools slop can take negative values to shorten my reads) -convert back to BAM using bedtools bedtobam

I would like to know if this is a viable option. I do realize a lot of information is lost when the bam goes to bed format, so I need to know if there is any reason to believe that such an input would violate some assumptions made by csaw.

Also if there a simpler workaround to look at insertion sites, I would gladly consider it!

Thanks Alex

csaw ATAC-seq Tn5 insertions • 3.3k views

ADD COMMENT • link updated 4.3 years ago by Antonio Ahn ▴ 10 • written 4.4 years ago by alexandre.blais ▴ 50

0

Entering edit mode

I personally always count the 5' ends of every forward and reverse read as this represents the Tn insertion event. I do not see the point in only counting fragments of a certain size since a regulatory element (so basically an ATAC-seq peak) is (functionally or "operationally"-defined) the union of both the nucleosome-free region and the signal that comes from the adjacent histones which are also in open (but not nucleosome-free) chromatin. The histone modifications on the histones that flank the nucleosome-free regions are important regulators, so why excluding signals from there? Counting only fragments of e.g. TLEN < 100bp throws away more than half of the available reads for that regions, and that reduces power. I personally use featureCounts (--read2pos 5 option) but as Aaron says, simply setting ext to 1 should do the same thing in the csaw universe. Towards that 5bp shift, well, on the total scale of a peak (say somewhat 200-1000bp) or a window (100bp or whatever you define when using csaw) I don't think it matters, this is more important when working on TF footprinting where basepair-resolution is an issue.

ADD REPLY • link 4.4 years ago ATpoint ★ 4.7k

1

Entering edit mode

Thanks ATpoint, Your reasoning matches exactly mine. Previously I used to call peaks (with MACS2 tricked to use both ends of pairs, or Genrich), then use featureCounts (with the options you described), and analyze the results in edgeR pretty much as if it were RNAseq data. My concern came from realizing that peaks are often fairly broad, and I have noticed in my data some areas WITHIN a single peak, where DNA accessibility is different between two treatments. I feel the differential accessibility in such cases occurs on just a small fraction of the peak and may be missed if we look at whole peaks, but a window-based approach like csaw might catch it.

I am not overtly worried about the 5' shift and its impact on resolution. If there is a simple way to implement it, then I will. I am assuming that Tn5 possibly requires enough free DNA to sit properly before it can cut, so I am giving it a footprint of 10 bp (VERY arbitrary call by me). Alex

ADD REPLY • link 4.4 years ago alexandre.blais ▴ 50

0

Entering edit mode

My concern came from realizing that peaks are often fairly broad, and I have noticed in my data some areas WITHIN a single peak, where DNA accessibility is different between two treatments. I feel the differential accessibility in such cases occurs on just a small fraction of the peak and may be missed if we look at whole peaks, but a window-based approach like csaw might catch it.

I have been realizing the same issue. Particularly when a consensus peak with a larger length is generated (by merging peaks from multiple samples). I'm also thinking of trying out this csaw approach for the same issues raised by alexandre.

ADD REPLY • link 4.3 years ago Antonio Ahn ▴ 10

score 1 · Answer 1 · 2020-10-15

I've never dealt with ATAC-seq data myself, but IIUC, the 5' end of each read corresponds to the Tn5 insertion site. So if you want to test for differential accessibility at the resolution of insertion sites, you could just treat the paired-end reads as single-end reads. No need to bother making inferences on the interval between the two paired reads. This process only makes sense in ChIP-seq because we know that the binding site must be somewhere inside the fragment, but for ATAC-seq, that reasoning is not applicable because the activity of interest is happening on the ends of the fragment rather than in the middle.

So, you could set readParam() to trick it into thinking it's single-end data, set ext= to some arbitrarily short length (maybe 1 bp?) and slide a window of your desired width across the genome. A 150 bp window seems like a sensible place to start, though I don't really have a good feeling of the resolution of ATAC-seq data. For ChIP-seq, I was happy enough knowing the location of a DB peak within ~100 bp.

On a related note, I don't think it's possible to implement the 5 bp shift in csaw - are you working at resolutions where this matters? If your windows are ~100 bp wide, it doesn't really matter whether you give or take a few bp here or there.