I'm currently using esATAC to process a number of ATAC-Seq samples that are done in two biological replicates across four sample groups. I'm interested in using DiffBind to calculate read counts from the resulting PeakCallingFseq.bed and Bowtie2Mapping.bam files. Ultimately, I would like to analyze differential binding using these read counts in edgeR.
I'm using the following information in a dba sample sheet:
- SampleID: a unique ID that is composed of the content from Factor and Replicate columns (8 samples)
- Factor: A combination of content from Condition and Treatment columns (4 sample groups)
- Condition: A cell type (2 conditions)
- Treatment: A genotype (2 treatments)
- Replicate: 1 or 2 (2 replicates per sample group)
- bamReads: file path to the Bowtie2Mapping.bam esATAC file
- bamControl: I'm unsure if this is pertaining to our wild-type control genotype, but I don't believe so. I don't wish to have the control samples directly adjust non-control read counts outside of the necessary normalization procedure; hence I've omitted this column.
- ControlID: I believe I can omit this column, but I saw an example as having this column present with "input" for each sample. I have included this column, but seeing as I don't have a bamControl column perhaps it's not needed.
- Peaks: file path to the PeakCallingFseq.bed esATAC file
- PeakCaller: bed (peak score is in 5th column)
I'm able to extact read counts using the below code (R 3.6.0; DiffBind 2.12.0):
#Create DBA object:
dba.in = dba(sampleSheet="sampleSheet.csv")
#Calculate counts per sample:
dba.count.norm = dba.count(dba.in)
#Export out a dataframe of TMM-normalized reads for input into edgeR:
counts.norm = dba.peakset(dba.count.norm, bRetrieve = T, DataType = DBA_DATA_FRAME)
I believe the steps I have taken are correct, but I've noticed an issue with the counts that I'm wondering may either be attributed to an error made in the steps above or possibly a limitation of TMM normalization. Our project contains samples that have varying library sizes - some samples are twice the size of others. As normalization has already been performed on the counts via the dba.count function, I've omitted normalization in edgeR (with calcNormFactors(method="none")), yet I appear to be seeing bias in our analyses that may be attributed to the vastly different library sizes. If this issue isn't due to an error made in the steps above, I'm wondering if it possible to prevent the dba.count function from applying a default TMM normalization, so that I can use various normalization methods within edgeR, such as TMMwsp or RLE.
Thank you kindly for your feedback and time!
-Tim-