DESeq2 baseMean from DiffBind summarizedExperiment
1
0
Entering edit mode
rbronste ▴ 60
@rbronste-12189
Last seen 5.0 years ago

So I am using DiffBind to create a count matrix as summarizedExperiment and then running this through DESeq2 and I was wondering about ways to filter for specific baseMean ranges within this data? My script looks like the following:

library("DESeq2")
library("ggplot2")
library("BiocParallel")
library("DiffBind")

parallel=TRUE
BPPARAM=MulticoreParam(4)

# Create a dba.count object, important that dba.count be unnormalized (DBA_SCORE_READS).

ATAC_adult_MA<- dba(sampleSheet = "Adult_ATAC_MA1.csv", peakFormat="narrow")

ATAC_Adult_MA_count<-dba.count(ATAC_adult_MA)

# Retrieve counts from dba.count

rangedCounts <- dba.peakset(ATAC_Adult_MA_count, bRetrieve=TRUE)

nrows <- 1095036
ncols <- 12
counts <- matrix(runif(nrows * ncols, 1, 1e5), nrows)
rowRanges<-GRanges(rangedCounts)

sampleNames<-c("MMV1",    "MMV2",    "MMV3",    "FMV1",    "FMV2",    "FMV3", "MME7", "MME8", "MME9", "FME1", "FME2", "FME3")
sampleSex<-c("male", "male", "male", "female", "female", "female", "male", "male", "male", "female", "female", "female")
colData<-data.frame(sampleName=sampleNames, sex=sampleSex)

# Retrieve counts from dba.count without the site interval ranges

counts <- as.matrix(mcols(rangedCounts))

# Construct a SummarizedExperiment

se<-SummarizedExperiment(assays=list(counts=counts),rowRanges=rowRanges, colData=colData)

# DESeq2 analysis

table(sampleSex)

dds<-DESeqDataSet(se, design= ~sex)

dds$group <- factor(paste0(dds$sex))

design(dds) <- ~0 + group

dds <- dds[ rowSums(counts(dds)) > 1, ]

dds <- DESeq(dds, betaPrior = FALSE)
deseq2 diffbind basemean summarizedexperiment • 2.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

I don’t follow the question. You want to pre-filter the dds based on the mean of counts? You can pre-filter the object using dds[keep,] where keep may be defined as rows with mean of normalized count greater than some value or rows where x or more samples have a count greater than or equal to y. This is up to you.

ADD COMMENT
0
Entering edit mode

Yes actually that answers part of it, however an even deeper and perhaps slightly naive question is how the baseMean called by DESeq2 relates to the original counts and how does that change with normalization within DESeq2 - trying to determine if the problem is with the stringency of called peaks by MACS2 or something happening to during the differential peak calling. An example being that a lot of shared peaks among samples may have a high baseMean such as those at promoters with the differential peaks having low baseMeans relatively - therefore the overall change in the differential peaks is small. Thanks again!

ADD REPLY
0
Entering edit mode

Yes this is quite helpful however I guess I am still a bit confused as what baseMean represents in an RNA-seq experiment vs the experiment I am indicating with the code above? The binding matrix in question above is made from ATAC-seq data. Thanks.

ADD REPLY
1
Entering edit mode

The column that DESeq2 creates called “baseMean” is rowMeans(counts(dds, normalized=TRUE)) regardless of the type of input data.

ADD REPLY
0
Entering edit mode

Sure of course, I just mean that baseMean in gene expression data represents to some approximation the transcript abundance however in ChIP or ATAC data merely represents the fragment pileup at a specific site, unless I am incorrect? 

ADD REPLY
1
Entering edit mode

I'll leave it to the DiffBind package authors I suppose

ADD REPLY
1
Entering edit mode

In this case, where the count score is DBA_SCORE_READS, the binding matrix does not have pileup scores, but rather the total number of reads that align overlapping each (consensus) peak. This is similar to what you get in RNA data, where the expression matrix has the raw number of reads that overlap the gene (exons), only in the ATAC case, instead of transcripts, they are overlapping open chromatin regions.

ADD REPLY
0
Entering edit mode

Thanks! That explains it really nicely. One additional question is whether in ATAC data using DBA_SCORE_READS this consensus means anything like it does with the RNA-seq data where differences in expression of a gene are inferred. Im assuming it simply means smaller or larger open chromatin regions however what that amounts to functionally, unlike the gene expression data, is not well defined. That is really what I was getting at with what baseMean really represents across the two experimental types.  

ADD REPLY

Login before adding your answer.

Traffic: 714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6