[DESeq2] Introduce a bias in DESeq2
1
0
Entering edit mode
@3bb2ad9e
Last seen 6 weeks ago
France

Dear Bioconductor community,

First of all, thank you for this forum that helped me many times during my PhD years. This is the first time I'm posting a question, but I've been following the forum for a long time now.

My problem is the following : I would like to introduce bias coefficients into DESeq2, before differential expression analysis.

Basically, what I would like to do is to multiply all my raw counts in a given sample by a coefficient k.

So let's say I have two samples A and B, for two genes, a and b. I would like to transform my initial count matrix

   A   B 
a x1 x2 
b x3 x4  

in

    A       B 
a x1*k1 x2*k2 
b x3*k1 x4*k2

However, my current problem is that would only be accounted as sequecing depth, thus normalized by DESeq2. Is there a way to introduce a bias after normalization ? Does that even make sense to use DESeq2 ?

For those interested in why, it would be to use DESeq2 to compare absolute production of RNA (let's say two cell type population, with different abundance, thus not only comparing the relative enrichment of a gene in each cell type, but the total enrichment of that gene in a mix of these two cells with known cell type abundance).

Thank you very much for your help and kind support and excuse me if my question is not perfectly posed (yet!),

David Benacom

DifferentialExpression DESeq2 BatchEffect • 881 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 6 hours ago
United States

This is often called an offset in statistical modeling. We have some support site posts about how to do this in DESeq2. Search for posts mentioning 'normMatrix'.

ADD COMMENT
0
Entering edit mode

Dear Micheal,

Thank you for your answer, as I missed that feature while looking in the documentation. Here is the toy example I made using the normFactors.

For that, I calculated the sizeFactors, then I used them as an initial value that I multiplied by the inverse of my coefficients (to scale up), then, I normalized again the size factors with the geometric mean.

It behaves as I want to, i.e it gives me a statistical value for enrichment, that is modulated by my coefficients. It also corrects for library size. Does it sound statistically sound ? Thank you,

    ```{r}

# Load necessary library
library(DESeq2)
library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Parameters
num_genes <- 1000
num_samples <- 10
conditions <- c("A", "B", "C")
base_count <- 100

# Generate count matrix with Poisson distribution and noise
counts <- matrix(rpois(num_genes * num_samples, lambda = base_count), nrow = num_genes, ncol = num_samples)

noise <- matrix(rnorm(num_genes * num_samples, mean = 0, sd = base_count * 0.05), nrow = num_genes, ncol = num_samples)

counts <- counts + noise
counts[counts < 0] <- 0  # Ensure no negative counts

# Round counts to the nearest integer
counts <- round(counts)


# Assign column names (sample IDs)
colnames(counts) <- paste0("sample", 1:num_samples)

# Generate coldata
coldata <- data.frame(
  sample_id = colnames(counts),
  condition = rep(conditions, length.out = num_samples)
)
rownames(coldata) <- coldata$sample_id


# Add a row to counts with expression 1 in 'A' , 'B' samples, but 1000 in 'C' samples to mimic a true "C" DE gene 
new_row <- rep(1, num_samples)
new_row[coldata$condition == "C"] <- 1000
counts <- rbind(counts, new_row)

# Double the counts for samples where condition is "B" to mimic change in library size 
counts[, coldata$condition == "B"] <- counts[, coldata$condition == "B"] * 2


# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)

dds <- estimateSizeFactors(dds)

# Initialize normFactors with size factors
normFactors <- matrix(sizeFactors(dds), nrow = num_genes+1, ncol = num_samples, byrow = TRUE)
print(head(normFactors))


#A has thrice the absolute expression. 
scaling_factor <- 3 
# Multiply columns of condition A by 1/scaling_factor
normFactors[, c(1, 4, 7, 10)] <- normFactors[, c(1, 4, 7, 10)] * (1/scaling_factor)


#Norm factors to 1 geometric mean 
normFactors <- normFactors / exp(rowMeans(log(normFactors)))

print(head(normFactors))

# Assign normFactors. 
normalizationFactors(dds) <- normFactors

#Rerun with these factors 
dds <- DESeq(dds)

MA plots of the results

ADD REPLY
0
Entering edit mode

Did you read the support site posts that use the normMatrix argument of estimateSizeFactors?

ADD REPLY
0
Entering edit mode

Dear Micheal,

This example is really clear. As I understand, here you normalize effectively for multicopy, so that a gene that is inflated by a variation in chromosome copy number is effectively normalized for expression, and thus does not appear in the DE genes in the "results" part, because the number of DNA copies is effectively four time more.

For my part, what I want to do is give more 'weight' to some samples, so that their genes effectively appear as DE. So for instance, my NormMatrix should offset all the genes in sample A, so that they appear 3 times more expressed that they really are. But I cannot just multiply all my counts by three in my weighted samples, because then it would appear as a library size that is three time bigger, which get normalized (in the next example)

counts <- matrix(rpois(400, 100), ncol=4)
dnacopy <- matrix(1, nrow=100, ncol=4)

# counts[,1:2] <- counts[,1:2] * 3
OR 
# dnacopy[,1:2] <- 3

coldata <- data.frame(x=factor(c(1,1,2,2)))
dds <- DESeqDataSetFromMatrix(counts, coldata, ~x)
dds <- estimateSizeFactors(dds, # normMatrix=dnacopy)
## using 'normMatrix', adding normalization factors which correct for library size
dds <- DESeq(dds, quiet=TRUE)
res <- results(dds)
plotMA(res)

Similarly, if I just set the norm matrix to 3 for all genes of the first two samples, it also gets normalized.

Does that make any sense ?

ADD REPLY
0
Entering edit mode

It sounds from your description like you want the offset to go in the other direction then?

E.g. say you have a count of 100. If you supply normMatrix and that sample gets a value 2x higher than the other samples, then it's halving the 100, so it appears more like a 50 to the rest of the regression formula (in that we model E(count) = X beta + offset).

ADD REPLY

Login before adding your answer.

Traffic: 535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6