Question

Combining 2 datasets from 2 different RNAseq experiments

0

Entering edit mode

l.s.wijaya • 0

@lswijaya-21856

Last seen 4.9 years ago

Hi All,

This question might already be asked. If so, please inform me the link and close this forum. However, if not, here is my questions. So I have performed 2 different RNA seq experiments in the same cell lines. However, these 2 experiments were done in different time and also with different conditions (treatment, concentration, etc). In the 2nd experiment (which was done latter), I included several identical samples to the first experiment as the batch control. The problem is, when I perform PCA in these identical samples, I can clearly see the batch effect (two different clouds with 2 different colors (batches)). This batch effect appears in 2 different datasets merge methods. First method, I combined the count data from 2 different experiments then normalized the combined data (CPM) and another method, I normalized (CPM) the data separately then combined the log norm. I haven't tried to calculate log2FC yet so far. My question is, what is the most appropriate strategy to combine 2 different datasets to diminish batch effects. Should I combine the raw data then perform normalization and log2FC calculation altogether (which I plan to do) or do it separately and combine the data at log2FC level? I use Deseq2 function to calculate log2FC, is it possible to add for instance "batch" in the design, even though I already add "meanID" (which is actually the specific ID consisting of the sample condition and batch ID for each sample)? Thanks in advance for the answers.

Lukas

deseq2 normalization probe • 5.4k views

ADD COMMENT • link updated 4.9 years ago by Michael Love 43k • written 4.9 years ago by l.s.wijaya • 0

score 0 · Answer 1 · 2020-03-26

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

Do you want a single estimate of LFC averaging across batches? If so, use ~batch + condition

ADD COMMENT • link 4.9 years ago Michael Love 43k

0

Entering edit mode

Dear Dr. Michael Love,

I tried to add batches in my design. However, I got this error : “the model matrix is not full rank, so the model cannot be fit as specified.” I was thinking this is because the name of one condition (meanID) already contains the name of the batch itself. My meanID name is : compoundconcentrationbatch. Another possibility based on your tutorial is that the identical sample tested from the 2nd batch aren't equal to the first batch. I only add 10 identical samples to the 2nd batch, the rest of the samples are different. Do you think it's still possible to check the batch effect in this situation? I wasn't planning to calculate the average, I was planning to calculate the log2FC separately then check the batch effect with PCA, etc. At the normalized count level, I see the batch effect. Currently I am working on the log2FC level to see the batch effect. Thanks for the answers.

ADD REPLY • link 4.9 years ago l.s.wijaya • 0

0

Entering edit mode

Can you write out the column data? I'm having a hard time parsing your text. E.g.:

batch, condition
1, control
1, treated
2, control
2, treated

ADD REPLY • link 4.9 years ago Michael Love 43k

0

Entering edit mode

I do apologize if my explanation is confusing. I also often have a hard time to write down things. So in my metadata file, I make a new column called meanID. Inside the meanID is the name of the condition "" and the batch. So for instance, I have condition : control in batch 1, then in the meanID I write : control1. Then I use this meanID column in the design, design = ~meanID. Maybe if I split this meanID into 2 columns, I can calculate one log2FC values from these 2 batches by changing my design into design = ~batch + condition. BTW, upon PCA analysis at the log2FC level, seems the batch effect is gone since I don't see any clear separation between 2 identical samples from 2 batches. However, I am not sure if my approach is correct.

ADD REPLY • link 4.9 years ago l.s.wijaya • 0

0

Entering edit mode

I’m sure I’m going to have to just ask again for a sample table. This is also in the posting guide as a helpful way to efficiently communicate your experimental design.

ADD REPLY • link 4.9 years ago Michael Love 43k

0

Entering edit mode

Here is the example of the table :

TREATMENT , EXP_ID, meanID
ActA, Batch2, ActA_Batch2
Bef, Batch2, Bet_Batch2
ActA, Batch1, ActA_Batch1

Here, I use meanID as the design. So that for the treatment ActA, I got 2 log2FC values. The purpose is to see if the ActA in those 2 batches show interbatch variation.

ADD REPLY • link 4.9 years ago l.s.wijaya • 0

0

Entering edit mode

Sorry, I can’t help you here, I need full information, and you’re just giving me snippets. I’m also pretty busy right now with teaching. I’d recommend to consult with a statistician at your institute.