Hi, I would to use DESeq2 to process three bulk RNASeq paired samples but I am trying to figure out what is the valid model to use here. I used tximport to import Kallisto's transcript-level abundance estimates at gene level to use with deseq2.
In the paired samples, the treatment is overxperssion of gene A. Sample information is as follows:
condition patient_id
BT12CONT Control BT1
BT12OE OverExp BT1
BT53CONT Control BT53
BT53OE OverExp BT53
GBM5CONT Control GBM5
GBM5OE OverExp GBM5
I am interested in looking at the condition effect while accounting for sample pairs so I thought a model like the following would be enough:
> ~ condition + patient_id
The PCA for these samples shows that the samples separate by patient_id
Is this simple model to look at condition/treatment effect enough?
Thanks! Puks
Your samples notably cluster by cell line, not by treatment. Therefore it appears unfortunate to use them as biological replicates. From a biological standpoint this quite normal for cell lines. During cell line establishment there are a lot of things changing inside the cell, particular clones start growing out, the cell might acquire all kinds of alterations that help it grow. Therefore it is not unexpected to see large differences between cell lines (or even between different clones of the same cell line). I do not think this setup is a good choice to get the information you want. You should probably have used the same cell line and perform the overexpression study with this line in a replicated manner. This would give you the power to detect significant changes within the cell line. Comparing these results with the same experiment using the other two cell lines in a replicated fashion then would give you information on how reproducible the findings are from a biological standpoint.
Thanks ATpoint! You are correct, there should have been replicates for each cell line but unfortunately the person who performed the experiment did not do it.
I have to disagree with ATpoint here. It is actually a good design to use cell lines derived from multiple patients. This assures that the list of differentially expressed genes that OP will find is not specific to one (arbitrarily chosen) patient but has some generality and hence likely to have good overlap with he list one would find if one tried again with different patients.
The fact that the difference between patients is larger than between treatment and control indicates that the treatment has just a small effect: either a small effect on many genes, or a large one on only few genes. If the latter is the case, including "patient_id" in the model will allow to find these genes (because DESeq2 will look at the differences between treatment nd control within each sample pair).
If, however, the treatment causes genes to only change slightly, the experiment is underpowered with just three patients and will return nothing. However, while performing it with many replicates from the same patient will produce many hits, which are maybe not very useful.