Context: I am currently analyzing some Illumina RNA-seq data. These data are from treating two different genotypes (WT & HET) with four different treatments. There are THREE biological replicates for each treatment group.
- WT treated with negative control (NC)
- WT treated with a solution of some concentration (Ace)
- WT treated with the same solution but more concentrated (MAce)
- WT treated with positive control (PC)
Same for HET.
What I have done: I used Salmon to quantify the reads and also create 100 bootstrapped samples, and then I used fishpond to perform differential gene analysis, where I was able to compare within a treatment group. For example, I was able to find differential expressed genes between A and B in the first treatment group, using A as a reference.
Question: The analysis I have done so far does not allow me to compare across treatment groups because the estimates of abundance are normalized within a treatment group. In other words, I cannot compare the log2(foldchange) from Treatment 1 to that from Treatment 2. However, I want to be able to compare across treatments. I want to be able to establish a basal level to which I can compare different treatments. But I do not know how to do so.
Attempt: Upon reading more into the fishpond documentation, I realized I could use genotype to match samples across conditions and then use interaction = True
in swish. In an attempt to follow the Interaction designs in the documentation, I am stuck at creating paired samples, specifically, y2$pair <- as.numeric(factor(y2$line))
. Is line in the macrophage dataset equivalent to genotype in my dataset?
Additional Question: The mapping rates from salmon for all samples range from 68 to 81 with the majority being in the 70s. Is this typical?
R Code: Comparing Treatment Group 1 and Treatment Group 2 based on genotypes
se <- tximeta(coldata)
gse <- summarizeToGene(se)
gy <- gse
gy <- gy[,gy$condition %in% c("WT-NC", "WT-Ace", "HET-NC", "HET-Ace")]
gy$genotype <- factor(ifelse(grepl("WT", gy$condition), "WT", "HET"))
gy$condition <- factor(ifelse(grepl("NC", gy$condition), "NC", "Ace"))
with(colData(gy),
table(condition, genotype)
)
This is my first time analyzing RNA-seq data, so any additional advice is appreciated! Thank you :)
Thank you so much! I actually found way fewer DEGs across treatment groups than within treatment groups. One comparison did not even output any DEGs. But I have four follow-up questions.
1) How come you suggested I do this without pairing? Originally, I was thinking that the samples were paired by genotypes across treatments. For example, the triplicate WT samples in Treatment 1 are paired with the triplicate WT samples in Treatment 2.
2) How does fishpond deal with biological replicates in the dataset? Does it just take the average of every value from all replicates?
3) You mentioned that by using
interation=True
, it compares the LFC of one group to another group by creating pseudo-pairs? What does that mean exactly? Does it normalize the estimates of abundance using all the WT samples in Treatment 1 and Treatment 2 and create some sort of basal level to which fishpond can compare the HET samples?4) When I first started analyzing the data, I was actually using sleuth instead of fishpond. It also seems like fishpond is not very commonly used, even though fishpond gets more updates on GitHub than sleuth. Is there a reason for this? Is sleuth better than fishpond? What are some advantages and disadvantages?
Finding less differences in differences is expected. After all it had to be differential first within one group to have a change in the effect across groups.
Check out the Details section of ?swish and also the vignette section talking about the permutation schemes for interaction designs. Here we motivate how the interaction test is performed.
Biological replicates are used in the computation of the test statistic. The biological variability will decrease the test statistic as it increases with respect to the difference in expression across groups. I’d say see the Swish paper published in NAR for more about the method details.
As far as why one method is cited or downloaded more, there are many factors, but I can just note they Swish has performed well across a number of comparisons including now by external groups. Swish has been published fairly recently, and I’ve only given a handful of talks on it (I haven’t been able to attend many conferences during the pandemic), but I do know of many groups that are using it for DE in cases with some features having high inferential variance.