Hello all,
I had to remove samples from my data and I end up with comparing 3 samples (cond1) vs. 2 samples (cond2) — I know that I'm not in an ideal situation. I want to get the most DEGs. I have run kallisto and import the data in R using `tximport`.
The problem is that when I compare 2 vs 1 I dont't have the same results that when I compare 1 vs 2... o.O I can share any data with whoever interested.
Here are my files: samples 13 and 14 correspond to cond1, and 16 to 18 to cond2 (I removed file 15 because I was not satisfied with it).
sample13 "TTLN13/abundance.tsv" sample14 "TTLN14/abundance.tsv" sample16 "TTLN16/abundance.tsv" sample17 "TTLN17/abundance.tsv" sample18 "TTLN18/abundance.tsv"
I import them, no problem:
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene) pairwise #some french headers #pairwise is a subset of a bigger dataframe sample stade ponte jour lane 1 S1_1 S1 OP9 1 1 2 S1_2 S1 OP9 1 2 4 S2_1 S2 OP9 4 4 5 S2_2 S2 OP9 2 1 6 S2_3 S2 OP9 2 2 > dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, pairwise, ~stade) factor levels were dropped which had no samples using counts and average transcript lengths from tximport > dds class: DESeqDataSet dim: 21877 5 metadata(1): version assays(2): counts avgTxLength rownames(21877): '' aaas ... zzef1 zzz3 rowData names(0): colnames(5): sample13 sample14 sample16 sample17 sample18 colData names(5): sample stade ponte jour lane > dds <- dds[ rowSums(counts(dds)) > 1, ] > dds <- DESeq(dds) estimating size factors using 'avgTxLength' from assays(dds), correcting for library size estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing > counts <- counts(dds,norm=TRUE) > counts <- as.data.frame(counts) > res <- results(dds) > markers <- res[abs(res$log2FoldChange)>2 & res$padj < 0.05 & !is.na(res$padj), ] > markers log2 fold change (MAP): stade S2 vs S1 Wald test p-value: stade S2 vs S1 DataFrame with 9 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> cpa1 6444.24007 2.441987 0.2732529 8.936729 4.008585e-19 1.303057e-15 kif14 251.38984 2.088346 0.2748661 7.597683 3.014796e-14 5.345508e-11 LOC103353161 162.16751 3.518848 0.2916084 12.067033 1.577185e-33 3.076141e-29 LOC103358230 17178.41284 2.127152 0.2182673 9.745631 1.925790e-22 9.390154e-19 LOC103359291 17611.56096 2.092522 0.2768940 7.557124 4.120793e-14 6.697663e-11 LOC103361754 82.92328 2.646414 0.3090550 8.562923 1.100413e-17 2.384717e-14 LOC103373239 460.27637 2.124864 0.2425778 8.759518 1.960866e-18 4.780592e-15 LOC103373302 18387.41060 2.354269 0.2583432 9.112950 8.017192e-20 3.127346e-16 LOC103376355 103.27335 3.573329 0.3130216 11.415599 3.494866e-30 3.408193e-26
Now, I do the same as before, but I reverse the order of the samples in the condition dataframe --> the results are different!
> pairwise2 <- pairwise > pairwise2[1:3,] <- pairwise[3:5,] > pairwise2[4:5,] <- pairwise[1:2,] > pairwise2 sample stade ponte jour lane 1 S2_1 S2 OP9 4 4 2 S2_2 S2 OP9 2 1 4 S2_3 S2 OP9 2 2 5 S1_1 S1 OP9 1 1 6 S1_2 S1 OP9 1 2 > dds2 <- DESeqDataSetFromTximport(txi.kallisto.tsv, pairwise2, ~stade) factor levels were dropped which had no samples using counts and average transcript lengths from tximport > dds2 class: DESeqDataSet dim: 21877 5 metadata(1): version assays(2): counts avgTxLength rownames(21877): '' aaas ... zzef1 zzz3 rowData names(0): colnames(5): sample13 sample14 sample16 sample17 sample18 colData names(5): sample stade ponte jour lane > dds2 <- dds2[ rowSums(counts(dds2)) > 1, ] > class(counts(dds2)) [1] "matrix" > dds2 <- DESeq(dds2) estimating size factors using 'avgTxLength' from assays(dds), correcting for library size estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing > counts2 <- counts(dds2,norm=TRUE) > counts2 <- as.data.frame(counts2) > res2 <- results(dds2) > markers2 <- res2[abs(res2$log2FoldChange)>2 & res2$padj < 0.05 & !is.na(res2$padj), ] > markers2 log2 fold change (MAP): stade S2 vs S1 Wald test p-value: stade S2 vs S1 DataFrame with 3 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> LOC103361292 81.60711 3.288638 0.3039721 10.818880 2.801800e-27 2.793535e-23 LOC103361620 109.93052 2.289659 0.3047828 7.512427 5.804097e-14 3.857983e-10 LOC103376346 441.07780 -4.524589 0.2573114 -17.584097 3.261354e-69 6.503466e-65
###
I tried this with a symmetrical design (n=3 in both conditions), and there is no problem: results are identical when comparing cond1 to cond2 and when comparing cond2 to cond1. My problem seems to be due to an asymmetrical design. Would you have any idea of what is happening?
Many thanks!
Thanks Michael! Please correct me if I'm wrong. What I should do is import a sampleTable from a file with the file location info.
Then, using this table (19 rows in total), import the whole kallisto dataset:
Then import this dataset with the desired design
And finally compare what I would like to compare (the so-called contrasts):
Only then will my result be consistent from one comparison (S2 vs S1) to the other (S1 vs S2)?
Yes, and above you had switched the sample table around but not switched the count data (and other associated data compiled by tximport). DESeqDataSet and other Bioconductor objects link the assay data (counts) and the sample information.