Question

DESeq2 for mulitple tissues under 2 conditions and contrast options

0

Entering edit mode

bioinf ▴ 10

@bioinf-12080

Last seen 11 weeks ago

United States

Hi,

I have paired rnaseq data from multiple samples, counted with featureCounts, now planning to use DESeq2 and trying to design it. I have gone through DESEq2 comparison with mulitple cell types under 2 conditions. However, I would like to confirm if my design is correct or not?

Here is the sample coldata:

    tissue condition
sample1_WA1 WA1 Wild
sample2_WA2 WA2 Wild
sample3_WA3 WA3 Wild
sample4_WB1 WB1 Wild
sample5_WB2 WB2 Wild
sample6_WB3 WB3 Wild
sample7_WC1 WC1 Wild
sample8_WC2 WC2 Wild
sample9_WC3 WC3 Wild
sample10_MA1 MA1 Mutant
sample11_MA2 MA2 Mutant
sample12_MA3 MA3 Mutant
sample13_MB1 MB1 Mutant
sample14_MB2 MB2 Mutant
sample15_MB3 MB3 Mutant
sample16_MC1 MC1 Mutant
sample17_MC2 MC2 Mutant
sample18_MC3 MC3 Mutant
sample19_WE1 WE1 Wild
sample20_WE2 WE2 Wild
sample21_WE3 WE3 Wild
sample22_WD1 WD1 Wild
sample23_WD2 WD2 Wild
sample24_WD3 WD3 Wild

where A,B,C,D,E are five tissue types and D and E are from wild condition only. 1,2 and 3 are biological replicates.

I want to perform:

(i) comparison of differentially expressed genes between all tissue types in wild

(ii) comparison of differentially expressed genes between all tissue types in mutant

(iii) comparison of differentially expressed genes between for all tissue types between wild versus mutant

(iv) comparison of differentially expressed genes between between D and E

How should I setup the design with replicates? Is this correct:

dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ tissue)

dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Warning message:
In checkForExperimentalReplicates(object, modelMatrix) :
  same number of samples and coefficients to fit,
  estimating dispersion by treating samples as replicates.
  read the ?DESeq section on 'Experiments without replicates'

Please guide Michael Love

Thanks!

deseq2 rnaseq featurecounts DESeqDataSetFromMatrix • 4.9k views

ADD COMMENT • link updated 7.2 years ago by Michael Love 43k • written 7.3 years ago by bioinf ▴ 10

1

Entering edit mode

I'm guess that e.g. WA1 is replicate one of WA. To get your analysis started, you'll at least need to separate out that using eg tidyr::extract(coldata, c("Tissue","Replicate", "(..)([123])") . Then you'll probably need to clarify what exactly you mean by "between all tissue types" - do you want a genelist for each pair or tissues, or do you want a single one that lists genes that have at least one tissue that is different from the others; or comparisons against a common 'baseline' tissue. (iii) is even more ambiguous: I think you'll only be able to use the common tissues, but do you want three lists of tissue-specific mutant vs wt, or genes that have a consistent mutant vs wt effect size across all tissues, or ...

You should also make your replication structure clear. Is there a 'batch' effect in that WA1 is closely related to WB1 (e.g. from the same individual - possible), and MB1 (unlikely from same individual, but maybe from 'batch 1'). Currently there's not really enough detail in your question to allow us to answer it.

ADD REPLY • link 7.3 years ago Gavin Kelly ▴ 690

0

Entering edit mode

@ Gavin Kelly Yes, you are right WA1 is the replicate 1 of WA.

A,B,C,D,E are five different tissues, while 1, 2 and 3 indicate biological replicates.

Would you please elaborate more on how I can rearrange coldata to make it better accessible for DESeqDataSetFromMatrix ? This is the first time I am going to use DESeq2 and confused with the design step.

More details:

(i) First of all, I want a gene list and clustered heatmap in 24 samples at padj < 0.05 and gene names in the heatmap

(ii) Gene list and clustered heatmap (padj < 0.05) between all tissue types in wild (including the replicate information in the heatmap): WA (1,2,3) vs. WB (1,2,3) vs. WC (1,2,3)

(iii) Gene list and clustered heatmap (padj < 0.05) between all tissue types in mutant (including the replicate information in the heatmap): NA (1,2,3) vs. NB (1,2,3) vs. NC (1,2,3)

(iv) Gene list and clustered heatmap (padj < 0.05) between all tissue types in wild and mutant (including the replicate information in the heatmap): WA (1,2,3) vs. WB (1,2,3) vs. WC (1,2,3) vs. NA (1,2,3) vs. NB (1,2,3) vs. NC (1,2,3)

(v) Gene list and clustered heatmap (padj < 0.05) between D and E tissue from wild (including the replicate information in the heatmap): WD (1,2,3) vs. WE (1,2,3)

WA, WB, WC (and their replicates) belong to pool 1, NA, NB, NC (and their replicates) belong to pool 2 and WD, WE (and their replicates) belong to pool 3. Each pool consists of 10 individuals.

Samples run:

Lanes Samples
L1 WA1 WA2 WA3 WB1 WB2 WB3 WC1 WC2 WC3
L2 NA1 NA2 NA3 NB1 NB2 NB3 NC1 NC2 NC3
L3 WD1 WD2 WD3 WE1 WE2 WE3

ADD REPLY • link updated 7.3 years ago by Michael Love 43k • written 7.3 years ago by bioinf ▴ 10

0

Entering edit mode

Michael Love

# I tried the following for (1):

dds4group <- dds
dds4group$group <- factor(paste0(dds4group$tissue, dds4group$condition))
design(dds4group) <- ~ group
dds4group_deseq <- DESeq(dds4group)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

resultsNames(dds4group_deseq)
[1] "Intercept"  "group_Awild_vs_Amutant"   "group_Cmutant_vs_Amutant" "group_Cwild_vs_Amutant"  

[5] "group_Bmutant_vs_Amutant" "group_Bwild_vs_Amutant"   "group_Dwild_vs_Amutant"   "group_Ewild_vs_Amutant"

# It showed me the different combinations but the ones I want are not there:

"group_Awild_vs_Cwild"  
"group_Awild_vs_Bwild"
"group_Bwild_vs_Cwild"
"group_Dwild_vs_Ewild"

# Then I did:

dds4group2 <- dds
dds4group2$group2 <- factor(paste0(dds4group2$tissue, dds4group2$wild))
design(dds4group2) <- ~ group2
dds4group2_deseq <- DESeq(dds4group2)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 29 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
resultsNames(dds4group2_deseq)
[1] "Intercept"     "group2_C_vs_A" "group2_B_vs_A" "group2_D_vs_A" "group2_E_vs_A"

# Still "group2_B_vs_C" and "group2_D_vs_E" are absent.

ADD REPLY • link 7.2 years ago bioinf ▴ 10

1

Entering edit mode

When you use ‘group’ you’re not restricted to what is in resultsNames(), just use ‘contrast’ to compare the levels of ‘group’ that you want to compare.

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael Love ! I did:

res_ACwild <- results(dds4group_deseq, contrast=c("group","Awild","Cwild"))

After subset and ordered by padj, I am interested to annotate the list of genes with ensembl and then heatmap, so I wanted to have a normalized dataframe, but it gave me a dataframe including other samples as well which I don't want:

res_ACwild_0.05_order_norm <- merge(as.data.frame(res_ACwild_0.05_order), as.data.frame(counts(dds4group_deseq,normalized =TRUE)), by = 'row.names', sort = FALSE)

Please guide.

Thanks.

ADD REPLY • link 7.2 years ago bioinf ▴ 10

0

Entering edit mode

I’d recommend getting some help from someone with R experience, or just follow code from our vignette or workflow. Here you never subset the counts matrix. And there are safer ways to go at this.

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Vignette says:

select <- order(rowMeans(counts(dds,normalized=TRUE))

which is for whole data.

I am confused on how I can get normalized count dataframe for the samples belong to the group mentioned in the contrast. :(

ADD REPLY • link 7.2 years ago bioinf ▴ 10

0

Entering edit mode

That is one example of 'select', picking the top 20 genes by the mean count.

If you want a heatmap of the top genes by adjusted p-value in a results table, you could use:

idx <- head(order(res$padj), n)

You can then use 'idx' to subset the rows of any matrix or results table produced by DESeq2, e.g. counts(dds), counts(dds, normalized=TRUE), res, assay(vsd), etc. As long as you haven't yet re-ordered or subsetted those matrices/tables, then you can index each of them with the same vector 'idx'. I wouldn't use merge() here, it could lead to some bugs.

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael Love. I think I have been unable to make my query clear. I will explain with an example.

Whether I use any of the three below mentioned options, my output is same.

head(assay(rld), 5)                 # For my whole data
head(assay(rld_dds4group_deseq), 5) # From my previous posts in the thread 
head(assay(rld_dds4group2_deseq), 5) # From my previous posts in the thread

For my whole data, normalized dataframe (padj <0.05 and ordered) has columns for all samples:

colnames(resOrdered_all_norm)

 [1] "gene"               "baseMean"           "log2FoldChange"     "lfcSE"              "stat"              
 [6] "pvalue"             "padj"               "sample1_WA1" "sample1_WA2" 
[11] 
[16] 
[21] 
[26] 
[31] 
[36] "sample24_WD3"

But, when I use contrast, for e.g:

dds4group <- dds
dds4group$group <- factor(paste0(dds4group$tissue, dds4group$condition))
design(dds4group) <- ~ group
dds4group_deseq <- DESeq(dds4group)
res_ACwild <- results(dds4group_deseq, contrast=c("group","Awild","Cwild"))
res_ACwild_0.05 <- subset(res_ACwild, res_ACwild$padj < 0.05)
res_ACwild_0.05_order <- res_ACwild_0.05[order(res_ACwild_0.05$padj),]
res_ACwild_0.05_order

log2 fold change (MLE): group Awild vs Cwild
Wald test p-value: group Awild vs Cwild
DataFrame with 70 rows and 6 columns
baseMean log2FoldChange     lfcSE       stat       pvalue         padj
<numeric>      <numeric> <numeric>  <numeric>    <numeric>    <numeric>
ENS...   159.20760      -5.626086 0.3587791 -15.681195 2.033900e-55 5.517768e-51
...                       ...            ...       ...        ...          ...       
ENS...  99.092907       1.966425 0.5159984   3.810913 0.0001384543   0.04268325

I got 70 genes for Awild versus Cwild groups, which are 6 samples in total.

Therefore, my desired normalized dataframe should have only 6 samples however, it is showing all samples for 70 genes.

This is what I am unable to figure out. :(

I have to make many other pairwise comparisons and I am stuck at the first one only since long days.

ADD REPLY • link 7.2 years ago bioinf ▴ 10

0

Entering edit mode

So you need to subset the columns. In R you subset a matrix by columns using a paradigm: [,idx]

Might be good to lookup an R reference for basic object manipulation. I’ve found “Quick R” to be a good one. (Search google for “Quick R”)

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael Love. I am aware of that and was just wondering if inside DESeq2 there is some automated way while using contrast.

Anyways, thank you so much for answering my queries. :)

ADD REPLY • link 7.2 years ago bioinf ▴ 10

score 1 · Answer 1 · 2017-11-02

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 16 hours ago

United States

First you need to clean up your colData so that you have:

condition: "wild" or "mutant" (make wild the reference level of condition, see vignette for instructions)
tissue: "A", "B", "C", "D", or "E"

I can't exactly tell what you are looking for in (iii), maybe you can restate. For the others, you can use a design of ~tissue + condition, and then simply use results() with an appropriate 'contrast', e.g.:

contrast=c("condition","mutant","wild")

or

contrast=c("tissue","E","D")

ADD COMMENT • link 7.3 years ago Michael Love 43k

0

Entering edit mode

So I can use:

dds$condition <- factor(dds$condition, levels = c("wild","mutant"))

How can I keep replicate numbers also during analysis and heatmap so that I can see if there is any variation between replicates?

In (iii) I want to compare between different tissue types from mutant.

ADD REPLY • link 7.3 years ago bioinf ▴ 10

1

Entering edit mode

I don't follow, what's the problem. The replicate numbers contribute nothing to the differential expression, unless, are you saying that all the replicate 1's have something to do with each other? A heatmap will allow you to see the variation among replicates. We have example code for heatmaps in our documentation.

"between different tissue types from mutant" : the model I'm suggesting above has a tissue effect which is independent of wild or mutant status. If you want to compare all possible combinations of tissue x condition, you should follow the advice in this first code chunk:

http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael Love. I cleaned my data as per your suggestions and ran DESeq2. Its awesome. :-)

I did:

dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~tissue + condition)
dds$condition <- factor(dds$condition, levels = c("wild","mutant"))
dds_deseq <- DESeq(dds)
res <- results(dds_deseq)

Now, I would like to do comparison between

1) "A" vs."B" tissues under "wild" condition

I tried the following:

res_ad <- results(dds_deseq, contrast=c("tissue","A","B"))
res_ad_0.05 <- subset(res_ad, res_ad$padj < 0.05)

But, it takes "A","B" from both "wild" and "mutant" while I want only for "wild".

2) Together AC vs. B ("AC" vs. "B" tissues under "wild" condition)

I have already annotated my dataframe for all samples with ensembl and is there any possibility to use that dataframe for comparisons (e.g. to find list of differentially expressed genes for some samples?

Please guide.

Thanks.

ADD REPLY • link 7.2 years ago bioinf ▴ 10