DESeq: Question about contrast and group-wise subsetting.
1
0
Entering edit mode
hrishi27n ▴ 20
@hrishi27n-11821
Last seen 3.4 years ago
United States

Hello All,

I am performing a DEG analysis using DESeq2 and would really appreciate your advice on the results(contrast) section of my analysis. 

I have 5 disease categories(RM, RP, SP_progress, SP_stable and healthy) for each category we extracted monocytes and treated these monocytes with different concentration of a chemical (10nm,50nm,100nm,200nm and 0nm/none). The goal is to perform DE analysis within each group by comparing monocytes with different treatment compositions (Eg: For disease category RM, I am interested in comparing 10nm vs 100nm, 10nm vs 50nm, 100nm vs none..etc).  My code is as follows:

ddsComplete <- DESeqDataSetFromMatrix(countData=FilteredCounts, colData=phenoFile,design=~0+Monocytes:Disease)

ddsComplete <- DESeq(ddsComplete) 

 

PhenoFile:

head(phenoFile)

  Position Disease Monocytes

1       A1      RM    None

2       B1      RM      10nm

3       C1      RM      50nm

4       D1      RP      10nm

5       E1      RP      50nm

6       F1      RP     100nm

 

Results:

resultsNames(ddsComplete)

[1] "Monocytes100nm.DiseaseHealthy"       

[2] "Monocytes10nm.DiseaseHealthy"        

[3] "Monocytes50nm.DiseaseHealthy"        

[4] "MonocytesNone.DiseaseHealthy"      

[5] "Monocytes200nm.DiseaseHealthy"    

[6] "Monocytes100nm.DiseaseRM"            

[7] "Monocytes10nm.DiseaseRM"             

[8] "Monocytes50nm.DiseaseRM"             

[9] "MonocytesNone.DiseaseRM"           

[10] "Monocytes200nm.DiseaseRM"         

[11] "Monocytes100nm.DiseaseRP"            

[12] "Monocytes10nm.DiseaseRP"             

[13] "Monocytes50nm.DiseaseRP"             

[14] "MonocytesNone.DiseaseRP"           

[15] "Monocytes200nm.DiseaseRP"         

[16] "Monocytes100nm.DiseaseSP_progress"   

[17] "Monocytes10nm.DiseaseSP_progress"    

[18] "Monocytes50nm.DiseaseSP_progress"    

[19] "MonocytesNone.DiseaseSP_progress"

[20] "Monocytes200nm.DiseaseSP_progress"

[21] "Monocytes100nm.DiseaseSP_stable"     

[22] "Monocytes10nm.DiseaseSP_stable"      

[23] "Monocytes50nm.DiseaseSP_stable"      

[24] "MonocytesNone.DiseaseSP_stable"    

[25] "Monocytes200nm.DiseaseSP_stable"

 

In order to extract results for e.g. comparing 100nm vs 200nm for Disease category RP, I wrote the following contrast:

RP100_vs_200 = results(ddsComplete, contrast=c('Monocytes','100nm.DiseaseRP','200nm.DiseaseRP'))

>RP100_vs_200

log2 fold change (MLE): Monocytes 100nm.DiseaseRP vs 200nm.DiseaseRP

Wald test p-value: Monocytes 100nm.DiseaseRP vs 200nm.DiseaseRP

DataFrame with 16770 rows and 6 columns

           baseMean log2FoldChange      lfcSE        stat     pvalue      padj

          <numeric>      <numeric>  <numeric>   <numeric>  <numeric> <numeric>

A1BG     0.77580621     -0.4489085  0.5952707 -0.75412502 0.45077415 0.9999971

A1BG-AS1 0.14957383      0.3799384  1.7488089  0.21725552 0.82800922 0.9999971

A1CF     0.01171596     -0.3450653 10.0207498 -0.03443507 0.97253022 0.9999971

A2M      0.46024794     -1.5849584  0.7909503 -2.00386595 0.04508442 0.9999971

A2M-AS1  0.01496816     -0.3450659 10.0207498 -0.03443514 0.97253016 0.9999971

...             ...            ...        ...         ...        ...       ...

ZYG11A    1.3799655      0.1101611  0.4711617   0.2338074  0.8151345 0.9999971

ZYG11B   29.3911744     -0.2092063  0.5674405  -0.3686841  0.7123632 0.9999971

ZYX       0.8827047      0.2417958  0.6352289   0.3806436  0.7034677 0.9999971

ZZEF1     0.4399459     -0.3553409  0.7771330  -0.4572459  0.6474943 0.9999971

ZZZ3      0.5939009     -0.5578059  0.6982475  -0.7988655  0.4243684 0.9999971

 Could someone please help me understand how the contrast works? For the above result(RP100_vs_200) does the contrast function internally subsets groups and then generates table based upon the inputs given to it? If I was to subset my countData and phenoFile to just include disease category RP and re-run the analysis using design=~Monocytes, would the results for 100nm_vs_200nm  be similiar? I expect the results to be little different for RP since  the estimates would be different for subset RP considering the sample size. 

Hope the question is clear! Thank you!!!

deseq2 deseq • 2.7k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

Yes, your above setup works to contrast the groups.

"does the contrast function internally subsets groups and then generates table based upon the inputs given to it?"

The results() function constructs the relevant linear combination of coefficients to achieve the desired contrast.

"If I was to subset my countData and phenoFile to just include disease category RP and re-run the analysis using design=~Monocytes, would the results for 100nm_vs_200nm  be similiar?"

Yes, roughly similar, although subsetting will change the dispersion estimates, and so the p-values would not be identical. (This is discussed in a FAQ in the vignette, with the recommendation to not subset in general, but use the whole dataset, then use results() to make pairwise comparisons.)

I'd suggest looking at plotCounts() for the top genes in each contrast so you can confirm to yourself that you are obtaining the genes you want.

ADD COMMENT
0
Entering edit mode

Micheal, thank you for your response. I am uploading a PCA plot of my full data below. You can clearly see from the PCA that the condition "None" (untreated) is clearly separating regardless of the disease category. I tried investigating how the results looked when all samples were used in the model vs disease specific subsets, surprisingly the results looked very different. I understand the baseMean, p-values and the lfcSE to be different but even the log2FoldChange are very different. I suspect that this could be due to "None" condition, but I am not very sure about this. What are your thoughts on this? Should this dataset be analyzed by disease subset? 

https://i.imgur.com/rdWzMKc.png

ADD REPLY
0
Entering edit mode

"even the log2FoldChange are very different"

Can you make a plot of the LFC for a category using the two analysis approaches: first with all samples using the interaction term, and second with only samples from that category in the 'dds'?

You can subset to just the genes with small padj, to filter out noisy LFCs (another option would be to shrink LFCs across the range, but this would require you to change the design a bit, and also I don't know which version of DESeq2 you are using).

If you have res1 and res2:

either <- which(res1$padj < .1 | res2$padj < .1)
plot(res1$log2FoldChange[either], res2$log2FoldChange[either], cex=.1)

 

ADD REPLY
0
Entering edit mode

Micheal,

I have attached the plot below. 

This is my overall code to generate the plot:  (My data has technical replicates which I have collapsed together)

Instead of using the padj I have used pval to generate the plot, most my padj value where 0.99 for the RP : 200nm vs 50nm comparison. 

library('DESeq2')

countData <- read.csv("collapsedData.csv", header=T,sep=",",row.names=1)

phenoData <- read.csv("collapsedPheno.csv", header=T,sep=",")

countData <- countData[rowSums(countData) > 0,]

ddsAll <- DESeqDataSetFromMatrix(countData=countData,colData=phenoData,design=~0+Monocytes:Disease)

ddsAll <- DESeq(ddsAll,parallel=TRUE)

All_RM_50nm_vs_200nm <- results(ddsAll,contrast=c('Monocytes','200nm.DiseaseRM','50nm.DiseaseRM'))

############################################

RM_pheno <- phenoData[phenoData$Disease=='RM',]

RM_Frame <- subset(countData,select=RM_pheno$Position)

RM_Frame <- RM_Frame[rowSums(RM_Frame) > 0,]

ddsRM <- DESeqDataSetFromMatrix(countData=RM_Frame,colData=RM_pheno,design=~Monocytes)

ddsRM <- DESeq(ddsRM,parallel=TRUE)

Subset_RM50nm_vs_200nm <- results(ddsRM,contrast=c('Monocytes','200nm','50nm'))

either <- which(Subset_RM50nm_vs_200nm$pvalue < .05 | All_RM_50nm_vs_200nm$pvalue < .05)

plot(Subset_RM50nm_vs_200nm$log2FoldChange[either],All_RM_50nm_vs_200nm$log2FoldChange[either],cex=.2)

https://i.imgur.com/gFi64yq.png

 

ADD REPLY
1
Entering edit mode

Can you confirm that the counts in ddsRM are the same as in ddsAll[, ddsAll$Disease == "RM"] ? I'm concerned that something may be going wrong during the subsetting procedure, which mixes up the samples. A safer way to subset a DESeqDataSet (or any such matrix-like Bioconductor object) is simply:

ddsRM <- ddsAll[, ddsAll$Disease == "RM"]

Then you can reassign a design with:

design(ddsRM) <- ~Monocytes

 

ADD REPLY
0
Entering edit mode

You are right it was indeed the subsetting. The plot now shows a straight line. Thank you very much Micheal. 

Here is the plot: https://i.imgur.com/1sXmMYZ.png

ADD REPLY

Login before adding your answer.

Traffic: 462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6