too many genes with low read-counts as differential expressed in DESeq2
2
4
Entering edit mode
anand_mt ▴ 90
@anand_mt-10016
Last seen 3.7 years ago

Hi all,

I am analyzing an rna-seq dataset (treated vs control) with three replicates in each. After performing differential expression with DESeq2, there are way too many differentially expressed genes ! Most of them are with low readcounts (<5). Here is MA plot, vertical line at readcount 5.

Here is my summary of results:

>summary(dds.24h.res, alpha = 0.01)

out of 41405 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 7163, 17% 
LFC < 0 (down)   : 6146, 15% 
outliers [1]     : 0, 0% 
low counts [2]   : 11791, 28% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

 

Then I looked at metadata of results for threshold:

>metadata(dds.24h.res)$filterThreshold

48.74028% 
0.7043008 

 

Its too low! Then I tried turning of independentFiltering false, yet I have the same results (too many DEGs).

> summary(res.24.noFilt, alpha = 0.01)

out of 41405 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 6910, 17% 
LFC < 0 (down)   : 6056, 15% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

 

I also manually filtered raw counts (rowmean > 5) before running DESeq, but still I have many DEGs. PCA results show nice clustering of samples and replicates so no problem there. 

 

Is there any thing I am missing ?

Thanks.

-Anand.

deseq2 rna-seq • 9.2k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 3 days ago
United States

hi Anand,

"there are way too many differentially expressed genes ! Most of them are with low readcounts (<5)"

There is nothing here which suggests to me that the DE calls or the filtering is too low. I would guess that your treatment induces very strong differences, and that the amount of biological and technical variation among replicates is low compared to the differences across treatment. You can get a sense of this by looking at a PCA plot.

I don't think that most of the DE genes have read count < 5, you could actually quantify this by binning the genes which low p-value by mean count. There is such a plot in the vignette.

A gene with a mean count of 5 can still have strong DE signal, for example if you have 0,0,0 vs 10,10,10 in an experiment with low biological and technical variation among replicates.

Also, regarding, too many DE genes, there was a recent similar question on the support site, and I recommended that they take a look at the DESeq2 paper, specifically the part on the use of tests incorporating a log2 fold change threshold.

ADD COMMENT
0
Entering edit mode

Hi Anand and Michael,

To illustrate that I also have some troubles with few genes that have low read-counts but are DE with a high log2FC threshold. Here is my top DE genes for padj <0.01 and |log2FC| >2. See For example Gene G1 : lo2FC= - 22.01 BUT mean condition 1 (BC.mean)= 59.93 and 0 for condition2 (BV.mean).

So, First : How to deal with these "low read-counts" genes ?

Clearly here the filtering according to Log2FC is not relevant. Moreover It is not possible to validate these genes with QPCR because of low reads.

Second : These genes are DE or not according to the model used in DESeq2 : my design is two conditions (virus-infected versus control) two genotypes B and A. When I analysed the data according to the "Example 2" from DEseq2 vignette (which fit adequately to my design). I obtained these genes. Whereas when I analysed  my design as 2 conditions only B Virus/  B control,  these genes are not DE.

Any idea ? Comment ?

names

   

BC.m

BV.m

AC.m

AV.m

log2FC

 

padj

G1

   

59

0,00

1048,90

1263,14

-22,01

 

1,30E-15

G4

   

8,9

0,41

62,91

110,18

-4,35

 

2,87E-03

G5

   

4,40

0,00

110,33

102,44

-5,28

 

1,56E-03

G6

   

97,07

0,00

1893,44

1897,97

-9,61

 

5,94E-04

G7

   

24,81

2,58

302,67

147,67

-3,27

 

1,88E-03

G8

   

67,37

16,34

35,24

60,79

-2,05

 

8,63E-05

G9

   

63,68

0,00

1530,28

1982,28

-9,00

 

2,79E-04

G10

   

18,91

0,00

365,41

336,67

-7,26

 

4,10E-03

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
anand_mt ▴ 90
@anand_mt-10016
Last seen 3.7 years ago

Hi Michael,

Thank you for the reply and you were rite about the affect of treatment.

I have few more questions. My data here comes from kallisto, I have summarized counts and TPM values to gene level. However, when I see for few genes TPM values are very diffrent between two conditions , but still the gene is not differentially expressed or log2Foldchage is too low in  DESeq results. For ex here, TPM values differ by 2 fold change between treatment and control, but normalized counts show no diffrenece. Then I calculated FPKM values with `fpkm()`. 

 

condition counts normalizedCounts TPM FPKM
treat1
17334
11569.34
85.75486
36.79019
treat2
8242
15883.92
130.43380
50.51046
treat3
8777
15182.55
121.49322
48.28013
control1
41283
13487.88
222.30643
42.89112
control2
11569
14882.25
259.38771
47.32518
control3
14657
14369.24
252.28415
45.69382

 

I am bit confused how to explain this difference between TPM values and DESeq result. I know TPM/FPKM are not comparable between samples, but if I have draw heat-map, which values are better? TPM or rlog transfomed ?

Thanks again,

-Anand.

ADD COMMENT
2
Entering edit mode
Are you using the tximport package to build the DESeqDataSet? You should if you want to make a reasonable comparison here. See the tximport or DESeq2 vignette.
ADD REPLY
0
Entering edit mode

Hi Michael,

Thanks for suggestion. I used tximport and repeated DESeq but still result is same as above. 

txi = tximport(files = u87.coldf$file, type = 'kallisto', tx2gene = t2g, reader = fread)

txi.dds = DESeqDataSetFromTximport(txi = txi, colData = u87.coldf, design = ~Batch + Condition)

txi.res = results(object = txi.dds, contrast = c('Condition', 'treated', 'control'))

Table for same gene:

condition counts normalizedCounts TPM FPKM
treat1
17334
11841.23
85.75486
103.3063
treat2 8242
15588.66
130.43380
135.9999
treat3
8777
14805.13
121.49322
129.1642
control1
41283
14278.97
222.30643
124.5738

control2

control3

11569
14657
15200.54
14798.43
259.38771
252.28415
132.6139
129.1057

If looking at TPM values, this genes expression changes by over 2 fold. 

ADD REPLY
0
Entering edit mode

Why are the counts exactly the same after using tximport? Are you sure you are computing from a new DESeqDataSet?

Also, what does the batch look like? You can't just look at the normalized counts alone if you have batch variables in the design. What does the full colData(dds) look like?

ADD REPLY
0
Entering edit mode

Hi Michael,

Yes I am using proper DESeqDataSet generated from tximport. There are two batches, batch one contains samples treat1 and control1 which were sequened in May at high coverage, whereas batch two contains rest of samples sequenced at relatively low coverage in July. coldata looks like this,

condition  batch                   files
control_1   control batch1 control_1/abundance.tsv
control_2   control batch2 control_2/abundance.tsv
control_3   control batch2 control_3/abundance.tsv
treated_1   treated batch1 treated_1/abundance.tsv
treated_2   treated batch2 treated_2/abundance.tsv
treated_3   treated batch2 treated_3/abundance.tsv

 

I am also sharing the raw data and reproducible code that I used in case you want to take a look. Here: https://www.dropbox.com/s/ffswpwv6qa9ctft/bioconductor_85509.zip?dl=0  (Its about 20mb)

I am confused regarding desecrepency between deseq results and tpm values. 

ADD REPLY
1
Entering edit mode

DESeq2 (and edgeR and limma) use a robust estimation procedure for library size (which also takes into account feature length if you use the tximport pipeline), compared to TPM which cannot be described as robust. An example is if you have large changes in expression across condition for some genes -- as it seems you do from the MA plots -- this will induce large changes on the TPMs of all other genes, even if the true expression of these genes did not change, because the TPM is restricted to sum to 1e6. In contrast, the library size procedure in DESeq2 is robust to large differential expression for some but not all genes. If there is a discrepancy between TPM and normalized counts (where the normalized counts are derived using the tximport pipeline), I'd guess it has to do with other genes with large differences across condition.

Another point to make is that here treatment seems to induce very large changes across many genes, and so any effort at normalization becomes difficult. Is your goal to identify any genes which changed at all? If you increase your sample size by 1 or 2 per group, you would likely gain thousands of genes given the MA plot. You might therefore consider using a log fold change threshold to define a more manageable set of differentially expressed genes (see DESeq2 paper and vignette).

ADD REPLY
0
Entering edit mode

Hi Michael,

Thanks for detailed clarification! Yes, the drug being used here is global repressor, so huge change is expected, and we were looking for robust DEGs. For now I am using fold change threshold on top of fdr to filter out the list. I have one more question,

Since most of the postdocs in lab are used to expression units (tpm/fpkm), how do you suggest to estimate such values here? In my opinion using `DESeq2::fpkm()` on tximport dds seems to give reasonable expression (proportional to normalized counts/DESeq results). But same cannot be said for fpm (it's still messy).

EDIT: I was looking at vignette on rlog transofrmation (section 2.1.1 Blind dispersion estimation) and it says - 

However, blind dispersion estimation is not the appropriate choice if one expects
that many or the majority of genes (rows) will have large differences in counts which
are explainable by the experimental design, and one wishes to transform the data
for downstream analysis.

This appears to be my case, so blind should be FALSE for better heatmap/downstream processes ?

ADD REPLY
1
Entering edit mode

I don't have a strong preference between TPM and FPKM. TPM are much easier to explain. The least confusing thing would be to present the gene-level TPM derived from the upstream software if you are using tximport. But if you're trying to explain about the difference between what you see with TPM and what you see in the MA plot above, you can use this contrived example to try to demonstrate the difference:

Suppose we have 5 genes only, and we are using TPT = transcripts per ten, as a measure of expression. Let's say the control sample has true transcript counts [1,1,1,1,6]. And then there is a treated sample with true transcript counts [1,1,1,1,12]. The first sample is already normalized as TPT because it sums to 10, but the second sample has TPT of [5/8, 5/8, 5/8, 5/8, 7 1/2]. Note that there is nothing wrong with the TPT, it accurately represents the proportion each gene contributes to the total, scaled such that the total is ten. However, increases in transcription for the genes with the highest TPT (or TPM) will affect the values for all the other genes, even if those genes did not undergo differences in the amount of transcription upon treatment.

Yes, I would recommend blind=FALSE here, and for most visualizations.

 

ADD REPLY
0
Entering edit mode

Excellent! A lot to think about. Sorry one last question, in one of the comment above, you mentioned - 

You can't just look at the normalized counts alone if you have batch variables in the design. 

I understand that the batch variables are not reflected in normalized counts. Whats the rational behind the sentence ?

ADD REPLY
0
Entering edit mode

I just wanted to remind you that when you have a batch variable in the design, the LFC is not simply, e.g. the average of group 2 normalized counts divided by group 1 normalized counts. The LFC is calculated within the GLM and takes into account differences due to batch.

ADD REPLY
0
Entering edit mode

Thank you. This has been informative.

ADD REPLY

Login before adding your answer.

Traffic: 887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6