Entering edit mode
Hi, Mike:
I aksed you some questions about DESeq2 input data about a week or so
as shown below. Then based on some other's experience and suggestions,
I did use the rounded RSEM raw counts to run DESeq2, due to the
dataset has matched tumors vs normals, I did try to use either or
paired test to derive the DEGs. But I noticed a strange observation:
For 7 genes of our interest, if I used paired test, I got the
following:
entrezID symbols gene_id baseMean log2FoldChange
lfcSE
6440 6440 SFTPC SFTPC|6440 560695.51884 -4.309999
0.4811095
7476 7476 WNT7A WNT7A|7476 210.61484 -2.609732
0.3639806
8328 8328 GFI1B GFI1B|8328 11.25229 -2.445345
0.3040805
27255 27255 CNTN6 CNTN6|27255 297.51966 -5.049929
0.3276390
89780 89780 WNT3A WNT3A|89780 295.59498 -5.037872
0.2976513
256815 256815 C10orf67 C10orf67|256815 84.88613 -4.841887
0.2890447
388939 388939 C2orf71 C2orf71|388939 51.81894 -4.567162
0.3601929
pvalue padj
6440 3.292485e-19 5.363882e-18
7476 7.501115e-13 6.306398e-12
8328 8.854943e-16 1.023742e-14
27255 1.336693e-53 4.232359e-51
89780 2.922717e-64 2.206764e-61
256815 5.536370e-63 3.881589e-60
388939 7.655804e-37 6.679604e-35
However, if I used pooled test, I got the following:
entrezID symbols gene_id baseMean log2FoldChange
lfcSE
6440 6440 SFTPC SFTPC|6440 560695.51884 -3.654053
0.4720463
7476 7476 WNT7A WNT7A|7476 210.61484 -1.826788
0.3769339
8328 8328 GFI1B GFI1B|8328 11.25229 -1.822568
0.3423880
27255 27255 CNTN6 CNTN6|27255 297.51966 -4.103604
0.3509631
89780 89780 WNT3A WNT3A|89780 295.59498 -4.424951
0.3262553
256815 256815 C10orf67 C10orf67|256815 84.88613 -4.293802
0.3183049
388939 388939 C2orf71 C2orf71|388939 51.81894 -4.068668
0.3828477
pvalue padj
6440 NA NA
7476 NA NA
8328 NA NA
27255 NA NA
89780 NA NA
256815 NA NA
388939 NA NA
Noticed that when I used paired test, the p-values of these genes look
quite good, whereas if used pooled test, the pvalues are all NAs (most
of other genes in dataset seem OK with valid p-values at least not NA
values). almost all commands are the same for these two settings,
except for the modeling step:
Pooled test:
dds <- DESeqDataSetFromMatrix(countData = countD,colData = colD,design
= ~Type);
Paired test:
dds <- DESeqDataSetFromMatrix(countData = countD,colData = colD,design
= ~Subject+Type);
The Subject holds the sample names with matched tumor and normal
samples.
I did pull out the counts data for these genes, which seem not
uncommon (i posted below FYI). T is tumors, N is normals,T4626_01A
mactched with N4626_11A for the sample patient as 4626, and so on so
forth. any idea why caused this, any issue with my data or bug(s) in
the package?
Here is the count data for these genes:
entrezID T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A
N6145_11A
6440 6440 938904 1271864 963 1642706 10488
1032959
7476 7476 117 252 53 177 8
260
8328 8328 4 30 1 19 2
9
27255 27255 399 425 6 461 8
473
89780 89780 276 500 23 310 10
547
256815 256815 53 175 3 98 4
141
388939 388939 7 227 3 202 3
116
T6146_01A N6146_11A T6776_01A N6776_11A T6777_01A N6777_11A
T6744_01A
6440 46915 610409 39544 855546 5535 658155
52865
7476 65 207 2 253 2 191
81
8328 3 9 2 13 2 11
3
27255 81 329 3 722 27 351
47
89780 34 330 8 555 3 465
31
256815 7 70 1 199 2 53
5
388939 1 22 1 73 18 116
9
N6744_11A T5932_01A N5932_11A T5933_01A N5933_11A T5936_01A
N5936_11A
6440 667035 21596 690368 11 880241 184
858491
7476 235 29 258 1 240 55
323
8328 7 1 5 0 10 0
6
27255 501 4 545 0 676 9
674
89780 361 9 457 0 409 1
565
256815 85 2 97 0 175 6
173
388939 98 4 84 0 31 0
106
T6828_01A N6828_11A T6836_01A N6836_11A T4632_01A N4632_11A
T2662_01A
6440 44239 737756 68 529280 2859 485130
13
7476 6 257 126 192 2 94
3
8328 0 14 1 9 0 10
2
27255 12 626 0 413 4 176
2
89780 11 587 8 214 0 14
3
256815 4 217 3 143 1 16
3
388939 6 94 1 57 2 0
0
N2662_11A T6147_01A N6147_11A T6743_01A N6743_11A T6829_01A
N6829_11A
6440 1526175 16149 892008 9150 1138397 55
824547
7476 794 4 281 3 317 20
174
8328 19 5 16 2 5 5
8
27255 591 4 288 7 313 232
327
89780 1173 50 710 12 402 1
320
256815 254 68 330 0 114 18
80
388939 50 4 130 1 77 0
326
T6831_01A N6831_11A T4627_01A N4627_11A T4490_01A N4490_11A
T6595_01A
6440 294 306899 8082 708377 115 1293737
81
7476 0 179 239 240 209 211
511
8328 1 12 6 37 3 5
2
27255 3 418 6 392 2 430
2
89780 1 333 67 488 7 298
8
256815 3 122 0 89 0 174
8
388939 0 10 0 63 1 31
0
N6595_11A T6835_01A N6835_11A T2657_01A N2657_11A T4676_01A
N4676_11A
6440 1145375 41647 367263 334072 1082719 42168
959756
7476 728 283 196 10 244 18
282
8328 84 7 7 18 23 2
21
27255 975 88 348 13 448 19
606
89780 1479 31 185 145 631 23
383
256815 310 12 124 1 77 6
146
388939 82 1 6 1 108 6
31
T2661_01A N2661_11A T4625_01A N4625_11A T6745_01A N6745_11A
T2665_01A
6440 413649 844142 202136 2070763 76113 1441306
82
7476 34 510 137 387 323 360
247
8328 72 32 4 44 3 8
6
27255 24 718 11 509 19 484
5
89780 15 1021 18 531 14 398
8
256815 17 121 1 152 8 192
1
388939 4 90 2 267 12 51
13
N2665_11A T2655_01A N2655_11A
6440 2200556 278151 1687880
7476 572 117 1096
8328 12 2 51
27255 895 89 921
89780 998 65 1490
256815 167 18 220
388939 123 88 142
Thanks in advance!
Ming
From: michaelisaiahlove@gmail.com
Date: Wed, 29 Jan 2014 13:01:40 -0500
Subject: Re: DESeq2 - input data questions
To: yi02@hotmail.com
CC: bioconductor@r-project.org
hi Ming,
We do not recommend using rounded estimated values of read counts with
DESeq2 (although I found an email from myself to the list one year ago
contradicting this in the case that someone had no access to the raw
data)â. Counts of reads which are proportionally assigned to genes
and then rounded can be a bad fit for distributions like the Negative
Binomial (and Poisson). For example, this procedure could generate
values arising from a distribution that has variance less than the
mean. As a rule, and to help avoid erroneous results, users should
produce a matrix containing integer counts of reads uniquely aligned
to features.
Mike
On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote:
Hi, Dear Michael:
Sorry to directly jump in a quick question, since I saw you respond to
some DESeq questions yesterday and I know you are the author of DESeq2
package.
I run into some RNAseq data issue with edgeR package (see my post in
BioC yesterday), since the data I used are not integer counts of
genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact,
are not integers, although TCGA named it as raw_count (I showed some
piece of such a file from TCGA as below).
file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem.
genes.results"
its content looks below:
gene_id raw_count scaled_estimate transcript_id
1 ?|100130426 0.00 0.000000e+00 uc011lsn.1
2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1
3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2
4 ?|10357 218.79 1.933490e-05 uc010zzl.1
5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1
6 ?|136542 0.00 0.000000e+00 uc011krn.1
I used the raw_count of this file for edgeR analysis and run into some
warning messages which indicating potential issues, and had been
pointed out by Dr. Smyth as you may have seen in my post yesterday.
Reading the user guide from both DESeq and DESeq2, both packages
mentioned the input was expected to be count data in a form of a
matrix of integer values, not even (rounded) normalized counts, which
would lead to nonsensical results.
The TCGA data above apparently are not true raw counts but some kind
of RSEM processed raw_counts. Although both RSEM authors and TCGA data
staff all suggested the raw_count can be used for edgeR and DESeq (or
DESeq2) analysis, because of the issues I run into in edgeR. I wonder
what would be your opinion on this data for analysis in DESeq2?
Any advice would be appreciated very much!
Thanks and best
Ming Yi
NIH
Maryland,USA
On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote:
Hi, Dear Michael:
Sorry to directly jump in a quick question, since I saw you respond to
some DESeq questions yesterday and I know you are the author of DESeq2
package.
I run into some RNAseq data issue with edgeR package (see my post in
BioC yesterday), since the data I used are not integer counts of
genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact,
are not integers, although TCGA named it as raw_count (I showed some
piece of such a file from TCGA as below).
file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem.
genes.results"
its content looks below:
gene_id raw_count scaled_estimate transcript_id
1 ?|100130426 0.00 0.000000e+00 uc011lsn.1
2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1
3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2
4 ?|10357 218.79 1.933490e-05 uc010zzl.1
5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1
6 ?|136542 0.00 0.000000e+00 uc011krn.1
I used the raw_count of this file for edgeR analysis and run into some
warning messages which indicating potential issues, and had been
pointed out by Dr. Smyth as you may have seen in my post yesterday.
Reading the user guide from both DESeq and DESeq2, both packages
mentioned the input was expected to be count data in a form of a
matrix of integer values, not even (rounded) normalized counts, which
would lead to nonsensical results.
The TCGA data above apparently are not true raw counts but some kind
of RSEM processed raw_counts. Although both RSEM authors and TCGA data
staff all suggested the raw_count can be used for edgeR and DESeq (or
DESeq2) analysis, because of the issues I run into in edgeR. I wonder
what would be your opinion on this data for analysis in DESeq2?
Any advice would be appreciated very much!
Thanks and best
Ming Yi
NIH
Maryland,USA
[[alternative HTML version deleted]]