extracting CPM from a DGElist after normalization in edgeR
2
0
Entering edit mode
@alessandroguffantigenomniacom-4436
Last seen 10.2 years ago
Dear colleagues and dear Mark: We want to extract the CPM from a DGElist object after TMM normalization, common and tagwise dispersion evaluation. Here is a sample of the code, it's just a standard edgeR DEG workflow: samplesheet1 <- read.delim("sample_sheet.txt", stringsAsFactors = FALSE) comparison <- c("Tumor_1","Tumor_2") RG1 <- readDGE(samplesheet1,header=FALSE) colnames(RG1) <- gsub(".mature_cnt.txt","",RG1$sample$files,perl=TRUE) N<-length(colnames(RG1))/2 K<-10 keep <- rowSums(cpm(RG1) > K) >= N difflist1 <- RG1[keep,] difflist1$samples$lib.size <- colSums(difflist1$counts) difflist1 <- calcNormFactors(difflist1) difflist1 <- estimateCommonDisp(difflist1) difflist1 <- estimateTagwiseDisp(difflist1) currentDiff <- difflist1 de.com <- exactTest(currentDiff,pair=comparison) ... then there is the creation of an Excel output dataframes, including the CPM worksheet: addDataFrame(*round(cpm(currentDiff)*[,currentDiff$samples$group==comp arison[1] | currentDiff$samples$group==comparison[2]]), ucpm_sheet, startRow=2, startColumn=1) Now, the CPM content leaves us quite puzzled because it is very variable according to the syntax we use to extract it, and this particular syntax seems to give values which do not sum up to 1 million as expected. This is the syntax coherent with the Apparently the only CPM values which do sum up correctly to 1 million es expected are those extracted with the following instruction head(cpm(currentDiff$counts)) But what are then these CPM values, which are different from the ones extracted with the command above ? head(cpm(currentDiff)) These CPM values are different also from the values obtained with this command head(cpm(currentDiff$pseudo.counts)) Many thanks in advance for any clarification ! Alessandro, Moreno & Paolo ---- > head(cpm(currentDiff)) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P hsa-miR-140-5p 306.550 209.4272 233.03 273.53 581.762 263.281 140.452 285.775 633.920 679.240 232.223 396.27 203.828 hsa-miR-133a-3p 28.945 14.5648 51.61 44.04 4.116 5.495 33.441 55.981 4.393 18.394 2.609 29.45 2.474 hsa-miR-665 11.695 21.7297 36.35 16.71 4.785 18.865 46.817 3.762 12.213 19.308 20.162 3.22 32.849 hsa-miR-1250-5p 2.047 0.1175 13.03 10.21 3.344 13.462 3.541 6.396 2.357 4.166 3.202 28.35 4.123 hsa-miR-381-3p 19.223 24.1963 84.53 44.74 16.104 10.898 33.834 182.992 43.926 37.092 1.186 25.37 10.308 hsa-miR-374b-3p 39.543 7.6348 10.46 27.62 31.077 18.224 25.966 17.457 39.747 29.572 7.353 22.46 13.607 LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB hsa-miR-140-5p 401.843 309.98 408.091 231.87 469.49 664.349 213.087 551.31 870.579 326.37 hsa-miR-133a-3p 2.974 16.17 5.822 14.44 43.60 48.011 27.482 27.06 13.031 41.21 hsa-miR-665 15.179 46.63 18.831 15.42 4.69 3.359 7.923 11.81 3.425 4.57 hsa-miR-1250-5p 10.904 12.05 18.831 19.21 27.76 1.976 21.870 48.22 67.411 12.49 hsa-miR-381-3p 11.957 44.73 10.007 16.40 21.45 23.907 80.795 21.46 32.661 75.00 hsa-miR-374b-3p 30.048 30.12 15.829 21.17 23.99 46.924 13.122 28.86 36.587 19.18 > head(cpm(currentDiff$counts)) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P hsa-miR-140-5p 215.684 196.8196 237.43 307.42 605.485 285.767 142.576 288.156 546.467 593.42 153.0286 458.283 167.040 hsa-miR-133a-3p 20.365 13.6880 52.59 49.49 4.284 5.964 33.947 56.448 3.787 16.07 1.7194 34.058 2.027 hsa-miR-665 8.228 20.4215 37.04 18.78 4.980 20.476 47.525 3.794 10.529 16.87 13.2864 3.724 26.920 hsa-miR-1250-5p 1.440 0.1104 13.28 11.47 3.481 14.611 3.594 6.449 2.032 3.64 2.1102 32.786 3.379 hsa-miR-381-3p 13.525 22.7397 86.13 50.28 16.761 11.828 34.346 184.517 37.866 32.41 0.7816 29.335 8.448 hsa-miR-374b-3p 27.822 7.1751 10.66 31.05 32.344 19.780 26.359 17.602 34.264 25.84 4.8456 25.975 11.151 LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB hsa-miR-140-5p 418.080 330.59 487.258 231.94 532.639 720.513 260.36 599.79 929.610 342.404 hsa-miR-133a-3p 3.094 17.25 6.952 14.44 49.461 52.070 33.58 29.44 13.915 43.239 hsa-miR-665 15.792 49.73 22.484 15.42 5.321 3.643 9.68 12.85 3.657 4.795 hsa-miR-1250-5p 11.345 12.85 22.484 19.22 31.491 2.143 26.72 52.46 71.982 13.100 hsa-miR-381-3p 12.441 47.70 11.948 16.40 24.338 25.928 98.72 23.34 34.876 78.687 hsa-miR-374b-3p 31.263 32.12 18.899 21.17 27.216 50.891 16.03 31.40 39.068 20.121 > head(cpm(currentDiff$pseudo.counts)) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P hsa-miR-140-5p 215.686 196.824 237.42 307.42 605.506 285.765 142.74 288.149 546.461 593.414 153.0351 458.284 167.054 hsa-miR-133a-3p 20.362 13.698 52.59 49.49 4.256 5.961 33.80 56.483 3.793 16.071 1.7286 34.088 2.048 hsa-miR-665 8.221 20.425 37.04 18.79 4.953 20.476 47.03 3.781 10.532 16.869 13.2858 3.712 26.891 hsa-miR-1250-5p 1.432 0.121 13.29 11.47 3.461 14.617 3.71 6.449 2.037 3.642 2.1169 32.804 3.399 hsa-miR-381-3p 13.517 22.749 86.14 50.28 16.735 11.826 34.44 184.584 37.867 32.407 0.7897 29.345 8.461 hsa-miR-374b-3p 27.838 7.186 10.65 31.05 32.352 19.779 26.37 17.595 34.259 25.835 4.8545 25.973 11.165 LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB hsa-miR-140-5p 418.084 330.60 487.258 231.96 532.637 720.512 260.347 599.80 929.622 342.403 hsa-miR-133a-3p 3.073 17.26 6.948 14.45 49.475 52.067 33.578 29.45 13.909 43.238 hsa-miR-665 15.786 49.68 22.485 15.42 5.309 3.645 9.673 12.85 3.649 4.794 hsa-miR-1250-5p 11.334 12.86 22.482 19.23 31.489 2.145 26.717 52.48 72.001 13.099 hsa-miR-381-3p 12.430 47.70 11.945 16.41 24.330 25.929 98.743 23.34 34.875 78.686 hsa-miR-374b-3p 31.272 32.12 18.897 21.18 27.212 50.889 16.025 31.40 39.072 20.120 > colSums(cpm(currentDiff)) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P LT3P ST2P 1421292 1064057 981465 889765 960819 921314 985099 991736 1160034 1144623 1517511 864691 1220229 961164 937648 ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB 837525 999688 881438 922050 818447 919173 936499 953165 > colSums(cpm(currentDiff$counts)) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 ST5P ST11P SB 1e+06 1e+06 1e+06 > colSums(cpm(currentDiff$pseudo.counts)) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 ST5P ST11P SB 1e+06 1e+06 1e+06 > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252 [4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] xlsx_0.5.5 xlsxjars_0.5.0 rJava_0.9-6 edgeR_3.4.2 limma_3.18.9 loaded via a namespace (and not attached): [1] tools_3.0.2 ----------------------------------------------------------- Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da considerarsi vietato ed abusivo. The information transmitted is intended only for the per...{{dropped:10}}
edgeR edgeR • 2.5k views
ADD COMMENT
0
Entering edit mode
Devon Ryan ▴ 200
@devon-ryan-6054
Last seen 8.9 years ago
Germany
The difference in output when running cpm() on a matrix and a DGEList is due to the latter using the normalized library size. In other words, the latter method does: t(t(currentDiff$counts)/(1e-6*currentDiff$samples$lib.size*currentDiff $samples$norm.factors)) rather than t(t(currentDiff$counts)/(1e-6*colSums(currentDiff$counts))) Utilizing the normalization factor would seem to be a much better idea in most cases. Best, Devon -- Devon Ryan, Ph.D. Email: dpryan@dpryan.com Laboratory for Molecular and Cellular Cognition German Centre for Neurodegenerative Diseases (DZNE) Ludwig-Erhard-Allee 2 53175 Bonn Germany <devon.ryan@dzne.de> On Thu, Apr 3, 2014 at 11:58 AM, alessandro.guffanti@genomnia.com < alessandro.guffanti@genomnia.com> wrote: > Dear colleagues and dear Mark: > > We want to extract the CPM from a DGElist object after TMM > normalization, common and tagwise dispersion evaluation. > > Here is a sample of the code, it's just a standard edgeR DEG workflow: > > samplesheet1 <- read.delim("sample_sheet.txt", stringsAsFactors = FALSE) > comparison <- c("Tumor_1","Tumor_2") > RG1 <- readDGE(samplesheet1,header=FALSE) > colnames(RG1) <- gsub(".mature_cnt.txt","",RG1$sample$files,perl=TRUE) > N<-length(colnames(RG1))/2 > K<-10 > keep <- rowSums(cpm(RG1) > K) >= N > difflist1 <- RG1[keep,] > difflist1$samples$lib.size <- colSums(difflist1$counts) > difflist1 <- calcNormFactors(difflist1) > difflist1 <- estimateCommonDisp(difflist1) > difflist1 <- estimateTagwiseDisp(difflist1) > currentDiff <- difflist1 > de.com <- exactTest(currentDiff,pair=comparison) > > ... then there is the creation of an Excel output dataframes, including > the CPM worksheet: > > > addDataFrame(*round(cpm(currentDiff)*[,currentDiff$samples$group==co mparison[1] > | currentDiff$samples$group==comparison[2]]), ucpm_sheet, startRow=2, > startColumn=1) > > Now, the CPM content leaves us quite puzzled because it is very variable > according to the syntax we use > to extract it, and this particular syntax seems to give values which do > not sum up to 1 million as expected. > This is the syntax coherent with the > > Apparently the only CPM values which do sum up correctly to 1 million es > expected are those extracted with the following > instruction > > head(cpm(currentDiff$counts)) > > But what are then these CPM values, which are different from the ones > extracted with the command above ? > > head(cpm(currentDiff)) > > > These CPM values are different also from the values obtained with this > command > > > head(cpm(currentDiff$pseudo.counts)) > > > Many thanks in advance for any clarification ! > > Alessandro, Moreno & Paolo > > ---- > > > > head(cpm(currentDiff)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P > > hsa-miR-140-5p 306.550 209.4272 233.03 273.53 581.762 263.281 140.452 > 285.775 633.920 679.240 232.223 396.27 203.828 > > hsa-miR-133a-3p 28.945 14.5648 51.61 44.04 4.116 5.495 33.441 > 55.981 4.393 18.394 2.609 29.45 2.474 > > hsa-miR-665 11.695 21.7297 36.35 16.71 4.785 18.865 46.817 3.762 > 12.213 19.308 20.162 3.22 32.849 > > hsa-miR-1250-5p 2.047 0.1175 13.03 10.21 3.344 13.462 3.541 > 6.396 2.357 4.166 3.202 28.35 4.123 > > hsa-miR-381-3p 19.223 24.1963 84.53 44.74 16.104 10.898 33.834 > 182.992 43.926 37.092 1.186 25.37 10.308 > > hsa-miR-374b-3p 39.543 7.6348 10.46 27.62 31.077 18.224 25.966 > 17.457 39.747 29.572 7.353 22.46 13.607 > > LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > hsa-miR-140-5p 401.843 309.98 408.091 231.87 469.49 664.349 213.087 > 551.31 870.579 326.37 > > hsa-miR-133a-3p 2.974 16.17 5.822 14.44 43.60 48.011 27.482 > 27.06 13.031 41.21 > > hsa-miR-665 15.179 46.63 18.831 15.42 4.69 3.359 7.923 11.81 > 3.425 4.57 > > hsa-miR-1250-5p 10.904 12.05 18.831 19.21 27.76 1.976 21.870 > 48.22 67.411 12.49 > > hsa-miR-381-3p 11.957 44.73 10.007 16.40 21.45 23.907 80.795 > 21.46 32.661 75.00 > > hsa-miR-374b-3p 30.048 30.12 15.829 21.17 23.99 46.924 13.122 > 28.86 36.587 19.18 > > > > head(cpm(currentDiff$counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P > > hsa-miR-140-5p 215.684 196.8196 237.43 307.42 605.485 285.767 142.576 > 288.156 546.467 593.42 153.0286 458.283 167.040 > > hsa-miR-133a-3p 20.365 13.6880 52.59 49.49 4.284 5.964 33.947 > 56.448 3.787 16.07 1.7194 34.058 2.027 > > hsa-miR-665 8.228 20.4215 37.04 18.78 4.980 20.476 47.525 3.794 > 10.529 16.87 13.2864 3.724 26.920 > > hsa-miR-1250-5p 1.440 0.1104 13.28 11.47 3.481 14.611 3.594 > 6.449 2.032 3.64 2.1102 32.786 3.379 > > hsa-miR-381-3p 13.525 22.7397 86.13 50.28 16.761 11.828 34.346 > 184.517 37.866 32.41 0.7816 29.335 8.448 > > hsa-miR-374b-3p 27.822 7.1751 10.66 31.05 32.344 19.780 26.359 > 17.602 34.264 25.84 4.8456 25.975 11.151 > > LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > hsa-miR-140-5p 418.080 330.59 487.258 231.94 532.639 720.513 260.36 > 599.79 929.610 342.404 > > hsa-miR-133a-3p 3.094 17.25 6.952 14.44 49.461 52.070 33.58 > 29.44 13.915 43.239 > > hsa-miR-665 15.792 49.73 22.484 15.42 5.321 3.643 9.68 12.85 > 3.657 4.795 > > hsa-miR-1250-5p 11.345 12.85 22.484 19.22 31.491 2.143 26.72 > 52.46 71.982 13.100 > > hsa-miR-381-3p 12.441 47.70 11.948 16.40 24.338 25.928 98.72 > 23.34 34.876 78.687 > > hsa-miR-374b-3p 31.263 32.12 18.899 21.17 27.216 50.891 16.03 > 31.40 39.068 20.121 > > > > head(cpm(currentDiff$pseudo.counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P > > hsa-miR-140-5p 215.686 196.824 237.42 307.42 605.506 285.765 142.74 > 288.149 546.461 593.414 153.0351 458.284 167.054 > > hsa-miR-133a-3p 20.362 13.698 52.59 49.49 4.256 5.961 33.80 > 56.483 3.793 16.071 1.7286 34.088 2.048 > > hsa-miR-665 8.221 20.425 37.04 18.79 4.953 20.476 47.03 3.781 > 10.532 16.869 13.2858 3.712 26.891 > > hsa-miR-1250-5p 1.432 0.121 13.29 11.47 3.461 14.617 3.71 > 6.449 2.037 3.642 2.1169 32.804 3.399 > > hsa-miR-381-3p 13.517 22.749 86.14 50.28 16.735 11.826 34.44 > 184.584 37.867 32.407 0.7897 29.345 8.461 > > hsa-miR-374b-3p 27.838 7.186 10.65 31.05 32.352 19.779 26.37 > 17.595 34.259 25.835 4.8545 25.973 11.165 > > LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > hsa-miR-140-5p 418.084 330.60 487.258 231.96 532.637 720.512 260.347 > 599.80 929.622 342.403 > > hsa-miR-133a-3p 3.073 17.26 6.948 14.45 49.475 52.067 33.578 > 29.45 13.909 43.238 > > hsa-miR-665 15.786 49.68 22.485 15.42 5.309 3.645 9.673 12.85 > 3.649 4.794 > > hsa-miR-1250-5p 11.334 12.86 22.482 19.23 31.489 2.145 26.717 > 52.48 72.001 13.099 > > hsa-miR-381-3p 12.430 47.70 11.945 16.41 24.330 25.929 98.743 > 23.34 34.875 78.686 > > hsa-miR-374b-3p 31.272 32.12 18.897 21.18 27.212 50.889 16.025 > 31.40 39.072 20.120 > > > > colSums(cpm(currentDiff)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P LT3P ST2P > > 1421292 1064057 981465 889765 960819 921314 985099 991736 1160034 > 1144623 1517511 864691 1220229 961164 937648 > > ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > 837525 999688 881438 922050 818447 919173 936499 953165 > > > > colSums(cpm(currentDiff$counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P > LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P > > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > > ST5P ST11P SB > > 1e+06 1e+06 1e+06 > > > > colSums(cpm(currentDiff$pseudo.counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P > LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P > > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > > ST5P ST11P SB > > 1e+06 1e+06 1e+06 > > > > > sessionInfo() > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > > [1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 > LC_MONETARY=Italian_Italy.1252 > > [4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252 > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > > [1] xlsx_0.5.5 xlsxjars_0.5.0 rJava_0.9-6 edgeR_3.4.2 limma_3.18.9 > > loaded via a namespace (and not attached): > > [1] tools_3.0.2 > > > > > > ----------------------------------------------------------- > Il Contenuto del presente messaggio potrebbe contenere informazioni > confidenziali a favore dei > soli destinatari del messaggio stesso. Qualora riceviate per errore questo > messaggio siete pregati > di cancellarlo dalla memoria del computer e di contattare i numeri sopra > indicati. Ogni utilizzo o > ritrasmissione dei contenuti del messaggio da parte di soggetti diversi > dai destinatari č da > considerarsi vietato ed abusivo. > > The information transmitted is intended only for the per...{{dropped:10}} > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thanks for the kind answer, Alessandro & colleagues Il 03/04/2014 14:51, Devon Ryan ha scritto: > The difference in output when running cpm() on a matrix and a DGEList > is due to the latter using the normalized library size. In other > words, the latter method does: > > t(t(currentDiff$counts)/(1e-6*currentDiff$samples$lib.size*currentDi ff$samples$norm.factors)) > > rather than > > t(t(currentDiff$counts)/(1e-6*colSums(currentDiff$counts))) > > Utilizing the normalization factor would seem to be a much better idea > in most cases. > > Best, > Devon > > -- > Devon Ryan, Ph.D. > Email: dpryan@dpryan.com <mailto:dpryan@dpryan.com> > Laboratory for Molecular and Cellular Cognition > German Centre for Neurodegenerative Diseases (DZNE) > Ludwig-Erhard-Allee 2 > 53175 Bonn > Germany > ----------------------------------------------------------- Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da considerarsi vietato ed abusivo. The information transmitted is intended only for the per...{{dropped:10}}
0
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia
Dear Alessandro, I see that Devon Ryan has answered your question, but the answer is also available directly from the help system. If you type help("cpm") the first line of Details says: "CPM or RPKM values are useful descriptive measures for the expression level of a gene or transcript. By default, the normalized library sizes are used in the computation for DGEList objects but simple column sums for matrices." Best wishes Gordon > Date: Thu, 03 Apr 2014 11:58:57 +0200 > From: "alessandro.guffanti at genomnia.com" > <alessandro.guffanti at="" genomnia.com=""> > To: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: [BioC] extracting CPM from a DGElist after normalization in > edgeR > > Dear colleagues and dear Mark: > > We want to extract the CPM from a DGElist object after TMM > normalization, common and tagwise dispersion evaluation. > > Here is a sample of the code, it's just a standard edgeR DEG workflow: > > samplesheet1 <- read.delim("sample_sheet.txt", stringsAsFactors = FALSE) > comparison <- c("Tumor_1","Tumor_2") > RG1 <- readDGE(samplesheet1,header=FALSE) > colnames(RG1) <- gsub(".mature_cnt.txt","",RG1$sample$files,perl=TRUE) > N<-length(colnames(RG1))/2 > K<-10 > keep <- rowSums(cpm(RG1) > K) >= N > difflist1 <- RG1[keep,] > difflist1$samples$lib.size <- colSums(difflist1$counts) > difflist1 <- calcNormFactors(difflist1) > difflist1 <- estimateCommonDisp(difflist1) > difflist1 <- estimateTagwiseDisp(difflist1) > currentDiff <- difflist1 > de.com <- exactTest(currentDiff,pair=comparison) > > ... then there is the creation of an Excel output dataframes, including > the CPM worksheet: > > addDataFrame(*round(cpm(currentDiff)*[,currentDiff$samples$group==co mparison[1] > | currentDiff$samples$group==comparison[2]]), ucpm_sheet, startRow=2, > startColumn=1) > > Now, the CPM content leaves us quite puzzled because it is very variable > according to the syntax we use > to extract it, and this particular syntax seems to give values which do > not sum up to 1 million as expected. > This is the syntax coherent with the > > Apparently the only CPM values which do sum up correctly to 1 million es > expected are those extracted with the following > instruction > > head(cpm(currentDiff$counts)) > > But what are then these CPM values, which are different from the ones > extracted with the command above ? > > head(cpm(currentDiff)) > > > These CPM values are different also from the values obtained with this > command > > > head(cpm(currentDiff$pseudo.counts)) > > > Many thanks in advance for any clarification ! > > Alessandro, Moreno & Paolo > > ---- > > >> head(cpm(currentDiff)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P > > hsa-miR-140-5p 306.550 209.4272 233.03 273.53 581.762 263.281 140.452 > 285.775 633.920 679.240 232.223 396.27 203.828 > > hsa-miR-133a-3p 28.945 14.5648 51.61 44.04 4.116 5.495 33.441 > 55.981 4.393 18.394 2.609 29.45 2.474 > > hsa-miR-665 11.695 21.7297 36.35 16.71 4.785 18.865 46.817 3.762 > 12.213 19.308 20.162 3.22 32.849 > > hsa-miR-1250-5p 2.047 0.1175 13.03 10.21 3.344 13.462 3.541 > 6.396 2.357 4.166 3.202 28.35 4.123 > > hsa-miR-381-3p 19.223 24.1963 84.53 44.74 16.104 10.898 33.834 > 182.992 43.926 37.092 1.186 25.37 10.308 > > hsa-miR-374b-3p 39.543 7.6348 10.46 27.62 31.077 18.224 25.966 > 17.457 39.747 29.572 7.353 22.46 13.607 > > LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > hsa-miR-140-5p 401.843 309.98 408.091 231.87 469.49 664.349 213.087 > 551.31 870.579 326.37 > > hsa-miR-133a-3p 2.974 16.17 5.822 14.44 43.60 48.011 27.482 > 27.06 13.031 41.21 > > hsa-miR-665 15.179 46.63 18.831 15.42 4.69 3.359 7.923 11.81 > 3.425 4.57 > > hsa-miR-1250-5p 10.904 12.05 18.831 19.21 27.76 1.976 21.870 > 48.22 67.411 12.49 > > hsa-miR-381-3p 11.957 44.73 10.007 16.40 21.45 23.907 80.795 > 21.46 32.661 75.00 > > hsa-miR-374b-3p 30.048 30.12 15.829 21.17 23.99 46.924 13.122 > 28.86 36.587 19.18 > > > > head(cpm(currentDiff$counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P > > hsa-miR-140-5p 215.684 196.8196 237.43 307.42 605.485 285.767 142.576 > 288.156 546.467 593.42 153.0286 458.283 167.040 > > hsa-miR-133a-3p 20.365 13.6880 52.59 49.49 4.284 5.964 33.947 > 56.448 3.787 16.07 1.7194 34.058 2.027 > > hsa-miR-665 8.228 20.4215 37.04 18.78 4.980 20.476 47.525 3.794 > 10.529 16.87 13.2864 3.724 26.920 > > hsa-miR-1250-5p 1.440 0.1104 13.28 11.47 3.481 14.611 3.594 > 6.449 2.032 3.64 2.1102 32.786 3.379 > > hsa-miR-381-3p 13.525 22.7397 86.13 50.28 16.761 11.828 34.346 > 184.517 37.866 32.41 0.7816 29.335 8.448 > > hsa-miR-374b-3p 27.822 7.1751 10.66 31.05 32.344 19.780 26.359 > 17.602 34.264 25.84 4.8456 25.975 11.151 > > LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > hsa-miR-140-5p 418.080 330.59 487.258 231.94 532.639 720.513 260.36 > 599.79 929.610 342.404 > > hsa-miR-133a-3p 3.094 17.25 6.952 14.44 49.461 52.070 33.58 > 29.44 13.915 43.239 > > hsa-miR-665 15.792 49.73 22.484 15.42 5.321 3.643 9.68 12.85 > 3.657 4.795 > > hsa-miR-1250-5p 11.345 12.85 22.484 19.22 31.491 2.143 26.72 > 52.46 71.982 13.100 > > hsa-miR-381-3p 12.441 47.70 11.948 16.40 24.338 25.928 98.72 > 23.34 34.876 78.687 > > hsa-miR-374b-3p 31.263 32.12 18.899 21.17 27.216 50.891 16.03 > 31.40 39.068 20.121 > > > > head(cpm(currentDiff$pseudo.counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P > > hsa-miR-140-5p 215.686 196.824 237.42 307.42 605.506 285.765 142.74 > 288.149 546.461 593.414 153.0351 458.284 167.054 > > hsa-miR-133a-3p 20.362 13.698 52.59 49.49 4.256 5.961 33.80 > 56.483 3.793 16.071 1.7286 34.088 2.048 > > hsa-miR-665 8.221 20.425 37.04 18.79 4.953 20.476 47.03 3.781 > 10.532 16.869 13.2858 3.712 26.891 > > hsa-miR-1250-5p 1.432 0.121 13.29 11.47 3.461 14.617 3.71 > 6.449 2.037 3.642 2.1169 32.804 3.399 > > hsa-miR-381-3p 13.517 22.749 86.14 50.28 16.735 11.826 34.44 > 184.584 37.867 32.407 0.7897 29.345 8.461 > > hsa-miR-374b-3p 27.838 7.186 10.65 31.05 32.352 19.779 26.37 > 17.595 34.259 25.835 4.8545 25.973 11.165 > > LT3P ST2P ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > hsa-miR-140-5p 418.084 330.60 487.258 231.96 532.637 720.512 260.347 > 599.80 929.622 342.403 > > hsa-miR-133a-3p 3.073 17.26 6.948 14.45 49.475 52.067 33.578 > 29.45 13.909 43.238 > > hsa-miR-665 15.786 49.68 22.485 15.42 5.309 3.645 9.673 12.85 > 3.649 4.794 > > hsa-miR-1250-5p 11.334 12.86 22.482 19.23 31.489 2.145 26.717 > 52.48 72.001 13.099 > > hsa-miR-381-3p 12.430 47.70 11.945 16.41 24.330 25.929 98.743 > 23.34 34.875 78.686 > > hsa-miR-374b-3p 31.272 32.12 18.897 21.18 27.212 50.889 16.025 > 31.40 39.072 20.120 > > > > colSums(cpm(currentDiff)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P LT3P ST2P > > 1421292 1064057 981465 889765 960819 921314 985099 991736 1160034 > 1144623 1517511 864691 1220229 961164 937648 > > ST10P ST12P ST7P ST8P ST4P ST5P ST11P SB > > 837525 999688 881438 922050 818447 919173 936499 953165 > > > > colSums(cpm(currentDiff$counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P > LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P > > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > > ST5P ST11P SB > > 1e+06 1e+06 1e+06 > > > > colSums(cpm(currentDiff$pseudo.counts)) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P > LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P > > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > > ST5P ST11P SB > > 1e+06 1e+06 1e+06 > > > >> sessionInfo() > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > > [1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 > LC_MONETARY=Italian_Italy.1252 > > [4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252 > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > > [1] xlsx_0.5.5 xlsxjars_0.5.0 rJava_0.9-6 edgeR_3.4.2 limma_3.18.9 > > loaded via a namespace (and not attached): > > [1] tools_3.0.2 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Hello - thanks also for this second clarification. I actually read this help line, but it was a bit obscure to me Let me try to summarize: > colSums*(**cpm(currentDiff$counts)*) , where currentDiff is a DGEList object after normalization LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 ==> the *count **matrix (matrix of cpm)* in this case does not use the values normalized by library size, so the values add up to 1 million, correct ? in this case, though, I can compare directly values between samples. > colSums(*cpm(currentDiff)*) LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P LT2P LT3P ST2P 1421292 1064057 981465 889765 960819 921314 985099 991736 1160034 1144623 1517511 864691 1220229 961164 937648 ST10P ST12P ST7P ST8P ST4P 837525 999688 881438 922050 818447 ==> these are the same count values, but normalized by library sizes, so the CPM will not add up to 1.000.000 (roughly) per sample, correct ? this is also the way in which CPM are extracted in the manual. But I don't understand one thing: we scale up (or don't scale up) the libraries by size, then we calculate the CPM. Still the CPM should add up to 1 million for each sample in the two categories, so that every gene can be compared directly between samples Am I missing something fundamental here or the scaling is done *after* the CPM calculation ? Let me know, cheers, Alessandro PS A naive question: what is the role (roughly) for pesudocounts ? Many thanks for your feedback, Alessandro & Co. -- Dear Alessandro, I see that Devon Ryan has answered your question, but the answer is also available directly from the help system. If you type help("cpm") the first line of Details says: "CPM or RPKM values are useful descriptive measures for the expression level of a gene or transcript. By default, the normalized library sizes are used in the computation for DGEList objects but simple column sums for matrices." Best wishes Gordon ----------------------------------------------------------- Il Contenuto del presente messaggio potrebbe contenere informazioni confidenziali a favore dei soli destinatari del messaggio stesso. Qualora riceviate per errore questo messaggio siete pregati di cancellarlo dalla memoria del computer e di contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei contenuti del messaggio da parte di soggetti diversi dai destinatari è da considerarsi vietato ed abusivo. The information transmitted is intended only for the per...{{dropped:10}}
0
Entering edit mode
Hi Alessandro, I think you might not be understanding what scale normalization is. Have a read of the section on normalization in the edgeR User's Guide. That will also answer your question on pseudo-counts. Best wishes Gordon On Fri, 4 Apr 2014, alessandro.guffanti at genomnia.com wrote: > Hello - thanks also for this second clarification. I actually read this help > line, but it > was a bit obscure to me > > Let me try to summarize: > >> colSums*(**cpm(currentDiff$counts)*) , where currentDiff is a DGEList > object after normalization > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C ST8C ST11C LT1P > LT2P LT3P ST2P ST10P ST12P ST7P ST8P ST4P > > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 > > ==> the *count **matrix (matrix of cpm)* in this case does not use the values > normalized by library size, > so the values add up to 1 million, correct ? in this case, though, I can > compare directly values between > samples. > >> colSums(*cpm(currentDiff)*) > > LT1C LT2C LT3C ST2C ST4C ST10C ST12C ST5C ST7C > ST8C ST11C LT1P LT2P LT3P ST2P > > 1421292 1064057 981465 889765 960819 921314 985099 991736 1160034 > 1144623 1517511 864691 1220229 961164 937648 > > ST10P ST12P ST7P ST8P ST4P > > 837525 999688 881438 922050 818447 > > ==> these are the same count values, but normalized by library sizes, so the > CPM will not add up to 1.000.000 (roughly) > per sample, correct ? this is also the way in which CPM are extracted in the > manual. > > But I don't understand one thing: we scale up (or don't scale up) the > libraries by size, then we calculate the CPM. > Still the CPM should add up to 1 million for each sample in the two > categories, so that every gene can be compared > directly between samples > > Am I missing something fundamental here or the scaling is done *after* the > CPM calculation ? > > Let me know, cheers, > > Alessandro > > PS > > A naive question: what is the role (roughly) for pesudocounts ? > > Many thanks for your feedback, > > Alessandro & Co. > > -- > > Dear Alessandro, > > I see that Devon Ryan has answered your question, but the answer is also > available directly from the help system. If you type help("cpm") the first > line of Details says: > > "CPM or RPKM values are useful descriptive measures for the expression level > of a gene or transcript. By default, the normalized library sizes are used in > the computation for DGEList objects but simple column sums for matrices." > > Best wishes > Gordon > > > > > > ----------------------------------------------------------- > Il Contenuto del presente messaggio potrebbe contenere informazioni > confidenziali a favore dei > soli destinatari del messaggio stesso. Qualora riceviate per errore questo > messaggio siete pregati di cancellarlo dalla memoria del computer e di > contattare i numeri sopra indicati. Ogni utilizzo o ritrasmissione dei > contenuti del messaggio da parte di soggetti diversi dai destinatari ? da > considerarsi vietato ed abusivo. > > The information transmitted is intended only for the p...{{dropped:15}}
ADD REPLY

Login before adding your answer.

Traffic: 686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6