hi I want to analyze the TCGA data with the DESeq2 package. As you know there are three types of data in this database. This site (http://seqanswers.com/forums/showthread.php?t=42911) provides information on these three types of data.
1- raw counts: The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.
2- scaled estimate: The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!
3- normalizedresults: The *.normalizedresults files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments.
I read the biostars and support.bioconducto posts but unfortunately did not get my questions answered. Which of these data is best for finding the DEG (differentially expressed genes)?
thank you
I have re-processed most of the TCGA RNA-seq data from, originally, the HTseq raw counts (when they were the only data available), and, recently, the RSEM expression levels. Taking the RSEM files, you can import these to DESeq2 via tximport: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#rsem
This paper and dataset might also be relevant: https://www.ncbi.nlm.nih.gov/pubmed/26209429 (GSE62944 in Gene Expression Omnibus). It provides raw counts for most of TCGA.