Find of DEG

Question

Which of these (raw count, scaled estimate, normalized_results ) data of TCGA is best for finding the DEG (differentially expressed) genes by DESeq2 package ?

0

Entering edit mode

roohallah1435 • 0

@roohallah1435-23238

Last seen 4.9 years ago

hi I want to analyze the TCGA data with the DESeq2 package. As you know there are three types of data in this database. This site (http://seqanswers.com/forums/showthread.php?t=42911) provides information on these three types of data.

1- raw counts: The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

2- scaled estimate: The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

3- normalizedresults: The *.normalizedresults files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments.

I read the biostars and support.bioconducto posts but unfortunately did not get my questions answered. Which of these data is best for finding the DEG (differentially expressed genes)?

thank you

deseq2 • 3.8k views

ADD COMMENT • link 4.9 years ago roohallah1435 • 0

0

Entering edit mode

I have re-processed most of the TCGA RNA-seq data from, originally, the HTseq raw counts (when they were the only data available), and, recently, the RSEM expression levels. Taking the RSEM files, you can import these to DESeq2 via tximport: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#rsem

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

This paper and dataset might also be relevant: https://www.ncbi.nlm.nih.gov/pubmed/26209429 (GSE62944 in Gene Expression Omnibus). It provides raw counts for most of TCGA.

ADD REPLY • link 4.9 years ago Stephen Piccolo ▴ 600

score 0 · Answer 1 · 2020-04-01

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

What kind of DE are you looking for? What groups of samples?

ADD COMMENT • link 4.9 years ago Michael Love 43k

0

Entering edit mode

Comparison between normal and tumor samples.

1- Is the use of normalized data correct or is it recommended to use raw data?

2- Are my codes correct?

R scripts for my analysis is:

options(stringsAsFactors=F)

GBMnormalized <- readexcel("GBM_normalized.xlsx")

GBMdata <- as.data.frame(GBM_normalized)

rownames(GBMdata)<- GBMdata$Genenames

GBMdata <- GBMdata[,-1]

GBMdata <- as.matrix(GBMdata)

mode(GBMdata)<- "integer"

GBMdata_nt <- GBMdata[,1:161]

gr_nt <- factor(c(rep("normal",5 ), rep("Tumor", 156)))

colDatant <- data.frame(group=grnt , type= "paired-end")

cdsnt <- DESeqDataSetFromMatrix(GBMdatant, colData_nt, design = ~group )

cdsnt <- DESeq(cdsnt)

cntnt <- log2(1+counts(cdsnt, normalized= T))

Find of DEG

resnt <- data.frame (results(cdsnt, c("group", "Tumor", "normal")))

resnt$genename <- rownames(resnt)

resnt$padj <- p.adjust(resnt$pvalue, method = "BH")

resnt <- resnt[order(res_nt$padj),]

ggplot(resnt, aes(log2FoldChange, -log10(padj) , color=log2FoldChange)) + geompoint() + theme_bw()

ADD REPLY • link 4.9 years ago roohallah1435 • 0

score 0 · Answer 2 · 2020-04-02

1- Is the use of normalized data correct or is it recommended to use raw data?

The recommended input to DESeq2 is stated in the vignette and manual pages, i.e., raw counts.

roohallah1435, what is contained in GBM_normalized.xlsx?; and why is the data even in an Excel file? - having data in Excel format can result in numerous types of formatting issues. If you want help, then please help us - this is the very first time that you have mentioned the file, GBM_normalized.xlsx.

Judging by the description provided in your original post, and your subsequent code that you've provided, you are taking scaled raw counts, forcing them back to integers, and then normalising them in DESeq2 (?) - this does not seem correct to me.

Another part that makes little sense is when you use p.adjust() - DESeq2 will perform p-value adjustment for you.

Please read my other comment and start from the RSEM files.