Question

Too many significant genes when integrating gtex and tcga

0

Entering edit mode

Reza • 0

@fd964599

Last seen 11 months ago

United Kingdom

I have been working on data from the recount3 project to integrate GTEx and TCGA data and perform DEG analysis using DESeq2. However, I am encountering an issue where I am getting too many significant genes while using datasets with large sample size such as TCGA-COAD and colon tissue in GTEX.

This phenomenon is also mentioned here (PMID: 35199033), which reports that 92% of total gene input is accounted for by differentially expressed genes (DEGs) detected across TCGA primary tumor and GTEx normal colon tissue samples.

When using limma for analysis, the treat function can help address this issue by computing empirical Bayes moderated-t p-values relative to a minimum fold-change threshold.

Now I have two questions:

I was wondering if there is a similar solution available for DESeq2.
Is there a better approach to address this problem? because even after using treat there are still many significant genes left.

RNA-seq DESeq2 • 1.7k views

ADD COMMENT • link updated 8 weeks ago by Jane • 0 • written 11 months ago by Reza • 0

0

Entering edit mode

These two datasets are from completely different experiments / batches. It is utterly meaningless to compare them. I would suggest comparative analysis within subtypes only using TCGA data.

ADD REPLY • link 11 months ago ATpoint ★ 4.8k

0

Entering edit mode

Its true, however, there are many cancers in TCGA that do not have normal samples and in that case, someone can borrow normal samples from GTEx data set. I am also facing this situation, and when I include batch variable, there is an error of matrix is not full rank (as usual). I came across this publication "New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx" DOI:10.1371/journal.pcbi.1006701, which ,to some extent, can address these kind of issues.

ADD REPLY • link 10 months ago manwar • 0

0

Entering edit mode

It's perfectly confounded, you cannot correct for anything. Nothing will ever change that. People are just ignoring this simple fact because they "need" to analyze data, pretending comparability.

ADD REPLY • link 10 months ago ATpoint ★ 4.8k

0

Entering edit mode

I encountered the same issue: too many DEGs after limma analysis. It is sad that until today we can still see the paper published where GTEx and TCGA combines.eg, this paper: Identification of ferroptosis-related signature predicting prognosis and therapeutic responses in pancreatic cancer.

ADD REPLY • link 8 weeks ago Jane • 0

score 1 · Answer 1 · 2024-05-03

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

See ?results, in particular the lfcThreshold argument.

ADD COMMENT • link 11 months ago James W. MacDonald 68k

1

Entering edit mode

Yes, this was one of the aspects we highlighted in the 2014 paper, and it's also in the workflow. Check these places first.

Also take a step back and consider: you are asking, with many samples, if the null is true that gene expression is constant regardless of tumor/normal status. Of course it will reject the majority of genes.

Maybe there is a better approach to your biological question that null hypothesis testing.

ADD REPLY • link 11 months ago Michael Love 43k