Analysis of large RNA-Seq Datasets
2
0
Entering edit mode
KokiBio • 0
@a7f26943
Last seen 6 weeks ago
United States

Hello,

I am attempting to find the DEGs between Mesenchymal glioblastoma and healthy tissue samples using raw counts. However, PCA indicates no distinct clustering of healthy vs. tumor samples. In general, I am wondering how RNA-seq analysis of multiple in-vivo samples is conducted given their inherent heterogeneity. The code is not included here, but batch correction with sva also resulted in negligible changes.

All samples are from unique individuals, so there are no pairs of healthy vs. tumor samples from the same patient. The location of healthy sample collection is not known.

In general, can I extract any useful information from this RNA-seq dataset? How is RNA-seq analysis of large datasets with unlisted batch conditions handled? Any help is appreciated. Thank you!

PCA



library(tidyverse)
library(conflicted)
library(org.Hs.eg.db)
library(DESeq2)
library(ggplot2)
library(tools)
library(biomaRt)
library(GOSemSim)
library(RColorBrewer)
library(TCGAbiolinks)
library(sva)

query.GBM <- GDCquery(project = "TCGA-GBM",
                      data.category = "Transcriptome Profiling",
                      experimental.strategy = "RNA-Seq",
                      workflow.type = "STAR - Counts")

GDCdownload(query = query.GBM)
data <- GDCprepare(query = query.GBM, save = TRUE, save.filename = "GDC_TCGA_GBM.rda")

# Match sample barcodes to GlioVis clinical data format
colnames(counts) <- substring(colnames(counts), 1, 12)
colnames(counts) <- gsub("-", "." ,colnames(counts))

# Remove patients with multiple samples
duplicates <- colnames(counts)[duplicated(colnames(counts))]
counts <- counts[, !colnames(counts) %in% duplicates]

# Set "Subtype" of non-tumor samples to "Healthy"
pheno$Subtype[pheno$Histology == "Non-tumor"] <- "Healthy"

# filter pheno to include healthy and MES samples
pheno.MES <- pheno[pheno$Subtype == "Mesenchymal" | 
                     pheno$Subtype == "Healthy", ]

counts.MES <- counts[, colnames(counts) %in% pheno.MES$Sample]

# Create counts metadata
col.data.MES <- pheno.MES[pheno.MES$Sample %in% colnames(counts.MES), c("Sample", "Subtype")]
col.data.MES$Subtype <- factor(col.data.MES$Subtype, 
                               levels = c("Healthy", "Mesenchymal"))

col.data.MES <- col.data.MES %>%
  group_by(Subtype) %>%
  mutate(Sample_ID = Sample,
         Sample = paste0(ifelse(Subtype == "Healthy", "Healthy_", "MES_"), row_number())) %>%
  ungroup()

colnames(counts.MES) <- col.data.MES$Sample

# Remove genes with a total of less than 20 counts combined
counts.MES <- counts.MES[rowSums(counts.MES) > 20, ]

# Create DESeq object
dds.MES <- DESeqDataSetFromMatrix(countData = counts.MES, 
                                  colData = col.data.MES,
                                  design = ~Subtype)
dds.MES <- DESeq(dds.MES)

# Plot PCA and check dispersion
vs.dds.MES <- vst(dds.MES, blind=FALSE)

plotPCA(vs.dds.MES, intgroup = c("Subtype")) +ggtitle("MES Biplot")
plotDispEsts(dds.MES)
DESeq2 • 1.0k views
ADD COMMENT
1
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 9 days ago
Australia

If you have trust in housekeeping genes, RUVg will do the trick. However, the ultimate way is technical replicates across batches.

ADD COMMENT
1
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 10 hours ago
Germany

You can take inspiration from this GitHub repo of the developer towards using RUVseq (https://github.com/mikelove/preNivolumabOnNivolumab) for these sorts of datasets. But with only n=4 for one condition, it is obviously underpowered, so don't expect too much insights. For general guidance, consider collaborating with a local statistician.

ADD COMMENT
0
Entering edit mode

Thanks for pointing to that repo ATpoint

The key insight with RUV is that, even with large count outliers, adding the RUV factors prevents FP and the model is really misspecified otherwise. The RUV factors soak up the variance from aberrant samples.

The only limitation is when batch and condition are totally confounded, then factor analysis can't help you.

ADD REPLY

Login before adding your answer.

Traffic: 618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6