Hi, I am trying to do comparative RNA-seq analysis with DESeq2.
My purposes are: 1. combine transcripts into genes 2. detect gene expression difference under different conditions 3. obtain a summarized gene expression table
- combine transcripts into genes I basically followed these commands: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#kallisto
library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
names(txi)
library(tximportData)
dir <- system.file("/Users", package = "tximportData")
samples <- read.table(file.path("/Users", "samplelist.csv"), header = TRUE)
tx2gene1 <- read.csv(file.path("/Users", "transcript_gene.csv"), header = TRUE)
all(file.exists(files))
files <- file.path("/Users", "kallisto", samples$sample, "abundance.tsv")
names(files) <- samples$sample
library(tximportData)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene1)
Is that a valid way? I found the sum of the gene expression of each sample is largely diverse.
- detect gene expression difference under different conditions I followed this: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts
Since it describes that "As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values", I used "txt" above as input without normalization.
dds <- DESeqDataSetFromMatrix(countData = txi,
colData = colData,
design = ~condition)
result <- results(dds, contrast=c("condition","conditionA","conditionB"))
Is that a valid way to do it?
- obtain a summarized gene expression table
dds <- estimateSizeFactors(dds)
normalized <-counts(dds, normalized=TRUE)
However, I found the sum of the gene expression of each sample is largely diverse. Something like: (two samples for condition A and B)
sample A1 A2 B1 B2
gene1 10 11 1 1
gene2 20 19 2 1
gene3 30 32 3 4
gene4 40 38 4 4
It looks odd and I wonder how to obtain a gene expression table of all the samples, which are normalized by both gene length and between-sample bias. Or, is there a significant gene expression depletion in sample B1 and B2 overall?
Thank you for your help.
Thank you for the prompt response.
I tried
DESeqDataSetFromTximport
this time, but the last problem remains.What I was trying to ask is: I expected the sum of all gene expression in the tximport output is same across the samples, like some other normalization methods, e.g., "all TPMs (transcript-level or gene-level) should always equal 1,000,000." However, the observed sum of all gene (I assumed gene1-4 is all the genes in the genome in this example above) expression in the tximport output is quite different between the samples (in this case, the sum is 100 for sample A1 and 10 for sample B1), even after the
normalized <-counts(dds, normalized=TRUE)
.Is that appropriate to calculate fold change over this dataset? If this is not properly normalized, how can I obtain the normalized data table?
I appreciate your help,
Can you show me the column sum of normalized counts for your real data?
The actual data is like this:
sample stage1_treatment1 stage1_treatment2 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment1 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment1 stage2_treatment1 stage2_treatment2 stage2_treatment1 stage2_treatment3 stage2_treatment2 stage2_treatment3 stage2_treatment1 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment2 sum 11360882.22 15284394.07 19442307.11 19787325.58 17012527.38 15219009.92 15389659.54 16551581.26 15560148.45 16184824.93 17221220.16 14641597.56 12890883.85 7180978.134 7209581.174 7462163.292 8744727.234 7069309.349 7538834.546 6157925.432 6281529.7 7788110.682 8102544.576 8004473.424 7499249.027 7546127.962 7360558.875
seems like it ranges from 6 to 20 million? Yes this is strange if those are colSums of normalized counts. Is there something very different about these samples? Are they perhaps experiencing massive changes in the distribution of expressed genes?
I see. Thank you very much for your response. Yes, it could be possible; there can be a huge difference between the condition. I will discuss with my collaborators and will also search for studies with a similar setting. Actually, this is my first experience observing such an extreme diversity between samples although I have analyzed several RNAseq datasets with different experimental designs.