Question

[deseq2] normalization between samples after merging transcripts into genes

0

Entering edit mode

marie • 0

@marie-19926

Last seen 3.8 years ago

Norway

Hi, I am trying to do comparative RNA-seq analysis with DESeq2.

My purposes are: 1. combine transcripts into genes 2. detect gene expression difference under different conditions 3. obtain a summarized gene expression table

combine transcripts into genes I basically followed these commands: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#kallisto

library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
names(txi)
library(tximportData)

dir <- system.file("/Users", package = "tximportData")
samples <- read.table(file.path("/Users", "samplelist.csv"), header = TRUE)
tx2gene1 <- read.csv(file.path("/Users", "transcript_gene.csv"), header = TRUE)
all(file.exists(files))
files <- file.path("/Users", "kallisto", samples$sample, "abundance.tsv")
names(files) <- samples$sample

library(tximportData)
 txi <- tximport(files, type = "kallisto", tx2gene = tx2gene1)

Is that a valid way? I found the sum of the gene expression of each sample is largely diverse.

detect gene expression difference under different conditions I followed this: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts

Since it describes that "As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values", I used "txt" above as input without normalization.

dds <- DESeqDataSetFromMatrix(countData = txi,
                              colData = colData,
                              design = ~condition)
result <- results(dds, contrast=c("condition","conditionA","conditionB"))

Is that a valid way to do it?

obtain a summarized gene expression table

dds <- estimateSizeFactors(dds)
normalized <-counts(dds, normalized=TRUE)

However, I found the sum of the gene expression of each sample is largely diverse. Something like: (two samples for condition A and B)

sample A1 A2 B1 B2
gene1  10  11  1  1
gene2  20  19  2  1
gene3  30  32  3  4
gene4  40  38  4  4

It looks odd and I wonder how to obtain a gene expression table of all the samples, which are normalized by both gene length and between-sample bias. Or, is there a significant gene expression depletion in sample B1 and B2 overall?

Thank you for your help.

deseq2 tximport • 2.4k views

ADD COMMENT • link updated 6.2 years ago by Michael Love 43k • written 6.2 years ago by marie • 0

score 0 · Answer 1 · 2019-02-21

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 hours ago

United States

Yes that is how to summarize to gene with tximport.

You should use DESeqDataSetFromTximport which is specifically designed for importing from tximport, as the name suggests. See here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-input

That table looks like there is DE for those genes. I don't follow what you are trying to show me?

ADD COMMENT • link 6.2 years ago Michael Love 43k

0

Entering edit mode

Thank you for the prompt response.

I tried DESeqDataSetFromTximport this time, but the last problem remains.

What I was trying to ask is: I expected the sum of all gene expression in the tximport output is same across the samples, like some other normalization methods, e.g., "all TPMs (transcript-level or gene-level) should always equal 1,000,000." However, the observed sum of all gene (I assumed gene1-4 is all the genes in the genome in this example above) expression in the tximport output is quite different between the samples (in this case, the sum is 100 for sample A1 and 10 for sample B1), even after the normalized <-counts(dds, normalized=TRUE).

Is that appropriate to calculate fold change over this dataset? If this is not properly normalized, how can I obtain the normalized data table?

I appreciate your help,

ADD REPLY • link 6.2 years ago marie • 0

0

Entering edit mode

Can you show me the column sum of normalized counts for your real data?

ADD REPLY • link 6.2 years ago Michael Love 43k

0

Entering edit mode

The actual data is like this:

sample stage1_treatment1 stage1_treatment2 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment1 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment1 stage2_treatment1 stage2_treatment2 stage2_treatment1 stage2_treatment3 stage2_treatment2 stage2_treatment3 stage2_treatment1 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment2 sum 11360882.22 15284394.07 19442307.11 19787325.58 17012527.38 15219009.92 15389659.54 16551581.26 15560148.45 16184824.93 17221220.16 14641597.56 12890883.85 7180978.134 7209581.174 7462163.292 8744727.234 7069309.349 7538834.546 6157925.432 6281529.7 7788110.682 8102544.576 8004473.424 7499249.027 7546127.962 7360558.875

ADD REPLY • link 6.2 years ago marie • 0

0

Entering edit mode

seems like it ranges from 6 to 20 million? Yes this is strange if those are colSums of normalized counts. Is there something very different about these samples? Are they perhaps experiencing massive changes in the distribution of expressed genes?

ADD REPLY • link 6.2 years ago Michael Love 43k

0

Entering edit mode

I see. Thank you very much for your response. Yes, it could be possible; there can be a huge difference between the condition. I will discuss with my collaborators and will also search for studies with a similar setting. Actually, this is my first experience observing such an extreme diversity between samples although I have analyzed several RNAseq datasets with different experimental designs.

ADD REPLY • link 6.2 years ago marie • 0