Question

Strange dispersion plot

0

Entering edit mode

Moose123 • 0

@moose123-13761

Last seen 6.8 years ago

Hi,

To explain our data, we are doing an exploratory network analysis, I have a dataset with 154 individuals (RIL population) which all have only one replicate. Also two parents, with 4 replicates each. We are working with around 50.000 genes. For the network analysis I use DESeq2 transformation with all samples (162), however I get a very strange dispersion plot (see right pic). When I try with only 16 samples this plot looks as expected (see left pic). I tried taking out samples which seem to be outliers when doing sample clustering, this did not change the result.

In addition, when performing either rlog or vst after dispersion calculation, the variance of the genes among the different samples is very low (the genes with highest variance over the samples have a difference of max 2 with rlog for instance).

This is for instance a small selection of the genes with highest variance over a few samples:

	sample 1	sample 10	sample100	sample102	sample103 sample	104 sample	105
Gene 1	9.209826	8.079033	8.051399	8.064809	8.053316	8.065023	8.053341
Gene 2	9.929695	10.00144	8.320844	9.718413	9.772473	9.776905	8.331892
Gene 3	9.666449	8.260443	9.616898	9.587368	9.4724	9.714794	9.295183
Gene 4	8.604385	8.512491	8.597288	9.579488	8.50671	8.510553	8.507916
Gene 5	8.765191	8.768521	8.761995	8.762326	9.477285	10.09543	8.762831
Gene 6	8.475988	8.471243	9.257903	9.651275	9.574287	8.470999	9.381431
Gene 7	8.69598	9.9309	9.869177	8.675978	9.659681	8.68028	8.67665
Gene 8	8.678918	9.761749	8.785855	9.596582	9.603226	9.633314	9.654142

Anyone an idea of what might be causing this strange dispersion plot and very low variance between different samples of the genes? Is it the big number of samples?

Also I did downstream network analysis, and it did give biologically meaningfull results, so I'm not sure if it is still reliable or not.

deseq2 dispersion • 2.5k views

ADD COMMENT • link 6.8 years ago Moose123 • 0

score 0 · Answer 1 · 2018-02-26

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

Can you post your code? Relevant for example is the design you are using.

ADD COMMENT • link 6.8 years ago Michael Love 43k

0

Entering edit mode

Thank you for your comment. As design I consider every individual ("Line") a different condition, this because they all carry a different combination of the parental genomes, and each parent is considered one condition. We don't have 'real' different conditions like environmental stress.

The code I use:

#read expression data

countDataGene <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))

#### Filter non expressed genes ####

#remove genes that are not very informative, remove genes which have <10 counts in 90% of samples

countDataGene1 <- countDataGene

m90 <- countDataGene1

percents_90 <- rowMeans(m90 > 10)

percents_90

m90_percents = m90[percents_90>0.1,]

head(m90_percents,100)

dim(m90_percents)

countDataGene <- m90_percents

#read in the experimental design

colData <- as.matrix(read.csv("Experdesign.csv", header= TRUE))

head(colData)

#all(rownames(Design_data) %in% colnames(countDataGene))

all(rownames(colData) %in% colnames(countDataGene))

#Create a DESeqDataSet from count matrix and labels

dds <- DESeqDataSetFromMatrix(countData = countDataGene, colData = colData, design = ~ Line)

dds

###### normalization #######

# Estimate size factors #

dds = estimateSizeFactors(dds)

#The correction (size) factors can be retrieved using:

sizeFactors(dds)

#estimate dispersions

dds <- estimateDispersions(dds)

plotDispEsts(dds)

######RLOG#######

rld <- rlog(dds, blind=FALSE)

rld_count_data <- assay(rld)

head(rld_count_data)

dim(rld_count_data)

#write.csv(rld_count_data, file ="Rlog_gene_count_data.csv", row.names = TRUE)

meanSdPlot(assay(rld))

plotDispEsts(dds)

#with top 30000 Transcripts with highest variance

topVarGenes <- head( order( rowVars( assay(rld) ), decreasing=TRUE ),30000)

topVarGenes <- assay(rld)[ topVarGenes, ]

ADD REPLY • link 6.8 years ago Moose123 • 0

0

Entering edit mode

I'm not sure exactly what is going on here, but the variance is very high for lots of genes. You say you are working with 50,000 genes, do you mean expressed genes? How many genes are expressed here? Is this typical RNA-seq at the gene-level?

ADD REPLY • link 6.8 years ago Michael Love 43k

0

Entering edit mode

Yes around 50,000 expressed genes ( I filtered out genes that have less than 10 counts in 90% of the samples), we are working with wheat. It is typical RNA-seq data at gene level. I made frequency plots of vst and rlog data, and the data does seem to be distributed normal, although shrinked a lot possibly due to the high sample size?

For instance here, the first picture is a gene with low raw read counts, and the pic below a gene with higher raw read counts.

ADD REPLY • link 6.7 years ago Moose123 • 0

0

Entering edit mode

hi,

Can you show the following:

ntd <- normTransform(dds)
plotPCA(ntd)

I'm wondering if there is some substantial structure in your data such that ~line is not sufficient. The dispersion estimates above are very high, indicating something like bimodal data within condition.

ADD REPLY • link 6.7 years ago Michael Love 43k

0

Entering edit mode

It gives the following:

> ntd <- normTransform(dds)
> plotPCA(ntd)
Error in .local(object, ...) :
the argument 'intgroup' should specify columns of colData(dds)

ADD REPLY • link 6.7 years ago Moose123 • 0

0

Entering edit mode

Oops, I forgot you aren't working with "condition" as the design. You should put plotPCA(ntd, "line")

ADD REPLY • link 6.7 years ago Michael Love 43k

0

Entering edit mode

That worked! It is a bit as expected, the parents (of which we have four reps) at each side, and all offspring lines are in between (that have only 1 rep each). But very low PCs.

ADD REPLY • link 6.7 years ago Moose123 • 0

0

Entering edit mode

I really don't know what's going on, why the dispersion is so high here, and I'd have to dive in to the counts to have a better idea, but I don't have much time at the moment. You might try instead limma-voom which is very fast for large datasets because it uses linear models, and additionally by working on log scaled counts, it can be less sensitive to outliers.

ADD REPLY • link 6.7 years ago Michael Love 43k

0

Entering edit mode

No problem, thanks a lot for your time and help.

ADD REPLY • link 6.7 years ago Moose123 • 0

score 0 · Answer 2 · 2018-02-26

0

Entering edit mode

Moose123 • 0

@moose123-13761

Last seen 6.8 years ago

*edit: My excuses, this message can be deleted. I did not reply directly on your question.

ADD COMMENT • link 6.8 years ago Moose123 • 0