Hi,
To explain our data, we are doing an exploratory network analysis, I have a dataset with 154 individuals (RIL population) which all have only one replicate. Also two parents, with 4 replicates each. We are working with around 50.000 genes. For the network analysis I use DESeq2 transformation with all samples (162), however I get a very strange dispersion plot (see right pic). When I try with only 16 samples this plot looks as expected (see left pic). I tried taking out samples which seem to be outliers when doing sample clustering, this did not change the result.
In addition, when performing either rlog or vst after dispersion calculation, the variance of the genes among the different samples is very low (the genes with highest variance over the samples have a difference of max 2 with rlog for instance).
This is for instance a small selection of the genes with highest variance over a few samples:
sample 1 | sample 10 | sample100 | sample102 | sample103 sample | 104 sample | 105 | |
Gene 1 | 9.209826 | 8.079033 | 8.051399 | 8.064809 | 8.053316 | 8.065023 | 8.053341 |
Gene 2 | 9.929695 | 10.00144 | 8.320844 | 9.718413 | 9.772473 | 9.776905 | 8.331892 |
Gene 3 | 9.666449 | 8.260443 | 9.616898 | 9.587368 | 9.4724 | 9.714794 | 9.295183 |
Gene 4 | 8.604385 | 8.512491 | 8.597288 | 9.579488 | 8.50671 | 8.510553 | 8.507916 |
Gene 5 | 8.765191 | 8.768521 | 8.761995 | 8.762326 | 9.477285 | 10.09543 | 8.762831 |
Gene 6 | 8.475988 | 8.471243 | 9.257903 | 9.651275 | 9.574287 | 8.470999 | 9.381431 |
Gene 7 | 8.69598 | 9.9309 | 9.869177 | 8.675978 | 9.659681 | 8.68028 | 8.67665 |
Gene 8 | 8.678918 | 9.761749 | 8.785855 | 9.596582 | 9.603226 | 9.633314 | 9.654142 |
Anyone an idea of what might be causing this strange dispersion plot and very low variance between different samples of the genes? Is it the big number of samples?
Also I did downstream network analysis, and it did give biologically meaningfull results, so I'm not sure if it is still reliable or not.
Thank you for your comment. As design I consider every individual ("Line") a different condition, this because they all carry a different combination of the parental genomes, and each parent is considered one condition. We don't have 'real' different conditions like environmental stress.
The code I use:
#read expression data
countDataGene <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))
#### Filter non expressed genes ####
#remove genes that are not very informative, remove genes which have <10 counts in 90% of samples
countDataGene1 <- countDataGene
m90 <- countDataGene1
percents_90 <- rowMeans(m90 > 10)
percents_90
m90_percents = m90[percents_90>0.1,]
head(m90_percents,100)
dim(m90_percents)
countDataGene <- m90_percents
#read in the experimental design
colData <- as.matrix(read.csv("Experdesign.csv", header= TRUE))
head(colData)
#all(rownames(Design_data) %in% colnames(countDataGene))
all(rownames(colData) %in% colnames(countDataGene))
#Create a DESeqDataSet from count matrix and labels
dds <- DESeqDataSetFromMatrix(countData = countDataGene, colData = colData, design = ~ Line)
dds
###### normalization #######
# Estimate size factors #
dds = estimateSizeFactors(dds)
#The correction (size) factors can be retrieved using:
sizeFactors(dds)
#estimate dispersions
dds <- estimateDispersions(dds)
plotDispEsts(dds)
######RLOG#######
rld <- rlog(dds, blind=FALSE)
rld_count_data <- assay(rld)
head(rld_count_data)
dim(rld_count_data)
#write.csv(rld_count_data, file ="Rlog_gene_count_data.csv", row.names = TRUE)
meanSdPlot(assay(rld))
plotDispEsts(dds)
#with top 30000 Transcripts with highest variance
topVarGenes <- head( order( rowVars( assay(rld) ), decreasing=TRUE ),30000)
topVarGenes <- assay(rld)[ topVarGenes, ]
I'm not sure exactly what is going on here, but the variance is very high for lots of genes. You say you are working with 50,000 genes, do you mean expressed genes? How many genes are expressed here? Is this typical RNA-seq at the gene-level?
Yes around 50,000 expressed genes ( I filtered out genes that have less than 10 counts in 90% of the samples), we are working with wheat. It is typical RNA-seq data at gene level. I made frequency plots of vst and rlog data, and the data does seem to be distributed normal, although shrinked a lot possibly due to the high sample size?
For instance here, the first picture is a gene with low raw read counts, and the pic below a gene with higher raw read counts.
hi,
Can you show the following:
I'm wondering if there is some substantial structure in your data such that ~line is not sufficient. The dispersion estimates above are very high, indicating something like bimodal data within condition.
It gives the following:
> ntd <- normTransform(dds)
> plotPCA(ntd)
Error in .local(object, ...) :
the argument 'intgroup' should specify columns of colData(dds)
Oops, I forgot you aren't working with "condition" as the design. You should put plotPCA(ntd, "line")
That worked! It is a bit as expected, the parents (of which we have four reps) at each side, and all offspring lines are in between (that have only 1 rep each). But very low PCs.
I really don't know what's going on, why the dispersion is so high here, and I'd have to dive in to the counts to have a better idea, but I don't have much time at the moment. You might try instead limma-voom which is very fast for large datasets because it uses linear models, and additionally by working on log scaled counts, it can be less sensitive to outliers.
No problem, thanks a lot for your time and help.