Transcriptional amplification is the phenomenon where the majority of genes in a sample are increased in expression. An overview of it and the statistical implications is found in Cell. Recent research found that about 30% of primary cancers have evidence of genome doubling (tetraploidy). Now that I have a data set with matched RNA-seq and DNA WGS, I notice many samples have an estimated copy number of 4 for about 70% to 80% of their chromosomes from the DNA sequencing analysis.
I am considering how to estimate size factors for samples correctly before model fitting. edgeR has calcNormFactors
which tries to make some value like the median fold change between samples the same. Could a more complicated version of it be developed in future? For example, it might look like
# Assume that SCC15 cancer is tetraploid, SCC22 is diploid.
genesList <- list(SCC15normal = allGenes, SCC15cancer = SCC15tetraploidGenes, SCC22normal = allGenes, SCC22cancer = allGenes)
calcNormFactors(countMatrix, targetLFC = c(0, 1, 0, 0), whichGenes = genesList)
Similarly, DESeq2 has a function named estimateSizeFactors
. Would providing a matrix of genes' copy numbers (rows) and samples (columns) as normMatrix
allow the accurate estimation of size factors?
Could the vignettes of edgeR and DESeq2 packages have a section showing all of the functions which need parameters to be set by the user to correctly model transcriptional amplification? Neither vignette discusses crucial assumptions, such as most genes are not differentially expressed, for the default workflows to work well, and the impact of assumption violations.
With some samples being almost entirely diploid and others being almost entirely tetraploid, Approach 1 wouldn't have many genes to use. It requires a subset of genes that are diploid in all samples. I am doubtful about looping across the samples for Approach 2. Wouldn't that mean using calcNormFactors on one sample at a time, so the resulting factor is always 1? Approach 3 seems like a good choice, but the duplications often don't span entire chromosomes, but a sizeable fraction of them. The partitioning of the genome into regions to loop over would be difficult.
Actually, my original interest for calculating normalisation factors was to scale raw counts between samples so that the pairs which are diploid and tetraploid would have most genes at a two-fold difference for exploratory data purposes such as drawing a heatmap (log-scale, of course) to highlight how transcriptional amplification affects the majority of genes. But, it's a good point you raise whether such an overall shift should be expected or not.
Approach 2 clarification:
I'd imagine that you'd probably want to use
scaleOffset
to add offsets to aDGEList
, to get them on the same scale as the log-transformed effective library sizes for proper interpretation, e.g., byaveLogCPM
.Approach 3 comments:
Depending on exactly how far down the rabbit hole you're willing to go; it's theoretically possible to accommodate gene-by-gene differences in the copy number profile. For the NB dispersion estimation, you just loop across all genes, fit a GLM to each gene separately to get the adjusted profile likelihood values across the dispersion grid, and then aggregate this value across genes with
WLEB
to get the trend. Repeat this with the QL dispersion estimation, passing a vector of gene-specific residual d.f.'s tofitFDist(Robustly)
. Repeat again when testing the contrast of interest withglmQLFTest
, picking up p-values for whatever non-copy-number-related contrast you're interested in. You'd have to dive deep into edgeR's guts to do this but it can be done by someone who is sufficiently determined/knowledgeable/desperate.Additional remarks:
Indeed, copy number changes can have unpredictable effects. Take Hi-C, for example. If you double the copies of two pieces of DNA, how does their contact frequency change? One might expect 2-fold, if they were already in a tight interaction; but one might alternatively expect 4-fold under a random collision model. And this are just low-level physical interactions; one shudders to think of predicting the higher-order changes from a large-scale perturbation of the transcriptional regulatory network.