Question

CNV-expression testing with edgeR

0

Entering edit mode

Ludwig Geistlinger ▴ 70

@ludwig-geistlinger-3939

Last seen 2.7 years ago

USA/Boston/HMS

Dear edgeR-Team,

I have around 200 samples for which I have

1) CNV data, ie. a matrix X (regions x samples) in which a cell is either 2 (normal diploid), 0/1 (loss), or 3/4/5/... (gain),

2) RNA-seq data, ie. a matrix Y (genes x samples) storing the raw RNA-seq read counts.

Now I want to find for each CNV region (rows of X) DE genes between sample groups formed by CN state,.

For each CNV region, the control group is formed by samples that are diploid (2) in this region, against we are comparing one or more sample groups, each representing a deviating state (eg. 3; corresponding to a one copy gain in that region).

That essentally means that the rows of X are different versions of my group vector by which I am testing for differential expression.

Accordingly I could go ahead (reasonable or not), applying edgeR's standard workflow for each row of X

corresponding to testing all CNVs against all genes:

# already filtered for genes that are not sufficiently expressed
y <- edgeR::DGEList(counts=Y)
y <- edgeR::calcNormFactors(y)

edgeIt <- function(group)
{   
    nr.states <- length(unique(group))

    group <- as.factor(group)
    y$samples$group <- group
    f <- stats::formula("~group")
    design <- stats::model.matrix(f)
    y <- edgeR::estimateDisp(y, design, robust=TRUE)
    fit <- edgeR::glmQLFit(y, design, robust=TRUE)
    qlf <- edgeR::glmQLFTest(fit, coef=2:nr.states)
    return(qlf)
}

res <- apply(X, 1, edgeIt)

Now, due to runtime and power considerations, that's not really the optimal approach that you can think off.

First question: Testing various rows of X (many different group vectors), it figures that results of

y <- edgeR::estimateDisp(y, design, robust=TRUE)

hardly change, and it seems to be ok to do this just once after

y <- edgeR::calcNormFactors(y)

i.e. moving the dispersion estimate out of the function? (see code below)

Second question:

As for eQTL studies, you rather want to restrict testing to the genes close-by to a CNV, let's say in a 1 Mbp window.

Let's say I have a pre-computed list cnv2genes giving me for each CNV the surrounding genes. Would it then be appropriate to just test these genes for each CNV for DE? I.e.

# already filtered for genes that are not sufficiently expressed
y <- edgeR::DGEList(counts=Y)
y <- edgeR::calcNormFactors(y)

## computing dispersion only once, independent of design
y <- edgeR::estimateDisp(y, robust=TRUE)

edgeIt2 <- function(group, genes)
{
    nr.states <- length(unique(group))

    group <- as.factor(group)
    y$samples$group <- group
    f <- stats::formula("~group")
    design <- stats::model.matrix(f)

    # testing only genes in the neighborhood of a CNV
    fit <- edgeR::glmQLFit(y[genes,], design, robust=TRUE)
    qlf <- edgeR::glmQLFTest(fit, coef=2:nr.states)
    return(qlf)
}

res <- lapply(rownames(X), function(cnv) edgeIt2(X[cnv,], cnv2genes[[cnv]]))

However, if I'm doing the second version, it runs through for some CNVs, but for others I encounter errors like:

Error in makeCompressedMatrix(glmfit$var.post, dim(glmfit$counts), byrow = FALSE) : 
  'dims[1]' should equal length of 'x'

or

Error in squeezeVar(s2, df = df.residual, covariate = AveLogCPM, robust = robust,  : 
  Could not estimate prior df

Many thanks!

edgeR • 1.3k views

ADD COMMENT • link 6.6 years ago Ludwig Geistlinger ▴ 70

score 4 · Answer 1 · 2018-08-25

For your first question: if your design matrix is changing, then you really should re-estimate your dispersions. If it doesn't make a difference, then it also implies that the choice of design matrix doesn't make a difference; in which case, why bother fitting different design matrices at all? Conversely, if you have CNV-induced DE, then failing to model that with the appropriate design matrix will inflate the dispersions and reduce power.

In fact, one could argue that your design matrices aren't complex enough. Your current setup implicitly assumes that CNVs in any part of the genome could have an effect on any gene in the transcriptome (otherwise you wouldn't bother testing all CNVs against all genes). If you fit a design matrix for CNV region A but a gene is affected by a CNV region B that has a completely different profile across cells, then that gene's variability will be inflated in the A model. You might think that this is okay because you don't care about this gene in the context of A anyway, but empirical Bayes shrinkage will share information across genes and drag up the dispersion estimates for every gene in the data set. This will globally reduce power for your DE analysis for A.

There are also some issues with interpretation when different CNVs have non-orthogonal profiles across the data set. If CNV regions A and A' have similar profiles, DE genes for A could be ascribed to A' or vice versa. This makes it difficult to interpret the results, as you need to consider whether the gene is DE for all other CNVs with a similar profile.

A slightly better approach would probably look something like this:

Perform PCA on the log-CNV matrix and take the first few PCs. An appropriate choice could be made by looking at the scree plot, though there are also more sophisticated methods like parallel analysis.
Use the PC loadings to relate each PC to each CNV region. If you are fortunate, each PC will correspond to clear groups of CNV regions with the same pattern of change across cells. Alternatively, you could test each CNV region for "DE" with respect to each PC in order to identify the contributors to each PC, though this requires some care to avoid circularity.
Use the chosen PCs to form a design matrix, and use that for the DE analysis on the genes. You will now have genes that are DE with respect to each PC, i.e., each CNV group.

This ensures that the dispersion is properly estimated, as all relevant aspects of CNV-induced systematic variation are modelled; and simplifies interpretation by considering groups of similar CNV regions corresponding to each PC. If you want DE that is specific to a single CNV region of interest, you can use all CNV regions except the one of interest to construct the PCs; create the design matrix with the PCs and the profile of the CNV region of interest; and test for the effect of the CNV term, treating the PCs as blocking factors. This has less power if your CNV regions are highly colinear, but the corresponding DE genes would be difficult to interpret in any case.

For your second question: you'll have to be more specific about the context in which these errors occur. I'll guess that the first error is caused by some subsetting error in y[genes,]; these may have been present in previous releases, so make sure you're using the latest version of edgeR. The second error is probably caused by NAs from fitFDistRobustly, though this may be the correct behaviour rather than a bug, e.g., if you have too few genes in a region for any distribution fitting.

score 0 · Answer 2 · 2018-08-27

Thanks a lot for this comprehensive answer and these great suggestions.

Although there are scenarios where a CNV can affect gene expression on a genome-wide

scale (eg by modulating the expression of a coinciding transcription factor, which

in turn regulates distal genes), you rarely have the power to detect that in an

all-CNVs-against-all-genes setting - given the multiple testing problem that you

are running into.

I like your PCA-approach as it not only simplifies interpretation but also

greatly reduces the number of tests that you conduct.

However, my question is more going towards testing each CNV in isolation for effects

on genes within the respective CNV region or close-by (dosage and neighborhood effects), given the clear spatial indication that you have here.

As such, I'm wondering which part of the edgeR pipeline to carry out on the global

scale and which part anew for each cnv and associated subset of proximal genes

(which might come down to testing just a single gene for one CNV at a time).

More general, my question is: I have my RNA-seq transcriptomic measurements, but

I'm only interested in testing a subset of genes, or even just a single one, in

a particular constellation of my sample groups.

Obviously, calcNormFactors needs to be calculated on the complete expression

matrix, and just once.

For estimateDisp apparently for each CNV anew - or at least for each CNV group

as determined with your PCA approach - but again over the complete expression matrix.

Likewise for glmQLFit.

I would then subset the fit to the genes of interest (genes in the vicinity of the cnv under study)

and carry out glmQLFTest on this cnv-specific subset of genes.

Does this sound about right? Many thanks!