I am trying to analyze a barcode sequencing dataset. The experimental technique that generates the data is described here: Chemical genomic profiling via barcode sequencing to predict compound mode of action. Briefly, we are sequencing a pool of gene knock-out mutants of a yeast or bacterium growing all together. Each mutant has a gene knocked out and is labelled by a unique barcode sequence. We are counting those barcodes in each sample. The goal is to determine which knock-outs grow better or worse under different conditions, e.g. to assess a gene's effect on fitness. The result is much like an RNA-seq count matrix: we have a count for each gene in each sample. I have been trying to use limma to analyze the results. However there is a difference. The knockout library contains many genes that are represented more than once, i.e. there are two or more barcodes that map to the same gene. I am thinking of either summing or averaging these, so I would get a single count value. This could be done first thing before calculating normalization factors in edgeR and running voom. However, is this the right thing to do? And if I compute an average count for a gene, do I need to round it up to an integer?
It's probably fine to sum the counts for all barcodes corresponding to a single gene prior to running voom. This would be analogous to summing exon counts to get a single gene count per sample in a standard DE analysis. You'll end up with larger counts per gene, which should give you more power to detect differences upon culturing under different conditions. Don't take the mean of counts, as this would make it difficult to model the variances. (The variance of the mean depends on the number of barcodes you added together, which isn't something that voom knows about. While this dependence is also present in the variance of the sum, the size of the sum is proportional to the number of barcodes, so voom can figure it out based on the size of the summed counts.)
That said, summation assumes that the barcodes for each gene behave similarly and can be aggregated into a single value. If this isn't the case, you might be losing power when you add things together, e.g., because strong DE for one barcode is "diluted" by weak DE for another barcode. There's also the strange cases when two barcodes for the same gene respond in different directions upon culturing - I'm not sure how to interpret them. I don't know whether such inconsistencies are common in chemical genomics, so it'd be worth checking the behaviour of individual barcodes in your top set of DE genes.
Alternatively, you can keep each barcode separate, analyze them separately, and then aggregate their statistics at the end of the statistical analysis, e.g., using Simes' method (test for any barcodes for a gene being DE) or an intersection-union test (test for all barcodes for a gene being DE). This is analogous to testing for differential expression of individual exons.
Thanks, Aaron! This is very helpful. Looks like I can use diffSplice() to test if there are any genes where different mutants behave differently. Is it generally a good idea to aggregate data at gene level, e.g. by summing the counts? In particular, I am concerned that failing to do so would affect gene set enrichment tests. Introducing multiple entries for the same gene where there ought to be a single one would skew the size and composition of gene sets. Is that a valid concern?
In the absence of any evidence suggesting that it's a bad idea, I would sum the counts. This simplifies your analysis (no need to figure out how to report redundant barcodes in the result list), and will give you more power to detect differences. If you're concerned, pick out a couple of genes in your DE list and check that their barcodes are behaving consistently.
As for the gene set tests, whether or not there's a problem depends on whether the testing machinery accounts for the correlations between features (i.e., barcodes for the same gene). If you're using ROAST, for example, the correlations are modelled so any extra barcodes for a gene will be recognised as being redundant and downweighted appropriately. If you're using other tests that assume independence, then you'd be in trouble if each gene were represented more than once.
Thanks, Aaron! This is very helpful. Looks like I can use
diffSplice()
to test if there are any genes where different mutants behave differently. Is it generally a good idea to aggregate data at gene level, e.g. by summing the counts? In particular, I am concerned that failing to do so would affect gene set enrichment tests. Introducing multiple entries for the same gene where there ought to be a single one would skew the size and composition of gene sets. Is that a valid concern?Yury
In the absence of any evidence suggesting that it's a bad idea, I would sum the counts. This simplifies your analysis (no need to figure out how to report redundant barcodes in the result list), and will give you more power to detect differences. If you're concerned, pick out a couple of genes in your DE list and check that their barcodes are behaving consistently.
As for the gene set tests, whether or not there's a problem depends on whether the testing machinery accounts for the correlations between features (i.e., barcodes for the same gene). If you're using ROAST, for example, the correlations are modelled so any extra barcodes for a gene will be recognised as being redundant and downweighted appropriately. If you're using other tests that assume independence, then you'd be in trouble if each gene were represented more than once.
Thanks again!
Yury