I have Illumina sequenced mRNA-libraries amplified from single cells using the SMART protocol. Each sample required different numbers of PCR-cycles to get enough cDNA for library prep, and the quality of the cDNA was also variable (i.e. read length distribution and amount).
The goal of the experiment is to see whether the different cells are different in terms of gene expression, and whether cells with a similar morphology are more similar to each other genetically (e.g. cluster together in a PCA-analysis). Since we don't know these things yet, it is hard to say whether we have replicates or not, but we hope to also identify differetially expressed genes between the different cell types.
But I don’t know how to best normalize these data. I was thinking to normalize based on housekeeping genes. But maybe it is better to take into account the number of PCR-cycles (but there is also an amplification step in the library prep)? Or simply just normalize based on the number of mapped reads in total? And which criteria can I use to evaluate which procedure performs the best?
Thanks! Jon
I think it at least partly depends on what you consider 'similar' gene expression in this context. Question: If you have two cells with precisely the same transcripts expressed in precisely the same relative amounts, *but* one cell expresses everything ten times more than the other then do you consider those cells similar or not? If you do (and for many applications I would argue that it is reasonable to consider those two cells to be identical; total RNA content is often simply a function of cell size which is often not really of biological interest) then normalising by total read count seems sufficient. If you do not consider those cells identical then I think you need some kind of spike-in (like ERCC) to normalise against. Do you have those included in your design?
Thanks for the feedback. For my case, in the example you describe the two cells would be regarded identical. I guess it would also be very hard to distinguish actual elevated levels of gene expression differences in the PCR-amplification. About the spike-in, we didn't think about that. But that is definitely something we should have added...