Hello all,
I have a couple of theoretical/analysis questions concerning my data analysis. I am spiking in an equal amount of the ERCC Control mix into each of my samples that I can use to create a normalization factor because of the diversity of my libraries. I then would like to generate TMM values for all of the genes in my data set based on the ERCC normalization factor and use those values downstream in some custom algorithms/scripts I am designing. I am using edgeR and the calcNormFactors() function on the reads that align to the ERCC transcripts to generate a normalization factor I can apply to the raw read counts in my data set. I have imported my files into the DGEList format and generated normalization factors two different ways which are listed below:
A). including only those reads which map to the ERCC transcripts
files group lib.size norm.factors 1-1_Elution_ERCC_counts 1 4452 0.9584354 1-1_Total_ERCC_counts 2 872 0.9997297 GFP_Elution_ERCC_counts 3 4209 1.0072572 GFP_Total_ERCC_counts 4 778 1.0361298
or
B). using all the reads in a sample (both human RNA reads and ERCC control reads.
files group lib.size norm.factors 1-1_Elution_ERCC_counts 1 194732 2.3671725 1-1_Total_ERCC_counts 2 184785 0.5019733 GFP_Elution_ERCC_counts 3 185957 2.2148415 GFP_Total_ERCC_counts 4 206232 0.3799678
At this point I feel it important to note that I recognize how small my library sizes are. This was taken from a pilot experiment I am analyzing right now. My full sequencing experiment will be at a much larger depth.
I know that calcNormFactors() is sensitive to the library size, so it does not surprise me that I get different results from these two methods, but I am wondering which is the correct method for generating a normalization factor for my purposes. I anticipated that the number of ERCC reads would be higher in my Elution samples compared to my Total samples, so I also assumed the norm.factors would be different between the Totals and Elutions as well. Additionally, because I want to use the norm.factors on all of my other genes I think I need to take into consideration the total size of the library (options B), as opposed to option A where I am only looking at the size of the ERCC reads library. In essence, I am leaning towards option B but I wanted to get some thoughts from people with more expertise in this subject.
Once I have this settled, I was going to generate TMM values by using the equation TMM = raw read counts/(library size*normalization factor). Does this yield the same value as the $pseudo.counts from the estimateCommonDisp() function? I tried researching this online and in the edgeR manual, but was confused because the manual states the authors do not recommend using pseudo.counts as normalized values.
Thanks for your help!
Adam
Thanks for the clarification Aaron. For basic DE analysis we will be assuming a majority of genes are non-DE and using them for normalization. But we also want to look at mRNA half-life and are doing a fractionation to do so (the elution and total samples). This is where we feel it is necessary to provide a control spike in, because the composition/complexity of the total and elution samples are different following fractionation. Furthermore, our calculation relies upon using equivalent volumes of each sample to generate accurate half-lives, but in order to maintain efficient cluster generation during sequencing we normalize samples by weight, thus throwing off the volume ratios we are concerned about.
Thanks for clarifying the pseudo-counts as well. I will take a look at the q2qnbionom function in further detail. Correct me if I am wrong, but you are saying that the pseudo-counts are calculated by using the q2q function and that this would be a better representation of the normalized data as opposed to simply scaling.