I'm going to be using frmatools to generate experiment-specific fRMA vectors from a set of training data, and these vectors will be used to normalize that training data as well as individual samples that will subsequently become available over time (all arrays are/will be run by the same group). I have a couple of questions about the specifics of implementing this approach.
First, I actually have two datasets. The first one is all blood samples, while the second is kidney biopsy samples. So each dataset is a single tissue, but with several different conditions (the conditions are healthy transplant and several different types of transplant dysfunction/rejection). The datasets will be analyzed independently, but I wonder if it makes sense to pool them for the purpose of generating the fRMA vectors. Pooling would obviously result in one set of vectors based on a larger dataset, and therefore hopefully more robust. But with only a few tissues/conditions, would pooling them in this way cause problems for the analysis?
Second, what is the proper way to define a batch? Is there one correct way? Specifically, since my samples are all the same tissue, should I include the sample condition in my definition of batches in addition to technical variables like run date? What are the implications of this choice of batch definition, conceptually?
Finally, fRMA has single-array and multi-array normalization modes. Am I correct in assuming that I should use the single-array mode for all arrays, since the test arrays will need to be normalized in single-array mode as the come it?
Ok, so the goal is to define batches such that probes that are unreliable in my experiment will have large residuals and be down-weighted. That makes sense. I think you've answered my questions.