Question

RUVSeq empirical negative controls? how many to take to find the span set

0

Entering edit mode

anthonycolombo60 • 0

@anthonycolombo60-8475

Last seen 3.7 years ago

United States

Hi.

so I am using RUVg(eset, k=1,...) determining the in-silico negative controls by gene or transcript that has a p.value greater than 0.55 with the FDR procedure Benj-Hoch (many many highly insignificant entries came up).

my question is I am not sure how many to include as insignificant entries into RUVg empirical negative controls. ?

my first thought is to take the bottom 10% insignificant genes/transcripts with pval near 0.8 (these were a few hundred far fewer than the flat threshold of p.val 0.55).

Then after reading the RUVSeq manual, they grabbed anything that is not in the top 5000 genes returned from edgeR.

I do notice a difference in the calculated weights by negative control selection process, but am not sure if it is helpful during factor analysis algorithm which elements (and how many elements) can optimize the computation for the spanning space of unwanted variance.

Any suggestions are greatly appreciated.

Sincerely,

Anthony C.

ruvseq RUV ruvnormalize ruvg • 1.9k views

ADD COMMENT • link 8.8 years ago anthonycolombo60 • 0

score 2 · Accepted Answer · 2016-07-12

Hi Anthony,

when selecting a set of negative controls you have a tradeoff between having a good number of genes and a set of genes that are not affected by the biological factor of interest. Selecting more genes will in principle lead to more stable estimates of the unwanted variation (UV) factors, but will carry the risk of including genes that are actually DE.

In practice we see that usually a few hundreds genes are OK, so I think that your approach of selecting only the bottom 10% of genes ranked by p-value should be fine. However, if the results are very different with different sets of negative controls, you may want to explore a bit more the behavior of these genes to see whether the set with fewer genes doesn't fully capture the batch effects or if the larger set captures some biological signal of interest.

The easiest way is to plot the samples in the space of the first principal components color-coded by biology and possibly by other factors that you know may influence the experiment.

I hope this helps.