Entering edit mode
Jenny Drnevich
★
2.2k
@jenny-drnevich-382
Last seen 10.4 years ago
Hello all,
I'm analyzing a set of data that turns out to be a little unusual, but
related to the recent discussions on what to do if you have a large
(>40%)
proportion of genes changing . I'd like some advice on my approach,
particularly from the point of view of a manuscript reviewer...
Here's the scenario: I get a set of 6 affymetrix chips to analyze, 2
treatments, 3 independent reps each. The QC on the chips is
outstanding,
the distributions of intensities within each set of reps are very
similar,
but the "Inf" treatment has slightly lower expression values overall
than
the "Non" treatment, based on boxplot() and hist(). I use GCRMA for
preprocessing, and limma functions for the two-group comparison.
Results:
about half of the genes are differentially expressed at FDR=0.05, and
twice
as many are downregulated as upregulated. I am now worried about the
normalization, because quantile normalization (and just about every
other
normalization method) assumes that only a small proportion of genes
(~20% -
40% at most) are changing. So I ask the researcher if she would expect
a
large number of genes to be changing, and if most of them would be
decreasing, and she says "yes, of course". Turns out her treatments on
the
cell line are mock-infected (control) and infected with a virus that
takes
over the cell completely to produce viral RNA and eventually kills the
cell. The infected treatment was harvested right when the first cells
started dying, so there should be broad-scale down-regulation of host
mRNAs
due to infection. This corresponds to the lower overall intensities in
the
"Inf" group; extraction efficiencies were equivalent for all the
samples,
and equal volumes of labeled RNA were hybridized to each chip, so I
assume
the remainder of the RNA in the "Inf" samples was viral. The viral RNA
did
not appear to have much effect on non-specific binding because MM
distributions were extremely similar across all arrays, although again
slightly lower for "Inf" replicates.
What is the best way to normalize these data? Suggestions in the
Bioconductor Archives for dealing with disparate groups mostly
involved
samples from different tissue types, and the consensus seemed to be to
normalize within each group separately. However, there were cautions
that
the values across tissue types may not be comparable, and that scaling
each
array to the same mean/median intensity might be a good solution.
However,
in this case I don't think scaling is appropriate because there is
reason
to believe that the mean/median intensity is not the same between the
treatments. I remember a paper discussing normalization assumptions
that
mentioned a case where programmed cell death was being assayed, and so
most
transcripts were going way down. However, I can't remember what they
advised to do in this case, nor which paper it was - anyone know?
This situation also turns out to be very similar to the spike-in
experiment
of Choe et al. (Genome Biology 2005, 6:R16) where they spiked in ~2500
RNA
species at the same concentration for two groups(C and S), and another
~1300 RNA species at various concentrations, all higher in the S
group; to
make up for the difference in overall RNA concentration, they added an
appropriate amount of unlabeled ploy(C) RNA to the C group. So in
total,
~3800 RNA species were present of the ~14,000 probe sets on the Affy
DrosGenome1 chip. Even though less than 10% of all the probe sets were
changed, because they were all "up-regulated", the typical
normalization
routines resulted in apparent "down-regulation" of many probe sets
that
were spiked-in at the same level. Their solution was to normalize to
the
probe sets corresponding to the RNAs not changed, so they could
evaluate
variants of other pre-processing steps and analysis methods.
Obviously, we
cannot do this. There are only 4 external spike in controls, so I am
hesitant to normalize to them as well.
Here is what I propose to do to account for both a large proportion of
genes changing, and most of them changing in one direction, along with
justification that I hope is acceptable:
Background correction was performed based on GC content of the probes
(Wu
et al. 2004). Because infection is expected to cause a large
proportion of
genes to change, normalization across all arrays could not be
performed
because most normalization methods assume that only a small fraction
of
genes are changing (refs). Instead, quantile normalization was
performed
separately for treatment group, as has been suggested for disparate
samples
such as different tissue types. Additionally, the amount of host RNA
in the
infected cells is expected to decrease, so both sets of arrays were
not
scaled to the same median but instead were left alone; in this
experiment,
the extremely high correlation and consistency of arrays values
suggests
that the arrays can be directly compared.
What do you think? Would this past muster with you if you were the
reviewer?
Thanks,
Jenny
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA
ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu