Entering edit mode
Jeff Gentry
★
3.9k
@jeff-gentry-12
Last seen 10.1 years ago
Forwarded on request of Rafael ....
Given the heavy usage of affy by members of this list, it might be of
interest.
---------- Forwarded message ----------
Date: Fri, 23 Aug 2002 00:07:09 -0400 (EDT)
From: Rafael A. Irizarry <ririzarr@jhsph.edu>
Reply-To: rafa@jhu.edu
To: biocore@stat.math.ethz.ch
Subject: affy 2.0
hi! for the next version of affy i would like to have just one main
class.
because the pkg is a merge of two, we have redundancy, there are two
approaches for storing probe level data. this is extra work because we
have to make sure methods work for both.
regardless of the
approach we decide on, we will have the same methods so the user
should
not see the difference. i need help deciding which approach is more
convenient.
ill use chips instead of arrays so that we dont get confused with what
R
calls arrays.
approach 1: for each chip we keep a matrix (Cel) where the row 10
,column
12 entry represents the probe intensity read from the physical row 10,
column 12 position on the chip. we then keep three dimensional arrays
to
represent multiple chip experiments. to know what position goes with
what
gene a separate class (Cdf) is defined that contains a matrix with the
gene names for each entry in the probe intensity matrix. so the row
10,
column 12 entry in the Cdf matrix gives the genename for the probe in
the row 10, column 12 entry in the Cel matrix.the Cdf class contains
the
necesary info to know whats PM and whats MM
approach 2: keeps the pm data in a matrix with rows representing
probes
and columns representing chips. similarly for mm. to know what row
goes
with what gene we keep a vector with the genenames. to know what gene
is
in column, say, 10 we simply look to the 10th entry in the name
vector.
similarly we have vectors with the probe numbers, x positions, and y
positions,
an advantage of approach 1 is that we dont need to keep the x,y
(physical
position on the chip)
information. a disadvantage is that subsetting by genes and creating
"fake" instances can be confusing because we need to control 2 classes
(cel,cdf).
an advantage of approach 2 is that the pms and mms are readily
available
and subsetting by genes is easy. as a consequence creating "fake"
instances is easy. a disadvantage is that we need extra slots to keep
the
physical position information and that the we are a bit farther away
from
the raw data.
at first i was leaning toward approach 1 because its closer to the
raw data... now im a bit worried
about difficulties with subsetting by genes, and how it affects "genes
for hire".
any opinions? suggestions?
rafael
_______________________________________________
Biocore mailing list
Biocore@stat.math.ethz.ch
http://www.stat.math.ethz.ch/mailman/listinfo/biocore