Entering edit mode
Heidi Dvinge
★
2.0k
@heidi-dvinge-2195
Last seen 10.3 years ago
Dear all,
I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
the makecdfenv package to build a cdf environment based on the file
MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
That worked without any problems, but out of curiosity I tried taking
a closer look at the format of the array, to see how many probes were
in each probe set etc.
I'm aware that some probes map to multiple probe sets and are removed
when the cdfenv is produced, which seems to be the case for about 8%
of the probes. My question is exactly how this happens? I would
expect the multiple-mapping probes to be removed from all probe sets,
but this doesn't seem to be the case.
Example with the two overlapping probe sets 10344719 and 10353008,
where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
into only tab-delimited info and read into R, and "INDEX" being a
unique probe identifier (the same as index-1 in the cdf env):
> cdf[cdf$QUAL=="10344719","INDEX"]
[1] 7543 661828 575792 962890 963940 140756 337977
510591 860722 968182 387524 386474
[13] 385518 384468 1076441 1075391 850724 51881 957657 100610
862535 506651 505601 82272
[25] 83322 692860 691810 494417 932343 689216 836826 894914
715393 421443 92496 485600
[37] 253868 352083 594288 1049892 370822 369772 416675 928371
505790 506840 135781
> cdf[cdf$QUAL=="10353008","INDEX"]
[1] 506840 505790 928371 416675 369772 370822 1049892
485600 92496 421443 715393 894914
[13] 1073586 110809 836826 689216 932343 494417 691810
83322 82272 505601 506651 862535
[25] 100610 957657 51881 850724 1075391 1076441 384468 385518
386474 387524 968182 860722
[37] 510591 337977 140756 963940 962890 575792 661828 7543
> indexProbes(raw, genenames="10344719")
$`10344719`
[1] 692861 253869 352084 594289 135782
> indexProbes(raw, genenames="10353008")
$`10353008`
[1] 506841 505791 928372 416676 369773 370823 1049893
485601 92497 421444 715394 894915
[13] 1073587 110810 836827 689217 932344 494418 691811
83323 82273 505602 506652 862536
[25] 100611 957658 51882 850725 1075392 1076442 384469 385519
386475 387525 968183 860723
[37] 510592 337978 140757 963941 962891 575793 661829 7544
So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
which are overlapping. In the cdf environment 10344719 appears to
have the 42 overlapping probes removed, but they're still present in
10353008.
A similar situation is seen for e.g. the overlapping probe sets
10461391 and 10487930 with 41 probes each, 40 of which are identical:
> cdf[cdf$QUAL=="10461391","INDEX"]
[1] 483268 1022846 409057 703153 328783 372162 882399
569942 765746 868615 948367 413614
[13] 830931 434763 970910 600221 599171 135798 6746 455659
799186 912319 469313 145393
[25] 872191 126758 801051 774196 773146 965810 272742 19445
585800 999188 1012776 823868
[37] 156514 210874 645037 799505 1075142
> cdf[cdf$QUAL=="10487930","INDEX"]
[1] 1075142 799505 645037 210874 156514 823868 1012776
999188 585800 19445 272742 965810
[13] 773146 774196 801051 126758 872191 145393 469313 912319
799186 839098 6746 135798
[25] 599171 600221 970910 434763 830931 413614 948367 868615
765746 569942 882399 372162
[37] 328783 703153 409057 1022846 483268
> indexProbes(raw, genenames="10461391")
$`10461391`
[1] 455660
> indexProbes(raw, genenames="10487930")
$`10487930`
[1] 1075143 799506 645038 210875 156515 823869 1012777
999189 585801 19446 272743 965811
[13] 773147 774197 801052 126759 872192 145394 469314 912320
799187 839099 6747 135799
[25] 599172 600222 970911 434764 830932 413615 948368 868616
765747 569943 882400 372163
[37] 328784 703154 409058 1022847 483269
Any comments on this or on exactly how the cdf environment is created
would be much appreciated.
Thanks
\Heidi
> sessionInfo()
R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
i386-apple-darwin8.10.1
locale:
en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] tools stats graphics grDevices utils datasets
methods base
other attached packages:
[1] makecdfenv_1.17.0 affy_1.17.3 preprocessCore_1.1.5
affyio_1.7.17
[5] Biobase_1.99.4
------------<<>>------------
Heidi Dvinge
EMBL-European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD
Mail: heidi@ebi.ac.uk
Phone: +44 (0) 1223 494 444
------------<<>>------------
[[alternative HTML version deleted]]