Pd info package affy 10K array

0

Entering edit mode

Michael Gormley ▴ 60

@michael-gormley-2866

Last seen 6.8 years ago

United States

I get an error when running the makePdInfoPackage function to make a PdInfo package for the 10K mapping array. The output from the function reads: > makePdInfoPackage(pkg,destDir=".") Creating package in ./pd.mapping10k.xba142 loadUnitsByBatch took 22.86 sec loadAffyCsv took 2.79 sec Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be unique) In addition: Warning messages: 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' Timing stopped at: 0.36 0.01 0.44 > traceback() 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = .SQLitePkgName) 11: sqliteExecStatement(con, statement, bind.data) 10: sqliteQuickSQL(conn, statement, bind.data, ...) 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) 6: eval(expr, envir, enclos) 5: eval(expr, envir = loc.frame) 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) 3: buildPdInfoDb(object@cdfFile, object@csvAnnoFile, object@csvSeqFile, dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) 2: makePdInfoPackage(pkg, destDir = ".") 1: makePdInfoPackage(pkg, destDir = ".") I noticed a prior post that suggested that this may be due to entering a record into a table with a Feature ID that is already in the table. Is this the case? Is there a work-around here? Thanks, Mike Gormley [[alternative HTML version deleted]]

• 1.5k views

ADD COMMENT • link updated 16.8 years ago by James W. MacDonald 68k • written 16.8 years ago by Michael Gormley ▴ 60

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Hi Michael, Michael Gormley wrote: > I get an error when running the makePdInfoPackage function to make a PdInfo > package for the 10K mapping array. The output from the function reads: > >> makePdInfoPackage(pkg,destDir=".") > Creating package in ./pd.mapping10k.xba142 > loadUnitsByBatch took 22.86 sec > loadAffyCsv took 2.79 sec > Error in sqliteExecStatement(con, statement, bind.data) : > RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be > unique) > In addition: Warning messages: > 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' > 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' > 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' > Timing stopped at: 0.36 0.01 0.44 I have spent some time looking at this, and it appears that the problem is due to inconsistencies between the cdf and probe sequence files. As far as I can tell there are many probe locations ((x, y) coordinates) in the cdf that don't exist in the probe sequence file, and vice versa. The function loadAffySeqCsv() reads in a chunk of data from the probe sequence file, then matches the indices (computed from the (x, y) coordinates) of these data with the indices that were generated using the cdf data. In the first chunk of 1000 probesets, there are only 8223 probesets that match between the two data sources. I don't think this would normally be a problem, except for the fact that 1000 probesets from the sequence file should *exactly* line up with what we got from the cdf. But the real problem that arises is this: The computation of indices is based on the dimensions of the chip. If we query the cdf to find what the dimensions are we get this: readCdfHeader(cdfFile) $ncols [1] 658 $nrows [1] 658 So we compute the indices thus: index <- x + 1 + y * ncols This will give unique indices for all (x, y) coordinates on the chip, assuming we agree that the dimensions of the chip are 658 x 658. However, the sequence file doesn't agree: pmdf[pmdf$fid == 9264,] fset.name x y offset seq tstrand type tallele 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM T fid 7077 9264 The above is one line from the first 1000 probesets. Note that the (x, y) coordinates are (709, 13)! When we calculate the index (fid) we get 9264. Unfortunately, if we use (51, 14) we also get 9264. Because the sequence file isn't playing by the rules, we end up with a total of 25 duplicate indices. Since the index values are the primary key for the table we are trying to populate we get an error because you can't have duplicated primary keys. So long story short, the sequence file for this chip is broken - the apparent maximum (x, y) coordinate is (710, 707) which is well beyond what the cdf claims. Or maybe the cdf is broken - I don't really know. The end result is that this will never work until Affy comes up with some consistent information for the chip. Best, Jim > >> traceback() > 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = > .SQLitePkgName) > 11: sqliteExecStatement(con, statement, bind.data) > 10: sqliteQuickSQL(conn, statement, bind.data, ...) > 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) > 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) > 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) > 6: eval(expr, envir, enclos) > 5: eval(expr, envir = loc.frame) > 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) > 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile, > dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) > 2: makePdInfoPackage(pkg, destDir = ".") > 1: makePdInfoPackage(pkg, destDir = ".") > > I noticed a prior post that suggested that this may be due to entering a > record into a table with a Feature ID that is already in the table. Is this > the case? Is there a work-around here? > > Thanks, > Mike Gormley > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 16.8 years ago James W. MacDonald 68k

0

Entering edit mode

Note that there are two different Affymetrix 10K chip types, namely Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka 'Mapping 10K Array 2.0'). The probe sequence file you refer to seems to be for the former, which is a larger chip. Details on the official Affymetrix CDFs (converted to binary though): > library(aroma.affymetrix) > cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142") > cdf AffymetrixCdfFile: Path: annotationData/chipTypes/Mapping10K_Xba142 Filename: Mapping10K_Xba142.cdf Filesize: 9.53MB Chip type: Mapping10K_Xba142 RAM: 0.00MB File format: v4 (binary; XDA) Dimension: 658x658 Number of cells: 432964 Number of units: 10208 Cells per unit: 42.41 Number of QC units: 9 > cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131") > cdf AffymetrixCdfFile: Path: annotationData/chipTypes/Mapping10K_Xba131 Filename: Mapping10K_Xba131.cdf Filesize: 10.79MB Chip type: Mapping10K_Xba131 RAM: 0.00MB File format: v4 (binary; XDA) Dimension: 712x712 Number of cells: 506944 Number of units: 11564 Cells per unit: 43.84 Number of QC units: 9 FYI: I try to collect information about various Affymetrix chip types at: http://groups.google.com/group/aroma-affymetrix/web/documentation- on-chip-types Final comment: I would like to emphasize the difference between 'chip type' and 'CDF'; a chip type refers to a unique product coming out of Affymetrix, whereas a CDF refers to an annotation of a chip type. There can be many different CDFs for each chip type, but only one chip type per CDF. Cheers Henrik On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Michael, > > Michael Gormley wrote: >> >> I get an error when running the makePdInfoPackage function to make a >> PdInfo >> package for the 10K mapping array. The output from the function reads: >> >>> makePdInfoPackage(pkg,destDir=".") >> >> Creating package in ./pd.mapping10k.xba142 >> loadUnitsByBatch took 22.86 sec >> loadAffyCsv took 2.79 sec >> Error in sqliteExecStatement(con, statement, bind.data) : >> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be >> unique) >> In addition: Warning messages: >> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >> Timing stopped at: 0.36 0.01 0.44 > > I have spent some time looking at this, and it appears that the problem is > due to inconsistencies between the cdf and probe sequence files. As far as I > can tell there are many probe locations ((x, y) coordinates) in the cdf that > don't exist in the probe sequence file, and vice versa. > > The function loadAffySeqCsv() reads in a chunk of data from the probe > sequence file, then matches the indices (computed from the (x, y) > coordinates) of these data with the indices that were generated using the > cdf data. In the first chunk of 1000 probesets, there are only 8223 > probesets that match between the two data sources. I don't think this would > normally be a problem, except for the fact that 1000 probesets from the > sequence file should *exactly* line up with what we got from the cdf. > > But the real problem that arises is this: > > The computation of indices is based on the dimensions of the chip. If we > query the cdf to find what the dimensions are we get this: > > readCdfHeader(cdfFile) > $ncols > [1] 658 > > $nrows > [1] 658 > > So we compute the indices thus: > > index <- x + 1 + y * ncols > > This will give unique indices for all (x, y) coordinates on the chip, > assuming we agree that the dimensions of the chip are 658 x 658. However, > the sequence file doesn't agree: > > pmdf[pmdf$fid == 9264,] > fset.name x y offset seq tstrand type > tallele > 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM > T > fid > 7077 9264 > > The above is one line from the first 1000 probesets. Note that the (x, y) > coordinates are (709, 13)! When we calculate the index (fid) we get 9264. > Unfortunately, if we use (51, 14) we also get 9264. Because the sequence > file isn't playing by the rules, we end up with a total of 25 duplicate > indices. Since the index values are the primary key for the table we are > trying to populate we get an error because you can't have duplicated primary > keys. > > So long story short, the sequence file for this chip is broken - the > apparent maximum (x, y) coordinate is (710, 707) which is well beyond what > the cdf claims. Or maybe the cdf is broken - I don't really know. The end > result is that this will never work until Affy comes up with some consistent > information for the chip. > > Best, > > Jim > > > > >> >>> traceback() >> >> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = >> .SQLitePkgName) >> 11: sqliteExecStatement(con, statement, bind.data) >> 10: sqliteQuickSQL(conn, statement, bind.data, ...) >> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) >> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) >> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) >> 6: eval(expr, envir, enclos) >> 5: eval(expr, envir = loc.frame) >> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) >> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile, >> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) >> 2: makePdInfoPackage(pkg, destDir = ".") >> 1: makePdInfoPackage(pkg, destDir = ".") >> >> I noticed a prior post that suggested that this may be due to entering a >> record into a table with a Feature ID that is already in the table. Is >> this >> the case? Is there a work-around here? >> >> Thanks, >> Mike Gormley >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 16.8 years ago Henrik Bengtsson ★ 2.4k

0

Entering edit mode

Interesting. To test the problems Michael was having, I simply went to Affy's product support page and downloaded the library file, annotation file, and sequence file. So it appears they have things mixed up on that page, and there isn't anything obvious about the sequence file that would inform anybody it is wrong: > dir(pattern = "^Mapping") [1] "Mapping10K_probe_tab" "Mapping10K_Xba142.CDF" [3] "Mapping10K_Xba142.na25.annot.csv" Best, Jim Henrik Bengtsson wrote: > Note that there are two different Affymetrix 10K chip types, namely > Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka > 'Mapping 10K Array 2.0'). The probe sequence file you refer to seems > to be for the former, which is a larger chip. Details on the official > Affymetrix CDFs (converted to binary though): > >> library(aroma.affymetrix) >> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142") >> cdf > AffymetrixCdfFile: > Path: annotationData/chipTypes/Mapping10K_Xba142 > Filename: Mapping10K_Xba142.cdf > Filesize: 9.53MB > Chip type: Mapping10K_Xba142 > RAM: 0.00MB > File format: v4 (binary; XDA) > Dimension: 658x658 > Number of cells: 432964 > Number of units: 10208 > Cells per unit: 42.41 > Number of QC units: 9 > >> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131") >> cdf > AffymetrixCdfFile: > Path: annotationData/chipTypes/Mapping10K_Xba131 > Filename: Mapping10K_Xba131.cdf > Filesize: 10.79MB > Chip type: Mapping10K_Xba131 > RAM: 0.00MB > File format: v4 (binary; XDA) > Dimension: 712x712 > Number of cells: 506944 > Number of units: 11564 > Cells per unit: 43.84 > Number of QC units: 9 > > FYI: I try to collect information about various Affymetrix chip types at: > > http://groups.google.com/group/aroma-affymetrix/web/documentation- on-chip-types > > Final comment: I would like to emphasize the difference between 'chip > type' and 'CDF'; a chip type refers to a unique product coming out of > Affymetrix, whereas a CDF refers to an annotation of a chip type. > There can be many different CDFs for each chip type, but only one chip > type per CDF. > > Cheers > > Henrik > > On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald > <jmacdon at="" med.umich.edu=""> wrote: >> Hi Michael, >> >> Michael Gormley wrote: >>> I get an error when running the makePdInfoPackage function to make a >>> PdInfo >>> package for the 10K mapping array. The output from the function reads: >>> >>>> makePdInfoPackage(pkg,destDir=".") >>> Creating package in ./pd.mapping10k.xba142 >>> loadUnitsByBatch took 22.86 sec >>> loadAffyCsv took 2.79 sec >>> Error in sqliteExecStatement(con, statement, bind.data) : >>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be >>> unique) >>> In addition: Warning messages: >>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>> Timing stopped at: 0.36 0.01 0.44 >> I have spent some time looking at this, and it appears that the problem is >> due to inconsistencies between the cdf and probe sequence files. As far as I >> can tell there are many probe locations ((x, y) coordinates) in the cdf that >> don't exist in the probe sequence file, and vice versa. >> >> The function loadAffySeqCsv() reads in a chunk of data from the probe >> sequence file, then matches the indices (computed from the (x, y) >> coordinates) of these data with the indices that were generated using the >> cdf data. In the first chunk of 1000 probesets, there are only 8223 >> probesets that match between the two data sources. I don't think this would >> normally be a problem, except for the fact that 1000 probesets from the >> sequence file should *exactly* line up with what we got from the cdf. >> >> But the real problem that arises is this: >> >> The computation of indices is based on the dimensions of the chip. If we >> query the cdf to find what the dimensions are we get this: >> >> readCdfHeader(cdfFile) >> $ncols >> [1] 658 >> >> $nrows >> [1] 658 >> >> So we compute the indices thus: >> >> index <- x + 1 + y * ncols >> >> This will give unique indices for all (x, y) coordinates on the chip, >> assuming we agree that the dimensions of the chip are 658 x 658. However, >> the sequence file doesn't agree: >> >> pmdf[pmdf$fid == 9264,] >> fset.name x y offset seq tstrand type >> tallele >> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM >> T >> fid >> 7077 9264 >> >> The above is one line from the first 1000 probesets. Note that the (x, y) >> coordinates are (709, 13)! When we calculate the index (fid) we get 9264. >> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence >> file isn't playing by the rules, we end up with a total of 25 duplicate >> indices. Since the index values are the primary key for the table we are >> trying to populate we get an error because you can't have duplicated primary >> keys. >> >> So long story short, the sequence file for this chip is broken - the >> apparent maximum (x, y) coordinate is (710, 707) which is well beyond what >> the cdf claims. Or maybe the cdf is broken - I don't really know. The end >> result is that this will never work until Affy comes up with some consistent >> information for the chip. >> >> Best, >> >> Jim >> >> >> >> >>>> traceback() >>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = >>> .SQLitePkgName) >>> 11: sqliteExecStatement(con, statement, bind.data) >>> 10: sqliteQuickSQL(conn, statement, bind.data, ...) >>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) >>> 6: eval(expr, envir, enclos) >>> 5: eval(expr, envir = loc.frame) >>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) >>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile, >>> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) >>> 2: makePdInfoPackage(pkg, destDir = ".") >>> 1: makePdInfoPackage(pkg, destDir = ".") >>> >>> I noticed a prior post that suggested that this may be due to entering a >>> record into a table with a Feature ID that is already in the table. Is >>> this >>> the case? Is there a work-around here? >>> >>> Thanks, >>> Mike Gormley >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> -- >> James W. MacDonald, M.S. >> Biostatistician >> Affymetrix and cDNA Microarray Core >> University of Michigan Cancer Center >> 1500 E. Medical Center Drive >> 7410 CCGC >> Ann Arbor MI 48109 >> 734-647-5623 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 16.8 years ago James W. MacDonald 68k

0

Entering edit mode

This is the same source where I obtained the files originally. I have brought this issue to the attention of affy technical support. Hoping they can get me the correct probe sequence file. On Thu, Jun 26, 2008 at 2:26 PM, James W. MacDonald <jmacdon@med.umich.edu> wrote: > Interesting. > > To test the problems Michael was having, I simply went to Affy's product > support page and downloaded the library file, annotation file, and sequence > file. So it appears they have things mixed up on that page, and there isn't > anything obvious about the sequence file that would inform anybody it is > wrong: > > > dir(pattern = "^Mapping") > [1] "Mapping10K_probe_tab" "Mapping10K_Xba142.CDF" > [3] "Mapping10K_Xba142.na25.annot.csv" > > Best, > > Jim > > > > > Henrik Bengtsson wrote: > >> Note that there are two different Affymetrix 10K chip types, namely >> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka >> 'Mapping 10K Array 2.0'). The probe sequence file you refer to seems >> to be for the former, which is a larger chip. Details on the official >> Affymetrix CDFs (converted to binary though): >> >> library(aroma.affymetrix) >>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142") >>> cdf >>> >> AffymetrixCdfFile: >> Path: annotationData/chipTypes/Mapping10K_Xba142 >> Filename: Mapping10K_Xba142.cdf >> Filesize: 9.53MB >> Chip type: Mapping10K_Xba142 >> RAM: 0.00MB >> File format: v4 (binary; XDA) >> Dimension: 658x658 >> Number of cells: 432964 >> Number of units: 10208 >> Cells per unit: 42.41 >> Number of QC units: 9 >> >> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131") >>> cdf >>> >> AffymetrixCdfFile: >> Path: annotationData/chipTypes/Mapping10K_Xba131 >> Filename: Mapping10K_Xba131.cdf >> Filesize: 10.79MB >> Chip type: Mapping10K_Xba131 >> RAM: 0.00MB >> File format: v4 (binary; XDA) >> Dimension: 712x712 >> Number of cells: 506944 >> Number of units: 11564 >> Cells per unit: 43.84 >> Number of QC units: 9 >> >> FYI: I try to collect information about various Affymetrix chip types at: >> >> >> http://groups.google.com/group/aroma-affymetrix/web/documentation- on-chip-types >> >> Final comment: I would like to emphasize the difference between 'chip >> type' and 'CDF'; a chip type refers to a unique product coming out of >> Affymetrix, whereas a CDF refers to an annotation of a chip type. >> There can be many different CDFs for each chip type, but only one chip >> type per CDF. >> >> Cheers >> >> Henrik >> >> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald >> <jmacdon@med.umich.edu> wrote: >> >>> Hi Michael, >>> >>> Michael Gormley wrote: >>> >>>> I get an error when running the makePdInfoPackage function to make a >>>> PdInfo >>>> package for the 10K mapping array. The output from the function reads: >>>> >>>> makePdInfoPackage(pkg,destDir=".") >>>>> >>>> Creating package in ./pd.mapping10k.xba142 >>>> loadUnitsByBatch took 22.86 sec >>>> loadAffyCsv took 2.79 sec >>>> Error in sqliteExecStatement(con, statement, bind.data) : >>>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be >>>> unique) >>>> In addition: Warning messages: >>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>> Timing stopped at: 0.36 0.01 0.44 >>>> >>> I have spent some time looking at this, and it appears that the problem >>> is >>> due to inconsistencies between the cdf and probe sequence files. As far >>> as I >>> can tell there are many probe locations ((x, y) coordinates) in the cdf >>> that >>> don't exist in the probe sequence file, and vice versa. >>> >>> The function loadAffySeqCsv() reads in a chunk of data from the probe >>> sequence file, then matches the indices (computed from the (x, y) >>> coordinates) of these data with the indices that were generated using the >>> cdf data. In the first chunk of 1000 probesets, there are only 8223 >>> probesets that match between the two data sources. I don't think this >>> would >>> normally be a problem, except for the fact that 1000 probesets from the >>> sequence file should *exactly* line up with what we got from the cdf. >>> >>> But the real problem that arises is this: >>> >>> The computation of indices is based on the dimensions of the chip. If we >>> query the cdf to find what the dimensions are we get this: >>> >>> readCdfHeader(cdfFile) >>> $ncols >>> [1] 658 >>> >>> $nrows >>> [1] 658 >>> >>> So we compute the indices thus: >>> >>> index <- x + 1 + y * ncols >>> >>> This will give unique indices for all (x, y) coordinates on the chip, >>> assuming we agree that the dimensions of the chip are 658 x 658. However, >>> the sequence file doesn't agree: >>> >>> pmdf[pmdf$fid == 9264,] >>> fset.name x y offset seq tstrand type >>> tallele >>> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM >>> T >>> fid >>> 7077 9264 >>> >>> The above is one line from the first 1000 probesets. Note that the (x, y) >>> coordinates are (709, 13)! When we calculate the index (fid) we get 9264. >>> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence >>> file isn't playing by the rules, we end up with a total of 25 duplicate >>> indices. Since the index values are the primary key for the table we are >>> trying to populate we get an error because you can't have duplicated >>> primary >>> keys. >>> >>> So long story short, the sequence file for this chip is broken - the >>> apparent maximum (x, y) coordinate is (710, 707) which is well beyond >>> what >>> the cdf claims. Or maybe the cdf is broken - I don't really know. The end >>> result is that this will never work until Affy comes up with some >>> consistent >>> information for the chip. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> >>> traceback() >>>>> >>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = >>>> .SQLitePkgName) >>>> 11: sqliteExecStatement(con, statement, bind.data) >>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...) >>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) >>>> 6: eval(expr, envir, enclos) >>>> 5: eval(expr, envir = loc.frame) >>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) >>>> 3: buildPdInfoDb(object@cdfFile, object@csvAnnoFile, object@csvSeqFile, >>>> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) >>>> 2: makePdInfoPackage(pkg, destDir = ".") >>>> 1: makePdInfoPackage(pkg, destDir = ".") >>>> >>>> I noticed a prior post that suggested that this may be due to entering a >>>> record into a table with a Feature ID that is already in the table. Is >>>> this >>>> the case? Is there a work-around here? >>>> >>>> Thanks, >>>> Mike Gormley >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> Affymetrix and cDNA Microarray Core >>> University of Michigan Cancer Center >>> 1500 E. Medical Center Drive >>> 7410 CCGC >>> Ann Arbor MI 48109 >>> 734-647-5623 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > [[alternative HTML version deleted]]

ADD REPLY • link 16.8 years ago Michael Gormley ▴ 60

0

Entering edit mode

Hi, I can confirm that the probe sequence file for Mapping10K_Xba142 [http://www.affymetrix.com/Auth/analysis/downloads/data/Mapping10Kv2_p robe_tab.zip] linked to at the 'Mapping 10K 2.0 Array - Support Materials' page [http://www.affymetrix.com/support/technical/byproduct.affx?product=10 k-20] does indeed look like it is for Mapping10K_Xba131, e.g. the available X and Y positions are in [1,710] and [1,707] which is clearly outside the dimension of the Mapping10K_Xba142 chip type 658x658. Did you post this in the Affymetrix Forum https://www.affymetrix.com/community/forums/index.jspa or directly to the support? Is there a thread where I can post a follow up? -Henrik On Thu, Jun 26, 2008 at 2:25 PM, Michael Gormley <michael.gormley at="" gmail.com=""> wrote: > This is the same source where I obtained the files originally. I have > brought this issue to the attention of affy technical support. Hoping they > can get me the correct probe sequence file. > > On Thu, Jun 26, 2008 at 2:26 PM, James W. MacDonald <jmacdon at="" med.umich.edu=""> > wrote: >> >> Interesting. >> >> To test the problems Michael was having, I simply went to Affy's product >> support page and downloaded the library file, annotation file, and sequence >> file. So it appears they have things mixed up on that page, and there isn't >> anything obvious about the sequence file that would inform anybody it is >> wrong: >> >> > dir(pattern = "^Mapping") >> [1] "Mapping10K_probe_tab" "Mapping10K_Xba142.CDF" >> [3] "Mapping10K_Xba142.na25.annot.csv" >> >> Best, >> >> Jim >> >> >> >> Henrik Bengtsson wrote: >>> >>> Note that there are two different Affymetrix 10K chip types, namely >>> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka >>> 'Mapping 10K Array 2.0'). The probe sequence file you refer to seems >>> to be for the former, which is a larger chip. Details on the official >>> Affymetrix CDFs (converted to binary though): >>> >>>> library(aroma.affymetrix) >>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142") >>>> cdf >>> >>> AffymetrixCdfFile: >>> Path: annotationData/chipTypes/Mapping10K_Xba142 >>> Filename: Mapping10K_Xba142.cdf >>> Filesize: 9.53MB >>> Chip type: Mapping10K_Xba142 >>> RAM: 0.00MB >>> File format: v4 (binary; XDA) >>> Dimension: 658x658 >>> Number of cells: 432964 >>> Number of units: 10208 >>> Cells per unit: 42.41 >>> Number of QC units: 9 >>> >>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131") >>>> cdf >>> >>> AffymetrixCdfFile: >>> Path: annotationData/chipTypes/Mapping10K_Xba131 >>> Filename: Mapping10K_Xba131.cdf >>> Filesize: 10.79MB >>> Chip type: Mapping10K_Xba131 >>> RAM: 0.00MB >>> File format: v4 (binary; XDA) >>> Dimension: 712x712 >>> Number of cells: 506944 >>> Number of units: 11564 >>> Cells per unit: 43.84 >>> Number of QC units: 9 >>> >>> FYI: I try to collect information about various Affymetrix chip types at: >>> >>> >>> http://groups.google.com/group/aroma-affymetrix/web /documentation-on-chip-types >>> >>> Final comment: I would like to emphasize the difference between 'chip >>> type' and 'CDF'; a chip type refers to a unique product coming out of >>> Affymetrix, whereas a CDF refers to an annotation of a chip type. >>> There can be many different CDFs for each chip type, but only one chip >>> type per CDF. >>> >>> Cheers >>> >>> Henrik >>> >>> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald >>> <jmacdon at="" med.umich.edu=""> wrote: >>>> >>>> Hi Michael, >>>> >>>> Michael Gormley wrote: >>>>> >>>>> I get an error when running the makePdInfoPackage function to make a >>>>> PdInfo >>>>> package for the 10K mapping array. The output from the function reads: >>>>> >>>>>> makePdInfoPackage(pkg,destDir=".") >>>>> >>>>> Creating package in ./pd.mapping10k.xba142 >>>>> loadUnitsByBatch took 22.86 sec >>>>> loadAffyCsv took 2.79 sec >>>>> Error in sqliteExecStatement(con, statement, bind.data) : >>>>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be >>>>> unique) >>>>> In addition: Warning messages: >>>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>>> Timing stopped at: 0.36 0.01 0.44 >>>> >>>> I have spent some time looking at this, and it appears that the problem >>>> is >>>> due to inconsistencies between the cdf and probe sequence files. As far >>>> as I >>>> can tell there are many probe locations ((x, y) coordinates) in the cdf >>>> that >>>> don't exist in the probe sequence file, and vice versa. >>>> >>>> The function loadAffySeqCsv() reads in a chunk of data from the probe >>>> sequence file, then matches the indices (computed from the (x, y) >>>> coordinates) of these data with the indices that were generated using >>>> the >>>> cdf data. In the first chunk of 1000 probesets, there are only 8223 >>>> probesets that match between the two data sources. I don't think this >>>> would >>>> normally be a problem, except for the fact that 1000 probesets from the >>>> sequence file should *exactly* line up with what we got from the cdf. >>>> >>>> But the real problem that arises is this: >>>> >>>> The computation of indices is based on the dimensions of the chip. If we >>>> query the cdf to find what the dimensions are we get this: >>>> >>>> readCdfHeader(cdfFile) >>>> $ncols >>>> [1] 658 >>>> >>>> $nrows >>>> [1] 658 >>>> >>>> So we compute the indices thus: >>>> >>>> index <- x + 1 + y * ncols >>>> >>>> This will give unique indices for all (x, y) coordinates on the chip, >>>> assuming we agree that the dimensions of the chip are 658 x 658. >>>> However, >>>> the sequence file doesn't agree: >>>> >>>> pmdf[pmdf$fid == 9264,] >>>> fset.name x y offset seq tstrand type >>>> tallele >>>> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM >>>> T >>>> fid >>>> 7077 9264 >>>> >>>> The above is one line from the first 1000 probesets. Note that the (x, >>>> y) >>>> coordinates are (709, 13)! When we calculate the index (fid) we get >>>> 9264. >>>> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence >>>> file isn't playing by the rules, we end up with a total of 25 duplicate >>>> indices. Since the index values are the primary key for the table we are >>>> trying to populate we get an error because you can't have duplicated >>>> primary >>>> keys. >>>> >>>> So long story short, the sequence file for this chip is broken - the >>>> apparent maximum (x, y) coordinate is (710, 707) which is well beyond >>>> what >>>> the cdf claims. Or maybe the cdf is broken - I don't really know. The >>>> end >>>> result is that this will never work until Affy comes up with some >>>> consistent >>>> information for the chip. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> >>>>>> traceback() >>>>> >>>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = >>>>> .SQLitePkgName) >>>>> 11: sqliteExecStatement(con, statement, bind.data) >>>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...) >>>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) >>>>> 6: eval(expr, envir, enclos) >>>>> 5: eval(expr, envir = loc.frame) >>>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) >>>>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile, >>>>> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) >>>>> 2: makePdInfoPackage(pkg, destDir = ".") >>>>> 1: makePdInfoPackage(pkg, destDir = ".") >>>>> >>>>> I noticed a prior post that suggested that this may be due to entering >>>>> a >>>>> record into a table with a Feature ID that is already in the table. Is >>>>> this >>>>> the case? Is there a work-around here? >>>>> >>>>> Thanks, >>>>> Mike Gormley >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> Affymetrix and cDNA Microarray Core >>>> University of Michigan Cancer Center >>>> 1500 E. Medical Center Drive >>>> 7410 CCGC >>>> Ann Arbor MI 48109 >>>> 734-647-5623 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> Affymetrix and cDNA Microarray Core >> University of Michigan Cancer Center >> 1500 E. Medical Center Drive >> 7410 CCGC >> Ann Arbor MI 48109 >> 734-647-5623 > >

ADD REPLY • link 16.8 years ago Henrik Bengtsson ★ 2.4k

Login before adding your answer.