Oligo: cannot fit robust Probe Level linear Models / fitPLM when using a custom CDF
2
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 3 days ago
Wageningen University, Wageningen, the …

Hello,

I would like to process a set of mouse gene st v1.1 arrays in Oligo with the function fitProbeLevelModel(), in analogy of using the function fitPLM (from the library affyPLM). However, this won't work when i would like to summarize the probes into probesets using a 'custom CDF'. Any pointers on how to solve this would be appreciated (or is it a bug?). Please note that it works as expected when using the 'official' PdInfo package.

 

Thanks,

Guido

 

> #Preliminaries
> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.2 (BiocInstaller 1.20.1), ?biocLite for help
> biocLite("pd.mogene.1.1.st.v1")
> install.packages("http://mbni.org/customcdf/20.0.0/entrezg.download/pd.mogene11st.mm.entrezg_20.0.0.tar.gz")
>
> library(oligo)
> celFiles <- list.celfiles(full.names = TRUE)
> affy.data <- read.celfiles(celFiles, pkgname = "pd.mogene11st.mm.entrezg") #using remapped, up-to-date chip design/annotation.
Loading required package: pd.mogene11st.mm.entrezg
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : ./G168_A5_1_WT_COLON_LFD_2.CEL
<<snip>>
>
> x.norm.plm <- fitProbeLevelModel(affy.data)
Error in sqliteSendQuery(con, statement, bind.data) :
  error in statement: no such table: corepm
>
> affy.data
GenericFeatureSet (storageMode: lockedEnvironment)
assayData: 1178100 features, 25 samples
  element names: exprs
protocolData
  rowNames: G168_A5_1_WT_COLON_LFD_2.CEL (25 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: G168_A5_1_WT_COLON_LFD_2.CEL (25 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.mogene11st.mm.entrezg
>
>
>
>#Works fine when using 'official' PdInfo package...
> affy.data <- read.celfiles(celFiles)
Loading required package: pd.mogene.1.1.st.v1
Platform design info loaded.
Reading in : ./G168_A5_1_WT_COLON_LFD_2.CEL
<<snip>>
> x.norm.plm <- fitProbeLevelModel(affy.data)
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
  Estimates... OK
  StdErrors... OK
  Weights..... OK
  Residuals... OK
  Scale....... OK
>
>
> sessionInfo()
R version 3.2.2 Patched (2015-11-03 r69595)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] pd.mogene.1.1.st.v1_3.14.1     pd.mogene11st.mm.entrezg_0.0.1
 [3] RSQLite_1.0.0                  DBI_0.3.1                     
 [5] oligo_1.34.2                   Biostrings_2.38.4             
 [7] XVector_0.10.0                 IRanges_2.4.8                 
 [9] S4Vectors_0.8.11               Biobase_2.30.0                
[11] oligoClasses_1.32.0            BiocGenerics_0.16.1           
[13] BiocInstaller_1.20.1          

loaded via a namespace (and not attached):
 [1] affxparser_1.42.0          splines_3.2.2             
 [3] GenomicRanges_1.22.4       zlibbioc_1.16.0           
 [5] bit_1.1-12                 foreach_1.4.3             
 [7] GenomeInfoDb_1.6.3         tools_3.2.2               
 [9] SummarizedExperiment_1.0.2 ff_2.2-13                 
[11] iterators_1.0.8            preprocessCore_1.32.0     
[13] affyio_1.40.0              codetools_0.2-14          
>

 

 

oligo pdinfobuilder pd.mogene.1.1.st customcdf • 1.9k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

It's a bug. There is some code that builds a SQL query and I think the logic is based on an earlier iteration of what the various table names would be. I don't know how long it will take to get a fix, but it's easy enough for you to make changes on the fly to get it to work for now. Just do this:

> library(oligo)
> debug(getProbeInfo)
> x.norm.plm <- fitProbeLevelModel(affy.data)

Then step through the debugger until you see this:

Browse[3]>
debug: if (class(object) == "GenericFeatureSet") {
    probeTypeTable <- paste0(probeType, "feature")
    metaTable <- paste0(target, probeType)
    featureTable <- paste0("featureSet", substr(target, 4, 4))
    fields <- c(sprintf("%s.fid as fid", metaTable), sprintf("%s.man_fsetid as man_fsetid",
        featureTable), sprintf("%s.fsetid as fsetid", metaTable))
    fields <- paste(fields, collapse = ", ")
    sql <- paste("SELECT", fields, "FROM", metaTable, "INNER JOIN",
        featureTable, "USING(fsetid)")
    rm(fields, probeTypeTable, metaTable, featureTable)
} else {
    fields <- unique(c("fid", "man_fsetid", field))
    fields[fields == "fsetid"] <- paste(probeTable, "fsetid",
        sep = ".")
    if (isST | class(object) == "TilingFeatureSet")
        fields[fields == "man_fsetid"] <- paste(probeTable, "fsetid as man_fsetid",
            sep = ".")
    fields <- paste(fields, collapse = ", ")
    sql <- paste("SELECT", fields, "FROM", probeTable, "INNER JOIN featureSet",
        "USING(fsetid)")
    rm(fields)
}

Now you have to pay attention. Keep stepping through until you see this line:

Browse[3]>
debug: fields <- c(sprintf("%s.fid as fid", metaTable), sprintf("%s.man_fsetid as man_fsetid",
    featureTable), sprintf("%s.fsetid as fsetid", metaTable))

Before you hit Enter the next time, you need to do this:

Browse[3]> metaTable <- "mps1pm"
Browse[3]> featureTable <- "featureSet1"

To correct the metaTable and featureTable objects. Now if you hit Enter two more times, you should be able to inspect the SQL query:

Browse[3]>
debug: sql <- paste("SELECT", fields, "FROM", metaTable, "INNER JOIN",
    featureTable, "USING(fsetid)")
Browse[3]>
debug: rm(fields, probeTypeTable, metaTable, featureTable)
Browse[3]> sql
[1] "SELECT mps1pm.fid as fid, featureSet1.man_fsetid as man_fsetid, mps1pm.fsetid as fsetid FROM mps1pm INNER JOIN featureSet1 USING(fsetid)"

And if your SQL query looks like that, just hit 'c' to get the debugger to continue.

ADD COMMENT
0
Entering edit mode

Thanks James, working fine now!

Outcome:

> x.norm.plm
Probe Level Model
Method..............: plm
# Samples...........: 25
# Probes (processed): 601827
# Probesets.........: 20778
Annotation..........: pd.mogene11st.mm.entrezg
>
ADD REPLY
0
Entering edit mode

Dear James/Benilton,

Using the current oligo version I experienced the same problem as described before (though for a set of mogene 2.0 ST arrays), but I was able to get it running applying the 'fix' listed above by James. Nevertheless, with the upcoming BioC fall release in mind, I think it would be nice if this could be fixed permanently. Would you mind having a look at this?

In addition, how to best 'extract' the normalized data from an oligoPLM object? I am asking because the accessor exprs() doesn't work on such objects.

> celFiles <- list.celfiles(full.names = TRUE, listGzipped=TRUE)
> celFiles
[1] "./GSM2028011_Ctrl1.CEL.gz" "./GSM2028012_Ctrl2.CEL.gz"
[3] "./GSM2028013_Ctrl3.CEL.gz" "./GSM2028014_LPS1.CEL.gz"
[5] "./GSM2028015_LPS2.CEL.gz"  "./GSM2028016_LPS3.CEL.gz"
> affy.data <- read.celfiles(celFiles, pkgname = "pd.mogene20st.mm.entrezg")
Loading required package: pd.mogene20st.mm.entrezg
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : ./GSM2028011_Ctrl1.CEL.gz
Reading in : ./GSM2028012_Ctrl2.CEL.gz
Reading in : ./GSM2028013_Ctrl3.CEL.gz
Reading in : ./GSM2028014_LPS1.CEL.gz
Reading in : ./GSM2028015_LPS2.CEL.gz
Reading in : ./GSM2028016_LPS3.CEL.gz
> x.norm.plm <- fitProbeLevelModel(affy.data)
Error in sqliteSendQuery(con, statement, bind.data) :
  error in statement: no such table: corepm

 

After applying the work-around mentioned above:

 

> x.norm.plm
Probe Level Model
Method..............: plm
# Samples...........: 6
# Probes (processed): 481590
# Probesets.........: 23748
Annotation..........: pd.mogene20st.mm.entrezg
> head(exprs(x.norm.plm))
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘exprs’ for signature ‘"oligoPLM"’

> sessionInfo()
R version 3.3.1 Patched (2016-06-28 r70853)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods 
[9] base     

other attached packages:
[1] pd.mogene20st.mm.entrezg_0.0.1 RSQLite_1.0.0                
[3] DBI_0.5-1                      oligo_1.36.1                 
[5] Biostrings_2.40.2              XVector_0.12.1               
[7] IRanges_2.6.1                  S4Vectors_0.10.3             
[9] Biobase_2.32.0                 oligoClasses_1.34.0          
[11] BiocGenerics_0.18.0            limma_3.28.21                
[13] BiocInstaller_1.22.3          

loaded via a namespace (and not attached):
[1] affxparser_1.44.0          splines_3.3.1            
[3] GenomicRanges_1.24.3       zlibbioc_1.18.0          
[5] bit_1.1-12                 foreach_1.4.3            
[7] GenomeInfoDb_1.8.7         tools_3.3.1              
[9] SummarizedExperiment_1.2.3 ff_2.2-13                
[11] iterators_1.0.8            preprocessCore_1.34.0    
[13] affyio_1.42.0              codetools_0.2-14         
>
ADD REPLY
0
Entering edit mode

 

<<EDIT>>

Never mind my 2nd question; coef() should be used to extract the normalized expression data...

> head(coef(x.norm.plm))
             GSM2028011_Ctrl1.CEL.gz GSM2028012_Ctrl2.CEL.gz
100009600_at                5.452908                5.345068
100009609_at                3.636248                3.638303
100009614_at                4.856107                4.789826
100009664_at                5.250991                5.607274
100012_at                   4.973587                5.113751
100017_at                  10.579906               10.675544
             GSM2028013_Ctrl3.CEL.gz GSM2028014_LPS1.CEL.gz GSM2028015_LPS2.CEL.gz
100009600_at                5.503566               5.790891               5.534350
100009609_at                3.587186               3.620297               3.680230
100009614_at                4.950158               5.170772               4.920435
100009664_at                5.308606               5.459623               5.421679
100012_at                   5.057979               4.429790               4.850578
100017_at                  10.613880              10.099777              10.004951
             GSM2028016_LPS3.CEL.gz
100009600_at               5.573990
100009609_at               3.477622
100009614_at               5.178364
100009664_at               5.356755
100012_at                  4.981260
100017_at                  9.315269
>

 

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

Hey Guido,

My bad - it's not a bug. It's a feature. Normally there are a set of 'meta' tables in a pdInfoPackage that map probes to probesets, and they are named things like 'core' 'full', 'extended' etc. But in the case of a GenericPDInfoPackage like what you get from MBNI, they are called 'mps1' -  'mpsX' and it's up to you to know what you want. Luckily for you, the MBNI summarization is just to a single level, so there is just an mps1 table. So here is how you do things:

> dat <- read.celfiles(list.celfiles(), pkgname = "pd.mogene20st.mm.entrezg")
Platform design info loaded.

> eset <- fitProbeLevelModel(dat, target = "mps1")
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
  Estimates... OK
  StdErrors... OK
  Weights..... OK
  Residuals... OK
  Scale....... OK
 
ADD COMMENT
0
Entering edit mode

Hi James,

Sorry for my delayed reply, but your suggestion is indeed working nicely! Just for my info: how exactly to extract the name(s) of the relevant meta tables? I tried the following, but I am not sure this is the way to do it because "mps1" is not explicitly returned, but it is rather named "featureSet1".

> pd.mogene20st.mm.entrezg@tableInfo
          tbl row_count
1 featureSet1     23748
2      mps1pm    481590
3   pmfeature    481590
>
ADD REPLY
1
Entering edit mode

Well, that's the trick, isn't it?

The oligo and pdInfoBuilder packages have evolved over the years to accommodate various arrays that were not considered at the outset, and as often happens in these situations, the original table naming conventions didn't really apply to the new arrays being accommodated.

The original names used for tables that map probesets (or PSRs in Affy lingo) to other summarization levels were a takeoff on the Affy file names. So the core_mps, full_mps, etc, tables were named after the .mps files that Affy uses to make those mappings. But the MBNI arrays don't have that convention, and more to the point, the arbitrary future array designs that Benilton wanted to accommodate may or may not either. So it didn't make sense to force the existing table names on these new sets of tables.

Instead, Benilton went with a scheme that is independent of the input file names, but that regrettably requires the end user to know more about what they are doing. So the new table names are just pairs or triplets of the form

featureSetN
mpsNpm
mpsNmm

where N is any integer, starting at 1, and  you only have an mpdNmm table if the array has MM probes. So you can now feed pdInfoBuilder a table that does any number of arbitrary mappings of probes to probesets, and those mappings will be encapsulated in a set of N tables. But you have to know what, e.g., featureSet1/mps1p/mps1mm is doing, because there are no longer any hints. And if you have a set of featureSet2/mps2pm/mps2mm tables, you have to know a priori what those tables are doing as well.

So we gained the ability to use MBNI re-mapped probesets, and pretty much any other random changes that Affy might spring on us, but at the cost of complexity.

Anyway, long story short, you use "mps1" because that is the 'basename' of the tables that map probes to probesets. If there were a featureSet2/mps2pm set of tables, you could use "mps2" to summarize at the 'mps2' level.

ADD REPLY

Login before adding your answer.

Traffic: 576 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6