Question

Inconsistent results when normalizing HTA2.0 arrays at "level=core"

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 27 days ago

Wageningen University, Wageningen, the …

I noticed an intriguing discrepancy when I normalized a set of HTA2.0 arrays (GSE111156, here). Specifically, a different number of probesets is returned when using the rma or fitProbeLevelModel function in oligo, even if in both cases the argument target=core is used. I initially noticed this when I tried to annotate the normalized data at the transcript level, which returned only NAs when processed through fitProbeLevelModel, but when rma was used this was not the case. Then I noticed that the probesetIDs that are returned are not identical (though both time a 'core analysis' was performed). Any idea what causes this discrepancy?

> library(oligo)
> library(affycoretools)
> library(hta20transcriptcluster.db)
>
> path <- "/mnt/files/GSE111156"
> affy.data <- read.celfiles(filenames = list.celfiles(path,  full.names=TRUE, listGzipped=TRUE)[1:5] )
Reading in : /mnt/files/GSE111156/GSM3024184_Case_1_N.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024185_Case_1_A.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024186_Case_1_T.CEL.gz
<<snip>>

> normalized.data.plm <- fitProbeLevelModel(affy.data, target = "core")
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
  Estimates... OK
  StdErrors... OK
  Weights..... OK
  Residuals... OK
  Scale....... OK
> 
> normalized.data.plm
Probe Level Model
Method..............: plm
# Samples...........: 5
# Probes (processed): 7576209
# Probesets.........: 925032
Annotation..........: pd.hta.2.0
> 
> # Note: 925032 probesets
>
> #annotate
> normalized.data.plm <- opset2eset(normalized.data.plm) #1st convert PLMset to eSet!
> normalized.data.plm <- annotateEset(normalized.data.plm,  hta20transcriptcluster.db)
> 
> #check
> normalized.data.plm
ExpressionSet (storageMode: lockedEnvironment)
assayData: 925032 features, 5 samples 
  element names: exprs, se.exprs 
protocolData
  sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
    ... GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
    ... GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData
  featureNames: 2824539_st 2824540_st ... TCONS_l2_00005064_st (925032
    total)
  fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0 
>
> # NO probeset could be annotated...??
> sumis.na(fData(normalized.data.plm)$ENTREZID))
[1] 925032
> 
> featureNames(normalized.data.plm)[1:5]
[1] "2824539_st" "2824540_st" "2824541_st" "2824542_st" "2824543_st"
>
> 
>
> # now process same input using rma function
> normalized.data.rma <- rma(affy.data, target = "core")
Background correcting
Normalizing
Calculating Expression
> normalized.data.rma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 70523 features, 5 samples 
  element names: exprs 
protocolData
  rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
    GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
    GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0 
> 
> # Note: 70523 probesets
>
> Annotate
> normalized.data.rma <- annotateEset(normalized.data.rma,  hta20transcriptcluster.db)
>
> # 44% of the probesets could now be annotated...!?
> sumis.na(fData(normalized.data.rma)$ENTREZID))
[1] 39696
>
> featureNames(normalized.data.rma)[1:5]
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.12.2               DBI_1.0.0                      
 [3] RSQLite_2.1.1                   hta20transcriptcluster.db_8.7.0
 [5] org.Hs.eg.db_3.8.2              AnnotationDbi_1.46.0           
 [7] affycoretools_1.56.0            oligo_1.48.0                   
 [9] Biostrings_2.52.0               XVector_0.24.0                 
[11] IRanges_2.18.1                  S4Vectors_0.22.0               
[13] Biobase_2.44.0                  oligoClasses_1.46.0            
[15] BiocGenerics_0.30.0

oligo HTA2.0 affycoretools RMA fitProbeLevelModel • 1.8k views

ADD COMMENT • link updated 5.8 years ago by James W. MacDonald 68k • written 5.8 years ago by Guido Hooiveld ★ 4.1k

score 1 · Accepted Answer · 2019-07-03

I don't think fitProbeLevelModel has been updated to correctly process the HTA arrays. If you run RMA, the mapping of probes to probesets is done by the internal function getFidMetaProbesetCore, which does

> oligo:::getFidMetaProbesetCore
function (object, sortBy = "fsetid") 
{
    conn <- db(object)
    on.exit(dbDisconnect(conn))
    if (!is.null(sortBy)) 
        sortBy <- match.arg(sortBy, c("fid", "fsetid"))
    if (!is(object, "HTAFeatureSet")) {
        sql <- "SELECT fid, meta_fsetid as fsetid FROM pmfeature INNER JOIN core_mps USING(fsetid)"
    }
    else {
        sql <- "SELECT fid, core_mps.transcript_cluster_id as fsetid FROM pmfeature INNER JOIN core_mps USING(fsetid)"
    }
    featureInfo <- dbGetQuery(conn, sql)
    if (!is.null(sortBy)) {
        featureInfo <- featureInfo[order(featureInfo[[sortBy]]), 
            ]
        rownames(featureInfo) <- NULL
    }
    return(featureInfo)
}

The operating principle here being the SQL for HTA arrays being different from the other ST type arrays. If you use fitProbeLevelModel, it calls getProbeInfo, which if you follow the if/else tree, ends up doing the following,

        fields <- unique(c("fid", "man_fsetid", field))
        fields[fields == "fsetid"] <- paste(probeTable, "fsetid", 
            sep = ".")
        if (isST | class(object) == "TilingFeatureSet") 
            fields[fields == "man_fsetid"] <- paste(probeTable, 
                "fsetid as man_fsetid", sep = ".")
        fields <- paste(fields, collapse = ", ")
        sql <- paste("SELECT", fields, "FROM", probeTable, "INNER JOIN featureSet", 
            "USING(fsetid)")

Resulting in

Browse[2]> sql
[1] "SELECT fid, man_fsetid, pmfeature.fsetid FROM pmfeature INNER JOIN featureSet USING(fsetid)"

Which is (obviously) a different SQL query, and will not correctly map the probes to probesets. I can put this on my TODO list, but it's not going to be a high priority, given that rma does work correctly, and so far as I can tell, Affy/Fisher have never really got people to use these arrays anyway.