Question

pd.mirna.4.0 normalization to spike-in probesets

0

Entering edit mode

mforde84 ▴ 20

@mforde84-12150

Last seen 7.6 years ago

Hello,

I have CEL files from GeneChip miRNA v4.0. I would like to generate normalized data using a Loess curve subset with only spike-in probesets. Package affy has a function normalize.loess(mat, subset=...) which can accomplish the subsetting portion, however after importing the CEL data using the oligo package I end up with ~300,000 annotations corresponding to PM sequences. How do I determine which PM sequencing in the annotation are associated with the spike-ins? I've checked the CSV annotation provided online, and they are all N-masked.

Martin

microarry pd.mirna.4.0 • 1.9k views

ADD COMMENT • link updated 7.6 years ago by James W. MacDonald 68k • written 7.6 years ago by mforde84 ▴ 20

score 2 · Accepted Answer · 2017-08-15

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 20 hours ago

United States

## Load package
> library(pd.mirna.4.0)

## create a connection to the underlying SQLite DB
> con <- db(pd.mirna.4.0)

## Check what probeset types there are
> z <- dbGetQuery(con, "select * from featureSet;")
> table(z$type)

    1     7    12    13    14
36205    95    27    17     9

## and figure out what the type IDs are
> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2            main->junctions
3     3                 main->psrs
4     4               main->rescue
5     5              control->affx
6     6              control->chip
7     7  control->bgp->antigenomic
8     8      control->bgp->genomic
9     9             normgene->exon
10   10           normgene->intron
11   11   rescue->FLmRNA->unmapped
12   12   control->affx->bac_spike
13   13             oligo_spike_in
14   14            r1_bac_spike_at
15   15 control->affx->polya_spike
16   16        control->affx->ercc
17   17  control->affx->ercc->step

## So we want 12,13,14, which are all spike in probes (maybe not 14? you decide).
> spikes <- dbGetQuery(con, "select fid, fsetid, type from featureSet inner join pmfeature using(fsetid) where type in ('12','13','14');")
> head(spikes)
       fid   fsetid type
1   150727 20538329   14
2   150186 20538329   14
3   149107 20538329   14
4   149648 20538329   14
5   149103 20538329   14
6   149644 20538329   14

The fid in this table corresponds to the pmIndex for the underlying probes. Normally one would do the normalization first, at the probe level, and then summarize (this is what RMA does). So if you want to do something clever, you can first background correct and then normalize the data using a loess fit at the probe level, and then do the summarization using rma, specifying that you don't want to do any background correction or normalization.

I suppose you could also background correct and summarize (but not normalize) using rma, and then use the spike-ins for the normalization, but that's probably not the best idea (you probably want to summarize already normalized data, not vice versa). But it's your project, so you make the call.

ADD COMMENT • link 7.6 years ago James W. MacDonald 68k

1

Entering edit mode

I should also clarify that the pmIndex for the probes is the row number. So as an example, the first row has an fid of 150727, which means that probe is in row 150727 of the raw probeset data.

ADD REPLY • link 7.6 years ago James W. MacDonald 68k

0

Entering edit mode

You're awesome. Thanks for the clarifications.

ADD REPLY • link 7.6 years ago mforde84 ▴ 20

0

Entering edit mode

One additional question that I presume you can help me with. How would you suggest doing the background correction without summarization? I tried using the bg.correct method in affy, however both the mas and rma options either produce an error or a segfault. I think it has to do with the design of the cdf I generated with as the Dilution dataset provided by the package affydata works as expected:

    > library(makecdfenv)
    > library(affxparser)
    > convertCdf("miRNA-4_0-st-v1.cdf", "mirna40cdf", version=4, verbose=TRUE)
    > pkgpath <- tempdir()
    > make.cdf.package("mirna40cdf", version = packageDescription("makecdfenv", field = "Version"), species="",

unlink=TRUE, compress=FALSE, package.path = pkgpath)

> ##package installation from DOS shell (R CMD INSTALL "path to package")

> library(mirna40cdf)

> affys <- ReadAffy()

> bg.correct(affys, method="mas")

Error in matrix(.C("affy_background_adjust_R", as.double(as.vector(allintensities)), :
NA/NaN/Inf in foreign function call (arg 1)

> bg.correct(affys, method="rma")

*** caught segfault ***
address 0x7f6984b30000, cause 'invalid permissions'

ADD REPLY • link 7.6 years ago mforde84 ▴ 20

1

Entering edit mode

You can't use the affy/makecdfenv pipeline for these arrays. Instead you use oligo, which uses the pd.mirna.4.0 package we have talked about. You can use the backgroundCorrect and normalize functions in oligo to do what you want. The tricky part is getting normalize to use the probes you care about. You want to do something like

dat <- read.celfiles(list.celfiles())

dat <- backgroundCorrect(dat, "rma")

dat <- normalize(dat, "loess", subset = <put your vector of fids here>)

And it's tricky because you have to dig through the code to know that the subset argument exists for the (hidden) loessNorm function that you end up using.

ADD REPLY • link 7.6 years ago James W. MacDonald 68k

0

Entering edit mode

It doesn't appear that subset is implemented yet:

> normalize(bgd, "loess", subset=spikes$fid)
Normalizing... ^C
Warning message:
In .local(object, ...) :
  Subset not implemented (yet). Returning everything.

So my thought on getting around this would be to extract out the exprs from the background corrected data, normalize with the affy library, reconstruct an ExpressionFeatureSet object with the normalized data, then perform the summarization using oligo's rma function with background and normalize flags set to false. Is this reasonable? I'm not sure, because I don't know off the top of my head if the original feature set object can be modified in place so as to swap in the normalized data, or if I have to break the object down into it's respective features (e.g., phenotype, annotation, and assay data), then generate a new object with the supplied information. Any suggestions?

edit:

I checked the loessNorm function you are referring to, and you are correct that subset is being passed to the functions and being incorporated into the normalization.

## loess normalization helper
loessNormV <- function(v1, v2, subset, span, degree, weights, family){
    x <- (v1+v2)/2
    index <- c(which.min(x), subset, which.max(x))
    xx <- x[index]
    yy <- v1[index]-v2[index]
    aux <- loess(yy ~ xx, span=span, degree=degree,
                 weights=weights, family=family)
    predict(aux, data.frame(xx=x))
}

## loess normalization
## matrices
loessNorm <- function (mat, subset, epsilon=0.01, maxit=1, log.it=TRUE,
                       verbose=TRUE, span=2/3, family.loess="symmetric"){
    II <- nrow(mat)
    J <- ncol(mat)
    if (missing(subset))
        subset <- sample(II, min(c(5000, II)))
    if log.it)
        mat <- log2(mat)

    change <- epsilon + 1
    iter <- 0
    w <- c(0, rep(1, length(subset)), 0)
    while (iter < maxit) {
        iter <- iter + 1
        means <- matrix(0, II, J)
        for (j in 1:(J - 1)) {
            for (k in (j + 1):J) {
                aux <- loessNormV(v1=mat[,j], v2=mat[,k], subset=subset,
                                  span=span, degree=1, weights=w,
                                  family=family.loess)/J
                means[, j] <- means[, j] + aux
                means[, k] <- means[, k] - aux
                if (verbose)
                    message("Done with ", j, " vs ", k, " in iteration ", iter)
            }
        }
        mat <- mat - means
        change <- max(colMeans((means[subset, ])^2))
        if (verbose)
            message(sprintf("Iteration: %02d - Change: %02.8f", iter, change))
    }
    if ((change > epsilon) & (maxit > 1))
        warning("No convergence after ", maxit, " iterations.")
    if log.it) {
        return(2^mat)
    }else{
        return(mat)
    }
}

So the error I'm getting is just caused by a flag in some parent functions, and the normalization is actually performed accordingly where the loess curve is only fit to specified spike in probes. Great.

Appreciate all of the help.

Marty

ADD REPLY • link 7.6 years ago mforde84 ▴ 20

0

Entering edit mode

Never mind, I see it in the documentation.

ADD REPLY • link 7.6 years ago mforde84 ▴ 20