## Load package
> library(pd.mirna.4.0)
## create a connection to the underlying SQLite DB
> con <- db(pd.mirna.4.0)
## Check what probeset types there are
> z <- dbGetQuery(con, "select * from featureSet;")
> table(z$type)
1 7 12 13 14
36205 95 27 17 9
## and figure out what the type IDs are
> dbGetQuery(con, "select * from type_dict;")
type type_id
1 1 main
2 2 main->junctions
3 3 main->psrs
4 4 main->rescue
5 5 control->affx
6 6 control->chip
7 7 control->bgp->antigenomic
8 8 control->bgp->genomic
9 9 normgene->exon
10 10 normgene->intron
11 11 rescue->FLmRNA->unmapped
12 12 control->affx->bac_spike
13 13 oligo_spike_in
14 14 r1_bac_spike_at
15 15 control->affx->polya_spike
16 16 control->affx->ercc
17 17 control->affx->ercc->step
## So we want 12,13,14, which are all spike in probes (maybe not 14? you decide).
> spikes <- dbGetQuery(con, "select fid, fsetid, type from featureSet inner join pmfeature using(fsetid) where type in ('12','13','14');")
> head(spikes)
fid fsetid type
1 150727 20538329 14
2 150186 20538329 14
3 149107 20538329 14
4 149648 20538329 14
5 149103 20538329 14
6 149644 20538329 14
The fid in this table corresponds to the pmIndex for the underlying probes. Normally one would do the normalization first, at the probe level, and then summarize (this is what RMA does). So if you want to do something clever, you can first background correct and then normalize the data using a loess fit at the probe level, and then do the summarization using rma
, specifying that you don't want to do any background correction or normalization.
I suppose you could also background correct and summarize (but not normalize) using rma
, and then use the spike-ins for the normalization, but that's probably not the best idea (you probably want to summarize already normalized data, not vice versa). But it's your project, so you make the call.
I should also clarify that the pmIndex for the probes is the row number. So as an example, the first row has an fid of 150727, which means that probe is in row 150727 of the raw probeset data.
You're awesome. Thanks for the clarifications.
One additional question that I presume you can help me with. How would you suggest doing the background correction without summarization? I tried using the bg.correct method in affy, however both the mas and rma options either produce an error or a segfault. I think it has to do with the design of the cdf I generated with as the Dilution dataset provided by the package affydata works as expected:
> library(makecdfenv)
> library(affxparser)
> convertCdf("miRNA-4_0-st-v1.cdf", "mirna40cdf", version=4, verbose=TRUE)
> pkgpath <- tempdir()
> make.cdf.package("mirna40cdf", version = packageDescription("makecdfenv", field = "Version"), species="",
unlink=TRUE, compress=FALSE, package.path = pkgpath)
> ##package installation from DOS shell (R CMD INSTALL "path to package")
> library(mirna40cdf)
> affys <- ReadAffy()
> bg.correct(affys, method="mas")
Error in matrix(.C("affy_background_adjust_R", as.double(as.vector(allintensities)), :
NA/NaN/Inf in foreign function call (arg 1)
> bg.correct(affys, method="rma")
*** caught segfault ***
address 0x7f6984b30000, cause 'invalid permissions'
You can't use the affy/makecdfenv pipeline for these arrays. Instead you use oligo, which uses the pd.mirna.4.0 package we have talked about. You can use the
backgroundCorrect
andnormalize
functions in oligo to do what you want. The tricky part is gettingnormalize
to use the probes you care about. You want to do something likeAnd it's tricky because you have to dig through the code to know that the subset argument exists for the (hidden)
loessNorm
function that you end up using.It doesn't appear that subset is implemented yet:
So my thought on getting around this would be to extract out the exprs from the background corrected data, normalize with the affy library, reconstruct an ExpressionFeatureSet object with the normalized data, then perform the summarization using oligo's rma function with background and normalize flags set to false. Is this reasonable? I'm not sure, because I don't know off the top of my head if the original feature set object can be modified in place so as to swap in the normalized data, or if I have to break the object down into it's respective features (e.g., phenotype, annotation, and assay data), then generate a new object with the supplied information. Any suggestions?
edit:
I checked the loessNorm function you are referring to, and you are correct that subset is being passed to the functions and being incorporated into the normalization.
So the error I'm getting is just caused by a flag in some parent functions, and the normalization is actually performed accordingly where the loess curve is only fit to specified spike in probes. Great.
Appreciate all of the help.
Marty
Never mind, I see it in the documentation.