Question

How to use brainarray custom cdf for HTA platform?

0

Entering edit mode

pbachali ▴ 50

@pbachali-9651

Last seen 5.3 years ago

Hi,

I am trying to analyze HTA platform. I am able to use Affy CDF. I am trying to use Brainarray CDF. I could install packages hta20hsentrezgcdf, hta20hsentrezgprobe, hta20hsentrezg.db in R. I am trying to use function read.celfiles but it is throwing me the error.

affyGeneFS <- read.celfiles(geneCELs, pkgname = "hta20hsentrezgcdf")

The error is

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘kind’ for signature ‘"environment"’.

I even tried

cdf="hta_Hs_ENTREZG"

affyGeneFS <- read.celfiles(geneCELs, cdfname=cdf). It threw me the following error.

Error: These do not exist:

"hta_Hs_ENTREZG".

I believe readAffy function uses cdfname argument. I am not sure how to use the brainarray cdf in oligo package.

Thanks,

Prat

hta customcdf • 3.0k views

ADD COMMENT • link updated 8.2 years ago by James W. MacDonald 68k • written 8.2 years ago by pbachali ▴ 50

score 0 · Answer 1 · 2017-02-10

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

There isn't a pdInfoPackage for oligo, so I think you are stuck with the affy package. In which case you want

abatch <- ReadAffy(cdfname = "hta20hsentrezgcdf")
eset <- rma(abatch)

ADD COMMENT • link 8.2 years ago James W. MacDonald 68k

0

Entering edit mode

I am unable to use affy package for these array type since it is HTA platform. I have used oligo package and used read.celfiles functions and I am successful in mapping the annotations. As part of our pipeline we use custom CDF (Brainarray CDF) too. So, I am trying to use "hta20hsentrezgcdf". As you mentioned this isn't a pdInfoPackage. I did some reserach and I found the following script.

library(pdInfoBuilder)
library(ff)
library(doMC)
registerDoMC(10)

download.file("http://mbni.org/customcdf/21.0.0/ense.download/hta20_Hs_ENSE_21.0.0.zip","tmp.zip")
unzip("tmp.zip")
dir()
z <- cdf2table("hta20_Hs_ENSE.cdf")

I am trying to run this script and it is throwing me the following error in R.

R studio quit unexpectedly.

I am not sure if it memory issue. I have tried the same amazon cloud server too. It is throwing the same error there. I am wondering is there is anyway to use brainarray cdf with oligo package.

Prat

ADD REPLY • link 8.2 years ago pbachali ▴ 50

0

Entering edit mode

Recently I have analyzed a HTA2.0 array data set. Indeed, remapped PdInfo objects are not available for this (and some other) arrays. According to Manhong (maintainer of the Custom CDFs): "It is very slow to use BioC's built-in method to build those pdInfo packages, some of them even give me coredump. This is why some Pd packages are missing".

You therefore limited to use the 'old fashioned' affy package to do this. However, since people are discouraged to actually use affy for these generation of arrays, a warning (for the Gene ST 1.x) or error (for the Gene ST 2.x, HTA, MTA, Clariom, etc) is returned when trying to load these arrays using ReadAffy(). This warning/error has been disabled in the affy library available at the MBNI site here (check for "Modified 'affy_1.52.0' Package", available as source or binary). Only use this package if you know what you are doing!

To analyze the data, first download and install the required files (for EntrezGene-based):

install.packages("http://mbni.org/customcdf/21.0.0/entrezg.download/hta20hsentrezgcdf_21.0.0.tar.gz", type = "source", repos = NULL)
install.packages("http://mbni.org/customcdf/21.0.0/entrezg.download/hta20hsentrezgprobe_21.0.0.tar.gz", type = "source", repos = NULL)
install.packages("http://mbni.org/customcdf/21.0.0/entrezg.download/hta20hsentrezg.db_21.0.0.tar.gz", type = "source", repos = NULL)

and then proceed like James said above (with a small modification: note that the 'cdf'' part is omitted in the name when specifying the cdf [hta20hsentrezg]).

>library(affy) #using modified library!
>abatch <- ReadAffy(cdfname = 'hta20hsentrezg')

> abatch
AffyBatch object
size of arrays=2572x2680 features (34 kb)
cdf=hta20hsentrezg (32531 affyids)
number of samples=47
number of genes=32531
annotation=hta20hsentrezg
notes=
>

etc.

ADD REPLY • link 8.2 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

And to add further to what Guido has said, and what you have noted. When you get a core dump from cdf2table (which is what Manhong Dai notes as well), what is happening is that the C code, which is based on Affy's Calvin software is segfaulting! So code that comes directly from the manufacturer of these arrays cannot read in the cdf that Manhong has produced.

I also tried to read in the cdf using the code in affyio, which is not to my knowledge based on Calvin, but is code that Ben Bolstad wrote before Affy released their Calvin software. And this software segfaults as well! As a matter of fact, read.cdffile in the makecdfenv package segfaults when trying to read this cdf, so it's not clear to me how Manhong was able to generate any cdf packages for this array.

It's possible that this is just some little thing in the cdf file that the C code underlying all these functions doesn't like, but then again maybe not. It's not clear why the cdf is causing segfaults, and I don't care enough to try to track this down. But there is some unexplained problem with this cdf, and you are assuming that it is not a big deal when you use the MBNI cdf package.

ADD REPLY • link 8.2 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

That's a good observation with some potential serious implications! Thanks for making me/us aware of this.

Just to add my experience: first of all, I also don't know what the exact procedure is for Manhong when he generates these custom CDFs, and how he creates the CDF files themselves. However, my preferred way of creating a CDF file from an environment is using the functionality available in aroma.affymetrix (env2Cdf(); see here). NB: For this you also need a CEL file.
If I do this for the HTA2.0 array, I can make a CDF file, which in turn can be used by cdf2table() without crashing R, and ultimately a corresponding PdInfo package can be created... I realize, though, that I use as 'source' the custom CDF library that *may* have some problem...

> library("aroma.affymetrix")
> env2Cdf("hta20hsentrezgcdf", "ct_1562_1.CEL", overwrite=TRUE)
Reading environment: hta20hsentrezgcdf.
Reading CEL file header.
Creating CDF list for 32531 units.
Adding group names to CDF list.
Adding temporary suffix from file...
 Pathname: ./hta20hsentrezg.cdf
 Suffix: .tmp
 Rename existing file?: FALSE
 Temporary pathname: ./hta20hsentrezg.cdf.tmp
Adding temporary suffix from file...done
Writing CDF file...
  Pathname: ./hta20hsentrezg.cdf.tmp
  Writes CDF header and unit names...
  Writes CDF header and unit names...done
  Writes QC units...
  Writes QC units...done
  Writes 32531 units...
  Writes 32531 units...done
Writing CDF file...done
Dropping temporary suffix from file...
 Temporary pathname: ./hta20hsentrezg.cdf.tmp
 Suffix: .tmp
 Regular expression for suffix: \.tmp$
 Pathname: ./hta20hsentrezg.cdf
 Renaming existing file...
  Result: TRUE
 Renaming existing file...done
Dropping temporary suffix from file...done
>

Now create a PdInfo package from CDF file generated above (recycling code form James available here A: How to use brainarray custom cdf with oligo package?). Remarkable, this code James posted dealt also with the HTA2.0 array, but then worked nicely (difference is that v19 of the custom CDF was then used as source....)

> library(pdInfoBuilder)
> library(ff)
> library(foreach)
> z <- cdf2table("hta20hsentrezg.cdf")
Warning message:
executing %dopar% sequentially: no parallel backend registered
> head(z)
  fs1_man_fsetid   fs1_type fs1_direction    x    y     fid atom probe_type
1   100147813_at expression         sense 1342  450 1158743    0         pm
2   100147813_at expression         sense 1342  450 1158743    0         mm
3   100147813_at expression         sense 2521 2063 5308558    1         pm
4   100147813_at expression         sense 2521 2063 5308558    1         mm
5   100147813_at expression         sense 1112 1794 4615281    2         pm
6   100147813_at expression         sense 1112 1794 4615281    2         mm
> seed <- new("GenericPDInfoPkgSeed", table=z, author = "me", email = "me@mine.org", species = "Homo sapiens", pkgName = "pd.hta20.hs.entrezg")
>
>  makePdInfoPackage(seed)
==================================================================================
Building annotation for Generic Array
==================================================================================
Creating package in ./pd.hta20.hs.entrezg
Inserting 32531 rows into table featureSet1... OK
Inserting 3562835 rows into table mps1mm... OK
Inserting 3562835 rows into table mps1pm... OK
Inserting 3562835 rows into table mmfeature... OK
Inserting 3562835 rows into table pmfeature... OK
Counting rows in featureSet1
Counting rows in mmfeature
Counting rows in mps1mm
Counting rows in mps1pm
Counting rows in pmfeature
[1] TRUE
> install.packages("pd.hta20.hs.entrezg/", type="source", repos = NULL)
* installing *source* package 'pd.hta20.hs.entrezg' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (pd.hta20.hs.entrezg)
>

ADD REPLY • link 8.2 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Thanks, much Guido Hooiveld. It perfectly worked. My primary goal to use the Brianarray CDF is differential gene expression. I have one last question. As part of probe filtering, I use a histogram to plot my probe intensities and then choose a certain threshold where most of the low-intensity probes are located and filter them. But with this array type, I could use only "rma" normalization and my histograms have many platos. I am wondering if there is any stringent way to perform probe filtering for this array type. I am trying to use genefilter function now.

Prat

ADD REPLY • link 8.2 years ago pbachali ▴ 50

0

Entering edit mode

If I filter my data set, which I usually do not, then most of the time I do this based on the IQR (interquartile range).

See below for the signal distribution of my HTA2.0 data set (RMA normalized; RNA was from cell lines).

http://imgur.com/BfcLLIl

ADD REPLY • link 8.2 years ago Guido Hooiveld ★ 4.1k