Entering edit mode
Hi Lei,
Without exploring further I have to make guesses. However, my best
guess is that you are losing probesets for which the probes completely
overlap between probesets.
Remember that this array is pretty crazy. Mature miRNA transcripts are
only 21-23 bases long, and the Affy probes are all 25-mers, so a
probeset usually consists of something like 9 identical probes. In
addition, an miRNA for one species is often identical to the miRNA
from
another, so you could hypothetically have two probesets that are
supposed to measure miRNA from two different species, but the two
probesets would just be made up of the same 9 identical probes!
So here is an example. If I use the affxparser package to read in the
cdf you gave me, I have the 36k probesets you are expecting. If I then
use makecdfenv to create a cdf package, I only have the 25k you are
seeing.
Now if we look at let-7-5p in mirBase, we have this page:
http://www.mirbase.org/cgi-bin/query.pl?terms=let-7-5p&submit=Search
You can see that there are 7 mature miRNAs, from seven different
species. All of these exist on the cdf, but when we run it through
makecdfenv, the only one that survives is MIMAT0030474_st
If I write a little function to parse out the (x,y) coordinates for
all
seven of these probesets, I get this:
$MIMAT0000001_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 450 143 279 51 381 163 497 16 366
y 57 60 164 276 277 390 390 493 495
$MIMAT0000396_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 13 480 125 234 434 29 445 174 317
y 62 85 89 186 300 309 382 393 528
$MIMAT0004190_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 13 480 125 234 434 29 445 174 317
y 62 85 89 186 300 309 382 393 528
$MIMAT0008354_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 447 206 250 36 400 202 438 352 255
y 85 98 174 280 324 425 435 498 540
$MIMAT0014366_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 255 50 376 97 531 225 413 125 473
y 61 202 203 312 317 411 428 510 530
$MIMAT0021417_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 450 143 279 51 381 163 497 16 366
y 57 60 164 276 277 390 390 493 495
$MIMAT0030474_st
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
x 289 468 153 25 302 413 41 172 458
y 103 198 201 205 328 425 434 536 536
So at least some of these probesets are really just the same thing
twice. Without going deeper, I can't say for sure that this is the
only
thing going on, but you now see how crazy this array really is.
Best,
Jim
On Wednesday, January 15, 2014 3:53:25 PM, Huang, Lei [BSD] - CRI
wrote:
> Thanks a lot Jim! Do you think the problems you found also
contribute to the missing probesets when building cdf package from
makecdfenv?
>
> Best,
>
> Lei
> On Jan 15, 2014, at 2:30 PM, James W. MacDonald <jmacdon at="" u.washington.edu=""> wrote:
>
>> Hi Lei,
>>
>> It turns out that there are at least two differences between the
miRNA 4.0 array and those that came before it.
>>
>> First, there are now no MM probes at all (for the 3.0, for example,
there were 180 MM probes). This is the cause of the error you see when
trying to make the pd.mirna.4.0 package. The code expects MM probes
and thus tries to put those probes into the 'mmfeature' table of the
database, and errors when there are none. This is pretty easy to fix -
you can just put a test for MM probes into the code, and if there are
no MM probes you just skip that step. Hypothetically I could have
patched the code and sent you a pd.mirna.4.0 package that would work
(and then sent the patch to Benilton Carvalho).
>>
>> However, there is a bigger problem that will require more effort,
and should be handled by Benilton. Prior versions of the miRNA arrays
never shared probes between probesets, so the code for building the pd
package for the existing miRNA arrays is a modification of the code
used to create pd packages for the Exon ST arrays, which also never
share probes among probesets.
>>
>> The miRNA 4.0 array is now like the Gene ST arrays, which also
share probes between probesets, so the code will have to be modified
to account for that fact. This will take more than a couple of simple
changes, so you (we) will have to wait for Benilton to fix it.
>>
>> Best,
>>
>> Jim
>>
>> On 1/15/2014 1:15 AM, Lei Huang [guest] wrote:
>>> Dear all,
>>>
>>> I am working on a set of Affymetrix GeneChip miRNA 4.0 microarray
data and would like to perform differential expression analysis using
Bioconductor packages. Since this is a fairly new platform, no CDF and
annotation packages are available in bioconductor repository at the
moment. Affymetrix folks kindly provided me miRNA 4.0 CDF file as well
as sample CEL data. So I desided to create a CDF package by my own
using make.cdf.package() from makecdfenv package. I was able to make
the package and install it without trouble. However, after I read the
raw CEL files and normalized the affybatch with vsnrma()/rma(), I
found the number of probesets is only 25065 while the number is 36249
in original Affymetrix miRNA 4.0 CDF file. I am aware that from
version 4, Affymetrix changed their naming convention for the probeset
IDs, but this shouldn't cause the problem of missing probesets. What I
did wrong? I would really appreciate if anyone could give me some
hints/advices on solving this
>>> problem.
>>>
>>> -Lei
>>>
>>> --
>>> Lei Huang
>>> Center for Research Informatics
>>> Biological Science Division
>>> University of Chicago
>>> http://cri.uchicago.edu
>>> --
>>>
>>> P.S. The following are the code and output from my R session:
>>>
>>>> setwd("~/Documents/Project/mirna/GeneChip 4-0 Array Sample Data")
>>>> library(affy)
>>>> library(makecdfenv)
>>> Loading required package: affyio
>>>> pkgpath <- tempdir()
>>>> pname <- cleancdfname(whatcdf("20131118_Human-Brain-
AM7962-130ng_rep1_(miRNA-4_0).CEL"))
>>>> make.cdf.package("miRNA-4_0-st-v1.cdf",
cdf.path="~/Documents/Project/mirna/miRNA-4_0-st-v1_CDF",
>>> + compress=FALSE, species = "",
packagename=pname, package.path = pkgpath)
>>> Reading CDF file.
>>> Creating CDF environment
>>> Wait for about 251 dots...........................................
......................................................................
......................................................................
......................................................................
>>> Creating package in /var/folders/rh/rrlg3bcs6kgcj89zm4mgjjxh0000gq
/T//RtmpRos3Be/mirna40cdf
>>>
>>> README PLEASE:
>>> A source package has now been produced in
>>> /var/folders/rh/rrlg3bcs6kgcj89zm4mgjjxh0000gq/T//RtmpRos3Be/mirna
40cdf.
>>> Before using this package it must be installed via 'R CMD INSTALL'
>>> at a terminal prompt (or DOS command shell).
>>> If you are using Windows, you will need to get set up to install
packages.
>>> See the 'R Installation and Administration' manual, specifically
>>> Section 6 'Add-on Packages' as well as 'Appendix E: The Windows
Toolset'
>>> for more information.
>>>
>>> Alternatively, you could use make.cdf.env(), which will not
require you to install a package.
>>> However, this environment will only persist for the current R
session
>>> unless you save() it.
>>>
>>> ## install the cdf package from shell
>>> ## cd to mirna40cdf location
>>> ## R CMD INSTALL mirna40cdf
>>>
>>>> library(limma)
>>>> library(vsn)
>>>> library(mirna40cdf)
>>>>
>>>> affybatch <- ReadAffy(filenames=list.files())
>>>> affybatch at cdfName
>>> [1] "miRNA-4_0"
>>>
>>> ## normalization
>>>> eset.norm <- vsnrma(affybatch)
>>> vsn2: 292681 x 8 matrix (1 stratum).
>>> Please use 'meanSdPlot' to verify the fit.
>>> Calculating Expression
>>>
>>> ## only 25,065 probesets, the original Affymetrix cdf file
contains 36,249 probesets
>>>> dim(eset.norm)
>>> Features Samples
>>> 25065 8
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets
methods base
>>>
>>> other attached packages:
>>> [1] mirna40cdf_1.38.0 AnnotationDbi_1.24.0 vsn_3.30.0
>>> [4] limma_3.18.9 makecdfenv_1.38.0 affyio_1.30.0
>>> [7] affy_1.40.0 Biobase_2.22.0 BiocGenerics_0.8.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BiocInstaller_1.12.0 compiler_3.0.2 DBI_0.2-7
>>> [4] grid_3.0.2 IRanges_1.20.6 lattice_0.20-24
>>> [7] preprocessCore_1.24.0 RSQLite_0.11.4 stats4_3.0.2
>>> [10] tools_3.0.2 zlibbioc_1.8.0
>>>
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>
>
> ________________________________
> This email is intended only for the use of the individual or entity
to which it is addressed and may contain information that is
privileged and confidential. If the reader of this email message is
not the intended recipient, you are hereby notified that any
dissemination, distribution, or copying of this communication is
prohibited. If you have received this email in error, please notify
the sender and destroy/delete all copies of the transmittal.
>
> Thank you.
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099