Entering edit mode
This is not going to be as short as I wish it would be, but here
goes...
1) that data is NOT in Final Report Format (although I did run KIRC
450k
through GenomeStudio to check our work, our internal pipeline has been
to
use methylumIDAT on raw scanner output); older archives like 27k COAD
are
structured in a format that can be extracted from level 3 (masked beta
values), or else level 1 data (M and U intensities, which are fed as
matrices to methylumi or minfi). Each file represents one sample
(tumor or
normal or cell line control), the details of which are included in the
mage-tab directory.
2) newer archives (primarily 450k data, but also some 27k data, such
as
LAML and KIRC) include IDAT files and a mapping from sample name to
IDAT
barcode, both in the MAGE-tab experiment description and in the AUX
directory. I always suggest using those.
3) older archives (primarily 27k data, and I can't think of a single
450k
archive like this) are most easily processed using the Level 3 data,
which
is to say, beta values that have been masked to NA for SNPs and
detection
p-values > 0.05. Converting beta values to M-values (log2(M/U)) is
trivial; the only objection I have to using level 3 data is that it
doesn't
recapitulate the entire process.
My preference would have been to use IDAT files right from the
beginning,
but I only got involved in packaging last summer, and at that point
there
were a number of "data freeze" events that needed to be taken care of.
By
the time we put BRCA up (the largest 450k dataset within TCGA), the
levels
(IDATs as level 1, M/U/p as level 2, betas as level 3) had solidified
to
the current structure. The IDF and SDRF files do represent the
experimental design as faithfully as we are able, given the time
constraints and the MAGE-tab spec.
One sensible thing to do here is to make sure that an up-to-date
vignette,
using one 27k and 450k tumor each for preprocessing, is included in
methylumi for the upcoming release. The primary goal here is for
every
step in any TCGA paper to be easily reproducible. Packages that
predate my
involvement in packaging did not (in my opinion) make that very easy,
so I
lobbied for the format changes.
On Wed, Feb 29, 2012 at 2:48 PM, Ed Siefker <ebs15242@gmail.com>
wrote:
> I am trying to read Level 1 methylation data
> from the TCGA into bioconductor. The
> platform is HumanMethylation27, which
> is supported by lumi, right?
>
> Here is my R session:
>
> > library(lumi)
> Loading required package: methylumi
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
> Vignettes contain introductory material. To view, type
> 'browseVignettes()'. To cite Bioconductor, see
> 'citation("Biobase")' and for packages 'citation("pkgname")'.
>
> Loading required package: nleqslv
> KernSmooth 2.23 loaded
> Copyright M. P. Wand 1997-2009
>
> Attaching package: lumi
>
> The following object(s) are masked from package:methylumi:
>
> estimateM, getHistory
>
> Warning message:
> found methods to import for function as.list but not the generic
itself
> >
> > fileName <-
> 'jhu-usc.edu_COAD.HumanMethylation27.1.lvl-1.TCGA-AA-3555-01A-
01D-0820-05.txt'
> > example.lumi <- lumiR(fileName)
> Error in gregexpr("\t", dataLine1)[[1]] : subscript out of bounds
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]