Entering edit mode
On Tue, Aug 24, 2010 at 10:17 AM, Alex Levitchi
<alex.levitchi@cbm.fvg.it>wrote:
> Dear Sean Davis,
> Since my last letter, I managed to do almost everything.
Unfortunately, I
> am not definitely understanding the aim of organizing microarray
data in
> GSEs and GDSs, in sense that GEOquery uses different tactics to load
data
> and convert them. So, probably, creating a tool I also must take
into
> consideration all these aspects and allow different steps to load
data,
> corresponding to the level of data organization, GSM to GPL,
converting them
> in ExpressionSet type.
>
Hi, Alex.
Your understanding is correct. GSE and GDS contain different
information
and so are dealt with differently by GEOquery.
> Also, there is another problem, regarding the fact that GPLs, GDSs
and GSEs
> can contain different tables by their size (different number of
probes /
> rows) which do not allow the analysis straightforward. I am not
sure, but I
> suppose that, e.g, if a GSE consists of GSMs from different
platforms,
> expression and phenotypic data are structured in several parts
according to
> the GPL. Thus, in the example I've sent
>
Again, I think your understanding is correct.
>
> > gse=getGEO(idata,GSEMatrix=TRUE) #'idata'
the name
> of the dataset, especially GSE or user created table
> >columns=c('title','type','source_name_ch1','platform_id')
>
This be probably be about right for 1-color data, but certainly may
not be
directly useful for 2-color data or for sequencing data. Also, this
minimal
information may not allow one to capture the appropriate information
for
every experiment. If all the phenotype data is carried ONLY in the
source_name_ch1, then you will be fine, but that will not be the case
for
many experimental designs.
> >pdata=pData(gse[[1]])[,columns]
> >expression=exprs(gse[[1]])
> >colnames(expression)=as.vector(pdata[colnames(expression),3])
>
This assumes that the source_name_ch1 column has unique entries. They
need
not be unique.
>
> I suppose gse[[1]] represent the information extracted only for the
first
> GPL from 'platform_id' column, which was extracted from phenodata,
and, if
> there are 2 or more GPLs, it should be 'gse[[2]]' and so on.
> Unfortunately, I did not find any article or manual which describe
these
> peculiarities.
>
> This is described in the help page for getGEO. getGEO with
GSEMatrix=TRUE
returns a list of ExpressionSets.
> Please, give me a hint if I am right and I use a correct way to
interpret
> microarray data structure in order to prepare the data for the later
> analysis.
> The informations I always need to get are:
> 1 - expression values table, with
> 2 - rows - probe_ids and columns - the name of each sample
> 3 - GPL name, to use it for the downloading if the corresponding
> Bioconductor annotation package.
>
>
In fact, what you are asking for is an ExpressionSet. The getGEO()
returns
a list of those directly, so there is no need to do any further
post-processing with getting GSEs. For GDS data, you can simply use
GDS2eSet(getGEO("GDSXXXX")) and you will get an ExpressionSet. Both
methods
will load the featureData slot with the full GPL data table, so you
can use
that for annotation. If you want to use the bioconductor annotation
packages instead, see the GEOmetadb package which has mappings from
GPL
accessions to bioconductor data packages.
Sean
> Kind regards,
> Alex Levitchi
> PhD in Genetics,
> Bioinformatician at Laboratory of Bioinformatics
> CBM, Area Science Park, Trieste, Italy
> http://www.cbm.fvg.it/laboratories/bioinformatics_research
>
> scientific researcher,
> Center of Molecular Biology,
> University of Academy of Sciences of Moldova
> www.edu.asm.md
>
>
> ----- ÐÑÑ
одное ÑообÑение -----
> ÐÑ: "Sean Davis" <sdavis2@mail.nih.gov>
> ÐомÑ: "Alex Levitchi" <alex.levitchi@cbm.fvg.it>
> ÐопиÑ: bioconductor@stat.math.ethz.ch
> ÐÑпÑавленнÑе: ÐÑÑниÑа, 23 ÐÑÐ»Ñ 2010 г
19:53:47 GMT +01:00 ÐмÑÑеÑдам,
> ÐеÑлин, ÐеÑн, Ðена, Рим, СÑокголÑм
> Тема: Re: [BioC] downloading different kinds of microarray data
>
> Hi, Alex. You are definitely thinking correctly that you want to be
using
> ExpressionSets. I would focus your attention on learning to
construct an
> ExpressionSet for each case you outline.
>
> Sean
>
> On Jul 23, 2010 10:12 AM, "Alex Levitchi" <alex.levitchi@cbm.fvg.it>
> wrote:
>
> Dear Bioconductors,
> I am working on the development of a tool which use to download
microarray
> data and then make the connection to Bioconductor annotation
packages.
> My specific answer is about the way to manage downloading different
kinds
> of microarrays, which can be:
> - GSE
> - several GSMs
> - users data (excel or tab delimiter file).
> I use GEOquery package.
> My tool works fine if I am using just GSE file, which has a good
structure
> and I know how to extract expression values, platform (GPL) and
samples
> names.
>
> > gse=getGEO(idata,GSEMatrix=TRUE)
> >columns=c('title','type','source_name_ch1','platform_id')
> >pdata=pData(gse[[1]])[,columns]
> >expression=exprs(gse[[1]])
> >colnames(expression)=as.vector(pdata[colnames(expression),3])
>
> But I feel confused, when I think about the way to handle with
several GSMs
> or user data.
> applying getGEO function for GSM I have to use then Table(gse)$VALUE
to
> extract expression values and Meta(gse)$platform_id to know the GPL.
I
> understand how to do this easy when I have just 1 GSM. How should I
manage
> several GSMs?
> from the start I supposed to use smth like this:
>
> >gse=do.call("cbind",lapply('list_of_GSMs'),function(x) {
> >getGEO(as.character(x),GSEMatrix=TRUE)
> >}
> but, thus, I get just expression values matrix, and I still don't
know what
> is the GPL and sample names.
>
> Another idea (I did not check it yet, as I am not sure it is
correct) is to
> try to create an ExpressionSet (also for user data, after
downloading them
> through 'read.table'), but I also don't know how to create a
phenoData file,
> simply manually or there is a possibility to make it through the
code.
> having ExpressionSet I suppose I will can to use "pData" function
like in
> case of a GSE.
> Doing all this I would like to be able to download and arrange the
data in
> the way, to use the rest of the functions which comes after
'gse=....' in
> the up presented example.
>
> Please, give me some hints at least at one of this points.
>
> Thank's for you nice job.
> Cheers
>
> Alexei Levitchi
> PhD in Genetics,
> Bioinformatician at Laboratory of Bioinformatics
> CBM, Area Science Park, Trieste, Italy
> http://www.cbm.fvg.it/laboratories/bioinformatics_research
>
> scientific researcher,
> Center of Molecular Biology,
> University of Academy of Sciences of Moldova
> www.edu.asm.md
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
[[alternative HTML version deleted]]