Hi Pau,
On Mon, Sep 15, 2014 at 10:40 AM, Pau Marc MuĂąoz Torres
<paumarc@gmail.com>
wrote:
> Good afternoon to everybody,
>
> I'm just doing my firsts steps working with affymetrix data and i
have
> some questions.
>
> I started to working with CEL files by moving the data contained
in them
> to a csv file. Then I tried to relate affymetrix codes with uniprot
codes.
> I performed it as follow:
> library(affy)
> library("xxxx.db")
> setwd("/home/paumarc/affy/Data/Exp/Cel")
> data <- ReadAffy()
> my_frame <- data.frame(exprs(data))
> Annot <- data.frame(ACCNUM=sapply(contents(xxxACCNUM), paste,
collapse=",
> "), SYMBOL=sapply(contents(xxxSYMBOL), paste, collapse=", "),
> DESC=sapply(contents(xxxGENENAME), paste, collapse=",
> "),DESC=sapply(contents(xxxUNIPROT), paste, collapse=", "))
> all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T)
> write.csv(all, file = "xxx.csv")
>
It is common for people new to Bioconductor to want to write things
out to
disk. You should resist that urge. There is almost no reason to write
data
to disk, and every reason to keep data in the structures designed to
hold
them.
In addition, you have not summarized the data, so there won't be a
one-to-one correspondence between the contents of my_frame, and Annot.
In
other words, for the arrays you are analyzing, there are (mostly) 11
probes
that are intended to interrogate a single transcript (the collection
of
probes for a given transcript is called a probeset). If you were to do
eset <- rma(data)
then the ExpressionSet called 'eset' would contain the data after
summarizing each probeset to give a single expression value per
transcript.
At this point, you could hypothetically extract the data from the
ExpressionSet and write to disk, but there is no profit in doing that.
Presumably the next step is to make some comparisons, and all the
software
in Bioconductor that you can use to make those comparisons (limma,
multtest, siggenes, samr, etc) will happily use an ExpressionSet as
input.
If you want to add annotation to the ExpressionSet, then you can use
the
select method:
annot <- select(xxx.db, keys(xxx.db),
c("REFSEQ","SYMBOL","GENENAME","UNIPROT"))
But note that you will get a warning:
Warning message:
In .generateExtraRows(tab, keys, jointype) :
'select' resulted in 1:many mapping between keys and return rows
which is because there are any number of probesets on these arrays
that can
measure more than one thing. In addition, using either ACCNUM or
REFSEQ
isn't the best idea, as there are any number of genes that have like a
bazillion different transcripts. For example, 1007_s_at measures DDR1
(and
an internal miRNA, miR-4640). If you get all the ACCNUM, you will have
370(!) rows for that gene that you will be concatenating. If instead
you
use ENTREZID, which is the more 'usual' thing to do, you get this:
ENTREZID SYMBOL GENENAME
UNIPROT
1 780 DDR1 discoidin domain receptor tyrosine kinase 1
Q08345
2 780 DDR1 discoidin domain receptor tyrosine kinase 1
Q96T62
3 780 DDR1 discoidin domain receptor tyrosine kinase 1
Q96T61
4 100616237 MIR4640 microRNA 4640
<na>
Which is still a bit crazy, given the replication in UNIPROT, but not
as
extreme as ACCNUM or REFSEQ. You can then concatenate as you did
above:
annotlst <- split(annot[,-1], annot[,1])
annot <- do.call("rbind", lapply(annotlst, function(x) apply(x, 2,
function(y) paste(unique(y), collapse = ","))))
Best,
Jim
> where XXX is one of the follwing database
>
> hgu133a.db
> hgu133b.db
> pd.hg.u133.plus.2
>
This isn't an annotation database (the pd package above), so you don't
want
to be using that for this purpose.
> hgu133plus2.db
>
> unfortuntally the codes cointained at the CEL files and the
database do
> not match well. Some examples are:
>
> 137707 208130 NA NA NA NA 202.5 192 209.3
313.8
> 137708 208130_s_at NM_030984 TBXAS1 thromboxane A
synthase 1
> (platelet) P24557, Q53F23, B4DJG6, Q16843 NA NA NA NA
> 137709 208131 NA NA NA NA 187.5 194.5 167.5
224
> 137710 208131_s_at NM_000961 PTGIS prostaglandin I2
> (prostacyclin) synthase Q16647 NA NA NA NA
>
> I suppose that 208131_s_at should mach with 208131 and
208130_s_at
> with 208130, is that supposition correct? does some body knows to
which
> affymetrics database correspond the identification codes 208130_s_at
and
> 208131_s_at
> Thank you
>
> pau
>
> Pau Marc MuĂąoz Torres
> skype: pau_marc
>
http://www.linkedin.com/in/paumarc
>
http://www.researchgate.net/profile/Pau_Marc_Torres3/info/
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor