Dear list,
I have a data frame with three columns. First column is probe set
IDs, Second column is associated gene symbol, and, third column is a
p-value stat:
hgu133a ID Gene Symbol Combined p-value
217757_at A2M 0.787923912
214440_at NAT1 0.240689023
206797_at NAT2 0.497092074
202376_at SERPINA3 3.88E-13
Etc....
I would like to end up with a data frame where each row is a unique
Gene Symbol. In the case of multiple gene symbols I want to include
the row with the lowest Combined p-value. The above case would
transform into:
hgu133a ID Gene Symbol Combined p-value
217757_at A2M 0.787923912
214440_at NAT1 0.240689023
202376_at SERPINA3 3.88E-13
Etc....
Could someone point me to a function which would help me in this
regard? If this is more of an R mailing list post I apologize and
will post there.
Thanks,
> sessionInfo()
R version 2.6.0 (2007-10-03)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] splines tools stats graphics grDevices utils
datasets
[8] methods base
other attached packages:
[1] lumi_1.4.0 mgcv_1.3-29 affycoretools_1.10.0
[4] annaffy_1.10.0 KEGG_2.0.0 GO_2.0.0
[7] gcrma_2.10.0 matchprobes_1.10.0 biomaRt_1.12.0
[10] RCurl_0.8-1 GOstats_2.4.0 Category_2.4.0
[13] genefilter_1.16.0 survival_2.32 RBGL_1.14.0
[16] annotate_1.16.0 xtable_1.5-1 GO.db_2.0.0
[19] AnnotationDbi_1.0.4 RSQLite_0.6-3 DBI_0.2-3
[22] graph_1.16.1 affy_1.16.0 preprocessCore_1.0.0
[25] affyio_1.6.0 Biobase_1.16.0 limma_2.12.0
loaded via a namespace (and not attached):
[1] cluster_1.11.10 XML_1.93-2.2
Scott A. Ochsner, Ph.D.
NURSA Bioinformatics
Molecular and Cellular Biology
Baylor College of Medicine
Houston, TX. 77030
phone: 713-798-6227
Hi Scott,
taking the issue aside whether this is the ideal way of combining the
multiple probe-sets per gene,
I do not think that you would need a special function for this
purpose.
Basic R functions will suffice.
Let A be your data.frame, then
# first reorder the rows of your data.frame by p-value
A <- A[order(A$"Combined p-value"),]
# and remove any rows containing a gene symbol mentioned in a previous
row
B <- A[!duplicated(A$"Gene Symbol"),]
Regards,
Joern
Ochsner, Scott A wrote:
> Dear list,
>
> I have a data frame with three columns. First column is probe set
IDs, Second column is associated gene symbol, and, third column is a
p-value stat:
>
> hgu133a ID Gene Symbol Combined p-value
> 217757_at A2M 0.787923912
> 214440_at NAT1 0.240689023
> 206797_at NAT2 0.497092074
> 202376_at SERPINA3 3.88E-13
> Etc....
>
> I would like to end up with a data frame where each row is a unique
Gene Symbol. In the case of multiple gene symbols I want to include
the row with the lowest Combined p-value. The above case would
transform into:
>
> hgu133a ID Gene Symbol Combined p-value
> 217757_at A2M 0.787923912
> 214440_at NAT1 0.240689023
> 202376_at SERPINA3 3.88E-13
> Etc....
>
> Could someone point me to a function which would help me in this
regard? If this is more of an R mailing list post I apologize and
will post there.
>
> Thanks,
>
>
>> sessionInfo()
>>
> R version 2.6.0 (2007-10-03)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines tools stats graphics grDevices utils
datasets
> [8] methods base
>
> other attached packages:
> [1] lumi_1.4.0 mgcv_1.3-29 affycoretools_1.10.0
> [4] annaffy_1.10.0 KEGG_2.0.0 GO_2.0.0
> [7] gcrma_2.10.0 matchprobes_1.10.0 biomaRt_1.12.0
> [10] RCurl_0.8-1 GOstats_2.4.0 Category_2.4.0
> [13] genefilter_1.16.0 survival_2.32 RBGL_1.14.0
> [16] annotate_1.16.0 xtable_1.5-1 GO.db_2.0.0
> [19] AnnotationDbi_1.0.4 RSQLite_0.6-3 DBI_0.2-3
> [22] graph_1.16.1 affy_1.16.0 preprocessCore_1.0.0
> [25] affyio_1.6.0 Biobase_1.16.0 limma_2.12.0
>
> loaded via a namespace (and not attached):
> [1] cluster_1.11.10 XML_1.93-2.2
>
> Scott A. Ochsner, Ph.D.
> NURSA Bioinformatics
> Molecular and Cellular Biology
> Baylor College of Medicine
> Houston, TX. 77030
> phone: 713-798-6227
>