How to check if the microarray data is from codelink
2
1
Entering edit mode
@agaz-hussain-wani-7620
Last seen 6.8 years ago
India

I want to know if the data is from codelink before using library(codelink) to generate expression and p-value. What I was doing is to find the .TXT extension of the file but found that it is not always .TXT and can be also .txt for raw codelink files for example (GSE9490) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9490) VS [GSE9334] (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9334).

I also tried to find keyword CodeLink Expression Analysis in the file but again found that some files do not have that keyword, again [GSE9490](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9490) VS [9334](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9334) . I would like to know what is always in the codelink file which can be used to check accurately if the data belongs to `codelink` platform. 

codelink microarray • 1.6k views
ADD COMMENT
2
Entering edit mode
@sean-davis-490
Last seen 5 months ago
United States

The official (supplied by codelink) GEO Platform (GPL) is:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL2895

The GPL record contains information about associated samples and series. The following will provide the series IDs associated with the codelink platform:

gpl = getGEO("GPL2895")
Meta(gpl)$series_id
 [1] "GSE3578"  "GSE4106"  "GSE4609"  "GSE4812"  "GSE4846"  "GSE5108"  "GSE5216"  "GSE5350"  "GSE6213" 
[10] "GSE6304"  "GSE6585"  "GSE6630"  "GSE6692"  "GSE7330"  "GSE8353"  "GSE8604"  "GSE9332"  "GSE9490" 
[19] "GSE10064" "GSE10123" "GSE10145" "GSE12530" "GSE13857" "GSE14797" "GSE14808" "GSE15829" "GSE16523"
[28] "GSE16717" "GSE16944" "GSE17470" "GSE18124" "GSE18464" "GSE19834" "GSE20167" "GSE22812" "GSE24519"
[37] "GSE24591" "GSE24807" "GSE25431" "GSE26326" "GSE27448" "GSE29002" "GSE29136" "GSE29763" "GSE31075"
[46] "GSE32191" "GSE32403" "GSE32902" "GSE33133" "GSE33651" "GSE35499" "GSE36007" "GSE37186" "GSE37187"
[55] "GSE38542" "GSE40007" "GSE44172" "GSE44187" "GSE44736" "GSE55768" "GSE56739" "GSE60602" "GSE79189"
[64] "GSE80347" "GSE94318"

There are three additional GPLs (alternative--supplied by other submitters) noted on that webpage. GEO adds that information to the GPL as simple text annotations (not ideal, but the information is there).

Meta(gpl)$relation
[1] "Alternative to: GPL11010"                         
[2] "Alternative to: GPL8060"                          
[3] "Alternative to: GPL18134 ([DISCOVERY PROBE_TYPE])"

Each of these GPL records can be treated the same way to get a complete list of GSEs (or GSMs, if that is the goal).

Alternatively, each GSE record has an associated platform, stored in the annotation slot of an ExpressionSet. More concretely:

gse = getGEO('GSE3578')[[1]]
# gse is an ExpressionSet
gse

Note the Annotation below shows "GPL2895".

ExpressionSet (storageMode: lockedEnvironment)
assayData: 54359 features, 156 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM82284 GSM82285 ... GSM128604 (156 total)
  varLabels: title geo_accession ... data_row_count (31 total)
  varMetadata: labelDescription
featureData
  featureNames: 1001 1002 ... 504109 (54359 total)
  fvarLabels: ID LOGICAL_ROW ... GI_LIST (9 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL2895

Returning to the original question, checking to see if a GSE belongs to a specific platform is just this check:

annotation(gse) == 'GPL2895'
[1] TRUE

EDIT: This answer is perhaps not a complete answer to the original question, it seems, as the question seems to focus on parsing of text files after reading again. Indeed, matching files to formats is a challenging problem.

ADD COMMENT
1
Entering edit mode
Diego Diez ▴ 760
@diego-diez-4520
Last seen 4.2 years ago
Japan

Unfortunately it is not possible to identify Codelink files using the extension. Usually they are named either TXT or txt but that is not very informative because that extension is commonly used for regular text files. A Codelink file has to contain a header formatted in a particular way. Also, it begins with the following text "CodeLink Expression Analysis 5.0.0.18008", although the software version may change. For example, the GEO dataset GSE9490 is Codelink format but GSE9334 is not.

You can use the codelink package to read and preprocess one but not the other. Alternatively you can use the GEOquery package to read them directly from GEO into an ExpressionSet object.

Note that the codelink package will help you to read and preprocess Codelink files, not to "generate a p-value". For that you need some other package for statistical analysis, like the limma package.


EDIT

To clarify my post:

To know if a particular dataset in GEO is from the Codelink platform, using the approach described by Sean is a very good way to go. Once you have some Codelink datasets you may want to import them into R using getGEO() or download the RAW data (e.g. GSE9490_RAW link at the end of the pages) and read it with the codelink package. The advantage of using the codelink package is that you may have more control over what you do with the data. The disadvantage is that this is not always possible because the data uploaded to GEO sometimes does not conform with the Codelink format (even though the extension is TXT or txt). In that case, I feel that using getGEO() is the simplest option.

Regardless, for the datasets mentioned in the OP, one is Codelink and the other not (so obviously using the codelink package with that one is not an option).

ADD COMMENT

Login before adding your answer.

Traffic: 689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6