Question

Pipeline to analyzing Microarray data set( GSE files from NCBI GEO)

0

Entering edit mode

adR ▴ 40

@do-it-23093

Last seen 8 months ago

Germany, München

Hi dear friends,

Would you please give me a few minutes for my question? However, I am not sure that I can ask my question here! Anyways, could you please direct me on how to download and process a microarray data file(GSE) from GEO? Thank you so much!

microarray affy • 4.9k views

ADD COMMENT • link updated 4.8 years ago by Kevin Blighe ★ 4.0k • written 4.8 years ago by adR ▴ 40

score 1 · Answer 1 · 2020-06-25

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 19 days ago

Republic of Ireland

Hi, for most microarray datasets, there should be a blue Analyze with GEO2R button on the main accession page. If you click on that, and then go to the R script tab, you should see code that allows you to obtain the data, usually via the Series Matrix File.

This data should already be normalised but GEO cannot guarantee this for every study; so, you need to verify the authors' methods and the other information in the GEO record [that the data is normalised]. Plotting the data via histograms and box-and-whiskers can help, too - if microarray and RMA-normalised, it is easy to identify this [quantile normalisation] distribution via a box-and-whiskers.

For example, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112811

Essentially, the package to use in GEOquery.

Kevij

ADD COMMENT • link 4.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thanks, but the microarray datasets I am looking for has no this "Analyze with GEO2R" button.

ADD REPLY • link 4.8 years ago adR ▴ 40

0

Entering edit mode

What are the IDs?

ADD REPLY • link 4.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

here is the IDs and the link GSE117134 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi

ADD REPLY • link 4.8 years ago adR ▴ 40

1

Entering edit mode

That's not a microarray data set. It's RNA-Seq

ADD REPLY • link 4.8 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks James, and sorry for the misunderstandings! However, how I can have them in my R environment? Is there a Bioconductor package that help me to download the files? Best, AD

ADD REPLY • link 4.8 years ago adR ▴ 40

3

Entering edit mode

in principle, as Kevin said, GEOquery should download the processed expression profiles using the function getGEO(). however, i'm afraid in the case of GSE117134 this is not going to work because those expression profiles seem to be stored as supplementary file in GEO. to access them you can do the following:

download.file("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE117134&format=file&file=GSE117134%5Fcasava%5Fgene%5Fexpression%2Etsv%2Egz", "GSE117134_casava_gene_expression.tsv.gz")
y <- read.csv("GSE117134_casava_gene_expression.tsv.gz", sep="\t")
dim(y)
[1] 18079    55
y[1:5, 1:5]
    Gene WT_ZT00_BR1 WT_ZT00_BR2 WT_ZT00_BR3 WT_ZT04_BR1
1   Xkr4     0.01436     0.00000     0.00955     0.00834
2  Sox17     0.61960     0.42963     0.38794     0.41122
3 Mrpl15    15.12894    17.25888    14.15717    12.10911
4 Lypla1    53.79726    76.55921    69.43295    58.77200
5  Tcea1     5.47423     5.52237     5.04543     5.20710

because read.csv() outputs a data.frame object you might want to transform it into a matrix as follows:

genesymbols <- y$Gene
y <- as.matrix(y[, -1])
rownames(y) <- genesymbols
y[1:5, 1:5]
       WT_ZT00_BR1 WT_ZT00_BR2 WT_ZT00_BR3 WT_ZT04_BR1 WT_ZT04_BR2
Xkr4       0.01436     0.00000     0.00955     0.00834     0.00000
Sox17      0.61960     0.42963     0.38794     0.41122     0.47293
Mrpl15    15.12894    17.25888    14.15717    12.10911    16.77634
Lypla1    53.79726    76.55921    69.43295    58.77200    82.90862
Tcea1      5.47423     5.52237     5.04543     5.20710     5.84447

and now you're ready to analyze the gene expression data matrix in y. A starting point for a beginner may be any of the available workflows for this kind of data, such as this one or this other one, taking into account that your starting point are not RNA-seq raw counts but RNA-seq processed expression profiles in some kind of continuous units of expression, check the associated publication to learn what kind of units are those and how they have been processed.

ADD REPLY • link 4.8 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thank you so much! It was a great help!

ADD REPLY • link 4.8 years ago adR ▴ 40

0

Entering edit mode

Beware that these appear to be RPKM values rather than read counts, so they don't input directly into any of the Bioconductor workflows. See https://support.bioconductor.org/p/56275/ for discussion and work-around suggestions.

ADD REPLY • link 4.8 years ago Gordon Smyth 52k