Question

Downloading microarray data for DEG analysis!

0

Entering edit mode

Nithisha ▴ 10

@nithisha-14272

Last seen 7.1 years ago

Hello,

I am currently following this protocol:

https://www.bioconductor.org/packages//2.7/bioc/vignettes/oligo/inst/doc/V5ExonGene.pdf

to download Microarray data from GEO.

Firstly, when I try the command for rma analysis, I see this error.

> genePS <- rma(affyGeneFS, target="probeset") Error in .local(object, ...) : unused argument (target = "probeset")

I then specified no target and it seemed to work.

> genePS <- rma(affyGeneFS) Background correcting Normalizing Calculating Expression

However, in the next step, to obtain featureData, I see yet another error.

> featureData(genePS) = getNetAffx(genePS) Error in getNetAffx(genePS) : NetAffx Annotation not available in 'pd.mouse430.2'. Consider using 'biomaRt'.

I have 2 questions.

1) Why did specifying no target work?

2) How can I substitute the getNetAffx command with BiomaRt in order to obtain the featureData?

Thank you!

affymetrix microarrays • 2.2k views

ADD COMMENT • link updated 7.5 years ago by James W. MacDonald 68k • written 7.5 years ago by Nithisha ▴ 10

score 3 · Accepted Answer · 2017-10-30

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 5 days ago

United States

The arrays you are trying to analyze are some old school Mouse 430_2 arrays. These arrays were based on an IVT procedure that uses oligo-dT as the primer, so by definition you are querying only the 3' region of the transcript (e.g. oligo-dT binds to the poly-A tail of the mature transcript, so you start the IVT at the very start of the transcript, and it generally proceeds only so far). More modern Affy arrays use a random primer, so you get cDNA from across the transcript rather than just at the 3' end.

With the more modern arrays (the Gene ST and Exon ST arrays that are being described in that workflow), the probes are dispersed across the entire transcript, so you can summarize subsets of the probes to get measures for individual exons, or you can summarize all of the probes for a given transcript to get a transcript-level measurement. But this isn't possible with the 3'-biased arrays like the Mouse 430_2 arrays you are working with. For those arrays there is only one level at which you can summarize, so there is no 'target' argument to specify different summarization levels.

Another difference has to do with what annotation data are encapsulated in the pdInfoPackage. For the Gene and Exon ST arrays, there are a couple of annotation files that are part of the pdInfoPackage, and you can use getNetAffx to parse those files. This isn't true of the older arrays. I have a convenience method in my affycoretools package that is intended to automatically annotate your ExpressionSet that you can use. For your arrays you would have to install both the affycoretools and mouse4302.db packages, and then you can do

library(affycoretools)

library(mouse4302.db)

genePS <- annotateEset(genePS, mouse4302.db)

ADD COMMENT • link 7.5 years ago James W. MacDonald 68k

0

Entering edit mode

This was extremely informative and makes my understanding a lot clearer. I used the code you provided and it worked perfectly. Thanks so much, appreciate it James!

ADD REPLY • link 7.5 years ago Nithisha ▴ 10

0

Entering edit mode

Hi James,

If I could just extend this question- I have another dataset that contains microarray data and the GEO platform mentioned is [MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST Array [transcript (gene) version]. I am not sure if that was useful information but I used the before mentioned protocol: https://www.bioconductor.org/packages//2.7/bioc/vignettes/oligo/inst/doc/V5ExonGene.pdf to download the microarray data and pdata(featureData(geneCore)) does not seem to have any gene names. In the end, I get gene IDs such as 10338001, 10338006 etc as my Limma DEG analysis output. Is there a way for me to get gene names/symbols? Do I have to use BioMart for this?

Thanks in advance.

ADD REPLY • link 7.5 years ago Nithisha ▴ 10

0

Entering edit mode

The same recommendation applies, only now the annotation package is the mogene10sttranscriptcluster.db package. Alternatively, since this is a Gene ST arrray, you could use the in-built Affy data

library(affycoretools)

geneCore <- annotateEset(geneCore, mogene10sttranscriptcluster.db)

OR

geneCore <- annotateEset(geneCore, pd.mogene.1.0.st.v1)

And do note that there are help pages for functions. ?annotateEset should be instructive, and if not, please tell me why so I can improve it.

ADD REPLY • link 7.5 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

Thank you for your reply. I went through ?annotateEset and have a few questions that I hope are not too basic.

1) Firstly, what would be the difference between mogene10strranscriptcluster.db and pd.mogebe.1.0.st.v1? And how would I know to look for these packages within the library "affycoretools"? Would this be based on the package written in the GEO link from where I get the dataset?

2) Before running your code, pData(featureData(geneCore) returned several columns such as probe_set_id, gene_assignment, pathways etc. After running the code above through, pData(featureData(geneCore) only shows 4 columns = probe_id, entrez_id, symbol and gene_name. I was wondering if it is possible to retain the original information provided in the featureData of geneCore as well?

3) After running the above code, some of the gene probe IDs seemed to be mapped to gene_names which display "NA". Is this to be expected?

4) Also, can BioMart be used to map the gene probe IDs to gene names? I know BioMart is related to Ensembl so I was wondering if Ensemble contained Affy information.

I apologize for the number of questions asked but I am new to this and would very much appreciate your advice!

Thank you.

ADD REPLY • link 7.5 years ago Nithisha ▴ 10

1

Entering edit mode

The mogene10sttranscriptcluster.db package provides annotation data that maps the Affy probeset IDs to genes and whatnot. If you summarize at the probeset level (not really recommended, btw), there is also a mogene10stprobeset.db package. The pd.mogene.1.0.st.v1 package is used by the oligo package to map probes to probesets when you run rma. It also happens to contain some annotation data, which annotateEset will extract and process. A further alternative is to note that GEO will automatically populate your featureData slot, and you could use those data as well. But do note that what GEO puts in there is pretty messy, and you have to parse out the useful bits yourself. A workflow showing all the ins and outs of this might be useful.
Yes. As I mentioned above, you automatically get some data. If you like what you get, then just go with that.
Yes. Not all probesets actually measure something. Some are background or QC probes, and some are just speculative content that never really worked out.
Yes, you can use the biomaRt package to annotate as well. But do note that biomaRt is querying a database, so you won't get the data back in the same order as your query, so you have to make sure to re-order correctly. Nobody loves it when people spend lots of time trying to validate Gene X, only to find that it was mis-annotated, and is actually Gene Y! That can lead to some uncomfortable lab meetings. Not that I have ever experienced that myself....