Annotating featureData to an affymetrix Xcel microarray
1
0
Entering edit mode
Adam • 0
@15f2ecea
Last seen 2.7 years ago
United Kingdom

Enter the body of text here Hi, I am looking for some help in annotating my feature data. I am using Affy and have annotated the phenodata using annotatedDataFrame function but cant seem to do the same thing with the feature data. Therefore i am struggling to summarise the probe level data into gene level data for the downstream differential gene expression analysis. When i come to parse my annotatedDataFrame object into ReadAffy it errors on me. I have also tried to parse in the featureData object at the time of RMA but again am unable to.

Does anyone have any experience here please?

Code should be placed in three backticks as shown below


# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
Microarray • 1.9k views
ADD COMMENT
0
Entering edit mode

Unless you show code it's impossible to help you.

But in general you should not be making separate AnnotatedDataFrames and trying to put them into an AffyBatch. You might want to annotate after running RMA, but usually not before.

ADD REPLY
0
Entering edit mode

\\raw.data <- ReadAffy(celfile.path = celpath, phenoData = AnnotatedDataFrame)

raw.data@featureData <- new("AnnotatedDataFrame", data = annotate)

eset <- expresso(raw.data, bgcorrect.method = "rma", normalize = "constant", summary.method = "avgdiff", pmcorrect.method = "pmonly")

This is the code i have been using. Sorry for not including before.

ADD REPLY
0
Entering edit mode

Sorry i should have said i am also using the makecdfenv function as per

make.cdf.package("Xcel", packagename = "xcelcdf", cdf.path = "path", package.path = , species = "Homo_sapiens") cdf <- read.cdffile("path")

ADD REPLY
0
Entering edit mode

in addition, this runs fine with no errors but it does not give me data that is gene expression data but still probe expression data. What am i doing wrong here so that i does summarise the probesets into gene level data?

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States

Are you using expresso that way because it's what you want, or because that is what is in the help page? Normally you would just use rma instead, as it's the de facto method for Affy arrays and is arguably the best method to use. If you plan to publish, saying you did something else is problematic because the reviewers are going to wonder why you didn't use rma which you will then have to explain/defend.

Anyway, a couple of things. If you find yourself using the @ operator, you should rethink what you are doing. I mean it works, but it is not meant to be used directly like that. The main idea behind the various S4 methods in Bioconductor is to abstract away all the details of the underlying object so you don't have to mess with low level accessors or understand the underlying data structure. Usually there is a public method that uses the @ accessor for you, and which is meant to allow the underlying representation to change without requiring that you understand all the details. In this case, if you really need to add featureData to your ExpressionSet, you can simply use the featureData accessor.

It's also probably easier to build an annotation package for your array, which will then allow you to easily annotate. You need the annotation csv file, which I assume you already have.

> df <- read.csv("Xcel.na36.annot.csv", comment.char = "#")
> probF <- df[,c("Probe.Set.ID","Entrez.Gene")]
> names(probF) <- c("probes","genes")
> makeChipPackage("xcel", probF, "org.Hs.eg.db", "0.0.1", "me <me@mine.org>", "me", ".", "9606", "Homo", "sapiens")
Populating genes table:
Populating map_metadata table:
probes table filled and indexed
table metadata filled
table metadata filled
Creating package in ./xcel.db 
Now deleting temporary database file
[1] "./xcel.db"
> install.packages("xcel.db", repos = NULL, type = "source") ## only need type argument if on Windows, as I am
> library(xcel.db)
> abatch <- ReadAffy()
## here i am using some Xcel files I got from GEO
> eset <- rma(abatch)
> library(affycoretools)
> eset <- annotateEset(eset, xcel.db)
> head(fData(eset))
                PROBEID ENTREZID SYMBOL
200000_s_at 200000_s_at    10594  PRPF8
200001_at     200001_at      826 CAPNS1
200002_at     200002_at    11224  RPL35
200003_s_at 200003_s_at     6158  RPL28
200004_at     200004_at     1982 EIF4G2
200005_at     200005_at     8664  EIF3D
                                                        GENENAME
200000_s_at                         pre-mRNA processing factor 8
200001_at                                calpain small subunit 1
200002_at                                  ribosomal protein L35
200003_s_at                                ribosomal protein L28
200004_at     eukaryotic translation initiation factor 4 gamma 2
200005_at   eukaryotic translation initiation factor 3 subunit D
ADD COMMENT
0
Entering edit mode

Thanks for this. All works perfectly. But i still missing something. I run the code below and still gives me expression by probe rather than summary gene expresison. How do i perform the summary step? Really appreciate your help, thanks in advance.

ph <- eset@phenoData
f <- factor(ph@data$NFkB.Cluster,
            labels = c("Atypical", "Canonical", "Non_canonical"),
            ordered = T)

design <- model.matrix(~ 0 + f)
colnames(design) = levels(f)
design
data.fit = lmFit(eset, design)

cont.matrix <- makeContrasts(CanonicalvsAtypical=Canonical-Atypical,
                             AtypicalvsNon_canonical=Atypical-Non_canonical,
                             CanonicalvsNon_canonical=Canonical-Non_canonical,levels=design)
data.contr = contrasts.fit(data.fit,cont.matrix)
data.fit.eb = eBayes(data.contr)

head(data.fit.eb$coefficients)



Contrasts
              CanonicalvsAtypical AtypicalvsNon_canonical CanonicalvsNon_canonical
  200000_s_at          -0.1253895             -0.10704128               -0.2324307
  200001_at             0.2730138              0.18391004                0.4569239
  200002_at             0.5021192              0.30566054                0.8077797
  200003_s_at           0.2642715              0.13043072                0.3947022
  200004_at             0.6152136              0.55398937                1.1692030
  200005_at             0.1121613              0.09711398                0.2092753
ADD REPLY
0
Entering edit mode

Those data are summarized by probeset, which is +/- by gene expression. If you look at my example, you can see that the probesets correspond to (in order), PRPF8, CAPNS1, RPL35, RPL28, EIF4G2, and EIF3D.

I feel compelled to warn you (again!) about mucking around in objects that have accessors. You can get the phenoData from your eset object using the pData accessor, rather than extracting with the @ accessor. And once you have an MArrayLM object, you usually just want to get the results out using the topTable accessor, in which case if you had followed my example above, your results would be automatically annotated. Just getting the top 6 coefficients for all three contrasts is not useful. But something like

topTable(data.fit.eb, 1, Inf, p.value = 0.05)

will give you all the genes (annotated!) at an FDR < 0.05.

ADD REPLY
0
Entering edit mode

Thanks James. Appologies for not updating. New code p <- pData(eset) f <- factor(p$NFkB.Cluster, labels = c("Atypical", "Canonical", "Non_canonical"), ordered = T)

I think, because of my inexperience, i am not explaining myself/not understanding what you are saying. Your example works well and I have followed it. I get a list of probes with genes annotated with all the associated MArrayLM data. What i cant seem to do is summarise the multiple annotated logFC per gene into a single reading for that gene and do this across the whole microarray. Is there a specific accesor to do this? Or is this summarisation step done at an earlier stage.

Once again appreciate your time in answering my questions.

ADD REPLY
0
Entering edit mode

OK, now I'm confused. What do you mean by 'summarize the multiple annotated logFC per gene into a single reading'? Or maybe it's better to just ask exactly what you are trying to do. You have three comparisons, right? That will generate three coefficients (the logFC), and each coefficient is only meaningful for the one comparison. You could hypothetically do an F-test to test that any of the individual comparisons is significant, by simply neglecting to include a coefficient in your call to topTable (or conversely by using a range, like 1:3). But that isn't a super common thing to do in this context. Each individual comparison will be more powerful, and more directed, so people usually just do three individual comparisons.

ADD REPLY
0
Entering edit mode

I am now convinced i am not explaining myself properly. apologies!

Each gene has multiple data points which correspond (correct me if i am wrong) to individual probes. So when i call the first coefficient i get the contrast between Canonical and atypical signalling. But in this there are multiple entries for each gene. I thought the term for combining probesets into gene was summarising hence why i used this term. so what i want to do is take each probe reading for a particular gene and combine them (i understand there are different ways to combine them) to get a single output for that gene.

This is why i tried expresso as there was a specific bit of code there to summarise.

ADD REPLY
0
Entering edit mode

OK, I think I get it. There are different levels of summarization. At the first level, there are a certain number of 25-mer probes that interrogate different portions of a given gene's transcript, and what expresso and rma do is to summarize those values to a 'probeset' level, which is meant to be correlated with the underlying expression of that gene. For most people that's sufficient and they just analyze the probeset levels and go on from there.

However, some of the probesets measure the same gene. Or at least they are annotated to the same gene, and sometimes people only want one measurement for each gene. I don't think that is your concern, but if it is, you can use the avereps function in limma, along with the gene symbols to compute an average for any replicated gene. Another alternative is to use an MBNI remapped CDF, which maps each probe to a particular gene, so when you summarize using rma, you only have one measurement per gene.

But if you are worried that you are using each of the 25-mer probes separately, that's not the case. After either expresso or rma, all you have is a summary value for all the probes for a given probeset, which is intended to measure a single gene.

ADD REPLY
0
Entering edit mode

Thanks again. The rma summarised values are intersting because the multiple probesets for a given gene are variable.

I ran this bit of code.

a <- assayData(eset) f <- fData(eset) eset2 <- limma::avereps(a$exprs, ID =f$SYMBOL) Worked perfectly. Thanks so much for your help.

ADD REPLY

Login before adding your answer.

Traffic: 723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6