Question

Codelink differential expression and annotation

1

Entering edit mode

Agaz Hussain Wani ▴ 260

@agaz-hussain-wani-7620

Last seen 7.0 years ago

India

I perform differential expression on codelink data and output is shown below. Now i am looking to annotate the probes and i use `ACCN` column from hwgcod.db to annotate. I am trying to get the ACCN's from probeName column ,about 7000 probe names and ACCN  does not match, which leads to loss of information. Can i get the ACCN instead of probeName as shown below as the second column. 

Note: I am using the infromation from probeName to get ACCN which i can use to annotate, say something like "1691167CB1" "NM_002575"  "3251556CB1" "NM_001045"  "201678"     "8187031CB1" "1019621"    "1023071".

      probeName       probeType logicalRow logicalCol   meanSNR     logFC
17143  1691167CB1_PROBE1 DISCOVERY        278         26 0.5780484 -6.767287
1460  NM_002575.1_PROBE1 DISCOVERY         26         23 0.6308975  6.655497
19529  3251556CB1_PROBE1 DISCOVERY        315         57 0.6364621 -4.456800
19025 NM_001045.1_PROBE1 DISCOVERY        307         66 0.5722841  5.984381
5311    201678.10_PROBE1 DISCOVERY         87         40 0.5945943 -5.464377
7264   8187031CB1_PROBE1 DISCOVERY        118         23 0.6700745  6.160234
9024    1019621.1_PROBE1 DISCOVERY        146          5 0.7937480 -7.693983
6481    1023071.1_PROBE1 DISCOVERY        105         69 0.7680499 -6.333866
       AveExpr         t      P.Value adj.P.Val         B
17143 1.379396 -19.89355 8.231218e-06 0.1629205 -2.990755
1460  1.468584  14.45828 3.746444e-05 0.3707668 -3.020015
19529 2.990133 -13.15169 5.856035e-05 0.3806185 -3.032661
19025 1.065737  11.83323 9.614977e-05 0.3806185 -3.049624
5311  1.250057 -11.28423 1.200560e-04 0.3960448 -3.058378
7264  2.368912  10.32842 1.812488e-04 0.4675345 -3.076801
9024  4.481637 -17.38061 8.927544e-05 0.3806185 -3.659384
6481  4.973425 -14.24512 1.889696e-04 0.4675345 -3.668655

My sample data is here https://www.dropbox.com/s/kz12cwgqes10oqu/GSM108290.TXT?dl=0 , which has ACCN column.

The second confusion i want to clear is, is there a single annotation package for affymetrix, illumina and codelink data platforms to annotate from. Thanks

r codelink differential expression hwgcod • 2.0k views

ADD COMMENT • link updated 9.8 years ago by Diego Diez ▴ 760 • written 9.8 years ago by Agaz Hussain Wani ▴ 260

0

Entering edit mode

In the file you link there is no ACCN column. Are you reading this file with readCodelinkSet() or from GEO with GEOquery? How are you getting the ACCN information (i.e. write the actual code)? This question is not about differential expression (as the title and tags suggests), only about annotation.

ADD REPLY • link 9.8 years ago Diego Diez ▴ 760

0

Entering edit mode

I can find " Annotation_NCBI-Acc"

in the above attached file as the sixth column, If that's not ACCN column , then what is that. And i have data already downloaded so i use readCodelinkSet() instead of GEOquery().

Please find the code which i use to extract information from probeNames() to annotate for h20kcod.db

codelink_processed <- list()
files = list.files(pattern = "TXT")
codelink_data <- function(data)
{
  for(i in 1:length(files))
  {
    codset = readCodelinkSet(filename = f[[i]])
    codset = codCorrect(codset, method = "half", offset = 0)
    codset = codNormalize(codset, method = "loess", weights = getWeight(codset), loess.method = "fast")
    exprs <- exprs(codset)
    probes <- probeNames(codset)
    snr  <- getSNR(codset)
     
    ## To get the information from probes compatible with ACCNUM
    newids <- gsub("\\..*|_PROBE.*", "",probes)
    
    ## Annotating the newids
    keys <- select(h20kcod.db, newids, c("SYMBOL"), "ACCNUM")
    
    dataFrame <- cbind(keys,exprs,snr)
    colnames(dataFrame) <- c("Probe","Gene Symbol","SignalIntensity", "SNR")
    codelink_processed[[length(codelink_processed)+1]] <- head(dataFrame,50)
  }
  return(codelink_processed)
}

dataframes <- codelink_data(files)

ADD REPLY • link 9.8 years ago Agaz Hussain Wani ▴ 260

0

Entering edit mode

I do not know why you assume that we should read "ACCN" column as "Annotation_NCBI-Acc"- they may or may not refer to the same thing. Please, be specific when referring to data columns (or in general, to anything you want others to help you with) or define before hand any alias you wish to use. I am preparing a workaround for your issue and will post it as an answer soon.

ADD REPLY • link 9.8 years ago Diego Diez ▴ 760

score 1 · Answer 1 · 2015-06-17

As we discussed in another question, the problem with the specific dataset you are using is that it reports the legacy probe names, instead of the current ones. Therefore, you can't directly use the h20kcod.db annotation package. You tried to workaround this problem by obtaining the accession number that is found in the legacy probe names. But this has the disadvantage that many ids cannot be found anymore. This is because many of the original mappings have changed, due to improved genome assembly and annotation or some of the accession numbers beeing discontinued.

Anyway, your only hope is to remap the legacy probe names to the new ones. In the GEO page for the file you linked (GSM108290; http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM108290) we can find information about the array design in the GPL file: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL2891. Unfortunately, in this case the current codelink probe names are reported, and there is not the correspondence with the legacy ones. However, there is the row and col location of the probes (LOGICAL_ROW and LOGICAL_COL) and we can use this information for the mapping. Below is the code I used to hack this:

NOTE: you need to download the GPL file into a text file. For that I clicked on the View full table button that lead me to this page: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GPL2891&id=12727&db=GeoDb_blob06)

library(codelink)

# read codelink file (NOTE: this workaround work also if you read all your files here)
codset <- readCodelinkSet("GSM108290.TXT")

# read GPL file
gpl <- read.table("GPL2891.txt", sep = "\t", header = TRUE, fill = TRUE, stringsAsFactors = FALSE, strip.white = TRUE)
head(gpl)

# obtain feature data.
fdata <- fData(codset)
head(fdata)

# fix extra white.
fdata$logicalRow <- trimWhiteSpace(fdata$logicalRow)
fdata$logicalCol <- trimWhiteSpace(fdata$logicalCol)

# check dimensions (they do not match!)
dim(gpl)
dim(codset) # in the codelink file BLANK probes are missing (!?)

# match each spot.
# NOTE: slow.
tmp <- apply(fdata, 1, function(x) {
  cr <- x["logicalRow"]
  cc <- x["logicalCol"]
  sel <- gpl$LOGICAL_ROW == cr & gpl$LOGICAL_COL == cc
  data.frame(id = gpl$ID[sel], probeName = gpl$PROBE_NAME[sel], ori = x["probeName"], row.names = NULL, stringsAsFactors = FALSE)
})
head(tmp)

# sanity check: one match per spot.
any(sapply(tmp, nrow) != 1)

# rbind all rows.
# NOTE: slow.
tmp <- do.call(rbind, tmp)
head(tmp)

# bind old and new data.
fdata2 <- cbind(fdata, tmp)

# rename original probe name column (contains legacy probe names).
colnames(fdata2)[1] <- "probeName_old"
head(fdata2)

# sanity check (everything OK?)
all(fdata2$probeName_old == fdata2$ori)

# assign new fdata.
fData(codset) <- fdata2
probeNames(codset)

# now we can get annotations the traditional way:
library(h20kcod.db)
# most probes have a matching ACCNUM:
h20kcod() # h20kcodACCNUM has 18673 mapped keys (of 19993 keys)
# get annotations for 10 first probes:
select(h20kcod.db, keys = probeNames(codset)[1:10], columns = c("ENTREZID", "SYMBOL"))
   PROBEID ENTREZID   SYMBOL
1  GE53815    23541  SEC14L2
2  GE59986     9053     MAP7
3  GE53908    57538    ALPK3
4  GE60064     5018    OXA1L
5  GE53952    54480    CHPF2
6  GE60188    10483   SEC23B
7  GE53801   400961   PAIP2B
8  GE59978     1506     CTRL
9  GE53905    57535 KIAA1324
10 GE60052    11099   PTPN21

This should work. I will consider adding a way to automatically map old legacy probes to the new ones. But this will not be immediately available, as so far this is the first case with this problem I found since the package was released (9 years ago).