Question

SPIA error - names of de are not in reference array

0

Entering edit mode

A Eelmur • 0

@-a-eelmur-6956

Last seen 10.4 years ago

United States

Hi,

I'm trying to use the SPIA package in R to complete a pathway analysis of differentially expressed genes that considers fold change and ORA. I'm new to R generally, and continually run into an error that states my de names are not in the reference array (exact wording below) when I run the analysis.

I've done many checks, my data is clear of duplicate entries, and NAs. The entrezgene IDs are valid. In fact, I can run the analysis on Graphite's online version of SPIA using the "all_genes" and "sig_genes" vectors/files listed below without any problem. I'd like to use R for the analysis, though as it gives better graphing options.

I think the problem has something to do with my creation of the named vector sig_genes(below), which contains the log2FoldChange values named with Entrezgene IDs.

I've searched around and can't seem to find a solution online that fixes the problem.

I've posted the columns of my original data frame (which is very large), my code and sessioninfo() below

Does anyone see an error in my code or session info () that might be leading to this problem?

Thanks

A. Eelmur

**

My original data frame (DF) contains 12000+ rows (not sure if a sample would, therefore, be useful here), but columns names are as so

>head(DF)

baseMean log2FoldChange lfcSE stat pvalue padj ensembl_gene_id

Here is the code and error.

> #SPIA
library(Biobase)
library(GEOquery)
library(limma)
library(SPIA)
library(plyr)
library(biomaRt)

> # read in the data

DF = read.delim("DF.txt", header=T, row.names=1)

>#define mart

ensembl <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

># get entrez gene IDs for the DF file
getgenes <- getBM(attributes = c("ensembl_gene_id", "entrezgene"), filters = "ensembl_gene_id", values = DF$ensembl_gene_id, mart = ensembl, uniqueRows=FALSE)

>#remove duplicates
getgenes2 = getgenes[!duplicated(getgenes[,1]),]

>#join entrez genes and DE data frame by ensemble gene ID

DF_with_EntrezID= join(DF, getgenes2, type = "left")

>#subset DF_with_EntrezID to just Entrez gene ID, log2 Fold Change and the padj columns and omit any NAs

DF_with_EntrezID_subset <- na.omit(subset(DF_with_EntrezID, select=c(entrezgene, log2FoldChange, padj))

>#make a vector of log2FoldChange values with a padj <0.1

sig_genes <- subset(DF_with_EntrezID_subset, padj<0.1)$log2FoldChange

>#change sig_genes to numeric

sig_genes <- as.numeric(sig_genes)

>#make a vector of Entregene IDs for DE genes with a padj <0.1

n <- subset(DF_with_EntrezID_subset, padj<0.1)$entrezgene

># change n to character

n <- as.character(n)

>#create named vector of log2FoldChange values with Entrezgene ID names

names(sig_genes)<- n

>#make reference array based on all Entrezgene IDs

all_genes <- as.character(DF_with_EntrezID_subsetentrezgene)

>res=spia(de=sig_genes, all=all_genes, organism="hsa",plots=TRUE)

Error in spia(de = DE, all = all_genes, organism = "hsa", plots = TRUE) :
de must be a vector of log2 fold changes. The names of de should be included in the refference array!

Here is my sessioninfo()

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyr_1.8.1 SPIA_2.16.0 KEGGgraph_1.22.1 graph_1.42.0 XML_3.98-1.1 limma_3.20.9
[7] GEOquery_2.30.1 Biobase_2.24.0 BiocGenerics_0.10.0 biomaRt_2.20.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.1 DBI_0.3.1 GenomeInfoDb_1.0.2 IRanges_1.22.10 Rcpp_0.11.3
[6] RCurl_1.95-4.3 RSQLite_1.0.0 stats4_3.1.1 tools_3.1.1

SPIA named vectors pathway analysis • 2.4k views

ADD COMMENT • link updated 2.6 years ago by a.ghosh • 0 • written 10.4 years ago by A Eelmur • 0

0

Entering edit mode

I am getting the same error. Did you find any solution ?

ADD REPLY • link 2.6 years ago a.ghosh • 0