Hi,
I'm trying to use the SPIA package in R to complete a pathway analysis of differentially expressed genes that considers fold change and ORA. I'm new to R generally, and continually run into an error that states my de names are not in the reference array (exact wording below) when I run the analysis.
I've done many checks, my data is clear of duplicate entries, and NAs. The entrezgene IDs are valid. In fact, I can run the analysis on Graphite's online version of SPIA using the "all_genes" and "sig_genes" vectors/files listed below without any problem. I'd like to use R for the analysis, though as it gives better graphing options.
I think the problem has something to do with my creation of the named vector sig_genes(below), which contains the log2FoldChange values named with Entrezgene IDs.
I've searched around and can't seem to find a solution online that fixes the problem.
I've posted the columns of my original data frame (which is very large), my code and sessioninfo() below
Does anyone see an error in my code or session info () that might be leading to this problem?
Thanks
A. Eelmur
**
My original data frame (DF) contains 12000+ rows (not sure if a sample would, therefore, be useful here), but columns names are as so
>head(DF)
baseMean log2FoldChange lfcSE stat pvalue padj ensembl_gene_id
Here is the code and error.
> #SPIA
library(Biobase)
library(GEOquery)
library(limma)
library(SPIA)
library(plyr)
library(biomaRt)
> # read in the data
DF = read.delim("DF.txt", header=T, row.names=1)
>#define mart
ensembl <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
># get entrez gene IDs for the DF file
getgenes <- getBM(attributes = c("ensembl_gene_id", "entrezgene"), filters = "ensembl_gene_id", values = DF$ensembl_gene_id, mart = ensembl, uniqueRows=FALSE)
>#remove duplicates
getgenes2 = getgenes[!duplicated(getgenes[,1]),]
>#join entrez genes and DE data frame by ensemble gene ID
DF_with_EntrezID= join(DF, getgenes2, type = "left")
>#subset DF_with_EntrezID to just Entrez gene ID, log2 Fold Change and the padj columns and omit any NAs
DF_with_EntrezID_subset <- na.omit(subset(DF_with_EntrezID, select=c(entrezgene, log2FoldChange, padj))
>#make a vector of log2FoldChange values with a padj <0.1
sig_genes <- subset(DF_with_EntrezID_subset, padj<0.1)$log2FoldChange
>#change sig_genes to numeric
sig_genes <- as.numeric(sig_genes)
>#make a vector of Entregene IDs for DE genes with a padj <0.1
n <- subset(DF_with_EntrezID_subset, padj<0.1)$entrezgene
># change n to character
n <- as.character(n)
>#create named vector of log2FoldChange values with Entrezgene ID names
names(sig_genes)<- n
>#make reference array based on all Entrezgene IDs
all_genes <- as.character(DF_with_EntrezID_subsetentrezgene)
>res=spia(de=sig_genes, all=all_genes, organism="hsa",plots=TRUE)
Error in spia(de = DE, all = all_genes, organism = "hsa", plots = TRUE) :
de must be a vector of log2 fold changes. The names of de should be included in the refference array!
Here is my sessioninfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8.1 SPIA_2.16.0 KEGGgraph_1.22.1 graph_1.42.0 XML_3.98-1.1 limma_3.20.9
[7] GEOquery_2.30.1 Biobase_2.24.0 BiocGenerics_0.10.0 biomaRt_2.20.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.1 DBI_0.3.1 GenomeInfoDb_1.0.2 IRanges_1.22.10 Rcpp_0.11.3
[6] RCurl_1.95-4.3 RSQLite_1.0.0 stats4_3.1.1 tools_3.1.1
I am getting the same error. Did you find any solution ?