Hi everyone!
I'm trying to apply pathway analysis to Array Express experiments. For that I downloaded the processed data, along with the sample and data relationship files (e.g. https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15641/). I made every value an integer (I know this isn't how the proper analysis is made, but this is just a fast way to corraborate suspicions on something) and prepared bouth countData and colData files (both .csv).
colData:
sample | type |
GSM391107 | normal human kidney tissue |
countData:
ID_REF | GSM391107 | GSM391108 | GSM391109 | GSM391110 | GSM391111 |
AFFX-TrpnX-M_at | 24 | 30 | 25 | 21 | 22 |
Now in R Studio I run the following commands:
library(tidyverse)
files <- list.files()
metadata <- read_csv(files[1])
counts <- read_csv(files[2])
library(DESeq2)
mycounts <- as.data.frame(counts)
metadata <- as.data.frame(metadata)
dds <- DESeqDataSetFromMatrix(countData=mycounts,colData=metadata,design=~type,tidy=TRUE)
dds <- DESeq(dds)
res <- results(dds, tidy=TRUE)
res <- tbl_df(res)
res %>%
filter(padj<0.05) %>%
write_csv("DGE.csv")
DGE.csv:
row | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
AFFX-TrpnX-5_at | 51.4388012667095 | 0.422155144223 | 0.0677142709644 | 6.2343058862228 | 4.53785621436E-10 | 2.52792643745E-09 |
So now I want to run pathway analysis according to this guide: http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html The experiment was made with Affymetrix Gene Chip HGU133A, so I:
library("AnnotationDbi") library("hgu133a.db")
And when I'm trying to merge annotation data to my matrix with:
res$entrez = mapIds(hgu133a.db, keys=row.names(res), column="ENTREZID", keytype="PROBEID", multiVals="first")
Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.
If I just type: res
# A tibble: 22,283 x 7 row baseMean log2FoldChange lfcSE stat pvalue padj <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 AFFX-TrpnX-… 25.7 0.134 0.0882 1.52 1.28e- 1 1.84e- 1
> sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_AR.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=es_AR.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=es_AR.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] hugene10sttranscriptcluster.db_8.7.0 [2] annotate_1.56.2 [3] XML_3.98-1.16 [4] hgu133a.db_3.2.3 [5] hgu95av2.db_3.2.3 [6] org.Hs.eg.db_3.5.0 [7] AnnotationDbi_1.40.0 [8] BiocInstaller_1.28.0 [9] DESeq2_1.18.1 [10] SummarizedExperiment_1.8.1 [11] DelayedArray_0.4.1 [12] matrixStats_0.54.0 [13] Biobase_2.38.0 [14] GenomicRanges_1.30.3 [15] GenomeInfoDb_1.14.0 [16] IRanges_2.12.0 [17] S4Vectors_0.16.0 [18] BiocGenerics_0.24.0 [19] forcats_0.3.0 [20] stringr_1.3.1 [21] dplyr_0.7.6 [22] purrr_0.2.5 [23] readr_1.1.1 [24] tidyr_0.8.1 [25] tibble_1.4.2 [26] ggplot2_3.0.0 [27] tidyverse_1.2.1 loaded via a namespace (and not attached): [1] nlme_3.1-131 bitops_1.0-6 lubridate_1.7.4 [4] bit64_0.9-7 RColorBrewer_1.1-2 httr_1.3.1 [7] tools_3.4.4 backports_1.1.2 utf8_1.1.4 [10] R6_2.2.2 rpart_4.1-13 Hmisc_4.1-1 [13] DBI_1.0.0 lazyeval_0.2.1 colorspace_1.3-2 [16] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.4 [19] gridExtra_2.3 bit_1.1-14 compiler_3.4.4 [22] cli_1.0.1 rvest_0.3.2 htmlTable_1.12 [25] xml2_1.2.0 scales_1.0.0 checkmate_1.8.5 [28] genefilter_1.60.0 digest_0.6.17 foreign_0.8-69 [31] XVector_0.18.0 base64enc_0.1-3 pkgconfig_2.0.2 [34] htmltools_0.3.6 htmlwidgets_1.3 rlang_0.2.2 [37] readxl_1.1.0 rstudioapi_0.7 RSQLite_2.1.1 [40] bindr_0.1.1 jsonlite_1.5 BiocParallel_1.12.0 [43] acepack_1.4.1 RCurl_1.95-4.11 magrittr_1.5 [46] GenomeInfoDbData_1.0.0 Formula_1.2-3 Matrix_1.2-14 [49] fansi_0.3.0 Rcpp_0.12.19 munsell_0.5.0 [52] stringi_1.2.4 yaml_2.2.0 zlibbioc_1.24.0 [55] plyr_1.8.4 blob_1.1.1 grid_3.4.4 [58] crayon_1.3.4 lattice_0.20-35 haven_1.1.2 [61] splines_3.4.4 hms_0.4.2 locfit_1.5-9.1 [64] knitr_1.20 pillar_1.3.0 geneplotter_1.56.0 [67] glue_1.3.0 latticeExtra_0.6-28 data.table_1.11.8 [70] modelr_0.1.2 cellranger_1.1.0 gtable_0.2.0 [73] assertthat_0.2.0 xtable_1.8-3 broom_0.5.0 [76] survival_2.41-3 memoise_1.1.0 bindrcpp_0.2.2 [79] cluster_2.0.7-1
I've been trying lots of guides and I can't get past through the annotation. When I type:
select(hgu133a.db, c("74694_s_at"), c("SYMBOL","ENTREZID", "GENENAME")) 'select()' returned 1:1 mapping between keys and columns PROBEID SYMBOL ENTREZID 1 74694_s_at RABEP2 79874 GENENAME 1 rabaptin, RAB GTPase binding effector protein 2
I'm new to R, so I'm betting this has something to do with a very basic deal, hope you don't mind telling me.
Thank you for your time!
Maxi
"I know this isn't how the proper analysis is made, but this is just a fast way ..."
You can't save time by doing something nonsensical, and treating normalized microarray intensities as if they were counts is in that category. You're following a guide that isn't at all designed for the type of data you have, as James has noted.