Question

DGE matrix annotation

0

Entering edit mode

maxininfa • 0

@maxininfa-17622

Last seen 6.3 years ago

Hi everyone!

I'm trying to apply pathway analysis to Array Express experiments. For that I downloaded the processed data, along with the sample and data relationship files (e.g. https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15641/). I made every value an integer (I know this isn't how the proper analysis is made, but this is just a fast way to corraborate suspicions on something) and prepared bouth countData and colData files (both .csv).

colData:

sample	type
GSM391107	normal human kidney tissue

countData:

ID_REF	GSM391107	GSM391108	GSM391109	GSM391110	GSM391111
AFFX-TrpnX-M_at	24	30	25	21	22

Now in R Studio I run the following commands:
library(tidyverse)
files <- list.files()
metadata <- read_csv(files[1])
counts <- read_csv(files[2])

library(DESeq2)
mycounts <- as.data.frame(counts)
metadata <- as.data.frame(metadata)

dds <- DESeqDataSetFromMatrix(countData=mycounts,colData=metadata,design=~type,tidy=TRUE)
dds <- DESeq(dds)
res <- results(dds, tidy=TRUE)

res <- tbl_df(res)

res %>%

filter(padj<0.05) %>%

write_csv("DGE.csv")

DGE.csv:

row	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
AFFX-TrpnX-5_at	51.4388012667095	0.422155144223	0.0677142709644	6.2343058862228	4.53785621436E-10	2.52792643745E-09

So now I want to run pathway analysis according to this guide: http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html

The experiment was made with Affymetrix Gene Chip HGU133A, so I:

library("AnnotationDbi") 
library("hgu133a.db")

And when I'm trying to merge annotation data to my matrix with:

res$entrez = mapIds(hgu133a.db, keys=row.names(res), column="ENTREZID", keytype="PROBEID", multiVals="first")

Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.

If I just type: res

# A tibble: 22,283 x 7
   row          baseMean log2FoldChange  lfcSE     stat   pvalue     padj
   <chr>           <dbl>          <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
 1 AFFX-TrpnX-…     25.7        0.134   0.0882   1.52   1.28e- 1 1.84e- 1

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_AR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_AR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_AR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] hugene10sttranscriptcluster.db_8.7.0
 [2] annotate_1.56.2                     
 [3] XML_3.98-1.16                       
 [4] hgu133a.db_3.2.3                    
 [5] hgu95av2.db_3.2.3                   
 [6] org.Hs.eg.db_3.5.0                  
 [7] AnnotationDbi_1.40.0                
 [8] BiocInstaller_1.28.0                
 [9] DESeq2_1.18.1                       
[10] SummarizedExperiment_1.8.1          
[11] DelayedArray_0.4.1                  
[12] matrixStats_0.54.0                  
[13] Biobase_2.38.0                      
[14] GenomicRanges_1.30.3                
[15] GenomeInfoDb_1.14.0                 
[16] IRanges_2.12.0                      
[17] S4Vectors_0.16.0                    
[18] BiocGenerics_0.24.0                 
[19] forcats_0.3.0                       
[20] stringr_1.3.1                       
[21] dplyr_0.7.6                         
[22] purrr_0.2.5                         
[23] readr_1.1.1                         
[24] tidyr_0.8.1                         
[25] tibble_1.4.2                        
[26] ggplot2_3.0.0                       
[27] tidyverse_1.2.1                     

loaded via a namespace (and not attached):
 [1] nlme_3.1-131           bitops_1.0-6           lubridate_1.7.4       
 [4] bit64_0.9-7            RColorBrewer_1.1-2     httr_1.3.1            
 [7] tools_3.4.4            backports_1.1.2        utf8_1.1.4            
[10] R6_2.2.2               rpart_4.1-13           Hmisc_4.1-1           
[13] DBI_1.0.0              lazyeval_0.2.1         colorspace_1.3-2      
[16] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.4      
[19] gridExtra_2.3          bit_1.1-14             compiler_3.4.4        
[22] cli_1.0.1              rvest_0.3.2            htmlTable_1.12        
[25] xml2_1.2.0             scales_1.0.0           checkmate_1.8.5       
[28] genefilter_1.60.0      digest_0.6.17          foreign_0.8-69        
[31] XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.2       
[34] htmltools_0.3.6        htmlwidgets_1.3        rlang_0.2.2           
[37] readxl_1.1.0           rstudioapi_0.7         RSQLite_2.1.1         
[40] bindr_0.1.1            jsonlite_1.5           BiocParallel_1.12.0   
[43] acepack_1.4.1          RCurl_1.95-4.11        magrittr_1.5          
[46] GenomeInfoDbData_1.0.0 Formula_1.2-3          Matrix_1.2-14         
[49] fansi_0.3.0            Rcpp_0.12.19           munsell_0.5.0         
[52] stringi_1.2.4          yaml_2.2.0             zlibbioc_1.24.0       
[55] plyr_1.8.4             blob_1.1.1             grid_3.4.4            
[58] crayon_1.3.4           lattice_0.20-35        haven_1.1.2           
[61] splines_3.4.4          hms_0.4.2              locfit_1.5-9.1        
[64] knitr_1.20             pillar_1.3.0           geneplotter_1.56.0    
[67] glue_1.3.0             latticeExtra_0.6-28    data.table_1.11.8     
[70] modelr_0.1.2           cellranger_1.1.0       gtable_0.2.0          
[73] assertthat_0.2.0       xtable_1.8-3           broom_0.5.0           
[76] survival_2.41-3        memoise_1.1.0          bindrcpp_0.2.2        
[79] cluster_2.0.7-1

I've been trying lots of guides and I can't get past through the annotation. When I type:

select(hgu133a.db, c("74694_s_at"), c("SYMBOL","ENTREZID", "GENENAME"))
'select()' returned 1:1 mapping between keys and columns
     PROBEID SYMBOL ENTREZID
1 74694_s_at RABEP2    79874
                                         GENENAME
1 rabaptin, RAB GTPase binding effector protein 2

I'm new to R, so I'm betting this has something to do with a very basic deal, hope you don't mind telling me.

Thank you for your time!

Maxi

DGE DESeq2 AnnotationDbi • 2.2k views

ADD COMMENT • link updated 6.5 years ago by James W. MacDonald 68k • written 6.5 years ago by maxininfa • 0

0

Entering edit mode

"I know this isn't how the proper analysis is made, but this is just a fast way ..."

You can't save time by doing something nonsensical, and treating normalized microarray intensities as if they were counts is in that category. You're following a guide that isn't at all designed for the type of data you have, as James has noted.

ADD REPLY • link 6.5 years ago Gordon Smyth 52k

score 3 · Accepted Answer · 2018-10-02

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 days ago

United States

First off, DESeq2 is not the right package to use for Affymetrix arrays! Did you get that idea somewhere? It's a bad idea, so you should stop doing that and use limma.

Secondly, you are using an old version of R/Bioconductor so you should update.

Third, if I am not mistaken, using row.names on a tibble returns the row names rather than a column called 'row'. I could be wrong, as I avoid as much of the tidyverse as possible; my go-to move when confronted with a tibble is to wrap it in as.data.frame, so I can interact with it like a civilized person. But to my jaundiced eye, it looks like the row names for that tibble are just 1:22283.

Anyway, if you get an error from mapIds that says the keys you are using aren't valid keys, that means that the keys you are using aren't the things you think they are. Your next move should be to look at the keys and try to figure out what the problem is. My guess is you are extracting 1:22282, which are not Affy IDs, and which is causing the error.

ADD COMMENT • link 6.5 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James!
1) I didn't know that there was such a restricted way to do things regarding this data, thank you for pointing it out.

2) I think I have this particulars versions because there are some packages that arent supported in newer versions (now I can't remember which, but it truly rings a bell in my brain)
3) Im such a newby that I never dug on what a tibble was, I simply followed the guide I found. I find your phrase "I can interact with it like a civilized person" made me laugh hard, my PhD lab partner and mentor always scold me on how caveman my scripting was (still is). You are right on that the row.names I'm getting are just 1:22283. When I run:

> (res[[1]])
   [1] "AFFX-TrpnX-M_at"             "AFFX-TrpnX-5_at"            
   [3] "AFFX-TrpnX-3_at"             "AFFX-ThrX-M_at"

So as you can see, the column is there, it's just that I'm not accessing it right.

I will get into the data I'm using so I can deal with this, but if I can't I will follow your advice and try wrapping it in as.data.frame

Thank you for your time!

ADD REPLY • link 6.5 years ago maxininfa • 0

0

Entering edit mode

1.) Any time you are doing inferential statistics, the first thing you have to do is figure out the null distribution of your test statistic. In other words, you want to be able to say what the likely range and frequency of your test statistic will be, under the null distribution, because you will compare your observed results to that null distribution in order to say how likely it is that your observed statistic is something you expect to see (under the null) or not. For count data, like what you get from doing RNA-Seq, the canonical null distribution is the Poisson distribution, and for over-dispersed counts like RNA-Seq, you might use the negative binomial distribution. That is what DESeq uses. But you don't have count data! You have data, that upon taking logs, is reasonably 'hump-shaped', in which case it's conventional to say the null distribution is the t-distribution, which is what the limma package uses.

So the restriction here is that the underlying null distribution for DESeq2 isn't reasonable for microarray data, which are continuously distributed rather than being discrete counts.

2.) If there are packages that aren't supported in the new version of R, that implies that they are no longer supported, in which case you should strongly consider finding a package that is supported that will accomplish the task you want to accomplish. Using busted old unsupported packages is a recipe for disaster, IMO.

ADD REPLY • link 6.5 years ago James W. MacDonald 68k