Hi all,
I have been developing a script to pull data from GEO, format it and output user defined subgroups. It seems however when i getGEO
, every now and then i get a series which cant parse correctly and so produces a horrible expression matrix.
an example of this can be seen below:
data <- GetGEO("GSE2193")
> Found 5 file(s) GSE2193-GPL1823_series_matrix.txt.gz Using locally
> cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1823_series_matrix.txt.gz
> Parsed with column specification: cols( .default = col_double() )
> See spec(...) for full column specifications. Using locally cached
> version of GPL1823 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1823.soft
> GSE2193-GPL1824_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1824_series_matrix.txt.gz
> Parsed with column specification: cols( .default = col_double() )
> See spec(...) for full column specifications. Using locally cached
> version of GPL1824 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1824.soft
> GSE2193-GPL1825_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1825_series_matrix.txt.gz
> Parsed with column specification: cols( .default = col_double() )
> See spec(...) for full column specifications. Using locally cached
> version of GPL1825 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1825.soft
> GSE2193-GPL1826_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1826_series_matrix.txt.gz
> Parsed with column specification: cols( `1` = col_double(),
> `-.954` = col_double(), `.104` = col_double(), `-1.08` =
> col_double(), X5 = col_double(), `-1.6` = col_double(), X7 =
> col_double(), `-.14` = col_double(), `-.256` = col_double(),
> `.929` = col_double(), `.205` = col_double(), `-.939` =
> col_double() ) Using locally cached version of GPL1826 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1826.soft
> GSE2193-GPL1827_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1827_series_matrix.txt.gz
> Parsed with column specification: cols( `1` = col_double(),
> `-.226` = col_double(), `.85` = col_double(), `.239` =
> col_double(), `.239_1` = col_double(), `.239_2` = col_double(),
> `.239_3` = col_double(), `.239_4` = col_double(), `.597` =
> col_double() ) Using locally cached version of GPL1827 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1827.soft Warning
> messages: 1: Missing column names filled in: 'X3' [3], 'X12' [12],
> 'X24' [24] 2: Missing column names filled in: 'X11' [11], 'X25' [25],
> 'X28' [28], 'X29' [29], 'X33' [33], 'X36' [36], 'X40' [40] 3: Missing
> column names filled in: 'X2' [2], 'X8' [8], 'X9' [9], 'X11' [11],
> 'X19' [19], 'X20' [20], 'X24' [24], 'X25' [25], 'X26' [26], 'X31' [31]
> 4: Missing column names filled in: 'X5' [5], 'X7' [7] 5: Duplicated
> column names deduplicated: '.239' => '.239_1' [5], '.239' => '.239_2'
> [6], '.239' => '.239_3' [7], '.239' => '.239_4' [8]
As you can see it doesn't appear to find the column names and so uses the first row of values. This produces an expression matrix like below (note, only used example from one platform)
head(data[["GSE2193-GPL1823_series_matrix.txt.gz"]]@assayData[["exprs"]])
> -1.587 X3 1.225 -.195 -1.002 1.519 1.894 -.881 -.354 -.463 X12 -.982 -.393 .047 -.268 .401
>2 -0.771 -0.049 NA 0.353 -1.880 -0.785 -0.965 NA -1.866 -1.807 NA -1.936 0.062 -1.257 -0.663 -1.878
>3 -1.753 -1.470 0.320 -0.499 -0.290 -1.026 1.175 1.396 1.291 1.032 -0.679 -1.995 -0.008 -0.525 -0.094 0.399
>4 0.563 0.195 0.006 1.214 1.506 0.931 0.405 0.178 0.242 1.476 0.357 -0.226 0.588 0.549 1.129 0.008
>5 1.292 0.864 -1.452 0.866 -0.298 0.492 2.379 2.310 0.012 0.784 -0.502 0.573 0.369 -0.171 0.259 -1.411
>6 -1.004 -0.202 NA 0.617 -2.138 -0.436 -0.620 NA -0.428 -0.026 NA -0.844 0.686 -0.505 -0.353 -1.745
>7 1.764 1.416 NA 0.325 -1.795 -1.535 4.634 6.749 0.313 0.427 -3.887 1.149 -1.297 -0.928 -1.351 NA
It seems the downloaded .txt.gz files contain column headers (i.e. GSM###) but isnt getting parsed when creating the matrix. have tried re-running and removing any cached versions but no success
Is this a bug?
I could find a workaround (i.e. saving to locally then reading in as text file then reformatting) but i am hoping to avoid such a hassle.
Many thanks,
Andy
> > sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8.1 x64 (build
> 9600)
>
> Matrix products: default
>
> locale: [1] LC_COLLATE=English_United Kingdom.1252
> LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United
> Kingdom.1252 LC_NUMERIC=C [5]
> LC_TIME=English_United Kingdom.1252
>
> attached base packages: [1] parallel stats graphics grDevices
> utils datasets methods base
>
> other attached packages: [1] xml2_1.2.0 data.table_1.12.2
> ggplot2_3.1.1 DT_0.5 shiny_1.3.2 [6]
> plyr_1.8.4 GEOquery_2.50.5 Biobase_2.42.0
> BiocGenerics_0.28.0
>
> loaded via a namespace (and not attached): [1] Rcpp_1.0.1
> pillar_1.3.1 compiler_3.5.3 later_0.8.0 tools_3.5.3
> digest_0.6.18 [7] tibble_2.1.1 gtable_0.3.0
> pkgconfig_2.0.2 rlang_0.3.4 rstudioapi_0.10 curl_3.3
> [13] withr_2.1.2 dplyr_0.8.0.1 htmlwidgets_1.3 hms_0.4.2
> grid_3.5.3 tidyselect_0.2.5 [19] glue_1.3.1 R6_2.4.0
> limma_3.38.3 tidyr_0.8.3 readr_1.3.1 purrr_0.3.2
> [25] magrittr_1.5 scales_1.0.0 promises_1.0.1
> htmltools_0.3.6 assertthat_0.2.1 colorspace_1.4-1 [31] mime_0.6
> xtable_1.8-4 httpuv_1.5.1 stringi_1.4.3 lazyeval_0.2.2
> munsell_0.5.0 [37] crayon_1.3.4