recount3 problem downloading data from source sra
1
0
Entering edit mode
@d2e5cc7f
Last seen 6 months ago
Germany

Hello,

I have been using recount3 a several times for downloading TCGA and GTEX data and it worked perfectly! Thanks for the package! I now wanted to download a different dataset. I have located it in the study explorer and use the R code that is generated to access the data:

library("recount3")
rse_data <- recount3::create_rse_manual(
  project = "SRP045225",
  project_home = "data_sources/sra",
  organism = "human",
  annotation = "gencode_v29",
  type = "gene"
)

traceback()
sessionInfo()

Unfortunately, I run into the following error and, as a result, the data cannot be downloaded. I have tried using different annotations available, but I keep getting the same error messages and no data can be downloaded. This did not happen when I download TCGA and GTEX data or other datasets from data_sources sra. I am wondering whether there is a problem with this particular dataset. Has anybody had a similar error? Any suggestions to solve it? Many thanks! in advance to everybody and the package developers/maintainers!

Here is the evaluated code:

> rse_data <- recount3::create_rse_manual(
+   project = "SRP045225",
+   project_home = "data_sources/sra",
+   organism = "human",
+   annotation = "gencode_v29",
+   type = "gene"
+ )
2024-05-21 16:34:55.952222 downloading and reading the metadata.
2024-05-21 16:34:56.595507 caching file sra.sra.SRP045225.MD.gz.
adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
Error in BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose) : 
  not all 'rnames' found or unique.
In addition: Warning messages:
1: download failed
  web resource path: 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
  local file path: 'C:\Users\Garcia\AppData\Local/R/cache/R/recount3/5f5410c23fc1_sra.sra.SRP045225.MD.gz'
  reason: Received HTTP/0.9 when not allowed 
2: bfcadd() failed; resource removed
  rid: BFC352
  fpath: 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
  reason: download failed 
3: In value[[3L]](cond) : 
trying to add rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz' produced error:
  bfcadd() failed; see warnings()
> traceback()
8: stop("not all 'rnames' found or unique.")
7: BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose)
6: BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose)
5: FUN(X[[i]], ...)
4: vapply(url, file_retrieve, character(1), bfc = bfc, verbose = verbose)
3: file_retrieve(url = locate_url(project = project, project_home = project_home, 
       type = "metadata", organism = organism, annotation = annotation, 
       recount3_url = recount3_url), bfc = bfc, verbose = verbose)
2: read_metadata(file_retrieve(url = locate_url(project = project, 
       project_home = project_home, type = "metadata", organism = organism, 
       annotation = annotation, recount3_url = recount3_url), bfc = bfc, 
       verbose = verbose))
1: recount3::create_rse_manual(project = "SRP045225", project_home = "data_sources/sra", 
       organism = "human", annotation = "gencode_v29", type = "gene")
> sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] recount3_1.12.0             SummarizedExperiment_1.32.0 Biobase_2.62.0              GenomicRanges_1.54.1       
 [5] GenomeInfoDb_1.38.8         IRanges_2.36.0              S4Vectors_0.40.1            BiocGenerics_0.48.1        
 [9] MatrixGenerics_1.14.0       matrixStats_1.3.0          

loaded via a namespace (and not attached):
 [1] rjson_0.2.21             lattice_0.22-6           vctrs_0.6.5              tools_4.3.3              bitops_1.0-7            
 [6] generics_0.1.3           curl_5.2.1               parallel_4.3.3           tibble_3.2.1             fansi_1.0.6             
[11] RSQLite_2.3.6            blob_1.2.4               pkgconfig_2.0.3          R.oo_1.26.0              Matrix_1.6-5            
[16] data.table_1.15.4        dbplyr_2.5.0             lifecycle_1.0.4          GenomeInfoDbData_1.2.11  compiler_4.3.3          
[21] Rsamtools_2.18.0         Biostrings_2.70.2        codetools_0.2-20         RCurl_1.98-1.14          yaml_2.3.8              
[26] pillar_1.9.0             crayon_1.5.2             R.utils_2.12.3           BiocParallel_1.36.0      DelayedArray_0.28.0     
[31] cachem_1.0.8             sessioninfo_1.2.2        abind_1.4-5              tidyselect_1.2.1         purrr_1.0.2             
[36] dplyr_1.1.4              restfulr_0.0.15          fastmap_1.1.1            grid_4.3.3               cli_3.6.2               
[41] SparseArray_1.2.4        magrittr_2.0.3           S4Arrays_1.2.1           XML_3.99-0.16.1          utf8_1.2.4              
[46] withr_3.0.0              filelock_1.0.3           bit64_4.0.5              XVector_0.42.0           httr_1.4.7              
[51] bit_4.0.5                R.methodsS3_1.8.2        memoise_2.0.1            BiocIO_1.12.0            BiocFileCache_2.10.2    
[56] rtracklayer_1.62.0       rlang_1.1.3              glue_1.7.0               DBI_1.2.2                rstudioapi_0.16.0       
[61] R6_2.5.1                 GenomicAlignments_1.38.2 zlibbioc_1.48.0
recount3 • 665 views
ADD COMMENT
2
Entering edit mode
@lcolladotor
Last seen 1 day ago
United States

Hi,

I wasn't able to reproduce the issue on both bioc 3.19 (current release version) and 3.18 (closer to your session info). My guess is that you might have some corrupted files at your recount3 cache directory or the file connection was failing when you tried to download the files (and it isn't right now). Thus, I recommend trying to download again. You could use recount3::recount3_cache() like I did on the bioc 3.18 test to test with a completely new recount3 cache. If that works, but it keeps failing with your original recount3 cache, you could use recount3::recount3_cache_rm() https://research.libd.org/recount3/reference/recount3_cache_rm.html to wipe out your original cache.

Best, Leo

PS I don't think that this is related to https://github.com/curl/curl/issues/13725, but your error messages could be related to issues with curl. Just in case, I've included the output of curl::curl_version() for me. Although I'm using curl 8.6.0 and it's not triggering curl issue 13725, hence why I don't think that the issues are related.

ADD COMMENT
1
Entering edit mode

Hello Leonardo,

Thanks for commenting on possible solutions. I noticed comparing your session info that I was using the previous version of R. I have first installed the newest version of R and bioc 3.19, updated and tried your suggested solutions. I do, however, obtain the same error:

> library("recount3")
> args(recount3_cache)
function (cache_dir = getOption("recount3_cache", NULL)) 
NULL
> rse_data <- recount3::create_rse_manual(
+  project = "SRP045225",
+  project_home = "data_sources/sra",
+  organism = "human",
+  annotation = "gencode_v29",
+  type = "gene",
+  bfc = recount3_cache(cache_dir = "~/Desktop/test_recount3_bioc3.19")
+ )
2024-05-22 10:21:08.943834 downloading and reading the metadata.
2024-05-22 10:21:09.985189 caching file sra.sra.SRP045225.MD.gz.
adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
Error in BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose) : 
  not all 'rnames' found or unique.
In addition: Warning messages:
1: download failed
  web resource path: 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
  local file path: '~/Desktop/test_recount3_bioc3.19/20d466d8f1_sra.sra.SRP045225.MD.gz'
  reason: Received HTTP/0.9 when not allowed 
2: bfcadd() failed; resource removed
  rid: BFC2
  fpath: 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
  reason: download failed 
3: In value[[3L]](cond) : 
trying to add rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz' produced error:
  bfcadd() failed; see warnings()
> traceback()
8: stop("not all 'rnames' found or unique.")
7: BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose)
6: BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose)
5: FUN(X[[i]], ...)
4: vapply(url, file_retrieve, character(1), bfc = bfc, verbose = verbose)
3: file_retrieve(url = locate_url(project = project, project_home = project_home, 
       type = "metadata", organism = organism, annotation = annotation, 
       recount3_url = recount3_url), bfc = bfc, verbose = verbose)
2: read_metadata(file_retrieve(url = locate_url(project = project, 
       project_home = project_home, type = "metadata", organism = organism, 
       annotation = annotation, recount3_url = recount3_url), bfc = bfc, 
       verbose = verbose))
1: recount3::create_rse_manual(project = "SRP045225", project_home = "data_sources/sra", 
       organism = "human", annotation = "gencode_v29", type = "gene", 
       bfc = recount3_cache(cache_dir = "~/Desktop/test_recount3_bioc3.19"))
> sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] recount3_1.14.0             SummarizedExperiment_1.34.0 Biobase_2.64.0              GenomicRanges_1.56.0       
 [5] GenomeInfoDb_1.40.0         IRanges_2.38.0              S4Vectors_0.42.0            BiocGenerics_0.50.0        
 [9] MatrixGenerics_1.16.0       matrixStats_1.3.0          

loaded via a namespace (and not attached):
 [1] rjson_0.2.21             lattice_0.22-6           vctrs_0.6.5              tools_4.4.0              bitops_1.0-7            
 [6] generics_0.1.3           parallel_4.4.0           curl_5.2.1               tibble_3.2.1             fansi_1.0.6             
[11] RSQLite_2.3.6            blob_1.2.4               pkgconfig_2.0.3          R.oo_1.26.0              Matrix_1.7-0            
[16] data.table_1.15.4        dbplyr_2.5.0             lifecycle_1.0.4          GenomeInfoDbData_1.2.12  compiler_4.4.0          
[21] Rsamtools_2.20.0         Biostrings_2.72.0        codetools_0.2-20         yaml_2.3.8               RCurl_1.98-1.14         
[26] pillar_1.9.0             crayon_1.5.2             R.utils_2.12.3           BiocParallel_1.38.0      DelayedArray_0.30.1     
[31] cachem_1.1.0             sessioninfo_1.2.2        abind_1.4-5              tidyselect_1.2.1         purrr_1.0.2             
[36] dplyr_1.1.4              restfulr_0.0.15          fastmap_1.2.0            grid_4.4.0               cli_3.6.2               
[41] SparseArray_1.4.5        magrittr_2.0.3           S4Arrays_1.4.1           XML_3.99-0.16.1          utf8_1.2.4              
[46] withr_3.0.0              filelock_1.0.3           UCSC.utils_1.0.0         bit64_4.0.5              XVector_0.44.0          
[51] httr_1.4.7               bit_4.0.5                R.methodsS3_1.8.2        memoise_2.0.1            BiocIO_1.14.0           
[56] BiocFileCache_2.12.0     rtracklayer_1.64.0       rlang_1.1.3              glue_1.7.0               DBI_1.2.2               
[61] rstudioapi_0.16.0        jsonlite_1.8.8           R6_2.5.1                 GenomicAlignments_1.40.0 zlibbioc_1.50.0         
> curl::curl_version()
$version
[1] "8.3.0"

$ssl_version
[1] "(OpenSSL/3.1.2) Schannel"

$libz_version
[1] "1.3"

$libssh_version
[1] "libssh2/1.11.0"

$libidn_version
[1] NA

$host
[1] "x86_64-w64-mingw32"

$protocols
 [1] "dict"    "file"    "ftp"     "ftps"    "gopher"  "gophers" "http"    "https"   "imap"    "imaps"   "ldap"    "ldaps"  
[13] "mqtt"    "pop3"    "pop3s"   "rtsp"    "scp"     "sftp"    "smb"     "smbs"    "smtp"    "smtps"   "telnet"  "tftp"   

$ipv6
[1] TRUE

$http2
[1] TRUE

$idn
[1] TRUE

I tried using recount3_cache_rm() and still got the error:

> library("recount3")
> recount3::recount3_cache_rm()
> rse_data <- recount3::create_rse_manual(
+  project = "SRP045225",
+  project_home = "data_sources/sra",
+  organism = "human",
+  annotation = "gencode_v29",
+  type = "gene"
+ )
2024-05-22 10:26:21.399864 downloading and reading the metadata.
2024-05-22 10:26:22.278906 caching file sra.sra.SRP045225.MD.gz.
adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
Error in BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose) : 
  not all 'rnames' found or unique.
In addition: Warning messages:
1: download failed
  web resource path: 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
  local file path: 'C:\Users\Garcia\AppData\Local/R/cache/R/recount3/20d427471292_sra.sra.SRP045225.MD.gz'
  reason: Received HTTP/0.9 when not allowed 
2: bfcadd() failed; resource removed
  rid: BFC357
  fpath: 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz'
  reason: download failed 
3: In value[[3L]](cond) : 
trying to add rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz' produced error:
  bfcadd() failed; see warnings()
> traceback()
8: stop("not all 'rnames' found or unique.")
7: BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose)
6: BiocFileCache::bfcrpath(bfc, url, exact = TRUE, verbose = verbose)
5: FUN(X[[i]], ...)
4: vapply(url, file_retrieve, character(1), bfc = bfc, verbose = verbose)
3: file_retrieve(url = locate_url(project = project, project_home = project_home, 
       type = "metadata", organism = organism, annotation = annotation, 
       recount3_url = recount3_url), bfc = bfc, verbose = verbose)
2: read_metadata(file_retrieve(url = locate_url(project = project, 
       project_home = project_home, type = "metadata", organism = organism, 
       annotation = annotation, recount3_url = recount3_url), bfc = bfc, 
       verbose = verbose))
1: recount3::create_rse_manual(project = "SRP045225", project_home = "data_sources/sra", 
       organism = "human", annotation = "gencode_v29", type = "gene")

Do you maybe have other guesses of what might be going on? Thanks a lot for the help!!

ADD REPLY
1
Entering edit mode

Hi,

What do you get if you run the following httr::HEAD() command? I noticed that you have libcurl version 8.3.0 from curl::curl_version()$version. I just installed version 8.8.0 earlier today (see https://github.com/Bioconductor/BiocFileCache/issues/48#issuecomment-2124935008 for all the details) but when I replied earlier I was using version 8.6.0. Given what I learned recently about BiocFileCache's internal functions, eventually it uses httr:HEAD() so I'm curious if that's where the issue lies given the error message you get about reason: Received HTTP/0.9 when not allowed.

libcurl version 8.8.0 is available from https://github.com/curl/curl/releases/tag/curl-8_8_0 and https://snyk.io/blog/how-to-update-curl/ has some instructions for Windows users. Note that it says that "anything less than 8.4.0 will need to be updated", so maybe there's a known issue with version 8.3.0 that I'm unaware of (I'm by far not a libcurl connoisseur).

Best, Leo

> httr::HEAD("http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz")
Response [https://recount-opendata.s3.amazonaws.com/recount3/release/human/data_sources/sra/metadata/25/SRP045225/sra.sra.SRP045225.MD.gz]
  Date: 2024-05-22 21:08
  Status: 200
  Content-Type: text/markdown
<EMPTY BODY>
> curl::curl_version()
$version
[1] "8.8.0"

$ssl_version
[1] "SecureTransport"

$libz_version
[1] "1.2.12"

$libssh_version
[1] NA

$libidn_version
[1] "2.3.7"

$host
[1] "aarch64-apple-darwin"

$protocols
 [1] "dict"    "file"    "ftp"     "ftps"    "gopher"  "gophers" "http"
 [8] "https"   "imap"    "imaps"   "ldap"    "ldaps"   "mqtt"    "pop3"
[15] "pop3s"   "rtsp"    "smb"     "smbs"    "smtp"    "smtps"   "telnet"
[22] "tftp"

$ipv6
[1] TRUE

$http2
[1] TRUE

$idn
[1] TRUE

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] httr_1.4.7     compiler_4.4.0 R6_2.5.1       curl_5.2.1
>
ADD REPLY

Login before adding your answer.

Traffic: 1019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6