I am not being able to use the rhdf5
package to access the S3 data from NASA's earthaccess, but without success, can you help me out here? I do need to use the direct cloud access because I am actually using their cloud computing system and the machine is within the same region, so it can read very fast without the need for downloading the data.
I'm using this code (with the correct username and password), but it is throwing an error saying it was unable to open the file.
########################
# Get the s3 credentials
########################
user = "username"
password = "password"
library(httr2)
req <- request("https://data.nsidc.earthdatacloud.nasa.gov/s3credentials") %>%
req_auth_basic(user, password)
res = req %>% httr2::req_options(followlocation=TRUE) %>% httr2::req_perform()
obj = res %>% httr2::resp_body_string() %>% jsonlite::parse_json()
s3_cred <- list(
aws_region = "us-west-2",
access_key_id = obj$accessKeyId,
secret_access_key = obj$secretAccessKey
)
#############
## Search for the link of some data within NASA earthdata
#############
searchUrl = 'https://cmr.earthdata.nasa.gov/search/granules.json?short_name=ATL08&version=006&bounding_box=-44.1503575306397,-13.7583117535016,-44.1006630227247,-13.712436430646'
res = httr2::request(searchUrl) %>% httr2::req_perform()
granules = res %>% httr2::resp_body_json()
# Get the s3Link for the data
library(tidyr)
(s3Link = sapply(granuleATL08$links, `[[`, 'href') %>% grep(pattern = "s3:.*h5$", value=TRUE))
## 's3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2018/10/19/ATL08_20181019181413_03220114_006_02.h5'
library(rhdf5)
rhdf5::h5ls(
file = s3Link,
s3 = TRUE,
s3credentials = s3_cred
)
# Error in H5Fopen(file, flags = flags, fapl = fapl, native = native): HDF5. File accessibility. Unable to open file.
# Traceback:
#
# 1. h5ls(file = s3Link, s3 = TRUE, s3credentials = s3_cred)
# 2. h5checktypeOrOpenLocS3(file, readonly = TRUE, fapl = fapl, native = native)
# 3. H5Fopen(file, flags = flags, fapl = fapl, native = native)
But if I use exactly the same credentials obj
but using s3fs
it works fine:
s3 = s3fs::S3FileSystem$new(
aws_access_key_id = obj$accessKeyId,
aws_secret_access_key = obj$secretAccessKey,
aws_session_token = obj$sessionToken
)
s3$is_file(s3Link)
## TRUE
I wonder if it is because of the s3://
protocol instead of the https://
, but I did try to change it to https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com
but it didn't work either.
Yes, I am running the code within
us-west-2
region, maybe that's why it isn't working on your end withs3fs
. Anyway, I can uses3fs$file_copy
by myself, but that won't work for me because that will actually download the entire file and I really just wanted to open a specific dataset within thehdf5
file.I can do that through python, but I was hoping I could get it done in R without using something like
reticulate
, because I already have a lot of functions written in R to process similar data. Actually I am one of the developers ofrGEDI
, and I was thinking about extending it with the capability of opening the files directly from the cloud so the users would not need to download the entire file.I know I could somehow try to fetch only the header bytes from the cloud and then skip directly to the bytes from the target dataset, but I don't know the HDF5 format deep enough, and I do not have all that time to dig into that, I was hoping that it could be done with a simple patch for the
rhdf5
package.I did try to understand the
rhdf5
but I don't find exactly the code where the package is actually making the connection to theS3
.It looks like the HDF5 library is adding support for temporary access credentials to the current developmental version (1.15.0). Take a look at the release notes at here:
Unfortunately we're only on version 1.10.7 in the Rhdf5lib/rhdf5 world, and it's quite a jump to do the update as there's a large number of API changes that need to be considered. However this is finally a use case that really indicates a need to update, so it'll move up my TODO list now.
Unfortunately I don't think there's a way to do this with rhdf5 as it stands. Sorry.
That's great Mike, thank you for the information. For some reason the
h5py
python package does work and they use the 1.10.4 version of the HDF5 library, I actually just need to open the dataset throughearthaccess
NASA's package and it will return agranule object
which I can pass to theh5py
as if it was a regularFile object
.It looks like they open it through the
s3fs
python package which creates an adapter class for theS3
which inherits from thefsspec.AsyncFileSystem
so it just behaves like a "regular" file. I don't know how or if we can do the same in R and pass its pointer to the HDF5 C library to act as it were a regular opened file. I believe we could do it using the s3fs C library (https://github.com/tongwang/s3fs-c), but I don't know if you are open to use another third-party library.