rhdf5: error trying to access files from S3 Cloud from NASA earthaccess
1
0
Entering edit mode
Caio • 0
@f91eb648
Last seen 11 months ago
United States

I am not being able to use the rhdf5 package to access the S3 data from NASA's earthaccess, but without success, can you help me out here? I do need to use the direct cloud access because I am actually using their cloud computing system and the machine is within the same region, so it can read very fast without the need for downloading the data.

I'm using this code (with the correct username and password), but it is throwing an error saying it was unable to open the file.

########################
# Get the s3 credentials
########################
user = "username"
password = "password"

library(httr2)
req <- request("https://data.nsidc.earthdatacloud.nasa.gov/s3credentials") %>% 
    req_auth_basic(user, password) 

res = req %>% httr2::req_options(followlocation=TRUE) %>% httr2::req_perform()
obj = res %>% httr2::resp_body_string() %>% jsonlite::parse_json()

s3_cred <- list(
    aws_region = "us-west-2",
    access_key_id = obj$accessKeyId,
    secret_access_key = obj$secretAccessKey
)


#############
## Search for the link of some data within NASA earthdata
#############
searchUrl = 'https://cmr.earthdata.nasa.gov/search/granules.json?short_name=ATL08&version=006&bounding_box=-44.1503575306397,-13.7583117535016,-44.1006630227247,-13.712436430646'

res = httr2::request(searchUrl) %>% httr2::req_perform()
granules = res %>% httr2::resp_body_json()


# Get the s3Link for the data
library(tidyr)
(s3Link = sapply(granuleATL08$links, `[[`, 'href') %>% grep(pattern = "s3:.*h5$", value=TRUE))
##  's3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2018/10/19/ATL08_20181019181413_03220114_006_02.h5'
library(rhdf5)

rhdf5::h5ls(
     file = s3Link,
     s3 = TRUE,
     s3credentials = s3_cred
)
# Error in H5Fopen(file, flags = flags, fapl = fapl, native = native): HDF5. File accessibility. Unable to open file.
# Traceback:
# 
# 1. h5ls(file = s3Link, s3 = TRUE, s3credentials = s3_cred)
# 2. h5checktypeOrOpenLocS3(file, readonly = TRUE, fapl = fapl, native = native)
# 3. H5Fopen(file, flags = flags, fapl = fapl, native = native)

But if I use exactly the same credentials obj but using s3fs it works fine:

s3 = s3fs::S3FileSystem$new(
    aws_access_key_id = obj$accessKeyId,
    aws_secret_access_key = obj$secretAccessKey,
    aws_session_token = obj$sessionToken
)

s3$is_file(s3Link)
## TRUE

I wonder if it is because of the s3:// protocol instead of the https://, but I did try to change it to https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com but it didn't work either.

rhdf5 • 1.0k views
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 4 hours ago
EMBL Heidelberg

I suspect this is a limitation in the HDF5 S3 vitrual file driver, which doesn't let you provide the session token. It looks like that's required to access this bucket, but there's no way to provide it to the HDF5 library, so rhdf5 won't work.

I tried using the s3fs approach to see if there was anyway I could modify rhdf5 to work with that, but I kept running into permissions issues. Running the code you provided with my own NASA credentials, and the running s3$file_copy(s3Link, "/tmp/new.h5") get's me a file containing:

<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>MHN75DG76917N5XG</RequestId>
<HostId>KAAozK6G/EtMkkv9OSUkAj9GrFQJuZCxIYS8slLQExZWfdJzz4uqH/bnKSLwKLDMFzXUBTlHqd9HIbx5BR/K7A==</HostId>
</Error>

Most of the documentation suggests you have to be running code in in the AWS us-west-2 region to access the bucket, which I am not doing. Are you doing that in the example that works. That might help me understand if I'm doing something wrong my end, or it's just the location I'm running the code.

ADD COMMENT
0
Entering edit mode

Yes, I am running the code within us-west-2 region, maybe that's why it isn't working on your end with s3fs. Anyway, I can use s3fs$file_copy by myself, but that won't work for me because that will actually download the entire file and I really just wanted to open a specific dataset within the hdf5 file.

I can do that through python, but I was hoping I could get it done in R without using something like reticulate, because I already have a lot of functions written in R to process similar data. Actually I am one of the developers of rGEDI, and I was thinking about extending it with the capability of opening the files directly from the cloud so the users would not need to download the entire file.

I know I could somehow try to fetch only the header bytes from the cloud and then skip directly to the bytes from the target dataset, but I don't know the HDF5 format deep enough, and I do not have all that time to dig into that, I was hoping that it could be done with a simple patch for the rhdf5 package.

I did try to understand the rhdf5 but I don't find exactly the code where the package is actually making the connection to the S3.

ADD REPLY
0
Entering edit mode

It looks like the HDF5 library is adding support for temporary access credentials to the current developmental version (1.15.0). Take a look at the release notes at here:

  - Implemented support for temporary security credentials for the Read-Only
  S3 (ROS3) file driver.

  When using temporary security credentials, one also needs to specify a
  session/security token next to the access key id and secret access key.
  This token can be specified by the new API function H5Pset_fapl_ros3_token().
  The API function H5Pget_fapl_ros3_token() can be used to retrieve
  the currently set token.

Unfortunately we're only on version 1.10.7 in the Rhdf5lib/rhdf5 world, and it's quite a jump to do the update as there's a large number of API changes that need to be considered. However this is finally a use case that really indicates a need to update, so it'll move up my TODO list now.

Unfortunately I don't think there's a way to do this with rhdf5 as it stands. Sorry.

ADD REPLY
0
Entering edit mode

That's great Mike, thank you for the information. For some reason the h5py python package does work and they use the 1.10.4 version of the HDF5 library, I actually just need to open the dataset through earthaccess NASA's package and it will return a granule object which I can pass to the h5py as if it was a regular File object.

It looks like they open it through the s3fs python package which creates an adapter class for the S3 which inherits from the fsspec.AsyncFileSystem so it just behaves like a "regular" file. I don't know how or if we can do the same in R and pass its pointer to the HDF5 C library to act as it were a regular opened file. I believe we could do it using the s3fs C library (https://github.com/tongwang/s3fs-c), but I don't know if you are open to use another third-party library.

ADD REPLY

Login before adding your answer.

Traffic: 846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6