Back to Tutorials and Data Recipes
Authors: Anna Windle (NASA, SSAI), Ian Carroll (NASA, UMBC), Carina Poulin (NASA, SSAI)
An Earthdata Login account is required to access data from the NASA Earthdata system, including NASA ocean color data.
In this example we will use the earthaccess
package to
search for OCI products on NASA Earthdata. The earthaccess
package, published on the [Python Package Index][pypi] and
[conda-forge][conda], facilitates discovery and use of all NASA Earth
Science data products by providing an abstraction layer for NASA’s
[Common Metadata Repository (CMR) API][cmr] and by simplifying requests
to NASA’s [Earthdata Cloud][edcloud]. Searching for data is more
approachable using earthaccess
than low-level HTTP
requests, and the same goes for S3 requests.
In short, earthaccess
helps
authenticate with Earthdata Login, makes
search easier, and provides a stream-lined way to
load data into xarray
containers. For more
on earthaccess
, visit the [documentation][earthaccess-docs]
site. Be aware that earthaccess
is under active
development.
To understand the discussions below on downloading and opening data, we need to clearly understand where our notebook is running. There are three cases to distinguish:
At the end of this notebook you will know:
earthaccess
earthaccess
to search for OCI data using
search filtersWe begin by importing the only package used in this notebook. If you have created an environment following the guidance provided with this tutorial, then the import will be successful.
import earthaccess
We also need pathlib
for directory creation, at least
until earthaccess
version 0.9.1 is available.
import pathlib
Next, we authenticate using our Earthdata Login credentials.
Authentication is not needed to search publicaly available collections
in Earthdata, but is always needed to access data. We can use the
login
method from the earthaccess
package.
This will create an authenticated session when we provide a valid
Earthdata Login username and password. The earthaccess
package will search for credentials defined by environmental
variables or within a .netrc file saved in the
home directory. If credentials are not found, an interactive prompt will
allow you to input credentials.
The persist=True
argument ensures any discovered
credentials are stored in a .netrc
file, so the argument is
not necessary (but it’s also harmless) for subsequent calls to
earthaccess.login
.
auth = earthaccess.login(persist=True)
Collections on NASA Earthdata are discovered with the
search_datasets
function, which accepts an
instrument
filter as an easy way to get started. Each of
the items in the list of collections returned has a “short-name”.
results = earthaccess.search_datasets(instrument="oci")
Datasets found: 20
for item in results:
summary = item.summary()
print(summary["short-name"])
PACE_OCI_L0_SCI PACE_OCI_L1A_SCI PACE_OCI_L1B_SCI PACE_OCI_L1C_SCI PACE_OCI_L2_AOP_NRT PACE_OCI_L2_BGC_NRT PACE_OCI_L2_IOP_NRT PACE_OCI_L2_PAR_NRT PACE_OCI_L3B_CHL_NRT PACE_OCI_L3B_IOP_NRT PACE_OCI_L3B_KD_NRT PACE_OCI_L3B_PAR_NRT PACE_OCI_L3B_POC_NRT PACE_OCI_L3B_RRS_NRT PACE_OCI_L3M_CHL_NRT PACE_OCI_L3M_IOP_NRT PACE_OCI_L3M_KD_NRT PACE_OCI_L3M_PAR_NRT PACE_OCI_L3M_POC_NRT PACE_OCI_L3M_RRS_NRT
Next, we use the search_data
function to find granules
within a collection. Let’s use the short_name
for the
PACE/OCI Level-2 near real time (NRT) product for
biogeochemical properties (although you can search for granules accross
collections too).
The short name can also be found on Eartdata Search, directly under the collection name, after clicking on the “i” button for a collection in any search result.
The count
argument limits the number of granules
returned and stored in the results
list, not the number of
granules found.
results = earthaccess.search_data(
short_name="PACE_OCI_L2_BGC_NRT",
count=1,
)
Granules found: 7960
We can refine our search by passing more parameters that describe the
spatiotemporal domain of our use case. Here, we use the
temporal
parameter to request a date range and the
bounding_box
parameter to request granules that intersect
with a bounding box. We can even provide a cloud_cover
threshold to limit files that have a lower percetnage of cloud cover. We
do not provide a count
, so we’ll get all granules that
satisfy the constraints.
tspan = ("2024-05-01", "2024-05-16")
bbox = (-76.75, 36.97, -75.74, 39.01)
clouds = (0, 50)
results = earthaccess.search_data(
short_name="PACE_OCI_L2_BGC_NRT",
temporal=tspan,
bounding_box=bbox,
cloud_cover=clouds,
)
Granules found: 3
Displaying results shows the direct download link: try it! The link will download one granule to your local machine, which may or may not be what you want to do. Even if you are running the notebook on a remote host, this download link will open a new browser tab or window and offer to save a file to your local machine. If you are running the notebook locally, this may be of use. However, in the next section we’ll see how to download all the results with one command.
results[0]
results[1]
results[2]
An upcoming tutorial will need access to Level-1 files, whether or
not we have direct access to the Earthdata Cloud, so let’s go ahead and
download a couple granules. As always, we start with an
earthaccess.search_data
.
results = earthaccess.search_data(
short_name="PACE_OCI_L1B_SCI",
temporal=tspan,
bounding_box=bbox,
count=2,
)
Granules found: 23
Now, we need to first understand the alternative to downloading
granules, since you may be surprised that there is an alternative at
all. The earthaccess.open
function accepts the list of
results from earthaccess.search_data
and returns a list of
file-like objects. No actual files are transferred.
paths = earthaccess.open(results)
Opening 2 granules, approx size: 3.47 GB using endpoint: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials
The file-like objects held in paths
can each be read
like a normal file. Here we load the first few bytes without any
specialized reader.
with paths[0] as file:
line = file.readline().strip()
line
b'\x89HDF'
Of course that doesn’t mean anything (or does it? 😉), because this is a binary file that needs a reader which understands the file format.
The earthaccess.open
function is used when you want to
directly read a bytes from a remote filesystem, but not download a whole
file. When running code on a host with direct access to the NASA
Earthdata Cloud, you don’t need to download the data and
earthaccess.open
is the way to go.
Now, let’s look at the earthaccess.download
function,
which is used to copy files onto a filesystem local to the machine
executing the code. For this function, provide the output of
earthaccess.search_data
along with a directory where
earthaccess
will store downloaded granules.
Even if you only want to read a slice of the data, and downloading
seems unncessary, if you use earthaccess.open
while not
running on a remote host with direct access to the NASA Earthdata Cloud,
performance will be very poor. This is not a problem with “the cloud” or
with earthaccess
, it has to do with the data format and may
soon be resolved.
Let’s continue to downloading the list of granules!
directory = pathlib.Path("L1B")
directory.mkdir(exist_ok=True)
paths = earthaccess.download(results, directory)
Getting 2 granules, approx download size: 3.47 GB Accessing cloud dataset using dataset endpoint credentials: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials Downloaded: L1B/PACE_OCI.20240501T165311.L1B.nc Downloaded: L1B/PACE_OCI.20240501T165811.L1B.nc
The paths
list now contains paths to actual files on the
local filesystem.
paths
[PosixPath('L1B/PACE_OCI.20240501T165311.L1B.nc'), PosixPath('L1B/PACE_OCI.20240501T165811.L1B.nc')]
Anywhere in any of these
notebooks where paths = earthaccess.open(...)
is used
to read data directly from the NASA Earthdata Cloud, you need to
substitute paths = earthaccess.download(..., local_path)
before running the notebook on a local host or a remote host that does
not have direct access to the NASA Earthdata Cloud.
You have completed the notebook on downloading and opening datasets. We now suggest starting the notebook on File Structure at Three Processing Levels.