{ "cells": [ { "cell_type": "markdown", "id": "43bf567d-6b45-427c-b70c-8d72be833aeb", "metadata": {}, "source": [ "# File Structure at Three Processing Levels for the Ocean Color Instrument (OCI)\n", "\n", "**Authors:** Anna Windle (NASA, SSAI), Ian Carroll (NASA, UMBC), Carina Poulin (NASA, SSAI)\n", "\n", "
\n", "\n", "The following notebooks are **prerequisites** for this tutorial.\n", "\n", "- Learn with OCI: [Data Access][oci-data-access]\n", "\n", "
\n", "\n", "
\n", "\n", "An [Earthdata Login][edl] account is required to access data from the NASA Earthdata system, including NASA ocean color data.\n", "\n", "
\n", "\n", "[edl]: https://urs.earthdata.nasa.gov/\n", "[oci-data-access]: https://oceancolor.gsfc.nasa.gov/resources/docs/tutorials/notebooks/oci_data_access/\n", "\n", "## Summary\n", "\n", "In this example we will use the `earthaccess` package to access OCI Level-1B (L1B), Level-2 (L2), and Level-3 (L3) NetCDF files and open them using `xarray`.\n", "\n", "**NetCDF** ([Network Common Data Format][netcdf]) is a binary file format for storing multidimensional scientific data (variables). It is optimized for array-oriented data access and support a machine-independent format for representing scientific data. Files ending in `.nc` are NetCDF files.\n", "\n", "**XArray** is a [package][xarray] that supports the use of multi-dimensional arrays in Python. It is widely used to handle Earth observation data, which often involves multiple dimensions — for instance, longitude, latitude, time, and channels/bands.\n", "\n", "[netcdf]: https://www.unidata.ucar.edu/software/netcdf/\n", "[xarray]: https://docs.xarray.dev/\n", "\n", "## Learning Objectives\n", "\n", "At the end of this notebok you will know:\n", "* How to find groups in a NetCDF file\n", "* How to use `xarray` to open OCI data\n", "* What key variables are present in the groups within OCI L1B, L2, and L3 files\n", "\n", "## Contents\n", "\n", "1. [Setup](#setup)\n", "1. [Explore L1B File Structure](#l1b)\n", "1. [Explore L2 File Structure](#l2)\n", "1. [Explore L3 File Structure](#l3)\n", "\n", "" ] }, { "cell_type": "markdown", "id": "3fed6190-246b-4189-84d5-1f511477678b", "metadata": {}, "source": [ "## 1. Setup\n", "\n", "Begin by importing all of the packages used in this notebook. If your kernel uses an environment defined following the guidance on the [tutorials] page, then the imports will be successful.\n", "\n", "[tutorials]: https://oceancolor.gsfc.nasa.gov/resources/docs/tutorials" ] }, { "cell_type": "code", "execution_count": null, "id": "b832fcd3", "metadata": {}, "outputs": [], "source": [ "import cartopy.crs as ccrs\n", "import earthaccess\n", "import h5netcdf\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import xarray as xr\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "c6082bc6", "metadata": {}, "source": [ "Set (and persist to your user profile on the host, if needed) your Earthdata Login credentials." ] }, { "cell_type": "code", "execution_count": null, "id": "0444d002", "metadata": {}, "outputs": [], "source": [ "auth = earthaccess.login(persist=True)" ] }, { "cell_type": "markdown", "id": "9207edf5-d1fa-4d0b-a18e-f7f976ba4fdb", "metadata": {}, "source": [ "[back to top](#contents) " ] }, { "cell_type": "markdown", "id": "35c141a6", "metadata": {}, "source": [ "## 2. Explore L1B File Structure\n", "\n", "Let's use `xarray` to open up a OCI L1B NetCDF file using `earthaccess`. We will use the same search method used in OCI Data Access. Note that L1B files do not include cloud coverage metadata, so we cannot use that filter." ] }, { "cell_type": "code", "execution_count": null, "id": "459a7523", "metadata": {}, "outputs": [], "source": [ "tspan = (\"2024-05-01\", \"2024-05-16\")\n", "bbox = (-76.75, 36.97, -75.74, 39.01)\n", "\n", "results = earthaccess.search_data(\n", " short_name=\"PACE_OCI_L1B_SCI\",\n", " temporal=tspan,\n", " bounding_box=bbox,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "c2b953dd", "metadata": {}, "outputs": [], "source": [ "paths = earthaccess.open(results)" ] }, { "cell_type": "markdown", "id": "c6d2d341", "metadata": {}, "source": [ "We want to confirm we are running code on a remote host with direct access to the NASA Earthdata Cloud. The next cell has\n", "no effect if we are, and otherwise raises an error. If there is an error, consider the substitution explained in the [Data Access][data-access] notebook.\n", "\n", "[data-access]: https://oceancolor.gsfc.nasa.gov/resources/docs/tutorials/notebooks/oci_data_access/" ] }, { "cell_type": "code", "execution_count": null, "id": "bdac5add", "metadata": {}, "outputs": [], "source": [ "try:\n", " paths[0].f.bucket\n", "except AttributeError:\n", " raise \"The result opened without an S3FileSystem.\"" ] }, { "cell_type": "markdown", "id": "de65a373", "metadata": {}, "source": [ "Let's open the first file of the L1B files list:" ] }, { "cell_type": "code", "execution_count": null, "id": "943fc648", "metadata": {}, "outputs": [], "source": [ "dataset = xr.open_dataset(paths[0])\n", "dataset" ] }, { "cell_type": "markdown", "id": "cbcbfc4b", "metadata": {}, "source": [ "Notice that this `xarray.Dataset` has nothing but \"Attributes\". We cannot use `xarray` to open multi-group hierarchies or list groups within a NetCDF file, but it can open a specific group if you know its path. The `xarray-datatree` package is going to be merged into `xarray` in the not too distant future, which will allow `xarray` to open the entire hieerarchy. In the meantime, we can use a lower level reader to see the top-level groups." ] }, { "cell_type": "code", "execution_count": null, "id": "c598d9bb", "metadata": {}, "outputs": [], "source": [ "with h5netcdf.File(paths[0]) as file:\n", " groups = list(file)\n", "groups" ] }, { "cell_type": "markdown", "id": "7210f2a5", "metadata": {}, "source": [ "Let's open the \"observation_data\" group, which contains the core science variables." ] }, { "cell_type": "code", "execution_count": null, "id": "29cd2e42", "metadata": {}, "outputs": [], "source": [ "dataset = xr.open_dataset(paths[0], group=\"observation_data\")\n", "dataset" ] }, { "cell_type": "markdown", "id": "82539346", "metadata": {}, "source": [ "Now you can view the Dimensions, Coordinates, and Variables of this group. To show/hide attributes, press the paper icon on the right hand side of each variable. To show/hide data representation, press the cylinder icon. For instance, you could check the attributes on \"rhot_blue\" to see that this variable is the \"Top of Atmosphere Blue Band Reflectance\".\n", "\n", "The dimensions of the \"rhot_blue\" variable are (\"blue_bands\", \"number_of_scans\", \"ccd_pixels\"), and it has shape (119, 1709, 1272). The `sizes` attribute of a variable gives us that information as a dictionary." ] }, { "cell_type": "code", "execution_count": null, "id": "2474335e", "metadata": {}, "outputs": [], "source": [ "dataset[\"rhot_blue\"].sizes" ] }, { "cell_type": "markdown", "id": "3802b01a", "metadata": {}, "source": [ "Let's plot the reflectance at postion 100 in the \"blue_bands\" dimension." ] }, { "cell_type": "code", "execution_count": null, "id": "d8e2ecc2-f6c1-482c-ae48-3a5f79dbfedc", "metadata": {}, "outputs": [], "source": [ "plot = dataset[\"rhot_blue\"].sel({\"blue_bands\": 100}).plot()" ] }, { "cell_type": "markdown", "id": "4787e534-fe8d-4a1a-92b0-d4fa7db1b6a6", "metadata": {}, "source": [ "[back to top](#contents) " ] }, { "cell_type": "markdown", "id": "3bfb4de3", "metadata": {}, "source": [ "## 3. Explore L2 File Structure\n", "\n", "OCI L2 files include retrievals of geophysical variables, such as Apparent Optical Properties (AOP), for each L1 swath. We'll use the same `earthaccess` search for L2 AOP data. Although now we can use `cloud_cover` too." ] }, { "cell_type": "code", "execution_count": null, "id": "ff545a3a", "metadata": {}, "outputs": [], "source": [ "tspan = (\"2024-05-01\", \"2024-05-16\")\n", "bbox = (-76.75, 36.97, -75.74, 39.01)\n", "clouds = (0, 50)\n", "\n", "results = earthaccess.search_data(\n", " short_name=\"PACE_OCI_L2_AOP_NRT\",\n", " temporal=tspan,\n", " bounding_box=bbox,\n", " cloud_cover=clouds,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "ba7b7b84", "metadata": {}, "outputs": [], "source": [ "paths = earthaccess.open(results)" ] }, { "cell_type": "code", "execution_count": null, "id": "a0e22bf9", "metadata": {}, "outputs": [], "source": [ "with h5netcdf.File(paths[0]) as file:\n", " groups = list(file)\n", "groups" ] }, { "cell_type": "markdown", "id": "e4991c9c", "metadata": {}, "source": [ "Let's look at the \"geophysical_data\" group, which is a new group generated by the level 2 processing." ] }, { "cell_type": "code", "execution_count": null, "id": "d3ff84df", "metadata": {}, "outputs": [], "source": [ "dataset = xr.open_dataset(paths[0], group=\"geophysical_data\")\n", "rrs = dataset[\"Rrs\"]\n", "rrs" ] }, { "cell_type": "code", "execution_count": null, "id": "8fd2252b", "metadata": {}, "outputs": [], "source": [ "rrs.sizes" ] }, { "cell_type": "markdown", "id": "8c1100ce", "metadata": {}, "source": [ "The Rrs variable has length 184 in the wavelength dimension, so the blue, red, and SWIR wavelengths have been combined. Let's map the Rrs at \"wavelength_3d\" position 100." ] }, { "cell_type": "code", "execution_count": null, "id": "c0494a5c", "metadata": {}, "outputs": [], "source": [ "plot = rrs.sel({\"wavelength_3d\": 100}).plot(cmap=\"viridis\")" ] }, { "cell_type": "markdown", "id": "bf90998b", "metadata": {}, "source": [ "Right now, the scene is being plotted using `number_of_lines` and `pixels_per_line` as \"x\" and \"y\", respectively. We need to add latitude and longitude values to create a true map. To do this, we will create a merged `xarray.Dataset` that pulls in information from the \"navigation_data\" group." ] }, { "cell_type": "code", "execution_count": null, "id": "7a5a9849", "metadata": {}, "outputs": [], "source": [ "dataset = xr.open_dataset(paths[0], group=\"navigation_data\")\n", "dataset = dataset.set_coords((\"longitude\", \"latitude\"))\n", "dataset = dataset.rename({\"pixel_control_points\": \"pixels_per_line\"})\n", "dataset = xr.merge((rrs, dataset.coords))\n", "dataset" ] }, { "cell_type": "markdown", "id": "df410885", "metadata": {}, "source": [ "Although we now have coordinates, they won't immediately help because the data are not gridded by latitude and longitude.\n", "The Level 2 data cover the original instrument swath and have not been resampled to a regular grid. Therefore latitude\n", "and longitude are known, but cannot be used immediately to \"look-up\" values like you can along an array's dimensions.\n", "\n", "Let's make a scatter plot of the pixel locations so we can see the irregular spacing. By selecting a `slice` with a step size larger than one, we get a subset of the locations for better visualization." ] }, { "cell_type": "code", "execution_count": null, "id": "48170e48", "metadata": {}, "outputs": [], "source": [ "plot = dataset.sel(\n", " {\n", " \"number_of_lines\": slice(None, None, 1720 // 20),\n", " \"pixels_per_line\": slice(None, None, 1272 // 20),\n", " },\n", ").plot.scatter(x=\"longitude\", y=\"latitude\")" ] }, { "cell_type": "markdown", "id": "7cb9f382", "metadata": {}, "source": [ "Let's plot this new `xarray.Dataset` the same way as before, but add latitude and longitude." ] }, { "cell_type": "code", "execution_count": null, "id": "abd1f45c", "metadata": {}, "outputs": [], "source": [ "rrs = dataset[\"Rrs\"].sel({\"wavelength_3d\": 100})\n", "plot = rrs.plot(x=\"longitude\", y=\"latitude\", cmap=\"viridis\", vmin=0)" ] }, { "cell_type": "markdown", "id": "39c9a8db", "metadata": {}, "source": [ "Now you can project the data onto a grid. If you wanna get fancy, add a coastline." ] }, { "cell_type": "code", "execution_count": null, "id": "7cfde548", "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = plt.axes(projection=ccrs.PlateCarree())\n", "rrs.plot(x=\"longitude\", y=\"latitude\", cmap=\"viridis\", vmin=0, ax=ax)\n", "ax.coastlines()\n", "ax.gridlines(draw_labels={\"left\": \"y\", \"bottom\": \"x\"})\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "b01838a6", "metadata": {}, "source": [ "Let's plot the full \"Rrs\" spectrum for individual pixels. A visualization with all the pixels\n", "wouldn't be useful, but limiting to a bounding box gives a simple way to subset pixels. Note that,\n", "since we still don't have gridded data (i.e. our latitude and longitude coordinates are two-dimensional),\n", "we can't `slice` on a built-in index. Without getting into anything complex, we can just box it out." ] }, { "cell_type": "code", "execution_count": null, "id": "94a991b3", "metadata": {}, "outputs": [], "source": [ "rrs_box = dataset[\"Rrs\"].where(\n", " (\n", " (dataset[\"latitude\"] > 37.52)\n", " & (dataset[\"latitude\"] < 37.55)\n", " & (dataset[\"longitude\"] > -75.46)\n", " & (dataset[\"longitude\"] < -75.43)\n", " ),\n", " drop=True,\n", ")\n", "rrs_box.sizes" ] }, { "cell_type": "markdown", "id": "b8b56c92", "metadata": {}, "source": [ "The line plotting method will only draw a line plot for 1D data, which we can get by stacking\n", "our two spatial dimensions and choosing to show the new \"pixel dimension\" as different colors." ] }, { "cell_type": "code", "execution_count": null, "id": "e1eec1ee", "metadata": {}, "outputs": [], "source": [ "rrs_stack = rrs_box.stack(\n", " {\"pixel\": [\"number_of_lines\", \"pixels_per_line\"]},\n", " create_index=False,\n", ")\n", "plot = rrs_stack.plot.line(hue=\"pixel\")" ] }, { "cell_type": "markdown", "id": "832dc749", "metadata": {}, "source": [ "We will go over how to plot Rrs spectra with accurate wavelength values on the x-axis in an upcoming notebook.\n", "\n", "[back to top](#contents) " ] }, { "cell_type": "markdown", "id": "98156f66-d662-478e-bd3b-6b9b2ffd578a", "metadata": {}, "source": [ "## 4. Explore L3 File Structure\n", "\n", "At Level-3 there are binned (B) and mapped (M) products available for OCI. The L3M remote sensing reflectance (Rrs) files contain global maps of Rrs. We'll use the same `earthaccess` method to find the data." ] }, { "cell_type": "code", "execution_count": null, "id": "d97b48c8", "metadata": {}, "outputs": [], "source": [ "tspan = (\"2024-05-01\", \"2024-05-16\")\n", "bbox = (-76.75, 36.97, -75.74, 39.01)\n", "\n", "results = earthaccess.search_data(\n", " short_name=\"PACE_OCI_L3M_RRS_NRT\",\n", " temporal=tspan,\n", " bounding_box=bbox,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "71f375ec", "metadata": {}, "outputs": [], "source": [ "paths = earthaccess.open(results)" ] }, { "cell_type": "markdown", "id": "b308e418", "metadata": {}, "source": [ "OCI L3 data do not have any groups, so we can open the dataset without the `group` argument.\n", "Let's take a look at the first file." ] }, { "cell_type": "code", "execution_count": null, "id": "4d29f91f", "metadata": {}, "outputs": [], "source": [ "dataset = xr.open_dataset(paths[0])\n", "dataset" ] }, { "cell_type": "markdown", "id": "801d4800", "metadata": {}, "source": [ "Notice that OCI L3M data has `lat` and `lon` coordinates, so it's easy to slice out a bounding box and map the \"Rrs_442\" variable." ] }, { "cell_type": "code", "execution_count": null, "id": "556b4225-2bb8-4d79-8901-7cfe4df4e41a", "metadata": {}, "outputs": [], "source": [ "rrs_442 = dataset[\"Rrs_442\"].sel({\"lat\": slice(-25, -45), \"lon\": slice(10, 30)})\n", "rrs_442" ] }, { "cell_type": "code", "execution_count": null, "id": "d0293716-0a37-4898-abb5-9ed7b349e61f", "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = plt.axes(projection=ccrs.PlateCarree())\n", "ax.coastlines()\n", "ax.gridlines(draw_labels={\"left\": \"y\", \"bottom\": \"x\"})\n", "rrs_442.plot(cmap=\"viridis\", vmin=0, ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "59bd1c7a-df29-4be7-ad46-f02beaacc53a", "metadata": {}, "source": [ "Also becuase the L3M variables have `lat` and `lon` coordinates, it's possible to stack multiple granules along a\n", "new dimension that corresponds to time.\n", "Instead of `xr.open_dataset`, we use `xr.open_mfdataset` to create a single `xarray.Dataset` (the \"mf\" in `open_mfdataset` stands for multiple files) from an array of paths.\n", "\n", "We also use a new search filter available in `search_data`: the `granule_name` argument accepts strings with the \"*\" wildcard. We need this to distinguish daily (\"DAY\") from eight-day (\"8D\") composites, as well as to get the 0.1 degree resolution projections." ] }, { "cell_type": "code", "execution_count": null, "id": "0ac643cf", "metadata": {}, "outputs": [], "source": [ "tspan = (\"2024-05-01\", \"2024-05-8\")\n", "results = earthaccess.search_data(\n", " short_name=\"PACE_OCI_L3M_CHL_NRT\",\n", " temporal=tspan,\n", " granule_name=\"*.DAY.*.0p1deg.*\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "701ad5eb", "metadata": {}, "outputs": [], "source": [ "paths = earthaccess.open(results)" ] }, { "cell_type": "markdown", "id": "9ecc47be", "metadata": {}, "source": [ "The `paths` list is sorted temporally by default, which means the shape of the `paths` array specifies the way we need to tile the files together into larger arrays. We specify `combine=\"nested\"` to combine the files according to the shape of the array of files (or file-like objects), even though `paths` is not a \"nested\" list in this case. The `concat_dim=\"date\"` argument generates a new dimension in the combined dataset, because \"date\" is not an existing dimension in the individual files." ] }, { "cell_type": "code", "execution_count": null, "id": "61459815", "metadata": {}, "outputs": [], "source": [ "dataset = xr.open_mfdataset(\n", " paths,\n", " combine=\"nested\",\n", " concat_dim=\"date\",\n", ")" ] }, { "cell_type": "markdown", "id": "4ceb68a5", "metadata": {}, "source": [ "Add a date dimension using the dates from the netCDF files." ] }, { "cell_type": "code", "execution_count": null, "id": "5e1c2c3f", "metadata": {}, "outputs": [], "source": [ "dates = [xr.open_dataset(a).attrs[\"time_coverage_end\"] for a in paths]\n", "dt = pd.to_datetime(dates)\n", "dataset = dataset.assign_coords(date=dt.values)\n", "dataset" ] }, { "cell_type": "markdown", "id": "2a3a66cc", "metadata": {}, "source": [ "A common reason to generate a single dataset from multiple, daily images is to create a composite. Compare the map from a single day ..." ] }, { "cell_type": "code", "execution_count": null, "id": "b7810952", "metadata": {}, "outputs": [], "source": [ "chla = np.log10(dataset[\"chlor_a\"])\n", "chla.attrs.update(\n", " {\n", " \"units\": f'log({dataset[\"chlor_a\"].attrs[\"units\"]})',\n", " }\n", ")\n", "plot = chla.sel(date = \"2024-05-02\").plot(aspect=2, size=4, cmap=\"GnBu_r\")" ] }, { "cell_type": "markdown", "id": "e6a6defc", "metadata": {}, "source": [ "... to a map of average values, skipping \"NaN\" values that result from clouds and the OCI's tilt maneuver." ] }, { "cell_type": "code", "execution_count": null, "id": "92204dec", "metadata": {}, "outputs": [], "source": [ "chla_avg = chla.mean(\"date\", keep_attrs=True)\n", "plot = chla_avg.plot(aspect=2, size=4, cmap=\"GnBu_r\")" ] }, { "cell_type": "markdown", "id": "55c50c87", "metadata": {}, "source": [ "We can also create a time series of mean values over the whole region." ] }, { "cell_type": "code", "execution_count": null, "id": "d3b00f29", "metadata": {}, "outputs": [], "source": [ "chla_avg = chla.mean(dim=[\"lon\", \"lat\"], keep_attrs=True)\n", "plot = chla_avg.plot(linestyle='-', marker='o', color='b')" ] }, { "cell_type": "markdown", "id": "eacc7bcc", "metadata": {}, "source": [ "[back to top](#contents)\n", "\n", "
\n", "\n", "You have completed the notebook on OCI file structure.\n", "\n", "
" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }