Basic Dataset Example
Finding, subsetting, and loading data in the NNJA-AI archive¶
This notebook shows a simple workflow to find a dataset, subset to a time/variable range of interest, and then fetch and plot the data.
DataCatalog¶
The DataCatalog holds references to all the datasets in the NNJA-AI archive, and can be searched (basic searching through names/strings/descriptions of datasets); datasets can be accessed though standard dictionary indexing, using the dataset name.
from nnja_ai import DataCatalog
catalog = DataCatalog(mirror="gcp_brightband")
print("catalog json:", catalog.catalog_uri)
print("datasets in catalog:")
catalog.list_datasets()
catalog json: gs://nnja-ai/data/v1/catalog.json datasets in catalog:
['amsua-1bamua-NC021023', 'atms-atms-NC021203', 'mhs-1bmhs-NC021027', 'cris-crisf4-NC021206', 'iasi-mtiasi-NC021241', 'geo-ahicsr-NC021044', 'geo-gsrasr-NC021045', 'geo-gsrcsr-NC021046', 'seviri-sevasr-NC021042', 'conv-adpsfc-NC000001', 'conv-adpsfc-NC000002', 'conv-adpsfc-NC000007', 'conv-adpsfc-NC000101', 'conv-adpupa-NC002001']
catalog.search("amsu")
[NNJADataset(name='amsua-1bamua-NC021023', description='AMSU-A Level 1B brightness temperature data from NOAA-15,-16,-17,-18, -19 (ATOVS), METOP-2,-1)']
amsu_ds = catalog["amsua-1bamua-NC021023"]
type(amsu_ds)
nnja.dataset.NNJADataset
NNJADataset¶
A NNJADataset object represents a dataset for a single sensor/source/message. It is a light abstraction around an underlying parquet dataset stored on GCS. A user can either use this library to directly access those parquets (the URIs of which can be accessed using dataset.manifest), or can explore/subset the dataset time ranges and variables.
A straightforward workflow might be to select a dataset, (lazily) select a time range, select variables of interest, and then load the data using pandas, dask, or polars.
print(amsu_ds.info())
Loading manifest for dataset 'amsua-1bamua-NC021023'... Dataset 'amsua-1bamua-NC021023': AMSU-A Level 1B brightness temperature data from NOAA-15,-16,-17,-18, -19 (ATOVS), METOP-2,-1 Tags: amsu, brightness temperature, satellite, METOP, NOAA Files: 7947 files in manifest Variables: 49
print(amsu_ds.manifest.file)
OBS_DATE
1998-10-25 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
1998-10-26 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
1998-10-27 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
1998-10-28 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
1998-10-29 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
...
2025-03-27 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
2025-03-28 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
2025-03-29 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
2025-03-30 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
2025-03-31 00:00:00+00:00 gs://nnja-ai/data/v1/amsua/1bamua/NC021023/OBS...
Name: file, Length: 7947, dtype: object
Time subsetting¶
Datasets can be subset in time using slice, list, or single value datetimes or strings castable with pd.to_datetime(); this is just convenience for subsetting on the dataset.manifest.
amsu_ds = amsu_ds.sel(time=slice("2021-01-01", "2021-01-01"))
ds2 = amsu_ds.sel(time="2021-01-01")
ds3 = amsu_ds.sel(time=["2021-01-01"])
assert ds2.manifest.equals(ds3.manifest) and ds2.manifest.equals(amsu_ds.manifest)
/Users/hans/code/nnja-ai/src/nnja/dataset.py:339: UserWarning: Naive datetime 2021-01-01 00:00:00 assumed to be in UTC
warnings.warn(f"Naive datetime {dt} assumed to be in UTC", UserWarning)
/Users/hans/code/nnja-ai/src/nnja/dataset.py:339: UserWarning: Naive datetime 2021-01-01 00:00:00 assumed to be in UTC
warnings.warn(f"Naive datetime {dt} assumed to be in UTC", UserWarning)
Variable subsetting¶
Datasets in the NNJA-AI archive may contain dozens or hundreds of individual variables, each representing a column of the underlying parquet dataset. These can be accessed directly using dataset.variables. Variables are grouped into 4 categories: primary data (e.g. brightness temeperature), secondary data (e.g. standard deviation of brightness temperature), primary descriptors (e.g. latitude), and secondary descriptors (e.g. field of view number). These groupings are a subjective assignment based on whether the data represents something about the world vs. something about the observer, and whether the data is important or not.
Depending on the backend used to load data, early subsetting by variables can significantly increase load times; of course, you can also load the data and subset columns directly.
list(amsu_ds.variables.values())
[NNJAVariable("MSG_TYPE" (secondary descriptors), Source message type),
NNJAVariable("MSG_DATE" (secondary descriptors), Message valid timestamp),
NNJAVariable("MSG_IDX" (secondary descriptors), Message index),
NNJAVariable("SUBSET_IDX" (secondary descriptors), Subset index),
NNJAVariable("SRC_FILENAME" (secondary descriptors), Source filename),
NNJAVariable("LAT" (primary descriptors), Latitude of the observation),
NNJAVariable("LON" (primary descriptors), Longitude of the observation),
NNJAVariable("SAID" (primary descriptors), Satellite identifier) [code table: 001007],
NNJAVariable("SIID" (secondary descriptors), Satellite instruments) [code table: 002019],
NNJAVariable("FOVN" (secondary descriptors), Field of view number),
NNJAVariable("LSQL" (secondary descriptors), Land/sea qualifier) [code table: 008012],
NNJAVariable("SAZA" (secondary descriptors), Satellite zenith angle),
NNJAVariable("SOZA" (secondary descriptors), Solar zenith angle),
NNJAVariable("HOLS" (secondary descriptors), Height of land surface),
NNJAVariable("HMSL" (secondary descriptors), Height or altitude),
NNJAVariable("SOLAZI" (secondary descriptors), Solar azimuth),
NNJAVariable("BEARAZ" (secondary descriptors), Bearing or azimuth),
NNJAVariable("OBS_TIMESTAMP" (primary descriptors), Observation timestamp),
NNJAVariable("BRITCSTC.TMBR_00001" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00002" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00003" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00004" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00005" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00006" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00007" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00008" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00009" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00010" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00011" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00012" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00013" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00014" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.TMBR_00015" (primary data), Brightness temperature),
NNJAVariable("BRITCSTC.CSTC_00001" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00002" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00003" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00004" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00005" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00006" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00007" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00008" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00009" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00010" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00011" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00012" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00013" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00014" (secondary data), Cold space temperature correction),
NNJAVariable("BRITCSTC.CSTC_00015" (secondary data), Cold space temperature correction),
NNJAVariable("OBS_DATE" (primary descriptors), Date of the observation)]
plot_col = "BRITCSTC.TMBR_00001"
amsu_ds_sub = amsu_ds.sel(variables=["LAT", "LON", plot_col])
print(amsu_ds_sub.info())
Dataset 'amsua-1bamua-NC021023': AMSU-A Level 1B brightness temperature data from NOAA-15,-16,-17,-18, -19 (ATOVS), METOP-2,-1 Tags: amsu, brightness temperature, satellite, METOP, NOAA Files: 1 files in manifest Variables: 3
This dataset now contains only one parquet partition (for the selected date) and 3 variables, so should be very fast to load, even using a tabular data tool that doesn't do lazy loading such as pandas. So let's load it and plot a day of data:
df = amsu_ds_sub.load_dataset(backend="pandas")
import matplotlib.pyplot as plt
def plot_df(df, plot_col):
fig, ax = plt.subplots(figsize=(15, 10))
subsample = 2
ax.scatter(df["LON"][::subsample], df["LAT"][::subsample], s=2, c=df[plot_col][::subsample], cmap="viridis")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title(f"AMSU Brightness Temperature for {plot_col}")
plt.show()
plot_df(df, plot_col)
Compressing all that down, this is what it takes to open a dataset, slice it down to just the subset of interest, load that locally, and plot it:
from nnja_ai import DataCatalog
catalog = DataCatalog(mirror="gcp_brightband")
amsu_ds = catalog["amsua-1bamua-NC021023"].sel(time="2021-01-01 00Z", variables=["LAT", "LON", "BRITCSTC.TMBR_00001"])
df = amsu_ds.load_dataset(backend="pandas")
plot_df(df, "BRITCSTC.TMBR_00001")
Loading manifest for dataset 'amsua-1bamua-NC021023'...