nnja_ai

Modules:

catalog –
dataset –
exceptions –
io –
variable –

Classes:

DataCatalog –

DataCatalog class for finding and loading NNJA datasets.
NNJADataset –

NNJADataset class for handling dataset metadata and loading data.
NNJAVariable –

A class to represent a variable in a NNJADataset.

DataCatalog

DataCatalog(
    mirror=DEFAULT_MIRROR, base_path=None, catalog_json=None
)

DataCatalog class for finding and loading NNJA datasets.

The DataCatalog represents a collection of NNJADataset objects, and provides some basic search/list functionality.

Attributes:

base_path (str) –

Base path for resolving relative URIs.
catalog_uri (str) –

Full URI to the catalog JSON file.
catalog_metadata (dict) –

Metadata of the catalog, loaded from the JSON file.
datasets (dict) –

Dictionary of dataset instances or subtypes.

Parameters:

mirror (Optional[str], default: DEFAULT_MIRROR ) –

Name of predefined mirror to use (e.g., 'gcp_nodd', 'aws_opendata').
base_path (Optional[str], default: None ) –

Custom base path for resolving relative URIs. Cannot be used with mirror.
catalog_json (Optional[str], default: None ) –

Custom catalog JSON path. Cannot be used with mirror.

Note

Dataset manifests are now loaded lazily on first access for better performance.

Raises:

ValueError –

If both mirror and custom parameters are specified.

Methods:

__getitem__ –

Fetch a specific dataset by name.
info –

Provide information about the catalog.
list_datasets –

List all dataset groups.
search –

Search datasets by name, tags, description, or variables.

getitem

__getitem__(dataset_name)

Fetch a specific dataset by name.

Parameters:

dataset_name (str) –

The name of the dataset to fetch.

Returns:

NNJADataset ( NNJADataset ) –

The dataset object.

info

info()

Provide information about the catalog.

list_datasets

list_datasets()

List all dataset groups.

search

search(query_term)

Search datasets by name, tags, description, or variables.

Parameters:

query_term (str) –

The term to search for.

Returns:

list ( list ) –

A list of NNJADataset objects matching the search term.

NNJADataset

NNJADataset(json_uri, base_path='')

NNJADataset class for handling dataset metadata and loading data.

The NNJADataset class is primarily meant to aid in navigating dataset metadata and loading data, with some support for data subsetting. The intent is that this class is used to find the appropriate dataset and variable(s) of interest, and then load the data into whichever library (e.g., pandas, polars, dask) is most appropriate for the user's needs.

Attributes:

name (str) –

Name of the dataset.
description (str) –

Description of the dataset.
tags (list) –

List of tags associated with the dataset.
oarquet_root_path (str) –

Directory containing the dataset's parquet files.
manifest (DataFrame) –

DataFrame containing the dataset's manifest of parquet partitions.
dimensions (dict) –

Dict of dimensions parsed from metadata.
variables (dict) –

Dict of NNJAVariable objects representing the dataset's variables.

Parameters:

json_uri (str) –

Path or URI to the dataset's JSON metadata.
base_path (str, default: '' ) –

Base path for resolving relative parquet_root_path.

Note

The manifest is now loaded lazily on first access for better performance.

Methods:

__getitem__ –

Fetch a specific variable by ID, or subset the dataset by a list of variable names.
__repr__ –

Return a concise string representation of the dataset.
info –

Provide a summary of the dataset.
list_variables –

List all variables with their descriptions.
load_dataset –

Load the dataset into a DataFrame using the specified library.
load_manifest –

Explicitly load the dataset's manifest of parquet partitions.
sel –

Select data based on the provided keywords.

manifest `property` `writable`

manifest

Get the dataset's manifest of parquet partitions, loading it if needed.

getitem

__getitem__(key)

Fetch a specific variable by ID, or subset the dataset by a list of variable names.

If a single variable ID is provided, return the variable object. If a list of variable names is provided, return a new dataset object with only the specified variables.

Parameters:

key (Union[str, List[str]]) –

The ID of the variable to fetch or a list of variable names to subset.

Returns:

Union[NNJAVariable, NNJADataset] –

NNJAVariable or NNJADataset: The variable object if a single ID is provided, or a DataFrame with the subsetted data if a list of variable names is provided.

repr

__repr__()

Return a concise string representation of the dataset.

info

info()

Provide a summary of the dataset.

list_variables

list_variables()

List all variables with their descriptions.

load_dataset

load_dataset(backend='pandas', **backend_kwargs)

Load the dataset into a DataFrame using the specified library.

Parameters:

backend (Backend, default: 'pandas' ) –

The library to use for loading the dataset ('pandas', 'polars', etc.).
**backend_kwargs –

Additional keyword arguments to pass to the backend loader.

Returns:

DataFrame –

The loaded dataset.

load_manifest

load_manifest()

Explicitly load the dataset's manifest of parquet partitions.

Returns:

NNJADataset –

The dataset object with the manifest loaded.

sel

sel(**kwargs)

Select data based on the provided keywords.

Allows for three types of selection

'variables' or 'columns': Subset the dataset by a list of variable names.
'time': Subset the dataset by a time range.
Any extra dimensions in self.dimensions: Subset the dataset by a specific value of the dimension.

Multiple keywords can be provided to perform multiple selections.

Parameters:

**kwargs –

Keywords for subsetting. Valid keywords are 'variables', 'columns', 'time', and any extra dimensions in self.dimensions.

Returns:

NNJADataset –

A new dataset object with the subsetted data.

NNJAVariable

NNJAVariable(variable_metadata, full_id, dim_val=None)

A class to represent a variable in a NNJADataset.

Many datasets in the NNJA archive have a large number of variables, and the parquet metadata doesn't provide enough flexibility to organize and describe them. We've organized variables into four categories, referenced by the 'category' attribute: - "primary data": The main data variables in the dataset that most users will use (e.g. brightness temperature, precipitation, radiance). - "primary descriptors": key descriptor variables that are useful for most users (e.g., time, latitude, longitude, satellite ID). - "secondary data": Additional data variables that are included for completeness, but contain little useful information for most users (e.g., data quality flags, variables that are null for most observations, etc.). - "secondary descriptors": Additional descriptor variables that are included for completeness, but contain little useful information for most users (e.g. processing station, scan number, etc.).

Additionally some variables have a 'dimension' attribute, which is used to represent additional information about the variable that can be used to subset the data (e.g., 'channel' for a satellite with many channels, or pressure level for some soundings). Because the data is based on parquet files, we can provide some additional subsetting features by using the 'dimension' attribute.

Attributes:

id (str) –

The fully expanded variable ID, corresponding to the parquet column name.
base_id (str) –

The original variable ID; same as id unless the variable has a dimension
description (str) –

Description of the variable.
category (str) –

Category of the variable.
dimension (optional) –

Dimension of the variable, if available.
extra_metadata (dict) –

Additional metadata for the variable.

Parameters:

variable_metadata (dict) –

Metadata for the variable.
full_id (str) –

The fully expanded variable ID (e.g., 'brightness_temp_00007', or "lat").
dim_val (Optional[Union[float, int]], default: None ) –

The value of the dimension for this variable, if applicable.

Methods:

__repr__ –

Provide a string representation of the variable.
info –

Provide a summary of the variable.

is_code_or_flag_table `property`

is_code_or_flag_table

Return True if the variable is associated with a code or flag table.

repr

__repr__()

Provide a string representation of the variable.

info

info()

Provide a summary of the variable.

nnja_ai

DataCatalog

__getitem__

info

list_datasets

search

NNJADataset

manifest property writable

__getitem__

__repr__

info

list_variables

load_dataset

load_manifest

sel

NNJAVariable

is_code_or_flag_table property

__repr__

info

getitem

manifest `property` `writable`

getitem

repr

is_code_or_flag_table `property`

repr