Skip to content

dataset

Classes:

  • NNJADataset

    NNJADataset class for handling dataset metadata and loading data.

NNJADataset

NNJADataset(json_uri, base_path='')

NNJADataset class for handling dataset metadata and loading data.

The NNJADataset class is primarily meant to aid in navigating dataset metadata and loading data, with some support for data subsetting. The intent is that this class is used to find the appropriate dataset and variable(s) of interest, and then load the data into whichever library (e.g., pandas, polars, dask) is most appropriate for the user's needs.

Attributes:

  • name (str) –

    Name of the dataset.

  • description (str) –

    Description of the dataset.

  • tags (list) –

    List of tags associated with the dataset.

  • oarquet_root_path (str) –

    Directory containing the dataset's parquet files.

  • manifest (DataFrame) –

    DataFrame containing the dataset's manifest of parquet partitions.

  • dimensions (dict) –

    Dict of dimensions parsed from metadata.

  • variables (dict) –

    Dict of NNJAVariable objects representing the dataset's variables.

Parameters:

  • json_uri (str) –

    Path or URI to the dataset's JSON metadata.

  • base_path (str, default: '' ) –

    Base path for resolving relative parquet_root_path.

Note

The manifest is now loaded lazily on first access for better performance.

Methods:

  • __getitem__

    Fetch a specific variable by ID, or subset the dataset by a list of variable names.

  • __repr__

    Return a concise string representation of the dataset.

  • info

    Provide a summary of the dataset.

  • list_variables

    List all variables with their descriptions.

  • load_dataset

    Load the dataset into a DataFrame using the specified library.

  • load_manifest

    Explicitly load the dataset's manifest of parquet partitions.

  • sel

    Select data based on the provided keywords.

manifest property writable

manifest

Get the dataset's manifest of parquet partitions, loading it if needed.

__getitem__

__getitem__(key)

Fetch a specific variable by ID, or subset the dataset by a list of variable names.

If a single variable ID is provided, return the variable object. If a list of variable names is provided, return a new dataset object with only the specified variables.

Parameters:

  • key (Union[str, List[str]]) –

    The ID of the variable to fetch or a list of variable names to subset.

Returns:

  • Union[NNJAVariable, NNJADataset]

    NNJAVariable or NNJADataset: The variable object if a single ID is provided, or a DataFrame with the subsetted data if a list of variable names is provided.

__repr__

__repr__()

Return a concise string representation of the dataset.

info

info()

Provide a summary of the dataset.

list_variables

list_variables()

List all variables with their descriptions.

load_dataset

load_dataset(backend='pandas', **backend_kwargs)

Load the dataset into a DataFrame using the specified library.

Parameters:

  • backend (Backend, default: 'pandas' ) –

    The library to use for loading the dataset ('pandas', 'polars', etc.).

  • **backend_kwargs

    Additional keyword arguments to pass to the backend loader.

Returns:

  • DataFrame

    The loaded dataset.

load_manifest

load_manifest()

Explicitly load the dataset's manifest of parquet partitions.

Returns:

  • NNJADataset

    The dataset object with the manifest loaded.

sel

sel(**kwargs)

Select data based on the provided keywords.

Allows for three types of selection
  • 'variables' or 'columns': Subset the dataset by a list of variable names.
  • 'time': Subset the dataset by a time range.
  • Any extra dimensions in self.dimensions: Subset the dataset by a specific value of the dimension.

Multiple keywords can be provided to perform multiple selections.

Parameters:

  • **kwargs

    Keywords for subsetting. Valid keywords are 'variables', 'columns', 'time', and any extra dimensions in self.dimensions.

Returns:

  • NNJADataset

    A new dataset object with the subsetted data.