Skip to content

io

Functions:

  • load_manifest

    Load the manifest file from the parquet directory to a DataFrame.

  • load_parquet

    Load parquet files using the specified backend; lazy if supported by the backend.

  • read_json

    Read and validate a JSON file from a URI.

load_manifest

load_manifest(parquet_dir)

Load the manifest file from the parquet directory to a DataFrame.

We assume Hive-style partitioning on GCS, and create a DataFrame with the partition keys and file paths.

Parameters:

  • parquet_dir (str) –

    Top-level directory containing the Hive-partitioned dataset.

Returns:

  • DataFrame

    pd.DataFrame: DataFrame with partition keys, values, and file paths.

load_parquet

load_parquet(
    parquet_uris,
    columns,
    backend="pandas",
    **backend_kwargs,
)

Load parquet files using the specified backend; lazy if supported by the backend.

With the current implementation, polars and dask will load lazily and preserve any hive partitions + columns, while pandas will load eagerly and concatenate the dataframes.

Parameters:

  • parquet_uris (List[str]) –

    List of URIs pointing to the parquet files.

  • columns (List[str]) –

    List of columns to load from the parquet files.

  • backend (Backend, default: 'pandas' ) –

    Backend to use for loading the parquet files. Valid options are "pandas", "polars", and "dask". Default is "pandas".

  • **backend_kwargs

    Additional keyword arguments to pass to the backend's loading function.

Returns:

  • Union[DataFrame, LazyFrame, DataFrame]

    Union[pd.DataFrame, pl.LazyFrame, dd.DataFrame]: A DataFrame containing the loaded data.

Raises:

  • ValueError

    If an unsupported backend is specified.

read_json

read_json(json_uri, schema_path=None)

Read and validate a JSON file from a URI.

Supports local and cloud storage URIs. If a JSON schema path is provided, the JSON file will be validated against the schema.

Parameters:

  • json_uri (str) –

    URI pointing to the JSON file.

  • schema_path (Optional[str], default: None ) –

    Path to the JSON schema file for validation.

Returns:

  • dict ( dict ) –

    The loaded JSON data.