io

Functions:

load_manifest –

Load the manifest file from the parquet directory to a DataFrame.
load_parquet –

Load parquet files using the specified backend; lazy if supported by the backend.
read_json –

Read and validate a JSON file from a URI.

load_manifest

load_manifest(parquet_dir)

Load the manifest file from the parquet directory to a DataFrame.

We assume Hive-style partitioning on GCS, and create a DataFrame with the partition keys and file paths.

Parameters:

parquet_dir (str) –

Top-level directory containing the Hive-partitioned dataset.

Returns:

DataFrame –

pd.DataFrame: DataFrame with partition keys, values, and file paths.

load_parquet

load_parquet(
    parquet_uris,
    columns,
    backend="pandas",
    **backend_kwargs,
)

Load parquet files using the specified backend; lazy if supported by the backend.

With the current implementation, polars and dask will load lazily and preserve any hive partitions + columns, while pandas will load eagerly and concatenate the dataframes.

Parameters:

parquet_uris (List[str]) –

List of URIs pointing to the parquet files.
columns (List[str]) –

List of columns to load from the parquet files.
backend (Backend, default: 'pandas' ) –

Backend to use for loading the parquet files. Valid options are "pandas", "polars", and "dask". Default is "pandas".
**backend_kwargs –

Additional keyword arguments to pass to the backend's loading function.

Returns:

Union[DataFrame, LazyFrame, DataFrame] –

Union[pd.DataFrame, pl.LazyFrame, dd.DataFrame]: A DataFrame containing the loaded data.

Raises:

ValueError –

If an unsupported backend is specified.

read_json

read_json(json_uri, schema_path=None)

Read and validate a JSON file from a URI.

Supports local and cloud storage URIs. If a JSON schema path is provided, the JSON file will be validated against the schema.

Parameters:

json_uri (str) –

URI pointing to the JSON file.
schema_path (Optional[str], default: None ) –

Path to the JSON schema file for validation.

Returns:

dict ( dict ) –

The loaded JSON data.