io
Functions:
-
load_manifest–Load the manifest file from the parquet directory to a DataFrame.
-
load_parquet–Load parquet files using the specified backend; lazy if supported by the backend.
-
read_json–Read and validate a JSON file from a URI.
load_manifest
load_manifest(parquet_dir)
Load the manifest file from the parquet directory to a DataFrame.
We assume Hive-style partitioning on GCS, and create a DataFrame with the partition keys and file paths.
Parameters:
-
parquet_dir(str) –Top-level directory containing the Hive-partitioned dataset.
Returns:
-
DataFrame–pd.DataFrame: DataFrame with partition keys, values, and file paths.
load_parquet
load_parquet(
parquet_uris,
columns,
backend="pandas",
**backend_kwargs,
)
Load parquet files using the specified backend; lazy if supported by the backend.
With the current implementation, polars and dask will load lazily and preserve any hive partitions + columns, while pandas will load eagerly and concatenate the dataframes.
Parameters:
-
parquet_uris(List[str]) –List of URIs pointing to the parquet files.
-
columns(List[str]) –List of columns to load from the parquet files.
-
backend(Backend, default:'pandas') –Backend to use for loading the parquet files. Valid options are "pandas", "polars", and "dask". Default is "pandas".
-
**backend_kwargs–Additional keyword arguments to pass to the backend's loading function.
Returns:
-
Union[DataFrame, LazyFrame, DataFrame]–Union[pd.DataFrame, pl.LazyFrame, dd.DataFrame]: A DataFrame containing the loaded data.
Raises:
-
ValueError–If an unsupported backend is specified.
read_json
read_json(json_uri, schema_path=None)
Read and validate a JSON file from a URI.
Supports local and cloud storage URIs. If a JSON schema path is provided, the JSON file will be validated against the schema.
Parameters:
-
json_uri(str) –URI pointing to the JSON file.
-
schema_path(Optional[str], default:None) –Path to the JSON schema file for validation.
Returns:
-
dict(dict) –The loaded JSON data.