Dataset Overview
Basic stats on all datasets, from the manifest¶
We can iterate over all the datasets in the catalog and get the variable count in each category (primary data, secondary data, primary descriptors, secondary descriptors), the first day in the manifest, the last day, and the number of days in between (and infer the number of missing days)
In [1]:
Copied!
from nnja_ai import DataCatalog
from nnja_ai import DataCatalog
In [ ]:
Copied!
import os
os.environ["NNJA_USE_AUTH"] = "true"
catalog = DataCatalog()
print("catalog json:", catalog.catalog_uri)
print("datasets in catalog:")
catalog.list_datasets()
import os
os.environ["NNJA_USE_AUTH"] = "true"
catalog = DataCatalog()
print("catalog json:", catalog.catalog_uri)
print("datasets in catalog:")
catalog.list_datasets()
catalog json: gs://bb-nnja-ai-dev/data/v1/catalog.json datasets in catalog:
Out[ ]:
['amsua-1bamua-NC021023', 'atms-atms-NC021203', 'mhs-1bmhs-NC021027', 'cris-crisf4-NC021206', 'iasi-mtiasi-NC021241', 'geo-ahicsr-NC021044', 'geo-gsrasr-NC021045', 'geo-gsrcsr-NC021046', 'seviri-sevasr-NC021042', 'conv-adpsfc-NC000001', 'conv-adpsfc-NC000002', 'conv-adpsfc-NC000007', 'conv-adpsfc-NC000101', 'conv-adpupa-NC002001']
In [ ]:
Copied!
# For each dataset, load the manifest, get the number of variables. From the manifest, get the OBS_DATE first and last values, and the expected number of days (diff between first and last), and compare to the actual number of days in the dataset.
for dataset in catalog.list_datasets():
ds = catalog[dataset]
manifest = ds.manifest
first_date = manifest.index.min()
last_date = manifest.index.max()
expected_days = (last_date - first_date).days
actual_days = len(manifest.index.unique())
print(" first date:", first_date.strftime("%Y-%m-%d"))
print(" last date:", last_date.strftime("%Y-%m-%d"))
print(f" days of data: {expected_days} (missing {expected_days - actual_days} days)")
print(f" number of variables: {len(ds.variables)}")
# For each dataset, load the manifest, get the number of variables. From the manifest, get the OBS_DATE first and last values, and the expected number of days (diff between first and last), and compare to the actual number of days in the dataset.
for dataset in catalog.list_datasets():
ds = catalog[dataset]
manifest = ds.manifest
first_date = manifest.index.min()
last_date = manifest.index.max()
expected_days = (last_date - first_date).days
actual_days = len(manifest.index.unique())
print(" first date:", first_date.strftime("%Y-%m-%d"))
print(" last date:", last_date.strftime("%Y-%m-%d"))
print(f" days of data: {expected_days} (missing {expected_days - actual_days} days)")
print(f" number of variables: {len(ds.variables)}")
first date: 1998-10-25 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 9654 (missing 1707 days) number of variables: 49 Loading manifest for dataset 'atms-atms-NC021203'... first date: 2012-02-15 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 4793 (missing 17 days) number of variables: 199 Loading manifest for dataset 'mhs-1bmhs-NC021027'... first date: 2007-02-27 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 6607 (missing 29 days) number of variables: 29 Loading manifest for dataset 'cris-crisf4-NC021206'... first date: 2018-01-16 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 2631 (missing 15 days) number of variables: 474 Loading manifest for dataset 'iasi-mtiasi-NC021241'... first date: 2008-01-01 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 6299 (missing 46 days) number of variables: 648 Loading manifest for dataset 'geo-ahicsr-NC021044'... first date: 2019-12-01 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 1947 (missing 152 days) number of variables: 147 Loading manifest for dataset 'geo-gsrasr-NC021045'... first date: 2019-12-01 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 1947 (missing 199 days) number of variables: 182 Loading manifest for dataset 'geo-gsrcsr-NC021046'... first date: 2019-12-01 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 1947 (missing 175 days) number of variables: 219 Loading manifest for dataset 'seviri-sevasr-NC021042'... first date: 2022-03-01 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 1126 (missing 0 days) number of variables: 265 Loading manifest for dataset 'conv-adpsfc-NC000001'... first date: 1979-01-01 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 16891 (missing 2177 days) number of variables: 87 Loading manifest for dataset 'conv-adpsfc-NC000002'... first date: 2005-08-18 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 7165 (missing 187 days) number of variables: 71 Loading manifest for dataset 'conv-adpsfc-NC000007'... first date: 2005-08-18 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 7165 (missing 127 days) number of variables: 63 Loading manifest for dataset 'conv-adpsfc-NC000101'... first date: 2020-10-22 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 1621 (missing 124 days) number of variables: 101 Loading manifest for dataset 'conv-adpupa-NC002001'... first date: 2009-12-31 00:00:00+00:00 last date: 2025-03-31 00:00:00+00:00 days of data: 5569 (missing 124 days) number of variables: 265
In [5]:
Copied!
ds.manifest.index
ds.manifest.index
Out[5]:
DatetimeIndex(['1998-10-25 00:00:00+00:00', '1998-10-26 00:00:00+00:00',
'1998-10-27 00:00:00+00:00', '1998-10-28 00:00:00+00:00',
'1998-10-29 00:00:00+00:00', '1998-10-30 00:00:00+00:00',
'1998-10-31 00:00:00+00:00', '1998-11-01 00:00:00+00:00',
'1998-11-02 00:00:00+00:00', '1998-11-03 00:00:00+00:00',
...
'2025-03-22 00:00:00+00:00', '2025-03-23 00:00:00+00:00',
'2025-03-24 00:00:00+00:00', '2025-03-25 00:00:00+00:00',
'2025-03-26 00:00:00+00:00', '2025-03-27 00:00:00+00:00',
'2025-03-28 00:00:00+00:00', '2025-03-29 00:00:00+00:00',
'2025-03-30 00:00:00+00:00', '2025-03-31 00:00:00+00:00'],
dtype='datetime64[ns, UTC]', name='OBS_DATE', length=7947, freq=None)