netCDF4 Files
Writing Files
pysat includes support for creating netCDF4 files suitable for public scientific
distribution. Both the data and metadata attached to a
pysat.Instrument
object are used to create a file that both humans
and machines may understand and parse without any outside information.
This process is built with a variety of options to help meet the range of needs
of the scientific community.
For many users a netCDF4 file suitable for distribution to research colleagues may be created using default parameters as shown below.
import datetime as dt
import pysat
# Instantiate Instrument object
inst = pysat.Instrument('pysat', 'testing')
stime = dt.datetime(2009, 1, 1)
# Load data into Instrument
inst.load(date=stime)
# Create netCDF4 file
fname = stime.strftime('example/file/path/name/test_%Y%j.nc')
pysat.utils.io.inst_to_netcdf(inst, fname)
This process writes all of the data within inst.data
to a netCDF4 file,
including the metadata stored at inst.meta.data
and inst.meta.header
. It
also adds a variety of supplemental attributes to the file indicating the
file’s conventions, creation date, and more.
pysat’s default conventions are a simplified implemention of the standards developed as part of NASA’s Ionospheric Connections (ICON) Explorer Mission. ICON’s standards were generated by creating the most compatible combination of parameters from other existing standards and software implementations within the community. The primary underlying standards come from the Space Physics Data Facility (SPDF) International Solar Terrestrial Physics ISTP/IACG Inter-Agency Consultative Group. This standard is formally for NASA Common Data Format (CDF) files officially distributed to the public via government systems. The standard has been modified, as noted above, and accommodates and includes basic netCDF4 standards. While the overlap between standards results in some duplicated information, pysat’s default user facing configuration minimizes this duplicated information.
A table of attributes written to every netCDF file is shown below. Any
pysat.Instrument
attributes added by a user are also written
to the file.
File Attribute |
Description |
---|---|
acknowledgements |
Ackowledgements from Instrument |
Conventions |
File metadata convention name |
Date_End |
Timestamp of last data entry |
Date_Start |
Timestamp of first data entry |
File |
Original filepath and filename |
File_Date |
Timestamp of last data entry |
Generation_Date |
YearMonthDay of file creation |
inst_id |
pysat.Instrument |
Logical_File_ID |
Filename without any path or type |
name |
pysat.Instrument |
pysat_version |
pysat version information |
platform |
pysat.Instrument |
references |
Journal references from Instrument |
tag |
pysat.Instrument |
Text_Supplement |
Supplement string |
Metadata is also provided for each variable. An example of the default
metadata stored within a variable as directly loaded using netCDF4
is
included below for the variable longitude
. Note that pysat added the
Format, Depend_0, Display_Type, and Var_Type metadata parameters,
part of the SPDF standard.
<class 'netCDF4._netCDF4.Variable'>
float64 longitude(Epoch)
units: degrees
long_name: Longitude
notes:
desc:
value_min: 0.0
value_max: 360.0
_FillValue : nan
FillVal : nan
fill: nan
Format: f8
Var_Type: data
Depend_0: Epoch
Display_Type: Time Series
unlimited dimensions: Epoch
current shape = (86400,)
filling on, default _FillValue of 9.969209968386869e+36 used
An example of the output produced when loading a pysat produced file direct via netCDF4 is included below. Note that the pysat produced file attributes are present along with some user defined values, such as references and acknowledgements, that are attached to the pysat testing Instrument object. Further, for basic netCDF standards, as well as community compatibility, the fill metdata information is automatically replicated as fill, _FillValue, and FillVal.
netCDF4.Dataset(fname)
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
acknowledgements: Test instruments provided through the pysat project.
https://www.github.com/pysat/pysat
new_thing: 1
references: Stoneback, Russell, et al. (2021).
pysat/pysat v3.0 (Version v3.0). Zenodo.
http://doi.org/10.5281/zenodo.1199703
test_clean_kwarg:
test_init_kwarg:
test_preprocess_kwarg:
pysat_version: 3.0.1
Conventions: pysat-simplified SPDF ISTP/IACG for NetCDF
Text_Supplement:
Date_End: Thu, 01 Jan 2009, 2009-01-01T23:59:59.000 UTC
Date_Start: Thu, 01 Jan 2009, 2009-01-01T00:00:00.000 UTC
File: ['.', 'test.nc']
File_Date: Thu, 01 Jan 2009, 2009-01-01T23:59:59.000 UTC
Generation_Date: 20211022
Logical_File_ID: test
dimensions(sizes): Epoch(86400)
variables(dimensions): int64 Epoch(Epoch), float64 uts(Epoch),
float64 mlt(Epoch), float64 slt(Epoch), float64 longitude(Epoch),
float64 latitude(Epoch), float64 altitude(Epoch), int64 orbit_num(Epoch),
int64 dummy1(Epoch), int64 dummy2(Epoch), float64 dummy3(Epoch),
float64 dummy4(Epoch), <class 'str'> string_dummy(Epoch),
<class 'str'> unicode_dummy(Epoch), int8 int8_dummy(Epoch),
int16 int16_dummy(Epoch), int32 int32_dummy(Epoch), int64 int64_dummy(Epoch)
groups:
When writing files pysat processes metadata for both xarray and pandas before writing the file. For xarray, pysat leverages xarray’s built-in file writing capabilities. For pandas, pysat interfaces with netCDF4 directly to translate both 1D and higher dimensional data into netCDF4.
Translating Metadata
Compatible file formats, such as those used by ICON, may achieve that compatibility by simultaneously adopting multiple standards. As different file standards may attempt to cover the same functionality this can result in duplicated information. To minimize the impact of working with duplicted metadata pysat includes support for automatically translating the metadata labels used at the Instrument level with one or more different labels used when writing the file. Thus, simple metadata labels may be maintained throughout a users code, but, when writing files the metadata labels will be expanded to maintain standards compatibility.
Consider the following example. The current metadata labels used by an
Instrument are accessed programatically and used to define the range of
keys for a meta label translation table. Thus, regardless of the label setting
at runtime, the current metadata keys will be assigned appropriately.
The targets for the metadata labels at the file level are defined as the values
for each key in the dictionary. Fill metadata values, inst.meta.labels.fill
will be written to the file as both ‘_FillValue’ and ‘FillVal’. Similary, the
maximum and minimum supported variables values inst.meta.labels.max_val
and
inst.meta.labels.min_val
will be written to the file as ‘ValidMax’,
‘Valid_Max’, and ‘ValidMin’, ‘Valid_Min’, respectively. If no translation table
is provided then pysat will use a default translation that maps
inst.meta.labels.fill_val
to ‘_FillValue’, ‘FillVal’, and ‘fill’.
# Define translation between metadata labels currenlty in use by
# the Instrument object (inst.meta.labels.*) and those that will
# be used when writing the netCDF file.
meta_translation_table = {inst.meta.labels.fill_val: ['_FillValue',
'FillVal'],
inst.meta.labels.desc: ['CatDesc'],
inst.meta.labels.name: ['Long_Name'],
inst.meta.labels.units: ['Units'],
inst.meta.labels.max_val: ['ValidMax',
'Valid_Max'],
inst.meta.labels.min_val: ['ValidMin',
'Valid_Min'],
inst.meta.labels.notes: ['Var_Notes']}
# Write netCDF file
pysat.utils.io.inst_to_netcdf(inst, fname,
meta_translation=meta_translation_table)
As noted above pysat will add some metadata for variables as part of pysat’s file standard. To further ensure compatibility with netCDF formats, boolean values are translated to integers (1/0 for True/False), and fill and range metadata for string variables is removed.
The export_nan
keyword in
pysat.utils.io.inst_to_netcdf()
controls which of the metadata labels
is allowed to transfer values of NaN to the file. By default, the
fill_val
, min_val
, and max_val
labels support NaN values.
Similarly, the check_type
keyword accepts a list of metadata labels
where the type of the metadata value is compared against the data type of the
variable. By default, the fill_val
, min_val
, and max_val
labels are
checked.
Custom metadata labels, in addition to pysat
’s defaults, can be written to the
file by adding the information to a pysat.Instrument
. The simplest
method is shown below. The case of the label is retained when writing to the
file.
# Add additional metadata to cover default plot label, like used by ICON.
# Default values of '' for 'FieldNam' are added for all remaining variables.
# Remaining metadata labels for 'longitude' other than 'FieldNam' are left
# unchanged.
inst.meta['longitude'] = {'FieldNam': 'Geographic Longitude'}
# Create netCDF4 file
fname = stime.strftime('example/file/path/name/test_%Y%j.nc')
pysat.utils.io.inst_to_netcdf(inst, fname)
For the most general method for adding additional metdata is recommended that
a pysat.Instrument
is instantiated with the additional
metadata labels, including the type.
# Define SPDF metadata labels
labels = {'units': ('units', str), 'name': ('long_name', str),
'notes': ('notes', str), 'desc': ('desc', str),
'plot': ('plot_label', str), 'axis': ('axis', str),
'scale': ('scale', str),
'min_val': ('value_min', np.float64),
'max_val': ('value_max', np.float64),
'fill_val': ('fill', np.float64)}
# Instantiate instrument
inst = pysat.Instrument('pysat', 'testing', labels=labels)
# Define translation of pysat metadata labels to those in the netCDF file
meta_translation_table = {inst.meta.labels.fill_val: ['_FillValue',
'FillVal'],
inst.meta.labels.desc: ['CatDesc'],
inst.meta.labels.name: ['Long_Name'],
inst.meta.labels.units: ['Units'],
inst.meta.labels.max_val: ['ValidMax',
'Valid_Max'],
inst.meta.labels.min_val: ['ValidMin',
'Valid_Min'],
inst.meta.labels.notes: ['Var_Notes'],
inst.meta.labels.scale: ['ScaleTyp'],
inst.meta.labels.plot: ['FieldNam'],
inst.meta.labels.axis: ['LablAxis']}
# Write netCDF file
pysat.utils.io.inst_to_netcdf(inst, fname, meta_translation=meta_translation)
The final opportunity to modify metadata before it is written to a file is
provided by the meta_processor
keyword. This keyword accepts a function
that will receive a dictionary with all metadata, modify it as neeeded, and
return the modified dictionary. The returned dictionary
will then be written to the netCDF file. The function itself provides
an opportunity for developers to add/modify/delete metadata in any manner.
Note that the processor function is applied as the last step in the pysat
metadata processing. Thus all translations, filtering, or other modifications
to metadata are all applied before the meta_processor.
def example_processor(meta_dict):
"""Example meta processor function.
Parameters
----------
meta_dict : dict
Dictionary with all metadata information, keyed by variable name.
Returns
-------
meta_dict : dict
Updated metadata information.
"""
for variable in meta_dict.keys():
for label in meta_dict[variable].keys():
fstr = ''.join(['Information for variable: ', variable,
' and label: ', label, ' is easily accesbile.',
' Value is: ', meta_dict[variable][label]])
print(fstr)
return meta_dict
# Write netCDF file
pysat.utils.io.inst_to_netcdf(inst, fname,
meta_translation=meta_translation,
meta_processor=example_processor)
Loading Files
pysat includes support for loading netCDF4 files, particularly those produced
by pysat, directly into compatible pandas and xarray formats. These routines
will load the data and metadata into the appropriate structures. pysat NetCDF
files may also be directly loaded into a general pysat.Instrument
.
Loading functions are provided under pysat.utils.io
and includes a
general data indepdent interface, pysat.utils.load_netcdf4()
, as well
as pandas and xarray specific readers
(pysat.utils.io.load_netcdf_pandas()
and
pysat.utils.io.load_netcdf_xarray()
). These functions are intended to
be used within a pysat.Instrument
support module, particularly the
load()
function.
For example, consider the complete instrument load function needed (single dataset) when loading a pysat produced file into pandas. For more information on adding a new dataset to pysat, see Adding a New Instrument.
def load(fnames, tag='', inst_id=''):
"""Load the example Instrument pysat produced data files.
Parameters
----------
fnames : list
List of filenames
tag : str
Instrument tag (accepts '' or a string to change the behaviour of
certain instrument aspects for testing). (default='')
inst_id : str
Instrument ID (accepts ''). (default='')
Returns
-------
data : pds.DataFrame
Instrument data
meta : pysat.Meta
Metadata
"""
return pysat.utils.io.load_netcdf4_pandas(fnames)
Now consider loading the file written in the example shown in
Section Writing Files. Because this pysat.Instrument
module may support either pandas or xarray data, the expected type must be
specified upon pysat.Instrument
instantiation. pysat also expects
all filenames to have some type of date format. However, by using the
data_dir
keyword argument, we can easily load files outside of
the standard pysat data paths.
import datetime as dt
import pysat
stime = dt.datetime(2009, 1, 1)
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
data_dir='/example/file/path/name',
file_format='test_{year:04}{day:03}.nc')
test_inst.load(date=stime)
To enable support for a wider variety of netCDF file standards pysat also
provides support for translating, dropping, and modifying metadata
information after it is loaded from file but before it is input into a
pysat.Meta
instance. We will use the file with SPDF standards
as an example.
The general order of metadata operations is, load from file, remove
netCDF4 specific metadata (‘Format’, ‘Var_Type’, ‘Depend_0’),
apply table translations, apply the meta processor, apply a
meta array expander (pysat does not support array elements within metadata),
and finally the metadata information is loaded into a pysat.Meta
instance.
import numpy as np
# Define metadata labels, the keys are labels using by pysat,
# while the values are the labels in the file and type.
# Only one type is currently supported for each metadata label.
labels = {'units': ('Units', str), 'name': ('Long_Name', str),
'notes': ('Var_Notes', str), 'desc': ('CatDesc', str),
'plot': ('FieldNam', str), 'axis': ('LablAxis', str),
'scale': ('ScaleTyp', str),
'min_val': ('Valid_Min', np.float64),
'max_val': ('Valid_Max', np.float64),
'fill_val': ('FillVal', np.float64)}
# Both 'ValidMin' and 'Valid_Min' are in the file with the same
# content. Only need one.
drop_labels = ['ValidMin', 'ValidMax']
# Instantiate generic Instrument and pass in modification options
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
data_dir='/example/file/path/name',
file_format='test_{year:04}{day:03}.nc',
load_labels=labels,
drop_meta_labels=drop_labels)
# Load data
test_inst.load(date=stime)
# Feedback on metadata
print(list(test_inst.meta.attrs()))
['FieldNam', 'LablAxis', 'ScaleTyp', 'units', 'long_name', 'notes',
'desc', 'value_min', 'value_max', 'fill']
Metadata labels for units, long name, notes, description, value min/max and fill
were all translated to the default metadata labels of test_inst
. The default
metadata labels don’t include entries for all SPDF parameters, thus ‘FieldNam’,
‘LablAxis’, ‘ScaleTyp’ retain the values in the file. While users can apply
their own labels when instantiating a pysat.Instrument
, for
non default metadata labels we recommend developers apply a translation table
to map the labels in the file to a more user friendly label.
# Define metadata labels, the keys are labels using by pysat,
# while the values are the labels from the file and type.
# The labels are applied last in the loading process.
# Only one type is currently supported for each metadata label.
labels = {'units': ('Units', str), 'name': ('Long_Name', str),
'notes': ('Var_Notes', str), 'desc': ('CatDesc', str),
'plot': ('plot', str), 'axis': ('axis', str),
'scale': ('scale', str),
'min_val': ('Valid_Min', np.float64),
'max_val': ('Valid_Max', np.float64),
'fill_val': ('fill', np.float64)}
# Generate custom meta translation table. When left unspecified the default
# table handles the multiple values for fill. We must recreate that
# functionality in our table. The targets for meta_translation should
# map to values in `labels` above.
meta_translation = {'FieldNam': 'plot', 'LablAxis': 'axis',
'ScaleTyp': 'scale', 'ValidMin': 'Valid_Min',
'Valid_Min': 'Valid_Min', 'ValidMax': 'Valid_Max',
'Valid_Max': 'Valid_Max', '_FillValue': 'fill',
'FillVal': 'fill'}
# Instantiate generic Instrument and pass in modification options.
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
data_dir='/example/file/path/name',
file_format='test_{year:04}{day:03}.nc',
load_labels=labels,
meta_translation=meta_translation)
# Load data
test_inst.load(date=stime)
# Feedback on metadata
print(list(test_inst.meta.attrs()))
['fill', 'plot', 'axis', 'scale', 'units', 'long_name', 'notes', 'desc',
'value_min', 'value_max']
Note that drop_labels
is no longer used. Instead, multiple metadata labels
in the file are mapped to a single label using the meta_translation
keyword.
If there is an inconsistency in values during this process a warning is issued.
The example below demonstrates how users can control the labels used to access metadata.
# Define metadata labels, the keys are labels using by pysat,
# while the values are the labels from the file and type.
# The labels are applied last in the loading process.
# Only one type is currently supported for each metadata label.
local_labels = {'units': ('UNITS', str), 'name': ('LongEST_Name', str),
'notes': ('FLY', str), 'desc': ('DIGits', str),
'plot': ('plottER', str), 'axis': ('axisER', str),
'scale': ('scalER', str),
'min_val': ('INVALIDmin', np.float64),
'max_val': ('invalidMAX', np.float64),
'fill_val': ('fillerest', np.float64)}
# Instantiate generic Instrument and pass in modification options
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
data_dir='/example/file/path/name',
file_format='test_{year:04}{day:03}.nc',
load_labels=labels, labels=local_labels,
meta_translation=meta_translation)
# Load data
test_inst.load(date=stime)
# Feedback on metadata
print(list(test_inst.meta.attrs()))
['UNITS', 'LongEST_Name', 'FLY', 'DIGits', 'INVALIDmin', 'invalidMAX',
'fillerest', 'plottER', 'axisER', 'scalER']