netCDF4 Files

Writing Files

pysat includes support for creating netCDF4 files suitable for public scientific distribution. Both the data and metadata attached to a pysat.Instrument object are used to create a file that both humans and machines may understand and parse without any outside information. This process is built with a variety of options to help meet the range of needs of the scientific community.

For many users a netCDF4 file suitable for distribution to research colleagues may be created using default parameters as shown below.

import datetime as dt
import pysat

# Instantiate Instrument object
inst = pysat.Instrument('pysat', 'testing')
stime = dt.datetime(2009, 1, 1)

# Load data into Instrument
inst.load(date=stime)

# Create netCDF4 file
fname = stime.strftime('example/file/path/name/test_%Y%j.nc')
pysat.utils.io.inst_to_netcdf(inst, fname)

This process writes all of the data within inst.data to a netCDF4 file, including the metadata stored at inst.meta.data and inst.meta.header. It also adds a variety of supplemental attributes to the file indicating the file’s conventions, creation date, and more.

pysat’s default conventions are a simplified implemention of the standards developed as part of NASA’s Ionospheric Connections (ICON) Explorer Mission. ICON’s standards were generated by creating the most compatible combination of parameters from other existing standards and software implementations within the community. The primary underlying standards come from the Space Physics Data Facility (SPDF) International Solar Terrestrial Physics ISTP/IACG Inter-Agency Consultative Group. This standard is formally for NASA Common Data Format (CDF) files officially distributed to the public via government systems. The standard has been modified, as noted above, and accommodates and includes basic netCDF4 standards. While the overlap between standards results in some duplicated information, pysat’s default user facing configuration minimizes this duplicated information.

A table of attributes written to every netCDF file is shown below. Any pysat.Instrument attributes added by a user are also written to the file.

File Attribute

Description

acknowledgements

Ackowledgements from Instrument

Conventions

File metadata convention name

Date_End

Timestamp of last data entry

Date_Start

Timestamp of first data entry

File

Original filepath and filename

File_Date

Timestamp of last data entry

Generation_Date

YearMonthDay of file creation

inst_id

pysat.Instrument inst_id

Logical_File_ID

Filename without any path or type

name

pysat.Instrument name

pysat_version

pysat version information

platform

pysat.Instrument platform

references

Journal references from Instrument

tag

pysat.Instrument tag

Text_Supplement

Supplement string

Metadata is also provided for each variable. An example of the default metadata stored within a variable as directly loaded using netCDF4 is included below for the variable longitude. Note that pysat added the Format, Depend_0, Display_Type, and Var_Type metadata parameters, part of the SPDF standard.

<class 'netCDF4._netCDF4.Variable'>
float64 longitude(Epoch)
    units: degrees
    long_name: Longitude
    notes:
    desc:
    value_min: 0.0
    value_max: 360.0
    _FillValue :  nan
    FillVal :  nan
    fill: nan
    Format: f8
    Var_Type: data
    Depend_0: Epoch
    Display_Type: Time Series
unlimited dimensions: Epoch
current shape = (86400,)
filling on, default _FillValue of 9.969209968386869e+36 used

An example of the output produced when loading a pysat produced file direct via netCDF4 is included below. Note that the pysat produced file attributes are present along with some user defined values, such as references and acknowledgements, that are attached to the pysat testing Instrument object. Further, for basic netCDF standards, as well as community compatibility, the fill metdata information is automatically replicated as fill, _FillValue, and FillVal.

netCDF4.Dataset(fname)

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    acknowledgements: Test instruments provided through the pysat project.
        https://www.github.com/pysat/pysat
    new_thing: 1
    references: Stoneback, Russell, et al. (2021).
        pysat/pysat v3.0 (Version v3.0). Zenodo.
        http://doi.org/10.5281/zenodo.1199703
    test_clean_kwarg:
    test_init_kwarg:
    test_preprocess_kwarg:
    pysat_version: 3.0.1
    Conventions: pysat-simplified SPDF ISTP/IACG for NetCDF
    Text_Supplement:
    Date_End: Thu, 01 Jan 2009,  2009-01-01T23:59:59.000 UTC
    Date_Start: Thu, 01 Jan 2009,  2009-01-01T00:00:00.000 UTC
    File: ['.', 'test.nc']
    File_Date: Thu, 01 Jan 2009,  2009-01-01T23:59:59.000 UTC
    Generation_Date: 20211022
    Logical_File_ID: test
    dimensions(sizes): Epoch(86400)
    variables(dimensions): int64 Epoch(Epoch), float64 uts(Epoch),
        float64 mlt(Epoch), float64 slt(Epoch), float64 longitude(Epoch),
        float64 latitude(Epoch), float64 altitude(Epoch), int64 orbit_num(Epoch),
        int64 dummy1(Epoch), int64 dummy2(Epoch), float64 dummy3(Epoch),
        float64 dummy4(Epoch), <class 'str'> string_dummy(Epoch),
        <class 'str'> unicode_dummy(Epoch), int8 int8_dummy(Epoch),
        int16 int16_dummy(Epoch), int32 int32_dummy(Epoch), int64 int64_dummy(Epoch)
    groups:

When writing files pysat processes metadata for both xarray and pandas before writing the file. For xarray, pysat leverages xarray’s built-in file writing capabilities. For pandas, pysat interfaces with netCDF4 directly to translate data into netCDF4.

Translating Metadata

Compatible file formats, such as those used by ICON, may achieve that compatibility by simultaneously adopting multiple standards. As different file standards may attempt to cover the same functionality this can result in duplicated information. To minimize the impact of working with duplicted metadata pysat includes support for automatically translating the metadata labels used at the Instrument level with one or more different labels used when writing the file. Thus, simple metadata labels may be maintained throughout a users code, but, when writing files the metadata labels will be expanded to maintain standards compatibility.

Consider the following example. The current metadata labels used by an Instrument are accessed programatically and used to define the range of keys for a meta label translation table. Thus, regardless of the label setting at runtime, the current metadata keys will be assigned appropriately. The targets for the metadata labels at the file level are defined as the values for each key in the dictionary. Fill metadata values, inst.meta.labels.fill will be written to the file as both ‘_FillValue’ and ‘FillVal’. Similary, the maximum and minimum supported variables values inst.meta.labels.max_val and inst.meta.labels.min_val will be written to the file as ‘ValidMax’, ‘Valid_Max’, and ‘ValidMin’, ‘Valid_Min’, respectively. If no translation table is provided then pysat will use a default translation that maps inst.meta.labels.fill_val to ‘_FillValue’, ‘FillVal’, and ‘fill’.

# Define translation between metadata labels currenlty in use by
# the Instrument object (inst.meta.labels.*) and those that will
# be used when writing the netCDF file.
meta_translation_table = {inst.meta.labels.fill_val: ['_FillValue',
                                                      'FillVal'],
                          inst.meta.labels.desc: ['CatDesc'],
                          inst.meta.labels.name: ['Long_Name'],
                          inst.meta.labels.units: ['Units'],
                          inst.meta.labels.max_val: ['ValidMax',
                                                     'Valid_Max'],
                          inst.meta.labels.min_val: ['ValidMin',
                                                     'Valid_Min'],
                          inst.meta.labels.notes: ['Var_Notes']}

# Write netCDF file
pysat.utils.io.inst_to_netcdf(inst, fname,
                              meta_translation=meta_translation_table)

As noted above pysat will add some metadata for variables as part of pysat’s file standard. To further ensure compatibility with netCDF formats, boolean values are translated to integers (1/0 for True/False), and fill and range metadata for string variables is removed.

The export_nan keyword in pysat.utils.io.inst_to_netcdf() controls which of the metadata labels is allowed to transfer values of NaN to the file. By default, the fill_val, min_val, and max_val labels support NaN values.

Similarly, the check_type keyword accepts a list of metadata labels where the type of the metadata value is compared against the data type of the variable. By default, the fill_val, min_val, and max_val labels are checked.

Custom metadata labels, in addition to pysat’s defaults, can be written to the file by adding the information to a pysat.Instrument. The simplest method is shown below. The case of the label is retained when writing to the file.

# Add additional metadata to cover default plot label, like used by ICON.
# Default values of '' for 'FieldNam' are added for all remaining variables.
# Remaining metadata labels for 'longitude' other than 'FieldNam' are left
# unchanged.
inst.meta['longitude'] = {'FieldNam': 'Geographic Longitude'}

# Create netCDF4 file
fname = stime.strftime('example/file/path/name/test_%Y%j.nc')
pysat.utils.io.inst_to_netcdf(inst, fname)

For the most general method for adding additional metdata is recommended that a pysat.Instrument is instantiated with the additional metadata labels, including the type.

# Define SPDF metadata labels
labels = {'units': ('units', str), 'name': ('long_name', str),
          'notes': ('notes', str), 'desc': ('desc', str),
          'plot': ('plot_label', str), 'axis': ('axis', str),
          'scale': ('scale', str),
          'min_val': ('value_min', np.float64),
          'max_val': ('value_max', np.float64),
          'fill_val': ('fill', np.float64)}

# Instantiate instrument
inst = pysat.Instrument('pysat', 'testing', labels=labels)

# Define translation of pysat metadata labels to those in the netCDF file
meta_translation_table = {inst.meta.labels.fill_val: ['_FillValue',
                                                      'FillVal'],
                          inst.meta.labels.desc: ['CatDesc'],
                          inst.meta.labels.name: ['Long_Name'],
                          inst.meta.labels.units: ['Units'],
                          inst.meta.labels.max_val: ['ValidMax',
                                                     'Valid_Max'],
                          inst.meta.labels.min_val: ['ValidMin',
                                                     'Valid_Min'],
                          inst.meta.labels.notes: ['Var_Notes'],
                          inst.meta.labels.scale: ['ScaleTyp'],
                          inst.meta.labels.plot: ['FieldNam'],
                          inst.meta.labels.axis: ['LablAxis']}

# Write netCDF file
pysat.utils.io.inst_to_netcdf(inst, fname, meta_translation=meta_translation)

The final opportunity to modify metadata before it is written to a file is provided by the meta_processor keyword. This keyword accepts a function that will receive a dictionary with all metadata, modify it as neeeded, and return the modified dictionary. The returned dictionary will then be written to the netCDF file. The function itself provides an opportunity for developers to add/modify/delete metadata in any manner. Note that the processor function is applied as the last step in the pysat metadata processing. Thus all translations, filtering, or other modifications to metadata are all applied before the meta_processor.

def example_processor(meta_dict):
"""Example meta processor function.

Parameters
----------
meta_dict : dict
    Dictionary with all metadata information, keyed by variable name.

Returns
-------
meta_dict : dict
    Updated metadata information.

"""

for variable in meta_dict.keys():
    for label in meta_dict[variable].keys():
        fstr = ''.join(['Information for variable: ', variable,
                        ' and label: ', label, ' is easily accesbile.',
                        ' Value is: ', meta_dict[variable][label]])
        print(fstr)

return meta_dict

# Write netCDF file
pysat.utils.io.inst_to_netcdf(inst, fname,
                              meta_translation=meta_translation,
                              meta_processor=example_processor)

Loading Files

pysat includes support for loading netCDF4 files, particularly those produced by pysat, directly into compatible pandas and xarray formats. These routines will load the data and metadata into the appropriate structures. pysat NetCDF files may also be directly loaded into a general pysat.Instrument. Loading functions are provided under pysat.utils.io and includes a general data indepdent interface, pysat.utils.load_netcdf4(), as well as pandas and xarray specific readers (pysat.utils.io.load_netcdf_pandas() and pysat.utils.io.load_netcdf_xarray()). These functions are intended to be used within a pysat.Instrument support module, particularly the load() function.

For example, consider the complete instrument load function needed (single dataset) when loading a pysat produced file into pandas. For more information on adding a new dataset to pysat, see Adding a New Instrument.

def load(fnames, tag='', inst_id=''):
    """Load the example Instrument pysat produced data files.

    Parameters
    ----------
    fnames : list
        List of filenames
    tag : str
        Instrument tag (accepts '' or a string to change the behaviour of
        certain instrument aspects for testing). (default='')
    inst_id : str
        Instrument ID (accepts ''). (default='')

    Returns
    -------
    data : pds.DataFrame
        Instrument data
    meta : pysat.Meta
        Metadata

    """

    return pysat.utils.io.load_netcdf4_pandas(fnames)

Now consider loading the file written in the example shown in Section Writing Files. Because this pysat.Instrument module may support either pandas or xarray data, the expected type must be specified upon pysat.Instrument instantiation. pysat also expects all filenames to have some type of date format. However, by using the data_dir keyword argument, we can easily load files outside of the standard pysat data paths.

import datetime as dt
import pysat

stime = dt.datetime(2009, 1, 1)
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
                             data_dir='/example/file/path/name',
                             file_format='test_{year:04}{day:03}.nc')
test_inst.load(date=stime)

To enable support for a wider variety of netCDF file standards pysat also provides support for translating, dropping, and modifying metadata information after it is loaded from file but before it is input into a pysat.Meta instance. We will use the file with SPDF standards as an example.

The general order of metadata operations is, load from file, remove netCDF4 specific metadata (‘Format’, ‘Var_Type’, ‘Depend_0’), apply table translations, apply the meta processor, apply a meta array expander (pysat does not support array elements within metadata), and finally the metadata information is loaded into a pysat.Meta instance.

import numpy as np

# Define metadata labels, the keys are labels using by pysat,
# while the values are the labels in the file and type.
# Only one type is currently supported for each metadata label.
labels = {'units': ('Units', str), 'name': ('Long_Name', str),
          'notes': ('Var_Notes', str), 'desc': ('CatDesc', str),
          'plot': ('FieldNam', str), 'axis': ('LablAxis', str),
          'scale': ('ScaleTyp', str),
          'min_val': ('Valid_Min', np.float64),
          'max_val': ('Valid_Max', np.float64),
          'fill_val': ('FillVal', np.float64)}

# Both 'ValidMin' and 'Valid_Min' are in the file with the same
# content. Only need one.
drop_labels = ['ValidMin', 'ValidMax']

# Instantiate generic Instrument and pass in modification options
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
                             data_dir='/example/file/path/name',
                             file_format='test_{year:04}{day:03}.nc',
                             load_labels=labels,
                             drop_meta_labels=drop_labels)
# Load data
test_inst.load(date=stime)

# Feedback on metadata
print(list(test_inst.meta.attrs()))

['FieldNam', 'LablAxis', 'ScaleTyp', 'units', 'long_name', 'notes',
 'desc', 'value_min', 'value_max', 'fill']

Metadata labels for units, long name, notes, description, value min/max and fill were all translated to the default metadata labels of test_inst. The default metadata labels don’t include entries for all SPDF parameters, thus ‘FieldNam’, ‘LablAxis’, ‘ScaleTyp’ retain the values in the file. While users can apply their own labels when instantiating a pysat.Instrument, for non default metadata labels we recommend developers apply a translation table to map the labels in the file to a more user friendly label.

# Define metadata labels, the keys are labels using by pysat,
# while the values are the labels from the file and type.
# The labels are applied last in the loading process.
# Only one type is currently supported for each metadata label.
labels = {'units': ('Units', str), 'name': ('Long_Name', str),
          'notes': ('Var_Notes', str), 'desc': ('CatDesc', str),
          'plot': ('plot', str), 'axis': ('axis', str),
          'scale': ('scale', str),
          'min_val': ('Valid_Min', np.float64),
          'max_val': ('Valid_Max', np.float64),
          'fill_val': ('fill', np.float64)}

# Generate custom meta translation table. When left unspecified the default
# table handles the multiple values for fill. We must recreate that
# functionality in our table. The targets for meta_translation should
# map to values in `labels` above.
meta_translation = {'FieldNam': 'plot', 'LablAxis': 'axis',
                    'ScaleTyp': 'scale', 'ValidMin': 'Valid_Min',
                    'Valid_Min': 'Valid_Min', 'ValidMax': 'Valid_Max',
                    'Valid_Max': 'Valid_Max', '_FillValue': 'fill',
                    'FillVal': 'fill'}

# Instantiate generic Instrument and pass in modification options.
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
                             data_dir='/example/file/path/name',
                             file_format='test_{year:04}{day:03}.nc',
                             load_labels=labels,
                             meta_translation=meta_translation)
# Load data
test_inst.load(date=stime)

# Feedback on metadata
print(list(test_inst.meta.attrs()))

['fill', 'plot', 'axis', 'scale', 'units', 'long_name', 'notes', 'desc',
 'value_min', 'value_max']

Note that drop_labels is no longer used. Instead, multiple metadata labels in the file are mapped to a single label using the meta_translation keyword. If there is an inconsistency in values during this process a warning is issued.

The example below demonstrates how users can control the labels used to access metadata.

# Define metadata labels, the keys are labels using by pysat,
# while the values are the labels from the file and type.
# The labels are applied last in the loading process.
# Only one type is currently supported for each metadata label.
local_labels = {'units': ('UNITS', str), 'name': ('LongEST_Name', str),
                'notes': ('FLY', str), 'desc': ('DIGits', str),
                'plot': ('plottER', str), 'axis': ('axisER', str),
                'scale': ('scalER', str),
                'min_val': ('INVALIDmin', np.float64),
                'max_val': ('invalidMAX', np.float64),
                'fill_val': ('fillerest', np.float64)}

# Instantiate generic Instrument and pass in modification options
test_inst = pysat.Instrument("pysat", "netcdf", pandas_format=True,
                             data_dir='/example/file/path/name',
                             file_format='test_{year:04}{day:03}.nc',
                             load_labels=labels, labels=local_labels,
                             meta_translation=meta_translation)
# Load data
test_inst.load(date=stime)

# Feedback on metadata
print(list(test_inst.meta.attrs()))

['UNITS', 'LongEST_Name', 'FLY', 'DIGits', 'INVALIDmin', 'invalidMAX',
 'fillerest', 'plottER', 'axisER', 'scalER']