Saving and Loading Data#
Data objects can be saved to and loaded from HDF5
files. HDF5 is a specialized data format that allows streaming chunks of data
from disk without loading all of it into memory (RAM), giving us an efficient
way to work with large datasets.
Saving#
To save a data object to disk, use the save() method:
>>> import numpy as np
>>> from torch_brain.data import RegularTimeSeries, IrregularTimeSeries, Data
>>> # Create a complex data object
>>> session = Data(
... spikes=IrregularTimeSeries(
... timestamps=np.array([1.2, 2.3, 3.1]),
... unit_id=np.array([1, 2, 1]),
... ),
... behavior=RegularTimeSeries(
... sampling_rate=100.0,
... hand_vel=np.random.randn(400, 2),
... eye_pos=np.random.randn(400, 2),
... pupil_size=np.random.randn(400),
... ),
... domain="auto",
... )
>>> # Save to a HDF5 file on disk
>>> session.save("neural_data.h5")
Loading#
To load data back from disk, use Data.load().
Let’s first load it “non-lazily” by passing lazy=False:
>>> # Read neural data from HDF5 file on disk
>>> session = Data.load("neural_data.h5", lazy=False)
>>> # Access neural data
>>> session.spikes.timestamps
array([1.2, 2.3, 3.1])
>>> session.behavior.sampling_rate
np.float64(100.0)
>>> # Slice
>>> sliced = session.slice(2., 4.)
>>> sliced
Data(
behavior=RegularTimeSeries(
eye_pos=[200, 2],
hand_vel=[200, 2],
pupil_size=[200]
),
spikes=IrregularTimeSeries(
timestamps=[2],
unit_id=[2]
),
)
By setting lazy=False, we load the entire dataset into memory upfront.
This quickly becomes infeasible for datasets of any real size (a few hundred GBs
to a few TBs). To address this, we provide a Lazy Loading mode.
Lazy Loading#
Under lazy-loading:
Data is read from disk only when an attribute is accessed.
Slicing is also deferred: only the attributes you access get sliced, and only at the moment you access them. Importantly, only the sliced portion is read from disk, not the whole array. So slicing a small window out of a huge recording is cheap, both in terms of disk I/O and memory usage.
The same goes for masking (via methods such as
ArrayDict.select_by_mask()): the masking is deferred until the attribute is actually requested.
To load data in lazy mode, simply omit the lazy=False flag we used above:
>>> # omit lazy=False to load lazily
>>> session = Data.load("neural_data.h5")
>>> session
Data(
behavior=LazyRegularTimeSeries(
eye_pos=<HDF5 dataset "eye_pos": shape (400, 2), type "<f8">,
hand_vel=<HDF5 dataset "hand_vel": shape (400, 2), type "<f8">,
pupil_size=<HDF5 dataset "pupil_size": shape (400,), type "<f8">
),
spikes=LazyIrregularTimeSeries(
timestamps=<HDF5 dataset "timestamps": shape (3,), type "<f8">,
unit_id=<HDF5 dataset "unit_id": shape (3,), type "<i8">
),
)
First note that the internal objects are LazyRegularTimeSeries and
LazyIrregularTimeSeries. Secondly, the presence of <HDF5 dataset...>
indicates that the arrays are yet to be loaded. Let’s see what happens when we
access eye_pos:
>>> session.behavior.eye_pos
array([[ 1.14566523, 0.22616446],
[-0.03963849, 0.11477352],
...
>>> session
Data(
behavior=LazyRegularTimeSeries(
eye_pos=[400, 2],
hand_vel=<HDF5 dataset "hand_vel": shape (400, 2), type "<f8">,
pupil_size=<HDF5 dataset "pupil_size": shape (400,), type "<f8">
),
spikes=LazyIrregularTimeSeries(
timestamps=<HDF5 dataset "timestamps": shape (3,), type "<f8">,
unit_id=<HDF5 dataset "unit_id": shape (3,), type "<i8">
),
)
We can see that eye_pos has been loaded, and the remaining attributes
are still lazy. If we access both hand_vel and pupil_size, behavior
will then turn into a RegularTimeSeries object:
>>> session.behavior.hand_vel
array([[ 1.86527056e-01, 1.54714182e-01],
[ 2.75861600e-01, -5.30891532e-01],
...
>>> session.behavior.pupil_size
array([ 1.66121169e-01, 2.06565774e-01, -7.85847571e-01, ...])
>>> session
Data(
behavior=RegularTimeSeries(
eye_pos=[400, 2],
hand_vel=[400, 2],
pupil_size=[400]
),
spikes=LazyIrregularTimeSeries(
timestamps=<HDF5 dataset "timestamps": shape (3,), type "<f8">,
unit_id=<HDF5 dataset "unit_id": shape (3,), type "<i8">
),
)
We can also slice a lazy object:
>>> sliced = session.slice(2., 4.)
>>> sliced
Data(
behavior=RegularTimeSeries(
eye_pos=[200, 2],
hand_vel=[200, 2],
pupil_size=[200]
),
spikes=LazyIrregularTimeSeries( # Note that this remains lazy!!
timestamps=<HDF5 dataset "timestamps": shape (3,), type "<f8">,
unit_id=<HDF5 dataset "unit_id": shape (3,), type "<i8">
),
)
>>> sliced.spikes.timestamps
array([0.3, 1.1])
Here, spikes stayed lazy after slicing. When we finally access
sliced.spikes.timestamps, only the two timestamps that fall within the
\([2, 4)\) window are read from disk and not the full timestamps array.
This is what makes lazy loading efficient: you only pay for the slice you ask for.