Dataset¶

class Dataset(dataset_dir, recording_ids=None, transform=None, keep_files_open=True, namespace_attributes=None)[source]¶

Bases: Dataset

PyTorch Dataset for loading time-slices of neural data recordings from HDF5 files.

The dataset can be indexed by a DatasetIndex object, which contains a recording id and a start and end times.

This definition is a deviation from the standard PyTorch Dataset definition, In this case, the Dataset by itself does not provide you with samples, but rather the means to flexibly work and access complete recordings. Within this framework, it is the job of the sampler to provide the indices that are used to slice the dataset into samples (see Samplers).

The lazy loading is done both in:

time: only the requested time interval is loaded, without having to load the entire recording into memory, and
attributes: attributes are not loaded until they are requested, this is useful when only a small subset of the attributes are actually needed.

Parameters:

dataset_dir (str | Path) – Path to the directory containing HDF5 recording files.
recording_ids (Optional[list[str]]) – Optional list of recording IDs to include. These correspond to the filenames of the HDF5 files in the dataset directory. If None, all *.h5 files in the dataset directory will be used.
transform (Optional[Callable]) – Optional transform to apply to each data sample.
keep_files_open (bool) – If True, keeps HDF5 files open in memory for faster access. If False, files are opened on-demand. Default is True.
namespace_attributes (Optional[list[str]]) – List of nested attribute paths (e.g., “session.id”) that should be namespaced when loading recordings in a NestedDataset situation. See Namespacing. No namespacing performed if set to None.

Subclassing:

Users are encouraged to subclass Dataset and optionally override:

get_recording_hook() to run light-weight custom post-processing on recordings just before get_recording() returns.
get_sampling_intervals() to customize how time-domain intervals are computed.
apply_namespace() to change how namespacing is applied to attributes. See Namespacing.

Namespacing:

When operating under a NestedDataset, “namespacing” automatically prefixes attribute values (e.g., session.id, subject.id) with the dataset name to avoid naming collisions when combining multiple datasets. The list of attributes that are to be namespaced can be set with namespace_attributes.

Example: With the namespace_attributes=["session.id", "subject.id"], say you create a nested dataset with two datasets named ds1 and ds2. Now, when you load a recording from ds1, the recording’s session.id and subject.id attributes will be prefixed with ds1/.

Subclasses can override apply_namespace() to customize how a namespace is applied.

property recording_ids: list[str]¶: Sorted list of recording IDs in the dataset.

get_recording(recording_id, _namespace='')[source]¶

Get lazy-loaded temporaldata.Data object for a recording.

Parameters:

recording_id (str) – The ID of the recording to load (same as from recording_ids()).
_namespace (str) – Optional namespace prefix to apply to attributes.

Return type:

Data

Returns:

Lazy temporaldata.Data object containing the full recording.

get_sampling_intervals(*args, **kwargs)[source]¶

Returns a dictionary of sampling intervals for each recording. This represents the intervals that can be sampled from each session.

This dictionary will be used by torch_brain’s Samplers to know where to sample from.

The default method returns intervals containing the entire domain of each recording. This behavior can be overridden by subclasses to give out custom sampling intervals.

Return type:: dict[str, Interval]
Returns:: Dictionary mapping recording IDs to their time domain intervals.

apply_namespace(data, namespace)[source]¶

Apply a namespace prefix to specified nested attributes in the data.

This method modifies the data object in-place by prepending the namespace to string attributes or string arrays specified in namespace_attributes.

Can be overridden by subclasses to apply the namespace in a custom way.

Parameters:

data (Data) – The Data object to modify.
namespace (str) – The namespace prefix to prepend (e.g., “experiment1/”).

Return type:

Data

Returns:

The modified temporaldata.Data object (same instance, modified in-place).

get_recording_hook(data)[source]¶

Hook method called after loading a recording in get_recording().

Subclasses can override this method to perform custom processing on recordings after they are loaded but before they are returned.

Parameters:: data (Data) – The Data object that was just loaded.
Return type:: None

class DatasetIndex(recording_id, start, end, _namespace='')[source]¶

Bases: object

Index for accessing a specific time interval of a recording within a Dataset.

Parameters:

recording_id (str) – The unique identifier for the recording to access.
start (float) – Start time of the interval (in seconds or appropriate time units).
end (float) – End time of the interval (in seconds or appropriate time units).
_namespace (str) – Optional namespace prefix for attribute namespacing. Used internally by torch_brain.dataset.NestedDataset to handle nested namespaced attributes.

recording_id: str¶

start: float¶

end: float¶