Dataset¶
- class Dataset(dataset_dir, recording_ids=None, transform=None, keep_files_open=True, namespace_attributes=None)[source]¶
Bases:
DatasetPyTorch Dataset for loading time-slices of neural data recordings from HDF5 files.
The dataset can be indexed by a
DatasetIndexobject, which contains a recording id and a start and end times.This definition is a deviation from the standard PyTorch Dataset definition, In this case, the Dataset by itself does not provide you with samples, but rather the means to flexibly work and access complete recordings. Within this framework, it is the job of the sampler to provide the indices that are used to slice the dataset into samples (see Samplers).
- The lazy loading is done both in:
time: only the requested time interval is loaded, without having to load the entire recording into memory, and
attributes: attributes are not loaded until they are requested, this is useful when only a small subset of the attributes are actually needed.
- Parameters:
dataset_dir (
str|Path) – Path to the directory containing HDF5 recording files.recording_ids (
Optional[list[str]]) – Optional list of recording IDs to include. These correspond to the filenames of the HDF5 files in the dataset directory. IfNone, all*.h5files in the dataset directory will be used.transform (
Optional[Callable]) – Optional transform to apply to each data sample.keep_files_open (
bool) – IfTrue, keeps HDF5 files open in memory for faster access. IfFalse, files are opened on-demand. Default isTrue.namespace_attributes (
Optional[list[str]]) – List of nested attribute paths (e.g., “session.id”) that should be namespaced when loading recordings in aNestedDatasetsituation. See Namespacing. No namespacing performed if set toNone.
- Subclassing:
Users are encouraged to subclass
Datasetand optionally override:get_recording_hook()to run light-weight custom post-processing on recordings just beforeget_recording()returns.get_sampling_intervals()to customize how time-domain intervals are computed.apply_namespace()to change how namespacing is applied to attributes. See Namespacing.
- Namespacing:
When operating under a
NestedDataset, “namespacing” automatically prefixes attribute values (e.g., session.id, subject.id) with the dataset name to avoid naming collisions when combining multiple datasets. The list of attributes that are to be namespaced can be set withnamespace_attributes.Example: With the
namespace_attributes=["session.id", "subject.id"], say you create a nested dataset with two datasets namedds1andds2. Now, when you load a recording fromds1, the recording’ssession.idandsubject.idattributes will be prefixed withds1/.Subclasses can override
apply_namespace()to customize how a namespace is applied.
- get_recording(recording_id, _namespace='')[source]¶
Get lazy-loaded
temporaldata.Dataobject for a recording.- Parameters:
recording_id (
str) – The ID of the recording to load (same as fromrecording_ids())._namespace (
str) – Optional namespace prefix to apply to attributes.
- Return type:
Data- Returns:
Lazy
temporaldata.Dataobject containing the full recording.
- get_sampling_intervals(*args, **kwargs)[source]¶
Returns a dictionary of sampling intervals for each recording. This represents the intervals that can be sampled from each session.
This dictionary will be used by
torch_brain’s Samplers to know where to sample from.The default method returns intervals containing the entire domain of each recording. This behavior can be overridden by subclasses to give out custom sampling intervals.
- apply_namespace(data, namespace)[source]¶
Apply a namespace prefix to specified nested attributes in the data.
This method modifies the data object in-place by prepending the namespace to string attributes or string arrays specified in
namespace_attributes.Can be overridden by subclasses to apply the namespace in a custom way.
- Parameters:
data (
Data) – The Data object to modify.namespace (
str) – The namespace prefix to prepend (e.g., “experiment1/”).
- Return type:
Data- Returns:
The modified
temporaldata.Dataobject (same instance, modified in-place).
- get_recording_hook(data)[source]¶
Hook method called after loading a recording in
get_recording().Subclasses can override this method to perform custom processing on recordings after they are loaded but before they are returned.
- Parameters:
data (
Data) – The Data object that was just loaded.- Return type:
- class DatasetIndex(recording_id, start, end, _namespace='')[source]¶
Bases:
objectIndex for accessing a specific time interval of a recording within a
Dataset.- Parameters:
recording_id (
str) – The unique identifier for the recording to access.start (
float) – Start time of the interval (in seconds or appropriate time units).end (
float) – End time of the interval (in seconds or appropriate time units)._namespace (
str) – Optional namespace prefix for attribute namespacing. Used internally bytorch_brain.dataset.NestedDatasetto handle nested namespaced attributes.