Dataset#
- class torch_brain.dataset.Dataset(dataset_dir, recording_ids=None, transform=None, keep_files_open=True, namespace_attributes=None)[source]#
Bases:
torch.utils.data.dataset.DatasetPyTorch Dataset for loading time-slices of neural data recordings from HDF5 files.
The dataset can be indexed by a
DatasetIndexobject, which contains a recording id and a start and end times.This definition is a deviation from the standard PyTorch Dataset definition, In this case, the Dataset by itself does not provide you with samples, but rather the means to flexibly work and access complete recordings. Within this framework, it is the job of the sampler to provide the indices that are used to slice the dataset into samples (see Samplers).
- The lazy loading is done both in:
time: only the requested time interval is loaded, without having to load the entire recording into memory, and
attributes: attributes are not loaded until they are requested, this is useful when only a small subset of the attributes are actually needed.
- Parameters:
dataset_dir (
str|Path) – Path to the directory containing HDF5 recording files.recording_ids (
Optional[list[str]]) – Optional list of recording IDs to include. These correspond to the filenames of the HDF5 files in the dataset directory. IfNone, all*.h5files in the dataset directory will be used.transform (
Optional[Callable]) – Optional transform to apply to each data sample.keep_files_open (
bool) – IfTrue, keeps HDF5 files open in memory for faster access. IfFalse, files are opened on-demand. Default isTrue.namespace_attributes (
Optional[list[str]]) – List of nested attribute paths (e.g., “session.id”) that should be namespaced when loading recordings in aNestedDatasetsituation. See Namespacing. No namespacing performed if set toNone.
- Subclassing:
Users are encouraged to subclass
Datasetand optionally override:get_recording_hook()to run light-weight custom post-processing on recordings just beforeget_recording()returns.get_sampling_intervals()to customize how time-domain intervals are computed.apply_namespace()to change how namespacing is applied to attributes. See Namespacing.
- Namespacing:
When operating under a
NestedDataset, “namespacing” automatically prefixes attribute values (e.g., session.id, subject.id) with the dataset name to avoid naming collisions when combining multiple datasets. The list of attributes that are to be namespaced can be set withnamespace_attributes.Example: With the
namespace_attributes=["session.id", "subject.id"], say you create a nested dataset with two datasets namedds1andds2. Now, when you load a recording fromds1, the recording’ssession.idandsubject.idattributes will be prefixed withds1/.Subclasses can override
apply_namespace()to customize how a namespace is applied.
- get_recording(recording_id, _namespace='')[source]#
Get lazy-loaded
temporaldata.Dataobject for a recording.- Parameters:
recording_id (
str) – The ID of the recording to load (same as fromrecording_ids())._namespace (
str) – Optional namespace prefix to apply to attributes.
- Return type:
- Returns:
Lazy
temporaldata.Dataobject containing the full recording.
- __getitem__(index)[source]#
Get a time-sliced sample from the dataset.
If a transform was provided during construction, it will be applied to the sliced sample before returning.
- Parameters:
index (
DatasetIndex) – Container for the recording ID and time interval.- Return type:
- Returns:
temporaldata.Dataobject containing the sliced time interval, optionally transformed.
- get_sampling_intervals(*args, **kwargs)[source]#
Returns a dictionary of sampling intervals for each recording. This represents the intervals that can be sampled from each session.
This dictionary will be used by
torch_brain’s Samplers to know where to sample from.The default method returns intervals containing the entire domain of each recording. This behavior can be overridden by subclasses to give out custom sampling intervals.
- apply_namespace(data, namespace)[source]#
Apply a namespace prefix to specified nested attributes in the data.
This method modifies the data object in-place by prepending the namespace to string attributes or string arrays specified in
namespace_attributes.Can be overridden by subclasses to apply the namespace in a custom way.
- Parameters:
- Return type:
- Returns:
The modified
temporaldata.Dataobject (same instance, modified in-place).
- get_recording_hook(data)[source]#
Hook method called after loading a recording in
get_recording().Subclasses can override this method to perform custom processing on recordings after they are loaded but before they are returned.