Dataset¶
- class Dataset(root, *, config=None, recording_id=None, split=None, transform=None, unit_id_prefix_fn=<function <lambda>>, session_id_prefix_fn=<function <lambda>>, subject_id_prefix_fn=<function <lambda>>)[source]¶
Bases:
Dataset
This class abstracts a collection of lazily-loaded Data objects. Each data object corresponds to a full recording. It is never fully loaded into memory, but rather lazy-loaded on-the-fly from disk.
The dataset can be indexed by a recording id and a start and end times using the get method. This definition is a deviation from the standard PyTorch Dataset definition, which generally presents the dataset directly as samples. In this case, the Dataset by itself does not provide you with samples, but rather the means to flexibly work and access complete recordings.
Within this framework, it is the job of the sampler to provide a list of
DatasetIndex
objects that are used to slice the dataset into samples (see Samplers).- The lazy loading is done both in:
time: only the requested time interval is loaded, without having to load the entire recording into memory, and
attributes: attributes are not loaded until they are requested, this is useful when only a small subset of the attributes are actually needed.
References to the underlying hdf5 files will be opened, and will only be closed when the Dataset object is destroyed.
- Parameters:
root (
str
) – The root directory of the dataset.config (
Optional
[str
]) – The configuration file specifying the sessions to include.brainset – The brainset to include. This is used to specify a single brainset, and can only be used if config is not provided.
session – The session to include. This is used to specify a single session, and can only be used if config is not provided.
split (
Optional
[str
]) – The split of the dataset. This is used to determine the sampling intervals for each session. The split is optional, and is used to load a subset of the data in a session based on a predefined split.transform (
Optional
[Callable
[[Data
],Any
]]) – A transform to apply to the data. This transform should be a callable that takes a Data object and returns a Data object.unit_id_prefix_fn (
Callable
[[Data
],str
]) – A function to generate prefix strings for unit IDs to ensure uniqueness across the dataset. It takes a Data object as input and returns a string that would be prefixed to all unit ids in that Data object. Default corresponds to the function lambda data: f”{data.brainset.id}/{data.session.id}/”session_id_prefix_fn (
Callable
[[Data
],str
]) – Same as unit_id_prefix_fn but for session ids. Default corresponds to the function lambda data: f”{data.brainset.id}/”subject_id_prefix_fn (
Callable
[[Data
],str
]) – Same as unit_id_prefix_fn but for subject ids. Default corresponds to the function lambda data: f”{data.brainset.id}/”
- get(recording_id, start, end)[source]¶
This is the main method to extract a slice from a recording. It returns a Data object that contains all data for recording
recording_id
between timesstart
andend
.
- get_recording_data(recording_id)[source]¶
Returns the data object corresponding to the recording
recording_id
. If the split is notNone
, the data object is sliced to the allowed sampling intervals for the split, to avoid any data leakage.RegularTimeSeries
objects are converted toIrregularTimeSeries
objects, since they are most likely no longer contiguous.Warning
This method might load the full data object in memory, avoid multiple calls to this method if possible.
- get_sampling_intervals()[source]¶
Returns a dictionary of sampling intervals for each session. This represents the intervals that can be sampled from each session.
Note that these intervals will change depending on the split. If no split is provided, the full domain of the data is used.