Dataset

class Dataset(root, *, config=None, recording_id=None, split=None, transform=None, unit_id_prefix_fn=<function <lambda>>, session_id_prefix_fn=<function <lambda>>, subject_id_prefix_fn=<function <lambda>>)[source]

Bases: Dataset

This class abstracts a collection of lazily-loaded Data objects. Each data object corresponds to a full recording. It is never fully loaded into memory, but rather lazy-loaded on-the-fly from disk.

The dataset can be indexed by a recording id and a start and end times using the get method. This definition is a deviation from the standard PyTorch Dataset definition, which generally presents the dataset directly as samples. In this case, the Dataset by itself does not provide you with samples, but rather the means to flexibly work and access complete recordings.

Within this framework, it is the job of the sampler to provide a list of DatasetIndex objects that are used to slice the dataset into samples (see Samplers).

The lazy loading is done both in:
  • time: only the requested time interval is loaded, without having to load the entire recording into memory, and

  • attributes: attributes are not loaded until they are requested, this is useful when only a small subset of the attributes are actually needed.

References to the underlying hdf5 files will be opened, and will only be closed when the Dataset object is destroyed.

Parameters:
  • root (str) – The root directory of the dataset.

  • config (Optional[str]) – The configuration file specifying the sessions to include.

  • brainset – The brainset to include. This is used to specify a single brainset, and can only be used if config is not provided.

  • session – The session to include. This is used to specify a single session, and can only be used if config is not provided.

  • split (Optional[str]) – The split of the dataset. This is used to determine the sampling intervals for each session. The split is optional, and is used to load a subset of the data in a session based on a predefined split.

  • transform (Optional[Callable[[Data], Any]]) – A transform to apply to the data. This transform should be a callable that takes a Data object and returns a Data object.

  • unit_id_prefix_fn (Callable[[Data], str]) – A function to generate prefix strings for unit IDs to ensure uniqueness across the dataset. It takes a Data object as input and returns a string that would be prefixed to all unit ids in that Data object. Default corresponds to the function lambda data: f”{data.brainset.id}/{data.session.id}/”

  • session_id_prefix_fn (Callable[[Data], str]) – Same as unit_id_prefix_fn but for session ids. Default corresponds to the function lambda data: f”{data.brainset.id}/”

  • subject_id_prefix_fn (Callable[[Data], str]) – Same as unit_id_prefix_fn but for subject ids. Default corresponds to the function lambda data: f”{data.brainset.id}/”

get(recording_id, start, end)[source]

This is the main method to extract a slice from a recording. It returns a Data object that contains all data for recording recording_id between times start and end.

Parameters:
  • recording_id (str) – The recording id of the slice. This is usually <brainset_id>/<session_id>

  • start (float) – The start time of the slice.

  • end (float) – The end time of the slice.

get_recording_data(recording_id)[source]

Returns the data object corresponding to the recording recording_id. If the split is not None, the data object is sliced to the allowed sampling intervals for the split, to avoid any data leakage. RegularTimeSeries objects are converted to IrregularTimeSeries objects, since they are most likely no longer contiguous.

Warning

This method might load the full data object in memory, avoid multiple calls to this method if possible.

get_sampling_intervals()[source]

Returns a dictionary of sampling intervals for each session. This represents the intervals that can be sampled from each session.

Note that these intervals will change depending on the split. If no split is provided, the full domain of the data is used.

get_recording_config_dict()[source]

Returns configs for each session in the dataset as a dictionary.

get_unit_ids()[source]

Returns all unit ids in the dataset.

get_session_ids()[source]

Returns the session ids of the dataset.

get_subject_ids()[source]

Returns all subject ids in the dataset.

get_brainset_ids()[source]

Returns all brainset ids in the dataset.

disable_data_leakage_check()[source]

Disables the data leakage check.

Warning

Only do this you are absolutely sure that there is no leakage between the current split and other splits (eg. the test split).

class DatasetIndex(recording_id, start, end)[source]

Bases: object

The dataset can be indexed by specifying a recording id and a start and end time.

recording_id: str
start: float
end: float