Embeddings Layers

Embedding

A simple embedding layer.

InfiniteVocabEmbedding

An extendable embedding layer + tokenizer.

RotaryTimeEmbedding

Rotary embedding layer.

SinusoidalTimeEmbedding

Sinusoidal embedding layer.

class Embedding(num_embeddings, embedding_dim, init_scale=0.02, **kwargs)[source]

A simple extension of torch.nn.Embedding to allow more control over the weights initializer. The learnable weights of the module of shape (num_embeddings, embedding_dim) are initialized from \(\mathcal{N}(0, \text{init_scale})\).

Parameters:
  • num_embeddings (int) – size of the dictionary of embeddings

  • embedding_dim (int) – the size of each embedding vector

  • init_scale (float, optional) – standard deviation of the normal distribution used for the initialization. Defaults to 0.02, which is the default value used in most transformer models

  • **kwargs – Additional arguments. Refer to the documentation of torch.nn.Embedding for details

reset_parameters()[source]

Resets all learnable parameters of the module.

class InfiniteVocabEmbedding(embedding_dim, init_scale=0.02)[source]

Embedding layer with a vocabulary that can be extended. Vocabulary is saved along with the model, and is reloaded when the state_dict is loaded. This is useful when the vocabulary is dynamically generated, e.g. from a dataset. For this reason this class also plays the role of the tokenizer.

This layer is initially lazy, i.e. it does not have a weight matrix. The weight matrix is initialized when:

  • The vocabulary is initialized via initialize_vocab().

  • or The model is loaded from a checkpoint that contains the vocabulary.

If the vocabulary is initialized before load_state_dict() is called, an error will be raised if the vocabulary in the checkpoint does not match the vocabulary in the model. The order of the words in the vocabulary does not matter, as long as the words are the same.

If you would like to create a new variant of an existing InfiniteVocabEmbedding (that you loaded from a checkpoint), you can use:

  • extend_vocab() to add new words to the vocabulary. The embeddings for the new

words will be initialized randomly.

  • subset_vocab() to select a subset of the vocabulary. The embeddings for the

selected words will be copied from the original embeddings, and the ids for the selected words will change and tokenizer() will be updated accordingly.

This module also plays the role of the tokenizer, which is accessible via tokenizer(), and is a Callable.

Warning

If you are only interested in loading a subset of words from a checkpoint, do not call initialize_vocab(), first load the checkpoint then use subset_vocab().

Parameters:
  • embedding_dim (int) – Embedding dimension.

  • init_scale (float) – The standard deviation of the normal distribution used to initialize the embedding matrix. Default is 0.02.

initialize_vocab(vocab)[source]

Initialize the vocabulary with a list of words. This method should be called only once, and before the model is trained. If you would like to add new words to the vocabulary, use extend_vocab() instead.

Note

A special word “NA” will always be in the vocabulary, and is assigned the index 0. 0 is used for padding.

Parameters:

vocab (List[str]) – A list of words to initialize the vocabulary.

Example

>>> from torch_brain.nn import InfiniteVocabEmbedding

>>> embedding = InfiniteVocabEmbedding(64)

>>> vocab = ["apple", "banana", "cherry"]
>>> embedding.initialize_vocab(vocab)

>>> embedding.vocab
OrderedDict([('NA', 0), ('apple', 1), ('banana', 2), ('cherry', 3)])

>>> embedding.weight.shape
torch.Size([4, 64])
extend_vocab(vocab, exist_ok=False)[source]

Extend the vocabulary with a list of words. If a word already exists in the vocabulary, an error will be raised. The embeddings for the new words will be initialized randomly, and new ids will be assigned to the new words.

Parameters:
  • vocab (List[str]) – A list of words to add to the vocabulary.

  • exist_ok (bool) – If True, the method will not raise an error if the new words already exist in the vocabulary and will skip them. Default is False.

Example

>>> from torch_brain.nn import InfiniteVocabEmbedding

>>> embedding = InfiniteVocabEmbedding(64)

>>> vocab = ["apple", "banana", "cherry"]
>>> embedding.initialize_vocab(vocab)
>>> embedding
InfiniteVocabEmbedding(embedding_dim=64, num_embeddings=4)

>>> new_words = ["date", "elderberry", "fig"]
>>> embedding.extend_vocab(new_words)
InfiniteVocabEmbedding(embedding_dim=64, num_embeddings=7)

>>> embedding.vocab
OrderedDict([('NA', 0), ('apple', 1), ('banana', 2), ('cherry', 3), ('date', 4), ('elderberry', 5), ('fig', 6)])

>>> embedding.weight.shape
torch.Size([7, 64])
subset_vocab(vocab, inplace=True)[source]

Select a subset of the vocabulary. The embeddings for the selected words will be copied from the original embeddings, and the ids for the selected words will be updated accordingly.

An error will be raised if one of the words does not exist in the vocabulary.

Parameters:
  • vocab (List[str]) – A list of words to select from the vocabulary.

  • inplace (bool) – If True, the method will modify the vocabulary and the weight matrix in place. If False, a new InfiniteVocabEmbedding will be returned with the selected words. Default is True.

Example

>>> from torch_brain.nn import InfiniteVocabEmbedding

>>> embedding = InfiniteVocabEmbedding(64)

>>> vocab = ["apple", "banana", "cherry"]
>>> embedding.initialize_vocab(vocab)
>>> embedding
InfiniteVocabEmbedding(embedding_dim=64, num_embeddings=4)

>>> selected_words = ["banana", "cherry"]
>>> embedding.subset_vocab(selected_words)
InfiniteVocabEmbedding(embedding_dim=64, num_embeddings=3)

>>> embedding.vocab
OrderedDict([('NA', 0), ('banana', 1), ('cherry', 2)])

>>> embedding.weight.shape
torch.Size([3, 64])
tokenizer(words)[source]

Convert a word or a list of words to their token indices.

Parameters:

words (Union[str, List[str]]) – A word or a list of words.

Returns:

A token index or a list of token indices.

Return type:

Union[int, List[int]]

Example

>>> from torch_brain.nn import InfiniteVocabEmbedding

>>> embedding = InfiniteVocabEmbedding(64)

>>> vocab = ["apple", "banana", "cherry"]
>>> embedding.initialize_vocab(vocab)

>>> embedding.tokenizer("banana")
2

>>> embedding.tokenizer(["apple", "cherry", "apple"])
[1, 3, 1]
detokenizer(index)[source]

Convert a token index to a word.

Parameters:

index (int) – A token index.

Returns:

A word.

Return type:

str

Example

>>> from torch_brain.nn import InfiniteVocabEmbedding

>>> embedding = InfiniteVocabEmbedding(64)

>>> vocab = ["apple", "banana", "cherry"]
>>> embedding.initialize_vocab(vocab)

>>> embedding.detokenizer(2)
'banana'
is_lazy()[source]

Returns True if the module is not initialized.

Example

>>> from torch_brain.nn import InfiniteVocabEmbedding

>>> embedding = InfiniteVocabEmbedding(64)

>>> embedding.is_lazy()
True

>>> vocab = ["apple", "banana", "cherry"]
>>> embedding.initialize_vocab(vocab)

>>> embedding.is_lazy()
False
reset_parameters()[source]

Resets all learnable parameters of the module, but will not reset the vocabulary.

class RotaryTimeEmbedding(head_dim, rotate_dim, t_min, t_max)[source]

Rotary time/positional embedding layer. This module is designed to be used with torch_brain.nn.RotarySelfAttention and torch_brain.nn.RotaryCrossAttention to modulate the attention weights in accordance with relative timing/positions of the tokens. Original paper: RoFormer: Enhanced Transformer with Rotary Position Embedding

The timeperiods are computed using generate_logspace_timeperiods().

Parameters:
  • head_dim (int) – Dimension of the attention head.

  • rotate_dim (int) – Number of dimensions to rotate. You can choose to rotate only a small portion of the head dimension using this parameter. E.g. PerceiverIO found rotating only half dimensions to be effective.

  • t_min (float) – Minimum period of the sinusoids. Set this to the smallest timescale the attention layer should care about.

  • t_max (float) – Maximum period of the sinusoids. Set this to the largest timescale the attention layer should care about.

omega: Tensor
forward(timestamps)[source]

Computes the rotary embeddings for given timestamps, which can then be used by RotaryTimeEmbedding.rotate().

Parameters:

timestamps (torch.Tensor) – timestamps tensor.

Return type:

Tensor

static rotate(x, rotary_emb, unsqueeze_dim=2)[source]

Apply the rotary positional embedding to the input data.

Parameters:
  • x (torch.Tensor) – Input data.

  • rotary_emb (torch.Tensor) – The rotary embedding produced by a forward call of RotaryTimeEmbedding.

  • unsqueeze_dim (int, optional) – Dimension where heads are located in the input tensor. E.g. For input shape (batch, heads, seq_len, dim) use 1. For input shape (batch, seq_len, heads, dim) use 2. Defaults to 2.

Return type:

Tensor

static invert(rotary_emb)[source]

Invert/Negate rotary embedding. If the input embeddings correspond to a time \(t\), then the output embeddings correspond to time \(-t\).

Parameters:

rotary_emb (torch.Tensor) – Embeddings produced by a forward call of RotaryTimeEmbedding.

Return type:

Tensor

class SinusoidalTimeEmbedding(dim, t_min, t_max)[source]

Sinusoidal time/position embedding layer. These embeddings are generally added/concatenated to tokens to give them a sense of time/position. The timeperiods are logarithmically spaced between t_min and t_max (both inclusive).

Parameters:
  • dim (int) – The dimension of the embedding needed (must be a multiple of 2)

  • t_min (float) – Minimum period of the sinusoids. Set this to the smallest timescale you care about.

  • t_max (float) – Maximum period of the sinusoids. Set this to the largest timescale you care about.

omega: Tensor
forward(timestamps)[source]

Convert raw timestamps to sinusoidal embeddings

Parameters:

timestamps (torch.Tensor) – timestamps tensor

Return type:

Tensor