A Word Level Transformer layer based on PyTorch and 🤗 Transformers.
Install the library from PyPI:
pip install transformers-embedderor from Conda:
conda install -c riccorl transformers-embedderIt offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface 🤗Transformers library. Here is a quick example:
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
model = tre.TransformersEmbedder(
"bert-base-cased", subword_pooling_strategy="sparse", layer_pooling_strategy="mean"
)
example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True){
'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_length': 7 # with special tokens included
}
outputs = model(**inputs)# outputs.word_embeddings.shape[1:-1] # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5
One of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings from the sub-token embeddings they output. With this API it's as easy as using 🤗Transformers to get word-level embeddings from theoretically every transformer model it supports.
The TransformersEmbedder class offers 3 ways to get the embeddings:
subword_pooling_strategy="sparse": computes the mean of the embeddings of the sub-tokens of each word (i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This strategy is the default one.subword_pooling_strategy="scatter": computes the mean of the embeddings of the sub-tokens of each word using a scatter-gather operation. It is not deterministic, but it works with ONNX export.subword_pooling_strategy="none": returns the raw output of the transformer model without sub-token pooling.
Here a little feature table:
| Pooling | Deterministic | ONNX | |
|---|---|---|---|
| Sparse | ✅ | ✅ | ❌ |
| Scatter | ✅ | ❌ | ✅ |
| None | ❌ | ✅ | ✅ |
There are also multiple type of outputs you can get using layer_pooling_strategy parameter:
layer_pooling_strategy="last": returns the last hidden state of the transformer modellayer_pooling_strategy="concat": returns the concatenation of the selectedoutput_layersof the
transformer modellayer_pooling_strategy="sum": returns the sum of the selectedoutput_layersof the transformer modellayer_pooling_strategy="mean": returns the average of the selectedoutput_layersof the transformer modellayer_pooling_strategy="scalar_mix": returns the output of a parameterised scalar mixture layer of the selectedoutput_layersof the transformer model
If you also want all the outputs from the HuggingFace model, you can set return_all=True to get them.
class TransformersEmbedder(torch.nn.Module):
def __init__(
self,
model: Union[str, tr.PreTrainedModel],
subword_pooling_strategy: str = "sparse",
layer_pooling_strategy: str = "last",
output_layers: Tuple[int] = (-4, -3, -2, -1),
fine_tune: bool = True,
return_all: bool = True,
)The Tokenizer class provides the tokenize method to preprocess the input for the TransformersEmbedder
layer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them
returning a dictionary with the inputs for the model. By passing return_tensors=True it will return the
inputs as torch.Tensor.
By default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.
text = "This is a sample sentence"
tokenizer(text)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)You can pass a pre-tokenized sentence (or batch of sentences) by setting is_split_into_words=True
text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)
text = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True)First, initialize the tokenizer
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")- You can pass a single sentence as a string:
text = "This is a sample sentence"
tokenizer(text){
{
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_lengths': [7],
}
- A sentence pair
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair){
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 15, 15])
},
'sentence_lengths': [15],
}
- A batch of sentences or sentence pairs. Using
padding=Trueandreturn_tensors=True, the tokenizer returns the text ready for the model
batch = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
["This", "is", "a", "sample", "sentence", "3"],
# ...
["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)
batch_pair = [
["This", "is", "a", "sample", "sentence", "pair", "1"],
["This", "is", "sample", "sentence", "pair", "2"],
["This", "is", "a", "sample", "sentence", "pair", "3"],
# ...
["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)It is possible to add custom fields to the model input and tell the tokenizer how to pad them using
add_padding_ops. Start by initializing the tokenizer with the model name:
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")Then add the custom fields to it:
custom_fields = {
"custom_filed_1": [
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
]
}Now we can add the padding logic for our custom field custom_filed_1. add_padding_ops method takes in
input
key: name of the field in the tokenizer inputvalue: value to use for paddinglength: length to pad. It can be anint, or two string value,subwordin which the element is padded to match the length of the subwords, andwordwhere the element is padded relative to the length of the batch after the merge of the subwords.
tokenizer.add_padding_ops("custom_filed_1", 0, "word")Finally, we can tokenize the input with the custom field:
text = [
"This is a sample sentence",
"This is another example sentence just make it longer, with a comma too!"
]
tokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)The inputs are ready for the model, including the custom filed.
>>> inputs
{
'input_ids': tensor(
[
[ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]
]
),
'token_type_ids': tensor(
[
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
),
'attention_mask': tensor(
[
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]
),
'scatter_offsets': tensor(
[
[ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]
]
),
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
]
),
'sparse_values': tensor(
[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]
),
'sparse_size': torch.Size([2, 17, 18])
}
'sentence_lengths': [7, 17],
}
Some code in the TransformersEmbedder class is taken from the PyTorch Scatter
library. The pretrained models and the core of the tokenizer is from 🤗 Transformers.