Fix transformers vocab not being saved to files#4023
Fix transformers vocab not being saved to files#4023MaksymDel wants to merge 3 commits intoallenai:mainfrom MaksymDel:vocab_hotfix
Conversation
| namespace : `str`, optional (default=`tags`) | ||
| namespace : `str`, optional (default=`from_transformers`) | ||
| We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. | ||
| We use a somewhat confusing default value of `tags` so that we do not add padding or UNK |
There was a problem hiding this comment.
| We use a somewhat confusing default value of `tags` so that we do not add padding or UNK | |
| We use `from_transformers` as the default namespace, which gets special treatment in our | |
| `Vocabulary` class to not get padding or UNK tokens. |
Actually, how are UNK and padding handled here? Is it possible to set them using the transformer? We probably do want UNK handling and padding tokens, we just need them set in the right way, both here and during loading. This might need some additional changes to how we save and load vocabularies.
And, while we're at it, we probably want the padding and unk tokens to be namespace-dependent.
| self, | ||
| encoding_dictionary: Dict[str, int], | ||
| namespace: str = "from_transformers", | ||
| resave_to_files: bool = False, |
There was a problem hiding this comment.
This is getting to be a bigger change, but it's probably necessary to really fix this the right way. But I think this function should take arguments for setting padding and oov tokens, and those should set them in a namespace-specific way. And the padding and oov tokens then would need to be serialized with the vocabulary and read from the serialized files. Then we wouldn't need the from_transformers namespace added to the non-padded namespaces.
I would wait to do any work on this, though, until @dirkgr has had a chance to glance at it and see if he agrees. I think this is the right approach, but I could be missing something.
|
It's a minimal fix, but I am very unhappy about the amount of magic going on. The way we do vocabularies already involves an untenable amount of magic, and this adds more. Saving state like this, especially the name of a file that gets overwritten, is just asking for trouble the next time this is used in a scenario we didn't think of. I sketched an alternative here: #4034. I'll add some discussion over there. It is not as minimal as this is, but it has some other advantages. |
Fixes #3097. Here we add a method to vocab to populate its namespace from transformers encoding (which is fetched in indexer) with an option to resave vocab to files (which is the main problem of #3097).
I reserved another
non_padded_namespacefamily:*from_transformersto use it by default for transformers instead oftags.Not saving transformers vocab to disc prevents
from_archivemodel loading (because vocab folder can be empty) with all the implications.I know this fix is not ideal, but I think it is an improvement over what we have currently. We can also open a separate issue dedicated to figuring out transformers <-> allennlp relationships.