-
Notifications
You must be signed in to change notification settings - Fork 2
Description
As mentioned on Slack, I propose adding a generalized numericalizer interface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existing Vocab class fits perfectly into this interface, so no big changes would be required. The interface would look like this:
class SmartNumericalizer(ABC):
def update(tokens):
pass
def finalize():
pass
def numericalize(tokens):
passThe name is just a placeholder for now.
One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the update method, similarly to what Vocab does now) and load them when finalize is called. I assume TF-IDF could be implemented in a similar fashion.
The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).