Skip to content

Proposal: Generalized Numericalizer interface #222

@ivansmokovic

Description

@ivansmokovic

As mentioned on Slack, I propose adding a generalized numericalizer interface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existing Vocab class fits perfectly into this interface, so no big changes would be required. The interface would look like this:

class SmartNumericalizer(ABC):
​
    def update(tokens):
        passdef finalize():
        passdef numericalize(tokens):
        pass

The name is just a placeholder for now.

One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the update method, similarly to what Vocab does now) and load them when finalize is called. I assume TF-IDF could be implemented in a similar fashion.

The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).

Metadata

Metadata

Assignees

Labels

featureNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions