Add the possibility to discard samples either:
- With a simple predicate: for instance, if a dataset has a column
foo which contains a list of elements. We should be able to write something like predicate: {{ foo.len > 0 }} (or some kind of similar syntax)
- With a LLM that classifies as either "skip" or "keep" depending on some prompt