refactor: datasets mlflow and more #118

marcellodebernardi · 2025-05-18T07:56:18Z

This PR is a depressingly confusing mish-mash of several bugfixes, improvements, and clean up. The main changes are:

The DatasetGenerator has been simplified, cleaned up, made properly async, and can now handle column-wise data augmentation (earlier only row-wise).
The traces of the smolagents agents are now logged to mlflow for easy tracking
Improvements to mlflow tracking, chain of thought is more informative, etc
Include the EDA report in the model bundle and mlflow for convenience
Miscelanneous bugfixes

Yes, this is a terrible PR 🚀

Copilot

Pull Request Overview

This PR refactors dataset generation, enhances MLflow integration, and cleans up related tooling and examples.

Simplified DatasetGenerator to use SimpleLLMDataGenerator with proper async and column‐wise augmentation.
Overhauled MLFlowCallback to auto-log smolagents, include EDA markdown artifacts, and updated its unit tests.
Updated project config, prompts, and utilities for EDA report handling and cleaned up obsolete code paths.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/internal/models/callbacks/test_mlflow.py	Adjusted MLflow tests to mock experiment creation and set_experiment
tests/unit/internal/datasets/core/generation/utils/test_oversampling.py	Removed obsolete SMOTE oversampling tests
pyproject.toml	Bumped package version; updated mlflow, torch, smolagents, transformers, tokenizers
plexe/templates/prompts/agent/schema_resolver_prompt.jinja	Refined prompt language and schema instructions
plexe/templates/prompts/agent/agent_manager_prompt.jinja	Removed personal name from manager prompt
plexe/models.py	Added formatting and storage of EDA markdown reports
plexe/internal/models/tools/datasets.py	Extended `split_datasets` to return `dataset_sizes`; strip suffixes in EDA lookup
plexe/internal/models/callbacks/mlflow.py	Configured experiment creation, smolagents autolog, and EDA artifact logging
plexe/internal/datasets/generator.py	Switched to `SimpleLLMDataGenerator` and cleaned imports
plexe/internal/datasets/core/generation/utils/oversampling.py	Deleted obsolete oversampling utility
plexe/internal/datasets/core/generation/simple_llm.py	Implemented async batch/column data generation
plexe/internal/datasets/core/generation/combined.py	Removed `CombinedDataGenerator`
plexe/internal/datasets/core/generation/base.py	Expanded interface to support column-only generation
plexe/internal/datasets/config.py	Locked generator to "simple" and cleaned up instructions
plexe/internal/common/utils/response.py	Added JSON array extraction and DataFrame conversion helpers
plexe/fileio.py	Included EDA markdown in model bundle on save/load
plexe/datasets.py	Enhanced `DatasetGenerator` for column augmentation and schema validation
plexe/agents/dataset_analyser.py	Expanded authorized imports for dataset analyser agent
examples/dataset_generation.py	New example for synthetic data generation
examples/dataset_augmentation.py	New example for dataset augmentation

Comments suppressed due to low confidence (4)

plexe/datasets.py:96

Name 'DataGenerator' is not imported in this module. Add the correct import (e.g. from plexe.internal.datasets.generator import SimpleLLMDataGenerator or alias your generator) to avoid NameError.

self.data_generator = DataGenerator(self.provider, self.description, self.schema)

tests/unit/internal/models/callbacks/test_mlflow.py:183

The new EDA artifact logging logic in on_build_end isn't covered by this test. Consider adding a test where info.node.metadata['eda_markdown_reports'] is set and assert that mlflow.log_artifact is called.

# Call on_build_end

plexe/internal/models/callbacks/mlflow.py:104

Missing import for Path. Add 'from pathlib import Path' at the top of this module to avoid NameError at runtime.

report_path = Path(f"eda_report_{dataset_name}.md")

plexe/internal/common/utils/response.py:183

Logger is used but not defined. Add 'logger = logging.getLogger(name)' at the top of this file so the module-level logger is available.

logger.warning("JSON is a single object, converting to list")

Copilot · 2025-05-19T08:43:38Z

plexe/internal/models/callbacks/mlflow.py

+        # Create experiment
+        self.experiment_id = mlflow.create_experiment(self.experiment_name)
+        mlflow.set_experiment(experiment_name=self.experiment_name)
+        logger.debug(f"✅  MLFlow configured with experiment '{self.experiment_name}' (ID: {self.experiment_id})")


Creating a new experiment unconditionally on initialization can lead to duplicates or errors if the experiment already exists. Use get_experiment_by_name first and only call create_experiment when it returns None.

Suggested change

# Create experiment

self.experiment_id = mlflow.create_experiment(self.experiment_name)

mlflow.set_experiment(experiment_name=self.experiment_name)

logger.debug(f"✅ MLFlow configured with experiment '{self.experiment_name}' (ID: {self.experiment_id})")

# Create or retrieve experiment

experiment = mlflow.get_experiment_by_name(self.experiment_name)

if experiment is None:

self.experiment_id = mlflow.create_experiment(self.experiment_name)

logger.debug(f"✅ Created new MLFlow experiment '{self.experiment_name}' (ID: {self.experiment_id})")

else:

self.experiment_id = experiment.experiment_id

logger.debug(f"✅ Retrieved existing MLFlow experiment '{self.experiment_name}' (ID: {self.experiment_id})")

mlflow.set_experiment(experiment_name=self.experiment_name)

pyproject.toml

marcellodebernardi added 30 commits May 13, 2025 22:08

fix: trainer_source.py not cleaned up

5ea3b73

fix: handle dataset splitting for chronological data

4b2a668

fix: switch to codeagent for schema resolver

819aa39

feat: add data analyser agent

0e8979a

feat: add data analyser agent

1113e69

fix: put eda report as dict in metadata

0d8df15

feat: update multi-agent-system.md

df517a3

chore: bump to 0.20.0

05b8684

fix: misc improvements to dataset analyser

684cda3

fix: eda agent using wrong prompt template

f2023ba

chore: remove unused prompt template

b20dbbe

chore: remove unused plan generation template

cc22569

fix: emitter agent colors defined incorrectly

b6eef55

feat: make chain of thought summaries follow t/a/o structure

b94c721

feat: remove combined data generator in favour of simple

35ba360

Merge branch 'refs/heads/main' into fix/dataset-generator-cleanup

fa2fa90

fix: strip split suffix from eda report name

22a1a7d

fix: give dataset analyser all required imports

ad068c5

feat: enable mlflow tracing

42b1dda

chore: update vulnerable dependencies

6c5ff99

fix: allow scipy.* import for dataset analyser

d385c2a

fix: split_datasets to return dataset sizes

0f936b5

chore: remove smote oversampling

eb80f54

chore: clean up dataset generator config

4889b85

refactor: clean up datasets.py

cf8ecca

refactor: clean up data generation async logic

5b786ff

refactor: clean up data generation async logic

c011aa1

feat: add dataset generation example

3f0af16

chore: fix up base data generator interface

8559bce

fix: column addition not working plus noisy logging

b07786e

marcellodebernardi added 8 commits May 17, 2025 23:13

feat: add dataset augmentation example

831989c

chore: bump to 0.21.0

2c75417

feat: add eda report to model bundle

908e20a

fix: re-enable i/o schema logging

f42dcad

fix: add pandas.* to dataset analyser imports

a020be7

fix: give better instructions to schema resolver

09b376a

fix: remove silly naming from system prompts

8bfb1a1

refactor: make schema resolver prompt more concise

045fefc

marcellodebernardi marked this pull request as ready for review May 19, 2025 08:39

marcellodebernardi requested review from Copilot and vaibs-d May 19, 2025 08:40

Copilot AI reviewed May 19, 2025

View reviewed changes

plexe-ai deleted a comment from jazzberry-ai bot May 19, 2025

marcellodebernardi closed this May 19, 2025

marcellodebernardi deleted the refactor/datasets-mlflow-and-more branch May 19, 2025 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: datasets mlflow and more #118

refactor: datasets mlflow and more #118

Uh oh!

marcellodebernardi commented May 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: datasets mlflow and more #118

refactor: datasets mlflow and more #118

Uh oh!

Conversation

marcellodebernardi commented May 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants