Skip to content

Commit 70696bf

Browse files
feature: data preview tool (#114)
* fix: simplify santander example * refactor: remove 'llm_to_use' from tool signature * chore: yell at claude about imports * feat: initial implementation of more agentic predictor production * feat: extract input sample as dict * feat: simplify inference generation tools * chore: bump to 0.18.3 * fix: remove unused agent inputs * fix: include prompt templates in dumpcode.py * feat: move predictor generation from tools to agent * fix: register schemas * fix: remove unused inference prompts * fix: allow plexe imports for mlops engineer * feat: extract artifacts in inference context * feat: add house prices example * fix: artifact list extraction defined incorrectly * fix: incorrect sampling in examples * fix: add io and plexe to allowed imports * fix: setting llm for extraction incorrectly * fix: get schemas from registry at inference validation * fix: artifact extraction can fail silently * fix: extra space in house prices example * fix: only one integration test per module * feat: add data preview tool for agents * chore: bump to 0.18.4
1 parent 5cc4665 commit 70696bf

File tree

6 files changed

+80
-6
lines changed

6 files changed

+80
-6
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,3 +193,5 @@ plexe-full-codebase.txt
193193
# Example datasets
194194
examples/datasets/
195195
examples/datasets/*
196+
197+
**/.claude/settings.local.json

plexe/internal/agents.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
)
2424
from plexe.internal.models.tools.evaluation import get_review_finalised_model
2525
from plexe.internal.models.tools.metrics import get_select_target_metric
26-
from plexe.internal.models.tools.datasets import split_datasets, create_input_sample
26+
from plexe.internal.models.tools.datasets import split_datasets, create_input_sample, get_dataset_preview
2727
from plexe.internal.models.tools.execution import get_executor_tool
2828
from plexe.internal.models.tools.response_formatting import (
2929
format_final_orchestrator_agent_response,
@@ -107,7 +107,7 @@ def __init__(
107107
"- the name and comparison method of the metric to optimise"
108108
),
109109
model=LiteLLMModel(model_id=self.ml_researcher_model_id),
110-
tools=[],
110+
tools=[get_dataset_preview],
111111
add_base_tools=False,
112112
verbosity_level=self.specialist_verbosity,
113113
prompt_templates=get_prompt_templates("toolcalling_agent.yaml", "mls_prompt_templates.yaml"),
@@ -134,6 +134,7 @@ def __init__(
134134
validate_training_code,
135135
get_fix_training_code(self.tool_model_id),
136136
get_executor_tool(distributed),
137+
get_dataset_preview,
137138
format_final_mle_agent_response,
138139
],
139140
add_base_tools=False,
@@ -175,6 +176,7 @@ def __init__(
175176
get_review_finalised_model(self.tool_model_id),
176177
split_datasets,
177178
create_input_sample,
179+
get_dataset_preview,
178180
format_final_orchestrator_agent_response,
179181
],
180182
managed_agents=[self.ml_research_agent, self.mle_agent, self.mlops_engineer],

plexe/internal/models/tools/datasets.py

Lines changed: 67 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,13 @@
33
44
These tools help with dataset operations within the model generation pipeline, including
55
splitting datasets into training, validation, and test sets, registering datasets with
6-
the dataset registry, and creating sample data for validation.
6+
the dataset registry, creating sample data for validation, and previewing dataset content.
77
"""
88

99
import logging
10-
from typing import Dict, List
10+
from typing import Dict, List, Any
11+
12+
import numpy as np
1113
import pandas as pd
1214
from smolagents import tool
1315

@@ -123,3 +125,66 @@ def create_input_sample(train_dataset_names: List[str], input_schema_fields: Lis
123125
except Exception as e:
124126
logger.warning(f"⚠️ Error creating input sample for validation: {str(e)}")
125127
return False
128+
129+
130+
@tool
131+
def get_dataset_preview(dataset_name: str) -> Dict[str, Any]:
132+
"""
133+
Generate a concise preview of a dataset with statistical information to help agents understand the data.
134+
135+
Args:
136+
dataset_name: Name of the dataset to preview
137+
138+
Returns:
139+
Dictionary containing dataset information:
140+
- shape: dimensions of the dataset
141+
- dtypes: data types of columns
142+
- summary_stats: basic statistics (mean, median, min/max)
143+
- missing_values: count of missing values per column
144+
- sample_rows: sample of the data (5 rows)
145+
"""
146+
object_registry = ObjectRegistry()
147+
148+
try:
149+
# Get dataset from registry
150+
dataset = object_registry.get(TabularConvertible, dataset_name)
151+
df = dataset.to_pandas()
152+
153+
# Basic shape and data types
154+
result = {
155+
"dataset_name": dataset_name,
156+
"shape": {"rows": df.shape[0], "columns": df.shape[1]},
157+
"columns": list(df.columns),
158+
"dtypes": {col: str(dtype) for col, dtype in df.dtypes.items()},
159+
"sample_rows": df.head(5).to_dict(orient="records"),
160+
}
161+
162+
# Basic statistics
163+
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
164+
if numeric_cols:
165+
stats = df[numeric_cols].describe().to_dict()
166+
result["summary_stats"] = {
167+
col: {
168+
"mean": stats[col].get("mean"),
169+
"std": stats[col].get("std"),
170+
"min": stats[col].get("min"),
171+
"25%": stats[col].get("25%"),
172+
"median": stats[col].get("50%"),
173+
"75%": stats[col].get("75%"),
174+
"max": stats[col].get("max"),
175+
}
176+
for col in numeric_cols
177+
}
178+
179+
# Missing values
180+
missing_counts = df.isnull().sum().to_dict()
181+
result["missing_values"] = {col: count for col, count in missing_counts.items() if count > 0}
182+
183+
return result
184+
185+
except Exception as e:
186+
logger.warning(f"⚠️ Error creating dataset preview: {str(e)}")
187+
return {
188+
"error": f"Failed to generate preview for dataset '{dataset_name}': {str(e)}",
189+
"dataset_name": dataset_name,
190+
}

plexe/templates/prompts/agent/mle_prompt_templates.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ managed_agent:
2020
- The identifier of the LLM to use for code generation.
2121
2222
If the information above was not provided, you should reject the task and request your manager to provide the
23-
required information.
23+
required information. You can also use the 'get_dataset_preview' tool to get a better understanding of the data
24+
in case it helps.
2425
2526
## Instructions for You
2627
If you have the required information: generate Python machine learning training code to train a model that solves

plexe/templates/prompts/agent/mls_prompt_templates.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ managed_agent:
1414
or the LLM to use for plan generation, you should reject the task and ask your manager to provide the required
1515
information.
1616
17+
You can use the get_dataset_preview tool to examine the available datasets before formulating your solution plans.
18+
This will help you understand the data characteristics (data types, missing values, basic statistics)
19+
and propose more targeted approaches. Use the tool by providing a dataset name.
20+
1721
The solution concepts should be explained in 3-5 sentences each. Do not include implementations of the
1822
solutions, though you can include small code snippets if absolutely required to explain a plan.
1923
Do not suggest doing EDA, ensembling, or hyperparameter tuning. The solutions should be feasible using only

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "plexe"
3-
version = "0.18.3"
3+
version = "0.18.4"
44
description = "An agentic framework for building ML models from natural language"
55
authors = [
66
"marcellodebernardi <[email protected]>",

0 commit comments

Comments
 (0)