diff --git a/docs/architecture/multi-agent-system.md b/docs/architecture/multi-agent-system.md index 995b1778..9ff12e9f 100644 --- a/docs/architecture/multi-agent-system.md +++ b/docs/architecture/multi-agent-system.md @@ -39,6 +39,8 @@ graph TD User([User]) --> |"Intent & Datasets"| Model["Model Class"] subgraph "Multi-Agent System" + Model --> |"Data Registration"| EDA["EDA Agent"] + EDA --> |"Analysis & Reports"| SchemaResolver Model --> |"Schema Resolution"| SchemaResolver["Schema Resolver"] SchemaResolver --> |"Schemas"| Orchestrator Model --> |build| Orchestrator["Manager Agent"] @@ -53,6 +55,7 @@ graph TD subgraph Registry["Object Registry"] Datasets[(Datasets)] + EdaReports[(EDA Reports)] Artifacts[(Model Artifacts)] Code[(Code Snippets)] Schemas[(I/O Schemas)] @@ -70,10 +73,15 @@ graph TD Orchestrator <--> Registry Orchestrator <--> Tools MLS <--> Tools + MLS <--> EdaReports MLE <--> Tools + MLE <--> EdaReports MLOPS <--> Tools SchemaResolver <--> Registry SchemaResolver <--> Tools + SchemaResolver <--> EdaReports + EDA <--> Registry + EDA <--> Tools Orchestrator --> Result([Trained Model]) Result --> Model @@ -82,6 +90,28 @@ graph TD ## Key Components +### EDA Agent + +**Class**: `EdaAgent` +**Type**: `CodeAgent` + +The EDA Agent performs exploratory data analysis on datasets early in the workflow: + +```python +eda_agent = EdaAgent( + model_id=provider_config.research_provider, + verbose=verbose, + chain_of_thought_callable=cot_callable, +) +``` + +**Responsibilities**: +- Analyzing datasets to understand structure, distributions, and relationships +- Identifying data quality issues, outliers, and missing values +- Generating key insights about the data +- Providing recommendations for preprocessing and modeling +- Registering EDA reports in the Object Registry for use by downstream agents + ### Schema Resolver Agent **Class**: `SchemaResolverAgent` @@ -339,11 +369,17 @@ The multi-agent workflow follows these key steps: - User creates a `Model` instance with intent and datasets - User calls `model.build()` to start the process -2. **Schema Resolution**: +2. **Exploratory Data Analysis**: + - EdaAgent analyzes datasets to understand structure and characteristics + - Generates insights about data patterns, quality issues, and modeling considerations + - EDA reports are registered in the Object Registry for use by other agents + +3. **Schema Resolution**: - If schemas aren't provided, SchemaResolverAgent infers them + - The agent can leverage EDA findings to determine appropriate schemas - Schemas are registered in the Object Registry -3. **Orchestration**: +4. **Orchestration**: - Manager Agent selects metrics and splits datasets - Manager Agent initializes the solution planning phase @@ -372,7 +408,7 @@ The multi-agent workflow follows these key steps: The system uses a hierarchical communication pattern: ``` -User → Model → Schema Resolver → Manager Agent → Specialist Agents → Manager Agent → Model → User +User → Model → EDA Agent → Schema Resolver → Manager Agent → Specialist Agents → Manager Agent → Model → User ``` Each agent communicates through structured task descriptions and responses: @@ -515,7 +551,9 @@ class CustomModelValidator(Validator): - [PlexeAgent Class Definition](/plexe/internal/agents.py) - [Model Class Definition](/plexe/models.py) +- [EdaAgent Definition](/plexe/agents/dataset_analyser.py) - [SchemaResolverAgent Definition](/plexe/agents/schema_resolver.py) - [Tool Definitions](/plexe/internal/models/tools/) +- [Dataset Tools](/plexe/internal/models/tools/datasets.py) - [Executor Implementation](/plexe/internal/models/execution/) - [Object Registry](/plexe/internal/common/registries/objects.py) \ No newline at end of file diff --git a/plexe/agents/dataset_analyser.py b/plexe/agents/dataset_analyser.py new file mode 100644 index 00000000..83af48fa --- /dev/null +++ b/plexe/agents/dataset_analyser.py @@ -0,0 +1,98 @@ +""" +Exploratory Data Analysis (EDA) Agent for data analysis and insights in ML models. + +This module defines an EdaAgent that analyzes datasets to generate comprehensive +exploratory data analysis reports before model building begins. +""" + +import logging +from typing import List, Callable + +from smolagents import LiteLLMModel, CodeAgent + +from plexe.config import prompt_templates +from plexe.internal.common.utils.agents import get_prompt_templates +from plexe.internal.models.tools.datasets import register_eda_report +from plexe.internal.models.tools.schemas import get_raw_dataset_schema + +logger = logging.getLogger(__name__) + + +class EdaAgent: + """ + Agent for performing exploratory data analysis on datasets. + + This agent analyzes the available datasets to produce a comprehensive EDA report + containing data overview, feature analysis, relationships, data quality issues, + key insights, and recommendations for modeling. + """ + + def __init__( + self, + model_id: str = "openai/gpt-4o", + verbose: bool = False, + chain_of_thought_callable: Callable = None, + ): + """ + Initialize the EDA agent. + + Args: + model_id: Model ID for the LLM to use for data analysis + verbose: Whether to display detailed agent logs + chain_of_thought_callable: Optional callable for chain of thought logging + """ + self.model_id = model_id + self.verbose = verbose + + # Set verbosity level + self.verbosity = 1 if verbose else 0 + + # Create the EDA agent with the necessary tools + self.agent = CodeAgent( + name="DatasetAnalyser", + description=( + "Expert data analyst that performs exploratory data analysis on datasets " + "to generate insights and recommendations for ML modeling." + ), + model=LiteLLMModel(model_id=self.model_id), + tools=[register_eda_report, get_raw_dataset_schema], + add_base_tools=False, + verbosity_level=self.verbosity, + # planning_interval=3, + max_steps=30, + step_callbacks=[chain_of_thought_callable], + additional_authorized_imports=["pandas", "numpy", "plexe"], + prompt_templates=get_prompt_templates("code_agent.yaml", "eda_prompt_templates.yaml"), + ) + + def run( + self, + intent: str, + dataset_names: List[str], + ) -> bool: + """ + Run the EDA agent to analyze datasets and create EDA reports. + + Args: + intent: Natural language description of the model's purpose + dataset_names: List of dataset registry names available for analysis + + Returns: + Dictionary containing: + - eda_report_names: List of registered EDA report names in the Object Registry + - dataset_names: List of datasets that were analyzed + - summary: Brief summary of key findings + """ + # Use the template system to create the prompt + datasets_str = ", ".join(dataset_names) + + # Generate the prompt using the template system + task_description = prompt_templates.eda_agent_prompt( + intent=intent, + datasets=datasets_str, + ) + + # Run the agent to get analysis + self.agent.run(task_description) + + return True diff --git a/plexe/agents/schema_resolver.py b/plexe/agents/schema_resolver.py index 5e6dca1d..03463241 100644 --- a/plexe/agents/schema_resolver.py +++ b/plexe/agents/schema_resolver.py @@ -10,12 +10,12 @@ import logging from typing import Dict, List, Any, Callable -from smolagents import ToolCallingAgent, LiteLLMModel +from smolagents import LiteLLMModel, CodeAgent from plexe.config import prompt_templates from plexe.internal.common.registries.objects import ObjectRegistry -from plexe.internal.models.tools.datasets import get_dataset_preview -from plexe.internal.models.tools.schemas import get_raw_dataset_schema, register_final_model_schemas +from plexe.internal.models.tools.datasets import get_dataset_preview, get_eda_report +from plexe.internal.models.tools.schemas import register_final_model_schemas logger = logging.getLogger(__name__) @@ -49,14 +49,14 @@ def __init__( self.verbosity = 1 if verbose else 0 # Create the schema resolver agent with the necessary tools - self.agent = ToolCallingAgent( + self.agent = CodeAgent( name="SchemaResolver", description=( "Expert schema resolver that determines the appropriate input and output " "schemas for ML models based on intent and available datasets." ), model=LiteLLMModel(model_id=self.model_id), - tools=[get_dataset_preview, get_raw_dataset_schema, register_final_model_schemas], + tools=[get_dataset_preview, get_eda_report, register_final_model_schemas], add_base_tools=False, verbosity_level=self.verbosity, step_callbacks=[chain_of_thought_callable], diff --git a/plexe/config.py b/plexe/config.py index 21a6045f..57d0d8ff 100644 --- a/plexe/config.py +++ b/plexe/config.py @@ -192,15 +192,6 @@ def planning_system(self) -> str: def planning_select_metric(self, problem_statement) -> str: return self._render("planning/select_metric.jinja", problem_statement=problem_statement) - def planning_generate(self, problem_statement, metric_to_optimise) -> str: - return self._render( - "planning/generate.jinja", - problem_statement=problem_statement, - metric_to_optimise=metric_to_optimise, - allowed_packages=config.code_generation.allowed_packages, - deep_learning_available=config.code_generation.deep_learning_available, - ) - def schema_base(self) -> str: return self._render("schemas/base.jinja") @@ -225,6 +216,13 @@ def schema_resolver_prompt( has_output_schema=has_output_schema, ) + def eda_agent_prompt(self, intent, datasets) -> str: + return self._render( + "agent/agent_data_analyser_prompt.jinja", + intent=intent, + datasets=datasets, + ) + def training_system(self) -> str: return self._render("training/system_prompt.jinja") diff --git a/plexe/internal/agents.py b/plexe/internal/agents.py index be2aae50..934202ea 100644 --- a/plexe/internal/agents.py +++ b/plexe/internal/agents.py @@ -23,7 +23,12 @@ ) from plexe.internal.models.tools.evaluation import get_review_finalised_model from plexe.internal.models.tools.metrics import get_select_target_metric -from plexe.internal.models.tools.datasets import split_datasets, create_input_sample, get_dataset_preview +from plexe.internal.models.tools.datasets import ( + split_datasets, + create_input_sample, + get_dataset_preview, + get_eda_report, +) from plexe.internal.models.tools.schemas import get_raw_dataset_schema from plexe.internal.models.tools.execution import get_executor_tool from plexe.internal.models.tools.response_formatting import ( @@ -106,9 +111,10 @@ def __init__( "- input schema for the model" "- output schema for the model" "- the name and comparison method of the metric to optimise" + "- the name of the dataset to use for training" ), model=LiteLLMModel(model_id=self.ml_researcher_model_id), - tools=[get_dataset_preview], + tools=[get_dataset_preview, get_eda_report], add_base_tools=False, verbosity_level=self.specialist_verbosity, prompt_templates=get_prompt_templates("toolcalling_agent.yaml", "mls_prompt_templates.yaml"), diff --git a/plexe/internal/common/datasets/tabular.py b/plexe/internal/common/datasets/tabular.py index 3545c1f6..ebedc9fc 100644 --- a/plexe/internal/common/datasets/tabular.py +++ b/plexe/internal/common/datasets/tabular.py @@ -38,6 +38,8 @@ def split( test_ratio: float = 0.15, stratify_column: Optional[str] = None, random_state: Optional[int] = None, + is_time_series: bool = False, + time_index_column: Optional[str] = None, ) -> Tuple["TabularDataset", "TabularDataset", "TabularDataset"]: """ Split dataset into train, validation and test sets. @@ -45,16 +47,49 @@ def split( :param train_ratio: Proportion of data to use for training :param val_ratio: Proportion of data to use for validation :param test_ratio: Proportion of data to use for testing - :param stratify_column: Column to use for stratified splitting - :param random_state: Random seed for reproducibility + :param stratify_column: Column to use for stratified splitting (not used for time series) + :param random_state: Random seed for reproducibility (not used for time series) + :param is_time_series: Whether the data is chronological time series data + :param time_index_column: Column name that represents the time index, required if is_time_series=True :returns: A tuple of (train_dataset, val_dataset, test_dataset) - :raises ValueError: If ratios don't sum to approximately 1.0 + :raises ValueError: If ratios don't sum to approximately 1.0 or if time_index_column is missing for time series """ - from sklearn.model_selection import train_test_split - if abs(train_ratio + val_ratio + test_ratio - 1.0) > 1e-10: raise ValueError("Split ratios must sum to 1.0") + # Handle time series data + if is_time_series: + if not time_index_column: + raise ValueError("time_index_column must be provided when is_time_series=True") + + if time_index_column not in self._data.columns: + raise ValueError(f"time_index_column '{time_index_column}' not found in dataset columns") + + # Sort by time index + sorted_data = self._data.sort_values(by=time_index_column).reset_index(drop=True) + + # Calculate split indices + n_samples = len(sorted_data) + train_end = int(n_samples * train_ratio) + val_end = train_end + int(n_samples * val_ratio) + + # Split the data sequentially + train_data = sorted_data.iloc[:train_end] + val_data = sorted_data.iloc[train_end:val_end] + test_data = sorted_data.iloc[val_end:] + + # Handle edge cases for empty splits + empty_df = pd.DataFrame(columns=self._data.columns) + if val_ratio < 1e-10: + val_data = empty_df + if test_ratio < 1e-10: + test_data = empty_df + + return TabularDataset(train_data), TabularDataset(val_data), TabularDataset(test_data) + + # Regular random splitting for non-time series data + from sklearn.model_selection import train_test_split + # Handle all-data-to-train edge case if val_ratio < 1e-10 and test_ratio < 1e-10: return ( @@ -101,7 +136,7 @@ def split( stratify=temp_data[stratify_column] if stratify_column else None, random_state=random_state, ) - return (TabularDataset(train_data), TabularDataset(val_data), TabularDataset(test_data)) + return TabularDataset(train_data), TabularDataset(val_data), TabularDataset(test_data) def sample( self, n: int = None, frac: float = None, replace: bool = False, random_state: int = None diff --git a/plexe/internal/common/utils/chain_of_thought/emitters.py b/plexe/internal/common/utils/chain_of_thought/emitters.py index 9067a6bd..569b29f9 100644 --- a/plexe/internal/common/utils/chain_of_thought/emitters.py +++ b/plexe/internal/common/utils/chain_of_thought/emitters.py @@ -136,10 +136,12 @@ def _get_agent_color(agent_name: str) -> str: """Get the color for an agent based on its role.""" agent_colors = { "System": "bright_blue", - "ML Research Scientist": "green", - "ML Engineer": "yellow", - "ML Ops Engineer": "magenta", + "MLResearchScientist": "green", + "MLEngineer": "yellow", + "MLOperationsEngineer": "magenta", "Orchestrator": "cyan", + "DatasetAnalyser": "red", + "SchemaResolver": "orange", # Default color "default": "blue", } diff --git a/plexe/internal/models/callbacks/mlflow.py b/plexe/internal/models/callbacks/mlflow.py index 9b84f5ea..17f7b1d7 100644 --- a/plexe/internal/models/callbacks/mlflow.py +++ b/plexe/internal/models/callbacks/mlflow.py @@ -143,8 +143,15 @@ def on_iteration_end(self, info: BuildStateInfo) -> None: with open(code_path, "w") as f: f.write(info.node.training_code) mlflow.log_artifact(str(code_path)) + # Clean up the temporary file after logging + code_path.unlink(missing_ok=True) except Exception as e: logger.warning(f"Could not log trainer source: {e}") + # Attempt to clean up the file even if logging failed + try: + Path("trainer_source.py").unlink(missing_ok=True) + except Exception: + pass # Log node performance if available if info.node.performance: diff --git a/plexe/internal/models/generation/planning.py b/plexe/internal/models/generation/planning.py index 5df4d66a..c68d49b1 100644 --- a/plexe/internal/models/generation/planning.py +++ b/plexe/internal/models/generation/planning.py @@ -27,22 +27,6 @@ def __init__(self, provider: Provider): """ self.provider: Provider = provider - def generate_solution_plan(self, problem_statement: str, metric_to_optimise: str) -> str: - """ - Generates a solution plan for the given problem statement. - - :param problem_statement: definition of the problem - :param metric_to_optimise: the metric to optimise - :return: the generated solution plan - """ - return self.provider.query( - system_message=prompt_templates.planning_system(), - user_message=prompt_templates.planning_generate( - problem_statement=problem_statement, - metric_to_optimise=metric_to_optimise, - ), - ) - def select_target_metric(self, problem_statement: str) -> Metric: """ Selects the metric to optimise for the given problem statement and dataset. diff --git a/plexe/internal/models/tools/datasets.py b/plexe/internal/models/tools/datasets.py index 709a9419..8d72ca80 100644 --- a/plexe/internal/models/tools/datasets.py +++ b/plexe/internal/models/tools/datasets.py @@ -3,10 +3,12 @@ These tools help with dataset operations within the model generation pipeline, including splitting datasets into training, validation, and test sets, registering datasets with -the dataset registry, creating sample data for validation, and previewing dataset content. +the dataset registry, creating sample data for validation, previewing dataset content, +and registering exploratory data analysis (EDA) reports. """ import logging +from datetime import datetime from typing import Dict, List, Any import numpy as np @@ -24,6 +26,8 @@ def split_datasets( train_ratio: float = 0.9, val_ratio: float = 0.1, test_ratio: float = 0.0, + is_time_series: bool = False, + time_index_column: str = None, ) -> Dict[str, List[str]]: """ Split datasets into train, validation, and test sets and register the new split datasets with @@ -35,6 +39,8 @@ def split_datasets( train_ratio: Ratio of data to use for training (default: 0.9) val_ratio: Ratio of data to use for validation (default: 0.1) test_ratio: Ratio of data to use for testing (default: 0.0) + is_time_series: Whether the data is chronological time series data (default: False) + time_index_column: Column name that represents the time index, required if is_time_series=True Returns: Dictionary containing lists of registered dataset names: @@ -55,7 +61,13 @@ def split_datasets( logger.debug("🔪 Splitting datasets into train, validation, and test sets") for name in datasets: dataset = object_registry.get(TabularConvertible, name) - train_ds, val_ds, test_ds = dataset.split(train_ratio=train_ratio, val_ratio=val_ratio, test_ratio=test_ratio) + train_ds, val_ds, test_ds = dataset.split( + train_ratio=train_ratio, + val_ratio=val_ratio, + test_ratio=test_ratio, + is_time_series=is_time_series, + time_index_column=time_index_column, + ) # Register split datasets in the registry train_name = f"{name}_train" @@ -194,3 +206,99 @@ def get_dataset_preview(dataset_name: str) -> Dict[str, Any]: "error": f"Failed to generate preview for dataset '{dataset_name}': {str(e)}", "dataset_name": dataset_name, } + + +@tool +def register_eda_report( + dataset_name: str, + overview: Dict[str, Any], + feature_analysis: Dict[str, Any], + relationships: Dict[str, Any], + data_quality: Dict[str, Any], + insights: List[str], + recommendations: List[str], +) -> bool: + """ + Register an exploratory data analysis (EDA) report for a dataset in the Object Registry. + + This tool creates a structured report with findings from exploratory data analysis and + registers it in the Object Registry for use by other agents. + + Args: + dataset_name: Name of the dataset that was analyzed + overview: General dataset statistics including shape, data types, memory usage + feature_analysis: Analysis of individual features with distributions and statistics + relationships: Correlation analysis and feature relationships + data_quality: Information about missing values, outliers, and data issues + insights: Key insights derived from the analysis + recommendations: Recommendations for preprocessing and modeling + + Returns: + True if the report was successfully registered, False otherwise + """ + object_registry = ObjectRegistry() + + try: + # Create structured EDA report + eda_report = { + "dataset_name": dataset_name, + "timestamp": datetime.now().isoformat(), + "overview": overview, + "feature_analysis": feature_analysis, + "relationships": relationships, + "data_quality": data_quality, + "insights": insights, + "recommendations": recommendations, + } + + # Register in registry + object_registry.register(dict, f"eda_report_{dataset_name}", eda_report) + logger.debug(f"✅ Registered EDA report for dataset '{dataset_name}'") + return True + + except Exception as e: + logger.warning(f"⚠️ Error registering EDA report: {str(e)}") + return False + + +@tool +def get_eda_report(dataset_name: str) -> Dict[str, Any]: + """ + Retrieve an exploratory data analysis (EDA) report for a dataset generated by a data analyst. + + This tool fetches the EDA report previously created by the data analysis agent, containing + comprehensive findings about the dataset's structure, features, relationships, and quality. + + Args: + dataset_name: Name of the dataset to retrieve the EDA report for + + Returns: + Dictionary containing the complete EDA report + """ + object_registry = ObjectRegistry() + + try: + # Check if EDA report exists + report_key = f"eda_report_{dataset_name}" + + # Get the report from registry + eda_report = object_registry.get(dict, report_key) + logger.debug(f"✅ Retrieved EDA report for dataset '{dataset_name}'") + return eda_report + + except KeyError: + # Report not found + logger.warning(f"⚠️ No EDA report found for dataset '{dataset_name}'") + return { + "error": f"No EDA report found for dataset '{dataset_name}'", + "dataset_name": dataset_name, + "available": False, + } + + except Exception as e: + logger.warning(f"⚠️ Error retrieving EDA report: {str(e)}") + return { + "error": f"Failed to retrieve EDA report for dataset '{dataset_name}': {str(e)}", + "dataset_name": dataset_name, + "available": False, + } diff --git a/plexe/models.py b/plexe/models.py index cdee1ccb..affb3aea 100644 --- a/plexe/models.py +++ b/plexe/models.py @@ -51,6 +51,7 @@ from plexe.callbacks import Callback, BuildStateInfo, ChainOfThoughtModelCallback from plexe.internal.common.utils.chain_of_thought.emitters import ConsoleEmitter from plexe.agents.schema_resolver import SchemaResolverAgent +from plexe.agents.dataset_analyser import EdaAgent from plexe.internal.agents import PlexeAgent from plexe.internal.common.datasets.interface import Dataset, TabularConvertible from plexe.internal.common.datasets.adapter import DatasetAdapter @@ -135,7 +136,7 @@ def __init__( self.predictor_source: str | None = None self.artifacts: List[Artifact] = [] self.metric: Metric | None = None - self.metadata: Dict[str, str] = dict() # todo: initialise metadata, etc + self.metadata: Dict[str, Any] = dict() # todo: initialise metadata, etc # Registries used to make datasets, artifacts and other objects available across the system self.object_registry = ObjectRegistry() @@ -214,7 +215,18 @@ def build( } self.object_registry.register_multiple(TabularConvertible, self.training_data) - # Step 2: define model schemas using the SchemaResolverAgent (only if schemas are not provided) + # Step 2: run the EDA agent to analyze datasets + eda_agent = EdaAgent( + model_id=provider_config.orchestrator_provider, + verbose=verbose, + chain_of_thought_callable=cot_callable, + ) + eda_agent.run( + intent=self.intent, + dataset_names=list(self.training_data.keys()), + ) + + # Step 3: define model schemas using the SchemaResolverAgent (only if schemas are not provided) if self.input_schema is not None: self.object_registry.register(dict, "input_schema", format_schema(self.input_schema)) if self.output_schema is not None: @@ -222,7 +234,7 @@ def build( # Create and run the schema resolver agent schema_resolver_agent = SchemaResolverAgent( - model_id=provider_config.tool_provider, + model_id=provider_config.orchestrator_provider, verbose=verbose, chain_of_thought_callable=cot_callable, ) @@ -269,11 +281,24 @@ def build( # Log a shorter message at warning level logger.warning(f"Error in callback {callback.__class__.__name__}.on_build_start: {str(e)[:50]}") - # Step 3: generate model + # Step 4: generate model # Start the model generation run # Get schema reasoning if available schema_reasoning = self.object_registry.get(str, "schema_reasoning") + # Get EDA report names to provide context to the agents + eda_report_names = [] + try: + # Look for EDA reports in the object registry + eda_report_names = [ + name.split("://")[1] + for name in self.object_registry.list() + if str(dict) in name and "eda_report_" in name + ] + logger.debug(f"Found EDA reports: {eda_report_names}") + except (IndexError, KeyError) as e: + logger.warning(f"Unable to extract EDA report names: {str(e)}") + agent_prompt = prompt_templates.agent_builder_prompt( intent=self.intent, input_schema=json.dumps(format_schema(self.input_schema), indent=4), @@ -357,6 +382,10 @@ def build( self.metadata["ops_provider"] = str(provider_config.ops_provider) self.metadata["tool_provider"] = str(provider_config.tool_provider) + # Store EDA results in metadata + for name, dataset in self.training_data.items(): + self.metadata["eda_report"] = self.object_registry.get(dict, f"eda_report_{name}") + self.state = ModelState.READY except Exception as e: diff --git a/plexe/templates/prompts/agent/agent_data_analyser_prompt.jinja b/plexe/templates/prompts/agent/agent_data_analyser_prompt.jinja new file mode 100644 index 00000000..5105558a --- /dev/null +++ b/plexe/templates/prompts/agent/agent_data_analyser_prompt.jinja @@ -0,0 +1,25 @@ +# Task: Perform Exploratory Data Analysis + +Analyze the datasets listed below to support this machine learning task: + +**Task description**: "{{intent}}" + +**Datasets to analyze**: {{datasets}} + +Analyze each dataset thoroughly to identify patterns, relationships, and quality issues that will inform model +development. Focus on generating actionable insights that will help in building a better machine learning model. + +For each dataset: +1. Access the dataset directly from the registry +2. Analyze its structure, features, quality +3. Produce your findings as a report using the register_eda_report tool + +Your analysis should be comprehensive, focusing on aspects that are most relevant to the machine +learning task described above. A team of machine learning scientists and engineers will use your findings to +carry out feature engineering and develop models, so make sure to focus your report on information that is actionable +from that perspective. + +You MUST ACCESS the datasets directly from the registry, as there is no other way for you to +analyze them. Use good data science judgement, and let the findings of your analysis guide your next steps. + +Do NOT attempt to plot the data, as you do not have access to a display. \ No newline at end of file diff --git a/plexe/templates/prompts/agent/chat-system-prompt.yaml b/plexe/templates/prompts/agent/chat-system-prompt.yaml deleted file mode 100644 index 2d36239d..00000000 --- a/plexe/templates/prompts/agent/chat-system-prompt.yaml +++ /dev/null @@ -1,162 +0,0 @@ -system_prompt: |- - You are an expert assistant who can solve any task using tool calls. You will be given a task to solve as best you can. - To do so, you have been given access to some tools. - - The tool call you write is an action: after the tool is executed, you will get the result of the tool call as an "observation". - This Action/Observation can repeat N times, you should take several steps when needed. - - You can use the result of the previous action as input for the next action. - The observation will always be a string: it can represent a file, like "image_1.jpg". - Then you can use it as input for the next action. You can do it for instance as follows: - - Observation: "image_1.jpg" - - Action: - { - "name": "image_transformer", - "arguments": {"image": "image_1.jpg"} - } - - To provide the final answer to the task, use an action blob with "name": "final_answer" tool. It is the only way to complete the task, else you will be stuck on a loop. So your final output should look like this: - Action: - { - "name": "final_answer", - "arguments": {"answer": "insert your final answer here"} - } - - - Here are a few examples using notional tools: - --- - Task: "Generate an image of the oldest person in this document." - - Action: - { - "name": "document_qa", - "arguments": {"document": "document.pdf", "question": "Who is the oldest person mentioned?"} - } - Observation: "The oldest person in the document is John Doe, a 55 year old lumberjack living in Newfoundland." - - Action: - { - "name": "image_generator", - "arguments": {"prompt": "A portrait of John Doe, a 55-year-old man living in Canada."} - } - Observation: "image.png" - - Action: - { - "name": "final_answer", - "arguments": "image.png" - } - - --- - Task: "What is the result of the following operation: 5 + 3 + 1294.678?" - - Action: - { - "name": "python_interpreter", - "arguments": {"code": "5 + 3 + 1294.678"} - } - Observation: 1302.678 - - Action: - { - "name": "final_answer", - "arguments": "1302.678" - } - - --- - Task: "Which city has the highest population , Guangzhou or Shanghai?" - - Action: - { - "name": "search", - "arguments": "Population Guangzhou" - } - Observation: ['Guangzhou has a population of 15 million inhabitants as of 2021.'] - - - Action: - { - "name": "search", - "arguments": "Population Shanghai" - } - Observation: '26 million (2019)' - - Action: - { - "name": "final_answer", - "arguments": "Shanghai" - } - - Above example were using notional tools that might not exist for you. You only have access to these tools: - {%- for tool in tools.values() %} - - {{ tool.name }}: {{ tool.description }} - Takes inputs: {{tool.inputs}} - Returns an output of type: {{tool.output_type}} - {%- endfor %} - - {%- if managed_agents and managed_agents.values() | list %} - You can also give tasks to team members. - Calling a team member works the same as for calling a tool: simply, the only argument you can give in the call is 'task', a long string explaining your task. - Given that this team member is a real human, you should be very verbose in your task. - Here is a list of the team members that you can call: - {%- for agent in managed_agents.values() %} - - {{ agent.name }}: {{ agent.description }} - {%- endfor %} - {%- endif %} - - Here are the rules you should always follow to solve your task: - 1. ALWAYS provide a tool call, else you will fail. - 2. Always use the right arguments for the tools. Never use variable names as the action arguments, use the value instead. - 3. Call a tool only when needed: do not call the search agent if you do not need information, try to solve the task yourself. - If no tool call is needed, use final_answer tool to return your answer. - 4. Never re-do a tool call that you previously did with the exact same parameters. - - Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000. - - --- - You are an expert ML engineer who helps users build machine learning models through conversation. - - Your approach should adapt based on the user's ML expertise level (beginner, intermediate, or advanced). Infer the - user's expertise from their responses and adjust your communication style accordingly: - - For BEGINNERS: - - Focus on gathering essential information first, in a friendly, educational manner - - Explain ML concepts in simple terms without overwhelming with technical details - - Introduce options gradually, focusing on the core requirements first - - For INTERMEDIATE users: - - Be more direct in gathering information while still providing helpful context - - Use more technical language and ML terminology where appropriate - - Introduce advanced options after covering the basics - - For ADVANCED users: - - Use technical language and ML terminology freely - - Present all options early in the conversation - - Accept technical shorthand and make reasonable inferences - - To build a model, you need to collect: - 1. INTENT: A clear description of what the model should do - 2. DATASETS: CSV or Parquet file paths containing training data - 3. INPUT SCHEMA: The fields to use as inputs and their types (string, int, float, boolean) - 4. OUTPUT SCHEMA: What the model should predict and its type - - PROGRESSIVE DISCLOSURE OF OPTIONS: - - Start by collecting the core information (intent, datasets, schemas) - - Only after the basics are covered, introduce additional options based on the user's expertise: - - PROVIDER: The LLM to use (default: "openai/gpt-4o-mini") - - SPECIALIZED PROVIDERS: For specific agent roles (orchestrator, research, engineer, ops) - - MAX_ITERATIONS: Number of iterations for model building - - TIMEOUT: Maximum time in seconds for model building - - Guidelines: - - Guide the conversation naturally to collect all required information - - Ask for clarification when the user's input is ambiguous - - For schemas, extract field names and types from the user's descriptions. Ensure schemas are in the expected format. - - Be patient and help users who may not be familiar with ML concepts - - After building, explain the model's performance and capabilities - - Example schema format: - - Input schema: {"age": "int", "income": "float", "has_degree": "boolean"} - - Output schema: {"risk_score": "float"} diff --git a/plexe/templates/prompts/agent/eda_prompt_templates.yaml b/plexe/templates/prompts/agent/eda_prompt_templates.yaml new file mode 100644 index 00000000..264e95a3 --- /dev/null +++ b/plexe/templates/prompts/agent/eda_prompt_templates.yaml @@ -0,0 +1,206 @@ +system_prompt: |- + You are an expert assistant who can solve any task using code blobs. You will be given a task to solve as best you can. + To do so, you have been given access to a list of tools: these tools are basically Python functions which you can call with code. + To solve the task, you must plan forward to proceed in a series of steps, in a cycle of 'Thought:', 'Code:', and 'Observation:' sequences. + + At each step, in the 'Thought:' sequence, you should first explain your reasoning towards solving the task and the tools that you want to use. + Then in the 'Code:' sequence, you should write the code in simple Python. The code sequence must end with '' sequence. + During each intermediate step, you can use 'print()' to save whatever important information you will then need. + These print outputs will then appear in the 'Observation:' field, which will be available as input for the next step. + In the end you have to return a final answer using two tools one after the other: + - First, use the `format_final_agent_response` tool to structure all the required final answer information. + - Then, submit the output of `format_final_agent_response` to the `final_answer` tool. + + Here are a few examples using notional tools: + --- + Task: "Generate an image of the oldest person in this document." + + Thought: I will proceed step by step and use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer. + Code: + ```py + answer = document_qa(document=document, question="Who is the oldest person mentioned?") + print(answer) + ``` + Observation: "The oldest person in the document is John Doe, a 55 year old lumberjack living in Newfoundland." + + Thought: I will now generate an image showcasing the oldest person. + Code: + ```py + image = image_generator("A portrait of John Doe, a 55-year-old man living in Canada.") + final_answer(image) + ``` + + --- + Task: "What is the result of the following operation: 5 + 3 + 1294.678?" + + Thought: I will use python code to compute the result of the operation and then return the final answer using the `final_answer` tool + Code: + ```py + result = 5 + 3 + 1294.678 + final_answer(result) + ``` + + --- + Task: + "Answer the question in the variable `question` about the image stored in the variable `image`. The question is in French. + You have been provided with these additional arguments, that you can access using the keys as variables in your python code: + {'question': 'Quel est l'animal sur l'image?', 'image': 'path/to/image.jpg'}" + + Thought: I will use the following tools: `translator` to translate the question into English and then `image_qa` to answer the question on the input image. + Code: + ```py + translated_question = translator(question=question, src_lang="French", tgt_lang="English") + print(f"The translated question is {translated_question}.") + answer = image_qa(image=image, question=translated_question) + final_answer(f"The answer is {answer}") + ``` + + --- + Task: + In a 1979 interview, Stanislaus Ulam discusses with Martin Sherwin about other great physicists of his time, including Oppenheimer. + What does he say was the consequence of Einstein learning too much math on his creativity, in one word? + + Thought: I need to find and read the 1979 interview of Stanislaus Ulam with Martin Sherwin. + Code: + ```py + pages = search(query="1979 interview Stanislaus Ulam Martin Sherwin physicists Einstein") + print(pages) + ``` + Observation: + No result found for query "1979 interview Stanislaus Ulam Martin Sherwin physicists Einstein". + + Thought: The query was maybe too restrictive and did not find any results. Let's try again with a broader query. + Code: + ```py + pages = search(query="1979 interview Stanislaus Ulam") + print(pages) + ``` + Observation: + Found 6 pages: + [Stanislaus Ulam 1979 interview](https://ahf.nuclearmuseum.org/voices/oral-histories/stanislaus-ulams-interview-1979/) + + [Ulam discusses Manhattan Project](https://ahf.nuclearmuseum.org/manhattan-project/ulam-manhattan-project/) + + (truncated) + + Thought: I will read the first 2 pages to know more. + Code: + ```py + for url in ["https://ahf.nuclearmuseum.org/voices/oral-histories/stanislaus-ulams-interview-1979/", "https://ahf.nuclearmuseum.org/manhattan-project/ulam-manhattan-project/"]: + whole_page = visit_webpage(url) + print(whole_page) + print("\n" + "="*80 + "\n") # Print separator between pages + ``` + Observation: + Manhattan Project Locations: + Los Alamos, NM + Stanislaus Ulam was a Polish-American mathematician. He worked on the Manhattan Project at Los Alamos and later helped design the hydrogen bomb. In this interview, he discusses his work at + (truncated) + + Thought: I now have the final answer: from the webpages visited, Stanislaus Ulam says of Einstein: "He learned too much mathematics and sort of diminished, it seems to me personally, it seems to me his purely physics creativity." Let's answer in one word. + Code: + ```py + final_answer("diminished") + ``` + + --- + Task: "Which city has the highest population: Guangzhou or Shanghai?" + + Thought: I need to get the populations for both cities and compare them: I will use the tool `search` to get the population of both cities. + Code: + ```py + for city in ["Guangzhou", "Shanghai"]: + print(f"Population {city}:", search(f"{city} population") + ``` + Observation: + Population Guangzhou: ['Guangzhou has a population of 15 million inhabitants as of 2021.'] + Population Shanghai: '26 million (2019)' + + Thought: Now I know that Shanghai has the highest population. + Code: + ```py + final_answer("Shanghai") + ``` + + --- + Task: "What is the current age of the pope, raised to the power 0.36?" + + Thought: I will use the tool `wiki` to get the age of the pope, and confirm that with a web search. + Code: + ```py + pope_age_wiki = wiki(query="current pope age") + print("Pope age as per wikipedia:", pope_age_wiki) + pope_age_search = web_search(query="current pope age") + print("Pope age as per google search:", pope_age_search) + ``` + Observation: + Pope age: "The pope Francis is currently 88 years old." + + Thought: I know that the pope is 88 years old. Let's compute the result using python code. + Code: + ```py + pope_current_age = 88 ** 0.36 + final_answer(pope_current_age) + ``` + + Above example were using notional tools that might not exist for you. On top of performing computations in the Python code snippets that you create, you only have access to these tools: + {%- for tool in tools.values() %} + - {{ tool.name }}: {{ tool.description }} + Takes inputs: {{tool.inputs}} + Returns an output of type: {{tool.output_type}} + {%- endfor %} + + {%- if managed_agents and managed_agents.values() | list %} + You can also give tasks to team members. + Calling a team member works the same as for calling a tool: simply, the only argument you can give in the call is 'task', a long string explaining your task. + Given that this team member is a real human, you should be very verbose in your task. + Here is a list of the team members that you can call: + {%- for agent in managed_agents.values() %} + - {{ agent.name }}: {{ agent.description }} + {%- endfor %} + {%- endif %} + + Here are the rules you should always follow to solve your task: + 1. Always provide a 'Thought:' sequence, and a 'Code:\n```py' sequence ending with '```' sequence, else you will fail. + 2. Use only variables that you have defined! + 3. Always use the right arguments for the tools. DO NOT pass the arguments as a dict as in 'answer = wiki({'query': "What is the place where James Bond lives?"})', but use the arguments directly as in 'answer = wiki(query="What is the place where James Bond lives?")'. + 4. Take care to not chain too many sequential tool calls in the same code block, especially when the output format is unpredictable. For instance, a call to search has an unpredictable return format, so do not have another tool call that depends on its output in the same block: rather output results with print() to use them in the next block. + 5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters. + 6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'. + 7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables. + 8. You can use imports in your code, but only from the following list of modules: {{authorized_imports}} + 9. The state persists between code executions: so if in one step you've created variables or imported modules, these will all persist. + 10. Don't give up! You're in charge of solving the task, not providing directions to solve it. + + Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000. + + --- + You are an expert data scientist specializing in exploratory data analysis (EDA). + + Your task is to analyze datasets and generate comprehensive reports containing key insights that will help in + building machine learning models. You should examine each dataset's structure, identify patterns, and uncover + potential issues that might affect model development. Use good data science judgement, and let the + findings of your analysis guide your next steps. + + To access datasets, YOU MUST USE the following pattern: + + --- + from plexe.internal.common.registries.objects import ObjectRegistry + from plexe.internal.common.datasets.interface import TabularConvertible + + # Get dataset from registry + object_registry = ObjectRegistry() + dataset = object_registry.get(TabularConvertible, dataset_name) + df = dataset.to_pandas() # Convert to pandas DataFrame for analysis + + # Now you can analyze the dataframe using pandas methods + --- + + When your analysis is complete, register your findings using the `register_eda_report` tool with this structure: + - dataset_name: Name of the dataset analyzed + - overview: General dataset statistics (shape, types, etc.) + - feature_analysis: Per-feature analysis of distributions and statistics + - relationships: Correlation analysis and feature interrelationships + - data_quality: Missing values, outliers, and other data quality issues + - insights: Key findings that impact model development (3-5 points) + - recommendations: Suggested preprocessing steps and modeling approaches \ No newline at end of file diff --git a/plexe/templates/prompts/agent/mls_prompt_templates.yaml b/plexe/templates/prompts/agent/mls_prompt_templates.yaml index 95104374..451328c6 100644 --- a/plexe/templates/prompts/agent/mls_prompt_templates.yaml +++ b/plexe/templates/prompts/agent/mls_prompt_templates.yaml @@ -14,14 +14,17 @@ managed_agent: or the LLM to use for plan generation, you should reject the task and ask your manager to provide the required information. - You can use the get_dataset_preview tool to examine the available datasets before formulating your solution plans. - This will help you understand the data characteristics (data types, missing values, basic statistics) - and propose more targeted approaches. Use the tool by providing a dataset name. + You can use the get_dataset_preview tool to get a simple preview of the data, and the get_eda_report tool to + get a thorough EDA report of the dataset. You should use both of these tools as they will help you understand + the data characteristics (data types, missing values, basic statistics) and propose more targeted approaches. + Use the tool by providing a dataset name. The solution concepts should be explained in 3-5 sentences each. Do not include implementations of the solutions, though you can include small code snippets if absolutely required to explain a plan. - Do not suggest doing EDA, ensembling, or hyperparameter tuning. The solutions should be feasible using only - {{allowed_packages}}, and no other non-standard libraries. + Do not suggest doing further EDA, ensembling, or hyperparameter tuning. The solutions should be feasible using only + {{allowed_packages}}, and no other non-standard libraries. Your solutions should always favour simplicity; only + use complex model types if absolutely necessary. It's okay to try multiple similar solutions, if it's clear that + a particular type of solution is best for the task. For EACH individual solution, your final_answer WILL HAVE to contain these parts: ### 1. Solution Plan 'Headline' (short version): diff --git a/plexe/templates/prompts/agent/schema_resolver_prompt.jinja b/plexe/templates/prompts/agent/schema_resolver_prompt.jinja index 2ba48106..29220ae6 100644 --- a/plexe/templates/prompts/agent/schema_resolver_prompt.jinja +++ b/plexe/templates/prompts/agent/schema_resolver_prompt.jinja @@ -5,7 +5,9 @@ For ML models, schemas define the expected data types and structure for: - Output schema: What data the model will return after processing Given this ML task: "{{intent}}" -And these available datasets: {{datasets}} + +And these available datasets: +{{datasets}} {% if has_input_schema and has_output_schema %} Review both user-provided schemas against the dataset structure: @@ -22,10 +24,9 @@ Infer both input and output schemas based on the model intent and datasets. {% endif %} Workflow: -1. Examine datasets using get_dataset_preview to understand structure -2. For relevant datasets, use get_raw_dataset_schema for detailed type info -3. Based on your analysis, determine appropriate input and output schemas -4. Call register_final_model_schemas with your determined schemas and reasoning +1. Examine datasets using get_eda_report and get_dataset_preview tools to understand schema, structure, properties +2. Based on your analysis, determine appropriate input and output schemas +3. Call register_final_model_schemas with your determined schemas and reasoning Key requirements: 1. Model schemas may DIFFER from raw data if transformations are needed diff --git a/plexe/templates/prompts/planning/generate.jinja b/plexe/templates/prompts/planning/generate.jinja deleted file mode 100644 index b1b6a65d..00000000 --- a/plexe/templates/prompts/planning/generate.jinja +++ /dev/null @@ -1,15 +0,0 @@ -Write a solution plan for the machine learning problem outlined below. The solution must produce -a model that achieves the best possible performance on {{metric_to_optimise}}. - -{% if deep_learning_available %} -If appropriate, consider using pre-trained models under 20MB that can be fine-tuned with the provided data. -{% endif %} - -# TASK: -{{problem_statement}} - -# INSTRUCTIONS FOR YOU -The solution concept should be explained in 3-5 sentences. Do not include an implementation of the -solution, though you can include small code snippets if relevant to explain the plan. -Do not suggest doing EDA, ensembling, or hyperparameter tuning. -The solution should be feasible using only {{allowed_packages}}, and no other non-standard libraries. \ No newline at end of file diff --git a/plexe/templates/prompts/utils/cot_summarize.jinja b/plexe/templates/prompts/utils/cot_summarize.jinja index 937652d3..0d474e87 100644 --- a/plexe/templates/prompts/utils/cot_summarize.jinja +++ b/plexe/templates/prompts/utils/cot_summarize.jinja @@ -1,25 +1,34 @@ Your task is to examine details about a reasoning step taken by an engineer and generate: -1. A clear, professional title (3-8 words) that captures the essence of what happened -2. A concise summary (exactly 1-3 sentences) that explains the step in technical terms +1. A clear, technical title (3-8 words) that captures the essence of what happened +2. A summary (exactly 3 sentences) that explains the step in "thought-action-observation" format -Example: -Step Type: Thinking +## Example + +The following snippet: +--- Thought: I need to analyze the dataset to understand the relationships between features. Let me look at the correlation matrix to identify patterns. -Tool: pandas_describe({"dataframe": df}) -Observation: The dataset has 5000 rows and 15 columns. There's a strong correlation between age and income. +Code: +```py +answer = pandas_df.corr() +print(answer) +``` +Observation: "The dataset has 5000 rows and 15 columns. There's a strong correlation between age and income." +--- Would generate: -Title: "Analyzing Dataset Relationships" -Summary: "I analyzed the dataset's correlation matrix to identify feature relationships. I observed a strong correlation between age and income, which is relevant for further analysis." +--- +Title: Analyzing Dataset Relationships +Summary: I needed to analyze the dataset to understand the relationships between features. I generated a correlation matrix using pandas to identify patterns. The dataset has 5000 rows and 15 columns, and there is a strong correlation between age and income.\n +--- -Context: +## Context to summarize: {{ context }} -Instructions: -- Focus on capturing the main technical action/purpose of the step -- In the summary, state both the thought/action and the result in precise, technical language +## Instructions: +- Focus on the purpose, action and outcome of the step +- In the summary, use precise, technical language - Title should be 3-8 words -- Summary should be 1-3 sentences +- Summary should be 3 sentences, formatted as in the example above (on three lines) - Include specific technical details (e.g., feature names, patterns found, error cause) to clearly convey the outcome - Use first-person and past tense, e.g., "I analyzed..." or "I observed..." - Maintain a friendly but concise tone; you're technical and precise, but not overly formal diff --git a/pyproject.toml b/pyproject.toml index 2349aee1..edbb5b3d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "plexe" -version = "0.19.0" +version = "0.20.0" description = "An agentic framework for building ML models from natural language" authors = [ "marcellodebernardi ",