feat: Proposed SIMBAUQ Sampling Strategy#785
feat: Proposed SIMBAUQ Sampling Strategy#785radum2275 wants to merge 4 commits intogenerative-computing:mainfrom
Conversation
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
planetf1
left a comment
There was a problem hiding this comment.
Also noticed we don't export SOFAISamplingStrategy in all - not an issue from this PR, but observed
|
|
||
|
|
||
| @pytest.mark.ollama | ||
| @pytest.mark.llm |
There was a problem hiding this comment.
| @pytest.mark.llm | |
| @pytest.mark.e2e |
llm was recently removed/deprecated as a marker. Our 'integration' tests are those that test multiple components of mellea together, but don't require external dependencies (like ollama) hence e2e as the classification
| @@ -0,0 +1,216 @@ | |||
| # pytest: openai, llm, qualitative | |||
There was a problem hiding this comment.
| # pytest: openai, llm, qualitative | |
| # pytest: openai, e2e, qualitative, skip |
Since this has dependencies we don't automatically set up, it can't automatically run in most environments/CI, so I think we need skip. (Also updated llm->e2e).
The openai marker isn't applicable here as the example is using rits (we'd need to clarify what we mean in the framework automation as it is of course using the openai API)
| uv run python docs/examples/simbauq/simbauq_example.py | ||
|
|
||
| Requires: | ||
| RITS_API_KEY environment variable or hardcoded key below. |
There was a problem hiding this comment.
this won't work for users outside IBM. Should the example be based on an external or local service?
There was a problem hiding this comment.
+1 all things on public github should only reference external/local services
| docs/docs/api/ | ||
| docs/docs/api-reference.mdx | ||
| .venv-docs-autogen/ | ||
| CLAUDE.md |
There was a problem hiding this comment.
| CLAUDE.md |
We have a CLAUDE.md checked in as part of the project so it should not be ignored.
| # SIMBA-UQ Sampling Strategy | ||
|
|
||
| Confidence-aware sample selection using the SIMBA-UQ framework | ||
| (Bhattacharjya et al., 2024). Generates multiple samples across a range of |
There was a problem hiding this comment.
| (Bhattacharjya et al., 2024). Generates multiple samples across a range of | |
| (Bhattacharjya et al., 2025). Generates multiple samples across a range of |
The paper is from 2025?
| return scores[self.rouge_type].fmeasure | ||
|
|
||
| if self.similarity_metric == "sbert": | ||
| from sklearn.metrics.pairwise import cosine_similarity |
There was a problem hiding this comment.
This isn't (currently) a mandatory dependency, or in any dependency group (unless there's a transitive chain)? So import needs a guard, clear message and/or fallback behaviour/raising error
| return float(np.exp(np.mean(log_sims))) | ||
|
|
||
| if self.aggregation == "harmonic_mean": | ||
| from scipy import stats as scipy_stats |
There was a problem hiding this comment.
see comment on import above (guard)
| Returns: | ||
| Trained ``RandomForestClassifier``. | ||
| """ | ||
| from sklearn.ensemble import RandomForestClassifier |
There was a problem hiding this comment.
see comment on import above (guard)
| @@ -0,0 +1,365 @@ | |||
| """Tests for SIMBAUQSamplingStrategy.""" | |||
|
|
|||
| import numpy as np | |||
There was a problem hiding this comment.
numpy is a transitive dependency of rouge_score, which is currently mandatory - but it's probably a good idea to ensure it's added as an explicit dependency?
Note it appears in the vllm group but another pr is removing that.
| probs = self._classifier.predict_proba(x_test) # type: ignore[union-attr] | ||
| return probs[:, 1] | ||
|
|
||
| def _compute_confidences(self, samples: list[str]) -> np.ndarray: |
There was a problem hiding this comment.
Nit: _compute_confidences() looks like it duplicates the aggregation loop already inlined in sample() (lines ~233-240) but isn't actually called from there. Would it make sense to have sample() call _compute_confidences() instead of inlining the logic? That way the tests exercise the same code path that runs in production.
Sampling Strategy PR
Use this template when adding or modifying sampling strategies in
mellea/stdlib/sampling/.Description
Implementation Checklist
Base Class
BaseSamplingStrategyif your changes are mostly modifying therepairand/orselect_from_failurefunctionsSamplingStrategyif your changes involve a newsamplemethodReturn Value
SamplingResult. Specifically, this means:ModelOutputThunks insample_generationsare properly typed from the Component and theparsed_repris the expected type.Integration
mellea/stdlib/sampling/__init__.pyTesting
tests/sampling/