-
Notifications
You must be signed in to change notification settings - Fork 537
[WIP] MAEB task selection #3867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: maeb
Are you sure you want to change the base?
Conversation
Implements new task selection approach using correlation analysis and clustering for MAEB evaluation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]>
- Add domain, category, and language checks to is_candidate_valid_removal to preserve at least one task from each unique domain, category, and language - Add top 5 longest tasks display for CLAP model reference timing - Add diagnostic cell for tasks with many negative correlations - Expand correlation thresholds to include 0.8 and 0.9 - Add Languages, Domains, Categories columns to summary table - Comment out license filtering to include all tasks - Handle empty model coverage gracefully with fallback logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ased tasks_to_keep - Move UMAP+HDBSCAN clustering right after initial correlation matrix - Define tasks_to_keep from outlier cluster (label -1) instead of empty list - Split function definitions to break circular dependency - Add domain counts cell after results DataFrame - Add model coverage distribution analysis (models at each task count) - Use models with >= 50 tasks for runtime estimation - Show task coverage in runtime output (N/M tasks with eval times) 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>
- Add get_pairs_above_threshold helper to get all correlated pairs - Track skipped_pairs where neither task can be removed - Continue to next pair when current pair is protected - Clear skipped_pairs when task set changes after removal - Only stop when all pairs above threshold have been tried 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>
Visualizes results_df with: - Blue gradient colormap (light to dark) - White background for NaN values - Adaptive text color (white for high scores, black for low) - Dynamic figure sizing based on data dimensions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MAEB(audio-text) benchmark with 17 cross-modal retrieval tasks (8 audio-to-text, 9 text-to-audio) selected via correlation threshold 0.95 - Inline task lists directly in MAEB benchmark objects - Add threshold 0.95 to task selection notebook - Convert comparison plot from 1x5 to 2x3 layout for 6 thresholds - Fix tasks_to_select_from to use modality-filtered tasks - Use models with complete eval times for runtime estimation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Expand MAEB(audio-text) benchmark from 17 to 29 tasks (14 A2T + 15 T2A) - Fix msclap model revision from "N/A" to "no_revision" to match results cache - Update benchmark contacts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Script generates top 10 model rankings for MAEB(audio) and MAEB(audio-text) benchmarks using Borda count, with per-category averages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
scripts/task_selection/task_selection_maeb_corr_and_cluster_mieb_method.py
Show resolved
Hide resolved
|
Created overview table for tasks and where they're used. Also version for google sheets https://docs.google.com/spreadsheets/d/1wyTvW0q6TIat7RMmfimlNKXri9O7cs_S0uebGTNya0c/edit?usp=sharing Table
scriptimport mteb
import pandas as pd
tasks = mteb.get_tasks(modalities=["audio"])
audio_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio)")]
audio_text_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio-text)")]
row = []
for task in tasks:
print(task.metadata.name)
in_audio = task.metadata.name in audio_tasks_names
in_audio_text = task.metadata.name in audio_text_tasks_names
row.append(
{
"Task Name": task.metadata.name,
"Task description": task.metadata.description,
"Task type": task.metadata.type,
"Task language(s)": ", ".join(task.metadata.eval_langs) if isinstance(task.metadata.eval_langs, list) else ", ".join(v[0] for v in task.metadata.eval_langs.values()),
"In MAEB(audio)": "Yes" if in_audio else "No",
"In MAEB(audio-text)": "Yes" if in_audio_text else "No",
}
)
df = pd.DataFrame(row)
df = df.sort_values(by=["Task Name", "Task type"]).reset_index(drop=True)
df.to_csv("audio_tasks_table.csv", index=False)
df.to_markdown("audio_tasks_table.md") |
|
Probably we can create english only version, but I'm not sure if it is relevant, because most of the tasks are english only |
|
Where are all the multilingual tasks? |
|
I think we can create
But this might be complicated to understand for users |
Why would it be complicated? Seems clear to me |
|
Hmm I would maybe do:
However, I would probably argue we could just make two columns that are PS: We have to fix the language annotations - birdset for example, is not English. |
How we should name it? Just
For leaderboard, I agree, but for the users I'm not sure because this can create problems on inference |
Ah I get it now, only maintain MAEB. Do we bother filtering out similar tasks? or use the entire collection? |
MAEB is the full Massive Audio Embedding Benchmark (v1), containing all tasks with audio modality across 7 task types: classification (35), clustering (10), pair classification (5), reranking (6), zero-shot classification (5), audio-to-text retrieval (18), and text-to-audio retrieval (17). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I'm a bit afraid that if we use only 1 benchmark, but users would want to evaluate only on part of it, e.g. audio only. They would need to filter tasks |
|
What if we have an english list, an audio list, a "the rest of the collection" list, and MAEB is english + audio + "the rest"? We can still have MAEB(eng)v1, MAEB(audio)v1, and MAEBv1 ? |
Rename UrbanSound8kZeroshotClassification to UrbanSound8kClassification in audio_classification module to avoid collision with the identically named class in audio_zeroshot_classification module. Both classes had the same Python name but different task names: - audio_classification: task name "UrbanSound8k" - audio_zeroshot_classification: task name "UrbanSound8kZeroshot" The * imports caused the zeroshot version to overwrite the classification version, leaving only "UrbanSound8kZeroshot" registered in the task registry and breaking MAEB benchmarks that reference "UrbanSound8k". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The dill/datasets library had a pickle incompatibility with Python 3.14. Datasets v4+ resolves this issue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The v0.02 task class was defined but not exported in __init__.py, causing KeyError when referenced in benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Renamed classes to match their metadata names so they can be found in the task registry: - JamAltArtist → JamAltArtistA2ARetrieval - JamAltLyricsT2A → JamAltLyricT2ARetrieval - JamAltLyricsA2T → JamAltLyricA2TRetrieval Also added explicit imports and exports for proper registration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
2631fc8 to
411a4ce
Compare
This reverts commit b244226.
|
Added
|
New utility script that calculates total evaluation times for specified
benchmarks and models. Features:
- Takes --benchmarks and --models as required arguments
- Optional --results-dir for custom cache location
- Outputs formatted table with task coverage and times per benchmark
- Shows totals per model
Usage:
python scripts/calculate_eval_times.py \
-b "MAEB(audio-text, lite)" "MAEB(audio-text, extended)" \
-m "OpenMuQ/MuQ-MuLan-large" "laion/clap-htsat-unfused" \
-r /path/to/results
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Computes Spearman and Pearson correlations between MAEB lite and extended benchmark variants to validate that lite benchmarks preserve model rankings. Outputs correlation values and scatter plots (PNG and PDF). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolve merge conflicts in audio task imports: - Update JamAlt and AudioCaps imports in any_2_any_retrieval - Remove moved files from eng classification imports Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
@AdnanElAssadi56 @Samoed @KennethEnevoldsen I've update both this branch AND the paper draft based on the following: MAEB Benchmark Summary
Notes:
|
The __init__.py was importing UrbanSound8kZeroshotClassification but the class is actually named UrbanSound8kClassification in the source file. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Great work! Maybe we can create versions with only English? |
|
Thanks! I feel our recurring theme overall has been maintainability - and that drives us to keep the number of benchmarks low. As such, I feel a modality split is the only key factor to have separate benchmarks. This way, we can also make a claim that since it's inherently multilingual, we incentivize/nudge the community to develop better multilingual audio embedding models. For English subsets, perhaps we only show a doc example on how to filter tasks? |
I was thinking this would be the ideal behaviour. We could easily add code on that (we could even add the benchmarks), but without adding multiple views in the leaderboard. I really agree with:
Which is why I would rather have a single benchmark with filters. I think this aligns fairly well with what we have now though. This is how I would phrase it in the paper. We construct a broad range of tasks. We call this collection MAEB+. This is the unreduced set - extended. Then we actual benchmark is a condensed version of this (MAEB+ never becomes a released benchmark it is just a collection of tasks used to construct MAEB). What do you guys think? I am unsure if we want to keep audio and audio-text seperated though. Here I am leaning combine, but it is only a small preference though (will look more at the paper to figure out what is best)
I agree that default multilingual (and potentially default multimodal, audio-text?) is a good incentive to provide. People will be interested in the English column, but we can provide that. Questions
(will look more in the paper as well) |
|
Overall, I think we can include English only (or multilingual) benchmark without
What is the problem with maintainability? |
|
Modality split seems the most practical still: There are just a lot more audio-only embedding models, and fewer audio-text-capable models.
💯 I think an English column and a
High number of benchmarks lowers maintainability. |
Resolve merge conflicts by combining GoogleSVQ from maeb with renamed JamAlt classes from maeb-task-selection. Also exclude *.tex files from typos checker. Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ripts - Updated MAEB Full from 95 to 97 tasks (added GoogleSVQ A2T/T2A retrieval) - Updated MAEB(audio-text, extended) from 36 to 38 tasks (added GoogleSVQ) - Fixed task categorization (moved JamAltArtistA2ARetrieval to correct section) - Updated benchmark descriptions with accurate counts (6 task types) - Added scripts for generating language distribution plots and overview tables - Fixed table generation to properly group multilingual retrieval tasks Co-Authored-By: Claude Opus 4.1 <[email protected]>
| "yamnet", | ||
| "ast-finetuned-audioset-10-10-0.4593", | ||
| "clap-htsat-fused", | ||
| "wav2vec2-xls-r-1b", | ||
| "larger_clap_general", | ||
| "MuQ-MuLan-large", | ||
| "whisper-medium", | ||
| "whisper-large-v3", | ||
| "Qwen2-Audio-7B", | ||
| "wavlm-base-plus-svmsclap-2023", | ||
| "wav2clip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm are we sure we want to keep all the references here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it look on a smaller window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this list, one would need to spin up the LB locally (point to maeb-results) and see which ones we want to label. I randomly picked a few models at the "Pareto front" so to say. Can change to whatever we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah. Probably good to pick a set of well-known references (so would porbably do it more based on the downloads on the hub) - it is not our main concern now so feel free to spin this up as an issue
72e3b27 to
37c919d
Compare
Complete renaming of MAEB benchmark to MAEB+ for clearer identification: - Rename benchmark variable from MAEB to MAEB_PLUS - Update benchmark name from "MAEB" to "MAEB+" - Update display name from "MAEB, Full" to "MAEB+" - Fix imports and exports in __init__.py - Update benchmark selector UI reference - Update all script string lookups from "MAEB" to "MAEB+" - Ensure consistency across all benchmark references Files modified: - mteb/benchmarks/benchmarks/benchmarks.py - mteb/benchmarks/benchmarks/__init__.py - mteb/leaderboard/benchmark_selector.py - scripts/generate_maeb_overview_tables.py - scripts/plot_maeb_language_counts.py Co-Authored-By: Claude Sonnet 4 <[email protected]>
…dio, lite) -> MAEB(audio-only) - MAEB_AUDIO_TEXT_LITE -> MAEB - MAEB_AUDIO_LITE -> MAEB_AUDIO - Update all imports, exports, and script references - Update benchmark selector and overview table scripts Co-Authored-By: Claude Sonnet 4 <[email protected]>
- MAEB now includes all 35 tasks (18 audio-only + 17 cross-modal) - Updated benchmark descriptions to remove model result counts - Table script now outputs single table with top 30 models - Added Audio-only rank column for cross-benchmark comparison Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ultilingual/zxx submodules Also add *.bib to typos exclude list to prevent false positives on bibliography files. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Apply correlation threshold 0.93 for redundancy removal - Add retrieval direction preference (T2A over A2T) - Update MAEB: 27 tasks, MAEB(audio-only): 16 tasks - Update table generation scripts with new counts Co-Authored-By: Claude Opus 4.5 <[email protected]>
Adds post-processing step to remove same-family same-type task duplicates, keeping the task with lowest average correlation to other retained tasks. Changes: - Add SAME_SOURCE_FAMILIES config and deduplicate_same_source_families() - Update MAEB: 27 → 25 tasks (remove FSD2019Kaggle, CommonLanguageGenderDetection) - Update MAEB(audio-only): 16 → 14 tasks (same removals for consistency) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Merged statistics of the tasks. For few of them it's still missing, because datasets are big #3498 |
Co-Authored-By: Claude Opus 4.5 <[email protected]>
See the draft benchmarks. (For audio-text I actually use the full collection, no filtering) You'll also find the filtering notebook and the script to generate "Table 1".
@KennethEnevoldsen @AdnanElAssadi56 maybe another one for environmental or something?