[WIP] MAEB task selection #3867

isaac-chung · 2026-01-05T20:30:30Z

See the draft benchmarks. (For audio-text I actually use the full collection, no filtering) You'll also find the filtering notebook and the script to generate "Table 1".

@KennethEnevoldsen @AdnanElAssadi56 maybe another one for environmental or something?

Implements new task selection approach using correlation analysis and clustering for MAEB evaluation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]>

- Add domain, category, and language checks to is_candidate_valid_removal to preserve at least one task from each unique domain, category, and language - Add top 5 longest tasks display for CLAP model reference timing - Add diagnostic cell for tasks with many negative correlations - Expand correlation thresholds to include 0.8 and 0.9 - Add Languages, Domains, Categories columns to summary table - Comment out license filtering to include all tasks - Handle empty model coverage gracefully with fallback logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…ased tasks_to_keep - Move UMAP+HDBSCAN clustering right after initial correlation matrix - Define tasks_to_keep from outlier cluster (label -1) instead of empty list - Split function definitions to break circular dependency - Add domain counts cell after results DataFrame - Add model coverage distribution analysis (models at each task count) - Use models with >= 50 tasks for runtime estimation - Show task coverage in runtime output (N/M tasks with eval times) 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>

- Add get_pairs_above_threshold helper to get all correlated pairs - Track skipped_pairs where neither task can be removed - Continue to next pair when current pair is protected - Clear skipped_pairs when task set changes after removal - Only stop when all pairs above threshold have been tried 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude <[email protected]>

Visualizes results_df with: - Blue gradient colormap (light to dark) - White background for NaN values - Adaptive text color (white for high scores, black for low) - Dynamic figure sizing based on data dimensions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add MAEB(audio-text) benchmark with 17 cross-modal retrieval tasks (8 audio-to-text, 9 text-to-audio) selected via correlation threshold 0.95 - Inline task lists directly in MAEB benchmark objects - Add threshold 0.95 to task selection notebook - Convert comparison plot from 1x5 to 2x3 layout for 6 thresholds - Fix tasks_to_select_from to use modality-filtered tasks - Use models with complete eval times for runtime estimation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Expand MAEB(audio-text) benchmark from 17 to 29 tasks (14 A2T + 15 T2A) - Fix msclap model revision from "N/A" to "no_revision" to match results cache - Update benchmark contacts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Script generates top 10 model rankings for MAEB(audio) and MAEB(audio-text) benchmarks using Borda count, with per-category averages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

scripts/task_selection/task_selection_maeb_corr_and_cluster_mieb_method.py

mteb/models/model_implementations/msclap_models.py

mteb/benchmarks/benchmarks/benchmarks.py

Samoed · 2026-01-06T13:31:30Z

Created overview table for tasks and where they're used. Also version for google sheets https://docs.google.com/spreadsheets/d/1wyTvW0q6TIat7RMmfimlNKXri9O7cs_S0uebGTNya0c/edit?usp=sharing

Table

	Task Name	Task description	Task type	Task language(s)	In MAEB(audio)	In MAEB(audio-text)
0	AmbientAcousticContext	The Ambient Acoustic Context dataset contains 1-second segments for activities that occur in a workplace setting. This is a downsampled version with ~100 train and ~50 test samples per class.	AudioClassification	eng-Latn	No	No
1	AmbientAcousticContextClustering	Clustering task based on a subset of the Ambient Acoustic Context dataset containing 1-second segments for workplace activities.	AudioClustering	eng-Latn	No	No
2	AudioCapsA2TRetrieval	Natural language description for any kind of audio in the wild.	Any2AnyRetrieval	eng-Latn	No	Yes
3	AudioCapsMiniReranking	A smaller subset of AudioCaps dataset preprocessed for audio reranking	AudioReranking	eng-Latn	Yes	No
4	AudioCapsT2ARetrieval	Natural language description for any kind of audio in the wild.	Any2AnyRetrieval	eng-Latn	No	Yes
5	AudioSetMini	AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. This is a mini version that is sampled from the original dataset.	AudioMultilabelClassification	eng-Latn	No	No
6	AudioSetStrongA2TRetrieval	Retrieve all temporally-strong labeled events within 10s audio clips from the AudioSet Strongly-Labeled subset.	Any2AnyRetrieval	eng-Latn	No	Yes
7	AudioSetStrongT2ARetrieval	Retrieve audio segments corresponding to a given sound event label from the AudioSet Strongly-Labeled 10s clips.	Any2AnyRetrieval	eng-Latn	No	Yes
8	BeijingOpera	Audio classification of percussion instruments into one of 4 classes: `Bangu`, `Naobo`, `Daluo`, and `Xiaoluo`	AudioClassification	eng-Latn	No	No
9	BirdCLEF	BirdCLEF+ 2025 dataset for species identification from audio, focused on birds, amphibians, mammals and insects from the Middle Magdalena Valley of Colombia. Downsampled to 50 classes with 20 samples each.	AudioClassification	eng-Latn	No	No
10	BirdSet	BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics	AudioClassification	eng-Latn	Yes	No
11	CMUArcticA2TRetrieval	Retrieve the correct transcription for an English speech segment. The dataset is derived from the phonetically balanced CMU Arctic single-speaker TTS corpora. The corpora contains 1150 samples based on read-aloud segments from books, which are out of copyright and derived from the Gutenberg project.	Any2AnyRetrieval	eng-Latn	No	Yes
12	CMUArcticT2ARetrieval	Retrieve the correct audio segment for an English transcription. The dataset is derived from the phonetically balanced CMU Arctic single-speaker TTS corpora. The corpora contains 1150 audio-text pairs based on read-aloud segments from public domain books originally sourced from the Gutenberg project.	Any2AnyRetrieval	eng-Latn	No	Yes
13	CREMADPairClassification	Classifying pairs as having same or different emotions in actor's voice recordings of text spoken in 6 different emotions	AudioPairClassification	eng-Latn	Yes	No
14	CREMA_D	Emotion classification of audio into one of 6 classes: Anger, Disgust, Fear, Happy, Neutral, Sad.	AudioClassification	eng-Latn	No	No
15	CREMA_DClustering	Emotion clustering task with audio data for 6 emotions: Anger, Disgust, Fear, Happy, Neutral, Sad.	AudioClustering	eng-Latn	No	No
16	ClothoA2TRetrieval	An audio captioning datasetst containing audio clips and their corresponding captions.	Any2AnyRetrieval	eng-Latn	No	Yes
17	ClothoT2ARetrieval	An audio captioning datasetst containing audio clips from the Freesound platform and their corresponding captions.	Any2AnyRetrieval	eng-Latn	No	Yes
18	CommonLanguageAgeDetection	Age Classification. This is a stratified subsampled version of the original CommonLanguage dataset.	AudioClassification	eng-Latn	Yes	No
19	CommonLanguageGenderDetection	Gender Classification. This is a stratified subsampled version of the original CommonLanguage datasets.	AudioClassification	eng-Latn	No	No
20	CommonLanguageLanguageDetection	Language Classification. This is a stratified subsampled version of the original CommonLanguage dataset.	AudioClassification	eng-Latn	No	No
21	CommonVoice17A2TRetrieval	Speech recordings with corresponding text transcriptions from CommonVoice dataset.	Any2AnyRetrieval	abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn	No	No
22	CommonVoice17T2ARetrieval	Speech recordings with corresponding text transcriptions from CommonVoice dataset.	Any2AnyRetrieval	abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn	No	No
23	CommonVoice21A2TRetrieval	Speech recordings with corresponding text transcriptions from CommonVoice dataset.	Any2AnyRetrieval	abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn	No	No
24	CommonVoice21T2ARetrieval	Speech recordings with corresponding text transcriptions from CommonVoice dataset.	Any2AnyRetrieval	abk-Latn,afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bak-Cyrl,bas-Latn,bel-Cyrl,ben-Beng,bre-Latn,bul-Cyrl,cat-Latn,ces-Latn,chv-Cyrl,ckb-Arab,cnh-Latn,cym-Latn,dan-Latn,deu-Latn,div-Thaa,dyu-Latn,ell-Grek,eng-Latn,epo-Latn,est-Latn,eus-Latn,fas-Arab,fin-Latn,fra-Latn,fry-Latn,gle-Latn,glg-Latn,grn-Latn,hau-Latn,heb-Hebr,hin-Deva,hsb-Latn,hun-Latn,hye-Armn,ibo-Latn,ina-Latn,ind-Latn,spa-Latn	No	No
25	ESC50	Environmental Sound Classification Dataset.	AudioClassification	eng-Latn	No	No
26	ESC50AudioReranking	ESC-50 environmental sound dataset adapted for audio reranking. Given a query audio of environmental sounds, rank 5 relevant audio samples higher than 16 irrelevant ones from different sound classes. Contains 200 queries across 50 environmental sound categories for robust evaluation.	AudioReranking	eng-Latn	No	No
27	ESC50Clustering	The ESC-50 dataset contains 2,000 labeled environmental audio recordings evenly distributed across 50 classes (40 clips per class). These classes are organized into 5 broad categories: animal sounds, natural soundscapes & water sounds, human (non-speech) sounds, interior/domestic sounds, and exterior/urban noises. This task evaluates unsupervised clustering performance on environmental audio recordings.	AudioClustering	eng-Latn	No	No
28	ESC50PairClassification	Environmental Sound Classification Dataset.	AudioPairClassification	eng-Latn	No	No
29	ESC50_Zeroshot	Environmental Sound Classification Dataset.	AudioZeroshotClassification	eng-Latn	No	No
30	EmoVDBA2TRetrieval	Natural language emotional captions for speech segments from the EmoV-DB emotional voices database.	Any2AnyRetrieval	eng-Latn	No	Yes
31	EmoVDBT2ARetrieval	Natural language emotional captions for speech segments from the EmoV-DB emotional voices database.	Any2AnyRetrieval	eng-Latn	No	Yes
32	FSD2019Kaggle	Multilabel Audio Classification.	AudioMultilabelClassification	eng-Latn	Yes	No
33	FSD50K	Multilabel Audio Classification.	AudioMultilabelClassification	eng-Latn	Yes	No
34	FSDD	Spoken digit classification of audio into one of 10 classes: 0-9	AudioClassification	eng-Latn	No	No
35	FSDnoisy18kAudioReranking	FSDnoisy18k sound event dataset adapted for audio reranking. Given a query audio with potential label noise, rank 4 relevant audio samples higher than 16 irrelevant ones from different sound classes. Contains 200 queries across 20 sound event categories.	AudioReranking	eng-Latn	Yes	No
36	FleursA2TRetrieval	Speech recordings with corresponding text transcriptions from the FLEURS dataset.	Any2AnyRetrieval	afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bel-Cyrl,ben-Beng,bos-Latn,bul-Cyrl,cat-Latn,ceb-Latn,ces-Latn,ckb-Arab,cmn-Hans,cym-Latn,dan-Latn,deu-Latn,ell-Grek,eng-Latn,est-Latn,fas-Arab,fil-Latn,fin-Latn,fra-Latn,ful-Latn,gle-Latn,glg-Latn,guj-Gujr,hau-Latn,heb-Hebr,hin-Deva,hrv-Latn,hun-Latn,hye-Armn,ibo-Latn,ind-Latn,isl-Latn,ita-Latn,jav-Latn,jpn-Jpan,kam-Latn,kan-Knda,kat-Geor,kaz-Cyrl,kea-Latn,khm-Khmr,kir-Cyrl,kor-Hang,lao-Laoo,lin-Latn,lit-Latn,ltz-Latn,lug-Latn,luo-Latn,lvs-Latn,mal-Mlym,mar-Deva,mkd-Cyrl,mlt-Latn,mon-Cyrl,mri-Latn,msa-Latn,mya-Mymr,nld-Latn,nob-Latn,npi-Deva,nso-Latn,nya-Latn,oci-Latn,ori-Orya,orm-Latn,pan-Guru,pol-Latn,por-Latn,pus-Arab,ron-Latn,rus-Cyrl,slk-Latn,slv-Latn,sna-Latn,snd-Arab,som-Latn,spa-Latn,srp-Cyrl,swe-Latn,swh-Latn,tam-Taml,tel-Telu,tgk-Cyrl,tha-Thai,tur-Latn,ukr-Cyrl,umb-Latn,urd-Arab,uzn-Latn,vie-Latn,wol-Latn,xho-Latn,yor-Latn,yue-Hant,zul-Latn	No	Yes
37	FleursT2ARetrieval	Speech recordings with corresponding text transcriptions from the FLEURS dataset.	Any2AnyRetrieval	afr-Latn,amh-Ethi,ara-Arab,asm-Beng,ast-Latn,aze-Latn,bel-Cyrl,ben-Beng,bos-Latn,bul-Cyrl,cat-Latn,ceb-Latn,ces-Latn,ckb-Arab,cmn-Hans,cym-Latn,dan-Latn,deu-Latn,ell-Grek,eng-Latn,est-Latn,fas-Arab,fil-Latn,fin-Latn,fra-Latn,ful-Latn,gle-Latn,glg-Latn,guj-Gujr,hau-Latn,heb-Hebr,hin-Deva,hrv-Latn,hun-Latn,hye-Armn,ibo-Latn,ind-Latn,isl-Latn,ita-Latn,jav-Latn,jpn-Jpan,kam-Latn,kan-Knda,kat-Geor,kaz-Cyrl,kea-Latn,khm-Khmr,kir-Cyrl,kor-Hang,lao-Laoo,lin-Latn,lit-Latn,ltz-Latn,lug-Latn,luo-Latn,lvs-Latn,mal-Mlym,mar-Deva,mkd-Cyrl,mlt-Latn,mon-Cyrl,mri-Latn,msa-Latn,mya-Mymr,nld-Latn,nob-Latn,npi-Deva,nso-Latn,nya-Latn,oci-Latn,ori-Orya,orm-Latn,pan-Guru,pol-Latn,por-Latn,pus-Arab,ron-Latn,rus-Cyrl,slk-Latn,slv-Latn,sna-Latn,snd-Arab,som-Latn,spa-Latn,srp-Cyrl,swe-Latn,swh-Latn,tam-Taml,tel-Telu,tgk-Cyrl,tha-Thai,tur-Latn,ukr-Cyrl,umb-Latn,urd-Arab,uzn-Latn,vie-Latn,wol-Latn,xho-Latn,yor-Latn,yue-Hant,zul-Latn	No	Yes
38	GTZANAudioReranking	GTZAN music genre dataset adapted for audio reranking. Given a query audio from one of 10 music genres, rank 3 relevant audio samples higher than 10 irrelevant ones from different genres. Contains 100 queries across 10 music genres for comprehensive evaluation.	AudioReranking	eng-Latn	No	No
39	GTZANGenre	Music Genre Classification (10 classes)	AudioClassification	eng-Latn	No	No
40	GTZANGenreClustering	Music genre clustering task based on GTZAN dataset with 10 music genres.	AudioClustering	eng-Latn	No	No
41	GigaSpeechA2TRetrieval	Given an English speech segment, retrieve its correct transcription. Audio comes from the 10 000‑hour training subset of GigaSpeech, which originates from ≈40 000 hours of transcribed audiobooks, podcasts, and YouTube.	Any2AnyRetrieval	eng-Latn	No	Yes
42	GigaSpeechT2ARetrieval	Given an English transcription, retrieve its corresponding audio segment. Audio comes from the 10 000‑hour training subset of GigaSpeech, sourced from ≈40 000 hours of transcribed audiobooks, podcasts, and YouTube.	Any2AnyRetrieval	eng-Latn	No	Yes
43	GunshotTriangulation	Classifying a weapon based on its muzzle blast	AudioClassification	eng-Latn	No	No
44	HiFiTTSA2TRetrieval	Sentence-level text captions aligned to 44.1 kHz audiobook speech segments from the Hi‑Fi Multi‑Speaker English TTS dataset. Dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.	Any2AnyRetrieval	eng-Latn	No	Yes
45	HiFiTTST2ARetrieval	Sentence-level text captions aligned to 44.1 kHz audiobook speech segments from the Hi‑Fi Multi‑Speaker English TTS dataset. Dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.	Any2AnyRetrieval	eng-Latn	No	Yes
46	IEMOCAPEmotion	Classification of speech samples into emotions (angry, happy, sad, neutral, frustrated, excited, fearful, surprised, disgusted) from interactive emotional dyadic conversations.	AudioClassification	eng-Latn	No	No
47	IEMOCAPGender	Classification of speech samples by speaker gender (male/female) from the IEMOCAP database of interactive emotional dyadic conversations.	AudioClassification	eng-Latn	No	No
48	JLCorpusA2TRetrieval	Emotional speech segments from the JL-Corpus, balanced over long vowels and annotated for primary and secondary emotions.	Any2AnyRetrieval	eng-Latn	No	Yes
49	JLCorpusT2ARetrieval	Emotional speech segments from the JL-Corpus, balanced over long vowels and annotated for primary and secondary emotions.	Any2AnyRetrieval	eng-Latn	No	Yes
50	LibriCount	Multiclass speaker count identification. Dataset contains audio recordings with between 0 to 10 speakers.	AudioClassification	eng-Latn	Yes	No
51	LibriTTSA2TRetrieval	Given audiobook speech segments from the multi‑speaker LibriTTS corpus, retrieve the correct text transcription. LibriTTS is a 585‑hour, 24 kHz, multi‑speaker English TTS corpus derived from LibriVox (audio) and Project Gutenberg (text).	Any2AnyRetrieval	eng-Latn	No	Yes
52	LibriTTST2ARetrieval	Given an English text transcription, retrieve its corresponding audiobook speech segment from the multi‑speaker LibriTTS corpus. LibriTTS is a 585‑hour, 24 kHz, multi‑speaker English TTS corpus derived from LibriVox and Project Gutenberg.	Any2AnyRetrieval	eng-Latn	No	Yes
53	MACSA2TRetrieval	Audio captions and tags for urban acoustic scenes in TAU Urban Acoustic Scenes 2019 development dataset.	Any2AnyRetrieval	eng-Latn	No	Yes
54	MACST2ARetrieval	Audio captions and tags for urban acoustic scenes in TAU Urban Acoustic Scenes 2019 development dataset.	Any2AnyRetrieval	eng-Latn	No	Yes
55	MInDS14	MInDS-14 is an evaluation resource for intent detection with spoken data in 14 diverse languages.	AudioClassification	ces-Latn,deu-Latn,eng-Latn,fra-Latn,ita-Latn,kor-Hang,nld-Latn,pol-Latn,por-Latn,rus-Cyrl,spa-Latn,zho-Hans	Yes	No
56	MridinghamStroke	Stroke classification of Mridingham (a pitched percussion instrument) into one of 10 classes: ["bheem", "cha", "dheem", "dhin", "num", "tham", "ta", "tha", "thi", "thom"]	AudioClassification	eng-Latn	Yes	No
57	MridinghamTonic	Tonic classification of Mridingham (a pitched percussion instrument) into one of 6 classes: B,C,C#,D,D#,E	AudioClassification	eng-Latn	No	No
58	MusicCapsA2TRetrieval	Natural language description for music audio.	Any2AnyRetrieval	eng-Latn	No	Yes
59	MusicCapsT2ARetrieval	Natural language description for music audio.	Any2AnyRetrieval	eng-Latn	No	Yes
60	MusicGenreClustering	Clustering music recordings in 9 different genres.	AudioClustering	eng-Latn	Yes	No
61	NMSQAPairClassification	A textless Q&A dataset. Given a pair of audio question and audio answer, is the answer relevant to the question?	AudioPairClassification	eng-Latn	Yes	No
62	NSynth	Instrument Source Classification: one of acoustic, electronic, or synthetic.	AudioClassification	eng-Latn	No	No
63	RavdessZeroshot	Emotion classification Dataset. RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes neutral,calm, happy, sad, angry, fearful, surprise, and disgust expressions. These 8 emtoions also serve as labels for the dataset.	AudioZeroshotClassification	eng-Latn	Yes	No
64	SIBFLEURS	Topic Classification for multilingual audio dataset. This dataset is a stratified and downsampled subset of the SIBFLEURS dataset, which is a collection of 1000+ hours of audio data in 100+ languages.	AudioMultilabelClassification	afr-Latn,amh-Ethi,arb-Arab,asm-Beng,ast-Latn,azj-Latn,bel-Cyrl,ben-Beng,bos-Latn,bul-Cyrl,cat-Latn,ceb-Latn,ces-Latn,ckb-Arab,cym-Latn,dan-Latn,deu-Latn,ell-Grek,eng-Latn,est-Latn,fin-Latn,fra-Latn,fuv-Latn,gaz-Latn,gle-Latn,glg-Latn,guj-Gujr,hau-Latn,heb-Hebr,hin-Deva,hrv-Latn,hun-Latn,hye-Armn,ibo-Latn,ind-Latn,isl-Latn,ita-Latn,jav-Latn,jpn-Jpan,kam-Latn,kan-Knda,kat-Geor,kaz-Cyrl,kea-Latn,khk-Cyrl,khm-Khmr,kir-Cyrl,kor-Hang,lao-Laoo,lin-Latn,lit-Latn,ltz-Latn,lug-Latn,luo-Latn,lvs-Latn,mal-Mlym,mar-Deva,mkd-Cyrl,mlt-Latn,mri-Latn,mya-Mymr,nld-Latn,nob-Latn,npi-Deva,nso-Latn,nya-Latn,oci-Latn,ory-Orya,pan-Guru,pbt-Arab,pes-Arab,pol-Latn,por-Latn,ron-Latn,rus-Cyrl,slk-Latn,slv-Latn,sna-Latn,snd-Arab,som-Latn,spa-Latn,srp-Cyrl,swe-Latn,swh-Latn,tam-Taml,tel-Telu,tgk-Cyrl,tgl-Latn,tha-Thai,tur-Latn,ukr-Cyrl,umb-Latn,urd-Arab,uzn-Latn,vie-Latn,wol-Latn,xho-Latn,yor-Latn,zho-Hans,zho-Hant,zsm-Latn,zul-Latn	Yes	No
65	SoundDescsA2TRetrieval	Natural language description for different audio sources from the BBC Sound Effects webpage.	Any2AnyRetrieval	eng-Latn	No	Yes
66	SoundDescsT2ARetrieval	Natural language description for different audio sources from the BBC Sound Effects webpage.	Any2AnyRetrieval	eng-Latn	No	Yes
67	SpeechCommands	A set of one-second .wav audio files, each containing a single spoken English word or background noise. To keep evaluation fast, we use a downsampled version of the original dataset by keeping ~50 samples per class for training.	AudioClassification	eng-Latn	No	No
68	SpeechCommandsZeroshotv0.01	Sound Classification/Keyword Spotting Dataset. This is a set of one-second audio clips containing a single spoken English word or background noise. These words are from a small set of commands such as 'yes', 'no', and 'stop' spoken by various speakers. With a total of 10 labels/commands for keyword spotting and a total of 30 labels for other auxiliary tasks	AudioZeroshotClassification	eng-Latn	Yes	No
69	SpokeNEnglish	Human Sound Classification Dataset.	AudioClassification	eng-Latn	Yes	No
70	SpokenQAForIC	SpokenQA dataset reformulated as Intent Classification (IC) task	AudioClassification	eng-Latn	Yes	No
71	SpokenSQuADT2ARetrieval	Text-to-audio retrieval task based on SpokenSQuAD dataset. Given a text question, retrieve relevant audio segments that contain the answer. Questions are derived from SQuAD reading comprehension dataset with corresponding spoken passages.	Any2AnyRetrieval	eng-Latn	No	Yes
72	TUTAcousticScenes	TUT Urban Acoustic Scenes 2018 dataset consists of 10-second audio segments from 10 acoustic scenes recorded in six European cities. This is a stratified subsampled version of the original dataset.	AudioClassification	eng-Latn	Yes	No
73	UrbanSound8KA2TRetrieval	UrbanSound8K: Audio-to-text retrieval of urban sound events.	Any2AnyRetrieval	eng-Latn	No	Yes
74	UrbanSound8KAudioReranking	UrbanSound8K urban sound dataset adapted for audio reranking. Given a query audio of urban sounds, rank 4 relevant audio samples higher than 16 irrelevant ones from different urban sound classes. Contains 200 queries across 10 urban sound categories for comprehensive evaluation.	AudioReranking	eng-Latn	No	No
75	UrbanSound8KT2ARetrieval	UrbanSound8K: Text-to-audio retrieval of urban sound events.	Any2AnyRetrieval	eng-Latn	No	Yes
76	UrbanSound8kZeroshot	Environmental Sound Classification Dataset.	AudioZeroshotClassification	eng-Latn	No	No
77	VehicleSoundClustering	Clustering vehicle sounds recorded from smartphones (0 (car class), 1 (truck, bus and van class), 2 (motorcycle class))	AudioClustering	eng-Latn	No	No
78	VocalSound	Human Vocal Sound Classification Dataset.	AudioClassification	eng-Latn	No	No
79	VocalSoundAudioReranking	VocalSound dataset adapted for audio reranking. Given a query vocal sound from one of 6 categories, rank 4 relevant vocal samples higher than 16 irrelevant ones from different vocal sound types. Contains 198 queries across 6 vocal sound categories for robust evaluation.	AudioReranking	eng-Latn	Yes	No
80	VocalSoundPairClassification	Recognizing whether two audio clips are the same human vocal expression (laughing, sighing, etc.)	AudioPairClassification	eng-Latn	Yes	No
81	VoiceGenderClustering	Clustering audio recordings based on gender (male vs female).	AudioClustering	eng-Latn	Yes	No
82	VoxCelebClustering	Clustering task based on the VoxCeleb dataset for sentiment analysis, clustering by positive/negative sentiment.	AudioClustering	eng-Latn	Yes	No
83	VoxCelebSA	VoxCeleb dataset augmented for Sentiment Analysis task	AudioClassification	eng-Latn	Yes	No
84	VoxLingua107_Top10	Spoken Language Identification for a given audio samples (10 classes/languages)	AudioClassification	eng-Latn	No	No
85	VoxPopuliAccentClustering	Clustering English speech samples by non-native accent from European Parliament recordings.	AudioClustering	eng-Latn	Yes	No
86	VoxPopuliAccentID	Classification of English speech samples into one of 15 non-native accents from European Parliament recordings. This is a stratified subsampled version of the original VoxPopuli dataset.	AudioClassification	eng-Latn	Yes	No
87	VoxPopuliAccentPairClassification	Classifying same or different regional accent of English	AudioPairClassification	eng-Latn	No	No
88	VoxPopuliGenderClustering	Subsampled Dataset for clustering speech samples by speaker gender (male/female) from European Parliament recordings.	AudioClustering	deu-Latn,eng-Latn,fra-Latn,pol-Latn,spa-Latn	No	No
89	VoxPopuliGenderID	Subsampled Dataset Classification of speech samples by speaker gender (male/female) from European Parliament recordings.	AudioClassification	deu-Latn,eng-Latn,fra-Latn,pol-Latn,spa-Latn	Yes	No
90	VoxPopuliLanguageID	Subsampled Dataset for classification of speech samples into one of 5 European languages (English, German, French, Spanish, Polish) from European Parliament recordings.	AudioClassification	deu-Latn,eng-Latn,fra-Latn,pol-Latn,spa-Latn	No	No

script

import mteb
import pandas as pd

tasks = mteb.get_tasks(modalities=["audio"])

audio_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio)")]
audio_text_tasks_names = [t.metadata.name for t in mteb.get_benchmark("MAEB(audio-text)")]

row = []
for task in tasks:
    print(task.metadata.name)
    in_audio = task.metadata.name in audio_tasks_names
    in_audio_text = task.metadata.name in audio_text_tasks_names
    row.append(
        {
            "Task Name": task.metadata.name,
            "Task description": task.metadata.description,
            "Task type": task.metadata.type,
            "Task language(s)": ", ".join(task.metadata.eval_langs) if isinstance(task.metadata.eval_langs, list) else ", ".join(v[0] for v in task.metadata.eval_langs.values()),
            "In MAEB(audio)": "Yes" if in_audio else "No",
            "In MAEB(audio-text)": "Yes" if in_audio_text else "No",
        }
    )

df = pd.DataFrame(row)
df = df.sort_values(by=["Task Name", "Task type"]).reset_index(drop=True)
df.to_csv("audio_tasks_table.csv", index=False)
df.to_markdown("audio_tasks_table.md")

Samoed · 2026-01-06T13:33:32Z

Probably we can create english only version, but I'm not sure if it is relevant, because most of the tasks are english only

isaac-chung · 2026-01-06T13:42:56Z

Where are all the multilingual tasks?

Samoed · 2026-01-06T13:45:08Z

I think we can create

MAEB(audio)
MAEB(audio-text-multilingual)
MAEB(audio-text-eng)

But this might be complicated to understand for users

isaac-chung · 2026-01-06T13:46:55Z

I think we can create

MAEB(audio)

MAEB(audio-text-multilingual)

MAEB(audio-text-eng)

But this might be complicated to understand for users

Why would it be complicated? Seems clear to me

KennethEnevoldsen · 2026-01-06T14:00:30Z

Hmm I would maybe do:

MAEB: the full MAEB, including audio, audio-text and multilingual
MAEB(audio): The audio-only subset of MAEB
MAEB(english): The english subset of MAEB

However, I would probably argue we could just make two columns that are audio-only and English and just maintain a single benchmark. WDYT? This both simplifies use, the selection and the paper itself.

PS: We have to fix the language annotations - birdset for example, is not English.

Samoed · 2026-01-06T14:20:12Z

We have to fix the language annotations - birdset for example, is not English

How we should name it? Just other or you have something specific in mind? We probably need to change it also to GTZAN (music classification), GunshotTriangulation, MridinghamTonic, NSynth tasks. Added an issue #3872

However, I would probably argue we could just make two columns that are audio-only and English and just maintain a single benchmark. WDYT? This both simplifies use, the selection and the paper itself.

For leaderboard, I agree, but for the users I'm not sure because this can create problems on inference

isaac-chung · 2026-01-06T19:45:00Z

we could just make two columns that are audio-only and English and just maintain a single benchmark.

~~Sorry it's been a long day, and for some reason I struggle to envision this. What would this look like? Would this need any change to the LB?~~

Ah I get it now, only maintain MAEB. Do we bother filtering out similar tasks? or use the entire collection?

MAEB is the full Massive Audio Embedding Benchmark (v1), containing all tasks with audio modality across 7 task types: classification (35), clustering (10), pair classification (5), reranking (6), zero-shot classification (5), audio-to-text retrieval (18), and text-to-audio retrieval (17). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Samoed · 2026-01-06T21:17:36Z

I'm a bit afraid that if we use only 1 benchmark, but users would want to evaluate only on part of it, e.g. audio only. They would need to filter tasks

isaac-chung · 2026-01-06T22:05:53Z

What if we have an english list, an audio list, a "the rest of the collection" list, and MAEB is english + audio + "the rest"? We can still have MAEB(eng)v1, MAEB(audio)v1, and MAEBv1 ?

Rename UrbanSound8kZeroshotClassification to UrbanSound8kClassification in audio_classification module to avoid collision with the identically named class in audio_zeroshot_classification module. Both classes had the same Python name but different task names: - audio_classification: task name "UrbanSound8k" - audio_zeroshot_classification: task name "UrbanSound8kZeroshot" The * imports caused the zeroshot version to overwrite the classification version, leaving only "UrbanSound8kZeroshot" registered in the task registry and breaking MAEB benchmarks that reference "UrbanSound8k". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The dill/datasets library had a pickle incompatibility with Python 3.14. Datasets v4+ resolves this issue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The v0.02 task class was defined but not exported in __init__.py, causing KeyError when referenced in benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Renamed classes to match their metadata names so they can be found in the task registry: - JamAltArtist → JamAltArtistA2ARetrieval - JamAltLyricsT2A → JamAltLyricT2ARetrieval - JamAltLyricsA2T → JamAltLyricA2TRetrieval Also added explicit imports and exports for proper registration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

This reverts commit b244226.

isaac-chung · 2026-01-09T11:17:48Z

Added MAEB(audio-text, extended) as well. Removed same task A2T variants from MAEB(audio-text, lite) based on our prev discussion:

Here we can prioritize the direction that seems the most valid (e.g. T2A might be more valid than A2T in some cases), or based on unique attributes (task A is the only one that covers domain X so we keep that one) or runtime (task A is much faster to run that task B so we choose A).

New utility script that calculates total evaluation times for specified benchmarks and models. Features: - Takes --benchmarks and --models as required arguments - Optional --results-dir for custom cache location - Outputs formatted table with task coverage and times per benchmark - Shows totals per model Usage: python scripts/calculate_eval_times.py \ -b "MAEB(audio-text, lite)" "MAEB(audio-text, extended)" \ -m "OpenMuQ/MuQ-MuLan-large" "laion/clap-htsat-unfused" \ -r /path/to/results Co-Authored-By: Claude Opus 4.5 <[email protected]>

Computes Spearman and Pearson correlations between MAEB lite and extended benchmark variants to validate that lite benchmarks preserve model rankings. Outputs correlation values and scatter plots (PNG and PDF). Co-Authored-By: Claude Opus 4.5 <[email protected]>

Resolve merge conflicts in audio task imports: - Update JamAlt and AudioCaps imports in any_2_any_retrieval - Remove moved files from eng classification imports Co-Authored-By: Claude Opus 4.5 <[email protected]>

isaac-chung · 2026-01-09T13:37:04Z

@AdnanElAssadi56 @Samoed @KennethEnevoldsen I've update both this branch AND the paper draft based on the following:

MAEB Benchmark Summary

Benchmark	Tasks	Task Types	Languages	Runtime (Small Model)	Runtime (Large Model)
MAEB(audio, lite)	18	5	~3 (eng, multilingual, zxx)	0.92h (YAMNet 3.8M)	11.52h (wav2vec2-xls-r-2b 2B)
MAEB(audio, extended)	53	6	~3 (eng, multilingual, zxx)	6.36h (YAMNet)	45.95h (wav2vec2-xls-r-2b)
MAEB(audio-text, lite)	17	2	~3 (eng, multilingual, zxx)	0.80h (MuQ-MuLan-large 630M)	1.35h (CLAP-htsat-unfused 194M)
MAEB(audio-text, extended)	36	2	102+ (via Fleurs, CommonVoice)	1.51h (MuQ-MuLan-large)	12.11h (CLAP-htsat-unfused)

Notes:

Lite benchmarks achieve 2–9× speedup vs extended variants
Lite-to-extended correlation: Spearman ρ=0.90 (audio-only), ρ=1.00 (audio-text)
Task types for audio: Classification, Clustering, PairClassification, Reranking, Retrieval
Task types for audio-text: Retrieval (A2T/T2A), ZeroshotClassification

The __init__.py was importing UrbanSound8kZeroshotClassification but the class is actually named UrbanSound8kClassification in the source file. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Samoed · 2026-01-09T14:33:24Z

Great work! Maybe we can create versions with only English?

isaac-chung · 2026-01-10T11:28:34Z

Thanks!

I feel our recurring theme overall has been maintainability - and that drives us to keep the number of benchmarks low. As such, I feel a modality split is the only key factor to have separate benchmarks. This way, we can also make a claim that since it's inherently multilingual, we incentivize/nudge the community to develop better multilingual audio embedding models.

For English subsets, perhaps we only show a doc example on how to filter tasks?

KennethEnevoldsen · 2026-01-10T15:34:11Z

I'm a bit afraid that if we use only 1 benchmark, but users would want to evaluate only on part of it, e.g. audio only. They would need to filter tasks

I was thinking this would be the ideal behaviour. We could easily add code on that (we could even add the benchmarks), but without adding multiple views in the leaderboard.

I really agree with:

I feel our recurring theme overall has been maintainability

Which is why I would rather have a single benchmark with filters. I think this aligns fairly well with what we have now though. This is how I would phrase it in the paper. We construct a broad range of tasks. We call this collection MAEB+. This is the unreduced set - extended. Then we actual benchmark is a condensed version of this (MAEB+ never becomes a released benchmark it is just a collection of tasks used to construct MAEB). What do you guys think?

I am unsure if we want to keep audio and audio-text seperated though. Here I am leaning combine, but it is only a small preference though (will look more at the paper to figure out what is best)

inherently multilingual, we incentivize/nudge the community to develop better multilingual audio embedding models.

I agree that default multilingual (and potentially default multimodal, audio-text?) is a good incentive to provide. People will be interested in the English column, but we can provide that.

Questions

What does ~3 languages (multilingual) mean?
Seems like we loose a lot of languages in the lite in audio-text. Seems concerning to me.
We also loose a task type in the lite version of audio, which I also think is problematic (we could do a similar check on domains)

(will look more in the paper as well)

Samoed · 2026-01-10T16:32:19Z

Overall, I think we can include English only (or multilingual) benchmark without zxx languages, because some models could have low performance on zxx tasks, but higher performance on real real languages.

I feel our recurring theme overall has been maintainability

What is the problem with maintainability?

isaac-chung · 2026-01-10T16:43:32Z

Modality split seems the most practical still: There are just a lot more audio-only embedding models, and fewer audio-text-capable models.

I agree that default multilingual (and potentially default multimodal, audio-text?) is a good incentive to provide. People will be interested in the English column, but we can provide that.

💯 I think an English column and a zxx column would complete the picture really well.

What is the problem with maintainability?

High number of benchmarks lowers maintainability.

Resolve merge conflicts by combining GoogleSVQ from maeb with renamed JamAlt classes from maeb-task-selection. Also exclude *.tex files from typos checker. Co-Authored-By: Claude Opus 4.5 <[email protected]>

…ripts - Updated MAEB Full from 95 to 97 tasks (added GoogleSVQ A2T/T2A retrieval) - Updated MAEB(audio-text, extended) from 36 to 38 tasks (added GoogleSVQ) - Fixed task categorization (moved JamAltArtistA2ARetrieval to correct section) - Updated benchmark descriptions with accurate counts (6 task types) - Added scripts for generating language distribution plots and overview tables - Fixed table generation to properly group multilingual retrieval tasks Co-Authored-By: Claude Opus 4.1 <[email protected]>

mteb/benchmarks/benchmarks/benchmarks.py

KennethEnevoldsen · 2026-01-15T19:21:10Z

mteb/leaderboard/figures.py

+    "yamnet",
+    "ast-finetuned-audioset-10-10-0.4593",
+    "clap-htsat-fused",
+    "wav2vec2-xls-r-1b",
+    "larger_clap_general",
+    "MuQ-MuLan-large",
+    "whisper-medium",
+    "whisper-large-v3",
+    "Qwen2-Audio-7B",
+    "wavlm-base-plus-svmsclap-2023",
+    "wav2clip",


Hmm are we sure we want to keep all the references here?

How does it look on a smaller window?

For this list, one would need to spin up the LB locally (point to maeb-results) and see which ones we want to label. I randomly picked a few models at the "Pareto front" so to say. Can change to whatever we want.

Hmm yeah. Probably good to pick a set of well-known references (so would porbably do it more based on the downloads on the hub) - it is not our main concern now so feel free to spin this up as an issue

docs/advanced_usage/vllm_wrapper.md

Complete renaming of MAEB benchmark to MAEB+ for clearer identification: - Rename benchmark variable from MAEB to MAEB_PLUS - Update benchmark name from "MAEB" to "MAEB+" - Update display name from "MAEB, Full" to "MAEB+" - Fix imports and exports in __init__.py - Update benchmark selector UI reference - Update all script string lookups from "MAEB" to "MAEB+" - Ensure consistency across all benchmark references Files modified: - mteb/benchmarks/benchmarks/benchmarks.py - mteb/benchmarks/benchmarks/__init__.py - mteb/leaderboard/benchmark_selector.py - scripts/generate_maeb_overview_tables.py - scripts/plot_maeb_language_counts.py Co-Authored-By: Claude Sonnet 4 <[email protected]>

…dio, lite) -> MAEB(audio-only) - MAEB_AUDIO_TEXT_LITE -> MAEB - MAEB_AUDIO_LITE -> MAEB_AUDIO - Update all imports, exports, and script references - Update benchmark selector and overview table scripts Co-Authored-By: Claude Sonnet 4 <[email protected]>

- MAEB now includes all 35 tasks (18 audio-only + 17 cross-modal) - Updated benchmark descriptions to remove model result counts - Table script now outputs single table with top 30 models - Added Audio-only rank column for cross-benchmark comparison Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

…ultilingual/zxx submodules Also add *.bib to typos exclude list to prevent false positives on bibliography files. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Apply correlation threshold 0.93 for redundancy removal - Add retrieval direction preference (T2A over A2T) - Update MAEB: 27 tasks, MAEB(audio-only): 16 tasks - Update table generation scripts with new counts Co-Authored-By: Claude Opus 4.5 <[email protected]>

Adds post-processing step to remove same-family same-type task duplicates, keeping the task with lowest average correlation to other retained tasks. Changes: - Add SAME_SOURCE_FAMILIES config and deduplicate_same_source_families() - Update MAEB: 27 → 25 tasks (remove FSD2019Kaggle, CommonLanguageGenderDetection) - Update MAEB(audio-only): 16 → 14 tasks (same removals for consistency) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Samoed · 2026-01-20T15:03:27Z

Merged statistics of the tasks. For few of them it's still missing, because datasets are big #3498

Co-Authored-By: Claude Opus 4.5 <[email protected]>

isaac-chung and others added 10 commits January 4, 2026 15:16

Add MAEB task selection method with correlation and clustering

6047669

Implements new task selection approach using correlation analysis and clustering for MAEB evaluation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <[email protected]>

add missing files

61ed764

make lint

8b023dc

KennethEnevoldsen reviewed Jan 6, 2026

View reviewed changes

scripts/task_selection/task_selection_maeb_corr_and_cluster_mieb_method.py Show resolved Hide resolved

mteb/models/model_implementations/msclap_models.py Show resolved Hide resolved

mteb/benchmarks/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

isaac-chung and others added 5 commits January 7, 2026 00:20

Upgrade datasets to v4+ for Python 3.14 compatibility

3147c20

The dill/datasets library had a pickle incompatibility with Python 3.14. Datasets v4+ resolves this issue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Track uv.lock in repository

b244226

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Export SpeechCommandsZeroshotClassificationv02 task

fb7061a

The v0.02 task class was defined but not exported in __init__.py, causing KeyError when referenced in benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

isaac-chung force-pushed the maeb-task-selection branch 2 times, most recently from 2631fc8 to 411a4ce Compare January 6, 2026 23:21

Revert "Track uv.lock in repository"

e1c1d64

This reverts commit b244226.

isaac-chung and others added 3 commits January 9, 2026 13:36

Merge branch 'maeb' into maeb-task-selection

5091e5a

Resolve merge conflicts in audio task imports: - Update JamAlt and AudioCaps imports in any_2_any_retrieval - Remove moved files from eng classification imports Co-Authored-By: Claude Opus 4.5 <[email protected]>

Fix UrbanSound8kClassification import in zxx __init__.py

247ade3

The __init__.py was importing UrbanSound8kZeroshotClassification but the class is actually named UrbanSound8kClassification in the source file. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge branch 'maeb' into maeb-task-selection

2906b0f

Resolve merge conflicts by combining GoogleSVQ from maeb with renamed JamAlt classes from maeb-task-selection. Also exclude *.tex files from typos checker. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Samoed added the maeb Audio extension label Jan 12, 2026

KennethEnevoldsen reviewed Jan 15, 2026

View reviewed changes

Samoed reviewed Jan 16, 2026

View reviewed changes

docs/advanced_usage/vllm_wrapper.md Outdated Show resolved Hide resolved

isaac-chung force-pushed the maeb-task-selection branch from 72e3b27 to 37c919d Compare January 16, 2026 08:23

isaac-chung and others added 11 commits January 16, 2026 11:18

Merge MAEB audio benchmarks into single MAEB_EXTENDED (89 tasks)

cf8babb

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge maeb branch: restructure any_2_any_retrieval imports into eng/m…

d914d3c

…ultilingual/zxx submodules Also add *.bib to typos exclude list to prevent false positives on bibliography files. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add BirdCLEF and VehicleSoundClustering to MAEB for domain coverage

924424a

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update MAEB to 29 tasks using threshold=0.9 correlation selection

fcf35e9

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update MAEB task selection to threshold 0.8

a2d6e0e

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge branch 'maeb' into maeb-task-selection

c9734b6

Update MAEB benchmarks for threshold=0.85 task selection

01ffa8d

Co-Authored-By: Claude Opus 4.5 <[email protected]>

[WIP] MAEB task selection #3867

Are you sure you want to change the base?

[WIP] MAEB task selection #3867

Conversation

isaac-chung commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 6, 2026

Uh oh!

isaac-chung commented Jan 6, 2026

Uh oh!

Samoed commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jan 6, 2026

Uh oh!

KennethEnevoldsen commented Jan 6, 2026

Uh oh!

Samoed commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 6, 2026

Uh oh!

isaac-chung commented Jan 6, 2026

Uh oh!

isaac-chung commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jan 9, 2026

MAEB Benchmark Summary

Uh oh!

Samoed commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jan 10, 2026

Uh oh!

KennethEnevoldsen commented Jan 10, 2026

Questions

Uh oh!

Samoed commented Jan 10, 2026

Uh oh!

isaac-chung commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samoed commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

isaac-chung commented Jan 5, 2026 •

edited

Loading

Samoed commented Jan 6, 2026 •

edited

Loading

Samoed commented Jan 6, 2026 •

edited

Loading

Samoed commented Jan 6, 2026 •

edited

Loading

isaac-chung commented Jan 6, 2026 •

edited

Loading

isaac-chung commented Jan 9, 2026 •

edited

Loading

Samoed commented Jan 9, 2026 •

edited

Loading

Samoed commented Jan 20, 2026 •

edited

Loading