K2 Vendor Verifier

What's K2VV

Since the release of the Kimi K2 model, we have received numerous feedback on the precision of Kimi K2 in toolcall. Given that K2 focuses on the agentic loop, the reliability of toolcall is of utmost importance.

We have observed significant differences in the toolcall performance of various open-source solutions and vendors. When selecting a provider, users often prioritize lower latency and cost, but may inadvertently overlook more subtle yet critical differences in model accuracy.

These inconsistencies not only affect user experience but also impact K2's performance in various benchmarking results. To mitigate these problems, we launch K2 Vendor Verifier to monitor and enhance the quality of all K2 APIs.

We hope K2VV can help ensuring that everyone can access a consistent and high-performing Kimi K2 model.

K2-thinking Evaluation Results

Test Time: 2025-11-15

temperature=1.0
max_tokens=64000

Model Name	Provider	Api Source	ToolCall-Trigger Similarity	ToolCall-Schema Accuracy
Model Name	Provider	Api Source	ToolCall-Trigger Similarity	count_finish_reason_tool_calls	count_successful_tool_call	schema_accuracy
kimi-k2-thinking	MoonshotAI	https://platform.moonshot.ai	-	1958	1958	100.00%
	Moonshot AI Turbo	https://platform.moonshot.ai	>=73%	1984	1984	100.00%
	Fireworks	https://fireworks.ai		1703	1703	100.00%
	InfiniAI	https://cloud.infini-ai.com		1827	1825	99.89%
	SiliconFlow	https://siliconflow.cn		2119	2097	98.96%
	GMICloud	https://openrouter.ai		1850	1775	95.95%
	AtlasCloud	https://openrouter.ai		1878	1798	95.74%
	SGLang	https://github.com/sgl-project/sglang		1874	1790	95.52%
	vLLM	https://github.com/vllm-project/vllm		2128	1856	87.22%
	Parasail	https://openrouter.ai		2108	1837	87.14%
	DeepInfra	https://openrouter.ai		2071	1800	86.91%
	GoogleVertex	https://openrouter.ai		1945	1668	85.76%
	Together	https://openrouter.ai		1893	1602	84.63%
	NovitaAI	https://openrouter.ai	72.22%	1778	1715	96.46%
	Chutes	https://openrouter.ai	68.10%	3657	3037	83.05%
	Fireworks	https://openrouter.ai	67.38%	1494	1494	100.00%

We ran the official API multiple times to test the fluctuation of `tool_call_f1`. The lowest score was 75.81%, and the average was 76%. Given the inherent randomness of the model, we believe that an `tool_call_f1` score above 73% is acceptable and can be used as a reference.

K2 0905 Evaluation Results

Test Time: 2025-11-15

temperature=0.6

Model Name	Provider	Api Source	ToolCall-Trigger Similarity	ToolCall-Schema Accuracy
Model Name	Provider	Api Source	ToolCall-Trigger Similarity	count_finish_reason_tool_calls	count_successful_tool_call	schema_accuracy
kimi-k2-0905-preview	MoonshotAI	https://platform.moonshot.ai	-	1274	1274	100.00%
	Moonshot AI Turbo	https://platform.moonshot.ai	>=80%	1398	1398	100.00%
	DeepInfra	https://openrouter.ai		1365	1365	100.00%
	Fireworks	https://openrouter.ai		1453	1453	100.00%
	Infinigence	https://cloud.infini-ai.com		1257	1257	100.00%
	NovitaAI	https://openrouter.ai		1299	1299	100.00%
	SiliconFlow	https://siliconflow.cn		1305	1302	99.77%
	Chutes	https://openrouter.ai		1271	1229	96.70%
	vLLM	https://github.com/vllm-project/vllm		1325	1007	76.00%
	SGLang	https://github.com/sgl-project/sglang		1269	928	73.13%
	Volc	https://www.volcengine.com		1330	969	72.86%
	Baseten	https://openrouter.ai		1243	901	72.49%
	AtlasCloud	https://openrouter.ai		1277	925	72.44%
	Together	https://openrouter.ai		1266	911	71.96%
	Groq	https://groq.com	69.52%	1042	1042	100.00%
	Nebius	https://nebius.ai	50.60%	644	544	84.47%

We ran the official API multiple times to test the fluctuation of `tool_call_f1`. The lowest score was 82.71%, and the average was 84%. Given the inherent randomness of the model, we believe that an `tool_call_f1` score above 80% is acceptable and can be used as a reference.

Evaluation Metrics

ToolCall-Trigger Similarity

We use tool_call_f1 to determine whether the model deployment is correct.

Label / Metric	Formula	Meaning
`TP` (True Positive)	—	Both model & official have `finish_reason == "tool_calls"`.
`FP` (False Positive)	—	Model `finish_reason == "tool_calls"` while official is `"stop"` or `"others"`.
`FN` (False Negative)	—	Model `finish_reason == "stop"` or `"others"` while official is `"tool_calls"`.
`TN` (True Negative)	—	Both model & official have `finish_reason == "stop"` or `"others"`.
`tool_call_precision`	`TP / (TP + FP)`	Proportion of triggered tool calls that should have been triggered.
`tool_call_recall`	`TP / (TP + FN)`	Proportion of tool calls that should have been triggered and were.
`tool_call_f1`	*`2`tool_call_precision``tool_call_recall `/ (`tool_call_precision`+`tool_call_recall`)`*	Harmonic mean of precision and recall (primary metric for deployment check).

ToolCall-Schema Accuracy

We use schema_accuracy to measure the robustness of the engineering.

Label / Metric	Formula / Condition	Description
`count_finish_reason_tool_calls`	—	Number of responses with `finish_reason == "tool_calls"`.
`count_successful_tool_call`	—	Number of tool_calls responses that passed schema validation.
`schema_accuracy`	`count_successful_tool_call / count_finish_reason_tool_calls`	Proportion of triggered tool calls whose JSON payload satisfies the schema.

How we do the test

We test toolcall's response over a set of 4,000 requests. Each provider's responses are collected and compared against the official Moonshot AI API.

K2 vendors are periodically evaluated. If you are not on the list and would like to be included, feel free to contact us.

Sample Data: Detailed samples and MoonshotAI results are available in tool-calls-dataset (50% of the test set).

Suggestions to Vendors

Use the Correct Versions
Some vendors may not meet the requirements due to using incorrect versions. We recommend using the following versions and newer versions:

K2-0905:
- vllm v0.11.0
- sglang v0.5.3rc0
- moonshotai/Kimi-K2-Instruct-0905 (commit: 94a4053eb8863059dd8afc00937f054e1365abbd)
K2-thinking:

Rename Tool Call IDs
The Kimi-K2 model expects all tool call IDs in historical messages to follow the format functions.func_name:idx. However, previous test cases may contain malformed tool IDs like serach:0*, which could mislead Kimi-K2 into generating incorrect tool call IDs, resulting in parsing failures.
In this version, we manually add the functions. prefix to all previous tool calls to make Kimi-K2 happy :). We recommend that users and vendors adopt this fix in practice as well.
This type of tool ID was generated by our official API. Before invoking the K2 model, our official API automatically renames all tool call IDs to the format functions.func_name:idx, so this is not an issue for us.
Add Guided Encoding
Large language models generate text token-by-token according to probability; they have no built-in mechanism to enforce a hard JSON schema. Even with careful prompting, the model may omit fields, add extra ones, or nest them incorrectly. So please add guided encoding to ensure the correct schema.

Verify by yourself

To run the evaluation tool with sample data, use the following command:

python tool_calls_eval.py samples.jsonl \
    --model kimi-k2-0905-preview \
    --base-url https://api.moonshot.cn/v1 \
    --api-key YOUR_API_KEY \
    --concurrency 5 \
    --output results.jsonl \
    --summary summary.json

samples.jsonl: Path to the test set file in JSONL format
--model: Model name (e.g., kimi-k2-0905-preview)
--base-url: API endpoint URL
--api-key: API key for authentication (or set OPENAI_API_KEY environment variable)
--concurrency: Maximum number of concurrent requests (default: 5)
--output: Path to save detailed results (default: results.jsonl)
--summary: Path to save aggregated summary (default: summary.json)
--timeout: Per-request timeout in seconds (default: 600)
--retries: Number of retries on failure (default: 3)
--extra-body: Extra JSON body as string to merge into each request payload (e.g., '{"temperature":0.6}')
--incremental: Incremental mode to only rerun failed requests

For testing other providers via OpenRouter:

python tool_calls_eval.py samples.jsonl \
    --model moonshotai/kimi-k2-0905 \
    --base-url https://openrouter.ai/api/v1 \
    --api-key YOUR_OPENROUTER_API_KEY \
    --concurrency 5 \
    --extra-body '{"provider": {"only": ["YOUR_DESIGNATED_PROVIDER"]}}'

Contact Us

We're preparing the next benchmark round and need your input.

If there's any metric or test case you care about, please drop a note in issue

And welcome to drop the name of any vendor you’d like to see in in issue

If you have any questions or concerns, please reach out to us at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
tool_calls_eval.py		tool_calls_eval.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

K2 Vendor Verifier

What's K2VV

K2-thinking Evaluation Results

We ran the official API multiple times to test the fluctuation of `tool_call_f1`. The lowest score was 75.81%, and the average was 76%. Given the inherent randomness of the model, we believe that an `tool_call_f1` score above 73% is acceptable and can be used as a reference.

K2 0905 Evaluation Results

We ran the official API multiple times to test the fluctuation of `tool_call_f1`. The lowest score was 82.71%, and the average was 84%. Given the inherent randomness of the model, we believe that an `tool_call_f1` score above 80% is acceptable and can be used as a reference.

Evaluation Metrics

ToolCall-Trigger Similarity

ToolCall-Schema Accuracy

How we do the test

Suggestions to Vendors

Verify by yourself

Contact Us

About

Uh oh!

Releases

Packages

Contributors 3

Languages

MoonshotAI/K2-Vendor-Verifier

Folders and files

Latest commit

History

Repository files navigation

K2 Vendor Verifier

What's K2VV

K2-thinking Evaluation Results

We ran the official API multiple times to test the fluctuation of tool_call_f1. The lowest score was 75.81%, and the average was 76%. Given the inherent randomness of the model, we believe that an tool_call_f1 score above 73% is acceptable and can be used as a reference.

K2 0905 Evaluation Results

We ran the official API multiple times to test the fluctuation of tool_call_f1. The lowest score was 82.71%, and the average was 84%. Given the inherent randomness of the model, we believe that an tool_call_f1 score above 80% is acceptable and can be used as a reference.

Evaluation Metrics

ToolCall-Trigger Similarity

ToolCall-Schema Accuracy

How we do the test

Suggestions to Vendors

Verify by yourself

Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

We ran the official API multiple times to test the fluctuation of `tool_call_f1`. The lowest score was 75.81%, and the average was 76%. Given the inherent randomness of the model, we believe that an `tool_call_f1` score above 73% is acceptable and can be used as a reference.

We ran the official API multiple times to test the fluctuation of `tool_call_f1`. The lowest score was 82.71%, and the average was 84%. Given the inherent randomness of the model, we believe that an `tool_call_f1` score above 80% is acceptable and can be used as a reference.

Packages