-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Hey Everyone, not sure where to put this, but I created a Working Cerebas Code GLM 4.7 Transformer for Claude Code CLI that handles the rate limiting, tool translation, and context etc. Hope it is helpful for others.
Cerebras GLM-4.7 Transformer for Claude Code Router
A custom transformer that enables Claude Code to work with Cerebras's GLM-4.7 model through Claude Code Router (CCR).
Recommended Setup
It is highly recommended to use Continuous Claude Code v3 alongside this transformer. Continuous Claude Code automatically handles context compaction, so you don't need to worry about manually running /compact when you hit context limits.
Features
1. Proactive Rate Limiting
The transformer implements a sliding window rate limiter that tracks usage before hitting Cerebras's limits, preventing 429 errors entirely.
Tracked Limits (Code Max Plan - $200/mo):
- 120 RPM (Requests Per Minute)
- 1.5M TPM (Tokens Per Minute)
- 120M tokens/day (Daily limit)
How it works:
- Estimates tokens for each request before sending
- Calculates the exact delay needed if any limit would be exceeded
- Automatically waits the minimum required time
- Updates estimates with actual token counts from responses
2. Context Limit Handling
Cerebras GLM-4.7 has a 131,072 token context limit. The transformer:
- Estimates total context size before each request
- Reserves ~20K tokens for output (safe input limit: ~111K tokens)
- When limit is exceeded, returns an Anthropic-format error:
{ "type": "error", "error": { "type": "invalid_request_error", "message": "prompt is too long: X tokens > 131072 maximum" } } - This triggers Claude Code's standard "Context limit reached" message, prompting you to run
/compact
3. Message Format Conversion
Converts Claude Code's message format to OpenAI-compatible format that Cerebras expects:
- Converts content arrays (with text/image objects) to plain strings
- Handles top-level
systemfield by converting to system message - Removes unsupported fields:
reasoning,thinking,anthropic_version,metadata,cache_control,stream_options
4. Agent Name Casing Fixes
GLM models sometimes output lowercase agent names, but Claude Code is case-sensitive. The transformer automatically corrects:
| GLM Output | Corrected |
|---|---|
explore |
Explore |
plan |
Plan |
bash |
Bash |
5. Tool Call ID Deduplication
Cerebras is strict about duplicate tool_call.id values in the conversation history. The transformer:
- Tracks all seen tool call IDs
- Removes duplicate tool calls before sending to Cerebras
- Prevents 422 errors from duplicate IDs
6. Streaming Response Support
Handles both streaming and non-streaming responses:
- Intercepts streaming responses to fix agent name casing in real-time
- Extracts token usage from SSE data for accurate rate limit tracking
Configuration
config.json
{
"LOG": true,
"LOG_LEVEL": "debug",
"API_TIMEOUT_MS": 600000,
"transformers": [
{
"path": "/path/to/cerebras-transformer.js"
}
],
"Providers": [
{
"name": "cerebras",
"api_base_url": "https://api.cerebras.ai/v1/chat/completions",
"api_key": "YOUR_CEREBRAS_API_KEY",
"models": ["zai-glm-4.7"],
"transformer": {
"use": [
"cerebras",
"enhancetool",
["maxtoken", { "max_tokens": 16384 }]
]
}
}
],
"Router": {
"default": "cerebras,zai-glm-4.7",
"background": "cerebras,zai-glm-4.7",
"think": "cerebras,zai-glm-4.7",
"longContext": "cerebras,zai-glm-4.7",
"longContextThreshold": 60000,
"claude-3-5-haiku-20241022": "cerebras,zai-glm-4.7",
"claude-3-5-haiku": "cerebras,zai-glm-4.7",
"haiku": "cerebras,zai-glm-4.7"
}
}Important: WebFetch Support
Claude Code's WebFetch tool makes a secondary API call to claude-3-5-haiku for summarizing fetched web content. You must add routes for Haiku models in your Router config (as shown above), otherwise WebFetch will fail.
Console Output
The transformer logs helpful information to stderr:
[Cerebras] Rate limit: waiting 2.5s (RPM: 118/120, TPM: 1450K/1500K)
[Cerebras] Token tracking: estimated 45000 -> actual 42350 (daily: 15M/120M)
[Cerebras] Context 125000 tokens exceeds safe limit 111072. Returning compact message.
[Cerebras] Removing duplicate tool_call.id: call_abc123
Limitations
- No extended thinking support: Cerebras doesn't support reasoning/thinking parameters
- 131K context limit: Smaller than Anthropic's 200K limit - use
/compactor Continuous Claude Code v3 - Token estimation: Uses ~4 chars/token approximation, actual may vary slightly
Troubleshooting
"Context limit reached" appearing frequently
- Use Continuous Claude Code v3 for automatic context management
- Or manually run
/compactwhen prompted
WebFetch not working
- Ensure you have the Haiku routes in your Router config
- WebFetch uses Haiku for content summarization
422 errors about duplicate tool_call.id
- The transformer handles this automatically
- If still occurring, check for very long conversations with repeated tool calls
Rate limit delays
- Normal behavior - the transformer is preventing 429 errors
- Consider upgrading your Cerebras plan for higher limits
Credits
Built for use with Claude Code Router by musistudio.
Cerebras GLM-4.7 provides extremely fast inference speeds, making it an excellent choice for Claude Code workflows.
// Custom transformer for Cerebras GLM-4.7 with proactive rate limiting
// Tracks tokens and requests to avoid 429 errors before they happen
/**
* Context limit configuration
* GLM-4.7 has 131072 token limit - reserve space for output
*/
const CONTEXT_LIMIT = 131072;
const MAX_INPUT_TOKENS = 115000; // Leave ~16K for output
/**
* Converts content from Claude Code format (array of objects) to plain string
*/
function convertContentToString(content) {
if (typeof content === 'string') {
return content;
}
if (Array.isArray(content)) {
return content
.map((item) => {
if (typeof item === 'string') {
return item;
}
if (item.type === 'text' && item.text) {
return item.text;
}
if (item.type === 'image' || item.type === 'image_url') {
return '[Image content]';
}
return '';
})
.join('');
}
return '';
}
/**
* Estimate token count from text (rough approximation: ~4 chars per token)
*/
function estimateTokens(text) {
if (!text) return 0;
return Math.ceil(text.length / 4);
}
/**
* Known agent name corrections (lowercase -> correct case)
* GLM models often output lowercase but Claude Code is case-sensitive
*/
const AGENT_NAME_CORRECTIONS = {
'explore': 'Explore',
'plan': 'Plan',
'bash': 'Bash',
'general-purpose': 'general-purpose',
'statusline-setup': 'statusline-setup',
'claude-code-guide': 'claude-code-guide',
};
/**
* Fix agent name casing in tool call arguments
*/
function fixAgentNameCasing(text) {
if (!text) return text;
let fixed = text;
for (const [wrong, correct] of Object.entries(AGENT_NAME_CORRECTIONS)) {
const patterns = [
new RegExp(`"subagent_type"\\s*:\\s*"${wrong}"`, 'gi'),
new RegExp(`'subagent_type'\\s*:\\s*'${wrong}'`, 'gi'),
new RegExp(`subagent_type.*?["']${wrong}["']`, 'gi'),
];
for (const pattern of patterns) {
fixed = fixed.replace(pattern, (match) => {
return match.replace(new RegExp(wrong, 'gi'), correct);
});
}
}
return fixed;
}
/**
* Transformer class for Cerebras GLM-4.7 with proactive rate limiting
*/
class CerebrasTransformer {
constructor() {
this.name = 'cerebras';
// Rate limit configuration for Code Max plan ($200/mo)
this.config = {
maxRequestsPerMinute: 120,
maxTokensPerMinute: 1500000,
maxTokensPerDay: 120000000,
windowSizeMs: 60000,
};
this.dailyTokens = 0;
this.dailyResetTime = this.getNextMidnight();
this.pendingRequestId = 0;
this.requestHistory = [];
this.lastRequestTime = 0;
}
getNextMidnight() {
const now = new Date();
const midnight = new Date(now);
midnight.setHours(24, 0, 0, 0);
return midnight.getTime();
}
checkDailyReset() {
const now = Date.now();
if (now >= this.dailyResetTime) {
console.error(`[Cerebras] Daily token counter reset. Previous day: ${Math.round(this.dailyTokens/1000000)}M tokens used.`);
this.dailyTokens = 0;
this.dailyResetTime = this.getNextMidnight();
}
}
cleanupHistory() {
const cutoff = Date.now() - this.config.windowSizeMs;
this.requestHistory = this.requestHistory.filter(r => r.timestamp > cutoff);
}
getCurrentUsage() {
this.cleanupHistory();
const requestCount = this.requestHistory.length;
const tokenCount = this.requestHistory.reduce((sum, r) => sum + r.tokens, 0);
return { requestCount, tokenCount };
}
calculateDelay(estimatedRequestTokens) {
this.checkDailyReset();
this.cleanupHistory();
const now = Date.now();
const usage = this.getCurrentUsage();
const withinRPM = usage.requestCount < this.config.maxRequestsPerMinute;
const withinTPM = usage.tokenCount + estimatedRequestTokens < this.config.maxTokensPerMinute;
const withinDaily = this.dailyTokens + estimatedRequestTokens < this.config.maxTokensPerDay;
if (withinRPM && withinTPM && withinDaily) {
return 0;
}
let requiredDelay = 0;
if (!withinRPM && this.requestHistory.length > 0) {
const oldestRequest = this.requestHistory[0];
const expiresAt = oldestRequest.timestamp + this.config.windowSizeMs;
const rpmDelay = Math.max(0, expiresAt - now + 50);
requiredDelay = Math.max(requiredDelay, rpmDelay);
}
if (!withinTPM && this.requestHistory.length > 0) {
const tokensNeeded = (usage.tokenCount + estimatedRequestTokens) - this.config.maxTokensPerMinute;
let tokensFreed = 0;
for (const req of this.requestHistory) {
tokensFreed += req.tokens;
if (tokensFreed >= tokensNeeded) {
const expiresAt = req.timestamp + this.config.windowSizeMs;
const tpmDelay = Math.max(0, expiresAt - now + 50);
requiredDelay = Math.max(requiredDelay, tpmDelay);
break;
}
}
}
if (!withinDaily) {
const dailyDelay = Math.max(0, this.dailyResetTime - now);
requiredDelay = Math.max(requiredDelay, dailyDelay);
}
return requiredDelay;
}
estimateRequestTokens(request) {
let tokens = 0;
if (request.messages && Array.isArray(request.messages)) {
for (const msg of request.messages) {
const content = typeof msg.content === 'string'
? msg.content
: convertContentToString(msg.content);
tokens += estimateTokens(content);
}
}
if (request.system) {
tokens += estimateTokens(convertContentToString(request.system));
}
tokens += request.max_tokens || 16384;
return tokens;
}
recordRequest(tokens) {
this.pendingRequestId++;
const requestId = this.pendingRequestId;
this.requestHistory.push({
id: requestId,
timestamp: Date.now(),
tokens: tokens,
estimated: true
});
this.lastRequestTime = Date.now();
this.dailyTokens += tokens;
return requestId;
}
updateRequestTokens(requestId, actualTokens) {
const request = this.requestHistory.find(r => r.id === requestId);
if (request && request.estimated) {
const oldTokens = request.tokens;
const tokenDiff = actualTokens - oldTokens;
request.tokens = actualTokens;
request.estimated = false;
this.dailyTokens += tokenDiff;
console.error(`[Cerebras] Token tracking: estimated ${oldTokens} -> actual ${actualTokens} (daily: ${Math.round(this.dailyTokens/1000000)}M/${Math.round(this.config.maxTokensPerDay/1000000)}M)`);
}
}
/**
* Transform the request from Claude Code format to Cerebras format
*/
async transformRequestIn(request, provider, context) {
const estimatedTokens = this.estimateRequestTokens(request);
// Check context limit BEFORE sending to Cerebras
const maxSafeInput = CONTEXT_LIMIT - 20000; // Leave 20K for output
if (estimatedTokens > maxSafeInput) {
console.error(`[Cerebras] Context ${estimatedTokens} tokens exceeds safe limit ${maxSafeInput}. Returning compact message.`);
// Flag to return error response
this._contextLimitHit = {
estimatedTokens,
limit: CONTEXT_LIMIT
};
// Return minimal request that will succeed
return {
model: request.model,
messages: [
{ role: 'system', content: 'Say OK' },
{ role: 'user', content: 'OK' }
],
max_tokens: 10,
stream: false
};
}
// Calculate and apply delay if needed
const delay = this.calculateDelay(estimatedTokens);
if (delay > 0) {
const usage = this.getCurrentUsage();
const delayStr = delay >= 60000
? `${Math.round(delay/60000)}m ${Math.round((delay%60000)/1000)}s`
: `${(delay/1000).toFixed(1)}s`;
console.error(`[Cerebras] Rate limit: waiting ${delayStr} (RPM: ${usage.requestCount}/${this.config.maxRequestsPerMinute}, TPM: ${Math.round(usage.tokenCount/1000)}K/${Math.round(this.config.maxTokensPerMinute/1000)}K)`);
await new Promise(resolve => setTimeout(resolve, delay));
}
this.currentRequestId = this.recordRequest(estimatedTokens);
// Deep clone and transform
const transformedRequest = JSON.parse(JSON.stringify(request));
// Deduplicate tool_call.ids (Cerebras is strict about this)
const seenToolCallIds = new Set();
if (transformedRequest.messages && Array.isArray(transformedRequest.messages)) {
for (const msg of transformedRequest.messages) {
if (msg.tool_calls && Array.isArray(msg.tool_calls)) {
msg.tool_calls = msg.tool_calls.filter(tc => {
if (tc.id && seenToolCallIds.has(tc.id)) {
console.error(`[Cerebras] Removing duplicate tool_call.id: ${tc.id}`);
return false;
}
if (tc.id) seenToolCallIds.add(tc.id);
return true;
});
}
if (msg.role === 'tool' && msg.tool_call_id) {
if (seenToolCallIds.has(msg.tool_call_id + '_response')) {
msg._duplicate = true;
} else {
seenToolCallIds.add(msg.tool_call_id + '_response');
}
}
}
transformedRequest.messages = transformedRequest.messages.filter(msg => !msg._duplicate);
}
// Remove unsupported fields
delete transformedRequest.reasoning;
delete transformedRequest.reasoning_content;
delete transformedRequest.thinking;
delete transformedRequest.anthropic_version;
delete transformedRequest.metadata;
// Transform messages
if (transformedRequest.messages && Array.isArray(transformedRequest.messages)) {
transformedRequest.messages = transformedRequest.messages.map((message) => {
const transformedMessage = { ...message };
if (message.content !== undefined) {
transformedMessage.content = convertContentToString(message.content);
}
if (message.role === 'system' && message.content !== undefined) {
transformedMessage.content = convertContentToString(message.content);
}
delete transformedMessage.cache_control;
return transformedMessage;
});
}
// Handle top-level system field
if (transformedRequest.system !== undefined) {
const systemContent = convertContentToString(transformedRequest.system);
const hasSystemMessage = transformedRequest.messages &&
transformedRequest.messages.length > 0 &&
transformedRequest.messages[0].role === 'system';
if (!hasSystemMessage && transformedRequest.messages) {
transformedRequest.messages.unshift({
role: 'system',
content: systemContent
});
}
delete transformedRequest.system;
}
if (!transformedRequest.max_tokens) {
transformedRequest.max_tokens = 16384;
}
delete transformedRequest.stream_options;
return transformedRequest;
}
/**
* Transform the response
*/
async transformResponseOut(response, context) {
const requestId = this.currentRequestId;
// Check if we hit context limit - return Anthropic error format
if (this._contextLimitHit) {
const info = this._contextLimitHit;
this._contextLimitHit = null;
console.error(`[Cerebras] Returning Anthropic-format context limit error`);
// Return exact Anthropic error format that Claude Code recognizes
const errorBody = {
type: 'error',
error: {
type: 'invalid_request_error',
message: `prompt is too long: ${info.estimatedTokens} tokens > ${info.limit} maximum`
}
};
return new Response(JSON.stringify(errorBody), {
status: 400,
headers: { 'Content-Type': 'application/json' }
});
}
// Handle non-streaming JSON response
try {
if (response && typeof response === 'object' && !response.body) {
// Fix agent name casing in tool calls
if (response.choices) {
for (const choice of response.choices) {
if (choice.message?.tool_calls) {
for (const toolCall of choice.message.tool_calls) {
if (toolCall.function?.arguments) {
toolCall.function.arguments = fixAgentNameCasing(toolCall.function.arguments);
}
}
}
if (typeof choice.message?.content === 'string') {
choice.message.content = fixAgentNameCasing(choice.message.content);
}
}
}
if (response.usage) {
const actualTokens = response.usage.total_tokens ||
(response.usage.prompt_tokens || 0) + (response.usage.completion_tokens || 0);
if (actualTokens > 0) {
this.updateRequestTokens(requestId, actualTokens);
}
}
return response;
}
// For streaming responses, intercept and fix agent names
if (response && response.body && typeof response.body.getReader === 'function') {
const originalBody = response.body;
const transformer = this;
const encoder = new TextEncoder();
const decoder = new TextDecoder();
const { readable, writable } = new TransformStream({
transform(chunk, controller) {
try {
let text = decoder.decode(chunk, { stream: true });
const fixedText = fixAgentNameCasing(text);
controller.enqueue(encoder.encode(fixedText));
// Parse SSE for usage info
const lines = fixedText.split('\n');
for (const line of lines) {
if (line.startsWith('data: ') && !line.includes('[DONE]')) {
try {
const data = JSON.parse(line.slice(6));
if (data.usage) {
const actualTokens = data.usage.total_tokens ||
(data.usage.prompt_tokens || 0) + (data.usage.completion_tokens || 0);
if (actualTokens > 0) {
transformer.updateRequestTokens(requestId, actualTokens);
}
}
} catch (e) {
// Not valid JSON, skip
}
}
}
} catch (e) {
controller.enqueue(chunk);
}
}
});
originalBody.pipeTo(writable).catch(() => {});
return new Response(readable, {
status: response.status,
statusText: response.statusText,
headers: response.headers
});
}
} catch (e) {
console.error('[Cerebras] Error processing response:', e.message);
}
return response;
}
}
module.exports = CerebrasTransformer;
module.exports.default = CerebrasTransformer;