Working Cerebas Code GLM 4.7 Transformer for Claude Code CLI

Hey Everyone, not sure where to put this, but I created a Working Cerebas Code GLM 4.7 Transformer for Claude Code CLI that handles the rate limiting, tool translation, and context etc. Hope it is helpful for others.

# Cerebras GLM-4.7 Transformer for Claude Code Router

A custom transformer that enables Claude Code to work with Cerebras's GLM-4.7 model through Claude Code Router (CCR).

## Recommended Setup

**It is highly recommended to use [Continuous Claude Code v3](https://github.com/continuedev/continue) alongside this transformer.** Continuous Claude Code automatically handles context compaction, so you don't need to worry about manually running `/compact` when you hit context limits.

---

## Features

### 1. Proactive Rate Limiting

The transformer implements a sliding window rate limiter that tracks usage **before** hitting Cerebras's limits, preventing 429 errors entirely.

**Tracked Limits (Code Max Plan - $200/mo):**
- **120 RPM** (Requests Per Minute)
- **1.5M TPM** (Tokens Per Minute)
- **120M tokens/day** (Daily limit)

**How it works:**
- Estimates tokens for each request before sending
- Calculates the exact delay needed if any limit would be exceeded
- Automatically waits the minimum required time
- Updates estimates with actual token counts from responses

### 2. Context Limit Handling

Cerebras GLM-4.7 has a **131,072 token context limit**. The transformer:

- Estimates total context size before each request
- Reserves ~20K tokens for output (safe input limit: ~111K tokens)
- When limit is exceeded, returns an **Anthropic-format error**:
  ```json
  {
    "type": "error",
    "error": {
      "type": "invalid_request_error",
      "message": "prompt is too long: X tokens > 131072 maximum"
    }
  }
  ```
- This triggers Claude Code's standard "Context limit reached" message, prompting you to run `/compact`

### 3. Message Format Conversion

Converts Claude Code's message format to OpenAI-compatible format that Cerebras expects:

- Converts content arrays (with text/image objects) to plain strings
- Handles top-level `system` field by converting to system message
- Removes unsupported fields: `reasoning`, `thinking`, `anthropic_version`, `metadata`, `cache_control`, `stream_options`

### 4. Agent Name Casing Fixes

GLM models sometimes output lowercase agent names, but Claude Code is case-sensitive. The transformer automatically corrects:

| GLM Output | Corrected |
|------------|-----------|
| `explore` | `Explore` |
| `plan` | `Plan` |
| `bash` | `Bash` |

### 5. Tool Call ID Deduplication

Cerebras is strict about duplicate `tool_call.id` values in the conversation history. The transformer:

- Tracks all seen tool call IDs
- Removes duplicate tool calls before sending to Cerebras
- Prevents 422 errors from duplicate IDs

### 6. Streaming Response Support

Handles both streaming and non-streaming responses:

- Intercepts streaming responses to fix agent name casing in real-time
- Extracts token usage from SSE data for accurate rate limit tracking

---

## Configuration

### config.json

```json
{
  "LOG": true,
  "LOG_LEVEL": "debug",
  "API_TIMEOUT_MS": 600000,
  "transformers": [
    {
      "path": "/path/to/cerebras-transformer.js"
    }
  ],
  "Providers": [
    {
      "name": "cerebras",
      "api_base_url": "https://api.cerebras.ai/v1/chat/completions",
      "api_key": "YOUR_CEREBRAS_API_KEY",
      "models": ["zai-glm-4.7"],
      "transformer": {
        "use": [
          "cerebras",
          "enhancetool",
          ["maxtoken", { "max_tokens": 16384 }]
        ]
      }
    }
  ],
  "Router": {
    "default": "cerebras,zai-glm-4.7",
    "background": "cerebras,zai-glm-4.7",
    "think": "cerebras,zai-glm-4.7",
    "longContext": "cerebras,zai-glm-4.7",
    "longContextThreshold": 60000,
    "claude-3-5-haiku-20241022": "cerebras,zai-glm-4.7",
    "claude-3-5-haiku": "cerebras,zai-glm-4.7",
    "haiku": "cerebras,zai-glm-4.7"
  }
}
```

### Important: WebFetch Support

Claude Code's `WebFetch` tool makes a secondary API call to `claude-3-5-haiku` for summarizing fetched web content. **You must add routes for Haiku models** in your Router config (as shown above), otherwise WebFetch will fail.

---

## Console Output

The transformer logs helpful information to stderr:

```
[Cerebras] Rate limit: waiting 2.5s (RPM: 118/120, TPM: 1450K/1500K)
[Cerebras] Token tracking: estimated 45000 -> actual 42350 (daily: 15M/120M)
[Cerebras] Context 125000 tokens exceeds safe limit 111072. Returning compact message.
[Cerebras] Removing duplicate tool_call.id: call_abc123
```

---

## Limitations

- **No extended thinking support**: Cerebras doesn't support reasoning/thinking parameters
- **131K context limit**: Smaller than Anthropic's 200K limit - use `/compact` or Continuous Claude Code v3
- **Token estimation**: Uses ~4 chars/token approximation, actual may vary slightly

---

## Troubleshooting

### "Context limit reached" appearing frequently
- Use Continuous Claude Code v3 for automatic context management
- Or manually run `/compact` when prompted

### WebFetch not working
- Ensure you have the Haiku routes in your Router config
- WebFetch uses Haiku for content summarization

### 422 errors about duplicate tool_call.id
- The transformer handles this automatically
- If still occurring, check for very long conversations with repeated tool calls

### Rate limit delays
- Normal behavior - the transformer is preventing 429 errors
- Consider upgrading your Cerebras plan for higher limits

---

## Credits

Built for use with [Claude Code Router](https://github.com/musistudio/claude-code-router) by musistudio.

Cerebras GLM-4.7 provides extremely fast inference speeds, making it an excellent choice for Claude Code workflows.


```javascript
// Custom transformer for Cerebras GLM-4.7 with proactive rate limiting
// Tracks tokens and requests to avoid 429 errors before they happen

/**
 * Context limit configuration
 * GLM-4.7 has 131072 token limit - reserve space for output
 */
const CONTEXT_LIMIT = 131072;
const MAX_INPUT_TOKENS = 115000;  // Leave ~16K for output

/**
 * Converts content from Claude Code format (array of objects) to plain string
 */
function convertContentToString(content) {
  if (typeof content === 'string') {
    return content;
  }

  if (Array.isArray(content)) {
    return content
      .map((item) => {
        if (typeof item === 'string') {
          return item;
        }
        if (item.type === 'text' && item.text) {
          return item.text;
        }
        if (item.type === 'image' || item.type === 'image_url') {
          return '[Image content]';
        }
        return '';
      })
      .join('');
  }

  return '';
}

/**
 * Estimate token count from text (rough approximation: ~4 chars per token)
 */
function estimateTokens(text) {
  if (!text) return 0;
  return Math.ceil(text.length / 4);
}

/**
 * Known agent name corrections (lowercase -> correct case)
 * GLM models often output lowercase but Claude Code is case-sensitive
 */
const AGENT_NAME_CORRECTIONS = {
  'explore': 'Explore',
  'plan': 'Plan',
  'bash': 'Bash',
  'general-purpose': 'general-purpose',
  'statusline-setup': 'statusline-setup',
  'claude-code-guide': 'claude-code-guide',
};

/**
 * Fix agent name casing in tool call arguments
 */
function fixAgentNameCasing(text) {
  if (!text) return text;

  let fixed = text;
  for (const [wrong, correct] of Object.entries(AGENT_NAME_CORRECTIONS)) {
    const patterns = [
      new RegExp(`"subagent_type"\\s*:\\s*"${wrong}"`, 'gi'),
      new RegExp(`'subagent_type'\\s*:\\s*'${wrong}'`, 'gi'),
      new RegExp(`subagent_type.*?["']${wrong}["']`, 'gi'),
    ];

    for (const pattern of patterns) {
      fixed = fixed.replace(pattern, (match) => {
        return match.replace(new RegExp(wrong, 'gi'), correct);
      });
    }
  }

  return fixed;
}

/**
 * Transformer class for Cerebras GLM-4.7 with proactive rate limiting
 */
class CerebrasTransformer {
  constructor() {
    this.name = 'cerebras';

    // Rate limit configuration for Code Max plan ($200/mo)
    this.config = {
      maxRequestsPerMinute: 120,
      maxTokensPerMinute: 1500000,
      maxTokensPerDay: 120000000,
      windowSizeMs: 60000,
    };

    this.dailyTokens = 0;
    this.dailyResetTime = this.getNextMidnight();
    this.pendingRequestId = 0;
    this.requestHistory = [];
    this.lastRequestTime = 0;
  }

  getNextMidnight() {
    const now = new Date();
    const midnight = new Date(now);
    midnight.setHours(24, 0, 0, 0);
    return midnight.getTime();
  }

  checkDailyReset() {
    const now = Date.now();
    if (now >= this.dailyResetTime) {
      console.error(`[Cerebras] Daily token counter reset. Previous day: ${Math.round(this.dailyTokens/1000000)}M tokens used.`);
      this.dailyTokens = 0;
      this.dailyResetTime = this.getNextMidnight();
    }
  }

  cleanupHistory() {
    const cutoff = Date.now() - this.config.windowSizeMs;
    this.requestHistory = this.requestHistory.filter(r => r.timestamp > cutoff);
  }

  getCurrentUsage() {
    this.cleanupHistory();
    const requestCount = this.requestHistory.length;
    const tokenCount = this.requestHistory.reduce((sum, r) => sum + r.tokens, 0);
    return { requestCount, tokenCount };
  }

  calculateDelay(estimatedRequestTokens) {
    this.checkDailyReset();
    this.cleanupHistory();

    const now = Date.now();
    const usage = this.getCurrentUsage();

    const withinRPM = usage.requestCount < this.config.maxRequestsPerMinute;
    const withinTPM = usage.tokenCount + estimatedRequestTokens < this.config.maxTokensPerMinute;
    const withinDaily = this.dailyTokens + estimatedRequestTokens < this.config.maxTokensPerDay;

    if (withinRPM && withinTPM && withinDaily) {
      return 0;
    }

    let requiredDelay = 0;

    if (!withinRPM && this.requestHistory.length > 0) {
      const oldestRequest = this.requestHistory[0];
      const expiresAt = oldestRequest.timestamp + this.config.windowSizeMs;
      const rpmDelay = Math.max(0, expiresAt - now + 50);
      requiredDelay = Math.max(requiredDelay, rpmDelay);
    }

    if (!withinTPM && this.requestHistory.length > 0) {
      const tokensNeeded = (usage.tokenCount + estimatedRequestTokens) - this.config.maxTokensPerMinute;
      let tokensFreed = 0;
      for (const req of this.requestHistory) {
        tokensFreed += req.tokens;
        if (tokensFreed >= tokensNeeded) {
          const expiresAt = req.timestamp + this.config.windowSizeMs;
          const tpmDelay = Math.max(0, expiresAt - now + 50);
          requiredDelay = Math.max(requiredDelay, tpmDelay);
          break;
        }
      }
    }

    if (!withinDaily) {
      const dailyDelay = Math.max(0, this.dailyResetTime - now);
      requiredDelay = Math.max(requiredDelay, dailyDelay);
    }

    return requiredDelay;
  }

  estimateRequestTokens(request) {
    let tokens = 0;

    if (request.messages && Array.isArray(request.messages)) {
      for (const msg of request.messages) {
        const content = typeof msg.content === 'string'
          ? msg.content
          : convertContentToString(msg.content);
        tokens += estimateTokens(content);
      }
    }

    if (request.system) {
      tokens += estimateTokens(convertContentToString(request.system));
    }

    tokens += request.max_tokens || 16384;

    return tokens;
  }

  recordRequest(tokens) {
    this.pendingRequestId++;
    const requestId = this.pendingRequestId;

    this.requestHistory.push({
      id: requestId,
      timestamp: Date.now(),
      tokens: tokens,
      estimated: true
    });
    this.lastRequestTime = Date.now();
    this.dailyTokens += tokens;

    return requestId;
  }

  updateRequestTokens(requestId, actualTokens) {
    const request = this.requestHistory.find(r => r.id === requestId);
    if (request && request.estimated) {
      const oldTokens = request.tokens;
      const tokenDiff = actualTokens - oldTokens;
      request.tokens = actualTokens;
      request.estimated = false;
      this.dailyTokens += tokenDiff;
      console.error(`[Cerebras] Token tracking: estimated ${oldTokens} -> actual ${actualTokens} (daily: ${Math.round(this.dailyTokens/1000000)}M/${Math.round(this.config.maxTokensPerDay/1000000)}M)`);
    }
  }

  /**
   * Transform the request from Claude Code format to Cerebras format
   */
  async transformRequestIn(request, provider, context) {
    const estimatedTokens = this.estimateRequestTokens(request);

    // Check context limit BEFORE sending to Cerebras
    const maxSafeInput = CONTEXT_LIMIT - 20000; // Leave 20K for output
    if (estimatedTokens > maxSafeInput) {
      console.error(`[Cerebras] Context ${estimatedTokens} tokens exceeds safe limit ${maxSafeInput}. Returning compact message.`);

      // Flag to return error response
      this._contextLimitHit = {
        estimatedTokens,
        limit: CONTEXT_LIMIT
      };

      // Return minimal request that will succeed
      return {
        model: request.model,
        messages: [
          { role: 'system', content: 'Say OK' },
          { role: 'user', content: 'OK' }
        ],
        max_tokens: 10,
        stream: false
      };
    }

    // Calculate and apply delay if needed
    const delay = this.calculateDelay(estimatedTokens);
    if (delay > 0) {
      const usage = this.getCurrentUsage();
      const delayStr = delay >= 60000
        ? `${Math.round(delay/60000)}m ${Math.round((delay%60000)/1000)}s`
        : `${(delay/1000).toFixed(1)}s`;
      console.error(`[Cerebras] Rate limit: waiting ${delayStr} (RPM: ${usage.requestCount}/${this.config.maxRequestsPerMinute}, TPM: ${Math.round(usage.tokenCount/1000)}K/${Math.round(this.config.maxTokensPerMinute/1000)}K)`);
      await new Promise(resolve => setTimeout(resolve, delay));
    }

    this.currentRequestId = this.recordRequest(estimatedTokens);

    // Deep clone and transform
    const transformedRequest = JSON.parse(JSON.stringify(request));

    // Deduplicate tool_call.ids (Cerebras is strict about this)
    const seenToolCallIds = new Set();
    if (transformedRequest.messages && Array.isArray(transformedRequest.messages)) {
      for (const msg of transformedRequest.messages) {
        if (msg.tool_calls && Array.isArray(msg.tool_calls)) {
          msg.tool_calls = msg.tool_calls.filter(tc => {
            if (tc.id && seenToolCallIds.has(tc.id)) {
              console.error(`[Cerebras] Removing duplicate tool_call.id: ${tc.id}`);
              return false;
            }
            if (tc.id) seenToolCallIds.add(tc.id);
            return true;
          });
        }

        if (msg.role === 'tool' && msg.tool_call_id) {
          if (seenToolCallIds.has(msg.tool_call_id + '_response')) {
            msg._duplicate = true;
          } else {
            seenToolCallIds.add(msg.tool_call_id + '_response');
          }
        }
      }

      transformedRequest.messages = transformedRequest.messages.filter(msg => !msg._duplicate);
    }

    // Remove unsupported fields
    delete transformedRequest.reasoning;
    delete transformedRequest.reasoning_content;
    delete transformedRequest.thinking;
    delete transformedRequest.anthropic_version;
    delete transformedRequest.metadata;

    // Transform messages
    if (transformedRequest.messages && Array.isArray(transformedRequest.messages)) {
      transformedRequest.messages = transformedRequest.messages.map((message) => {
        const transformedMessage = { ...message };

        if (message.content !== undefined) {
          transformedMessage.content = convertContentToString(message.content);
        }

        if (message.role === 'system' && message.content !== undefined) {
          transformedMessage.content = convertContentToString(message.content);
        }

        delete transformedMessage.cache_control;

        return transformedMessage;
      });
    }

    // Handle top-level system field
    if (transformedRequest.system !== undefined) {
      const systemContent = convertContentToString(transformedRequest.system);

      const hasSystemMessage = transformedRequest.messages &&
        transformedRequest.messages.length > 0 &&
        transformedRequest.messages[0].role === 'system';

      if (!hasSystemMessage && transformedRequest.messages) {
        transformedRequest.messages.unshift({
          role: 'system',
          content: systemContent
        });
      }

      delete transformedRequest.system;
    }

    if (!transformedRequest.max_tokens) {
      transformedRequest.max_tokens = 16384;
    }

    delete transformedRequest.stream_options;

    return transformedRequest;
  }

  /**
   * Transform the response
   */
  async transformResponseOut(response, context) {
    const requestId = this.currentRequestId;

    // Check if we hit context limit - return Anthropic error format
    if (this._contextLimitHit) {
      const info = this._contextLimitHit;
      this._contextLimitHit = null;

      console.error(`[Cerebras] Returning Anthropic-format context limit error`);

      // Return exact Anthropic error format that Claude Code recognizes
      const errorBody = {
        type: 'error',
        error: {
          type: 'invalid_request_error',
          message: `prompt is too long: ${info.estimatedTokens} tokens > ${info.limit} maximum`
        }
      };

      return new Response(JSON.stringify(errorBody), {
        status: 400,
        headers: { 'Content-Type': 'application/json' }
      });
    }

    // Handle non-streaming JSON response
    try {
      if (response && typeof response === 'object' && !response.body) {
        // Fix agent name casing in tool calls
        if (response.choices) {
          for (const choice of response.choices) {
            if (choice.message?.tool_calls) {
              for (const toolCall of choice.message.tool_calls) {
                if (toolCall.function?.arguments) {
                  toolCall.function.arguments = fixAgentNameCasing(toolCall.function.arguments);
                }
              }
            }
            if (typeof choice.message?.content === 'string') {
              choice.message.content = fixAgentNameCasing(choice.message.content);
            }
          }
        }

        if (response.usage) {
          const actualTokens = response.usage.total_tokens ||
            (response.usage.prompt_tokens || 0) + (response.usage.completion_tokens || 0);
          if (actualTokens > 0) {
            this.updateRequestTokens(requestId, actualTokens);
          }
        }
        return response;
      }

      // For streaming responses, intercept and fix agent names
      if (response && response.body && typeof response.body.getReader === 'function') {
        const originalBody = response.body;
        const transformer = this;

        const encoder = new TextEncoder();
        const decoder = new TextDecoder();

        const { readable, writable } = new TransformStream({
          transform(chunk, controller) {
            try {
              let text = decoder.decode(chunk, { stream: true });
              const fixedText = fixAgentNameCasing(text);
              controller.enqueue(encoder.encode(fixedText));

              // Parse SSE for usage info
              const lines = fixedText.split('\n');
              for (const line of lines) {
                if (line.startsWith('data: ') && !line.includes('[DONE]')) {
                  try {
                    const data = JSON.parse(line.slice(6));
                    if (data.usage) {
                      const actualTokens = data.usage.total_tokens ||
                        (data.usage.prompt_tokens || 0) + (data.usage.completion_tokens || 0);
                      if (actualTokens > 0) {
                        transformer.updateRequestTokens(requestId, actualTokens);
                      }
                    }
                  } catch (e) {
                    // Not valid JSON, skip
                  }
                }
              }
            } catch (e) {
              controller.enqueue(chunk);
            }
          }
        });

        originalBody.pipeTo(writable).catch(() => {});

        return new Response(readable, {
          status: response.status,
          statusText: response.statusText,
          headers: response.headers
        });
      }
    } catch (e) {
      console.error('[Cerebras] Error processing response:', e.message);
    }

    return response;
  }
}

module.exports = CerebrasTransformer;
module.exports.default = CerebrasTransformer;
```




GLM Output	Corrected
`explore`	`Explore`
`plan`	`Plan`
`bash`	`Bash`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working Cerebas Code GLM 4.7 Transformer for Claude Code CLI #1178

Cerebras GLM-4.7 Transformer for Claude Code Router

Recommended Setup

Features

1. Proactive Rate Limiting

2. Context Limit Handling

3. Message Format Conversion

4. Agent Name Casing Fixes

5. Tool Call ID Deduplication

6. Streaming Response Support

Configuration

config.json

Important: WebFetch Support

Console Output

Limitations

Troubleshooting

"Context limit reached" appearing frequently

WebFetch not working

422 errors about duplicate tool_call.id

Rate limit delays

Credits

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Working Cerebas Code GLM 4.7 Transformer for Claude Code CLI #1178

Description

Cerebras GLM-4.7 Transformer for Claude Code Router

Recommended Setup

Features

1. Proactive Rate Limiting

2. Context Limit Handling

3. Message Format Conversion

4. Agent Name Casing Fixes

5. Tool Call ID Deduplication

6. Streaming Response Support

Configuration

config.json

Important: WebFetch Support

Console Output

Limitations

Troubleshooting

"Context limit reached" appearing frequently

WebFetch not working

422 errors about duplicate tool_call.id

Rate limit delays

Credits

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions