Skip to content

Conversation

@ThomasK33
Copy link
Member

@ThomasK33 ThomasK33 commented Jan 23, 2026

Summary

Fixes a bug where the TCP transport removed clients without invoking on_disconnect, which could leave higher-level state out of sync. As a follow-up, removes the redundant server.state.clients mirror and relies on the transport-level registry exposed via server.get_status().

Key changes:

  • Centralize disconnect behavior via tcp._disconnect_client so on_disconnect fires for EOF, read errors, protocol errors, WebSocket CLOSE frames, and keepalive timeouts.
  • Make disconnect handling idempotent to avoid double-callbacks on multi-path teardown.
  • Remove the mirrored client registry from server/init.lua; use tcp.lua as the single source of truth for connected clients.
  • Update tests to assert against server.get_status().client_count instead of internal state.

Fixes #170.

Testing

  • nix develop .#ci -c make check
  • nix develop .#ci -c make test

📋 Implementation Plan

Plan: Investigate + fix Issue #170 (client tables out of sync)

Context / Why

GitHub Issue #170 reports that the WebSocket server keeps two Lua tables of connected clients and they drift out of sync, leaving “phantom” clients in server/init.lua and causing unbounded growth over time (memory leak / misleading debug state).

This should be treated as a bug: the two tables are intended to represent the same set of live connections, but the TCP layer doesn’t reliably notify the higher-level server module on abrupt disconnect paths (EOF/errors/timeouts).

Evidence (what I checked)

  • Issue report: [BUG] Two tables with connected clients #170 (two client tables; out-of-sync after Ctrl-C termination).
  • lua/claudecode/server/init.lua
    • Maintains M.state.clients.
    • Populates it only via callbacks:
      • on_connect: M.state.clients[client.id] = client (line ~56)
      • on_disconnect: M.state.clients[client.id] = nil (line ~74)
  • lua/claudecode/server/tcp.lua
    • Owns the transport-level server.clients map.
    • In client_tcp:read_start(...), EOF (not data) and read errors call only M._remove_client(server, client) and do not call server.on_disconnect(...) (lines ~125–136).
    • In the WebSocket close-frame handler it does call server.on_disconnect(...) and then _remove_client(...) (lines ~139–147).
    • Ping timeout path also calls _remove_client(...) without on_disconnect (lines ~283–297).
Why this causes “phantom” clients

server/init.lua is only updated through on_connect/on_disconnect. When the CLI is killed (Ctrl-C), the server usually sees a TCP EOF, not a WebSocket CLOSE frame; the TCP module removes the client from server.clients but never fires on_disconnect, so M.state.clients retains a stale entry forever.

Plan (fix)

1) Make disconnect notification consistent (core fix)

Goal: whenever a client is removed from server.clients, server.on_disconnect(client, code, reason) should be invoked exactly once.

  • In lua/claudecode/server/tcp.lua, introduce a single helper responsible for cleanup + notification, e.g.:
    • M._disconnect_client(server, client, code, reason)
    • or extend M._remove_client(server, client, code, reason) to optionally notify.
  • Ensure the helper is idempotent by guarding on server.clients[client.id] (only notify/remove if still present). This prevents double-calling if a CLOSE frame is processed and later a TCP EOF arrives.
  • Route all removal paths through this helper:
    • TCP read_start error branch
    • TCP EOF branch
    • WebSocket CLOSE frame callback (replace the current “call on_disconnect then _remove_client” with the helper)
    • WebSocket protocol/error callback (on_error branch) since it currently removes without disconnect notification
    • Keepalive timeout branch in start_ping_timer.

Close code/reason conventions (keep simple):

  • EOF / read error / protocol error / keepalive timeout: use WebSocket “abnormal closure” 1006 with a short reason string (e.g., "EOF", "Client read error: ...", "Connection timeout").
  • WebSocket CLOSE frame: preserve provided code and reason.

Defensive checks (per repo preference):

  • Add assert(type(server) == "table"), assert(type(server.on_disconnect) == "function"), assert(type(client) == "table" and type(client.id) == "string") in the helper.

2) Add regression tests (so it can’t come back)

Because the unit test suite runs with a mocked vim.loop, we need to simulate the EOF/error callback.

  • Update tests/mocks/vim.lua TCP stub so read_start(cb) stores cb on the handle (e.g., self._read_cb = cb).
  • Add a new unit test file (e.g., tests/unit/tcp_server_spec.lua) that:
    1. Creates a TCP server via require("claudecode.server.tcp").create_server(...) with spy callbacks.
    2. Manually calls tcp._handle_new_connection(server).
    3. Captures the connected client via on_connect.
    4. Simulates abrupt termination by invoking client.tcp_handle._read_cb(nil, nil).
    5. Asserts on_disconnect was called once and server.clients[client.id] is nil.

Optional (nice-to-have) extra tests:

  • Simulate read_start("some error", nil) to ensure on_disconnect fires on read errors.
  • Stub claudecode.server.client.process_data to force the tcp.lua “protocol error” callback and ensure disconnect notification occurs.

3) Clarify design intent (small docs/comments)

The existence of two tables isn’t inherently wrong: tcp.lua needs an internal map for transport operations, while server/init.lua keeps “server module state”. The bug is that they weren’t kept in sync.

  • Add a short comment near the type annotations in both files clarifying:
    • tcp.lua: server.clients is the canonical transport client registry.
    • server/init.lua: M.state.clients is a mirrored view updated via connect/disconnect callbacks (kept for API/debugging/back-compat).

4) Verify

  • Run make test and make check.
  • Manual sanity check in real Neovim:
    • Start a Claude Code session, then terminate it via Ctrl-C.
    • Confirm both registries report no remaining clients (and no growth over repeated connect/kill cycles).
📋 Implementation Plan (remove client mirror)

Plan: Remove server/init.lua client mirror (M.state.clients)

Context / Why

After fixing Issue #170’s disconnect notification gap, we should remove the redundant mirrored client registry in lua/claudecode/server/init.lua (M.state.clients).

Keeping two client registries (one canonical in server/tcp.lua, one mirrored in server/init.lua) increases maintenance burden and creates a recurring risk of drift (the original root cause of “phantom clients”). The server module already exposes client state via get_status(), which queries the canonical TCP server registry.

Evidence (what I checked)

  • Runtime mirror exists only in lua/claudecode/server/init.lua:
    • M.state.clients = {} (initial state)
    • Updated only in on_connect / on_disconnect
    • Cleared in M.stop()
  • Canonical registry is transport-owned:
    • lua/claudecode/server/tcp.lua maintains server.clients (string-id keyed) and is the source of truth for ping/keepalive and transport operations.
  • Tests currently assert on the mirror:
    • tests/unit/server_spec.lua
    • tests/server_test.lua
  • No docs mention state.clients (README/ARCHITECTURE/DEVELOPMENT).
  • lua/claudecode/server/mock.lua uses M.state.clients internally, but current test suite does not appear to import it; still, keeping mock aligned prevents future confusion.

Plan

1) Define the new invariant (single source of truth)

Invariant: the only authoritative client registry is tcp_server.clients (inside lua/claudecode/server/tcp.lua). server/init.lua must not maintain a second registry.

  • Client count/info should be obtained via:
    • require("claudecode.server.init").get_status() (preferred public API)
    • or tcp_server.get_client_count(server.state.server) / tcp_server.get_clients_info(server.state.server) where appropriate.
Compatibility note (intentional breaking surface)

Removing server.state.clients is a behavioral change for any external code that was reaching into internal state. This plan assumes state.clients is not a supported public API (no docs mention it).

If we decide we need a transition period, add a deprecated accessor like server.get_clients() or expose state.clients as a computed view (no longer updated by callbacks). That would be a separate follow-up decision; for “remove the mirror entirely” we remove the field.

2) Update lua/claudecode/server/init.lua to stop mirroring

Make ServerState no longer include clients, and remove all mutations.

Concrete edits:

  • Remove the EmmyLua field annotation for clients from ServerState.
  • Remove clients = {}, from M.state.
  • In the TCP callbacks passed to tcp_server.create_server(...):
    • on_connect: delete M.state.clients[client.id] = client, keep logging + process_mention_queue(true).
    • on_disconnect: delete M.state.clients[client.id] = nil, keep logging.
  • In M.stop() (or wherever shutdown happens): remove M.state.clients = {} reset.

Defensive programming:

  • Ensure any remaining code paths don’t assume M.state.clients exists (quick crash if they do).

3) Update tests to use the public status API instead of internal state

Replace assertions that depend on server.state.clients.

  • tests/unit/server_spec.lua:
    • Replace type(server.state.clients) == "table" with server.get_status().client_count expectations.
    • Replace #server.state.clients assertions with server.get_status().client_count.
  • tests/server_test.lua:
    • Replace “empty after stop” checks based on server.state.clients with server.get_status().client_count == 0 (or equivalent).

Notes:

  • Avoid using # on string-keyed tables (it’s undefined behavior); prefer explicit count fields.

4) (Optional but recommended) Align lua/claudecode/server/mock.lua with the new state shape

Even if currently unused by tests, keep the mock structurally similar to the real server.

  • Remove M.state.clients from the mock.
  • Use M.state.server.clients as the mock registry (this is closer to the real server, where state.server holds the TCPServer object).
  • Update mock helpers (add_client, remove_client, send, broadcast, etc.) to iterate M.state.server.clients.

5) Verify

  • Run make test and make check.
  • Manual sanity check in Neovim:
    • Start a session, confirm :lua print(require('claudecode.server.init').get_status().client_count) matches expected.
    • Connect/kill the CLI repeatedly and confirm client_count returns to 0 and does not grow.

Generated with mux • Model: openai:gpt-5.2 • Thinking: xhigh

Change-Id: I53259ea2f51407e4bde33258ba0639cf00b2c29a
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Change-Id: Ieee0d2842aa3ca84645dae4cd2552c7ee48dde6d
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Change-Id: If645c9f567afe7dbbf64c1dcd6f1c252ea18d266
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33 ThomasK33 merged commit aa9a5ce into main Jan 26, 2026
3 checks passed
@ThomasK33 ThomasK33 deleted the fix/170-disconnect-sync branch January 26, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Two tables with connected clients

1 participant