fix(server): sync disconnect callbacks #176
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a bug where the TCP transport removed clients without invoking
on_disconnect, which could leave higher-level state out of sync. As a follow-up, removes the redundantserver.state.clientsmirror and relies on the transport-level registry exposed viaserver.get_status().Key changes:
tcp._disconnect_clientsoon_disconnectfires for EOF, read errors, protocol errors, WebSocket CLOSE frames, and keepalive timeouts.server/init.lua; usetcp.luaas the single source of truth for connected clients.server.get_status().client_countinstead of internal state.Fixes #170.
Testing
nix develop .#ci -c make checknix develop .#ci -c make test📋 Implementation Plan
Plan: Investigate + fix Issue #170 (client tables out of sync)
Context / Why
GitHub Issue #170 reports that the WebSocket server keeps two Lua tables of connected clients and they drift out of sync, leaving “phantom” clients in
server/init.luaand causing unbounded growth over time (memory leak / misleading debug state).This should be treated as a bug: the two tables are intended to represent the same set of live connections, but the TCP layer doesn’t reliably notify the higher-level server module on abrupt disconnect paths (EOF/errors/timeouts).
Evidence (what I checked)
lua/claudecode/server/init.luaM.state.clients.on_connect:M.state.clients[client.id] = client(line ~56)on_disconnect:M.state.clients[client.id] = nil(line ~74)lua/claudecode/server/tcp.luaserver.clientsmap.client_tcp:read_start(...), EOF (not data) and read errors call onlyM._remove_client(server, client)and do not callserver.on_disconnect(...)(lines ~125–136).server.on_disconnect(...)and then_remove_client(...)(lines ~139–147)._remove_client(...)withouton_disconnect(lines ~283–297).Why this causes “phantom” clients
server/init.luais only updated throughon_connect/on_disconnect. When the CLI is killed (Ctrl-C), the server usually sees a TCP EOF, not a WebSocket CLOSE frame; the TCP module removes the client fromserver.clientsbut never fireson_disconnect, soM.state.clientsretains a stale entry forever.Plan (fix)
1) Make disconnect notification consistent (core fix)
Goal: whenever a client is removed from
server.clients,server.on_disconnect(client, code, reason)should be invoked exactly once.lua/claudecode/server/tcp.lua, introduce a single helper responsible for cleanup + notification, e.g.:M._disconnect_client(server, client, code, reason)M._remove_client(server, client, code, reason)to optionally notify.server.clients[client.id](only notify/remove if still present). This prevents double-calling if a CLOSE frame is processed and later a TCP EOF arrives.read_starterror branchon_errorbranch) since it currently removes without disconnect notificationstart_ping_timer.Close code/reason conventions (keep simple):
1006with a short reason string (e.g.,"EOF","Client read error: ...","Connection timeout").codeandreason.Defensive checks (per repo preference):
assert(type(server) == "table"),assert(type(server.on_disconnect) == "function"),assert(type(client) == "table" and type(client.id) == "string")in the helper.2) Add regression tests (so it can’t come back)
Because the unit test suite runs with a mocked
vim.loop, we need to simulate the EOF/error callback.tests/mocks/vim.luaTCP stub soread_start(cb)storescbon the handle (e.g.,self._read_cb = cb).tests/unit/tcp_server_spec.lua) that:require("claudecode.server.tcp").create_server(...)with spy callbacks.tcp._handle_new_connection(server).on_connect.client.tcp_handle._read_cb(nil, nil).on_disconnectwas called once andserver.clients[client.id]is nil.Optional (nice-to-have) extra tests:
read_start("some error", nil)to ensure on_disconnect fires on read errors.claudecode.server.client.process_datato force the tcp.lua “protocol error” callback and ensure disconnect notification occurs.3) Clarify design intent (small docs/comments)
The existence of two tables isn’t inherently wrong:
tcp.luaneeds an internal map for transport operations, whileserver/init.luakeeps “server module state”. The bug is that they weren’t kept in sync.tcp.lua: server.clientsis the canonical transport client registry.server/init.lua: M.state.clientsis a mirrored view updated via connect/disconnect callbacks (kept for API/debugging/back-compat).4) Verify
make testandmake check.📋 Implementation Plan (remove client mirror)
Plan: Remove
server/init.luaclient mirror (M.state.clients)Context / Why
After fixing Issue #170’s disconnect notification gap, we should remove the redundant mirrored client registry in
lua/claudecode/server/init.lua(M.state.clients).Keeping two client registries (one canonical in
server/tcp.lua, one mirrored inserver/init.lua) increases maintenance burden and creates a recurring risk of drift (the original root cause of “phantom clients”). The server module already exposes client state viaget_status(), which queries the canonical TCP server registry.Evidence (what I checked)
lua/claudecode/server/init.lua:M.state.clients = {}(initial state)on_connect/on_disconnectM.stop()lua/claudecode/server/tcp.luamaintainsserver.clients(string-id keyed) and is the source of truth for ping/keepalive and transport operations.tests/unit/server_spec.luatests/server_test.luastate.clients(README/ARCHITECTURE/DEVELOPMENT).lua/claudecode/server/mock.luausesM.state.clientsinternally, but current test suite does not appear to import it; still, keeping mock aligned prevents future confusion.Plan
1) Define the new invariant (single source of truth)
Invariant: the only authoritative client registry is
tcp_server.clients(insidelua/claudecode/server/tcp.lua).server/init.luamust not maintain a second registry.require("claudecode.server.init").get_status()(preferred public API)tcp_server.get_client_count(server.state.server)/tcp_server.get_clients_info(server.state.server)where appropriate.Compatibility note (intentional breaking surface)
Removing
server.state.clientsis a behavioral change for any external code that was reaching into internal state. This plan assumesstate.clientsis not a supported public API (no docs mention it).If we decide we need a transition period, add a deprecated accessor like
server.get_clients()or exposestate.clientsas a computed view (no longer updated by callbacks). That would be a separate follow-up decision; for “remove the mirror entirely” we remove the field.2) Update
lua/claudecode/server/init.luato stop mirroringMake
ServerStateno longer includeclients, and remove all mutations.Concrete edits:
clientsfromServerState.clients = {},fromM.state.tcp_server.create_server(...):on_connect: deleteM.state.clients[client.id] = client, keep logging +process_mention_queue(true).on_disconnect: deleteM.state.clients[client.id] = nil, keep logging.M.stop()(or wherever shutdown happens): removeM.state.clients = {}reset.Defensive programming:
M.state.clientsexists (quick crash if they do).3) Update tests to use the public status API instead of internal state
Replace assertions that depend on
server.state.clients.tests/unit/server_spec.lua:type(server.state.clients) == "table"withserver.get_status().client_countexpectations.#server.state.clientsassertions withserver.get_status().client_count.tests/server_test.lua:server.state.clientswithserver.get_status().client_count == 0(or equivalent).Notes:
#on string-keyed tables (it’s undefined behavior); prefer explicit count fields.4) (Optional but recommended) Align
lua/claudecode/server/mock.luawith the new state shapeEven if currently unused by tests, keep the mock structurally similar to the real server.
M.state.clientsfrom the mock.M.state.server.clientsas the mock registry (this is closer to the real server, wherestate.serverholds the TCPServer object).add_client,remove_client,send,broadcast, etc.) to iterateM.state.server.clients.5) Verify
make testandmake check.:lua print(require('claudecode.server.init').get_status().client_count)matches expected.Generated with
mux• Model:openai:gpt-5.2• Thinking:xhigh