Skip to content

Conversation

@Evanev7
Copy link
Member

@Evanev7 Evanev7 commented Dec 19, 2025

Motivation

Information gathering is tightly coupled to MacMon - we should start generalizing our information sources so we can add more in future.

Changes

Added a new system to gather any information. Currently, it is attached to the Worker - though this is mostly to keep the data processing logic simple. It could be made independent quite easily.

I also refactored topology to include different kinds of connections as we can gather RDMA connections without having a pre-existing socket connection, and made the relevant placement updates. We should no longer need the network locations script in the app.

Other sources of information now include:

  • static node information like "model" and "chip" (macos, "Unknown" fallback)
  • device friendly name (macos, falls back to device hostname)
  • network interfaces + ips (cross platform)
  • thunderbolt interfaces (macos)
  • thunderbolt connections (macos)
  • RAM usage (cross platform)
  • per-device configuration written to EXO_HOME/config.toml

Limitations

The current events added by the InfoGatherer are much too broad and don't follow proper Pydantic validation.

Model and Chip are not cross platform concepts.

We do not differentiate between unified and non-unified memory systems. this should be added to static information ASAP.

A lot of this data collection is based on simple timers. Watching the SC store on macos is the correct way to gather some of this information, but requires a detour into rust for macos.

Why It Works

The InfoGatherer is a generic subsystem which returns a union of metric datatypes. It writes them to an event, which is applied to state. It is currently re-spawned with the worker so each cluster receives the correct information.

As for topology, macOS identifies TB ports with a uuid in SPThunderboltDataType, and also stores remote uuids if it can find them. These changes read that data with the system_profiler, hopefully not so often as to cause notable performance impacts (though this should be tuned) but frequently enough for moderate responsiveness.
As we can identify TB connections between devices without needing ips attached to each interface, we can remove the network setup script (almost) completely.

Test Plan

Manual Testing

Spawn RDMA instances without enabling DHCP on the RDMA interfaces.

Automated Testing

Updated the current master and shared tests to cover the topology refactor and new events.

@Evanev7 Evanev7 requested a review from rltakashige December 19, 2025 18:36
@JakeHillion JakeHillion self-requested a review December 20, 2025 13:36
@exo-explore exo-explore deleted a comment from bigrich05 Dec 21, 2025
@exo-explore exo-explore deleted a comment from bigrich05 Dec 21, 2025
@Evanev7 Evanev7 force-pushed the gather-linux-info branch 6 times, most recently from c5fd38b to b48fc01 Compare December 22, 2025 02:29
Copy link
Collaborator

@rltakashige rltakashige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to re-review this not at 3am, but it looks good to me on first pass. I've added some minor comments.

class SocketConnection(FrozenModel):
sink_multiaddr: Multiaddr

def __hash__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does SocketConnection need a hash where RDMAConnection does not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Multiaddr type is unhashable as-is, where the pair of strings is hashable. I could fix this instead by making Multiaddr inherit from FrozenModel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! the port should not be part of the hash - two socketconnections with the same port are equivalent data. That does mean I should just store the ipaddress here, but we don't have a dedicated ipaddress type on this branch and we just mock multiaddrs everywhere instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gm btw haha

profile.chip_id = info.chip
# TODO: makes me slightly sad
case Sequence():
if info != []:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if event == []: break, just to reduce even one indent here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't break out of a match. I just realised that there are some legit meanings of [] though, so I can't merge this until I've split up these events properly. no network interfaces is possible and is distinct from no TBConnections, and both should be applied to state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

@Evanev7 Evanev7 force-pushed the gather-linux-info branch 2 times, most recently from e56de1a to b5319d6 Compare December 27, 2025 12:00
@Evanev7 Evanev7 force-pushed the gather-linux-info branch 6 times, most recently from ec02c75 to dfd866a Compare January 9, 2026 16:33
Evanev7 and others added 22 commits January 19, 2026 14:51
Dashboard fixes (TypeScript errors from `npm run check`):
- TopologyGraph.svelte: remove reference to deleted sendBackMultiaddr
  property, fix type inference for debug edge labels
- ModelCard.svelte: add missing topoWidth/topoHeight to early return
- +page.svelte: fix nested property access for deviceRank

Backend fix:
- info_gatherer.py: send initial MiscData on startup so friendly name
  appears immediately instead of showing "Unknown" until it changes

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The API changed topology.connections from an array to a nested mapping:
{ source: { sink: [SocketConnection | RDMAConnection] } }

- Update type definitions for RawSocketConnection and RawRDMAConnection
- Update transformTopology to iterate over nested mapping structure
- Handle snake_case ip_address from Multiaddr computed fields

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link
Member

@JakeHillion JakeHillion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gogogo

Let's get this into main and start iterating! (please remember to squash merge lol)

@Evanev7 Evanev7 enabled auto-merge (squash) January 19, 2026 16:54
@Evanev7 Evanev7 merged commit 2202685 into main Jan 19, 2026
8 checks passed
@Evanev7 Evanev7 deleted the gather-linux-info branch January 19, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants