Skip to content
214 changes: 214 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,212 @@ Binary messages should be rejected if there is no active stream.

The timestamp indicates when the first audio sample in this chunk should be output. Clients must translate this server timestamp to their local clock using the offset computed from clock synchronization. Clients should compensate for any known processing delays (e.g., DAC latency, audio buffer delays, amplifier delays) by accounting for these delays when submitting audio to the hardware.

## Sources

Sendspin can also represent **audio inputs** (e.g., AUX/line-in, turntable preamp, Bluetooth receiver, microphone/voice satellite) as first-class, selectable **sources**.

A **source** is implemented as a Sendspin client role that streams audio **to** the server. The server remains the single place that performs heavy work such as resampling, transcoding, equalization, mixing, buffering, visualization and distribution to output players.

Sources are intended to be simple:
- capture/encode audio
- optionally provide basic signal presence information (level / line sensing)
- stream audio frames with timestamps

A device may implement both `source` and `player` roles (e.g., a speaker with a local AUX input that can be forwarded into Sendspin).

The server may also expose **built-in inputs** (e.g., a line-in on the server host, or an HDMI capture device connected to the server) as a **virtual source client**. Virtual sources participate in the same source selection and state model as regular source clients and appear in the controller `sources` list.

## Source messages

This section describes messages specific to clients with the `source` role, which capture audio from a local input and stream it to the server.

A source client uses the same clock synchronization mechanism as all clients. Binary source audio messages are timestamped in the **server time domain** using the clock offset learned from `client/time`/`server/time`.

### Client → Server: `client/hello` source@v1 support object

The `source@v1_support` object in [`client/hello`](#client--server-clienthello) has this structure:

- `source@v1_support`: object
- `supported_formats`: object[] - list of supported capture/encode formats in priority order (first is preferred)
- `codec`: 'opus' | 'flac' | 'pcm' - codec identifier
- `channels`: integer - number of channels (e.g., 1 = mono, 2 = stereo)
- `sample_rate`: integer - sample rate in Hz (e.g., 44100, 48000)
- `bit_depth`: integer - bit depth (e.g., 16, 24)
- `controls?`: string[] - optional source control commands supported by this client (subset of: 'play' | 'pause' | 'next' | 'previous' | 'activate' | 'deactivate')
- `features?`: object - optional feature hints
- `level?`: boolean - true if source reports `level`
- `line_sense?`: boolean - true if source reports `signal`

**Note:** Servers must support all audio codecs: 'opus', 'flac', and 'pcm'.
**Note:** Servers should offer only the `supported_formats` options and avoid requesting unsupported formats.

Example `client/hello` excerpt:
```json
{
"type": "client/hello",
"payload": {
"client_id": "kitchen-linein",
"name": "Kitchen Line-In",
"version": 1,
"supported_roles": ["source@v1"],
"source@v1_support": {
"supported_formats": [
{
"codec": "opus",
"channels": 2,
"sample_rate": 48000,
"bit_depth": 16
},
{
"codec": "pcm",
"channels": 2,
"sample_rate": 48000,
"bit_depth": 16
}
],
"controls": ["play", "pause", "next", "previous", "activate", "deactivate"],
"features": {
"line_sense": true,
"level": true
}
}
}
}
```

### Client → Server: `client/state` source object

The `source` object in [`client/state`](#client--server-clientstate) has this structure:

- `source`: object
- `state`: 'idle' | 'streaming' | 'error'
- `level?`: number - optional normalized RMS/peak level (0.0-1.0), only if 'level' is supported
- `signal?`: 'unknown' | 'present' | 'absent' - optional line sensing/signal presence, only if 'line_sense' is supported
Comment on lines +597 to +598
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case of unknown?
Maybe I'm missing something, but couldn't the client just set line_sense to false?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn’t strictly required. We could simplify by only using present/absent, and treat signal as “unknown” when it’s omitted (or when line_sense=false).

The only reason to keep unknown is semantic clarity for clients that do support line sensing but can’t determine it yet (startup, device not ready, no samples). If we want to keep the spec minimal, dropping unknown is perfectly fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop unknown here to keep it simple.
For the server it doesn't really matter if the state is either unknown or absent since I imagine most implementations would treat it the same.


Example `client/state` excerpt:
```json
{
"type": "client/state",
"payload": {
"source": {
"state": "streaming",
"signal": "present",
"level": 0.42
}
}
}
```

### Client → Server: `client/command` source object

Source clients may send commands to inform the server about user-initiated capture actions (implementation-defined).

- `source`: object
- `command`: 'started' | 'stopped'

### Server → Client: `server/command` source object

The `source` object in [`server/command`](#server--client-servercommand) has this structure:

- `source`: object
- `command?`: 'start' | 'stop'
- `control?`: 'play' | 'pause' | 'next' | 'previous' | 'activate' | 'deactivate' - optional source control command; ignored if unsupported by the client
- `vad?`: object - optional VAD settings hint
- `threshold_db?`: number - signal threshold in dB
- `hold_ms?`: integer - hold time in milliseconds

All fields are optional. The server may send any subset (`command`, `control`, and/or `vad`) in one message.

#### Source command semantics

- `command` controls Sendspin ingest lifecycle for this source:
- `start`: server requests ingest to become active. The client should transition to `state: "streaming"`, send `input_stream/start`, and then send source audio chunks.
- `stop`: server requests ingest to become inactive. The client should send `input_stream/end`, stop sending source audio chunks, and transition to `state: "idle"`.
- `control` is optional upstream-device control intent and only applies when advertised in `source@v1_support.controls`.
- `play` | `pause` | `next` | `previous`: control content playback behavior on the upstream source device (if supported).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include this, then we should also include the state of the source? (Basically the same info we send about Sendspin being played)

I wonder if this is scope creep and we shouldn't include this for now.

What is the use case?

- `activate` | `deactivate`: prepare or power-manage the upstream source path (for example power on/off, wake/sleep, input enable/disable).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the server know when to call this? Why can't the source do this automatically on play using a hook


`start`/`stop` and `play`/`pause` are independent:

- `start`/`stop` govern whether Sendspin ingest is active.
- `play`/`pause` govern upstream content playback behavior.

#### Default ingest behavior

- Effective default after handshake is `stop` (ingest inactive).
- Server ingest interest is represented by `command: "start"` / `command: "stop"`.
- Server implementations should ignore/drop source binary chunks while ingest is not active.

#### `vad` semantics

`vad` is an optional server hint for source-side line-sense behavior (`threshold_db`, `hold_ms`). It allows centralized tuning and consistent behavior across sources/groups. Clients may ignore unsupported hints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels out of scope and more like a fleet management feature. This should just be locally configured?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in favor of moving the VAD configuration outside the Sendspin protocol.
What we could do is recommend specific threshold_db and hold_ms values, but using those shouldn't be mandatory.


Example `server/command` to start capture:
```json
{
"type": "server/command",
"payload": {
"source": {
"command": "start"
}
}
}
```

### Client → Server: `input_stream/start`

The `input_stream/start` message announces the active input stream format and provides any required codec header data.

- `source`: object
- `codec`: 'opus' | 'flac' | 'pcm'
- `channels`: integer
- `sample_rate`: integer
- `bit_depth`: integer
- `codec_header?`: string - Base64 encoded codec header (required for Opus/FLAC)

Example `input_stream/start`:
```json
{
"type": "input_stream/start",
"payload": {
"source": {
"codec": "flac",
"channels": 2,
"sample_rate": 48000,
"bit_depth": 16,
"codec_header": "BASE64..."
}
}
}
```

### Server → Client: `input_stream/request-format`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I miss in the spec is what if a client starts streaming data to the server, but the server doesn't care?

We would want a way to specify that. Maybe a request-format message to ask for none codec, and that is the default at start?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use cases I want to make sure that are covered by this role:

  1. A user starts a turntable connected to a Sendspin source device. The server starts playing the music in the same room as the turntable the moment the turntable starts playing, without interaction from the user on the Sendspin source device or Sendspin server.
  2. A user has the output of their computer available as a source in Sendspin, and from the Sendspin server can say: start streaming this source to speakers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. This is already the intended behavior, and it is how the implementation works.

The model is:

source.command (start / stop) defines server ingest interest.
Default is effectively not interested (stop) until the server sends start.
A source should only send media after start and input_stream/start.
If a source sends chunks while not started, the server drops/ignores them.
So we do not need a none codec to represent “server doesn’t care”; stop already covers that.

For the two use cases:

Turntable auto-start

  • Source reports signal/state (signal: present, optional started event).
  • Server policy decides to ingest and sends command: "start".
  • Audio is routed immediately to the target room.

Computer output selectable from server

  • User selects source in server UI/control plane.
  • Server sends command: "start" to that source.
  • Source sends input_stream/start + chunks.
  • Server sends command: "stop" when done.

I agree we should make this explicit in the spec text (default stop + ignore/drop when not started), but no new mechanism is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another Idea for the turntable use case: can't the client just directly start the input_stream?
If we require the server to immediately switch to the Turntable once it's detected, we don't need the negotiation with checking for client/state.source.signal = 'present'.

I think its a good idea to make this explicit, can you update the text @rudyberends ?


The server can request a different input stream format. Clients should respond by reconfiguring capture (if supported) and sending a new `input_stream/start` with the updated format and header.

- `source`: object
- `codec?`: 'opus' | 'flac' | 'pcm'
- `channels?`: integer
- `sample_rate?`: integer
- `bit_depth?`: integer

### Client → Server: `input_stream/end`

The client ends the current input stream. After this message, no more source audio chunks should be sent until a new `input_stream/start`.

### Client → Server: Source Audio Chunks (Binary)

Binary messages should be rejected by the server if the source is not in `state: 'streaming'`.
Clients must send `input_stream/start` before the first audio chunk.

- Byte 0: message type `12` (uint8)
- Bytes 1-8: timestamp (big-endian int64) - server clock time in microseconds when the first sample was captured
- Rest of bytes: encoded audio frame

The timestamp indicates when the first audio sample in this chunk was captured (in server time domain). The server may resample/transcode and then distribute the audio to players with its normal buffering and synchronization strategy.

**Note:** Source timestamps are derived from the client's clock offset and may show small discontinuities or drift (e.g., ADC clock variance). Server implementations should not assume perfectly continuous timestamps; the audio sample stream itself should remain continuous.

## Controller messages
This section describes messages specific to clients with the `controller` role, which enables the client to control the Sendspin group this client is part of, and switch between groups.

Expand Down Expand Up @@ -584,6 +790,14 @@ The `controller` object in [`server/state`](#server--client-serverstate) has thi
- `supported_commands`: string[] - subset of: 'play' | 'pause' | 'stop' | 'next' | 'previous' | 'volume' | 'mute' | 'repeat_off' | 'repeat_one' | 'repeat_all' | 'shuffle' | 'unshuffle' | 'switch'
- `volume`: integer - volume of the whole group, range 0-100
- `muted`: boolean - mute state of the whole group
- sources?: object[] - list of available/known sources on the server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove the select_source command from this PR.
If we include this, this should rather be part of future role since this adds quite a lot of data for basic controller use cases.

Just an idea: Maybe that future role will also allow you to see your library and select a album or playlist for playback? But that's something for later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed select_source from this PR and left it for a future “media/inputs” role. The reference implementation has been updated accordingly (no controller command, no select/clear CLI; only source listing remains).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you drop sources here as well?
I think this also belongs to that future "media/inputs" role since on its own it doesn't really bring helpful information to the clients.

- id: string - stable identifier of the source (typically the source client_id)
- name: string - friendly name
- state: 'idle' | 'streaming' | 'error'
- signal?: 'unknown' | 'present' | 'absent' - optional line sensing/signal presence
- selected?: boolean - whether this source is currently selected for this group
- last_event?: 'started' | 'stopped' - last source event (optional)
- last_event_ts_us?: integer - server time in microseconds for last event (optional)

**Reading group volume:** Group volume is calculated as the average of all player volumes in the group.

Expand Down