Skip to content

Introduce telemetry for observability#117

Open
Anyitechs wants to merge 3 commits intolightningdevkit:mainfrom
Anyitechs:introduce-telemetry
Open

Introduce telemetry for observability#117
Anyitechs wants to merge 3 commits intolightningdevkit:mainfrom
Anyitechs:introduce-telemetry

Conversation

@Anyitechs
Copy link
Contributor

@Anyitechs Anyitechs commented Jan 27, 2026

This introduces the foundational telemetry infrastructure to improve
the observability of LDK Server.

It adds a new /metrics endpoint exposed on the REST service address,
which serves Prometheus-compatible metrics. This endpoint is public and
does not require HMAC authentication, allowing for easy integration with
monitoring systems.

  • Added a Metrics utility struct to hold all the metrics we need to
    expose.

This is the first step in a larger effort to provide comprehensive telemetry.
Future updates will expand this to include other detailed metrics for channels,
balances, payments, etc.

Related issue #38

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Jan 27, 2026

👋 Thanks for assigning @benthecarman as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@Anyitechs Anyitechs force-pushed the introduce-telemetry branch 2 times, most recently from 8692c8c to 56b5e4a Compare January 27, 2026 01:51
@Anyitechs Anyitechs marked this pull request as ready for review January 27, 2026 01:59
chrono = { version = "0.4", default-features = false, features = ["clock"] }
log = "0.4.28"
base64 = { version = "0.21", default-features = false, features = ["std"] }
lazy_static = "1.5.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid taking this dependency and rather use one of the std::sync primitives (Once/OnceLock/LazyLock), if we need this at all.

log = "0.4.28"
base64 = { version = "0.21", default-features = false, features = ["std"] }
lazy_static = "1.5.0"
prometheus = "0.14.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this dependency, or can we just reimplement it easily locally?

If we need it, this should be at the very least made optional behind the metrics feature, and disable any default features.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, its a plain text file with some numbers, don't think we need to take a dep :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this dependency, or can we just reimplement it easily locally?

We can reimplement locally, but I think the dependency already offers some benefits that will come in handy when we want to provide metrics for things like balances in different states, payments, fees, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also pulls in several solo-maintainer dependencies (okay, with relatively well-known Rust community folks like burntsushi and dtolnay but also the apparently-kinda-unmaintained fnv crate) which we very strongly try to avoid given the security risk. Unless there's something that takes many hundred to a few thousand lines of code or very complicated and hard to test code to replicate, we should absolutely avoid taking a dependency for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I've implemented building prometheus endpoints in bash and python and....many times with probably less effort than it would have taken to figure out how to add a dependency and use it, so I'm not at all convinced that its worth it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will drop the dependency and reimplement locally.

self.service_health_score.set(score);
}

/// The health score computation is pretty basic for now and simply
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is it customary to have such a score? I would find it very hard to interpret, tbh.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is it customary to have such a score?

I think it is as some users might want to rely on that to know how their node is performing per time at a glance.

I would find it very hard to interpret, tbh.?

The current computation is pretty basic and relies on the NodeStatus informations from ldk-node. Though, we might want to refine that to include more informations and decide the best weightage value to assign for each event.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's okay as a first proof-of-concept, but IMO it would be much more useful to expose the actual underlying metrics such as peer/channel count, time since last successful chain sync/fee rate update, etc.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 2nd Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 3rd Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 4th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 5th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 6th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 7th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@Anyitechs Anyitechs force-pushed the introduce-telemetry branch 2 times, most recently from 6bcc9f8 to 1563d7c Compare February 13, 2026 14:04
@Anyitechs
Copy link
Contributor Author

This is ready for another look. Dropped the deps and reimplemented locally.

@Anyitechs Anyitechs requested a review from tnull February 13, 2026 14:25
@ldk-reviews-bot
Copy link

🔔 8th Reminder

Hey @tnull @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @tnull @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 9th Reminder

Hey @tnull @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60);

/// This represents a [`Metrics`] type that can go up and down in value.
pub struct IntGauge {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this extra newtype? Couldn't we just use a plain AtomicI64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A plain AtomicI64 could work, but I introduce the type for better organization and to represent the Gauge metric type (this will help differentiate it from a Counter type when introduced later, which could also use a AtomicI64 type but are meant to only increase and not decrease, in the metrics world).

Happy to drop if it's too much boilerplate.

self.service_health_score.set(score);
}

/// The health score computation is pretty basic for now and simply
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's okay as a first proof-of-concept, but IMO it would be much more useful to expose the actual underlying metrics such as peer/channel count, time since last successful chain sync/fee rate update, etc.

Copy link
Collaborator

@benthecarman benthecarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the metric endpoint to the ldk-server-client as well. We don't really need a cli command for it, but would be worth at least putting in the client so we have 100% coverage

buffer
}

fn compute_health_score(is_running: bool, has_peers: bool, is_wallet_synced: bool) -> i64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a weird health score. I think it'd be better to give stats about the node (balance, num peers, num channels, etc) and let the user make their own guidelines

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lnd has a prometheus integration. Would be good to look at and see what they are exposing in theirs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a weird health score.

The idea is to give users an indication on how their node is performing per time without having to look at other individual metrics. I've seen similar metric for other services, but will need to refine this the more.

I think it'd be better to give stats about the node (balance, num peers, num channels, etc) and let the user make their own guidelines

Right, I had earlier intended to do this in a follow-up. While this PR sets the structure/foundation, a follow-up will focus more on the metrics we need to expose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to give users an indication on how their node is performing per time without having to look at other individual metrics. I've seen similar metric for other services, but will need to refine this the more.

I'm not entirely opposed to have such an aggregated metric (and we can still discuss what factors to include with what weight, etc), but I agree it only makes sense in addition to exposing the values it's based on.

Comment on lines +266 to +272
runtime.spawn(async move {
loop {
interval.tick().await;
metrics_bg.update_service_health_score(&metrics_node);
}
});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of every minute updating this, we should be able to do it real time and just update it when we get a relevant event from ldk-node. ie we get a channel closed event so we update the metrics immediately to reflect that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to maintain a hybrid approach for this, real time update for the Node events but still maintain the polling for metrics like channel/peer/payment count, balances, etc.

@Anyitechs Anyitechs force-pushed the introduce-telemetry branch 2 times, most recently from 262569f to 226b86c Compare February 25, 2026 00:41
/// Retrieve the node metrics in Prometheus format.
pub async fn get_metrics(&self) -> Result<String, LdkServerError> {
let url = format!("https://{}/{GET_METRICS_PATH}", self.base_url);
let response = self.client.get(&url).send().await.map_err(|e| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we refactor post_request to do be able to do GET requests and use that. That has a lot of this logic already and would be better for future use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +161 to +168
if req.uri().path().len() > 1 && &req.uri().path()[1..] == GET_METRICS_PATH {
let metrics = Arc::clone(&self.metrics);
return Box::pin(async move {
Ok(Response::builder()
.header("Content-Type", "text/plain")
.body(Full::new(Bytes::from(metrics.gather_metrics())))
.unwrap())
});
Copy link
Collaborator

@benthecarman benthecarman Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't validating auth here we are short cutting before its done. I'm not sure if that'll break the typical Prometheus flow though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't validating auth here we are short cutting before its done.

Yes, this is intentional because Prometheus does not support the HMAC auth scheme we use. It supports only basic auth and TLS.

@Anyitechs Anyitechs force-pushed the introduce-telemetry branch from 226b86c to 1c27292 Compare March 6, 2026 15:12
This introduces the foundational telemetry infrastructure to improve
the observability of LDK Server.

It adds a new `/metrics` endpoint exposed on the REST service address,
which serves Prometheus-compatible metrics. This endpoint is public and
does not require HMAC authentication, allowing for easy integration with
monitoring systems.

- Added a `Metrics` utility struct to hold all the metrics we need to
expose.

This is the first step in a larger effort to provide comprehensive telemetry.
Future updates will expand this to include other detailed metrics for channels,
balances, payments, etc.
@Anyitechs Anyitechs force-pushed the introduce-telemetry branch from 1c27292 to 432780c Compare March 6, 2026 15:21
Copy link
Collaborator

@benthecarman benthecarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this endpoint isn't authenticated and can't be because of the prometheus limitations, we should make it configurable and default off.

&mut buffer,
"ldk_server_total_channels_count",
"Total number of channels",
"counter",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be gauge because it can go down, not just up

&mut buffer,
"ldk_server_total_public_channels_count",
"Total number of public channels",
"counter",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

&mut buffer,
"ldk_server_total_private_channels_count",
"Total number of private channels",
"counter",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Arc::clone(&event_publisher),
Arc::clone(&paginated_store)).await;

event_metrics.update_total_successful_payments_count(&event_node);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function does a full recount, we should be able to just increase by one

Arc::clone(&event_publisher),
Arc::clone(&paginated_store)).await;

event_metrics.update_total_failed_payments_count(&event_node);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment on lines +116 to +122
self.update_peer_count(node);
self.update_total_payments_count(node);
self.update_total_successful_payments_count(node);
self.update_total_failed_payments_count(node);
self.update_total_channels_count(node);
self.update_total_public_channels_count(node);
self.update_total_private_channels_count(node);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these we are making multiple calls to list all channels and payments. Would be better to just do that once and then count/filter as needed for each metric


fn call(&self, req: Request<Incoming>) -> Self::Future {
// Handle metrics endpoint separately to bypass auth and return plain text
if req.uri().path().len() > 1 && &req.uri().path()[1..] == GET_METRICS_PATH {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also require the request is a GET request

}

/// Retrieve the node metrics in Prometheus format.
pub async fn get_metrics(&self) -> Result<String, LdkServerError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a raw string, really should be decoded into the Response type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really should be decoded into the Response type

The Response type is protobuf, but Promotheus scrapers needs the endpoint to return plain-text

RequestType::Post => self.client.post(url),
};

let body_for_auth = body.as_deref().unwrap_or(&[]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can move this into the if authenticated arm


pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60);

pub struct Metrics {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some docs about how this is used for Prometheus? There's a lot of logic here that is specific to it without any context

@Anyitechs Anyitechs force-pushed the introduce-telemetry branch from 3402743 to 8555566 Compare March 9, 2026 19:38
Copy link
Collaborator

@benthecarman benthecarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting very close. We've merged e2e tests since you opened this. Can you add a test in there for the metrics endpoint?

Also, it's good we can now configure if the metrics are enabled or not, but we should have better docs about it. Currently when you do ldk-server --help it just says The option to enable the metrics endpoint. We should have a warning in here saying this endpoint is unauthenticated. There might be other places we can do this too.

env = "LDK_SERVER_METRICS_ENABLED",
help = "The option to enable the metrics endpoint."
)]
metrics_enabled: Option<bool>,
Copy link
Collaborator

@benthecarman benthecarman Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clap can we just make this metrics_enabled: bool, otherwise you need to do ldk-server --metrics-enabled true instead of just ldk-server --metrics-enabled


let metrics_node = Arc::clone(&node);
let mut interval = tokio::time::interval(BUILD_METRICS_INTERVAL);
let metrics = Arc::new(Metrics::new());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still building and tracking these metrics when they aren't enabled, this adds some extra overhead that we don't need. Can we only do it when metrics are enabled?

@Anyitechs
Copy link
Contributor Author

Thanks for the review. Addressed all comments here => 4628691.

@Anyitechs Anyitechs requested a review from benthecarman March 9, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants