Introduce telemetry for observability by Anyitechs · Pull Request #117 · lightningdevkit/ldk-server

Anyitechs · 2026-01-27T01:02:01Z

This introduces the foundational telemetry infrastructure to improve
the observability of LDK Server.

It adds a new /metrics endpoint exposed on the REST service address,
which serves Prometheus-compatible metrics. This endpoint is public and
does not require HMAC authentication, allowing for easy integration with
monitoring systems.

Added a Metrics utility struct to hold all the metrics we need to
expose.

This is the first step in a larger effort to provide comprehensive telemetry.
Future updates will expand this to include other detailed metrics for channels,
balances, payments, etc.

Related issue #38

ldk-reviews-bot · 2026-01-27T01:02:04Z

👋 Thanks for assigning @benthecarman as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

tnull · 2026-01-27T08:06:30Z

ldk-server/Cargo.toml

 chrono = { version = "0.4", default-features = false, features = ["clock"] }
 log = "0.4.28"
 base64 = { version = "0.21", default-features = false, features = ["std"] }
+lazy_static = "1.5.0"


Please avoid taking this dependency and rather use one of the std::sync primitives (Once/OnceLock/LazyLock), if we need this at all.

tnull · 2026-01-27T08:08:15Z

ldk-server/Cargo.toml

 log = "0.4.28"
 base64 = { version = "0.21", default-features = false, features = ["std"] }
+lazy_static = "1.5.0"
+prometheus = "0.14.0"


Do we really need this dependency, or can we just reimplement it easily locally?

If we need it, this should be at the very least made optional behind the metrics feature, and disable any default features.

Yea, its a plain text file with some numbers, don't think we need to take a dep :)

Do we really need this dependency, or can we just reimplement it easily locally?

We can reimplement locally, but I think the dependency already offers some benefits that will come in handy when we want to provide metrics for things like balances in different states, payments, fees, etc.

It also pulls in several solo-maintainer dependencies (okay, with relatively well-known Rust community folks like burntsushi and dtolnay but also the apparently-kinda-unmaintained fnv crate) which we very strongly try to avoid given the security risk. Unless there's something that takes many hundred to a few thousand lines of code or very complicated and hard to test code to replicate, we should absolutely avoid taking a dependency for it.

Personally, I've implemented building prometheus endpoints in bash and python and....many times with probably less effort than it would have taken to figure out how to add a dependency and use it, so I'm not at all convinced that its worth it here.

Right, will drop the dependency and reimplement locally.

tnull · 2026-01-27T08:09:25Z

ldk-server/src/util/metrics.rs

+		self.service_health_score.set(score);
+	}
+
+	/// The health score computation is pretty basic for now and simply


Hmm, is it customary to have such a score? I would find it very hard to interpret, tbh.?

Hmm, is it customary to have such a score?

I think it is as some users might want to rely on that to know how their node is performing per time at a glance.

I would find it very hard to interpret, tbh.?

The current computation is pretty basic and relies on the NodeStatus informations from ldk-node. Though, we might want to refine that to include more informations and decide the best weightage value to assign for each event.

I guess it's okay as a first proof-of-concept, but IMO it would be much more useful to expose the actual underlying metrics such as peer/channel count, time since last successful chain sync/fee rate update, etc.

ldk-server/src/util/metrics.rs

ldk-reviews-bot · 2026-01-29T02:00:16Z

🔔 1st Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-01-31T02:01:04Z

🔔 2nd Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-02T02:01:34Z

🔔 3rd Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-04T02:02:02Z

🔔 4th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-07T00:00:07Z

🔔 5th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-09T00:00:31Z

🔔 6th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-11T00:00:57Z

🔔 7th Reminder

Hey @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Anyitechs · 2026-02-13T14:24:52Z

This is ready for another look. Dropped the deps and reimplemented locally.

ldk-reviews-bot · 2026-02-14T00:01:00Z

🔔 8th Reminder

Hey @tnull @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-16T00:01:00Z

🔔 1st Reminder

Hey @tnull @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-02-16T00:02:13Z

🔔 9th Reminder

Hey @tnull @benthecarman! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull · 2026-02-16T13:12:38Z

ldk-server/src/util/metrics.rs

+pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60);
+
+/// This represents a [`Metrics`] type that can go up and down in value.
+pub struct IntGauge {


Why do we need this extra newtype? Couldn't we just use a plain AtomicI64?

A plain AtomicI64 could work, but I introduce the type for better organization and to represent the Gauge metric type (this will help differentiate it from a Counter type when introduced later, which could also use a AtomicI64 type but are meant to only increase and not decrease, in the metrics world).

Happy to drop if it's too much boilerplate.

tnull · 2026-02-16T13:15:54Z

ldk-server/src/util/metrics.rs

+		self.service_health_score.set(score);
+	}
+
+	/// The health score computation is pretty basic for now and simply


I guess it's okay as a first proof-of-concept, but IMO it would be much more useful to expose the actual underlying metrics such as peer/channel count, time since last successful chain sync/fee rate update, etc.

benthecarman

We should add the metric endpoint to the ldk-server-client as well. We don't really need a cli command for it, but would be worth at least putting in the client so we have 100% coverage

benthecarman · 2026-02-17T22:12:07Z

ldk-server/src/util/metrics.rs

+		buffer
+	}
+
+	fn compute_health_score(is_running: bool, has_peers: bool, is_wallet_synced: bool) -> i64 {


This is such a weird health score. I think it'd be better to give stats about the node (balance, num peers, num channels, etc) and let the user make their own guidelines

Lnd has a prometheus integration. Would be good to look at and see what they are exposing in theirs

This is such a weird health score.

The idea is to give users an indication on how their node is performing per time without having to look at other individual metrics. I've seen similar metric for other services, but will need to refine this the more.

I think it'd be better to give stats about the node (balance, num peers, num channels, etc) and let the user make their own guidelines

Right, I had earlier intended to do this in a follow-up. While this PR sets the structure/foundation, a follow-up will focus more on the metrics we need to expose.

The idea is to give users an indication on how their node is performing per time without having to look at other individual metrics. I've seen similar metric for other services, but will need to refine this the more.

I'm not entirely opposed to have such an aggregated metric (and we can still discuss what factors to include with what weight, etc), but I agree it only makes sense in addition to exposing the values it's based on.

benthecarman · 2026-02-17T22:14:52Z

ldk-server/src/main.rs

+		runtime.spawn(async move {
+			loop {
+				interval.tick().await;
+				metrics_bg.update_service_health_score(&metrics_node);
+			}
+		});
+


instead of every minute updating this, we should be able to do it real time and just update it when we get a relevant event from ldk-node. ie we get a channel closed event so we update the metrics immediately to reflect that

I intend to maintain a hybrid approach for this, real time update for the Node events but still maintain the polling for metrics like channel/peer/payment count, balances, etc.

benthecarman · 2026-02-25T19:26:17Z

ldk-server-client/src/client.rs

+	/// Retrieve the node metrics in Prometheus format.
+	pub async fn get_metrics(&self) -> Result<String, LdkServerError> {
+		let url = format!("https://{}/{GET_METRICS_PATH}", self.base_url);
+		let response = self.client.get(&url).send().await.map_err(|e| {


can we refactor post_request to do be able to do GET requests and use that. That has a lot of this logic already and would be better for future use

benthecarman · 2026-02-25T19:28:25Z

ldk-server/src/service.rs

+		if req.uri().path().len() > 1 && &req.uri().path()[1..] == GET_METRICS_PATH {
+			let metrics = Arc::clone(&self.metrics);
+			return Box::pin(async move {
+				Ok(Response::builder()
+					.header("Content-Type", "text/plain")
+					.body(Full::new(Bytes::from(metrics.gather_metrics())))
+					.unwrap())
+			});


We aren't validating auth here we are short cutting before its done. I'm not sure if that'll break the typical Prometheus flow though?

We aren't validating auth here we are short cutting before its done.

Yes, this is intentional because Prometheus does not support the HMAC auth scheme we use. It supports only basic auth and TLS.

This introduces the foundational telemetry infrastructure to improve the observability of LDK Server. It adds a new `/metrics` endpoint exposed on the REST service address, which serves Prometheus-compatible metrics. This endpoint is public and does not require HMAC authentication, allowing for easy integration with monitoring systems. - Added a `Metrics` utility struct to hold all the metrics we need to expose. This is the first step in a larger effort to provide comprehensive telemetry. Future updates will expand this to include other detailed metrics for channels, balances, payments, etc.

benthecarman

Since this endpoint isn't authenticated and can't be because of the prometheus limitations, we should make it configurable and default off.

benthecarman · 2026-03-06T23:34:29Z

ldk-server/src/util/metrics.rs

+			&mut buffer,
+			"ldk_server_total_channels_count",
+			"Total number of channels",
+			"counter",


this should be gauge because it can go down, not just up

benthecarman · 2026-03-06T23:34:37Z

ldk-server/src/util/metrics.rs

+			&mut buffer,
+			"ldk_server_total_public_channels_count",
+			"Total number of public channels",
+			"counter",


benthecarman · 2026-03-06T23:34:45Z

ldk-server/src/util/metrics.rs

+			&mut buffer,
+			"ldk_server_total_private_channels_count",
+			"Total number of private channels",
+			"counter",


benthecarman · 2026-03-06T23:36:13Z

ldk-server/src/main.rs

 								Arc::clone(&event_publisher),
 								Arc::clone(&paginated_store)).await;
+
+							event_metrics.update_total_successful_payments_count(&event_node);


this function does a full recount, we should be able to just increase by one

benthecarman · 2026-03-06T23:36:20Z

ldk-server/src/main.rs

 								Arc::clone(&event_publisher),
 								Arc::clone(&paginated_store)).await;
+
+							event_metrics.update_total_failed_payments_count(&event_node);


benthecarman · 2026-03-06T23:37:42Z

ldk-server/src/util/metrics.rs

+		self.update_peer_count(node);
+		self.update_total_payments_count(node);
+		self.update_total_successful_payments_count(node);
+		self.update_total_failed_payments_count(node);
+		self.update_total_channels_count(node);
+		self.update_total_public_channels_count(node);
+		self.update_total_private_channels_count(node);


For these we are making multiple calls to list all channels and payments. Would be better to just do that once and then count/filter as needed for each metric

benthecarman · 2026-03-06T23:40:07Z

ldk-server/src/service.rs


 	fn call(&self, req: Request<Incoming>) -> Self::Future {
+		// Handle metrics endpoint separately to bypass auth and return plain text
+		if req.uri().path().len() > 1 && &req.uri().path()[1..] == GET_METRICS_PATH {


should also require the request is a GET request

benthecarman · 2026-03-06T23:40:49Z

ldk-server-client/src/client.rs

 	}

+	/// Retrieve the node metrics in Prometheus format.
+	pub async fn get_metrics(&self) -> Result<String, LdkServerError> {


This is just a raw string, really should be decoded into the Response type

really should be decoded into the Response type

The Response type is protobuf, but Promotheus scrapers needs the endpoint to return plain-text

benthecarman · 2026-03-06T23:41:32Z

ldk-server-client/src/client.rs

+			RequestType::Post => self.client.post(url),
+		};
+
+		let body_for_auth = body.as_deref().unwrap_or(&[]);


nit: can move this into the if authenticated arm

benthecarman · 2026-03-06T23:42:57Z

ldk-server/src/util/metrics.rs

+
+pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60);
+
+pub struct Metrics {


Can we add some docs about how this is used for Prometheus? There's a lot of logic here that is specific to it without any context

benthecarman

Getting very close. We've merged e2e tests since you opened this. Can you add a test in there for the metrics endpoint?

Also, it's good we can now configure if the metrics are enabled or not, but we should have better docs about it. Currently when you do ldk-server --help it just says The option to enable the metrics endpoint. We should have a warning in here saying this endpoint is unauthenticated. There might be other places we can do this too.

benthecarman · 2026-03-09T19:49:08Z

ldk-server/src/util/config.rs

+		env = "LDK_SERVER_METRICS_ENABLED",
+		help = "The option to enable the metrics endpoint."
+	)]
+	metrics_enabled: Option<bool>,


for clap can we just make this metrics_enabled: bool, otherwise you need to do ldk-server --metrics-enabled true instead of just ldk-server --metrics-enabled

benthecarman · 2026-03-09T19:53:09Z

ldk-server/src/main.rs

+
+		let metrics_node = Arc::clone(&node);
+		let mut interval = tokio::time::interval(BUILD_METRICS_INTERVAL);
+		let metrics = Arc::new(Metrics::new());


We're still building and tracking these metrics when they aren't enabled, this adds some extra overhead that we don't need. Can we only do it when metrics are enabled?

Anyitechs · 2026-03-09T22:33:28Z

Thanks for the review. Addressed all comments here => 4628691.

Anyitechs force-pushed the introduce-telemetry branch 2 times, most recently from 8692c8c to 56b5e4a Compare January 27, 2026 01:51

Anyitechs marked this pull request as ready for review January 27, 2026 01:59

ldk-reviews-bot requested a review from benthecarman January 27, 2026 01:59

tnull reviewed Jan 27, 2026

View reviewed changes

Anyitechs force-pushed the introduce-telemetry branch 2 times, most recently from 6bcc9f8 to 1563d7c Compare February 13, 2026 14:04

Anyitechs requested a review from tnull February 13, 2026 14:25

tnull reviewed Feb 16, 2026

View reviewed changes

benthecarman reviewed Feb 17, 2026

View reviewed changes

Anyitechs force-pushed the introduce-telemetry branch 2 times, most recently from 262569f to 226b86c Compare February 25, 2026 00:41

Anyitechs requested a review from benthecarman February 25, 2026 00:49

benthecarman reviewed Feb 25, 2026

View reviewed changes

Anyitechs force-pushed the introduce-telemetry branch from 226b86c to 1c27292 Compare March 6, 2026 15:12

Anyitechs force-pushed the introduce-telemetry branch from 1c27292 to 432780c Compare March 6, 2026 15:21

benthecarman requested changes Mar 6, 2026

View reviewed changes

Make Metrics configurable & address fixups

8555566

Anyitechs force-pushed the introduce-telemetry branch from 3402743 to 8555566 Compare March 9, 2026 19:38

benthecarman requested changes Mar 9, 2026

View reviewed changes

Add e2e test & address feedbacks

4628691

Anyitechs requested a review from benthecarman March 9, 2026 22:33


		pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60);

		pub struct Metrics {

Conversation

Anyitechs commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldk-reviews-bot commented Jan 29, 2026

Uh oh!

ldk-reviews-bot commented Jan 31, 2026

Uh oh!

ldk-reviews-bot commented Feb 2, 2026

Uh oh!

ldk-reviews-bot commented Feb 4, 2026

Uh oh!

ldk-reviews-bot commented Feb 7, 2026

Uh oh!

ldk-reviews-bot commented Feb 9, 2026

Uh oh!

ldk-reviews-bot commented Feb 11, 2026

Uh oh!

Anyitechs commented Feb 13, 2026

Uh oh!

ldk-reviews-bot commented Feb 14, 2026

Uh oh!

ldk-reviews-bot commented Feb 16, 2026

Uh oh!

ldk-reviews-bot commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benthecarman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benthecarman Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Anyitechs commented Jan 27, 2026 •

edited

Loading

ldk-reviews-bot commented Jan 27, 2026 •

edited

Loading

benthecarman Feb 25, 2026 •

edited

Loading

benthecarman left a comment •

edited

Loading