Skip to content

Conversation

@ginaxu1
Copy link
Collaborator

@ginaxu1 ginaxu1 commented Jan 3, 2026

Summary

This PR adds observability (metrics collection and monitoring) to OE and PDP. Additionally, OE, PDP, CE all support optional observability via environment variables:

  • ENABLE_OBSERVABILITY=false - Disables observability
  • OTEL_METRICS_ENABLED=false - Disables OpenTelemetry metrics

When observability is disabled:

  • Services continue to function normally
  • /metrics endpoint returns "# Metrics disabled" or "# Metrics not initialized"
  • HTTP middleware passes through requests without recording metrics
  • Business event recording is skipped (no-op)
  • External call metrics are skipped (no-op)

Endpoints:

  1. Prometheus Targets: http://localhost:9091/targets
  2. Prometheus Graph: http://localhost:9091/graph
  3. Grafana Dashboard: http://localhost:3002/d/go-services-dashboard/go-services-metrics (user: admin, pw: admin)
image
  1. OE Metrics: http://localhost:4000/metrics
  2. PDP Metrics: http://localhost:8082/metrics

Metrics Captured

  • HTTP Request Metrics: http_requests_total (counter), http_request_duration_seconds (histogram)
  • Business Events (OE): business_events_total - GraphQL outcomes
  • External Calls (OE): external_call_duration_seconds (histogram), external_calls_total (counter)
  • Go Runtime: GC, memory, goroutines
  • Process: CPU, memory, network I/O

Labels: method, route, status, service_name, provider_key, call_type, event_type, outcome

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Other (please describe):

Testing with Observability Enabled (Default)

Start Services

cd exchange
export DB_PASSWORD=${DB_PASSWORD:-password}
docker compose up -d pdp-db ce-db
sleep 5  # Wait for databases
docker compose up -d orchestration-engine policy-decision-point consent-engine

Note: If consent-engine fails, rebuild: docker compose build consent-engine && docker compose up -d consent-engine

Verify Metrics

# Should show http_requests_total
curl http://localhost:4000/metrics | grep http_requests_total | head -2
curl http://localhost:8081/metrics | grep http_requests_total | head -2
curl http://localhost:8082/metrics | grep http_requests_total | head -2

Verify Health

curl http://localhost:4000/health  # Should return 200 OK
curl http://localhost:8081/health  # Should return 200 OK
curl http://localhost:8082/health  # Should return 200 OK

Testing with Observability Disabled

Start Services with Observability Disabled

cd exchange
export ENABLE_OBSERVABILITY=false
export DB_PASSWORD=${DB_PASSWORD:-password}
docker compose down  # Important: recreate containers to pick up env vars
docker compose up -d pdp-db ce-db
sleep 5
docker compose up -d orchestration-engine policy-decision-point consent-engine

Verify Metrics Disabled

# Should NOT show http_requests_total (only Go runtime metrics)
curl http://localhost:4000/metrics | grep -E "(http_requests_total|Metrics disabled)" | head -2
curl http://localhost:8082/metrics | grep -E "(http_requests_total|Metrics disabled)" | head -2

Verify Services Still Work

curl http://localhost:4000/health  # Should still work
curl http://localhost:8082/health   # Should still work

Verify Logs

export DB_PASSWORD=password
docker compose logs orchestration-engine | grep -i observability
docker compose logs policy-decision-point | grep -i observability
# Should show: "Observability disabled via environment variable"

Troubleshooting

Service Fails to Start

If a service fails to start when observability is disabled, check:

  1. Service logs for initialization errors
  2. Ensure ENABLE_OBSERVABILITY=false is set correctly
  3. Verify monitoring initialization is wrapped in if monitoring.IsObservabilityEnabled()

Metrics Endpoint Returns 503

This indicates metrics were not initialized. Check:

  1. Service logs for initialization errors
  2. Ensure observability is enabled or explicitly disabled
  3. Verify monitoring.Initialize() was called (if observability enabled)

Services Work But No Metrics

Check:

  1. Prometheus is scraping the correct endpoints
  2. Services are exposing /metrics endpoint
  3. HTTPMetricsMiddleware is applied to routes
  4. Observability is enabled (ENABLE_OBSERVABILITY=true)

Deployment

Environment Variables

OE: SERVICE_NAME, OTEL_METRICS_EXPORTER, CONFIG_PATH
PDP: SERVICE_NAME, OTEL_METRICS_EXPORTER, CHOREO_OPENDIF_DATABASE_* (with defaults)
Database: POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB (with defaults)

Verification

# Use verification script
cd observability && ./verification_steps.sh

Rollback Plan

To disable observability:

  1. Remove monitoring.Initialize() calls from main.go files
  2. Remove /metrics endpoint registrations
  3. Remove HTTPMetricsMiddleware wrappers
  4. Remove RecordBusinessEvent() and RecordExternalCall() calls

Note: Services continue to function normally without metrics collection.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ginaxu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the observability of the orchestration-engine and policy-decision-point Go services by integrating a dedicated monitoring package. The changes introduce comprehensive metrics for HTTP request handling, external service interactions, and critical business events, providing a clearer picture of application health and performance. Complementary updates to Grafana dashboards and Prometheus configuration ensure these new metrics are effectively collected and visualized, while also streamlining the local development environment.

Highlights

  • Enhanced Observability: Integrated a shared monitoring package into the orchestration-engine and policy-decision-point Go services, enabling the collection of various metrics.
  • Detailed HTTP Metrics: Implemented middleware to automatically capture HTTP request duration, method, and route, providing granular insights into API performance.
  • External Call Tracking: Added instrumentation to the orchestration-engine to record the duration and outcome of external HTTP calls made to other services.
  • Business Event Monitoring: Introduced specific metrics for key business events, such as the success or failure of GraphQL requests, to track application logic outcomes.
  • Monitoring Stack Updates: Updated Grafana dashboards and Prometheus configuration to correctly ingest, visualize, and scrape the newly exposed metrics, facilitating local development.
  • Go Module Path Refactoring: Standardized Go module paths for orchestration-engine and policy-decision-point and managed new dependencies for the monitoring package.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive monitoring capabilities to the orchestration-engine and policy-decision-point services by integrating a new shared pkg/monitoring module. This involves updating go.mod and go.sum files to include new dependencies, modifying main.go in both services to initialize monitoring, and adding HTTP metrics middleware and /metrics endpoints. Specifically, the orchestration-engine now records external call durations and GraphQL request outcomes, while the policy-decision-point sets a SERVICE_NAME environment variable for metric labeling. Import paths across several files in the orchestration-engine were also updated to reflect a module name change. In the Grafana dashboard, changes were made to Prometheus query labels for HTTP requests and P95 latency, and a 'Service Health' panel was removed. Prometheus configuration was updated to use host.docker.internal for scraping several services and to rename the portal-backend job to api-server-go. Review comments highlight concerns about the loss of per-route P95 latency data in the Grafana dashboard, the removal of the 'Service Health' panel, an inconsistency in the Prometheus scrape configuration for api-server-go (where the target hostname doesn't match the job name), and suggest removing a redundant logger initialization check in orchestration-engine/server/server.go.

@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch from 233c951 to 208988a Compare January 5, 2026 05:07
@ginaxu1
Copy link
Collaborator Author

ginaxu1 commented Jan 5, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive observability features, specifically metrics, to the orchestration-engine and policy-decision-point services. It integrates a new shared monitoring package by updating go.mod and go.sum files in both services, adding numerous OpenTelemetry-related dependencies. Both services now initialize monitoring at startup, apply HTTP metrics middleware to their routers, and expose a /metrics endpoint. The orchestration-engine further records durations for external provider calls and tracks the outcome (success/failure) of GraphQL requests as business events. Review comments highlighted an incorrect label (http_method instead of http_request_method) in the Grafana dashboard's PromQL queries for HTTP metrics, an inconsistent log.Printf call in orchestration-engine/main.go that should use logger.Log.Warn, and a minor formatting issue with extra leading spaces in a log statement within orchestration-engine/server/server.go. Additionally, Prometheus configuration was updated to use host.docker.internal for service targets and renamed the portal-backend job to api-server-go, while Grafana provisioning files were adjusted for dashboard editability and datasource configuration.

@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch from 3539a0c to cd9e5e1 Compare January 5, 2026 05:22
@ginaxu1
Copy link
Collaborator Author

ginaxu1 commented Jan 5, 2026

/gemini review

@ginaxu1 ginaxu1 changed the title 317 part3 observability only 317 Integration Observability with PDP and OE Jan 5, 2026
@ginaxu1 ginaxu1 changed the title 317 Integration Observability with PDP and OE 317 Add Observability to PDP and OE Jan 5, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces observability features to the Orchestration Engine and Policy Decision Point services by integrating OpenTelemetry for metrics collection. The changes are well-structured, including middleware for HTTP metrics, custom metrics for business events and external calls, and updated configurations for Prometheus and Grafana. My review focuses on improving security by removing hardcoded credentials and enhancing code clarity and maintainability.

@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch from 7e73a36 to 7a1d12c Compare January 5, 2026 06:31
@ginaxu1
Copy link
Collaborator Author

ginaxu1 commented Jan 5, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive observability features to the Orchestration Engine and Policy Decision Point services, which is a great step towards improving monitoring and reliability. The use of OpenTelemetry for vendor-agnostic metrics, non-blocking initialization, and the addition of a verification script are all excellent practices.

However, I've identified a few critical issues that should be addressed:

  • Security: The docker-compose.yml file contains hardcoded database credentials. These should be externalized using environment variables to avoid committing secrets to the repository.
  • Correctness: There's a bug in the provider.go file where the defer statement for metric recording doesn't work as intended due to variable shadowing and incorrect return patterns. This leads to inaccurate metrics for external calls, which undermines the goal of this PR.
  • Robustness: The new verification_steps.sh script uses a fixed sleep duration, which can be unreliable. I've suggested a more robust polling mechanism.

I've provided detailed comments and code suggestions for these points. Once they are addressed, this will be a solid contribution.

@ginaxu1
Copy link
Collaborator Author

ginaxu1 commented Jan 5, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates observability features into the orchestration-engine and policy-decision-point services, and enhances the local development environment for observability. Key changes include adding a pdp-db PostgreSQL service to docker-compose.yml and configuring both services to connect to it, along with updating go.mod and go.sum files to include a new shared/monitoring dependency. Both services now initialize the monitoring library, expose a /metrics endpoint, and apply an HTTP metrics middleware; the orchestration-engine also records external provider call durations and GraphQL request outcomes. Observability configurations for Grafana and Prometheus were updated, including changes to metric labels and Prometheus scrape targets to use host.docker.internal. A new verification_steps.sh script was added to automate the setup and verification of the observability stack and services. Review comments highlighted security concerns regarding weak default database passwords ('password' and empty string) in docker-compose.yml, suggesting their removal or more secure defaults, and identified a redundant sleep 5 command in the new verification script due to existing depends_on and health check mechanisms.

@ginaxu1 ginaxu1 marked this pull request as ready for review January 5, 2026 07:20
@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch 2 times, most recently from aad2ff8 to af8a49d Compare January 5, 2026 07:31
Copy link
Collaborator

@sthanikan2000 sthanikan2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ginaxu1 When I go through the implementation, I can see that the way you add the observability to the PDP and OE is not same. Is it intentional?

@ginaxu1
Copy link
Collaborator Author

ginaxu1 commented Jan 6, 2026

updated: OE, PDP, CE all follow the same pattern:

if monitoring.IsObservabilityEnabled() {
    monitoringConfig := monitoring.DefaultConfig("service-name")
    if err := monitoring.Initialize(monitoringConfig); err != nil {
        slog.Warn("Failed to initialize monitoring (service will continue)", "error", err)
    }
} else {
    slog.Info("Observability disabled via environment variable")
}

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces observability to the OE and PDP services using OpenTelemetry, which is a great addition. The implementation of the shared monitoring module is solid, with good practices like support for multiple exporters and graceful degradation. The new verification_steps.sh script is also a fantastic improvement for developer experience.

I've left a few comments with suggestions for improvement:

  • A high-severity issue in prometheus.yml regarding the use of host.docker.internal, which will likely break the setup for container-based development.
  • A few medium-severity suggestions in docker-compose.yml to improve the robustness and clarity of the local development setup for databases.
  • A minor suggestion to use slog.Warn instead of slog.Error for non-fatal monitoring initialization failures.

Overall, this is a well-executed feature. Addressing these points will make the observability setup more robust and easier to use for all developers.

@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch from af8a49d to ce39049 Compare January 6, 2026 07:25
@ginaxu1 ginaxu1 requested a review from sthanikan2000 January 6, 2026 07:28
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-implemented observability solution for the PDP, OE, and CE services. The changes are robust, making observability optional via environment variables, and the inclusion of a verification script is a fantastic addition for improving the developer experience. The instrumentation for HTTP metrics, business events, and external calls is thorough. I've identified a couple of medium-severity issues: one concerning an inconsistency in the database setup for the orchestration-engine in Docker Compose, and another related to Grafana dashboard provisioning which could lead to duplicates. Addressing these will enhance the consistency and maintainability of the local development environment.

@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch from ce39049 to ef523df Compare January 6, 2026 07:31
@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch 2 times, most recently from 2736899 to c331f5b Compare January 13, 2026 05:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive observability solution for the PDP and OE services, leveraging OpenTelemetry for metrics collection. The changes are well-structured, using a shared monitoring package and allowing observability to be toggled via environment variables. The addition of Prometheus/Grafana configurations and a verification script for local development is a great touch. My review focuses on a minor logical issue in the status reporting of the monitoring package, but overall the implementation is solid.

@OpenDIF OpenDIF deleted a comment from gemini-code-assist bot Jan 14, 2026
@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch 2 times, most recently from eb9a9d9 to 71b3bb7 Compare January 18, 2026 10:56
@ginaxu1 ginaxu1 force-pushed the 317-part3-observability-only branch from 71b3bb7 to 8f3117f Compare January 23, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants