-
Notifications
You must be signed in to change notification settings - Fork 2
317 Add Observability to PDP and OE #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ginaxu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the observability of the orchestration-engine and policy-decision-point Go services by integrating a dedicated monitoring package. The changes introduce comprehensive metrics for HTTP request handling, external service interactions, and critical business events, providing a clearer picture of application health and performance. Complementary updates to Grafana dashboards and Prometheus configuration ensure these new metrics are effectively collected and visualized, while also streamlining the local development environment. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces comprehensive monitoring capabilities to the orchestration-engine and policy-decision-point services by integrating a new shared pkg/monitoring module. This involves updating go.mod and go.sum files to include new dependencies, modifying main.go in both services to initialize monitoring, and adding HTTP metrics middleware and /metrics endpoints. Specifically, the orchestration-engine now records external call durations and GraphQL request outcomes, while the policy-decision-point sets a SERVICE_NAME environment variable for metric labeling. Import paths across several files in the orchestration-engine were also updated to reflect a module name change. In the Grafana dashboard, changes were made to Prometheus query labels for HTTP requests and P95 latency, and a 'Service Health' panel was removed. Prometheus configuration was updated to use host.docker.internal for scraping several services and to rename the portal-backend job to api-server-go. Review comments highlight concerns about the loss of per-route P95 latency data in the Grafana dashboard, the removal of the 'Service Health' panel, an inconsistency in the Prometheus scrape configuration for api-server-go (where the target hostname doesn't match the job name), and suggest removing a redundant logger initialization check in orchestration-engine/server/server.go.
233c951 to
208988a
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces comprehensive observability features, specifically metrics, to the orchestration-engine and policy-decision-point services. It integrates a new shared monitoring package by updating go.mod and go.sum files in both services, adding numerous OpenTelemetry-related dependencies. Both services now initialize monitoring at startup, apply HTTP metrics middleware to their routers, and expose a /metrics endpoint. The orchestration-engine further records durations for external provider calls and tracks the outcome (success/failure) of GraphQL requests as business events. Review comments highlighted an incorrect label (http_method instead of http_request_method) in the Grafana dashboard's PromQL queries for HTTP metrics, an inconsistent log.Printf call in orchestration-engine/main.go that should use logger.Log.Warn, and a minor formatting issue with extra leading spaces in a log statement within orchestration-engine/server/server.go. Additionally, Prometheus configuration was updated to use host.docker.internal for service targets and renamed the portal-backend job to api-server-go, while Grafana provisioning files were adjusted for dashboard editability and datasource configuration.
3539a0c to
cd9e5e1
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces observability features to the Orchestration Engine and Policy Decision Point services by integrating OpenTelemetry for metrics collection. The changes are well-structured, including middleware for HTTP metrics, custom metrics for business events and external calls, and updated configurations for Prometheus and Grafana. My review focuses on improving security by removing hardcoded credentials and enhancing code clarity and maintainability.
7e73a36 to
7a1d12c
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces comprehensive observability features to the Orchestration Engine and Policy Decision Point services, which is a great step towards improving monitoring and reliability. The use of OpenTelemetry for vendor-agnostic metrics, non-blocking initialization, and the addition of a verification script are all excellent practices.
However, I've identified a few critical issues that should be addressed:
- Security: The
docker-compose.ymlfile contains hardcoded database credentials. These should be externalized using environment variables to avoid committing secrets to the repository. - Correctness: There's a bug in the
provider.gofile where thedeferstatement for metric recording doesn't work as intended due to variable shadowing and incorrect return patterns. This leads to inaccurate metrics for external calls, which undermines the goal of this PR. - Robustness: The new
verification_steps.shscript uses a fixed sleep duration, which can be unreliable. I've suggested a more robust polling mechanism.
I've provided detailed comments and code suggestions for these points. Once they are addressed, this will be a solid contribution.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request integrates observability features into the orchestration-engine and policy-decision-point services, and enhances the local development environment for observability. Key changes include adding a pdp-db PostgreSQL service to docker-compose.yml and configuring both services to connect to it, along with updating go.mod and go.sum files to include a new shared/monitoring dependency. Both services now initialize the monitoring library, expose a /metrics endpoint, and apply an HTTP metrics middleware; the orchestration-engine also records external provider call durations and GraphQL request outcomes. Observability configurations for Grafana and Prometheus were updated, including changes to metric labels and Prometheus scrape targets to use host.docker.internal. A new verification_steps.sh script was added to automate the setup and verification of the observability stack and services. Review comments highlighted security concerns regarding weak default database passwords ('password' and empty string) in docker-compose.yml, suggesting their removal or more secure defaults, and identified a redundant sleep 5 command in the new verification script due to existing depends_on and health check mechanisms.
aad2ff8 to
af8a49d
Compare
sthanikan2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ginaxu1 When I go through the implementation, I can see that the way you add the observability to the PDP and OE is not same. Is it intentional?
|
updated: OE, PDP, CE all follow the same pattern: if monitoring.IsObservabilityEnabled() {
monitoringConfig := monitoring.DefaultConfig("service-name")
if err := monitoring.Initialize(monitoringConfig); err != nil {
slog.Warn("Failed to initialize monitoring (service will continue)", "error", err)
}
} else {
slog.Info("Observability disabled via environment variable")
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces observability to the OE and PDP services using OpenTelemetry, which is a great addition. The implementation of the shared monitoring module is solid, with good practices like support for multiple exporters and graceful degradation. The new verification_steps.sh script is also a fantastic improvement for developer experience.
I've left a few comments with suggestions for improvement:
- A high-severity issue in
prometheus.ymlregarding the use ofhost.docker.internal, which will likely break the setup for container-based development. - A few medium-severity suggestions in
docker-compose.ymlto improve the robustness and clarity of the local development setup for databases. - A minor suggestion to use
slog.Warninstead ofslog.Errorfor non-fatal monitoring initialization failures.
Overall, this is a well-executed feature. Addressing these points will make the observability setup more robust and easier to use for all developers.
af8a49d to
ce39049
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive and well-implemented observability solution for the PDP, OE, and CE services. The changes are robust, making observability optional via environment variables, and the inclusion of a verification script is a fantastic addition for improving the developer experience. The instrumentation for HTTP metrics, business events, and external calls is thorough. I've identified a couple of medium-severity issues: one concerning an inconsistency in the database setup for the orchestration-engine in Docker Compose, and another related to Grafana dashboard provisioning which could lead to duplicates. Addressing these will enhance the consistency and maintainability of the local development environment.
ce39049 to
ef523df
Compare
2736899 to
c331f5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive observability solution for the PDP and OE services, leveraging OpenTelemetry for metrics collection. The changes are well-structured, using a shared monitoring package and allowing observability to be toggled via environment variables. The addition of Prometheus/Grafana configurations and a verification script for local development is a great touch. My review focuses on a minor logical issue in the status reporting of the monitoring package, but overall the implementation is solid.
eb9a9d9 to
71b3bb7
Compare
71b3bb7 to
8f3117f
Compare
Summary
This PR adds observability (metrics collection and monitoring) to OE and PDP. Additionally, OE, PDP, CE all support optional observability via environment variables:
ENABLE_OBSERVABILITY=false- Disables observabilityOTEL_METRICS_ENABLED=false- Disables OpenTelemetry metricsWhen observability is disabled:
/metricsendpoint returns "# Metrics disabled" or "# Metrics not initialized"Endpoints:
Metrics Captured
http_requests_total(counter),http_request_duration_seconds(histogram)business_events_total- GraphQL outcomesexternal_call_duration_seconds(histogram),external_calls_total(counter)Labels:
method,route,status,service_name,provider_key,call_type,event_type,outcomeType of Change
Testing with Observability Enabled (Default)
Start Services
Note: If consent-engine fails, rebuild:
docker compose build consent-engine && docker compose up -d consent-engineVerify Metrics
Verify Health
Testing with Observability Disabled
Start Services with Observability Disabled
Verify Metrics Disabled
Verify Services Still Work
Verify Logs
Troubleshooting
Service Fails to Start
If a service fails to start when observability is disabled, check:
ENABLE_OBSERVABILITY=falseis set correctlyif monitoring.IsObservabilityEnabled()Metrics Endpoint Returns 503
This indicates metrics were not initialized. Check:
monitoring.Initialize()was called (if observability enabled)Services Work But No Metrics
Check:
/metricsendpointHTTPMetricsMiddlewareis applied to routesENABLE_OBSERVABILITY=true)Deployment
Environment Variables
OE:
SERVICE_NAME,OTEL_METRICS_EXPORTER,CONFIG_PATHPDP:
SERVICE_NAME,OTEL_METRICS_EXPORTER,CHOREO_OPENDIF_DATABASE_*(with defaults)Database:
POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB(with defaults)Verification
Rollback Plan
To disable observability:
monitoring.Initialize()calls frommain.gofiles/metricsendpoint registrationsHTTPMetricsMiddlewarewrappersRecordBusinessEvent()andRecordExternalCall()callsNote: Services continue to function normally without metrics collection.