-
Notifications
You must be signed in to change notification settings - Fork 68
Fix Flight denormalized field locking issues #1099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove real-time updates to Flight.total_views and Flight.total_clicks to eliminate database lock contention on high-traffic flights. Changes: - Remove F() expression updates in Advertisement.incr() that were causing row-level locks on every impression - Add Flight.refresh_denormalized_totals() method to calculate these fields from AdImpression on-demand - Add refresh_flight_denormalized_totals() Celery task for periodic background updates - Add management command to manually refresh flight totals - Update Offer.refund() to skip Flight field updates - Update test to manually refresh denormalized fields The denormalized fields are now calculated from the source of truth (AdImpression) either on-demand or via scheduled tasks, avoiding the lock contention that was blocking concurrent ad impressions.
|
This is a good starting point. I'd say we should get rid of the the management task (we'll never run it) and rely on the task. I think we should run the task every 5-10m instead of hourly. |
|
The test failures are pretty straight-forward issues because we're assuming this field is updated. |
- Remove refresh_flight_totals management command (won't be used) - Update task documentation to run every 5-10 minutes - Fix all test failures by adding refresh_denormalized_totals() calls after incr() operations that check total_views/total_clicks - Update tests that manually set total_clicks/total_views to use save(update_fields=[...]) to avoid triggering other updates
- Celery task sets cache timestamp on completion - Health endpoint checks cache instead of querying DB - Returns 503 if task hasn't run in 20+ minutes
- Refunding flights triggers recalc of totals - Handle case where a flight is slightly overfilled - Run denormalized calcs every 5m - Don't bother with "error counts" in denormalized calcs. It will either succeed or error and alert us.
davidfischer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to roll out as long as we also add monitoring to the health check. I like this pattern for health checks and think we should have more of them.
|
Yea, I think it's a solid pattern. |
Problem
This PR fixes database lock contention on the
adserver_flighttable that was causing queries like this to block for extended periods:On high-traffic flights, multiple concurrent ad impressions were trying to update the same
total_viewsandtotal_clicksfields, causing row-level locks that queued up and blocked ad serving.Solution
Remove real-time updates to the denormalized
Flight.total_viewsandFlight.total_clicksfields. These fields are now:AdImpressionwhen neededSince
AdImpressionis already the source of truth for this data, the denormalized fields onFlightdon't need to be updated in real-time.Changes
Advertisement.incr()that were causing row-level locksFlight.refresh_denormalized_totals()method to recalculate fields fromAdImpressionrefresh_flight_denormalized_totals()Celery task for periodic background updatesOffer.refund()to skip Flight field updatesDeployment Notes
Schedule the
refresh_flight_denormalized_totalstask to run periodically (every 5-10m)Testing