Skip to content

Conversation

@magajh
Copy link
Contributor

@magajh magajh commented Jan 14, 2025

Description

This PR introduces a new Data Collector API. The purpose of this API is to enable Open edX instances to collect and send content-related data to a central service (e.g., the Shipyard API). This lays the foundation for the Hosting division to make data-driven decisions and improve the overall service offering.

The new API is part of the data-collector module, ensuring clear separation from existing APIs in the data directory for better modularity and future extensibility.


Rationale

  • Flexibility and Modularity: Encapsulating the new functionality in a separate module (data-collector) allows for easier maintenance and potential migration if required in the future.
  • Centralized Decision-Making: By collecting metrics like active users, course data, and usage patterns from Open edX instances, the Hosting division can prioritize resources, optimize support, and provide actionable insights to clients.
  • Adaptability: The implementation is flexible, allowing the destination endpoint (e.g., Shipyard API) and the query file to be passed dynamically, enabling integration with various systems or future requirements.

Key Features

  1. Authentication:

    • Supports authentication via:
      • A token specific to the GitHub Action workflow.
  2. API Design:

    • Endpoint: /eox-core/data-api/v1/collect-data/
    • Accepts the following payload:
      • query_file_content: Content of a YAML file containing SQL queries.
      • query_file_url: A URL pointing to a publicly accessible YAML file with queries.
      • destination_url: The URL to which the results will be sent.
    • Ensures only one of query_file_content or query_file_url is required.
  3. Async Task Execution:

    • A Celery async task (generate_report) processes the provided queries:
      • Executes the SQL queries against the edxapp database.
      • Posts the processed results to the provided destination_url.
  4. Validation:

    • Includes a serializer to validate the incoming payload, ensuring the required data is present.
  5. Logging and Error Handling:

    • Logs key events during the execution of the task for better monitoring and debugging.
    • Implements retries for the async task in case of transient failures.

How to Test

  1. Set up an Open edX instance with the eox-core plugin installed.
  2. Authenticate using a valid token.
  3. Use the following payload to test the API:
    {
        "query_file_content": "queries:\n  - name: 'Example Query'\n    query: 'SELECT COUNT(*) FROM auth_user;'",
        "destination_url": "https://example.com/api/v1/receive-data/"
    }
  4. Verify the following:
    • The async task executes the queries and posts the results to the destination URL.
    • Any errors or missing fields result in appropriate error messages.
    • Ensure the task logs provide sufficient information.

Notes

  • The new API does not persist data locally; it dynamically processes and forwards the results.
  • Future enhancements could include:
    • Better query caching or optimization.

@magajh magajh changed the title feat: add API to generate reports feat: add task to generate reports Jan 14, 2025
@magajh magajh force-pushed the maga/implement-task-to-generate-reports branch from 833d135 to fe44aaa Compare January 22, 2025 05:40
@magajh magajh force-pushed the maga/implement-task-to-generate-reports branch from d7ac416 to 12ed527 Compare January 22, 2025 10:44
@magajh magajh changed the title feat: add task to generate reports feat: add Data Collector API Jan 22, 2025
@magajh magajh marked this pull request as ready for review January 22, 2025 11:04
@magajh magajh requested a review from a team as a code owner January 22, 2025 11:04
Copy link
Member

@felipemontoya felipemontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me there are serious concerns about the security of this feature. We have it on a test environment and I will do all my testing there before I requests changes. However I need to voice the concern that "as is" this must not be merged.

@luisfelipec95
Copy link
Contributor

luisfelipec95 commented Feb 24, 2025

Refactor:

  • Replaced External Query File with predefined queries.
  • Removed query_file_content and query_file_url from request handling.
  • New setting: AGGREGATED_DATA_COLLECTOR_API_ENABLED (default: False) to Enable/Disable the API Endpoint.
  • Added a Filter to Reject Non-SELECT Queries

@felipemontoya
Copy link
Member

@luisfelipec95 podríá pedirte que el setting sea: AGGREGATED_DATA_COLLECTOR_API_ENABLED ? Quiero que si alguien ve ese setting no le entre la duda de si los datos son PII

@luisfelipec95
Copy link
Contributor

@luisfelipec95 podríá pedirte que el setting sea: AGGREGATED_DATA_COLLECTOR_API_ENABLED ? Quiero que si alguien ve ese setting no le entre la duda de si los datos son PII

Done

@luisfelipec95 luisfelipec95 force-pushed the maga/implement-task-to-generate-reports branch from 52e4111 to b9e0f25 Compare February 24, 2025 23:18
Copy link
Contributor Author

@magajh magajh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luisfelipec95 Thanks for all the work!

We need to add tests for this implementation. I also noticed that the functions don’t use type hints—we should add them.

@magajh
Copy link
Contributor Author

magajh commented Feb 25, 2025

Also @luisfelipec95, could you please specify how this has been tested?

Copy link
Contributor Author

@magajh magajh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@magajh magajh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We can approve and merge this once the missing docstring is added and the unit tests are fixed

@luisfelipec95 @felipemontoya

@magajh magajh requested a review from felipemontoya March 3, 2025 16:10
@felipemontoya felipemontoya changed the title feat: add Data Collector API feat: add aggregated data collector API Mar 3, 2025
Copy link
Member

@felipemontoya felipemontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Must be explicitly enabled, the queries are fixed and completely anonymous. The API call is securely authenticated.
Nobody that does not want this enabled will get it by accident.

@luisfelipec95 luisfelipec95 merged commit fd092b3 into master Mar 3, 2025
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants