[DEV-14237] Create execute_spark_sql Command #4569

collinwr · 2025-12-30T22:17:54Z

Description:

This creates a new command that runs Spark SQL queries. These queries can be provided as either a:

Direct SQL string - A string containing one or more Spark SQL queries provided as a command argument
ex: execute_spark_sql --sql "SELECT * FROM rpt.transaction_search limit 10;"
ex with Makefile: make docker-compose-spark-submit django_command="execute_spark_sql --sql \"SELECT * FROM rpt.transaction_search limit 10;\""
A file - Provide a file path (local, http, or S3) as an argument. Can contain multiple SQL queries separated by semicolons:
ex: execute_spark_sql --file /project/test.sql
ex with Makefile: make docker-compose-spark-submit django_command="execute_spark_sql --file /project/test.sql"
Both file examples above assume that Spark commands are running in Docker, and project/test.sql is the path within Docker. I was able to get this working by putting test.sql directly in the usaspending-api repo's root directory (which is mounted into /project in Docker

Technical Details:

For running this in Jenkins, for now I'd say use the --sql approach, but my plan is to work with the OPS team to create a new Jenkins job that will copy the contents of an input box into a file in S3 then provide a --file argument to that S3 file in order to easily run multiple queries.

I've also included a few different arguments to control behavior like --create-temp-views, --result-limit, and --dry-run.

Docker Compose Updates

I also updated docker-compose.yml to account for the case that Docker-based Spark commands are being run from a Dev Container. I was running into an issue where my /project volume was empty. The source for this volume was previously set to .. this doesn't work from within a Dev Container because this path will be resolved on the host OS, not the directory within the Dev Container where the Docker Compose command is being executed. I created an override environment variable PROEJCT_DIRECTORY with a default of the existing value of . as to not break anyone's existing workflows.

Requirements for PR Merge:

Unit & integration tests updated
N/A - API documentation updated (examples listed below)
1. API Contracts
2. API UI
3. Comments
N/A - Data validation completed (examples listed below)
1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
3. Is the expected data returned with the expected format?
Appropriate Operations ticket(s) created
Jira Ticket(s)
1. DEV-14237

Explain N/A in above checklist:

…r compose to allow override of project dir for volume in spark-submit

… out sql statement splitting functionality into helper function

…ration tests

zachflanders-frb · 2026-01-05T14:30:54Z

docker-compose.yml

      dockerfile: Dockerfile.testing
    container_name: usaspending-test
    volumes:
-      - .:/usaspending-api


I'm not sure this is necessary to include in the docker-compose.yml since the default assumption is that the docker-compose.yml file is located in the project root directory.

The need for this arises when starting tests or Spark commands from a dev container. The problem is that the volume source path refers to the base OS (in our case, Windows) filesystem, not from the dev container. In my case, I need to override the . with the full Windows path to our project directory.

zachflanders-frb · 2026-01-05T14:43:46Z

usaspending_api/common/helpers/sql_helpers.py

I was curious if there might be a library to help us with some of these sql parsing functions instead of rolling our own. It seems like this one is pretty popular: https://github.com/tobymao/sqlglot. Of course, there are pros and cons to adding another dependency to keep track of. https://sqlparse.readthedocs.io/en/latest/api.html#sqlparse.split

sqlparse.split(sql: str, encoding: str | None = None, strip_semicolon: bool = False) → List[str]¶
Split sql into single statements.
Parameters:
sql – A string containing one or more SQL statements.
encoding – The encoding of the statement (optional).
strip_semicolon – If True, remove trailing semicolons (default: False).
Returns:
A list of strings.

sqlparse meets this use case perfectly and is already installed in our application because it's a dependency of Django. I even swapped it out in my test file to see if it passed my test cases (although I ended up removing the test because I removed the helper function).

usaspending_api/common/helpers/sql_helpers.py

Co-authored-by: Zach Flanders <[email protected]>

…library

collinwr added 2 commits December 30, 2025 22:02

[DEV-14237] Adds new command to run Spark SQL commands. Updates docke…

baa09ca

…r compose to allow override of project dir for volume in spark-submit

[DEV-14237] Reformats with black

42cee43

collinwr added do not merge [PR] shouldn't be merged in progress [ISSUE] being worked labels Dec 30, 2025

github-actions bot assigned collinwr Dec 30, 2025

collinwr added 3 commits December 31, 2025 21:44

[DEV-14237] Adds placeholder files for unit/integration tests. Splits…

6e27ef6

… out sql statement splitting functionality into helper function

Updates --sql argument to accept multiple statements. Adds unit/integ…

3094351

…ration tests

Reorders imports. Extra test to ensure sql executes

b252dcc

collinwr removed do not merge [PR] shouldn't be merged in progress [ISSUE] being worked labels Jan 2, 2026

Merge branch 'qat' into ftr/dev-14237-execute-spark-sql

0e6ea5f

zachflanders-frb reviewed Jan 5, 2026

View reviewed changes

github-actions bot assigned zachflanders-frb Jan 5, 2026

collinwr and others added 3 commits January 5, 2026 10:08

Adds type hints to helper function

77e94f5

Co-authored-by: Zach Flanders <[email protected]>

Replaces custom sql split function with that from sqlparse

6fd7500

Fixes merge conflicts by removing helper method replaced by sqlparse …

14fd170

…library

collinwr force-pushed the ftr/dev-14237-execute-spark-sql branch from 0266edf to 14fd170 Compare January 6, 2026 17:52

zachflanders-frb approved these changes Jan 6, 2026

View reviewed changes

collinwr merged commit 1071861 into qat Jan 6, 2026
32 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DEV-14237] Create execute_spark_sql Command #4569

[DEV-14237] Create execute_spark_sql Command #4569

Uh oh!

collinwr commented Dec 30, 2025 •

edited

Loading

Uh oh!

zachflanders-frb Jan 5, 2026

Uh oh!

collinwr Jan 5, 2026

Uh oh!

zachflanders-frb Jan 5, 2026

Uh oh!

collinwr Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DEV-14237] Create execute_spark_sql Command #4569

[DEV-14237] Create execute_spark_sql Command #4569

Uh oh!

Conversation

collinwr commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Technical Details:

Docker Compose Updates

Requirements for PR Merge:

Explain N/A in above checklist:

Uh oh!

zachflanders-frb Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

collinwr Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

zachflanders-frb Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

collinwr Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

collinwr commented Dec 30, 2025 •

edited

Loading