Skip to content

Conversation

@collinwr
Copy link
Contributor

@collinwr collinwr commented Dec 30, 2025

Description:

This creates a new command that runs Spark SQL queries. These queries can be provided as either a:

  • Direct SQL string - A string containing one or more Spark SQL queries provided as a command argument
    ex: execute_spark_sql --sql "SELECT * FROM rpt.transaction_search limit 10;"
    ex with Makefile: make docker-compose-spark-submit django_command="execute_spark_sql --sql \"SELECT * FROM rpt.transaction_search limit 10;\""
  • A file - Provide a file path (local, http, or S3) as an argument. Can contain multiple SQL queries separated by semicolons:
    ex: execute_spark_sql --file /project/test.sql
    ex with Makefile: make docker-compose-spark-submit django_command="execute_spark_sql --file /project/test.sql"
    Both file examples above assume that Spark commands are running in Docker, and project/test.sql is the path within Docker. I was able to get this working by putting test.sql directly in the usaspending-api repo's root directory (which is mounted into /project in Docker

Technical Details:

For running this in Jenkins, for now I'd say use the --sql approach, but my plan is to work with the OPS team to create a new Jenkins job that will copy the contents of an input box into a file in S3 then provide a --file argument to that S3 file in order to easily run multiple queries.

I've also included a few different arguments to control behavior like --create-temp-views, --result-limit, and --dry-run.

Docker Compose Updates

I also updated docker-compose.yml to account for the case that Docker-based Spark commands are being run from a Dev Container. I was running into an issue where my /project volume was empty. The source for this volume was previously set to .. this doesn't work from within a Dev Container because this path will be resolved on the host OS, not the directory within the Dev Container where the Docker Compose command is being executed. I created an override environment variable PROEJCT_DIRECTORY with a default of the existing value of . as to not break anyone's existing workflows.

Requirements for PR Merge:

  1. Unit & integration tests updated
  2. N/A - API documentation updated (examples listed below)
    1. API Contracts
    2. API UI
    3. Comments
  3. N/A - Data validation completed (examples listed below)
    1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
    2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
    3. Is the expected data returned with the expected format?
  4. Appropriate Operations ticket(s) created
  5. Jira Ticket(s)
    1. DEV-14237

Explain N/A in above checklist:

@collinwr collinwr added do not merge [PR] shouldn't be merged in progress [ISSUE] being worked labels Dec 30, 2025
@collinwr collinwr removed do not merge [PR] shouldn't be merged in progress [ISSUE] being worked labels Jan 2, 2026
dockerfile: Dockerfile.testing
container_name: usaspending-test
volumes:
- .:/usaspending-api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is necessary to include in the docker-compose.yml since the default assumption is that the docker-compose.yml file is located in the project root directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this arises when starting tests or Spark commands from a dev container. The problem is that the volume source path refers to the base OS (in our case, Windows) filesystem, not from the dev container. In my case, I need to override the . with the full Windows path to our project directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious if there might be a library to help us with some of these sql parsing functions instead of rolling our own. It seems like this one is pretty popular: https://github.com/tobymao/sqlglot. Of course, there are pros and cons to adding another dependency to keep track of. https://sqlparse.readthedocs.io/en/latest/api.html#sqlparse.split

sqlparse.split(sql: str, encoding: str | None = None, strip_semicolon: bool = False) → List[str]
Split sql into single statements.
Parameters:
sql – A string containing one or more SQL statements.
encoding – The encoding of the statement (optional).
strip_semicolon – If True, remove trailing semicolons (default: False).
Returns:
A list of strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlparse meets this use case perfectly and is already installed in our application because it's a dependency of Django. I even swapped it out in my test file to see if it passed my test cases (although I ended up removing the test because I removed the helper function).

@collinwr collinwr force-pushed the ftr/dev-14237-execute-spark-sql branch from 0266edf to 14fd170 Compare January 6, 2026 17:52
@collinwr collinwr merged commit 1071861 into qat Jan 6, 2026
32 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants