-
Notifications
You must be signed in to change notification settings - Fork 149
[DEV-14237] Create execute_spark_sql Command #4569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r compose to allow override of project dir for volume in spark-submit
… out sql statement splitting functionality into helper function
| dockerfile: Dockerfile.testing | ||
| container_name: usaspending-test | ||
| volumes: | ||
| - .:/usaspending-api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is necessary to include in the docker-compose.yml since the default assumption is that the docker-compose.yml file is located in the project root directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this arises when starting tests or Spark commands from a dev container. The problem is that the volume source path refers to the base OS (in our case, Windows) filesystem, not from the dev container. In my case, I need to override the . with the full Windows path to our project directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious if there might be a library to help us with some of these sql parsing functions instead of rolling our own. It seems like this one is pretty popular: https://github.com/tobymao/sqlglot. Of course, there are pros and cons to adding another dependency to keep track of. https://sqlparse.readthedocs.io/en/latest/api.html#sqlparse.split
sqlparse.split(sql: str, encoding: str | None = None, strip_semicolon: bool = False) → List[str]¶
Split sql into single statements.
Parameters:
sql – A string containing one or more SQL statements.
encoding – The encoding of the statement (optional).
strip_semicolon – If True, remove trailing semicolons (default: False).
Returns:
A list of strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlparse meets this use case perfectly and is already installed in our application because it's a dependency of Django. I even swapped it out in my test file to see if it passed my test cases (although I ended up removing the test because I removed the helper function).
0266edf to
14fd170
Compare
Description:
This creates a new command that runs Spark SQL queries. These queries can be provided as either a:
ex:
execute_spark_sql --sql "SELECT * FROM rpt.transaction_search limit 10;"ex with Makefile:
make docker-compose-spark-submit django_command="execute_spark_sql --sql \"SELECT * FROM rpt.transaction_search limit 10;\""ex:
execute_spark_sql --file /project/test.sqlex with Makefile:
make docker-compose-spark-submit django_command="execute_spark_sql --file /project/test.sql"Both file examples above assume that Spark commands are running in Docker, and
project/test.sqlis the path within Docker. I was able to get this working by puttingtest.sqldirectly in theusaspending-apirepo's root directory (which is mounted into/projectin DockerTechnical Details:
For running this in Jenkins, for now I'd say use the
--sqlapproach, but my plan is to work with the OPS team to create a new Jenkins job that will copy the contents of an input box into a file in S3 then provide a--fileargument to that S3 file in order to easily run multiple queries.I've also included a few different arguments to control behavior like
--create-temp-views,--result-limit, and--dry-run.Docker Compose Updates
I also updated
docker-compose.ymlto account for the case that Docker-based Spark commands are being run from a Dev Container. I was running into an issue where my/projectvolume was empty. Thesourcefor this volume was previously set to.. this doesn't work from within a Dev Container because this path will be resolved on the host OS, not the directory within the Dev Container where the Docker Compose command is being executed. I created an override environment variablePROEJCT_DIRECTORYwith a default of the existing value of.as to not break anyone's existing workflows.Requirements for PR Merge:
Explain N/A in above checklist: