I use this dataproc job
server to run all jobs. And gs://dingoproc/pyspark for all code and binaries. You can create your own job server and change the parameters accordingly to your environment.
dingo.py this is the actual PySpark job we want to run.
debug.py
this is used to debug in case the job ran into a File Not Found error.
run.py this
is the workaround code to run dingo.py with provided conda env
create a conda env
conda create -n dingo-spark-env python=3.12 pandas scikit-learn
activate the environment
conda activate dingo-spark-env
(optional) install conda-pack in your new environment if you don't have
pip install conda-pack
pack the environment
conda pack -n dingo-spark-env -o dingo_env.tar.gz
gcloud storage cp dingo_env.tar.gz *.py gs://dingoproc/pyspark/update the *.sh shell scripts accordingly to your enviroment, like
jobserver id, region etc.
./dingo.sh to run it. if it work then congratulations you can stop here.
or if you see the following error, then you probably don't have the conda env properly extracted.
Exception in thread "main" java.io.IOException: Cannot run program "./dingo/bin/python": error=2, No such file or directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:98)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1156)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:218)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:92)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1248)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1257)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
... 15 more
now we need to run ./debug.sh to verify that
--- Walking current directory ---
In folder: .
File: debug.py
File: dingo_env.tar.gz
File: dingo.py
File: .debug.py.crc
File: .dingo_env.tar.gz.crc
File: .dingo.py.crc
--- End of walk ---
Job [afceed94288a4c7d86e05e90b445538a] finished successfully.
done: true
if you see dingo_env.tar.gz is there but NOT properly extracted. then simply run run.sh to manually extract it and run dingo.py as a sub-process in run.py.
--> Step 1: Extracting 'dingo_env.tar.gz' to 'dingo'...
Extraction successful.
--> Step 2: Running 'dingo.py' with interpreter 'dingo/bin/python'...
Executing command: dingo/bin/python dingo.py
------------------------------
------------------------------
Script executed successfully.
Process finished.
Job [56e7bc7f60684169a316840aabe4d889] finished successfully.
done: true