Skip to content

cloudymoma/dataproc-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark on dataproc with custom python/conda environment

I use this dataproc job server to run all jobs. And gs://dingoproc/pyspark for all code and binaries. You can create your own job server and change the parameters accordingly to your environment.

dingo.py this is the actual PySpark job we want to run.

debug.py this is used to debug in case the job ran into a File Not Found error.

run.py this is the workaround code to run dingo.py with provided conda env

prepare conda runtime enviroment

create a conda env

conda create -n dingo-spark-env python=3.12 pandas scikit-learn

activate the environment

conda activate dingo-spark-env

(optional) install conda-pack in your new environment if you don't have

pip install conda-pack

pack the environment

conda pack -n dingo-spark-env -o dingo_env.tar.gz

upload to Google Cloud Storage (GCS)

gcloud storage cp dingo_env.tar.gz *.py gs://dingoproc/pyspark/

run the job

update the *.sh shell scripts accordingly to your enviroment, like jobserver id, region etc.

./dingo.sh to run it. if it work then congratulations you can stop here.

or if you see the following error, then you probably don't have the conda env properly extracted.

Exception in thread "main" java.io.IOException: Cannot run program "./dingo/bin/python": error=2, No such file or directory
        at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
        at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
        at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:98)
        at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1156)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:195)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:218)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:92)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1248)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1257)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
        at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
        at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
        at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
        at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
        ... 15 more

now we need to run ./debug.sh to verify that

--- Walking current directory ---
In folder: .
  File: debug.py
  File: dingo_env.tar.gz
  File: dingo.py
  File: .debug.py.crc
  File: .dingo_env.tar.gz.crc
  File: .dingo.py.crc
--- End of walk ---
Job [afceed94288a4c7d86e05e90b445538a] finished successfully.
done: true

if you see dingo_env.tar.gz is there but NOT properly extracted. then simply run run.sh to manually extract it and run dingo.py as a sub-process in run.py.

--> Step 1: Extracting 'dingo_env.tar.gz' to 'dingo'...
Extraction successful.

--> Step 2: Running 'dingo.py' with interpreter 'dingo/bin/python'...
Executing command: dingo/bin/python dingo.py
------------------------------

------------------------------
Script executed successfully.

Process finished.
Job [56e7bc7f60684169a316840aabe4d889] finished successfully.
done: true

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published