- Enable some of the GCP APIs (dataflow api, json api, logging api, biq query api, storage api, datastore api) from GCP console UI
- Establish an environment for beam
- create a conda
conda create -n beam-sandboxenvironment, - activate w/
conda activate beam-sandboxand - install
pip install apache-beam[gcp, test]for GCP & Test additions
- create a conda
- Test environment w/
python -m apache_beam.examples.wordcount --output beam/text,- Then,
cat beam/t*to see words and counts.
- Create a bucket for dataflow on GCP Storage, right after creating a GCP Project !
- Edit
./run-count-dataflow.shfile and change w/ your${PROJECT_ID} - Create a bucket named
beam-pipelines-123 - Under this folder create folders for every beam file such as
line-count- then, staging and temp folders such as
line-count\stagingand line-count\tempfolders
- then, staging and temp folders such as
- Edit
- Create a dataset bucket
gs://spark-dataset-1on GCP Storage, and uploaddatasetfolder into it. Public bucket level is much better. export GOOGLE_APPLICATION_CREDENTIALS=PATH_OF_SERVICE_ACCOUNT.json
- to run
-
python line-count.pyon your local (uses DirectRunner), or - Run on your local or GCP shell/Instance
./line-count-dataflow.sh(uses DataFlowRunner)
-
- Look Dataflow UI on GCP console and dataflow jobs running.
- Check logs