A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.
This library requires Spark 2.0+ and make sure to clone and install https://github.com/Esri/geometry-api-java.git
$SPARK_HOME/bin/spark-shell --packages com.esri:spark-shp:0.30This package allows reading shapefiles in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:
pathThe location of shapefile(s). Similar to Spark can accept standard Hadoop globbing expressions.shapeAn optional name of the shape column. Default value isshape.columnsAn optional list of comma separated attribute column names. Default value is blank indicating all attribute fields.formatAn optional parameter to define the output format of the shape field. Default value isSHP. Possible values are:SHPEsri binary shape format.WKTWell known Text.WKBWell Known BinaryGEOJSONGeoJSON
repairAn optional parameter to repair the read geometry. Possible values are:NoneNo repair.EsriApply Esri repair operator.OGCApply OGC repair operator.
CREATE TABLE gps
USING com.esri.spark.shp
OPTIONS
(
path "data/gps.shp"
)df = spark.read \
.format("shp") \
.options(path="data/gps.shp", columns="atext,adate", format="GEOJSON") \
.load() \
.cache()
This library is built using Apache Maven. To build the jar, execute the following command:
mvn clean install- Download the shapefile of Metro Stations in DC
export ENV=spark-shp
conda remove --yes --all --name $ENV
conda create --yes --name $ENV python=3.6
source activate $ENV
conda install --yes --quiet -c conda-forge\
jupyterlab\
tqdm\
future\
matplotlib=3.1\
gdal=2.4\
pyproj=2.2\
shapely=1.6\
pyshp=2.1