Shapefile Data Source for Apache Spark

A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.

Requirements

This library requires Spark 2.0+ and make sure to clone and install https://github.com/Esri/geometry-api-java.git

Using with Spark shell

$SPARK_HOME/bin/spark-shell --packages com.esri:spark-shp:0.30

Features

This package allows reading shapefiles in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:

path The location of shapefile(s). Similar to Spark can accept standard Hadoop globbing expressions.
shape An optional name of the shape column. Default value is shape.
columns An optional list of comma separated attribute column names. Default value is blank indicating all attribute fields.
format An optional parameter to define the output format of the shape field. Default value is SHP. Possible values are:
- SHP Esri binary shape format.
- WKT Well known Text.
- WKB Well Known Binary
- GEOJSON GeoJSON
repair An optional parameter to repair the read geometry. Possible values are:
- None No repair.
- Esri Apply Esri repair operator.
- OGC Apply OGC repair operator.

SQL API

CREATE TABLE gps
    USING com.esri.spark.shp
    OPTIONS
(
    path "data/gps.shp"
)

Python API

df = spark.read \
    .format("shp") \
    .options(path="data/gps.shp", columns="atext,adate", format="GEOJSON") \
    .load() \
    .cache()

Building From Source

This library is built using Apache Maven. To build the jar, execute the following command:

mvn clean install

Data

Download the shapefile of Metro Stations in DC

Create Conda Env

export ENV=spark-shp
conda remove --yes --all --name $ENV
conda create --yes --name $ENV python=3.6
source activate $ENV
conda install --yes --quiet -c conda-forge\
    jupyterlab\
    tqdm\
    future\
    matplotlib=3.1\
    gdal=2.4\
    pyproj=2.2\
    shapely=1.6\
    pyshp=2.1

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mvn23.sh		mvn23.sh
pom.xml		pom.xml
sql.ipynb		sql.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shapefile Data Source for Apache Spark

Requirements

Using with Spark shell

Features

SQL API

Python API

Building From Source

Data

Create Conda Env

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

mraad/spark-shp

Folders and files

Latest commit

History

Repository files navigation

Shapefile Data Source for Apache Spark

Requirements

Using with Spark shell

Features

SQL API

Python API

Building From Source

Data

Create Conda Env

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages