A MapReduce framework for spatial data
Here we provide several real datasets that can be used to test your operations in implemented in SpatialHadoop. All these datasets are extracted from public sources and can be freely used and redistributed. We highly recommend that you link to this page instead of copying the datasets because we will be updating these datasets and adding more.
If you find these datasets useful, consider citing our SpatialHadoop paper which made it possible to extract and process these big datasets.
@inproceedings{DBLP:conf/icde/EldawyM15, author = {Ahmed Eldawy and Mohamed F. Mokbel}, title = {{SpatialHadoop: {A} MapReduce Framework for Spatial Data}}, booktitle = {31st {IEEE} International Conference on Data Engineering, {ICDE} 2015, Seoul, South Korea, April 13-17, 2015}, pages = {1352--1363}, year = {2015}, crossref = {DBLP:conf/icde/2015}, url = {http://dx.doi.org/10.1109/ICDE.2015.7113382}, doi = {10.1109/ICDE.2015.7113382}, timestamp = {Tue, 09 Jun 2015 07:58:12 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/icde/EldawyM15}, bibsource = {dblp computer science bibliography, http://dblp.org} }
TIGER datasets are all extracted from US Census Bureau TIGER files. The datasets published here are all extracted from the files of 2015. They are all converted from Shapefiles to CSV to be easily processed in SpatialHadoop. We are planning to provide built-in support of Shapefiles in SpatialHadoop which will allow processing shapefiles directly from their sources. Each line in the CSV file contains a shape represented in Well-Known Text (WKT) format followed by other meta information for this record. To process these records in SpatialHadoop, use the built-in data type TigerShape. You can do this by specifying the command line parameter 'shape:tiger' when calling one of SpatialHadoop functions.
For convenience, all files are provided in compressed format (.bz2). SpatialHadoop can process the compressed files directly. It will automatically decompress the files on the fly while being processed. You can also choose to decompress the files first and then work on the decompressed files directly. All file sizes showed below are for the uncompressed files.
Dataset | Description | Size | Records | Schema | Overview | Download |
---|---|---|---|---|---|---|
AREALM | Area Landmark | 140MB | 129K Polygons | schema | Click to enlarge |
Download [42MB download size] |
AREAWATER | Area Hydrography | 2.1GB | 2.3M Polygons | schema | Click to enlarge |
Download [613MB download size] |
COUNTY | Counties | 149MB | 3,233 Polygons | schema | Click to enlarge |
Download [47MB download size] |
EDGES | All Edges Combined | 23 GB | 70 million polygons | schema | Click to enlarge |
Download [5.7GB download size] |
LINEARWATER | Linear Hydrography | 6GB | 5.8M Linestrings | schema | Click to enlarge |
Download [4.1GB download size] |
PRIMARYROADS | Primary Roads | 52 MB | 12,396 Linestrings | schema | Click to enlarge |
Download [15MB download size] |
RAILS | Rail roads | 79 MB | 181 million linestrings | schema | Click to enlarge |
Download [23MB download size] |
ROADS | Roads | 7.7 GB | 20 million Linestrings | schema | Click to enlarge |
Download [2GB download size] |
STATE | States | 17 MB | 56 polygons | schema | Click to enlarge |
Download [6MB download size] |
ZCTA5 | 5-Digit ZIP Code Tabulation Area | 1 GB | 33,144 Polygons | schema | Click to enlarge |
Download [310MB download size] |
Similar to the previous dataset but for the year of 2012. The geometry of shapes is represented in base-64-encoded Well-Known Binary (WKB).
Dataset | Description | Size | Records | Schema | Overview | Download |
---|---|---|---|---|---|---|
arealm | Area Landmark | 406MB | 122K Polygons | schema | Click to enlarge |
Download [99MB download size] |
areawater | Area Hydrography | 6.5GB | 2.3M Polygons | schema | Click to enlarge |
Download [1.4GB download size] |
edges | All Edges Combined | 62GB | 72.7M Polygons | schema | Click to enlarge |
Download [14GB download size] |
linearwater | Linear Hydrography | 18.3GB | 5.9M Linestrings | schema | Click to enlarge |
Download [4.1GB download size] |
primaryroads | Primary Roads | 77MB | 13,373 Linestrings | schema | Click to enlarge |
Download [23MB download size] |
zcta510 | 5-Digit ZIP Code Tabulation Area | 1.6GB | 33,144 Polygons | schema | Click to enlarge |
Download [458MB download size] |
This dataset is extracted from OpenStreetMap and represents map features for the whole world. The data is extracted from one planet.osm file and all the data is converted to a text format for simplicity. All files are formatted as either comma-separated-values (CSV) or tab-separated-values (TSV) which make them ready to be used with Pigeon scripts. Similar to TIGER datasets, all files are compressed for convenience and can be processed without decompression by SpatialHadoop and Pigeon scripts.
Dataset | Description | Size | Records | Schema | Overview | Download |
---|---|---|---|---|---|---|
All Nodes | All points on the planet | 96GB | 2.7 Billion records | schema | Click to enlarge |
Download [24.9GB download size] |
All Objects | All extracted map objects | 92GB | 263M records | schema | Click to enlarge |
Download [26GB download size] |
Buildings | Boundaries of all buildings | 26GB | 115M records | schema | Click to enlarge |
Download [6GB download size] |
Cemetery | Boundaries of cemetery areas | 56MB | 193M records | schema | Click to enlarge |
Download [17.4MB download size] |
Lakes | Boundaries of water areas | 9GB | 8.4M records | schema | Click to enlarge |
Download [2.7GB download size] |
Parks | Boundaries of parks or green areas | 9.3GB | 10M records | schema | Click to enlarge |
Download [2.9GB download size] |
Postal codes | Boundaries of postal code areas (mostly cities) | 1.4GB | 171K records | schema | Click to enlarge |
Download [477MB download size] |
Road network | Road network represented as individual road segments | 137GB | 717M records | schema | Click to enlarge |
Download [15.4GB download size] |
Roads | Roads and streets around the world each represented as a polyline | 24GB | 72M records | schema | Click to enlarge |
Download [6.7GB download size] |
Sports | Boundaries of sporting areas | 590MB | 1.8M records | schema | Click to enlarge |
Download [172MB download size] |
This dataset is extracted from OpenStreetMap and represents map features for the whole world. The data is extracted from one planet.osm file and all the data is converted to a text format for simplicity. All files are formatted as Tab-Separated-Values which makes them ready to be used with Pigeon scripts. Similar to TIGER datasets, all files are compressed for convenience and can be processed without decompression by SpatialHadoop and Pigeon scripts.
In the nodes dataset, each node is identified by a node ID as stored in OpenStreetMap. The location of a point is represented by a pair of coordinates (longitude, latitude). The tags of a node are also stored as they appear in the planet file.
For the ways datasets, each way is identified by a unique ID as appears in OpenStreetMap planet file. The spatial information is stored as an OGC shape represented in Well-known text (WKT) format. Similar to nodes, the tags are stored as appearing in the planet file.
Dataset | Description | Size | Records | Schema | Overview | Download |
---|---|---|---|---|---|---|
all_nodes | All points on the planet | 62.3GB | 1.7 Billion points | schema | Click to enlarge |
Download [17GB download size] |
all_ways | All ways on the planet | 59.6GB | 164M Polygons | schema | Click to enlarge |
Download [17GB download size] |
cities | Ways that represent cities on the planet | 844MB | 542K Polygons | schema | Click to enlarge |
Download [278MB download size] |
lakes | Ways that represent lakes on the planet | 2.6GB | 4.3M Polygons | schema | Click to enlarge |
Download [798MB download size] |
parks | Ways that represent parks on the planet | 102MB | 234K Polygons | schema | Click to enlarge |
Download [34MB download size] |
rivers | Ways that represent rivers on the planet | 945MB | 555K Polygons | schema | Click to enlarge |
Download [303MB download size] |
roads | Ways that represent roads on the planet | 20.6GB | 59M Polygons | schema | Click to enlarge |
Download [5.6GB download size] |