Who's On First — Data

tl;dr

If you already understand all the details (discussed below) or just feel like throwing caution to wind the Who's On First data is available from the following places:

Any individual Who's Of First ID

Amazon S3

GitHub

But... what does it mean?

First principles

  • The canonical URL for any given Who's On First ID is relative. This might seem counter-intuitive to the point of even being contradictory.
  • Given any Who's On First ID it should be possible to generate a (relative) URL for that record using a simple and well-defined formula.

The current model is to split a Who's On First ID in 3-number chunks representing nested subdirectories, followed by filename consisting of the ID followed by .geojson.

For example the ID for Montréal is 101736545 which becomes 101/736/545/101736545.geojson.

As of this writing it remains clear that this approach (lots of tiny files parented by lots of nested directories) can be problematic. We may be forced choose another approach, like fewer subdirectories but nothing has been decided and anything we do will be backwards compatible. By definition this means anything we do (to rewrite or redirect any existing URLs that people are using) should be possible for anyone else hosting their own copy of the data.

And that's the important bit: It should be possible for multiple groups to host their own copy of the Who's On First data while still maintaining stable references to any place in the dataset.

For example, consider two organizations each with their own domains:

The URLs http://foo.com/data/101/736/545/101736545.geojson and http://bar.com/wof-data/101/736/545/101736545.geojson may be completely different but both refer to the city of Montréal, or Who's On First ID 101736545.

Even though each organization hosts their own copy of the Who's On First dataset — and the reasons for doing so are entirely their own business — they still have a simple and unintrusive way to preserve parity when refering to places.

Mapzen's canonical URL for Who's On First data is https://whosonfirst.mapzen.com/data/ and we expect that these URLs will become canonical for other people by virtue of our efforts around the project but we've tried to design things in such a way that this doesn't have to be the case for everyone.

The gory details

And by gory details we're assuming a basic familiarity and comfort-level with any of the specific technologies described here.

Amazon S3

Our copy of the Who's On First data is stored in big-honking Amazon Web Services (AWS) S3 bucket called whosonfirst.mapzen.com in a sub-directory, or prefix in AWS-speak, called data.

The details of fetching the entirety of all the contents of an S3 bucket are outside the scope of this document but the fact that you can is one of the reasons we chose S3.

We will eventually release our own tool for fetching the entirety of the S3 bucket but it's not ready yet.

GitHub

We haven’t quite figured out what the best way of both distributing the Who’s On First data and of accepting corrections or suggestions from community. Even though the nice people at GitHub continue to do excellent work at making Git easier for a broader population to use, the reality remains that Git is a significant barrier to participation for many people.

Absent a more formal decision about an alternative, GitHub at least allows us to point in the general direction of:

  • An open and readily distributed dataset that people can download and work with.
  • A way for people to contribute corrections (and general nuance) about a place.
  • A way for us to be able to do everything above while still assuring us a measure of authority around the assertions we make about the data.
  • Also a way for us to think about how and where we store an audit trail (of sorts) for updates to a place.

There are some very real problems working with Who's On First data in Git repositories, still, so it's possible that we will stop using entirely.

whosonfirst-data versus whosonfirst-data-SOMETHING repositories

There is a lot of data in Who's On First, more than can practically fit in a single GitHub repository. Someday it may be possible but today it is not. To account for this fact Who's On First data has been separated in to a number of different GitHub repositories organized by placetype and region.

The naming conventions for repositories at their most granular is as follows:

whosonfirst-data + "-" + WHOSONFIRST_PLACETYPE + "-" + WHOSONFIRST_COUNTRY + "-" + WHOSONFIRST_SUBDIVISION

For example:

The first thing to note is that not all repositories are as granular as the rules described above. Wherever feasible we try to bundle records with the least amount of granularity as possible. For example postalcodes are grouped by country as are venues unless there are so many of them, like in the USA, that it is not practical to keep them in a single parent repository.

If a repository grows so much data that it is no longer practical to keep everything in one place then it may be subdivided in to a number of child repositories. Venues are a good example of this.

We try to maintain a separate parent repository for things that have been broken out in to multiple child repositories. For example there is a whosonfirst-data-postalcode repository that contains no data but instead a pointer to all the repositories that do have postalcode data. We also do the same for venues in the USA. This practice is still in its early stages so we apologize in advance if it's bumpy or incomplete.

The whosonfirst-data repository is the obvious exception (or perfect example, depending on how you look at it) to the scenario described above. This repository contains all administrative placetypes (all the places between and inclusive of continents to microhoods) for the entire world. While it is possible to imagine that the sum total of all the neighbourhoods in the world will require putting them in a separate repository but we are going to hold off doing that for as long as we can.

Who's On First records should always have a wof:repo property indicating the repository to which they belong. If they don't that's a bug.

Git and large files

We have started using git-lfs for managing large files in the whosonfirst-data repository. For example, the record for New Zealand which contains a very very very very very detailed coastline geometry exceeds the 100MB filesize limit for any individual file on GitHub.

A full discussion of how to use git-lfs is outside the scope of this document but you can see the current list of files being managed by invoking the git lfs ls-files command, like this:

$> cd /usr/local/mapzen/whosonfirst-data
$> git lfs ls-files
65ccc4825e * data/856/333/45/85633345.geojson
   

When you clone the whosonfirst-data repo the files (managed by git-lfs) only contain metadata, like this:

$> cat data/856/333/45/85633345.geojson
version https://git-lfs.github.com/spec/v1
oid sha256:65ccc4825e65c30f00fcebf1f3d57f4385f18a47e3c5e524114a67050186ae48
size 71879893

In order to retrieve the contents of the file itself you will need to run git lfs pull, like this:

$> git lfs pull
Fetching master
(1 of 1 files) 68.54 MB / 68.55 MB

$> cat data/856/333/45/85633345.geojson
{
  "id": 85633345,
  "type": "Feature",
  "properties": {
    "edtf:cessation":"u",
    "edtf:inception":"u",
    "geom:area":29.187792061074827,
    "geom:bbox":"166.426148,-47.289992,178.577244,-33

Depending on when you read this we may have already pre-emptively moved all the records for countries in to git-lfs. If the goal is to have details ground truth geometries for every place in the Who's On First gazetteer it stands to reason that most if not all countries will bump up against GitHub's existing file size limits.

Known knowns

TLS for whosonfirst.mapzen.com

Yes! As in, it is enabled across the board and everything on a non-secure port is redirected to a secure one.

Visiting https://s3.amazonaws.com/whosonfirst.mapzen.com/data/ in a browser

So, yeah. We probably should never have published this URL as-is. It's tricky because we need to share the URL with people who are going to try and download data using the AWS S3 API but the reality is that when you visit that URL in a web browser... you get unhelpful silence. Or more specifically your web browser tries to download an empty file. This is one of those unfortunate mismatches between the technology and people's expectations.