John Sample

Bits and Bytes
posts - 103, comments - 354, trackbacks - 16

Saturday, April 19, 2008


The toughest set of data to obtain for geocoding is a zip code to city translation.

Unfortunately, the USPS treats zip data as a trade secret and demands licensing fees for distribution which makes creating a free geocoder a little harder.

You used to be able to obtain a decent city to zip mapping from the FIPS55 data set. However, the FIPS folks were force to remove this information from all future releases. I still have the old FIPS data which I can use if worse comes to worse, but I'm trying to find a way around it by generating the data myself if at all possible. This way the data doesn't get continually out of date.

While going through the new census data format I thought I stumbled upon a way to extract city data.

First a little background on why this data is important, then I'll show you so pictures of why its so difficult.

For each street in the database we have the associated zip code. Actually we have two zip codes, one for each side of the street, but they are generally both the same. The zip code of the street is the main way in which we narrow down the search space for an address.
For example, if you tried to geocode “123 Main St Anytown, NY” the first step is to figure out what the possible zip codes are for Anytown, NY then see if we have any street names in that zip range. Note that if you try to geocode with just the zip code this is a non issue. “123 Main St 12345” could be geocoded without the city lookup at all. However, when we display information about the address it would be nice to know where that place is by using the zip to display the city.

The census data does contain names and geometry for most “places” (city, towns, etc.) so I investigated extracting the shape of all the cities in the country.
It also contains the shapes of what the census calls ZCTAs or “Zip Code Tabulation Areas.” I was hoping to overlay these two sets of data, then go through each place to extract every zip code it touches.
Unfortunately the “place” geometry doesn't give very good coverage as it uses very strict definitions for the boundaries of cities.

Here is a projection of the roads and a few landmarks in densely populated Fairfax County:

All of these roads are in a city of some sort as you or I would know them, but when you overlay the census city shape data (green) it looks like this:

As you can see the place data doesn't even come close to covering all the places where people live.

There still may be a way to get the data out of here, but its going to be tougher than I had hoped.
In the meantime the database creation can continue, its just going to have a few place holders where the zip translation can be plugged in later.

posted @ 9:35 AM | Feedback (2)