John Sample

Bits and Bytes
posts - 103, comments - 354, trackbacks - 16

Revamp

So what have I been working on the last few weeks? Well, I think I've come up with a way to store just about ALL the geometry and places in the census information.

What kind of stuff?
With the new method you could query for a golf course, apartment building, airport, etc. and get the outline not just a single long/lat point.
You could query for a school and get the outline of the entire grounds plus the point on which the building lies.
You could search for a lake, river, or park and do the same.

This includes all the other record types I've been getting emails about also (2, 4, 5, and more).

Why is it taking so long?

There is just no way to fit the architecture I need to use into a cross platform API, at least part of it. 

First a little background:

This project started out as just an academic exercise to see if I could reverse geocode a few locations, something I couldn't find with any free software out there. A side effect of that was building a geocoder that (although incomplete) was pretty functional.
One thing I have constantly struggled with is the speed of the reverse geocoding. The forward lookups are fast, but I have exhausted every trick I know of and can't get the speed of the reverse lookups to a level I consider acceptable. Right now I use B-Tree index intersections but its still not cutting it.
The new architecture will allow forward and reverse lookups to be fast, efficient, and complete, but with one major drawback:
I can't do it with SQL Server.... at least the reverse part of it.
This doesn't mean I'm going to stop supporting SQL Server for the forward geocoding, its just not going to be as powerful as the MySQL version.
Why?
SQL Server doesn't support R-Tree indexes or geometric columns. R-Trees are the only way to make the reverse lookup fast and store all the points of line or polygon. SQL Server just doesn't support them, which is sad because generally given the choice of MSSQL or Oracle on projects I usually go with MSSQL.

Here is an example of why I need to switch the way the geometry is stored. With the way the points are loaded now, the API is only aware of nodes, not the actual streets. Currently when the blue point is reverse geocoded in the following picture, the result will be Street B (which is obviously incorrect) since we can only determine the closest point.

The new storage method will change this and the API will become aware of the actual lines in between nodes (as well as all the internal Type 2 nodes) and return the correct answer: Street A.

Where does this leave the MSSQL and Sqlite versions?
They will benefit from the merging of all the types so you will be able to query schools, malls, parks, plus all the extra streets but:
1. Only point locations will be stored, the polygon outline of the grounds and the Type 2 points will not.
2. Reverse geocoding will still return incorrect results in the above example.

Anyway, thats where I'm at. I'm putting together the load and merge scripts for the new architecture, but it will take a while longer to release them since I have to change the storage structure.


 


 

posted on Friday, October 28, 2005 9:23 AM

Feedback

# re: Revamp

Have you spoken with the SQL Server folks? They may have advice, or you may be able to present the case for developing it.
10/28/2005 2:44 PM | Keith J. Farmer

# re: Revamp

Thanks. The work you're doing is amazingly great. Does anyone know if the functionality that you need and that's missing in SQLServer is going to be present in the upcoming release of SQLServer 2005 yet? I suppose it's not or you would have said so. I too would love a purely SQLServer solution, though I suppose one could just install the mysql version you speak of and run it as a web service for oneself and just deal with the extra overhead of having one more thing to think about in one's architecture from an administrative standpoint.

Congratulations on your upcoming new addition. I'm sure things are about to get really busy for you soon, as if they aren't already.

10/28/2005 2:46 PM | Rob Wuhrman

# re: Revamp

Unfortunately they have stated there won't be any support for R-Trees or spatial types in 2005. This is where I found it:
http://msdn.microsoft.com/SQL/2005/2005Webcasts/TSQLQandA.aspx

You are right about setting up a separate db. Just because the geo stuff is in MySql doesn't mean your whole app needs to be. After all its pretty much a read only type of lookup, and its free.

One of the things that is killing me is the amount of time it takes to test it. Spatial indexes take FOREVER to create. The last time I ran the index (before I added the Type 2 points and polygons) took 4 days to run. I may just find a way to distribute the db in prebuilt form when its done.
10/28/2005 2:57 PM | John Sample

# re: Revamp

Is it possible to have the installer only download my County instead of every county in the US, I only need to geocode one county for now, but this is huge when I get everything of course. Sorry I did not see if there was a way in earlier posts.
11/4/2005 5:41 PM | Tim G

# re: Revamp

Create a folder (TempFolder).
Create another folder inside that one with the two character code of your state. (ex. Virginia = VA)
Download the zip file of your county into that folder manually.

ex: TempFolder/VA/tgr11111.zip

Launch the installer and when it asks for the download folder point it at tempfolder, then just skip the download part.
11/4/2005 5:45 PM | John Sample

# re: Revamp

I am anxious to look at this but when I run the installer it hangs during the "Unzip and Parse" of Type 1 records.

Has anyone else had this problem?

I have experience the problem with Alabama and Florida. After removing those folders the unzip and parse will progress to IL (tgr17087.zip) before it hangs, so I still haven't gotten all the way through.

Disk space is not an issue as I have over 100 gig available on the drive in question.
11/5/2005 10:34 AM | John R.

Post Comment

Title  
Name  
Url
Enter the code you see:
Comment