geohashes and proximity: part II

geohashes and proximity title image

This is the second part of an exploration to find a solution to the following question:

Given a single (geo)coordinate x, and a large set of other (geo)coordinates A, how might one very efficiently query the set to return all coordinates within A that fall within a certain distance R of x?

In the first part of this exploration, we explored initial broader questions and the process of geohashing. In this second part, we’ll be going through the process of actually using geohashes to effectively find nearby coordinates.

Finding Proximal Points Using Geohashes

During my research into the matter I encountered a method of spatial indexing that allows us to quite efficiently find proximal points near a specific coordinate in space. It’s a method that produces approximate results really fast, and has good enough precision for a majority of applications. Most of the computation is through bitwise operations, which are extremely efficient. We do this by avoiding any multiplication or trigonometric functions.

This method uses the inherent properties of geohashes and sorted sets which support range queries to quickly find all proximal points at a given distance from a given coordinate.

01 Store Points as Accurate Geohashes

The first step in all of this, is to have a decent set of points to work with. The crux of this method is centred on storing geohashed coordinates in an ordered list that we can query-by-range. A good example of this kind of set is redis’ sorted set.

When we do store our points in the set, we’ll want to ensure our points are stored in the best possible geohash resolution as possible, and as standard integer geohashes (vs. the more common base32 variety). For javascript, this is 52-bits, and for most other languages that support true uint64 integers, it would be 64-bits.

02 Find a Geohash Resolution

As we’ve learned, the more bit depth we give a geohash, the smaller and more accurately it describes a subspace. Geohash values, of course, map to physical space and will represent a box of a certain physical area. We can use this inherent property of geohashes to find a geohash bit-depth that simulates a search radius.

But, that means using this particular method of spatial indexing we are literally searching within non-square rectangles, instead of circles with radiuses. So we won’t be getting the exact same results as we would with an actual radius proximity search. Still, people tend to think of proximity in terms of radii, so we’d like to use radius values to approximate associated boxes at comparable geohash resolutions.

We accept the difference between box search vs. radius search for the sake of efficiency since that is the main strength of this method. It’s a compromise that some of us can live with, though it’s important to note that if you want to replicate a true radius search exactly, this probably isn’t the method for you.

Anyway, knowing that our spatially indexed boxes actually represent physical space at various bit depths, we can use a table that has approximate matches in physical radius to geohash bit depth:

radius (m)	bit-depth	radius (m)	bit-depth
0.6	`52 bits`	3,866.9	`26 bits`
1	`50 bits`	8,749.7	`24 bits`
2.19	`48 bits`	15,664	`22 bits`
4.57	`46 bits`	33,163.5	`20 bits`
9.34	`44 bits`	72,226.3	`18 bits`
14.4	`42 bits`	150,350	`16 bits`
33.18	`40 bits`	306,600	`14 bits`
62.1	`38 bits`	474,640	`12 bits`
128.55	`36 bits`	1,099,600	`10 bits`
252.9	`34 bits`	2,349,600	`8 bits`
510.02	`32 bits`	4,849,600	`6 bits`
1,015.8	`30 bits`	10,018,863	`4 bits`
2,236.5	`28 bits`

Even these vary at different latitudes but, if we really want, we can do additional computation to create more accurate correlations if needed.

03 Geohash the Query Point

Once we’ve determined the bit-depth we need to use to match a search radius (using the above table or another equally valid method), we’ll need to geohash our coordinate to this desired bit-depth. This tells us which box our point is in at the desired bit-depth, and gives us a good starting point for finding other coordinates that are located nearby!

Geohash immediate box

You can see from the image above that we are actually searching within 9 geohash boxes at this bit-depth. The reason why we expand beyond just a single larger box is because our query point may be at the edge of the box, so we need to actually normalize the distance-probability for a more accurate resulting set of nearby points. We do this by finding points within neighbouring boxes as well.

04 Find Neighbouring Boxes

By knowing our neighbouring boxes’ addresses, we get a much more normalized box of values within a certain distance to our point. Because neighbouring geohash boxes don’t necessarily progress in linear order, we have to calculate the neighbouring boxes’ geohashes at our determined bit-depth from Step 2.

Geohash immediate box

Including the box that contains our point, we now have a set of 9 boxes (geohash values) at a bit-depth that gives us an area that approximately matches a certain radius from a given point. Excellent! Now we just have to find all points that lie within these 9 boxes.

05 Find Ranges of Points in All 9 Boxes

Now, remember we have stored all of our points at a higher geohash resolution from the bit-depth we’ve been working in! We need to deduce from what we know about our 9 boxes to find the set of high bit-depth geohashes that fall within these boxes.

As an example, let’s say our 9 search boxes are at 6-bit resolution, but our coordinates are stored as 52-bit geohash integers. We need to take the 9 different 6-bit boxes and find out what range of 52-bit geohashes are contained within them!

Geohash boxes big

You’ll notice from the above diagram that, in isolation, the progression and values of our boxes, which were determined by the location of our original query point, are fairly unpredictable. However, here’s something quite useful: A geohash box at a particular resolution contains all boxes within it at a higher resolution AND these smaller box value ranges are predictable and easily calculable. That is, if we know the geohash value of a larger box and if we want to find all the box numbers within that box that exist at a higher resolution, we can find the lowest box number and the highest box number within that larger box without a whole lot of computation.

All we need to do is effectively bit-shift the box values of our query box geohases to our stored geohash bit-depth. In our example, we’d be going from 6-bits to 52-bits, so that is a difference of 46-bits (52-6 = 46). To find a range, we just increment the 6-bit decimal number by 1 to find the upper range at of the box (still at 6-bits). Then we bit-shift the lower and upper range to our max bit-depth… in this case we’d be bit-shifting right by 46-bits. When we do this for each box, we build out a set of decimal integer ranges at our desired 52-bits resolution! If this is confusing to understand, perhaps the diagrams below will help explain what’s happening to the bits and values of the boxes.

This produces a set of ranges. From 6-bit boxes we’ve produced a set of ranges of 52-bit box values! The table below shows the aggregated ranges and their correlated 6-bit boxes.

box number (6-bit)	lower limit (52-bit)	upper limit (52-bit)
`15`	`4222124650659840`	`4503599627370496`
`26`	`7318349394477056`	`7599824371187712`
`27`	`7599824371187712`	`7881299347898368`
`49`	`13792273858822144`	`14073748835532800`
`51`	`14355223812243456`	`14636698788954112`
`50`	`14073748835532800`	`14355223812243456`
`39`	`10977524091715584`	`11258999068426240`
`37`	`10414574138294272`	`10696049115004928`
`48`	`13510798882111488`	`13792273858822144`

06 Query All Ranges

An important part of this method was storing our points as high-bit-depth geohash values in a sorted set that can be queried by range because we now have a set of ranges to query against!

Expressed as an array, our ranges from the previous step would look something like:

var ranges =  [ [4222124650659840,  4503599627370496],
                [7318349394477056,  7599824371187712],
                [7599824371187712,  7881299347898368],
                [13792273858822144, 14073748835532800],
                [14355223812243456, 14636698788954112],
                [14073748835532800, 14355223812243456],
                [10977524091715584, 11258999068426240],
                [10414574138294272, 10696049115004928],
                [13510798882111488, 13792273858822144] ];

Now, it’s just a matter of retrieving all entries that fall within the ranges of geohash values we’ve been able to calculate. Above we have 9 unique ranges (18 values in total). This set of ranges represents all the possible geohashed coordinates that would possibly fall within proximity to our original query point.

Further Optimizations

Range Overlap Aggregation

It may be noticeable that there are some overlaps in the ranges produced. This would mean that we would be querying more numeric ranges than needed if there is a possibility for aggregation.

Consider our example, and the ranges: [13510798882111488, 13792273858822144] and [13792273858822144, 14073748835532800]. These could actually be combined into one range [13510798882111488, 14073748835532800]. By doing this for all range overlaps, we can usually get the number of ranges to search within down from 9 to between 4 or 5.

var ranges =  [ [4222124650659840,  4503599627370496],
                [7318349394477056,  7881299347898368],
                [13510798882111488, 14636698788954112],
                [10977524091715584, 11258999068426240],
                [10414574138294272, 10696049115004928] ];

Above is an optimized set of ranges from Step 6 after merging all overlaps into combined ranges. This step can actually be done at our base search resolution (eg. 6-bits), which would let us skip the bit-shift operations for overlapping values.

Caching Ranges

As you can see, the bulk of the processing for this method of proximity search is in getting these treasured geohash range values. Using something like redis to do range queries is extremely fast so if the exact same search is being done (ie. same radius and same original coordinate) these ranges can be cached for even better performance!

Conclusion

What a roundabout way of finding nearby points! I was a little afraid that this wasn’t much faster than a regular latitude/longitude range query. But testing it out and comparing it to some mysql memory table coordinate range queries seems to show promise. I can tell you that the best part about this, for me personally, is that it avoids the need for me to use any additional database systems in my stack.

There are a few things we’ve learned through this exploration.

This method will produce different results than a true-to-radius proximity search.
With the right database/storage structure in place, this method can be extremely efficient and scalable.
The bulk of computation is in finding ranges of high bit-depth geohashes to search within.
We can cache and combine overlapping search ranges to further optimize the query if needed.

Resources

This has all been packaged into a node module that you can use to find nearby points really easily.