Geolocation and Cassandra at Physi

36
Geolocation with Cassandra Austin Cassandra Users – Jan 21, 2016

Transcript of Geolocation and Cassandra at Physi

Page 1: Geolocation and Cassandra at Physi

Geolocation with CassandraAustin Cassandra Users – Jan 21, 2016

Page 2: Geolocation and Cassandra at Physi

Matt Vorst

• Cassandra User– Since 2011

• Architect / Java developer

• Corporate Life– EntekIRD & Rockwell Automation

• Serial Entrepreneur– EventsInCincinnati.com – Co-founder– Dotloop, Inc. – Co-founder and CTO– Physi, Inc. – Co-founder and C*O

Page 3: Geolocation and Cassandra at Physi

Physi [fiz-ee] (noun)1. a mobile app that pairs nearby people to play sports2. a movement to make a smaller, happier, healthier

world through play

Page 4: Geolocation and Cassandra at Physi

Why Cassandra

• Operations is Hard– Most relational DB’s don’t scale easily or well– Murphy’s Law always strikes at the worst time– Recovery shouldn’t come at a high cost

• Distributed Design– Cassandra is a distributed technology– Applications are designed to be distributed

Page 5: Geolocation and Cassandra at Physi

Necessary Location Services

• Proximity Search– Postal code range search– Distance between postal codes

• Location Conversion– Postal code to latitude/longitude– Latitude/longitude to postal code

• Search– City name lookup

Page 6: Geolocation and Cassandra at Physi

Setup• Create the Keyspace

cqlsh> CREATE KEYSPACE physi WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

cqlsh> USE physi;

Page 7: Geolocation and Cassandra at Physi

Postal Code to Latitude/Longitude• Use Case

– Place markers on a map

• Solution– Buy a database– PK: Country/postal code

Page 8: Geolocation and Cassandra at Physi

Postal Code to Latitude/Longitude• Create Column Family

cqlsh>CREATE TABLE zip_code_master (location_country text, zip_code text, location_uuid uuid,

location_type text, city text, county text, state text, latitude_e6 bigint, longitude_e6 bigint, PRIMARY KEY (location_country, zip_code));

Page 9: Geolocation and Cassandra at Physi

Postal Code to Latitude/Longitude• Add data

cqlsh> INSERT INTO zip_code_master (location_country, zip_code, location_uuid, location_type, city, county, state, latitude_e6, longitude_e6)VALUES(‘US’,’45219’, 7b0e6b7f-0d9a-3a66-9f9a-0df17ed5dc39,’REGIONAL’,’Cincinnati’,’Hamilton’,’OH’,39127564,-84514489);

Page 10: Geolocation and Cassandra at Physi

Postal Code to Latitude/Longitude• Search

cqlsh>SELECT * FROM zip_code_master WHERE location_country = 'US' AND zip_code = '45219';

location_country | zip_code | city | county | latitude_e6 | location_type | location_uuid | longitude_e6 | state------------------+----------+------------+----------+-------------+---------------+--------------------------------------+--------------+------ US | 45219 | Cincinnati | Hamilton | 39127564 | REGIONAL | 7b0e6b7f-0d9a-3a66-9f9a-0df17ed5dc39 | -84514489 | OH

• Results

Page 11: Geolocation and Cassandra at Physi

Postal Code to Latitude/Longitude• Things to Know

– Row width: ~10– Postal codes cover different areas– A single postal codes can span different cities,

counties, and even states– The largest postal code covers 10,000 mi2

Page 12: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Use Case

– Determine which postal code a user is currently in server side

– Use this to return suggestions

Page 13: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• The Relational Way

– Draw a box, loop, and calculate

– Query: SELECT * FROM location_table

WHERE (min lat) < latitude AND latitude < (max lat)AND (min long) < longitude AND longitude < (max long)

Page 14: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Cassandra Solution

– Prebuild a lookup table• Slice the US up into 7mi by <=7mi squares• ~69 miles between lines of latitude• Longitude is not equally spaced

– PK: latE1|longE1

Page 15: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Cassandra Solution (cont.)

– Build: Add bordering postal codes

– Read: Loop and calculate distance

Page 16: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Create Column Family

cqlsh>CREATE TABLE latitude_longitude_zip_code (latitude_e1 int, longitude_e1 int, location_country text,

zip_code text, location text, PRIMARY KEY ((latitude_e1, longitude_e1),

location_country, zip_code));

Page 17: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Add data

cqlsh> INSERT INTO latitude_longitude_zip_code (latitude_e1, longitude_e1, location_country, zip_code,

location) VALUES(391,-845,'US','45219','{json data}');

cqlsh> INSERT INTO latitude_longitude_zip_code (latitude_e1, longitude_e1, location_country, zip_code,

location) VALUES(391,-845,'US','45220','{json data}');

Page 18: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Search

cqlsh>SELECT * FROM latitude_longitude_zip_code

WHERE latitude_e1 = 391 AND longitude_e1 = -845;

• Results latitude_e1 | longitude_e1 | location_country | zip_code | location-------------+--------------+------------------+----------+------------- 391 | -845 | US | 45206 | {json data} 391 | -845 | US | 45219 | {json data} 391 | -845 | US | 45220 | {json data}

Page 19: Geolocation and Cassandra at Physi

Latitude/Longitude to Postal Code• Things to Know

– Row width: 1 to ~50– This was a short lived solution– Primarily using client location services– Still used as a fallback for web– Creation of the lookup table took 3 hours on

localhost with RAID 0 SSDs

Page 20: Geolocation and Cassandra at Physi

City Name Lookup• Use Case

– Auto-complete city name

• Solution– Create a lookup– RK: searchTerm– CN: (0 padded count)|country|city

Page 21: Geolocation and Cassandra at Physi

City Name Lookup• Create Column Family

cqlsh>CREATE TABLE name_search (search_term text, occurrence_count int, location_country text, city text, state text, location text, PRIMARY KEY ((search_term), occurrence_count,

location_country, city, state));

Page 22: Geolocation and Cassandra at Physi

City Name Lookup• Add data

cqlsh> INSERT INTO name_search (search_term, occurrence_count, location_country, city,

state, location)VALUES ('aus', 31, 'US', 'austin', 'TX', '{json data}');

cqlsh> INSERT INTO name_search (search_term, occurrence_count, location_country, city,

state, location)VALUES ('aus', 10, 'US', 'austell', 'GA', '{json data}');

Page 23: Geolocation and Cassandra at Physi

City Name Lookup• Search

cqlsh>SELECT * FROM name_search WHERE search_term = 'aus' ORDER BY occurrence_count DESC;

• Results search_term | occurrence_count | location_country | city | state | location-------------+------------------+------------------+-------------+-------+------------- aus | 31 | US | austin | TX | {json data} aus | 10 | US | austell | GA | {json data} aus | 10 | US | ausablefork | NY | {json data}

Page 24: Geolocation and Cassandra at Physi

City Name Lookup• Things to Know

– Row width: 10 – 60K– Remove whitespace, special characters, convert

search terms to lowercase– Only search when 2 or more characters have

been entered

Page 25: Geolocation and Cassandra at Physi

Postal Code Range Search• Use Case

– Find nearby neighborhoods

• Solution– Create a lookup table– RK: country|postal code

Page 26: Geolocation and Cassandra at Physi

Postal Code Range Search• Create Column Family

cqlsh>CREATE TABLE zip_code_distance (location_country text, zip_code text, distance_e2 int,

location text, PRIMARY KEY ((location_country, zip_code),

distance_e2));

Page 27: Geolocation and Cassandra at Physi

Postal Code Range Search• Add Data

cqlsh> INSERT INTO zip_code_distance (location_country, zip_code, distance_e2, location)VALUES('US', '78741', 0, '{json data for 78741}');

cqlsh> INSERT INTO zip_code_distance (location_country, zip_code, distance_e2, location)VALUES('US', '78741', 180, '{json data for 78702}');

cqlsh> INSERT INTO zip_code_distance (location_country, zip_code, distance_e2, location)VALUES('US', '78741', 220, '{json data for 78721}');

Page 28: Geolocation and Cassandra at Physi

Postal Code Range Search• Search

cqlsh>SELECT * FROM zip_code_distance WHERE location_country = 'US' AND zip_code = '78741'AND distance_e2 < 200 ORDER BY distance_e2;

• Results location_country | zip_code | distance_e2 | location------------------+----------+-------------+----------------------- US | 78741 | 0 | {json data for 78741} US | 78741 | 180 | {json data for 78702}

Page 29: Geolocation and Cassandra at Physi

Postal Code Range Search• Things to know

– Row width: 1 to ~45K

Page 30: Geolocation and Cassandra at Physi

Distance Between Postal Codes• Use Case

– Estimate the distance between postal codes

• Solution– Create a lookup table– RK: country|postal code– CN: country|postal code– Value: distanceE2

Page 31: Geolocation and Cassandra at Physi

Distance Between Postal Codes• Create Column Family

cqlsh>CREATE TABLE zip_code_distance_between(location_country_1 text, zip_code_1 text,location_country_2 text, zip_code_2 text, distance_e2 int,PRIMARY KEY ((location_country_1, zip_code_1),location_country_2, zip_code_2));

Page 32: Geolocation and Cassandra at Physi

Distance Between Postal Codes• Add Data

cqlsh> INSERT INTO zip_code_distance_between (location_country_1, zip_code_1, location_country_2, zip_code_2, distance_e2)

VALUES('US', '78741', 'US', '78741', 0);

cqlsh> INSERT INTO zip_code_distance_between (location_country_1, zip_code_1, location_country_2, zip_code_2, distance_e2)

VALUES('US', '78741', 'US', '78702', 180);

Page 33: Geolocation and Cassandra at Physi

Distance Between Postal Codes• Select

cqlsh>SELECT * FROM zip_code_distance_between WHERE location_country_1 = 'US' AND zip_code_1 = '78741' AND location_country_2 = 'US' AND zip_code_2 = '78702';

• Results location_country_1 | zip_code_1 | location_country_2 | zip_code_2 | distance_e2--------------------+------------+--------------------+------------+------------- US | 78741 | US | 78702 | 180

Page 34: Geolocation and Cassandra at Physi

Distance Between Postal Codes• Things to know

– Row width: ~45K

Page 35: Geolocation and Cassandra at Physi

Final Thoughts• Why just Cassandra?

– Fewer technologies to support• Operations• Development

– But be reasonable• Prebuild reference data

– Consider prebuilding data to reduce read time

Page 36: Geolocation and Cassandra at Physi

Questions & Contact Info

Matt VorstCTO Physi, Inc.

[email protected]