SlideShare a Scribd company logo
1 of 30
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Spatial Analytics with Hive
Hive Meetup – July 24, 2013
@cshanklin
Page 1
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Why Spatial Analytics?
• Amount of spatial data has exploded due to mobile device
ubiquity and more reliance on sensors.
• Proliferation of consumer-oriented mapping products brings
spatial analytics to the mainstream.
Page 2
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
An Interesting Dataset
• GPS data collected from Uber trips.
• Anonymized, maintains days/times but not dates.
• Obtained from InfoChimps
Page 3
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Sample
Page 4
ID Date Time Latitude Longitude
1 1/7/07 10:54:50 37.782551 -122.445368
1 1/7/07 10:54:54 37.782745 -122.444586
1 1/7/07 10:54:58 37.782842 -122.443688
1 1/7/07 10:55:02 37.782919 -122.442815
1 1/7/07 10:55:06 37.782992 -122.442112
1 1/7/07 10:55:10 37.7831 -122.441461
1 1/7/07 10:55:14 37.783206 -122.440829
1 1/7/07 10:55:18 37.783273 -122.440324
Overall
1.1M distinct readings
25,000 distinct trips.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Meanwhile, At Uber Headquarters…
Page 5
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Questions Uber Might Ask:
• What do trips tend to look like?
• How can we reduce wait time and make more trips?
• Are there new products we should introduce?
Page 6
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Answering The Questions
• Why Use SQL?
–Well understood by analysts.
–Huge ecosystem, access Hive from any of 20+ BI tools.
• Why Hive?
–Supports advanced SQL analytics like windowing functions.
–Java based, makes it easy for 3rd parties to add extensions.
• Last Reason
–This is the Hive meetup. Were you expecting ABAP?
Page 7
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Getting a feel for the trips.
Page 8
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Duration
• To get the duration all we need to do is:
–Subtract the last timestamp from the first timestamp.
–Do it per trip ID (1-25000).
• OK, how do we do it with SQL?
Page 9
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Getting First Or Last Values In A Partition
Page 10
-- Get the last observation from each trip ID.
-- Standard approach on any SQL system that supports windowing.
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY uber.id ORDER BY uber.dt DESC) as rn
FROM
uber
) sub1
WHERE
rn = 1;
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
And Hive Supports Windowing Now (0.11+)
Page 11
Name Purpose
CUME_DIST
Number of rows with values lower than (or greater than if ORDER
BY DESC) the current row.
DENSE_RANK
The dense rank of the row within the partition. If any rows “tie” or
have the same value, they receive the same rank. DENSE_RANK
does not have gaps in the ranks, in contrast to RANK.
FIRST_VALUE The value in the first row within the partition.
LAST_VALUE
Surprisingly, not the opposite of FIRST_VALUE (if you want that
just change your sort order.) LAST_VALUE is tricky, look it up.
LAG Value from a prior row in the partition.
LEAD Value from a subsequent row in the partition.
NTILE Divides rows in a partition into N many groups.
ROW_NUMBER The row number of the row within the partition.
RANK
The rank of the row within the partition. This differs from
ROW_NUMBER in that ties receive the same value.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Compute Trip Durations
Page 12
-- Subtract the first timestamp from the last timestamp.
-- Use FIRST_VALUE and ROW_NUMBER to help compare first and last timestamps.
SELECT
id,
(unix_timestamp(dt) - unix_timestamp(fv)) as trip_duration
FROM (
SELECT
id, dt, fv
FROM (
SELECT
id, dt,
FIRST_VALUE(dt) OVER (PARTITION BY id ORDER BY dt) as fv,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt DESC) as lastrk
FROM
uber
) sub1
WHERE
lastrk = 1
) sub2;
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Trip Duration SQL Output
Page 13
id trip_duration
1 128
2 148
3 150
4 336
5 400
6 168
7 142
8 558
9 312
10 208
...
(25,000 total trips)
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Duration Was Easy, What About Distance?
• All we have is GPS readings.
• If we draw a line from GPS readings, it estimates trip distance.
• GPS readings are 4s apart, estimates should be close.
Page 14
Actual Route
GPS Signal
Estimated Route
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Enter GIS Tools for Hadoop
Page 15
esri.github.io/gis-tools-for-hadoop
Works with Hive and Map-Reduce
Syntax similar to other spatial systems like PostGIS
Open Source
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Spatial Framework for Hadoop Functions
Page 16
Name Purpose
ST_LineString Create a line from coordinates supplied in a string.
ST_Polygon Create a polygon.
ST_SetSRID Set Spatial Reference ID. SRID 4326 corresponds to WGS84.
ST_GeodesicLengthWGS84
Compute length of a line in meters assuming points use the
World Geodetic System 1984. GPS uses the WGS84
coordinate system.
ST_Length Compute Cartesian length.
ST_Contains
Determine if one spatial object contains another spatial
object.
ST_Intersects Determine if two spatial objects intersect.
ST_AsText
Return a text representation of a spatial object, suitable for
storing in a Hive string column. Objects can also be saved in
binary columns with no conversion.
82 total spatial functions provided by Spatial Framework for
Hadoop.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ST_LineString: Make a line.
• 2 Constructors
–ST_LineString(1, 1, 2, 2, 3, 3);
– Simple constructor.
–ST_LineString('linestring(1 1, 2 2, 3 3)');
– WKT or Well-Known-Text constructor.
• Neither approach very convenient for this dataset.
• Since SF4H is open-source I added a new constructor:
–ST_LineString([Array of ST_Point Objects]);
Page 17
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
collect_array: Custom UDAF turns columns to
arrays
Page 18
ID Date Time Latitude Longitude
1 1/7/07 10:54:50 37.782551 -122.445368
1 1/7/07 10:54:54 37.782745 -122.444586
1 1/7/07 10:54:58 37.782842 -122.443688
1 1/7/07 10:55:02 37.782919 -122.442815
1 1/7/07 10:55:06 37.782992 -122.442112
1 1/7/07 10:55:10 37.7831 -122.441461
1 1/7/07 10:55:14 37.783206 -122.440829
1 1/7/07 10:55:18 37.783273 -122.440324
> SELECT id, collect_array(latitude) FROM table GROUP BY id;
(1, [ 37.782551, 37.782745, 37.782842, 37.782919, 37.782992 ... ])
...
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Computing Trip Lengths Now Trivial
Page 19
-- Compute the trip lengths.
-- Our coordinates conform to WGS84, use that to compute distances.
-- ST_SetSRID(_, 4326) marks the object as conforming to WGS84.
-- Group by trip ID.
SELECT
id,
ST_GeodesicLengthWGS84(
ST_SetSRID(
ST_LineString(collect_array(point)), 4326)) as length
FROM (
SELECT
id,
ST_Point(longitude, latitude) as point
FROM
uber
) sub
GROUP BY
id;
Generate an ST_Point for each row
Group the points, turn them into arrays
and make a line out of it.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Demo
Computing Trip Distances in Hortonworks Sandbox
Page 20
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Visualizing Trip Times and Durations
Page 21
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Time For a New Product?
• How Likely is Demand for an SFO Rideshare?
• How many trips even go to SFO?
Page 22
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ST_Intersects
• Determines if two shapes intersect.
Page 23
Yes Not So Much
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What Trips Go To SFO?
• Approach:
–Draw a polygon around SFO drop-off area.
–Using the ST_LineStrings, see which trips intersect with this polygon.
Page 24
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SFO Drop-Off Area
• Inserted into table locations (name string, location string) for
easy joining against other shapes.
• Data estimated using Google Maps.
Page 25
Name Location
SFO
ST_Polygon(
37.616543, -122.392291,
37.613297, -122.392119,
37.616458, -122.389115,
37.613552, -122.389051)
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Computing the Intersection
Page 26
SELECT
count(id)
FROM (
SELECT
id,
ST_LineString(collect_array(point)) as trip
FROM (
SELECT
id,
ST_Point(longitude, latitude) AS point
FROM
uber
) points
GROUP BY
id
) trips JOIN (
SELECT ST_Polygon(definition) as sfo_coordinates
FROM locations
WHERE locations.name = "SFO"
) sfosub
WHERE
ST_Intersects(sfosub.sfo_coordinates, trips.trip);
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Demo
Counting Number of Trips to SFO in Sandbox
Page 27
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Counting It Up
• 80 / 25000 Uber trips went to SFO (0.32%)
• SFO Rideshare Product, maybe not a great idea.
Page 28
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Conclusion
• Spatial Framework for Hadoop makes geo analytics simple
with Hadoop and Hive.
• Hive 11 makes it simple to slice and dice datasets with
powerful analytics like windowing.
• Open source, extend and change to fit your needs.
Page 29
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Try It For Yourself
• Spatial Framework for Hadoop
–esri.github.io/gis-tools-for-hadoop
• UDFs, extra data and Hive queries
–github.com/cartershanklin/hive-spatial-uber
– (For the collect_array UDAF, queries and extra data)
–github.com/cartershanklin/spatial-framework-for-hadoop
– (For the extra ST_LineString constructor)
• Main Dataset
–infochimps.com/datasets/uber-anonymized-gps-logs
• Hortonworks Sandbox
–The easiest way to learn Hadoop.
–hortonworks.com/sandbox
Page 30

More Related Content

What's hot

Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Simplilearn
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)Hussain Mansoor
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
A swift introduction to Swift
A swift introduction to SwiftA swift introduction to Swift
A swift introduction to SwiftGiordano Scalzo
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...
Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...
Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...Edureka!
 
DevOps principles and practices - accelerate flow
DevOps principles and practices - accelerate flowDevOps principles and practices - accelerate flow
DevOps principles and practices - accelerate flowMurughan Palaniachari
 
Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)
Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)
Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)Chris Richardson
 

What's hot (20)

Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
DevOps and Tools
DevOps and ToolsDevOps and Tools
DevOps and Tools
 
...Lag
...Lag...Lag
...Lag
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
6.hive
6.hive6.hive
6.hive
 
redis basics
redis basicsredis basics
redis basics
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
An Introduction To REST API
An Introduction To REST APIAn Introduction To REST API
An Introduction To REST API
 
Yarn
YarnYarn
Yarn
 
A swift introduction to Swift
A swift introduction to SwiftA swift introduction to Swift
A swift introduction to Swift
 
Spring Boot
Spring BootSpring Boot
Spring Boot
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...
Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...
Nginx Tutorial | Learn Nginx Fundamentals | Deploy a Web Application Using Ng...
 
DevOps principles and practices - accelerate flow
DevOps principles and practices - accelerate flowDevOps principles and practices - accelerate flow
DevOps principles and practices - accelerate flow
 
Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)
Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)
Deploying Spring Boot applications with Docker (east bay cloud meetup dec 2014)
 

Similar to How To Analyze Geolocation Data with Hive and Hadoop

JSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorialJSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorialSoham Sengupta
 
Vortex Tutorial -- Part I
Vortex Tutorial -- Part IVortex Tutorial -- Part I
Vortex Tutorial -- Part IAngelo Corsaro
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
OpenTelemetry Introduction
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction DimitrisFinas1
 
IRJET - Identification and Classification of IoT Devices in Various Appli...
IRJET -  	  Identification and Classification of IoT Devices in Various Appli...IRJET -  	  Identification and Classification of IoT Devices in Various Appli...
IRJET - Identification and Classification of IoT Devices in Various Appli...IRJET Journal
 
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...Ted Chien
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Roberto Franchini
 
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...sinaexe
 
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...IRJET Journal
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack Ramit Surana
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 

Similar to How To Analyze Geolocation Data with Hive and Hadoop (20)

JSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorialJSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorial
 
Abstract
AbstractAbstract
Abstract
 
PrismTech Vortex Tutorial Part 1
PrismTech Vortex Tutorial Part 1PrismTech Vortex Tutorial Part 1
PrismTech Vortex Tutorial Part 1
 
Vortex Tutorial -- Part I
Vortex Tutorial -- Part IVortex Tutorial -- Part I
Vortex Tutorial -- Part I
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
Search@airbnb
Search@airbnbSearch@airbnb
Search@airbnb
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
OpenTelemetry Introduction
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction
 
IRJET - Identification and Classification of IoT Devices in Various Appli...
IRJET -  	  Identification and Classification of IoT Devices in Various Appli...IRJET -  	  Identification and Classification of IoT Devices in Various Appli...
IRJET - Identification and Classification of IoT Devices in Various Appli...
 
seminar report
seminar reportseminar report
seminar report
 
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
 
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Nandita resume
Nandita resumeNandita resume
Nandita resume
 
Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

How To Analyze Geolocation Data with Hive and Hadoop

  • 1. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Spatial Analytics with Hive Hive Meetup – July 24, 2013 @cshanklin Page 1
  • 2. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Why Spatial Analytics? • Amount of spatial data has exploded due to mobile device ubiquity and more reliance on sensors. • Proliferation of consumer-oriented mapping products brings spatial analytics to the mainstream. Page 2
  • 3. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. An Interesting Dataset • GPS data collected from Uber trips. • Anonymized, maintains days/times but not dates. • Obtained from InfoChimps Page 3
  • 4. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Data Sample Page 4 ID Date Time Latitude Longitude 1 1/7/07 10:54:50 37.782551 -122.445368 1 1/7/07 10:54:54 37.782745 -122.444586 1 1/7/07 10:54:58 37.782842 -122.443688 1 1/7/07 10:55:02 37.782919 -122.442815 1 1/7/07 10:55:06 37.782992 -122.442112 1 1/7/07 10:55:10 37.7831 -122.441461 1 1/7/07 10:55:14 37.783206 -122.440829 1 1/7/07 10:55:18 37.783273 -122.440324 Overall 1.1M distinct readings 25,000 distinct trips.
  • 5. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Meanwhile, At Uber Headquarters… Page 5
  • 6. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Questions Uber Might Ask: • What do trips tend to look like? • How can we reduce wait time and make more trips? • Are there new products we should introduce? Page 6
  • 7. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Answering The Questions • Why Use SQL? –Well understood by analysts. –Huge ecosystem, access Hive from any of 20+ BI tools. • Why Hive? –Supports advanced SQL analytics like windowing functions. –Java based, makes it easy for 3rd parties to add extensions. • Last Reason –This is the Hive meetup. Were you expecting ABAP? Page 7
  • 8. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Getting a feel for the trips. Page 8
  • 9. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Duration • To get the duration all we need to do is: –Subtract the last timestamp from the first timestamp. –Do it per trip ID (1-25000). • OK, how do we do it with SQL? Page 9
  • 10. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Getting First Or Last Values In A Partition Page 10 -- Get the last observation from each trip ID. -- Standard approach on any SQL system that supports windowing. SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY uber.id ORDER BY uber.dt DESC) as rn FROM uber ) sub1 WHERE rn = 1;
  • 11. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. And Hive Supports Windowing Now (0.11+) Page 11 Name Purpose CUME_DIST Number of rows with values lower than (or greater than if ORDER BY DESC) the current row. DENSE_RANK The dense rank of the row within the partition. If any rows “tie” or have the same value, they receive the same rank. DENSE_RANK does not have gaps in the ranks, in contrast to RANK. FIRST_VALUE The value in the first row within the partition. LAST_VALUE Surprisingly, not the opposite of FIRST_VALUE (if you want that just change your sort order.) LAST_VALUE is tricky, look it up. LAG Value from a prior row in the partition. LEAD Value from a subsequent row in the partition. NTILE Divides rows in a partition into N many groups. ROW_NUMBER The row number of the row within the partition. RANK The rank of the row within the partition. This differs from ROW_NUMBER in that ties receive the same value.
  • 12. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Compute Trip Durations Page 12 -- Subtract the first timestamp from the last timestamp. -- Use FIRST_VALUE and ROW_NUMBER to help compare first and last timestamps. SELECT id, (unix_timestamp(dt) - unix_timestamp(fv)) as trip_duration FROM ( SELECT id, dt, fv FROM ( SELECT id, dt, FIRST_VALUE(dt) OVER (PARTITION BY id ORDER BY dt) as fv, ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt DESC) as lastrk FROM uber ) sub1 WHERE lastrk = 1 ) sub2;
  • 13. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Trip Duration SQL Output Page 13 id trip_duration 1 128 2 148 3 150 4 336 5 400 6 168 7 142 8 558 9 312 10 208 ... (25,000 total trips)
  • 14. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Duration Was Easy, What About Distance? • All we have is GPS readings. • If we draw a line from GPS readings, it estimates trip distance. • GPS readings are 4s apart, estimates should be close. Page 14 Actual Route GPS Signal Estimated Route
  • 15. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Enter GIS Tools for Hadoop Page 15 esri.github.io/gis-tools-for-hadoop Works with Hive and Map-Reduce Syntax similar to other spatial systems like PostGIS Open Source
  • 16. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Spatial Framework for Hadoop Functions Page 16 Name Purpose ST_LineString Create a line from coordinates supplied in a string. ST_Polygon Create a polygon. ST_SetSRID Set Spatial Reference ID. SRID 4326 corresponds to WGS84. ST_GeodesicLengthWGS84 Compute length of a line in meters assuming points use the World Geodetic System 1984. GPS uses the WGS84 coordinate system. ST_Length Compute Cartesian length. ST_Contains Determine if one spatial object contains another spatial object. ST_Intersects Determine if two spatial objects intersect. ST_AsText Return a text representation of a spatial object, suitable for storing in a Hive string column. Objects can also be saved in binary columns with no conversion. 82 total spatial functions provided by Spatial Framework for Hadoop.
  • 17. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ST_LineString: Make a line. • 2 Constructors –ST_LineString(1, 1, 2, 2, 3, 3); – Simple constructor. –ST_LineString('linestring(1 1, 2 2, 3 3)'); – WKT or Well-Known-Text constructor. • Neither approach very convenient for this dataset. • Since SF4H is open-source I added a new constructor: –ST_LineString([Array of ST_Point Objects]); Page 17
  • 18. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. collect_array: Custom UDAF turns columns to arrays Page 18 ID Date Time Latitude Longitude 1 1/7/07 10:54:50 37.782551 -122.445368 1 1/7/07 10:54:54 37.782745 -122.444586 1 1/7/07 10:54:58 37.782842 -122.443688 1 1/7/07 10:55:02 37.782919 -122.442815 1 1/7/07 10:55:06 37.782992 -122.442112 1 1/7/07 10:55:10 37.7831 -122.441461 1 1/7/07 10:55:14 37.783206 -122.440829 1 1/7/07 10:55:18 37.783273 -122.440324 > SELECT id, collect_array(latitude) FROM table GROUP BY id; (1, [ 37.782551, 37.782745, 37.782842, 37.782919, 37.782992 ... ]) ...
  • 19. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Computing Trip Lengths Now Trivial Page 19 -- Compute the trip lengths. -- Our coordinates conform to WGS84, use that to compute distances. -- ST_SetSRID(_, 4326) marks the object as conforming to WGS84. -- Group by trip ID. SELECT id, ST_GeodesicLengthWGS84( ST_SetSRID( ST_LineString(collect_array(point)), 4326)) as length FROM ( SELECT id, ST_Point(longitude, latitude) as point FROM uber ) sub GROUP BY id; Generate an ST_Point for each row Group the points, turn them into arrays and make a line out of it.
  • 20. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Demo Computing Trip Distances in Hortonworks Sandbox Page 20
  • 21. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Visualizing Trip Times and Durations Page 21
  • 22. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Time For a New Product? • How Likely is Demand for an SFO Rideshare? • How many trips even go to SFO? Page 22
  • 23. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ST_Intersects • Determines if two shapes intersect. Page 23 Yes Not So Much
  • 24. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What Trips Go To SFO? • Approach: –Draw a polygon around SFO drop-off area. –Using the ST_LineStrings, see which trips intersect with this polygon. Page 24
  • 25. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. SFO Drop-Off Area • Inserted into table locations (name string, location string) for easy joining against other shapes. • Data estimated using Google Maps. Page 25 Name Location SFO ST_Polygon( 37.616543, -122.392291, 37.613297, -122.392119, 37.616458, -122.389115, 37.613552, -122.389051)
  • 26. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Computing the Intersection Page 26 SELECT count(id) FROM ( SELECT id, ST_LineString(collect_array(point)) as trip FROM ( SELECT id, ST_Point(longitude, latitude) AS point FROM uber ) points GROUP BY id ) trips JOIN ( SELECT ST_Polygon(definition) as sfo_coordinates FROM locations WHERE locations.name = "SFO" ) sfosub WHERE ST_Intersects(sfosub.sfo_coordinates, trips.trip);
  • 27. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Demo Counting Number of Trips to SFO in Sandbox Page 27
  • 28. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Counting It Up • 80 / 25000 Uber trips went to SFO (0.32%) • SFO Rideshare Product, maybe not a great idea. Page 28
  • 29. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Conclusion • Spatial Framework for Hadoop makes geo analytics simple with Hadoop and Hive. • Hive 11 makes it simple to slice and dice datasets with powerful analytics like windowing. • Open source, extend and change to fit your needs. Page 29
  • 30. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Try It For Yourself • Spatial Framework for Hadoop –esri.github.io/gis-tools-for-hadoop • UDFs, extra data and Hive queries –github.com/cartershanklin/hive-spatial-uber – (For the collect_array UDAF, queries and extra data) –github.com/cartershanklin/spatial-framework-for-hadoop – (For the extra ST_LineString constructor) • Main Dataset –infochimps.com/datasets/uber-anonymized-gps-logs • Hortonworks Sandbox –The easiest way to learn Hadoop. –hortonworks.com/sandbox Page 30

Editor's Notes

  1. If you spotted the error in this slide… we’re hiring.
  2. If you spotted the error in this slide… we’re hiring.
  3. If you spotted the error in this slide… we’re hiring.
  4. If you spotted the error in this slide… we’re hiring.