5. Have been for at least 30 years
Command
Control devices
Magic! Change our life!
Falling hardware and deployment costs
Cloud services
Ubiquitous communications
Big Data Analytics
5
7. 7
M2M/IoT Sector Map : Beecham Research
http://www.beechamresearch.com/article.aspx?id=4
IoT Convergence
Technology
Business and ecosystem
People, applications, things, data,
devices, etc.
8. Goal
Help we understand urban phenomena
Improve urban environment, city functions and human life quality
Predict and even pre‐solve the future of cities
An interdisciplinary field fusing the Data Science:
transportation, civil engineering, environment, economy, ecology,
and sociology
8
Win
政 府
城 市居 民
Data
Science
Win
Win
9. Sensing city dynamics unobtrusively, automatically, and
constantly
A variety of IoT sensors:
Mobile phones, vehicles, cameras, stations,…
User generated contents (check‐in, photos, tweets)
Heterogeneous data sources
Geospatial, temporal, social, text, images, economic, environmental
SMI serves both people and cities
Sensing → Mining → Improving
Location
Data
Traffic Flows
Human
Footprints
Weather
Road Network
Mobile
Signals
Transportation
SystemSocial
Network9
10. Sensing city dynamics unobtrusively, automatically, and
constantly
A variety of IoT sensors:
Mobile phones, vehicles, cameras, stations,…
User generated contents (check‐in, photos, tweets)
Heterogeneous data sources
Geospatial, temporal, social, text, images, economic, environmental
SMI serves both people and cities
Sensing (data) → Mining (Knowledge)→ Improving (architectures,
services and environment)
Location
Data
Traffic Flows
Human
Footprints
Weather
Road Network
Mobile
Signals
Transportation
SystemSocial
Network10
25. The connectivity is just an enabler but the real value of IoT
is on data
Big Data is important for finding value, and IoT can play an
important roles for data collection, negotiation and
combination
Big Data is nothing without real business value insight
e.g. Develop AI‐based applications based on big data
Cloud offers “Everywhere as a Service” for IoT and big data.
25
37. Temporal closeness
Period
Trend
37
Traffic flow example, figures from [Igor Grabec et al. 2014]
McDonald’s stock price index example,
figures from [William et al. 2015]
38. Why POI
Indicate the land usage, function, and environment of
a region
Challenges
massive POI data in a city
the information could vary in time
Two approaches to crawl(Google):
existing Yellow Page data
collect POI information physically, e.g., carrying a GPS
logger
some location‐based social networking
services(e.g. Foursquare) have allowed end
users to create a new POI in the system.
Region: residential areas, suburban areas,
and forest
38
from google map
39. Why road networks
Have a strong correlation with traffic flows
A good complementary of mobility modeling
Format:
Represented by a graph that is composed of
a set of edges (denoting road segments) and
a collection of nodes (standing for road intersections)
Each node has unique geospatial coordinates
Other properties, such as the length, speed
constraint, type of road, and number of lanes,
can be associated with an edge.
39
From Mathieu
Leplatre
From https://people.hofstra.edu/geotrans/eng/methods/nettopology.html
40. Traditional sensors:
loop sensors is quite limited
Surveillance cameras
widely deployed in urban areas
Need much human effort
Floating car‐based traffic monitoring
methods: GPS
higher flexibility and a lower deployment cost
depends on the distribution of the probing
vehicles
data sparsity problem exists
40
from google image
from google image
48. Unobtrusively and continually collect data in a large scale
Example: Continually probing the city traffic is challenging
as we do not have sensors on every road segment.
Deploy new sensing devices could help but
aggravate the burden of cities.
cost much energy, space and human resource
How to exploit what we already have in urban spaces
intelligently
Humans as a sensor is a new concept that may help tackle this
challenge.
48
49. Motivations
It is not cost‐effective to deploy sensors everywhere
Energy consumption
Challenges
Privacy issue
Loose‐controlled and non‐uniform distributed sensors
Unstructured, implicit, and noise data
texts and images
data‐missing problem
data contain much noisy
49 http://desktop.arcgis.com/
54. Motivations
Lots of trajectories → lots of data
Missing data problem
Noise complicates analysis and inference
Methods
Data reduction and filtering techniques
Indexing method
Filling missing value methods
54
63. • Output: The Air Quality Label of a certain1km grid on 2016/11/9 4am
(n =1~1000)
• Predictors (6 explanatory variables)
• temperature(x1, ‐30~40)
• wind speed (x2, 0~30, units not given)
• Winter?(x3=1 if yes, 0 if not)
• Number of factories (x4)
• humidity(x5, 0~100)
• Average number of population group (x6=1,...,6 <10000 (1), 10000‐15000(2), 15000‐
20000(3), 20000‐25000(4), 25000‐30000(5), >30000(6))
63
AQI Values Levels of Health Concern Colors
0-50 Good (G) Green
51-100 Moderate (M) Yellow
101-150 Unhealthy for sensitive groups (U-S) Orange
151-200 Unhealthy (U) Red
201-300 Very unhealthy (VU) Purple
301+ Hazardous (H) Maroon
64. 64
Classification and Prediction can be used to determine important data
classes or to predict future data trends.
Effective methods for ST data
decision trees, Bayesian belief network, Artificial neural network, Support Vector Machine
(SVM), nearest neighbor classifiers, and random forest.
Linear and nonlinear methods can be used for prediction.
Note overfitting problems
Efficiency is a concern for applications
Feature engineering is always important for urban prediction tasks.
Hybrid methods such as Bagging and Boosting can be used to increase
overall accuracy by combining a series of individual models.
Jiawei Han: data mining concepts and techniques
http://www.holehouse.org/mlclass/07_Regularization.html
67. • Response: The Air Quality Index of a certain1km grid on 2016/11/9 4am
(n =1~1000)
• Predictors (p=6 explanatory variables)
• temperature(x1, ‐30~40)
• wind speed (x2, 0~30, units not given)
• Winter?(x3=1 if yes, 0 if not)
• Number of factories (x4)
• humidity(x5, 0~100)
• Average number of population group (x6=1,...,6 <10000 (1), 10000‐15000(2), 15000‐
20000(3), 20000‐25000(4), 25000‐30000(5), >30000(6))
67
AQI Values Levels of Health Concern Colors
0-50 Good (G) Green
51-100 Moderate (M) Yellow
101-150 Unhealthy for sensitive groups (U-S) Orange
151-200 Unhealthy (U) Red
201-300 Very unhealthy (VU) Purple
301+ Hazardous (H) Maroon
68. 68
Wind speed
Air Quality Index
Air Quality Index
Wind speed
Wind speed
Wind speed
Air Quality Index
Air Quality Index
69. Hypothesize deterministic component
Estimate unknown Parameters
Specify probability distribution of random error Term
Estimate Error
Evaluate the fitted model
Use model for prediction & estimation
69
Y
X
Y
X
iii XY ˆˆˆ
10 iii XY ˆˆˆ
10
ii XY 10
ˆˆˆ ii XY 10
ˆˆˆ
Unsampled
observation
i = Random error
Observed value
^
78. 78
Group objects based on their similarity and has wide
applications.
Measure of similarity can be computed for various features of
data.
In urban applications, partitioning methods, hierarchical
methods, density‐based methods, grid‐based methods, and
model‐based methods are often used.
Outlier detection can be performed by clustering.
Sometimes be performed before other tasks.
81. Content‐based Filtering
Try to recommend items that are similar to those that a user liked in
the past
Recommends items based on a comparison between the content of
the items and a user profile
Build a model for each user that rates each item
Collaborative Filtering
Rely on the past user behaviors.
Recommend the favored items of people who are ‘similar’ to you
Top‐rated items or Top‐sellers
non‐personalized
Advanced recommendation considering
social/temporal/spatial factors
81
82. CF‐based methods usually have not bad performance on
urban task
Define customerized user similarity is very important
Hybrid methods usually work.
Single item recommendation is usually not enough for
real‐world applications.
Personalized methods usually has better performance
compared to non‐personalized ones.
Users have spatial/temporal preferences.
82
88. GPS Devices
In‐car GPS
Personal GPS logger
Location‐based services + mobile phone
Check‐in actions and records
E.g. Facebook, Foursquare, Twitter
Digital Camera
Geo‐tagged photos
E.g. Flickr, Instagram, Panoramio
88
Such user mobility records reveal how people travel around an area!
89. Geographical Footprints
A sequence/set of location data
points with
Latitude‐longitude records
Time stamps
Represent the spatial‐temporal
human activities
89
ID Timestamp Location
“Peter” 2010‐04‐02 13:12 37.5, ‐122.5
“Peter” 2010‐04‐02 15:22 37.2, ‐123.5
… … …
Human movement Animal movementTaxi movement
figures from Zheng et al., 2016
90. Micro‐level
Friend / Location / Event / Item recommendation
Targeted marketing & computational advertising
Macro‐level
Urban Computing
e.g. functional regions, diagnosing transportation problems
Disaster Management
e.g. resource distribution, recuse planning
Environmental Informatics
e.g. detect air and noise pollution
90
99. Who is most likely to be interacted with a given
individual in the future?
Problem Statement
Given G[t0,t0’] a graph on
edges up to t0’
Output a ranked list L of links (not in
G[t0,t0’]) that are predicted to appear
in future G[t0,t1’], t1’>t0’>t0
99
Friend suggestion in Facebook
Should Facebook suggest
Alice as a friend for Bob?
Bob
Alice
?
Common Neighbors
Preferential attachment
Jaccard’s coefficient
Adamic/Adar (AA)
Katz score
Hitting time
Rooted PageRank
105. Mobile phones data
Call Detail Records (CDR): , , ,
: caller, : callee, : timestamp,
: the location of tower that routed the call
Depict the daily routines of users (human mobility)
Mobility Features
Capture the degree of closeness or similarity of mobility
patterns between two users
Tow users sharing high degree of overlap in their trajectories are
expected to have a better likelihood of forming new links
105
[Wang et al. KDD’11]
108. Geo‐Social Features @ Check‐in Data
The number of check‐ins of common places
Place Entropy
Human Mobility Features @ Call Detail Records
Distance
Spatial Co‐Location Rate
Spatial Cosine Similarity
Jointly using social graph features, geo‐social features,
and human mobility features can improve the
performance of link prediction
108
114. Social IM
A social graph
Budget k
Propagation model
Independent Cascade
Linear Threshold
Influence Probability
Given / Learn
114
Geo‐Social IM
+ User’s location(s)
Fixed locations
Set of locations
+ Spatial Target
Region
Location
Global
Event
the probability of being in the region
119. Given a LBSN G=(V,E), a query Q=(R,k)
Find a k‐node seed set ∗
, such that for any other k‐
node set , ∗
, : influence spread of S
119
Query Region Top‐5 seed set S ={14,3,16,10,8}
Not in the region
cannot simply use vertices located in
the query region to identify top‐K seeds
[Li et al. SIGMOD’14]
120. Real‐world scenario
Each POI would like to attract users to visit
Via the check‐in records of users
More users (friends of check‐in users) will be influenced and
then visit this POI
Given one target location and the number of seeds,
can we find a set of seed nodes to maximize the
number of influenced users?
120
[Zhu et al. KDD’15]
123. Foursquare
Users write tips for each venue
Users are attracted by some venues via viewing tips
Users add interested tips to todo lists, and mark them done if they
did visit the venues
Location‐based advertising
Enlarge the visibility and adoption of the locations via the
promotion of influential users in LBSN
Questions
What is the attractiveness for u by viewing v’s tip?
Who are potentially influential users in LBSN?
123
[Wu et al. PAKDD’13]
124. Attractiveness Model: compute influence prob.
The likelihood that user ui is attracted by user uj’s tips
Higher P(ui→uj), if
(a) More mutual visited venues
(b) More popular for uj’s tips
One‐Wave Diffusion Model, similar to IC model but
Measure the direct impact of the initially selected nodes on their
first degree neighbors
Influence Maximization
Extending the greedy algorithm of conventional IM
124
[Wu et al. PAKDD’13]
125. Regional Influence of user u inside region R
Expected sum of localities of users influenced by u
Locality = prob. of u checking in at some location inside R
Propagation Model: IC‐based MIAwoT
Maximum Influence Arborescence (MIA) without Threshold
User ux influences uy only via maximum influence path π*xy
Problem: Given a region R, find a set of k regional
users
⊆ : ∀ ∈ and ∀ ∈ ,
125
C(u): check‐in location of uΦ : the set of influenced users by u
[Bouros et al. CIKM’14]
126. Different events exhibit various geographical and
social correlation among their participants
Real Application: Promoting New Products
126
iPhone KFC golf
Socially connected &
Geographically close
Users are faraway both
socially & geographically
Socially close but
geographically faraway
An iPhone user, u2, finds a new Apple
product in her vicinity, her posted info is
easier to spread out and drive nearby
Apple fans, u1 & u3, to the same store.
A KFC or golf user creates a related post, the others
may be less influenced, either because the info
cannot reach them or the location is too far.
[Zhang et al. CIKM’12]
129. Online Social Networks
Represent the human interactions in the online virtual world
E.g. Facebook, Twitter, LinkedIn
Location‐based Social Networks
E.g. Foursquare, Gowalla, Brightkite
Offline Social Networks
Represent the human interactions in the physical/real world
Hard to collect
Event‐based Social Networks
Combining offline and online social networks
129
130. Linking the online and offline social worlds
Online virtual world: share thoughts and experiences
Offline physical world: face‐to‐face interactions
When and where, who and who did what together
Informal get‐togethers. e.g. movie night, dining out
Formal activities. e.g. conference, business meeting
Comparing to SN & LBSN
EBSN have stronger social ties & intents than SN
Attend a physical activity together > being friend online
Participating in a hiking event > talking about hiking online
ENSB have explicit social interactions in the real world
LBSN record only offline check‐in, and suffer from sparsity
130
132. RSVP “Répondez s'il vous plaît”: a request for a
response from the invited person or people
132
@LinkedIn
• More than 90% events have
more than 10 RSVPs
• Only 15% events have more
than 50 RSVPs
More than 70% people
reposting attendance to a event
Heavy‐tailed distributions
[Gomez‐Rodriguez, CIKM’12]
134. 134
@Meetup
Co‐Join Group Co‐Comment Online Message
Co‐Attended Events Co‐Attended Events Co‐Attended Events
More online interactions may not result in more offline interactions!!
Time Effect?
Offline interaction grows
in a log‐linear style!
Online interaction: co‐join
group drops exponentially,
co‐comment drops
linearly, online message
remains stable
[Yin et al., SDM’14]
137. Wow! I found a
good restaurant
with buy-2-get-2
free for lunch.
Activity planning
Attendees tend to be familiar with
each other for good atmosphere
Attendees to the target location is
close to minimize waiting time
• For advertising: find a group of friends for a venue to push coupon
137
Group size
Familiarity
Constraint
Activity
Location
Attendee’s
current locations
Selected Group
[Yang et al., KDD’12]
149. Location Recommendation
Recommend NEW locations (never visited before)
Location Prediction
Predict the next existing locations (had ever visited)
General considered factors
Current location info
Current time
User history/preference
Social interaction
149
Route Planning can be viewed as
the successive applications of
location recommendation.
153. Query Preferable Routes Illustration
1) A set of locations
2) Time span of
route
A route pass through
these locations within
time span
1) A source loc.
2) A destination loc.
3) A number of
route length
A route starting from
source and arrive at
the destination, with
length satisfied
1) A city or an area
2) A set of labels of
interests
A route in such area,
which contains
locations possessing
such labels
153
Query:
S
t
154. GPS Trajectory
How to find meaningful and/or popular places?
How to tackle efficiently million‐scale geo‐data points for
query processing?
Uncertain Trajectory
Do not detail the sequences of movement
Raise uncertainty between consecutive points
154
Check‐in records Geo‐tagged Photos
158. Social Query
Popularity of locations
User Preference:
whether or not to consider user’s past
visiting history
Group or Social factor:
group trips or the locations that
friends had ever visited
Activity Labels:
specifying the labels or types of
locations in the route
158
5000
20
20000
4000
Grp
Mem.
List of desired
locations
A , , , ,
B , ,
C , , ,
D , , ,
a
h
c
f
AC
B
visit
visit
visit
visit
park
restaurant
theater
park
restaurant
theater
Query = {theater, restaurant, park}
160. Location Query Context Query Social Query
QL VO DI VT TT TD CO TK PO UP GS AL
GSP Trajectory Data
[Tang’13] ∎ ∎ ∎ ∎ ∎
[Chen’11] ∎ ∎ ∎
[Zheng’11] ∎ ∎ ∎ ∎ ∎
[Tang’11] ∎ ∎ ∎
[Jeung’08] ∎ ∎
[Xue’13] ∎ ∎
Uncertain
Trajectory Data
[Wei’12] ∎ ∎ ∎ ∎
[Hsieh’14] ∎ ∎ ∎ ∎ ∎ ∎
[Zheng’12] ∎ ∎ ∎ ∎ ∎
[Cao’12] ∎ ∎ ∎ ∎ ∎ ∎
[Lu’10] ∎ ∎ ∎ ∎ ∎ ∎
160
161. Graph Construction G
Design an objective function f(r) based on query, e.g.
E.g. visiting/transition popularity, label cover
With some constraints, e.g. travel time, financial cost
Find a route/path r in G such that f(r) is optimized
161
Trajectory Data Road Net
Nodes Locations Road Segments
Edges Traversal Intersection
Node Weights Popularity / Satisfaction / Traffic
Edge Weight Transition Probability / Frequency
163. • Each trajectory = a sequence of geo‐points / locations
Pattern Mining
Mining the frequent subsequences
constrained by the query requirements
Subsequence Pruning: keep closed ones (to save complexity)
Subsequence Merge: from local route to global route
Pattern Matching
Find individuals with similar behaviors of movements
Nearest‐nearest query processing (given some locations)
163
164. Discover the group of objects that move together (with
similar patterns of movements)
E.g. migration path, driving direction, travel paths
Recommend routes from your companion
Clustering objects and apply sequential pattern mining
164
[Tang et al., TIST’13]
Size threshold = 4
Duration threshold
= 4 snapshots
{o1, o2, o3, o4} is
the traveling
companion
165. Given an existing sub‐route, successively
predict/recommend the next locations
Till the user requirement is satisfied
E.g. Route Length k, Travel Time. Arrive the destination
Select the next locations
Unsupervised method
Location info. E.g. popularity, density, incoming flow
Estimate the probability P(candidateLoc | curSubRoute)
Supervised method
Choose a set of candidate locations
Extract route/Location‐aware features
Apply supervised learning methods e.g. SVM
165
169. Real Scenarios
“Want to have a one‐day trip in an unfamiliar city, Beijing. Any
route suggestion to visit famous places?”
“I am going to visit the Forbidden City in Beijing, with 3 hours.
What’s the route within the palace?”
Expected results
One‐day trip in Beijing: 3 hours in Forbidden City → 2 hours in
Tian An Men Square → 2 hours in Qian Men.
169
Using Geo‐tagged Photos
[Lu et al., MM’10]
Merge Paths
175. Features
Category Features Significance
Basic
Features
Dist Distance of a segment
MaxVi The ith maximal velocity of a segment
MaxAi The ith maximal acceleration of a segment
AV Average velocity of a segment
EV Expectation of velocity of GPS points in a segment
DV Variance of velocity of GPS points in a segment
Advanced
Features
HCR Heading Change Rate
SR Stop Rate
VCR Velocity Change Rate
175
[Zheng, et al. UbiComp’ 08]
182. Philosophy of the model
States of air quality
Temporal dependency in a location
Geo‐correlation between locations
Generation of air pollutants
Emission from a location
Propagation among locations
Two sets of features
Spatially‐related
Temporally‐related
Time
Geospace
Spatial Classifier
Temporal Classifier
Co‐Training
182
[Zheng & Hsieh et al. KDD’13]
184. Function Features PM10 PM2.5
linear
Temperature ∎ ∎
Humidity ∎ ∎
Wind speed ∎ ∎
Distance ∎ ∎
Road segment length ∎ ∎
Number of intersections ∎
Number of vehicle services ∎
Number of parks ∎ ∎
Number of hotels and real
estates
∎
Number of factories
∎ ∎
quartic Pressure ∎ ∎
logarithmic Time ∎ ∎
• More features investigated are irrelevant to air‐quality, including high way length, POIC1, POIC4, POIC5, POIC6,
POIC7, POIC9, POIC10, POIC11.
184
[Hsieh et al. KDD’15]
185. Cannot directly measure the improvement on
inference
Minimize the uncertainty of a relatively accurate model
Basic idea: the AQ distribution of a location should be
skewed (i.e., low entropy value)
Search space is getting large when k increases.
A greedy‐based method to find k locations that can maximize
their effect.
From uncertainty to unpredictability
Be independent with many other nodes
185
un
i
AQI
iiU DDDH 1
max
1
log)(
Prob
AQI value
Prob
AQI value
[Hsieh et al. KDD’15]
187. Data sources
The air quality of current time and the past few hours
The meteorological data of current time and past few hours
Humidity, temperature,..
Sunny, foggy, overcast, cloudy…
Minor rainy, moderate rainy, heavily rainy, rain storm
Wind direction, wind speed
Weather forecast
187
[Zheng et al. KDD’15]
192. Traditional Approach Data Science Approach
Cost Down of
Monitoring
Expensive purchase and
deployment of physical
sensors
Cheap Big Data‐driven
machine learning techniques
Sensor
Distribution
Severe sensor sparsity
problem
Inference of target value
anywhere and anytime
Forecasting
Accuracy
Low for regions w/o sensors High for regions w/o sensors
Usages of
Sensors
Distinct pollutants need
different types of sensors
The framework is general‐
purpose (not only for a
certain pollution), e.g. water
quality, air quality and noise
Sensor
Deployment
Determined by human’s
knowledge
Deployed by optimizing the
objective functions of
environmental monitoring192
201. Loans data
Approximation of the labeled data
Housing survey data
Business, school and hospital location data
Municipality data
Natural disaster incidences, Healthcare coverage , Number
of vehicles and passenger buses, etc.
201
[Ackermann et al. KDD’16]
204. 204
Features of User Review
Features of Taxi
Trajectories
Features of Smart Card
Transactions
Features of Checkins
Estate Investment Value
Learn an estate ranking predictor:
[Fu et al. TKDD’16]
206. 206
Taxi Arriving Volume
Taxi Leaving Volume
Taxi Transition Volume
Taxi Driving Velocity
Taxi Commute Distance
Bus Arriving Volume
Bus Leaving Volume
Bus Transition Volume
Bus Stop Density
Popularity of Checkin
Topic Profile of Checkin
Propagating word‐of‐mouth from poi
to neighborhood
Textual profiling from words to topics
[Fu et al. TKDD’16]
207. 207
• Business reviews and checkins performs better than taxi and bus traces
• Checkins and reviews represent attending phrase
• Taxi and bus traces moving phrase
• Taxi features perform better than bus features in falling market
• Taxi mobility represents white‐collar and business people
• Bus mobility represents mediate classes
[Fu et al. TKDD’16]
210. Identify significant factors
Feature analysis by directly drawing the graph
Correlation analysis
Prediction(by regression or classification) using different feature
sets
Some tips for mining ST Data
Feature engineering is always important
Handle users’ queries or preferences
Model spatial and temporal dependency
Urban ST Data is large‐scale and highly dynamic
Need effective and efficient model
210