[系列活動] 智慧城市中的時空大數據應用

Hsun‐Ping Hsieh(解巽評)
NCKU EE
1

 City is a concentration of people
 > 70% of the population lives in urban areas,
using > 70% of energy
3
from IBM smarter cities

發展中的智慧城市
 全球智慧城市各有其亮點及訴求
4

 Have been for at least 30 years
 Command
 Control devices
 Magic! Change our life!
 Falling hardware and deployment costs
 Cloud services
 Ubiquitous communications
 Big Data Analytics
5

 資料，連結以及設備
 設備（感測器）
 感知環境、感應數值
 連結
 傳遞、融合與連結感知
 資料：影像、聲音、文字、數字和符號
 構成訊息和知識的原始材料
6

7
M2M/IoT Sector Map : Beecham Research
http://www.beechamresearch.com/article.aspx?id=4
 IoT Convergence
 Technology
 Business and ecosystem
 People, applications, things, data,
devices, etc.

 Goal
 Help we understand urban phenomena
 Improve urban environment, city functions and human life quality
 Predict and even pre‐solve the future of cities
 An interdisciplinary field fusing the Data Science:
 transportation, civil engineering, environment, economy, ecology,
and sociology
8
Win
政府
城市居民
Data
Science
Win
Win

 Sensing city dynamics unobtrusively, automatically, and
constantly
 A variety of IoT sensors:
 Mobile phones, vehicles, cameras, stations,…
 User generated contents (check‐in, photos, tweets)
 Heterogeneous data sources
 Geospatial, temporal, social, text, images, economic, environmental
 SMI serves both people and cities
 Sensing → Mining → Improving
Location
Data
Traffic Flows
Human
Footprints
Weather
Road Network
Mobile
Signals
Transportation
SystemSocial
Network9

 Sensing city dynamics unobtrusively, automatically, and
constantly
 A variety of IoT sensors:
 Mobile phones, vehicles, cameras, stations,…
 User generated contents (check‐in, photos, tweets)
 Heterogeneous data sources
 Geospatial, temporal, social, text, images, economic, environmental
 SMI serves both people and cities
 Sensing (data) → Mining (Knowledge)→ Improving (architectures,
services and environment)
Location
Data
Traffic Flows
Human
Footprints
Weather
Road Network
Mobile
Signals
Transportation
SystemSocial
Network10

異質性資料感測與收集
人工智慧
(如:大數據分析、機器學習、資料探勘…等)
服務提供與應用
(如:社群服務、商業模型、交通管理、綠能保護與都市規劃...等)
資料處理與管理
智慧城市大數據處理架構
Sensing
Mining
Improving
Location
Data
Traffic Flows
Human
Footprints
Weather
Road Network
Mobile
Signals
Transportation
SystemSocial
Network11

專長
領域
學經
歷
獲獎
資料
重要
資歷
 大數據資料分析與探勘
 智慧城市與物聯網應用
 社群網路分析
 城市科學與計算
 成功大學電機工程學系助理教授
 臺灣大學資訊網路與多媒體研究所博士
 韓國KAIST大學訪問學者
 微軟亞洲研究院明日之星實習生
 ACM KDDCUP世界資料探勘杯競賽冠軍
 跨領域研究曾獲得資訊計算、人工智慧、地理資訊、社會網路、管
理決策、環境工程、環保綠能等8個學術或業界機構優等論文獎
 臺灣全國大專優秀青年
 微軟亞洲研究院最佳實習生
 微軟亞洲研究院‐整合性空氣品質推估與新量測站地點推薦系統
 微軟City Next 計畫(即時空氣品質預測系統http://urbanair.msra.cn/
已為中國主要城市環保單位所採用)
 Intel M2M(Machine to Machine)感測連結運算網運算計畫
 ACM SIGKDD 2015~2017 Program Committee
12

NCKU EE
13

西班牙美國
中國大陸
德國南韓
法國
日本
智慧健康－密西西比州糖尿病
患自我監控醫療系統
智慧安全－紐約市犯罪資訊系
統(CIW)與即時犯罪中心
智慧交通－2016年V2V成為標
準配備
智慧交通 -巴塞隆納建置互動公車亭、
空停車位偵測及便利腳踏車系統
智慧育樂－香港·AR旅遊導覽App
智慧物流－廈門港口管理局「港口資
料統計分析及應用」LTE系統
智慧物流－香港國際機場貨運系統
智慧物流－法蘭克福機場CargoCity整
合歐洲通關系統，預報通關典範
新加坡
智慧金流－SingTel推出行動支
付服務Dash
智慧交通－建置即時交通回饋
系統及大數據應用
智慧健康－SK Telecom推出個
人健康服務Health-On
智慧交通-建置首爾智慧公車系
統及於松島智慧城建置智慧運
輸系統
智慧安全 - SFR網路家庭安全監控服務
智慧交通－計程車排班數量偵測與提
供計程車流通服務
智慧物流－巴黎航空城Hubstart Paris
提供智慧倉儲與歐陸跨境清關服務
智慧育樂－東京Sunshine水族
館等地企鵝AR實境導航
智慧交通－東京蒐集車流量及
車速資料，進行路網空間分配
註1：V2V(Vehicle to Vehicle)車對車通訊；LBS(Location Base Service)行動定位服務
國際應用發展

物聯網之關鍵技術
 資料感測與傳輸、平台建置、雲端技術與巨量
資料分析
15

國際物聯網的推動狀況
資料來源：工研院IEK (2015/03)
 積極推動中的IoT計畫
16

智慧物聯網的應用範疇
 應用範疇廣泛 - 從個人、家庭、商業至政府部門，
從私部門到公部門涵蓋食衣住行育樂等各面向
資料來源：MIC(2015/10)
17

個人淺見
 智慧城市的特徵是如何強化「智慧」的應用
 智慧來自對環境變化的即時反應與各類資料的分析與判斷
 跨領域應用可讓資通訊技術(ICT)發揮的更淋漓盡致
 有效的平台建置可發揮系統整合與便民的效益
 資料儲存、雲端運算、各類資料統計與分析工具
 各式的應用與服務仍有賴以人為本的中心思維
 軟硬體系統整合可發揮應用系統的最大效益
 基礎建設是讓智慧城市的構想能夠實現載具
 異質網路的最佳化結合，4G + WiFi + iBeacon …
 資料的萃取是與環境互動的必要步驟，精確的資
料取得是成功的第一步
 各式資料的感測與融合
18

資料感測與融合
 不同類型感測器所扮演的角色之差異性與互補性
 影像感測器、慣性感測元件、氣體感測器、化學感測器 …
 實際應用情境是各種訊號的融合
19

智慧應用強度的提升
 對環境變化的即時反應
 動態感測：如晴天和雨天、不同時
間的推薦結果可能不相同
 依據個人行為的客製化推薦
 對未來提供具參考價值的預測
 利用現有資訊預測未來之狀況
 有效的資源配置
 異質網路頻寬的有效使用
 架構在基礎建設上的延伸 4G +
WiFi + iBeacon
 最佳建置點 + 時間與空間預測
 異常偵測
 交通事故、登革熱預防等
20

目前物聯網架構 – 以智慧交通為例
感
知
端 WSN
各種感測器
RFID
影像
服
務
端
公車
資料庫
道路交通
資料庫
停車系統資料庫
電子收費
資料庫
車輛控制和安
全服務資料庫
...觀光資料庫
資
料
端
……
21

未來物聯網架構 – 以智慧交通為例
城
市
端
WSN
各種感測器
RFID
服
務
端
雲
端
……
社群媒體
交通系統
影像
氣候
人與位置
路網結構
城市雲端資料中心
城市資料大數據分析與預測模組
計
算
端
考
慮
使
用
者
情
境
客製化服務
22

智慧大眾運輸
 現有作法
 根據歷史高低峰數據規畫大眾運輸班距
 行動大眾運輸分類資訊查詢系統，如公車動態資訊
查詢系統
 候車亭可得知目前下一班到站時間
精進作為
 需求反應式運輸系統：收集並依
據乘車人員的動態需求，經由電
腦優化計算後進行自動化派車服
務。(美國、日本)
 以預測交通與需求做為輔助
23

智慧道路
 現有作法
 根據歷史時段數據規畫路段號誌
 藉由現有佈建之車輛偵測器及閉路電視攝影機得到車流資
訊，經過中心電腦運算，會將交通資訊發送到路側的標誌，
用可以看到一個即時的路況訊息，提供路況、改道、停車、
車位等資訊。
 所有導航結果皆為相同
 通常只以交通時間最快到達為考量
精進作為
24
車路通訊：自動電子收
費、自動取得前方交通
路況、停車場資訊、定
點影音資訊上傳及下載
（美國、日本）
車間通訊：設備與 GPS
以利車輛防撞資訊的傳
遞，透過告知車輛彼此
間的位置與行駛方向速
度及煞車動作資訊等，
來避免車禍的發生（美
國、日本）
以零碰撞自動駕駛為終
極目標（歐、美、日）
多樣智慧導航分流，如
最安全、最美麗、景點
多、最文化、最環保
（義大利:最多風景路徑
導航）
提前預測交通狀況時間
2~4小時（荷蘭、美國）
根據現在車流狀況，將附
近燈號互相搭配，控制堵
車狀況（荷蘭、比利時）
與交通控制中心、其他應用使
用者和天氣預報部門相連，每
個使用者收到的導航資訊都是
個人化的，需要導航的使用者
即使最初路徑相同，也會出現
不同的導航結果。系統會自動
根據路況擁堵狀況進行導航，
同時收集其他使用者的潛在路
徑，然後做出調整（荷蘭）

 The connectivity is just an enabler but the real value of IoT
is on data
 Big Data is important for finding value, and IoT can play an
important roles for data collection, negotiation and
combination
 Big Data is nothing without real business value insight
 e.g. Develop AI‐based applications based on big data
 Cloud offers “Everywhere as a Service” for IoT and big data.
25

 Road connection, safety mining from twitter,
transportation mode, destination toward, user
query
26 Kim et al. 2014Safe oneNot safe one

 Similar with traffic prediction
 Cluster needs
 Must consider the riding duration and station
distance
27 Li et al. 2015

 Traffic anomaly detection by aggregation of GPS
data
 Diagnose by twitter data
28
Hong et al. 2015

 Possible influential factors: weather, road network,
traffic flow, location information…, etc.
 Deployed factors: Budge, inference accuracy,
meteorological variation effect, space Size…, etc.
29
Location Info
Weather
Road Network
Traffic

 Possible influential factors: Human mobility, geo‐
categorical Information, positive sentiments, social
network…, etc.
 Deployed factors: profit, competitiveness, transition
Cost, space size, seat number,…, etc.
30
Geo‐categorical
Info
Social Network Human Mobility
Sentiment

 Possible influential factors: weather, human mobility ,
geo‐categorical information, brightness, loud, etc.
 Deployed factors: historical crime rate, time, negative
sentiments, transportation convenience, etc.
31
Geo‐categorical
Info
Human Mobility
Brightness
Weather

32
Urban Issues
IoT Data
Models
Urban Scientist
Solve &
Improve
Propose
Exploit Handle
Cloud
32
IoT, Big data and Cloud are the future of the world
Lots of opportunities in Smart City

NCKU EE
33

NCKU EE
34

Spatial‐temporal
Static Data
Spatial Static
Temporal Dynamic Data
Spatial‐Temporal
Dynamic Data
Road/Transportation
Networks
POI(Point‐of‐Interest) Distributions
Trajectory Data
Spatial‐temporal
Crowd Souring Data or eventsWeather/AQI Station Data
Road Traffic Data
Point‐BasedNetwork‐Based
Foursquare, Geo‐tweets, Facebook checkin, Car
accidents, News
GPS, Taxi, Uber,
Mobile, Telecom, Disaster
Deployed sensors, monitoring stations
Google Traffic, Bureau Bing, Google, Baidu Maps
Google Maps, Review System
35
Figures modified from [Zheng et al. 2016]
 Spatial‐temporal (ST) Properties
 Frequency of varying information
 Events or routine reports

l1
l2
l3
c30
c31
c35
c36
c32
c34
c33
c20
c22
c21
c10
Distance(Physical closeness)
Heterogeneous Categories
 Different spatial granularities
 Region or neighborhood characteristics
Hierarchy
 Hierarchical location types
 City structures
36
from [Zheng et al. 2016]
from foursquare categories

 Temporal closeness
 Period
 Trend
37
Traffic flow example, figures from [Igor Grabec et al. 2014]
McDonald’s stock price index example,
figures from [William et al. 2015]

 Why POI
 Indicate the land usage, function, and environment of
a region
 Challenges
 massive POI data in a city
 the information could vary in time
 Two approaches to crawl(Google):
 existing Yellow Page data
 collect POI information physically, e.g., carrying a GPS
logger
 some location‐based social networking
services(e.g. Foursquare) have allowed end
users to create a new POI in the system.
 Region: residential areas, suburban areas,
and forest
38
from google map

 Why road networks
 Have a strong correlation with traffic flows
 A good complementary of mobility modeling
 Format:
 Represented by a graph that is composed of
 a set of edges (denoting road segments) and
 a collection of nodes (standing for road intersections)
Each node has unique geospatial coordinates
 Other properties, such as the length, speed
constraint, type of road, and number of lanes,
can be associated with an edge.
39
From Mathieu
Leplatre
From https://people.hofstra.edu/geotrans/eng/methods/nettopology.html

 Traditional sensors:
 loop sensors is quite limited
 Surveillance cameras
 widely deployed in urban areas
 Need much human effort
 Floating car‐based traffic monitoring
methods: GPS
 higher flexibility and a lower deployment cost
 depends on the distribution of the probing
vehicles
 data sparsity problem exists
40
from google image
from google image

 Call detail record (CDR) is a data record produced by a
telephone exchange phone numbers of both the calling
and receiving parties, the start time, and the duration of
that call.
 A mobile phone’s location can be roughly calculated based on base
stations.
 Study the behaviors of an individual or build a network
between different users.
41
From MIT reality mining

 Another kind of data representing citywide human
mobility.
Card swiping data
 is available in a city’s public transportation systems, where people swipe an
RFID card when entering into a subway station or getting on a bus.
 Swipe their cards again when leaving a station or getting off a bus
Street‐side parking
 Indicates the traffic of vehicles around a place
 Not only improve a city’s parking infrastructure but also analyze people’s
travel patterns.
 Support geo‐ads and location choosing for a business.
42
from google image

 Meteorological data
 Air quality data
 monitoring stations
 portable sensors
 Noise data
 Water pollutant
43
from google image

 Social structure, represented by a graph
 User‐generated social media
 texts, photos, and videos
 model people’s mobility in urban areas, which helps us
detect and understand urban events
44
Zheng et. al 2014

 Examples: transaction records of credit cards, stock
prices, housing prices, and people’s incomes
 Applications using machine‐learning
45
From House 123
Amir et al. 2010

 Gas consumption of vehicles on road surfaces and in
gas stations reflects a city’s energy consumption
 directly from gas sensors
 inferred from other data sources(e.g. GPS)
 The electricity consumption of a building can be used to
optimize residential energy usage, shift peak loads
 Reflect a city’s deployment and energy infrastructures
 The distribution of gas stations
 The pollution emission from vehicles on road surfaces
46

 Health care and disease data usually are generated by
hospitals and clinics
 Wearable computing devices enable people to monitor
their own health conditions, such as heart rate, pulse,
and sleep time
 Diagnosing a disease and doing a remote medical examination.
 Causal relationship with other sensors
 how is air pollution related to the asthma situation? How does
urban noise impact people’s mental health?
47
From 台南市政府

 Unobtrusively and continually collect data in a large scale
 Example: Continually probing the city traffic is challenging
 as we do not have sensors on every road segment.
 Deploy new sensing devices could help but
 aggravate the burden of cities.
 cost much energy, space and human resource
 How to exploit what we already have in urban spaces
intelligently
 Humans as a sensor is a new concept that may help tackle this
challenge.
48

 Motivations
 It is not cost‐effective to deploy sensors everywhere
 Energy consumption
 Challenges
 Privacy issue
 Loose‐controlled and non‐uniform distributed sensors
 Unstructured, implicit, and noise data
 texts and images
 data‐missing problem
 data contain much noisy
49 http://desktop.arcgis.com/

 Learn knowledge from heterogeneous data
 Need to both effective and efficient learning ability
 Another important issue: Visualization
50

 Need to communicate with many devices and users
simultaneously
 Need to send and receive data of different formats at
different frequencies.
 Real‐time online applications
 E.g.
 Online: Social media
 Offline: Traffic
 Service is usually online
51
http://gov20class.blogspot.tw/2012/12/implication‐of‐studying‐
online‐offline.html

 Large‐scale, high dynamic and complicate
Cloud computing
Spark/MapReduce
Data structures for handling ST‐ data
52

NCKU EE
53

 Motivations
 Lots of trajectories → lots of data
 Missing data problem
 Noise complicates analysis and inference
 Methods
 Data reduction and filtering techniques
 Indexing method
 Filling missing value methods
54

 Goal: reduce trajectory size w/o compromising
much precision
 Performance Metrics
 Processing time
 Error Measure
 Online v.s. Batch cases
55
from Lee & Kruum et al.

 Goal: Filter noises
 Mean and median
 Kalman filter
 Particle filter
56 from Lee & Kruum et al.

57
 Goal: To efficiently response the spatial queries
 Space Partition‐Based Indexing Structures
 Grid‐based
 Quad‐tree
 k‐D tree
 Data‐Driven Indexing Structures
 R‐Tree
Nearest Neighbour Queries Region (Range) Query
Grid‐based
K‐D tree
R‐tree
Quad‐tree
All figures are from Zheng et al.

 Classification & Prediction(Supervised Learning)
 Discovery (Unsupervised Learning)
 Pattern Mining method
 Clustering
 Regression
 Recommendation
58

59
 Predicts categorical class labels
 Typical Applications
 {credit history, salary}‐> credit approval ( Yes/No)
 {Temp, Humidity} ‐> Rain (Yes/No)
Mathematically
)(
:
}1,0{,}1,0{
xhy
YXh
YyXx n




 Infer environmental labels based on city dynamics
 Infer transportation modes(class) from GPS data
 Predict whether the person will visit this attraction or
not
 Guess whether two people are familiar with each other
 Forecast whether the customer has high risk or low
risk based on his transaction records
60

0 bxw

1 bxw
 1 bxw







1bxwif1
1bxwif1
)( 


xf 2
||||
2
Margin
w

61 Jiawei Han: data mining concepts and techniques

 What if decision boundary is not linear?
62 Jiawei Han: data mining concepts and techniques

• Output: The Air Quality Label of a certain1km grid on 2016/11/9 4am
(n =1~1000)
• Predictors (6 explanatory variables)
• temperature(x1, ‐30~40)
• wind speed (x2, 0~30, units not given)
• Winter?(x3=1 if yes, 0 if not)
• Number of factories (x4)
• humidity(x5, 0~100)
• Average number of population group (x6=1,...,6 <10000 (1), 10000‐15000(2), 15000‐
20000(3), 20000‐25000(4), 25000‐30000(5), >30000(6))
63
AQI Values Levels of Health Concern Colors
0-50 Good (G) Green
51-100 Moderate (M) Yellow
101-150 Unhealthy for sensitive groups (U-S) Orange
151-200 Unhealthy (U) Red
201-300 Very unhealthy (VU) Purple
301+ Hazardous (H) Maroon

64
 Classification and Prediction can be used to determine important data
classes or to predict future data trends.
 Effective methods for ST data
 decision trees, Bayesian belief network, Artificial neural network, Support Vector Machine
(SVM), nearest neighbor classifiers, and random forest.
 Linear and nonlinear methods can be used for prediction.
 Note overfitting problems
 Efficiency is a concern for applications
 Feature engineering is always important for urban prediction tasks.
 Hybrid methods such as Bagging and Boosting can be used to increase
overall accuracy by combining a series of individual models.
Jiawei Han: data mining concepts and techniques
http://www.holehouse.org/mlclass/07_Regularization.html

 Relationship between one dependent variable and
explanatory variable(s)
 Use equation to set up relationship
 Numerical Dependent (Response) Variable
 1 or More Numerical or Categorical Independent (Explanatory) Variables
 Used mainly for value prediction & estimation
65

 Infer environmental informatics based on city
dynamics
 Estimate potential needs for locations
 Predict human mobility or traffic flows in urban area
 Other applications, e.g. crime rate or methodologies.
66

• Response: The Air Quality Index of a certain1km grid on 2016/11/9 4am
(n =1~1000)
• Predictors (p=6 explanatory variables)
• temperature(x1, ‐30~40)
• wind speed (x2, 0~30, units not given)
• Winter?(x3=1 if yes, 0 if not)
• Number of factories (x4)
• humidity(x5, 0~100)
• Average number of population group (x6=1,...,6 <10000 (1), 10000‐15000(2), 15000‐
20000(3), 20000‐25000(4), 25000‐30000(5), >30000(6))
67
AQI Values Levels of Health Concern Colors
0-50 Good (G) Green
51-100 Moderate (M) Yellow
101-150 Unhealthy for sensitive groups (U-S) Orange
151-200 Unhealthy (U) Red
201-300 Very unhealthy (VU) Purple
301+ Hazardous (H) Maroon

68
Wind speed
Air Quality Index
Air Quality Index
Wind speed
Wind speed
Wind speed
Air Quality Index
Air Quality Index

 Hypothesize deterministic component
 Estimate unknown Parameters
 Specify probability distribution of random error Term
 Estimate Error
 Evaluate the fitted model
 Use model for prediction & estimation
69
Y
X
Y
X
iii XY  ˆˆˆ
10  iii XY  ˆˆˆ
10 
ii XY 10
ˆˆˆ   ii XY 10
ˆˆˆ  
Unsampled
observation
　i = Random error
Observed value
^

70
Regression
Models
Linear
Non‐
Linear
2+ Explanatory2+ Explanatory
VariablesVariables
Simple Multiple
Linear
1 Explanatory1 Explanatory
VariableVariable
Non‐
Linear

71
 Remark is similar to classification methods
 Effective methods for ST data
 Linear Regression
 Polynomial Regression
 Logistic Regression
 SVR(SVM’s version of Regression)

72
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
Modified from Jiawei Han: data mining concepts and techniques

 Mining trajectories, events or other spatial /temporal
correlations among objects
 Mining frequent route/diffusion patterns
 Driving directions
 Visiting attractions
 Disease/Rumor/Influence/pollution diffusion/propagation
 Mining single/group behavioral patterns
 Spatial patterns: O1 and O2 [in same group] 4 times
 Temporal patterns: O1 [right,t1→ right,t1 → right,t1] 2 times
 Spatial‐Temporal patterns: O1 and O2 [in same group from 2 time slots] 2 times
73
NhatHai Phan 2013

74
 Large‐scale spatial‐temporal data
 Mining sequential and graphical
structure patterns
 Discover a variety of rules and
interesting patterns from past
experiences
park
restaurant
theater
[Chen et al., ICDE’11]

75
 Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
 Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes

76
 Clustering locations (to become a living community)
by their features such as
 distance
 location types
 the visiting sets of users
 Group people who have similar
 preferences
 travel behaviors

77
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k‐means, k‐medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or
objects) using some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density‐based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

78
 Group objects based on their similarity and has wide
applications.
 Measure of similarity can be computed for various features of
data.
 In urban applications, partitioning methods, hierarchical
methods, density‐based methods, grid‐based methods, and
model‐based methods are often used.
 Outlier detection can be performed by clustering.
 Sometimes be performed before other tasks.

 A recommender system suggests potentially favored items
to users
79

 POI recommendation
 Trip route Planning
80

 Content‐based Filtering
 Try to recommend items that are similar to those that a user liked in
the past
 Recommends items based on a comparison between the content of
the items and a user profile
 Build a model for each user that rates each item
 Collaborative Filtering
 Rely on the past user behaviors.
 Recommend the favored items of people who are ‘similar’ to you
 Top‐rated items or Top‐sellers
 non‐personalized
 Advanced recommendation considering
social/temporal/spatial factors
81

 CF‐based methods usually have not bad performance on
urban task
 Define customerized user similarity is very important
 Hybrid methods usually work.
 Single item recommendation is usually not enough for
real‐world applications.
 Personalized methods usually has better performance
compared to non‐personalized ones.
 Users have spatial/temporal preferences.
82

NCKU EE
83

NCKU EE
84

 Topic 1: Properties & Measures on Geo‐Social Data
 Topic 2: Geo‐Social Link Prediction
 Topic 3: Location and Route Recommendation
 Topic 4: Geo‐Social Influence Maximization
 Topic 5: Connecting Online & Offline Social Worlds
85

 GPS‐equipped mobile devices
 Location‐acquisition services
 Social Networking Sites
86

87
Connect
Share
Deals
Venues Anywhere
Anytime
ActionInteract
Check‐in

 GPS Devices
 In‐car GPS
 Personal GPS logger
 Location‐based services + mobile phone
 Check‐in actions and records
 E.g. Facebook, Foursquare, Twitter
 Digital Camera
 Geo‐tagged photos
 E.g. Flickr, Instagram, Panoramio
88
Such user mobility records reveal how people travel around an area!

 Geographical Footprints
 A sequence/set of location data
points with
 Latitude‐longitude records
 Time stamps
 Represent the spatial‐temporal
human activities
89
ID Timestamp Location
“Peter” 2010‐04‐02 13:12 37.5, ‐122.5
“Peter” 2010‐04‐02 15:22 37.2, ‐123.5
… … …
Human movement Animal movementTaxi movement
figures from Zheng et al., 2016

 Micro‐level
 Friend / Location / Event / Item recommendation
 Targeted marketing & computational advertising
 Macro‐level
 Urban Computing
 e.g. functional regions, diagnosing transportation problems
 Disaster Management
 e.g. resource distribution, recuse planning
 Environmental Informatics
 e.g. detect air and noise pollution
90

 Can user‐generated geographical footprints enable
new social network analysis tasks?
 Part 1: Properties & Measures on Geo‐Social Networks
 Part 2: Geo‐Social Link Prediction
 Part 4: Geo‐Social Influence Maximization
 How social info benefit for location‐based applications?
 Part 3: Location & Route Recommendation
 Part 5: Geo‐social media connect online
& offline social worlds
91 location

NCKU EE
92

 Friendship vs. Distance
 Friends are close to each other in geography
 People tend to visit nearby locations
 Power Law of check‐ins
 Temporal Periodic Patterns
 Spatio‐Temporal Multi‐Center Distribution
93

94
Friends tend to be much closer than
random users: about 50% of social links
span less than 100 km, while about 50%
of users are more than 4,000 km apart
Distance between Friends Prob. of Friendship vs. Distance
Almost flat probability in the range
1‐10 km, while all curves then
decrease as distance grows
[Scellato et al. ICWSM’11]
Users who live close have higher
probability to create friendship links

95
A few places have many
check‐ins while most of
places have few check‐ins
A user goes to a few places
many times, and to many
places a few times
[Gao et al. ICWSM’12]
@Foursquare

96
of check‐insof check‐insPeople commute to and from work at roughly the same time during the work week,
as opposed to the weekend when peoples’ travel and schedules are less predictable.
@Twitter
@Twitter
[Cho et al. KDD’11] [Cheng et al. ICWSM’11]

97
Temporal
Spatial
Centers on certain
location areas, Rarely
checks‐in at locations far
away from the center
@Gowalla
@Gowalla
Prob. of visiting
regular location
centers on certain
time periods and
decreases during
other time periods
[Cho et al. KDD’11]

NCKU EE
98

 Who is most likely to be interacted with a given
individual in the future?
 Problem Statement
 Given G[t0,t0’] a graph on
edges up to t0’
 Output a ranked list L of links (not in
G[t0,t0’]) that are predicted to appear
in future G[t0,t1’], t1’>t0’>t0
99
Friend suggestion in Facebook
Should Facebook suggest
Alice as a friend for Bob?
Bob
Alice
?
 Common Neighbors
 Preferential attachment
 Jaccard’s coefficient
 Adamic/Adar (AA)
 Katz score
 Hitting time
 Rooted PageRank

100
David Liben‐Nowell et al. 2004

 Can we further improve the performance of link
prediction using the info from the geographical
activities of users?
 Check‐in history, venue information, user mobility
101
?
?
location
Social Network
Geographical
Activities
[Zhu, KDD15]

 More popular of a place visited by two users, the average
probability of being friends decreases
 E.g. public place: touristic places, airports, stations
 Popular = more check‐in by people
 Not much difference when a place has less than 100
 Unpopular places (less common #check‐ins) are likely to
be with a significant
importance for them
 E.g. private houses
102
[Scellato et al. KDD’11]

 Two factors should be considered simultaneously
 The number of check‐ins of a place
 The total number of check‐ins of a place for the user
 Place Entropy
103
Places with higher
entropy result in less
social links among their
visitors than venues
with lower values
Φ : the set of users
check‐in at location k
[Scellato et al. KDD’11]

104
the number and the
fraction of common
places between two
users
Φ : the set of users check‐in at location k
: the vector of check‐ins of user i at all locations
: the total number of check‐ins at location k
Entropy of
common places
Check‐in Number
of common places
[Scellato et al. KDD’1

 Mobile phones data
 Call Detail Records (CDR): , , ,
 : caller, : callee, : timestamp,
 : the location of tower that routed the call
 Depict the daily routines of users (human mobility)
 Mobility Features
 Capture the degree of closeness or similarity of mobility
patterns between two users
 Tow users sharing high degree of overlap in their trajectories are
expected to have a better likelihood of forming new links
105
[Wang et al. KDD’11]

 Distance: the distance of most likely locations
 Spatial Co‐Location Rate (SoLR): the probability that
users x and y visit at the same location
 Spatial Cosine Similarity (SCos): the cosine similarity of
user x and y’s trajectories
106
The probability that user x visits location lThe most likely location of user x

107
Correlations between mobility features and social features: Positive!
Mobility features achieve
comparable predictive
powers to conventional
social graph features.
Combining mobility and
social features yields a
sensible improvement.

 Geo‐Social Features @ Check‐in Data
 The number of check‐ins of common places
 Place Entropy
 Human Mobility Features @ Call Detail Records
 Distance
 Spatial Co‐Location Rate
 Spatial Cosine Similarity
 Jointly using social graph features, geo‐social features,
and human mobility features can improve the
performance of link prediction
108

NCKU EE
109

 Users perform actions
 Post messages, pictures, videos
 Like!, comment, share, retweet, rate, buy, ...
 Users are connected with other users
 They interact and influence each other
 Actions propagate
110
That’s
Cool!!
Cannot
Agree
more!
Friend
14:30 15:00
Influence

 Given a limit budget for initial advertising
 Identify a small set of influential customers (as seeds)
 Such that by convincing them to adopt the product
and finally trigger a larger cascade of influence
111
seed seed seed
seed

Inactive Node
Active Node
Threshold
Active neighbors
vw
0.5
0.3
0.2
0.5
0.1
0.4
0.3 0.2
0.6
0.2
Stop!
U
X
112

v
w
0.5
0.3
0.2
0.5
0.1
0.4
0.3 0.2
0.6
0.2
Inactive Node
Active Node
Newly active
node
Successful
attempt
Unsuccessful
attempt
Stop!
UX
113

Social IM
 A social graph
 Budget k
 Propagation model
 Independent Cascade
 Linear Threshold
 Influence Probability
Given / Learn
114
Geo‐Social IM
+ User’s location(s)
 Fixed locations
 Set of locations
+ Spatial Target
 Region
 Location
 Global
 Event
the probability of being in the region

 User’s location(s)
115
A fixed location
e.g. home
A set of locations
e.g. all her check‐ins
[Zhu, KDD15]

 Spatial Target: the users of interest
116
Global: all nodes Event: labels of interest
Region: a geographical area Location: a specific place
tennis
biking
skiing
biking
skiing
[Zhu, KDD15]

User’s
Location
Spatial
Target
Propagation
Probability
Propagation
Model
Loc‐IM
[Li, SIGMOD14]
A fixed loc Region Given IC
Loc Promotion
[Zhu, KDD15]
A set of loc Location Learn IC/LT
Attractive IM
[Wu, PAKDD13]
A set of loc Global Learn One‐wave
Reg IM
[Bouros, CIKM14]
A set of loc Region Given MIAwoT
Geo‐Soc Inf
[Zhang, CIKM12]
A fixed loc Event Given Random Walk
117

 Location‐aware word‐of‐mouth marketing
 Twitter aims to provide campaigns (e.g. restaurants) by
locating the potential customers in a spatial region (e.g. NYC,
LA) to promote their business
 Given few initial users
 Influence friends, FOF and so on in the region
 Location‐based social network (LBSN): G=(V,E)
 Twitter, Foursquare
 Each user has a geographical location
 Query (R,k) : a geographical region R
 VR: the set of users in R
118
[Li et al. SIGMOD’14]

 Given a LBSN G=(V,E), a query Q=(R,k)
 Find a k‐node seed set ∗
, such that for any other k‐
node set , ∗
 , : influence spread of S
119
Query Region Top‐5 seed set S ={14,3,16,10,8}
Not in the region
cannot simply use vertices located in
the query region to identify top‐K seeds
[Li et al. SIGMOD’14]

 Real‐world scenario
 Each POI would like to attract users to visit
 Via the check‐in records of users
 More users (friends of check‐in users) will be influenced and
then visit this POI
 Given one target location and the number of seeds,
can we find a set of seed nodes to maximize the
number of influenced users?
120
[Zhu et al. KDD’15]

 Via Conventional IM
 Given a graph with propagation probabilities on edges, a target
location, a budget k, find a k‐node seed set that maximizes the
influence spread
 Propagation probabilities vs. the target location
 Users should follow their friends to check‐in at the target location
specified
 Fundamental Assumption of Location Promotion
 Users will NOT visit the target location whose near‐by area is never
visited by them
 User Mobility determines propagation probabilities
121

Dynamic propagation
probability when the
target location is
changed in an LBSN
Goal: learn the
propagation probability
based on user check‐ins
122

 Foursquare
 Users write tips for each venue
 Users are attracted by some venues via viewing tips
 Users add interested tips to todo lists, and mark them done if they
did visit the venues
 Location‐based advertising
 Enlarge the visibility and adoption of the locations via the
promotion of influential users in LBSN
 Questions
 What is the attractiveness for u by viewing v’s tip?
 Who are potentially influential users in LBSN?
123
[Wu et al. PAKDD’13]

 Attractiveness Model: compute influence prob.
 The likelihood that user ui is attracted by user uj’s tips
 Higher P(ui→uj), if
 (a) More mutual visited venues
 (b) More popular for uj’s tips
 One‐Wave Diffusion Model, similar to IC model but
 Measure the direct impact of the initially selected nodes on their
first degree neighbors
 Influence Maximization
 Extending the greedy algorithm of conventional IM
124
[Wu et al. PAKDD’13]

 Regional Influence of user u inside region R
 Expected sum of localities of users influenced by u
 Locality = prob. of u checking in at some location inside R
 Propagation Model: IC‐based MIAwoT
 Maximum Influence Arborescence (MIA) without Threshold
 User ux influences uy only via maximum influence path π*xy
 Problem: Given a region R, find a set of k regional
users
 ⊆ : ∀ ∈ and ∀ ∈ ,
125
C(u): check‐in location of uΦ : the set of influenced users by u
[Bouros et al. CIKM’14]

 Different events exhibit various geographical and
social correlation among their participants
 Real Application: Promoting New Products
126
iPhone KFC golf
Socially connected &
Geographically close
Users are faraway both
socially & geographically
Socially close but
geographically faraway
An iPhone user, u2, finds a new Apple
product in her vicinity, her posted info is
easier to spread out and drive nearby
Apple fans, u1 & u3, to the same store.
A KFC or golf user creates a related post, the others
may be less influenced, either because the info
cannot reach them or the location is too far.
[Zhang et al. CIKM’12]

User’s
Location
Spatial
Target
Propagation
Probability
Propagation
Model
Influence
Spread
Loc‐IM
[Li, SIGMOD14]
A fixed loc Region Given IC In Region
Loc Promotion
[Zhu, KDD15]
A set of loc Location Learn IC/LT Soft Region
Attractive IM
[Wu, PAKDD13]
A set of loc Global Learn One‐wave Global
Reg IM
[Bouros, CIKM14]
A set of loc Region Given MIAwoT Soft Region
Geo‐Soc Inf
[Zhang, CIKM12]
A fixed loc Event Given Random Walk N/A
127

NCKU EE
128

 Online Social Networks
 Represent the human interactions in the online virtual world
 E.g. Facebook, Twitter, LinkedIn
 Location‐based Social Networks
 E.g. Foursquare, Gowalla, Brightkite
 Offline Social Networks
 Represent the human interactions in the physical/real world
 Hard to collect
 Event‐based Social Networks
 Combining offline and online social networks
129

 Linking the online and offline social worlds
 Online virtual world: share thoughts and experiences
 Offline physical world: face‐to‐face interactions
 When and where, who and who did what together
 Informal get‐togethers. e.g. movie night, dining out
 Formal activities. e.g. conference, business meeting
 Comparing to SN & LBSN
 EBSN have stronger social ties & intents than SN
 Attend a physical activity together > being friend online
 Participating in a hiking event > talking about hiking online
 ENSB have explicit social interactions in the real world
 LBSN record only offline check‐in, and suffer from sparsity
130

1. Properties of Event‐based Social World
2. Event Organization
3. Event Participant Recommendation
131

 RSVP “Répondez s'il vous plaît”: a request for a
response from the invited person or people
132
@LinkedIn
• More than 90% events have
more than 10 RSVPs
• Only 15% events have more
than 50 RSVPs
More than 70% people
reposting attendance to a event
Heavy‐tailed distributions
[Gomez‐Rodriguez, CIKM’12]

 Strong locality of social events and interactions
 81.93% of events participated in by a user are within 10
miles of his/her home location
 84.61% of offline friends live within 10 miles to each other
133
[Liu et al. KDD’12]

134
@Meetup
Co‐Join Group Co‐Comment Online Message
Co‐Attended Events Co‐Attended Events Co‐Attended Events
More online interactions may not result in more offline interactions!!
Time Effect?
Offline interaction grows
in a log‐linear style!
Online interaction: co‐join
group drops exponentially,
co‐comment drops
linearly, online message
remains stable
[Yin et al., SDM’14]

135
Small events lead people to
make more connections
than large events!
Network degree increment in different
size and types of events!
Party
[Xu et al., CHI’13]
@Douban

 Assume you plan to hold a cocktail party, which
individuals should you invite to make the participants
enjoy the party as much as possible?
 Enjoy the party?
 Distance to party location
 Participants are familiar
 The theme of the party
 User preference
 Number of participants
 Time available
136
with each other

Wow! I found a
good restaurant
with buy-2-get-2
free for lunch.
 Activity planning
 Attendees tend to be familiar with
each other for good atmosphere
 Attendees to the target location is
close to minimize waiting time
• For advertising: find a group of friends for a venue to push coupon
137
Group size
Familiarity
Constraint
Activity
Location
Attendee’s
current locations
Selected Group
[Yang et al., KDD’12]

 Real Scenarios
 A company plans to hold promotion campaign in a region, and
aims to identify potential customers who are interested in the
features (keywords) of the product
 Interest‐based group gathering, e.g. “movie”, “NBA”
 Interests: categories and keywords of venues
138
[Li et al., DKE’14]

 Given a SIG query q = (T, k), where T is a set of keywords,
k is the size of group
 Find a k‐user group G maximizing user‐interest score
and minimize inter‐user geo‐distance
139
[Li et al., DKE’14]
Du: the set of spatial objects checked in by user u
Dt: the set of spatial objects associated with tag t

 Image you have a ticket of Bruno Mars’ concert, but
may give up attending this activity if no suitable
partners
 Could someone recommend potential partners for you
to invite and enjoy the show together?
140
[Tu et al. PAKDD’15]

 Given an activity item a, a target user ut, social network,
user preference, and past activity attendance
 Recommend/Predict a partner uc who would participate
activity a together with ut
141
Social Closeness: common friends
Similar Interest
Also Like
Item User
Matrix Factorization or
Collaborative Filtering
[Tu et al. PAKDD’15]

 Connecting online and offline social worlds
1. Properties of Event‐based Social World
 E.g. Participating closer events
 E.g. Smaller events create more social ties
2. Event Organization
 Single / multiple event organization
3. Event Participant Recommendation
 Diffusion / Learning approaches
142

NCKU EE
143

 Application 1: Route Planning or Location recommendation
 Application 2: Urban Environment
 Application 3: Urban Business
 Application 4: Other Urban Issues
144

NCKU EE
145

 This is my first time visiting Florence, where are the
best to go?
 It’s dinner time now. Which one should I choose
among thousands of restaurants?
146

 Location, Point‐of‐Interest (POI), Venue
 A geographical point with specific function that users may find
useful or interesting
 E.g. restaurant, store, landmark
 Given
 A set of historical user‐generated location data
 Query/Requirement: depict the user needs
 Information about the desired places
 Recommend
 A set of locations/venues/POI
 A sequence of locations/venues/POI
 Satisfy the query requirements as much as possible
147
Location Rec.
Route Rec.

148
2015
03.14
2015
04.22
2015
04.30
2015
05.15
Time
?
2015
05.18
2015
05.19
? ? ?
Location
Rec.
Route
Rec.

 Location Recommendation
 Recommend NEW locations (never visited before)
 Location Prediction
 Predict the next existing locations (had ever visited)
 General considered factors
 Current location info
 Current time
 User history/preference
 Social interaction
149
Route Planning can be viewed as
the successive applications of
location recommendation.

 People seek to discover new locations
 80%‐90% of visited places are new
 60%‐80% of check‐ins occur at new places
150
[Noulas et al., SocialCom’12]

 Popularity: rank locations using # of check‐ins
 Content Filtering: using venue type preference
 Social Filtering: rank locations using # of check‐ins
by friends
 Home Distance: geo‐distance from home
 K‐NN User Similarity (CF)
 Place Network (Item Similarity)
 Matrix Factorization
151

152
Users Places
1. RWR finds a balance between
collective check‐in behaviors (graph
structure) and personalized bias.
2. RWR can be applied to users with no
check‐ins. (cold start)
[Noulas et al., SocialCom’12]

Query Preferable Routes Illustration
1) A set of locations
2) Time span of
route
A route pass through
these locations within
time span
1) A source loc.
2) A destination loc.
3) A number of
route length
A route starting from
source and arrive at
the destination, with
length satisfied
1) A city or an area
2) A set of labels of
interests
A route in such area,
which contains
locations possessing
such labels
153
Query:
S
t

 GPS Trajectory
 How to find meaningful and/or popular places?
 How to tackle efficiently million‐scale geo‐data points for
query processing?
 Uncertain Trajectory
 Do not detail the sequences of movement
 Raise uncertainty between consecutive points
154
Check‐in records Geo‐tagged Photos

 Location Query
 Required Locations: needed to be pass thru
 Visiting Order: order of required locations
 Geo‐Distance: geographical range or the tolerable
distance between locations
156
12
34 2km4km

 Context Query
 Visiting/Stay Time: whether the visiting time of a location is proper
or the staying time of a location
 Transit Time: the time for transiting between locations
 Travel Duration: the total traveling duration in the route
 Financial Cost: the budget of a route
 Top‐K retrieval: whether or not to return top‐k preferable routes
157
tVis = 4pm
tsta = 2hr
ttan = 1hr
tdur = 10hr c1=3USD
c2=10USD
c3=1USD
c4=0USD
ctotal=14USD < cbudget=15USD

 Social Query
 Popularity of locations
 User Preference:
whether or not to consider user’s past
visiting history
 Group or Social factor:
group trips or the locations that
friends had ever visited
 Activity Labels:
specifying the labels or types of
locations in the route
158
5000
20
20000
4000
Grp
Mem.
List of desired
locations
A , , , ,
B , ,
C , , ,
D , , ,
a
h
c
f
AC
B
visit
visit
visit
visit
park
restaurant
theater
park
restaurant
theater
Query = {theater, restaurant, park}

159
Approaches
Advanced Tasks
Graph Search Pattern Mining Prediction/Inference
[Chen’11], [Zheng’11], [Wei’12] [Tang’11], [Zheng’12], [Tang’13] [Jeung’08], [Xue’13], [Hsieh’14]
Tackling Uncertainty Internal Routes Route with auxiliary info
[Zheng’12], [Wei’12] [Lu’11], [Tseng’12] [Cheng’11], [Cao’12]

Location Query Context Query Social Query
QL VO DI VT TT TD CO TK PO UP GS AL
GSP Trajectory Data
[Tang’13] ∎ ∎ ∎ ∎ ∎
[Chen’11] ∎ ∎ ∎
[Zheng’11] ∎ ∎ ∎ ∎ ∎
[Tang’11] ∎ ∎ ∎
[Jeung’08] ∎ ∎
[Xue’13] ∎ ∎
Uncertain
Trajectory Data
[Wei’12] ∎ ∎ ∎ ∎
[Hsieh’14] ∎ ∎ ∎ ∎ ∎ ∎
[Zheng’12] ∎ ∎ ∎ ∎ ∎
[Cao’12] ∎ ∎ ∎ ∎ ∎ ∎
[Lu’10] ∎ ∎ ∎ ∎ ∎ ∎
160

 Graph Construction G
 Design an objective function f(r) based on query, e.g.
 E.g. visiting/transition popularity, label cover
 With some constraints, e.g. travel time, financial cost
 Find a route/path r in G such that f(r) is optimized
161
Trajectory Data Road Net
Nodes Locations Road Segments
Edges Traversal Intersection
Node Weights Popularity / Satisfaction / Traffic
Edge Weight Transition Probability / Frequency

 Given a source and a destination location,
find the most popular route in between
 Construct a transfer network
 Node: traj intersection, Edge: contiguous traj
 Turning Probs on edges
 Extend Dijkstra’s algorithm to
find the route with highest probability162
There could be no trajectory
connecting two locations at all
Count the number of trajectories on
different paths connecting two locations
[Chen et al., ICDE’11]

• Each trajectory = a sequence of geo‐points / locations
 Pattern Mining
 Mining the frequent subsequences
constrained by the query requirements
 Subsequence Pruning: keep closed ones (to save complexity)
 Subsequence Merge: from local route to global route
 Pattern Matching
 Find individuals with similar behaviors of movements
 Nearest‐nearest query processing (given some locations)
163

 Discover the group of objects that move together (with
similar patterns of movements)
 E.g. migration path, driving direction, travel paths
 Recommend routes from your companion
 Clustering objects and apply sequential pattern mining
164
[Tang et al., TIST’13]
Size threshold = 4
Duration threshold
= 4 snapshots
{o1, o2, o3, o4} is
the traveling
companion

 Given an existing sub‐route, successively
predict/recommend the next locations
 Till the user requirement is satisfied
 E.g. Route Length k, Travel Time. Arrive the destination
 Select the next locations
 Unsupervised method
 Location info. E.g. popularity, density, incoming flow
 Estimate the probability P(candidateLoc | curSubRoute)
 Supervised method
 Choose a set of candidate locations
 Extract route/Location‐aware features
 Apply supervised learning methods e.g. SVM
165

 Given a source and/or destination location, and the
current time, can we recommend a route, in which each
location can be visited with a pleasant experience
 Pleasant visiting of places should consider visiting time,
e.g.
 People usually visit the Empire State Building from about 12:00 to
the mid night (night view is popular)
 People tend to visit the Madison Square Garden in the early
evening for a basketball game
 The proper time to visit the Central Park is during daytime
 Time Square is preferred from afternoon to midnight.
[Hsieh’14]
166

 Given a start location and time, recommend a route such
that each location can be visited with the best
experience
 Successively predicting the next location
 Based on current location and time
 Features, e.g.
 Geo‐distance
 Time difference
 Transition time
 Popularity
 Transfer probability
 Location category
167
[Hsieh et al., TIST’14]

 Sequences of check‐in data are uncertain, discrete,
and considered as low‐sampling rate routes
 Given the original check‐in sequences, can we
estimate its original route from the samples?
168
Simple Reference trajectory
[Zheng et al., ICDE’12]
Spliced Reference trajectory

 Real Scenarios
 “Want to have a one‐day trip in an unfamiliar city, Beijing. Any
route suggestion to visit famous places?”
 “I am going to visit the Forbidden City in Beijing, with 3 hours.
What’s the route within the palace?”
 Expected results
 One‐day trip in Beijing: 3 hours in Forbidden City → 2 hours in
Tian An Men Square → 2 hours in Qian Men.
169
Using Geo‐tagged Photos
[Lu et al., MM’10]
Merge Paths

 Consider a user who wants to a one‐day trip for a
unfamiliar city. She might pose a query:
 Find the most popular route from my hotel such that it passes by
“shopping mall”, “restaurant”, and “museum”, and time spent on
the road is within 4hr
 Keyword‐aware Route Query
(a) Start and end locations (hotel)
(b) A set of keywords (shopping mall, restaurant, museum)
(c) Budget limit (with 4hr)
(d) A function f calculating the score of a route (popularity)
 Goals: Satisfying (a)(b)(c) and optimizing f
170
[Cao et al.,VLDB’12]

 Graph
 Route
 Objective score
 Budget score
 Feasible route
A location associated
with a set of keywords
Objective score:
o (v0, v1)=4
Budget scre:
b (v0, v1)=1
OS(R) = 2+1+1 = 4
BS(R) = 2+2+3 = 7
Query = <v0, v7, {t1, t2, t3}, 8>
OS(R2) = 9, BS(R2)=5
171
[Cao et al.,VLDB’12]

 Goal & Results: Inferring transportation modes from raw GPS data
 Differentiate driving, riding a bike, taking a bus and walking
 Achieve a 0.75 inference accuracy (independent of other sensor data)
GPS
log
Infer model
172
[Zheng, et al. UbiComp’ 08]

 Commonsense knowledge from the real world
 Typically, people need to walk before transferring transportation modes
 Typically, people need to stop and then go when transferring modes
173

 Change point‐based Segmentation Algorithm
 Step 1: distinguish all possible Walk Points, non‐Walk Points.
 Step 2: merge short segment composed by consecutive Walk Points or non‐Walk
points
 Step 3: merge consecutive Uncertain Segment to non‐Walk Segment.
 Step 4: end point of each Walk Segment are potential change points
174

 Features
Category Features Significance
Basic
Features
Dist Distance of a segment
MaxVi The ith maximal velocity of a segment
MaxAi The ith maximal acceleration of a segment
AV Average velocity of a segment
EV Expectation of velocity of GPS points in a segment
DV Variance of velocity of GPS points in a segment
Advanced
Features
HCR Heading Change Rate
SR Stop Rate
VCR Velocity Change Rate
175

 Our features are more discriminative than velocity
 Heading Change Rate (HCR)
 Stop Rate (SR)
 Velocity change rate (VCR)
 >65 accuracy
Velocity
Velocity
Velocity
Distance
Distance
Distance
a) Driving
b) Bus
c) Walking
Vs
Vs
Vs
176

 Transition probability between different transportation modes
 P(Bike|Walk) and P(Bike|Driving)
Segment[i].P(Bike) = Segment[i].P(Bike) * P(Bike|Car)
Segment[i].P(Walk) = Segment[i].P(Walk) * P(Walk|Car)
177

Inferring Real‐Time and Fine‐Grained air quality
throughout a city using Big Data
Meteorology Traffic POIs Road networksHuman Mobility
Historical air quality data Real‐time air quality reports
179
[Zheng & Hsieh et al. KDD’13]

180

 Partition a city into disjoint grids
 Extract features for each grid from its impacting region
 Meteorological features
 Traffic features
 Human mobility features
 POI features
 Road network features
 Co‐training‐based semi‐supervised learning model for each
pollutant
 Predict the AQI labels
 Data sparsity
 Two classifiers
181

 Philosophy of the model
 States of air quality
 Temporal dependency in a location
 Geo‐correlation between locations
 Generation of air pollutants
 Emission from a location
 Propagation among locations
 Two sets of features
 Spatially‐related
 Temporally‐related
Time
Geospace
Spatial Classifier
Temporal Classifier
Co‐Training
182

T=t-2
T=t-1
T=t
1
3
4
5 2
1
3
4
5 2
1
3
4
5 2
wt ,t‐1
11, wt ,t‐2
11, wt‐1 ,t‐2
11
wt ,t‐1
33, wt ,t‐2
33, wt‐1 ,t‐2
33
wt ,t
12, wt ,t
13, wt ,t
14, wt ,t
23,
wt ,t
25, wt ,t
34, wt ,t
35,wt ,t
45
Location
Data
WeatherRoad
Network
Weights determined
by the features
Time & Distance
183
[Hsieh et al. KDD’15]
Inferring urban AQI (air quality index)of
arbitrary locations

Function Features PM10 PM2.5
linear
Temperature ∎ ∎
Humidity ∎ ∎
Wind speed ∎ ∎
Distance ∎ ∎
Road segment length ∎ ∎
Number of intersections ∎
Number of vehicle services ∎
Number of parks ∎ ∎
Number of hotels and real
estates
∎
Number of factories
∎ ∎
quartic Pressure ∎ ∎
logarithmic Time ∎ ∎
• More features investigated are irrelevant to air‐quality, including high way length, POIC1, POIC4, POIC5, POIC6,
POIC7, POIC9, POIC10, POIC11.
184

 Cannot directly measure the improvement on
inference
 Minimize the uncertainty of a relatively accurate model
 Basic idea: the AQ distribution of a location should be
skewed (i.e., low entropy value)
 Search space is getting large when k increases.
 A greedy‐based method to find k locations that can maximize
their effect.
 From uncertainty to unpredictability
 Be independent with many other nodes
185
 
 un
i
AQI
iiU DDDH 1
max
1
log)(
Prob
AQI value
Prob
AQI value


 Spatial granularity: for each air quality monitoring station
 Temporal granularity:
 For each hour in the first 6 coming hours
 A max‐min range for 7‐12, 13‐24, and 25‐48
186
[Zheng et al. KDD’15]

 Data sources
 The air quality of current time and the past few hours
 The meteorological data of current time and past few hours
 Humidity, temperature,..
 Sunny, foggy, overcast, cloudy…
 Minor rainy, moderate rainy, heavily rainy, rain storm
 Wind direction, wind speed
 Weather forecast
187

188

 Urban noise pollution damages the mental
health (e.g. work efficiency and sleep quality)
2014.01.19
Urban Noise Pollution: Insidious Health Threat or Just 'City Livin'?
2014.12.01
2013.01.13 Quiet, please! The noise crisis isn’t just anecdotal.
• Mapping New York City
noise complaint
http://www.trulia.com/blog/trends/noise‐complaint‐maps/
• Visualizing noise data
cannot solve the problem.
Need to know what
makes the noises for any
place in the city!
189
[Hsieh et al. MM’15]

Noises can be reflected by multimodal Geo‐social
data that collectively sensor urban human activities
190
[Hsieh et al. MM’15]

Traditional Approach Data Science Approach
Cost Down of
Monitoring
Expensive purchase and
deployment of physical
sensors
Cheap Big Data‐driven
machine learning techniques
Sensor
Distribution
Severe sensor sparsity
problem
Inference of target value
anywhere and anytime
Forecasting
Accuracy
Low for regions w/o sensors High for regions w/o sensors
Usages of
Sensors
Distinct pollutants need
different types of sensors
The framework is general‐
purpose (not only for a
certain pollution), e.g. water
quality, air quality and noise
Sensor
Deployment
Determined by human’s
knowledge
Deployed by optimizing the
objective functions of
environmental monitoring192

 Can ML/DM techniques help locate the potential
customers for a certain retail chain in an urban area?
Where customers? When customers?
#Customers
Week1 2 3 4 5 6 ……
#Customers
Month1 2 3 4 5 6 ……
193
[Hsieh et al. PKDD’15]

A B
Potential Customers
Time (month)
Potential Customers
Time (month)
Potential Customers
Time (month)
Potential Customers
Time (month)
Potential Customers
Time (month)
Potential Customers
Time (month)
Potential Customers
Time (month)
Potential Customers
Time (month)
?
?
194

Social
MobilityGeographical
=
Training
=
Testing
Locations in NYC
Heterogeneous Features
Correlation Graph
Construction
Potential Customer
Inference
LBSN Data
Feature‐aware
Location Correlation
?
?
?
?
?
?
?
?
? ?
?
?
#pc
?
? ?
? ?
#pc
195

Related
Work
Factors Problem Setting
Geography Mobility Social Text Analysis Prediction Spatial
Spatial‐
Temporal
Karamshuk’13    
Li’13    
Kisilevich’10   
Tiwari’14    
Fu’14    
Chen’14   
Liu’14   
Hsieh’15      
196

 Home abandonment
 Distribution of home abandonment in Mexico
197
[Ackermann et al. KDD’16]

 There is a funny law says that you don’t need to
pay taxes if the building is unfinished
 Neighborhoods with high abandonment rate have
also high crime rates.
198

 Faced with economic constraints, workers tend
to buy cheaper houses which are located at a
distance from city centers.
 At the time of purchase, workers sometimes poorly informed
realities of their concerns.
Long distances to
their workplaces
and schools
Limited access
to services and
amenities
Limited
infrastructure
199

200
 Formulate the abandonment risk prediction problem
as a binary classification problem where the outcome
variable is whether a person abandons their house.
 Predict the risk of home abandonment, and provide purchase
advice.
 Provide policy recommendations to the government.
Predict whether the
person abandons this
house

 Loans data
 Approximation of the labeled data
 Housing survey data
 Business, school and hospital location data
 Municipality data
 Natural disaster incidences, Healthcare coverage , Number
of vehicles and passenger buses, etc.
201

 Model objectives: What is the risk of abandonment
for an existing loan in the next year?
 The algorithms tested were Support Vector Machines, Random
Forests, AdaBoost, and Logistic Regression.
202
 Prototype
 User selects a colonia
 User inputs personal
characteristics
 Municipality and location features
are retrieved from the database
 Application displays the prediction
to the user

 Market value
 The price an estate would trade in
the marketplace
203
 Investment value
 The growth potential of resale value
 Motivation to enter estate market
Predict future price!
Predict investment potential!
[Fu et al. TKDD’16]

204
Features of User Review
Features of Taxi
Trajectories
Features of Smart Card
Transactions
Features of Checkins
Estate Investment Value
Learn an estate ranking predictor:

205
 Overall Satisfaction
 Service Quality
 Environment Class
 Consumption Cost
 Functionality Planning

206
 Taxi Arriving Volume
 Taxi Leaving Volume
 Taxi Transition Volume
 Taxi Driving Velocity
 Taxi Commute Distance
 Bus Arriving Volume
 Bus Leaving Volume
 Bus Transition Volume
 Bus Stop Density
 Popularity of Checkin
 Topic Profile of Checkin
 Propagating word‐of‐mouth from poi
to neighborhood
 Textual profiling from words to topics

207
• Business reviews and checkins performs better than taxi and bus traces
• Checkins and reviews represent attending phrase
• Taxi and bus traces moving phrase
• Taxi features perform better than bus features in falling market
• Taxi mobility represents white‐collar and business people
• Bus mobility represents mediate classes

 The crime data collected in Chicago has detailed
information about the time and location (i.e., latitude and
longitude) of crime and the types of crime.
 Crime rate is the crime count normalized by the population
in a region.
 Consider two types of features X for inference:
 Nodal feature
 Demographic information
 POI distribution
 Edge feature
 Geographical distance
 Hyperlink by taxi flow
208

209
Linear regression formulation of problem :
In order to avoid negative value,
using Poisson regression model to represent problem :

 Identify significant factors
 Feature analysis by directly drawing the graph
 Correlation analysis
 Prediction(by regression or classification) using different feature
sets
 Some tips for mining ST Data
 Feature engineering is always important
 Handle users’ queries or preferences
 Model spatial and temporal dependency
 Urban ST Data is large‐scale and highly dynamic
 Need effective and efficient model
210

NCKU EE
211

 Introduction to Smarty, IoT and ST Big Data
 Topic 1: 智慧城市與物聯網發展現況
 Topic 2: 時空大數據特性與常用探勘技術
 Topic 3: 城市社群媒體與時空軌跡應用
 Topic 4: 當時空大數據遇到人工智慧
212

 Micro‐level
 Recommending Users / Friends / Locations / Marketing Seeds /
Events / Participants for someone
 Macro‐level
 Urban Planning
 e.g. functional regions, diagnosing transportation problems
 Disaster Management
 e.g. resource distribution, recuse planning
 Disease & Public Health
 e.g. immunization, epidemiology
 Environmental Informatics
 e.g. detect air and noise pollution
 Resource Arrangement
 e.g. sensor or station deployment
213

We live in a complex world.
Spatial‐Temporal Data not only makes our lives
more convenient, but also provides a medium for
reasoning about problems spanning society,
technology, information, health, nature, and
humanity.
214

Thank You!

http://myweb.ncku.edu.tw/~hphsieh/
hphsieh@mail.ncku.edu.tw
215

[系列活動] 智慧城市中的時空大數據應用

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie [系列活動] 智慧城市中的時空大數據應用

Ähnlich wie [系列活動] 智慧城市中的時空大數據應用 (20)

Mehr von 台灣資料科學年會

Mehr von 台灣資料科學年會 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

[系列活動] 智慧城市中的時空大數據應用