Large-Scale Geographically Weighted Regression on Spark

•Als PPTX, PDF herunterladen•

0 gefällt mir•1,671 views

Geographically Weighted Regression (GWR) is a local version of spatial regression that captures spatial dependency in regression analysis. GWR has many application in practice as a visualization and prediction tool for spatial exploration- (e.g in climate, economy, medical). However, this locally regression model is slow in process upon the volume of calculations and the spatial getting bigger. Improving performance of GWR is an critical issue, but their distributed implementations have not been studied. Recently, with the advent of Spark as well MapReduce framework, the development of machine learning applications and parallel programming becomes easier. In this article, we propose several large-scale implementations of distributed GWR, leveraging Spark framework. We implemented and evaluated these approaches with large datasets. To our best knowledge, this is the first work addressing GWR at large-scale.

Daten & Analysen

Hung Tien Tran, Hiep Tuan Nguyen, Viet-Trung Tran
Hanoi University of Science and Technology

Introduction
 What is Geographically Weighted Regression?
 What is our work?
Source: http://desktop.arcgis.com
GWR + =
- Large-scale spatial data
- Improve performance
- Distributed

Outline
 Background
 Problem
 Scalable GWR on Spark
 Experiments
 Discussion
 Conclusion

Background
 First Law of Geography - Waldo Tobler:
“Everything is related with everything else, but closer
things are more related”.
 Model GWR
 The OLS estimator takes the form
yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi
βˆ(u) = (X TW (u)X )−1 X TW (u)Y

Background
 Kernel function
 Gaussian function
 Bandwidth
5
fixed bandwidth adaptive bandwidth

Problem
 Estimating a local model
 Bandwidth selection
 Evaluation model
Choose kernel function
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Source: http://rose.bris.ac.uk
O(n3)
Which bandwidth is good

Problem
 How to apply the model for large-scale data?
 Data points
 Features
 Regression points

Large-Scale GWR on Spark
 Why is Spark?
 In-memory cluster-computing platform
 Support parallel programming
 Develop applications by high-level APIs
 Provides resilient distributed datasets and parallel
operations
 Integration with other components on Spark

Large-Scale GWR on Spark
 We propose three approach to scaling GWR
 Scaling Weighted Linear Regression
 Parallel Multiple WLR models
 Parallel Geographically Weighted Regression (combine
the first two approach)

Scalable GWR on Spark
 Naïve approach – Scaling Weighted Linear Regression
Foreach regPoint
Compute weight
Fit Weighted
Linear
Regression
Summary model
Compute weight
parallel
Compute WLR
model parallel

Scalable GWR on Spark
 Parallel Multiple WLR models
Regression dataset
Training dataset
WL
R
Compute weight
WL
R
Compute parallel
multiple WLR
models
Summary

Scalable GWR on Spark
 Parallel Multiple WLR models

Scalable GWR on Spark
 Parallel Geographically Weighted Regression
R
R
R
T
T
T
R
T
R
T
R
T
Regressio
n dataset
Training
dataset
Combin
e dataset
Distributed GWR Computation

Scalable GWR on Spark
 Parallel Geographically Weighted Regression

Experiments
 Environment
 Cluster: 8 nodes on Amazon Web Service
 4 cores Inte Xeon E5-2670 v2 2.5 GHz
 16 GB RAM, 2x40 GB SSD
 Hadoop 2.7.2 and Spark 1.6.1
 Dataset
| − −x : double(nullable = false)
| − −y : double(nullable = false)
| − −label : double(nullable = false)
| − −f eatures : vector(nullable = false)

Experiments
 Testing large training dataset
0
200
400
600
800
1000
1200
10000 100000 1000000 2000000 5000000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of training points

Experiments
 Testing large regression dataset
0
200
400
600
800
1000
1200
1000 5000 10000 20000 50000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points

Experiments
 Testing large dataset with increasing number of
features
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 50 100 200
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points

Experiments
 Cluster
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2-node 4-node 8-node
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of nodes

Discussion
 Related work
 Many library GWR on local
 Spgwr (multiR on GRID)
 Using GPU
 Our work
 First study distributed GWR on Spark
 Easy deployment and the advantages of Spark
 Scalable and work well on cluster

Conclusion
 We have
 Propose three approach
 Implement four algorithms base on Spark
 Evaluate our implementation
 Future work
 Improve performance by using Pipeline and Partitions
 Release as open-source library

Large-Scale Geographically Weighted Regression on Spark

Weitere ähnliche Inhalte

Was ist angesagt?

GprHendra Grandis

Remote sensing principles-spectral signature-spectural rangeMohsin Siddique

Remote Sensing: Principal Component AnalysisKamlesh Kumar

Remote Sensing - FundamentalsAjay Singh Lodhi

GeoServer, an introduction for beginnersGeoSolutions

Database gis fundamentalsSumant Diwakar

Lecture for landsatGeoMedeelel

Spatial data analysisJohan Blomme

Geographic information systemKamrul Islam Karim

Inverse distance weightingPenchala Vineeth

Spatial data analysis 1Johan Blomme

Morphometric analysis of vrishabhavathi watershed using remote sensing and giseSAT Journals

Vector data modelNaresh Kumar

What is Geography Information Systems (GIS)John Lanser

Basics of remote sensing, pk maniP.K. Mani

Hardware and software requirements for gisSumant Diwakar

Shortest route and mstAlona Salva

Hierarchical ClusteringCarlos Castillo (ChaTo)

Remote Sensinguest7b3693

DATA in GIS and DATA QueryKU Leuven

Was ist angesagt? (20)

Gpr

Remote sensing principles-spectral signature-spectural range

Remote Sensing: Principal Component Analysis

Remote Sensing - Fundamentals

GeoServer, an introduction for beginners

Database gis fundamentals

Lecture for landsat

Spatial data analysis

Geographic information system

Inverse distance weighting

Spatial data analysis 1

Morphometric analysis of vrishabhavathi watershed using remote sensing and gis

Vector data model

What is Geography Information Systems (GIS)

Basics of remote sensing, pk mani

Hardware and software requirements for gis

Shortest route and mst

Hierarchical Clustering

Remote Sensin

DATA in GIS and DATA Query

Andere mochten auch

giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN

Exploring housing patterns and dynamics in low demand neighbourhoods using Ge...Graham Squires

Time SeriesSTATISTIKA ITS

Riset SosialSTATISTIKA ITS

A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN

Neural Networks for OCRDavid Stark

OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN

Giasan.vn @rstarsViet-Trung TRAN

From decision trees to random forestsViet-Trung TRAN

success factors for project proposalsViet-Trung TRAN

Recent progress on distributing deep learningViet-Trung TRAN

Deep Learning Class #3 - Take Two LSTMsHolberton School

3 - Finding similar itemsViet-Trung TRAN

Recommender systems: Content-based and collaborative filteringViet-Trung TRAN

Tamil OCR using Tesseract OCR Enginebalamurugan.k Kalibalamurugan

Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN

ABC ELP Program - Innovation in governmentAnne-Marie Elias

Video Encoding and HTML5 Playback With Native DRMStefan Lederer

"Year of the Selfie" [INFOGRAPHIC]Unmetric

Living Wall - ArabicYousef Taibeh

Andere mochten auch (20)

giasan.vn real-estate analytics: a Vietnam case study

Exploring housing patterns and dynamics in low demand neighbourhoods using Ge...

Time Series

Riset Sosial

A Vietnamese Language Model Based on Recurrent Neural Network

Neural Networks for OCR

OCR processing with deep learning: Apply to Vietnamese documents

Giasan.vn @rstars

From decision trees to random forests

success factors for project proposals

Recent progress on distributing deep learning

Deep Learning Class #3 - Take Two LSTMs

3 - Finding similar items

Recommender systems: Content-based and collaborative filtering

Tamil OCR using Tesseract OCR Engine

Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...

ABC ELP Program - Innovation in government

Video Encoding and HTML5 Playback With Native DRM

"Year of the Selfie" [INFOGRAPHIC]

Living Wall - Arabic

Ähnlich wie Large-Scale Geographically Weighted Regression on Spark

Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy

Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Sangmin Park

Introduction to Chainer ChemistryPreferred Networks

Svm map reduce_slidesSara Asher

How to Layer a Directed Acyclic Graph (GD 2001)Nikola S. Nikolov

"An adaptive modular approach to the mining of sensor network ...butest

EAGE_prsentation_Anderson.pptxInistute of Geophysics, Tehran university , Tehran/ iran

Multi-Layer PerceptronsESCOM

PAC Bayesian for Deep LearningMark Chang

Hyperparameter optimization with approximate gradientFabian Pedregosa

Paper.pdfDavCla1

Information in the WeightsMark Chang

ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 7tingyuansenastro

Information in the WeightsMark Chang

Fast Object Recognition from 3D Depth Data with Extreme Learning MachineSoma Boubou

Implementation of the fully adaptive radar framework: Practical limitationsLuis Úbeda Medina

A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...Cemal Ardil

Machine Learning meets DevOpsPooyan Jamshidi

Imecs2012 pp440 445Rasha Orban

Ähnlich wie Large-Scale Geographically Weighted Regression on Spark (20)

Processing Reachability Queries with Realistic Constraints on Massive Network...

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...

Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...

Introduction to Chainer Chemistry

Svm map reduce_slides

How to Layer a Directed Acyclic Graph (GD 2001)

"An adaptive modular approach to the mining of sensor network ...

EAGE_prsentation_Anderson.pptx

Multi-Layer Perceptrons

PAC Bayesian for Deep Learning

Hyperparameter optimization with approximate gradient

Paper.pdf

Information in the Weights

ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 7

Information in the Weights

Fast Object Recognition from 3D Depth Data with Extreme Learning Machine

Implementation of the fully adaptive radar framework: Practical limitations

A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...

Machine Learning meets DevOps

Imecs2012 pp440 445

Mehr von Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN

Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN

Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN

Mapreduce simplified-data-processingViet-Trung TRAN

Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN

A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN

GPSinsights posterViet-Trung TRAN

Deep learning for nlpViet-Trung TRAN

Introduction to BigData @TCTK2015Viet-Trung TRAN

From neural networks to deep learningViet-Trung TRAN

Dimensionality reduction: SVD and its applicationsViet-Trung TRAN

Introduction to mining massive datasetsViet-Trung TRAN

6 clusteringViet-Trung TRAN

2 association rulesViet-Trung TRAN

Tachyon memory centric, fault tolerance storage for cluster framworksViet-Trung TRAN

Interactive big data analyticsViet-Trung TRAN

Hệ thống phân tích tình trạng giao thông: Ứng dụng công cụ xử lý dữ liệu lớn...Viet-Trung TRAN

Nosql data modelsViet-Trung TRAN

Overview of big data in cloud computingViet-Trung TRAN

Vanilla Hadoop vs. the rest Viet-Trung TRAN

Mehr von Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017

Dynamo: Amazon’s Highly Available Key-value Store

Pregel: Hệ thống xử lý đồ thị lớn

Mapreduce simplified-data-processing

Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook

A Vietnamese Language Model Based on Recurrent Neural Network

GPSinsights poster

Deep learning for nlp

Introduction to BigData @TCTK2015

From neural networks to deep learning

Dimensionality reduction: SVD and its applications

Introduction to mining massive datasets

6 clustering

2 association rules

Tachyon memory centric, fault tolerance storage for cluster framworks

Interactive big data analytics

Hệ thống phân tích tình trạng giao thông: Ứng dụng công cụ xử lý dữ liệu lớn...

Nosql data models

Overview of big data in cloud computing

Vanilla Hadoop vs. the rest

Kürzlich hochgeladen

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

IMA MSN - Medical Students Network (2).pptxdolaknnilon

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Kürzlich hochgeladen (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

20240419 - Measurecamp Amsterdam - SAM.pdf

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Top 5 Best Data Analytics Courses In Queens

Call Girls In Dwarka 9654467111 Escorts Service

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Advanced Machine Learning for Business Professionals

Defining Constituents, Data Vizzes and Telling a Data Story

E-Commerce Order PredictionShraddha Kamble.pptx

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Customer Service Analytics - Make Sense of All Your Data.pptx

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

IMA MSN - Medical Students Network (2).pptx

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Call Girls in Saket 99530🔝 56974 Escort Service

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Large-Scale Geographically Weighted Regression on Spark

1. Hung Tien Tran, Hiep Tuan Nguyen, Viet-Trung Tran Hanoi University of Science and Technology

2. Introduction  What is Geographically Weighted Regression?  What is our work? Source: http://desktop.arcgis.com GWR + = - Large-scale spatial data - Improve performance - Distributed

3. Outline  Background  Problem  Scalable GWR on Spark  Experiments  Discussion  Conclusion

4. Background  First Law of Geography - Waldo Tobler: “Everything is related with everything else, but closer things are more related”.  Model GWR  The OLS estimator takes the form yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi βˆ(u) = (X TW (u)X )−1 X TW (u)Y

5. Background  Kernel function  Gaussian function  Bandwidth 5 fixed bandwidth adaptive bandwidth

6. Problem  Estimating a local model  Bandwidth selection  Evaluation model Choose kernel function βˆ(u) = (X TW (u)X )−1 X TW (u)Y Source: http://rose.bris.ac.uk O(n3) Which bandwidth is good

7. Problem  How to apply the model for large-scale data?  Data points  Features  Regression points

8. Large-Scale GWR on Spark  Why is Spark?  In-memory cluster-computing platform  Support parallel programming  Develop applications by high-level APIs  Provides resilient distributed datasets and parallel operations  Integration with other components on Spark

9. Large-Scale GWR on Spark  We propose three approach to scaling GWR  Scaling Weighted Linear Regression  Parallel Multiple WLR models  Parallel Geographically Weighted Regression (combine the first two approach)

10. Scalable GWR on Spark  Naïve approach – Scaling Weighted Linear Regression Foreach regPoint Compute weight Fit Weighted Linear Regression Summary model Compute weight parallel Compute WLR model parallel

11. Scalable GWR on Spark  Naïve approach

12. Scalable GWR on Spark  Parallel Multiple WLR models Regression dataset Training dataset WL R Compute weight WL R Compute parallel multiple WLR models Summary

13. Scalable GWR on Spark  Parallel Multiple WLR models

14. Scalable GWR on Spark  Parallel Geographically Weighted Regression R R R T T T R T R T R T Regressio n dataset Training dataset Combin e dataset Distributed GWR Computation

15. Scalable GWR on Spark  Parallel Geographically Weighted Regression

16. Scalable GWR on Spark  Parallel Geographically Weighted Regression

17. Experiments  Environment  Cluster: 8 nodes on Amazon Web Service  4 cores Inte Xeon E5-2670 v2 2.5 GHz  16 GB RAM, 2x40 GB SSD  Hadoop 2.7.2 and Spark 1.6.1  Dataset | − −x : double(nullable = false) | − −y : double(nullable = false) | − −label : double(nullable = false) | − −f eatures : vector(nullable = false)

18. Experiments  Testing large training dataset 0 200 400 600 800 1000 1200 10000 100000 1000000 2000000 5000000 Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of training points

19. Experiments  Testing large regression dataset 0 200 400 600 800 1000 1200 1000 5000 10000 20000 50000 Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of regression points

20. Experiments  Testing large dataset with increasing number of features 0 200 400 600 800 1000 1200 1400 1600 1800 10 20 50 100 200 Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of regression points

21. Experiments  Cluster 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2-node 4-node 8-node Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of nodes

22. Discussion  Related work  Many library GWR on local  Spgwr (multiR on GRID)  Using GPU  Our work  First study distributed GWR on Spark  Easy deployment and the advantages of Spark  Scalable and work well on cluster

23. Conclusion  We have  Propose three approach  Implement four algorithms base on Spark  Evaluate our implementation  Future work  Improve performance by using Pipeline and Partitions  Release as open-source library

Hinweis der Redaktion

Scalability , Performance User-friendly APIs

Large-Scale Geographically Weighted Regression on Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Large-Scale Geographically Weighted Regression on Spark

Ähnlich wie Large-Scale Geographically Weighted Regression on Spark (20)

Mehr von Viet-Trung TRAN

Mehr von Viet-Trung TRAN (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Large-Scale Geographically Weighted Regression on Spark

Hinweis der Redaktion