Geographically Weighted Regression (GWR) is a local version of spatial regression that captures spatial dependency in regression analysis. GWR has many application in practice as a visualization and prediction tool for spatial exploration- (e.g in climate, economy, medical). However, this locally regression model is slow in process upon the volume of calculations and the spatial getting bigger. Improving performance of GWR is an critical issue, but their distributed implementations have not been studied. Recently, with the advent of Spark as well MapReduce framework, the development of machine learning applications and parallel programming becomes easier. In this article, we propose several large-scale implementations of distributed GWR, leveraging Spark framework. We implemented and evaluated these approaches with large datasets. To our best knowledge, this is the first work addressing GWR at large-scale.
4. Background
First Law of Geography - Waldo Tobler:
“Everything is related with everything else, but closer
things are more related”.
Model GWR
The OLS estimator takes the form
yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
6. Problem
Estimating a local model
Bandwidth selection
Evaluation model
Choose kernel function
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Source: http://rose.bris.ac.uk
O(n3)
Which bandwidth is good
7. Problem
How to apply the model for large-scale data?
Data points
Features
Regression points
8. Large-Scale GWR on Spark
Why is Spark?
In-memory cluster-computing platform
Support parallel programming
Develop applications by high-level APIs
Provides resilient distributed datasets and parallel
operations
Integration with other components on Spark
9. Large-Scale GWR on Spark
We propose three approach to scaling GWR
Scaling Weighted Linear Regression
Parallel Multiple WLR models
Parallel Geographically Weighted Regression (combine
the first two approach)
10. Scalable GWR on Spark
Naïve approach – Scaling Weighted Linear Regression
Foreach regPoint
Compute weight
Fit Weighted
Linear
Regression
Summary model
Compute weight
parallel
Compute WLR
model parallel
14. Scalable GWR on Spark
Parallel Geographically Weighted Regression
R
R
R
T
T
T
R
T
R
T
R
T
Regressio
n dataset
Training
dataset
Combin
e dataset
Distributed GWR Computation
15. Scalable GWR on Spark
Parallel Geographically Weighted Regression
16. Scalable GWR on Spark
Parallel Geographically Weighted Regression
18. Experiments
Testing large training dataset
0
200
400
600
800
1000
1200
10000 100000 1000000 2000000 5000000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of training points
19. Experiments
Testing large regression dataset
0
200
400
600
800
1000
1200
1000 5000 10000 20000 50000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points
20. Experiments
Testing large dataset with increasing number of
features
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 50 100 200
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points
22. Discussion
Related work
Many library GWR on local
Spgwr (multiR on GRID)
Using GPU
Our work
First study distributed GWR on Spark
Easy deployment and the advantages of Spark
Scalable and work well on cluster
23. Conclusion
We have
Propose three approach
Implement four algorithms base on Spark
Evaluate our implementation
Future work
Improve performance by using Pipeline and Partitions
Release as open-source library