This talk was first presented at the 2015 Strata+Hadoop World NYC (http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43484)
Deep learning algorithms have been widely used in many real-world applications, including computer vision, machine translation, and fraud detection. Unfortunately, deep learning only works best when the model is big and trained on large-scale datasets. Meanwhile, distributed computing platforms like Spark are designed to handle big data, and have been used extensively. By having deep learning available on Spark, businesses can fully take advantage of deep learning capabilities on their datasets using their existing Spark infrastructure.
In this talk, we present a scalable implementation of predictive deep learning algorithms on Spark, including feedforward neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). This, to our best knowledge, is the first successful implementation of CNNs and RNNs on Spark. To support big model training, we use Tachyon as common storage layers between the Spark workers. With its in-memory distributed execution model, Tachyon provides a scalable approach even when the model is too big to be handled on a single machine. Our solution also exploits graphical processing units (GPUs) for matrix computation whenever they are available on worker nodes, further improving execution time.
The attendees will learn about deep learning models, the architecture of the system, and how to train and run deep learning models on Spark with Tachyon.
2. @adataoinc @pentagoniac
The Journey
1. What We Do At Adatao
2. Challenges We Ran Into
3. How We AddressedThem
4. Lessons to Share
1. How some interesting things came about
2. Where some interesting things are going
3. How some good engineering/architectural
decisions are made
Along the Way,
You’ll Hear
9. @adataoinc @pentagoniac
MapReduce vs Pregel At Google:
An Analogy
— Sanjay Ghemawat, Google
If you squint at a
problem just a certain
way, it becomes a
MapReduce problem
11. @adataoinc @pentagoniac
Moral: “And No Religion, Too.”
Architectural choices
are locally optimal.
1 What’s best for someone else
isn’t necessarily best for you.
And vice versa.2
12. @adataoinc @pentagoniac
Challenge 2: Who-Does-What Architecture
DistBelief
Large Scale Distributed Deep Networks, Jeff Dean et al, NIPS 2012
1. Compute
Gradients
2. Update Params
(Descent)
16. @adataoinc @pentagoniac
Spark & Tachyon Architectural Options
Model as Broadcast
Variable
Spark-Only
Model Stored
as
Tachyon File
Tachyon-
Storage
Model Hosted
in
HTTP Server
Param-
Server
Model Stored
and Updated
by Tachyon
Tachyon-
CoProc
21. @adataoinc @pentagoniac
Tachyon-CoProcessors
!Spark workers do Gradients
- Handle data-parallel partitions
- Only compute Gradients; freed up quickly
- New workers can continue gradient compute where
previous workers left off (mini-batch behavior)
! UseTachyon for Descent
- Model Host
- Parameter Server
27. @adataoinc @pentagoniac
! Spark (Gradient) andTachyon (Descent) can be scaled independently
! The combination gives natural mini-batch behavior
! Up to 60% speed gain, scales almost linearly, and converges faster.
Lessons Learned: Tachyon CoProcessors
28. @adataoinc @pentagoniac
!Tunable Number of Data Partitions
- Big partitions: slow convergence, shorter time per epoch
- Small partitions: faster convergence, longer time per epoch (for
network communication)
Lessons Learned: Data Partitioning
29. @adataoinc @pentagoniac
!Typically each machine needs:
- (model_size + batch_size * unit_count) * 3 * 4 * 1.5 * executors
!batch_size matters
!If low RAM capacity, reduce the number of executors
Lessons Learned: Memory Tuning
30. @adataoinc @pentagoniac
!GPU is 10x faster on local, 2-4x faster on Spark
!GPU memory is limited.AWS
commonly 4-6 GB of memory
!Better to have multiple GPUs per worker
!On JVM with multi-process accesses,
GPUs might fail randomly
Lessons Learned: GPU vs CPU
32. @adataoinc @pentagoniac
Summary
!Tachyon is much more than memory-based filesystem
- Tachyon can become filesystem-backed shared-memory
!Combination of Spark &Tachyon CoProcessing yields superior Deep
Learning performance in multiple dimensions
!Adatao is open-sourcing both:
- Tachyon CoProcessor design & code
- Spark &Tachyon-CoProcessor Deep Learning implementation