Chainer GTC 2016

A Powerful, Flexible, and Intui5ve
Deep Learning Framework
@ NVIDIA GTC, April 6th, 2016
Shohei Hido
Chief Research Oﬃcer
Preferred Networks, Inc.

Overview
l  Chainer is a Python-based deep learning framework
l  Chainer v1.0 was released as an open source on June 2015
l  It DOESN’T rely on Theano, unlike other Python frameworks
l  Chainer uses a unique scheme named Define-by-Run

http://chainer.org/
l  Why do users sOll need another framework?
l  How different and effecOve Chainer is?
2

Preferred Networks (PFN)
A startup that applies deep learning to industrial IoT
l  Founded: March 2014
l  Headquarter: Tokyo, Japan
l  U.S. Subsidiary: San Mateo, California
l  Company size: 35 engineers & researchers
l  Investors: Toyota, FANUC, NTT
Deep learning Industrial IoT
3
Manufacturing
Automotive
Healthcare

Partnering with world-leading companies using Chainer
l  R&D collaboraOon on industrial problems with real-world data
̶  Specific requirements, modified algorithms, many trials and errors, etc
̶  Different from making general-purpose recogniOon system
4
Toyota FANUC
Panasonic
NTT
Cisco NVIDIA

Two types of background behind DL frameworks
1. Scalability-oriented
l  Use-cases in mind
̶  Image/speech recogniOon system
̶  Fast DL as a service in cloud
l  Problem type
̶  A few general applicaOons
̶  10+ million training samples
̶  10+ nodes cluster w/ fast network
l  Possible boZleneck
̶  Tuning of well-known algorithms
̶  Distributed computaOon for
model/data-parallel training
2. Flexibility-oriented
l  Use-cases in mind
̶  Algorithm research
̶  R&D projects for new products
l  Problem type
̶  Various speciﬁc applicaOons
̶  10+ k training samples
̶  1 node with mulOple GPUs
l  Possible boZleneck
̶  Trial-and-error in prototyping
̶  Debugging, proﬁling & refactoring
̶  (wait Ome during compilaOon)

Designed for efficient research & development
l  Flexible: new kinds of complex models for various applicaOons
l  IntuiOve: rapid prototyping and eﬃcient trial-and-error
l  Powerful: comparable performance for 1 node & mulO-GPUs
6
Scalability-oriented Flexibility-oriented

Agenda
l  Deep learning framework basics
l  IntroducOon to Chainer
l  CuPy: NumPy-compaOble GPU library
l  Performance and applicaOons
7

Neural network and computation
x1
xN
・・ h1
hH
・・・・
kM
k1
yM
y1
Forward computation
Backward computation
(backpropagation)
・・
・・
Input Hidden units Output
Text
Image
Sensor
Object: 
Tulip
Anomaly score: 
0.35
Category: 
Sports
・・
・・・・
8

Chainer focuses on network representation/training
l  Design choices for deep learning frameworks
̶  How to build neural networks?
̶  How to train neural networks?
̶  Which text format/language for modeling?
̶  Which language for compuOng?
̶  Run with GPU?
̶  Run on mulOple GPUs?
̶  Run on mulOple compute nodes?
9

Building and training neural networks:
Computational graph construction is the key
1.  Construct a computaOonal graph
̶  Based on network deﬁniOon given by users
̶  Chains of funcOons and operaOons on input variables
2.  Compute loss and gradients
̶  Forward computaOon to calculate loss for a minibatch
̶  BackpropagaOon gives gradients to all of parameters
3.  OpOmize model
̶  Update each parameter with the gradient
̶  Repeat unOl convergence
Step 1. is the most important and there are many approaches
10

Building blocks
l  These funcOonaliOes are very similar between frameworks
l  But the structure, abstracOon level, and interface are diﬀerent
l  It comes to the design of domain-speciﬁc language for NN
Array data structure
(vector/matrix/tensor)
Operations & functions
Network
(computational graph)
Optimizer
(SGD/AdaGrad/Adam)
11

Types of domain-specific language for neural networks
l  Text DSL
̶  Ex. Caffe (prototxt)
̶  Ex. CNTK (NDL)
l  Symbolic program
̶  OperaOons
on symbols
̶  Ex. Theano
̶  Ex. TensorFlow
l  ImperaOve program
̶  Direct computaOons
on raw data arrays
̶  Ex. Torch.nn
̶  Ex. Chainer
# Symbolic definiOon
A = Variable(‘A’)
B = Variable(‘B’)
C = B * A
D = C + Constant(1)
# Compile
f = compile(D)
d = f(A=np.ones(10),
B=np.ones(10) * 2)
# ImperaOve declaraOon
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
d = c + 1
%% DefiniOon in text
f: {
“A”: “Variable”,
“B”: “Variable”,
“C”: [“B”, “*”, “A”],
“ret”: [“C”, “+”, 1]
}
# Compile
f = compile(“f.txt”)
d = f(A=np.ones(10),
B=np.ones(10) * 2)
12
Ex. MXNet

Comparison of DSL type
DSL type Pros. Cons.
Text DSL
•  Human-readable definiOon
•  Non-programmer can easily
edit the network
•  Users must study the format
•  Format might have to be
extended for new algorithms
Internal DSL
Symbolic
•  StaOc analysis at compile
•  OpOmizaOon before training
•  Easy to parallelize
•  Users must study special syntax
•  May need more efforts to
implement new algorithms
ImperaOve
•  Less efforts to learn syntax
•  Easy debugging and profiling
•  Suitable for new algorithms
with complex logic
•  Hard to opOmize in advance
•  Less efficient in memory
allocaOon and parallelizaOon
Chainer is at the extreme end of imperaOve program for high flexibility
13

Agenda
14

Chainer as an open-source project
l  hZps://github.com/pfnet/chainer
l  50 contributors
l  1,277 stars & 255 fork
l  3,708 commits
l  AcOve development & release for last 10 months
̶  v1.0.0 (June 2015) to v1.7.2 (March 2016)

15
Original developer
Seiya Tokui

CuPy

Chainer software stack
CPU NVIDIA GPU
CUDA
cuDNN
BLAS
NumPy
Chainer
l  Chainer is built on top of NumPy and CUDA
l  CuPy is also introduced as an equivalent of NumPy on GPU
16

Run
Define
Graph build scheme (1/2) - Define-and-Run:
Most of frameworks use this scheme (Chainer does not)
l  Define: build a computaOonal graph based on definiOon
l  Run: update the model (parameters) using training dataset
Network
definiOon
ComputaOonal
graph
Gradient
funcOon
Parameters
ComputaOonal
graph
Gradient
funcOon
Parameters
Training
data
Update
Loss & gradient
Auto differenOaOon
17

Define-by-Run
Graph build scheme (2/2) - Define-by-Run:
Computational graph construction on the fly
l  No graph is constructed before training
l  Instead, the graph is built at each forward computaOon
l  ComputaOonal graph can be modified dynamically
for each iteraOon/sample or depending on some condiOons
Model
definiOon
ComputaOonal
graph
Gradient
funcOon
Parameters
Training
data
Update
Dynamic change
CondiOons
18

Define-by-Run example: MLP for MNIST
l  Only transformaOons between units are set before training
l  ConnecOon is given as forward computaOon
l1 = Linear(784, n_units)
l2 = Linear(n_units, 10))
Linear l2Linear l1
ｘ yh1
W bias
０
５
９
W bias
ReLU
def forward(x):
h1 = ReLU(l1(x))
return l2(h1)
19

Define-by-Run:
An interpreted language for neural network
l  Idea
̶  Forward computaOon actually goes through computaOonal graph
̶  By remembering the history, the actual graph can be obtained
l  Advantage
̶  Flexibility for new algorithms with complex components
u  Ex. recurrent, recursive, aZenOon, memory, adversarial, etc
̶  IntuiOve coding with highly imperaOve network deﬁniOon
u  Ex. stochasOc network of which graph changes for each iteraOon
l  Current drawbacks
̶  Graph is generated every Ome also for ﬁxed networks
̶  No opOmizaOon even for staOc part of graphs
u  JIT-like analysis and subgraph cache might be useful
20

Basic components (1/2): Variable and Function
l  Variable
̶  Variable wraps arrays (.data)
̶  It remembers parent funcOon
(.creator)
̶  It will be assigned gradient (.grad)
̶  It keeps track of not only data
but also computaOons
l  FuncOon
̶  TransformaOon between Variable
̶  Stateless
̶  e.g. sigmoid, tanh, ReLU,
maxpooling, dropout
Function
ｘ y
Variable
ｘ yh1
０
５
９
21

Chain (MLP2)
Basic components (2/2): Link and Chain
l  Link = funcOon with state
̶  Parameters are also Variable
and gradients will be assigned
̶  e.g. Linear (fully-connected), LSTM
ConvoluOon2d, word-embedding
l  Chain = network
̶  Chain has a set of child Link
̶  Forward computaOon is deﬁned
in . __call__()
̶  e.g. MLP2, AlexNet, GoogLeNet,
RNNLM, seq2seq,
Link
(Linear)
y=f(W*x+b)
ｘ y
W b
Linear l2Linear l1
yh1
W bias

W bias
ReLU
22

Backpropagation through computational graph
l  Consider an objecOve (Link.Linear): L = f(x * w + b)
l  This computes the value of L in forward computaOon, and
simultaneously builds the following computaOonal graph

l  The gradient of L can be computed with respect to
any variables by backpropagaOon
l  Then the opOmizer updates the value of parameters
*ｘ
W
+
b
f L
is Variable
is FuncOon
23

Code sample (1/4): Multi-layer perceptron
class MLP2(Chain):
def __init__(self):
super(MLP2, self).__init__(
l1=L.Linear(784, 100),
l2=L.Linear(100, 10),
)
def __call__(self, x):
h1 = F.relu(self.l1(x))
y = self.l2(h1)
return y
class Classifier(Chain):
def __init__(self, predictor):
super(Classifier, self).
__init__(predictor=predictor)
def __call__(self, x, t):
y = self.predictor(x)
self.accuracy = F.accuracy(y, t)
self.loss = F.softmax_cross_entropy(y, t)
return self.loss, self.accuracy

# Model and optimizer setup
model = Classifier(MLP2())
optimizer = optimizers.SGD()
optimizer.setup(model)
# training loop with minibatch
for i in range(0, datasize, batchsize):
x = Variable(x_tr[i:i+batchsize])
t = Variable(y_tr[i:i+batchsize])
model.zerograds()
loss, acc = model(x, t)
loss.backward()
optimizer.update()

Chain (MLP2)
Linear l2Linear l1
yh1
W bias

W bias
ReLU
24

Code sample (2/4): Convolutional neural network
class AlexNet(Chain):
def __init__(self):
super(AlexNet, self).__init__(
conv1=L.Convolution2D(3, 96, 11, stride=4),
conv2=L.Convolution2D(96, 256, 5, pad=2),
fc6=L.Linear(9216, 4096),
fc7=L.Linear(4096, 4096),
fc8=L.Linear(4096, 1000),
)
def __call__(self, x, t):
h = F.max_pooling_2d(F.relu(
F.local_response_normalization(self.conv1(x))), 3, stride=2)
h = F.max_pooling_2d(F.relu(
F.local_response_normalization(self.conv2(h))), 3, stride=2)
h = F.relu(self.conv3(h))
h = F.relu(self.conv4(h))
h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2)
h = F.dropout(F.relu(self.fc6(h)), train=self.train)
h = F.dropout(F.relu(self.fc7(h)), train=self.train)
y = self.fc8(h)
return y
* ImageNet Classiﬁcation with Deep Convolutional Neural Networks
http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
conv2d
conv2d
conv2d
conv2d
conv2d
linear
linear
25
linear

Code sample (3/4): Recurrent neural network
class SimpleRNN(Chain):
def __init__(self, n_vocab, n_units):
super(SimpleRNN, self).__init__(
embed=L.EmbedID(n_vocab, n_units)
x2h=L.Linear(n_units, n_units),
h2h=L.Linear(n_units, n_units),
h2y=L.Linear(n_units, n_vocab),)
self.h = None
def __call__(self, x):
y, h_new = self.fwd_one_step(x, self.h)
self.h = h_new
return y
def fwd_one_step(self, x, h):
x = F.tanh(self.embed(x))
if h is None:
h = F.tanh(self.x2h(x))
else:
h = F.tanh(self.x2h(x) + self.h2h(h))
y = F.softmax(self.h2y(h))
return y, h
x_1 h y_1
x_2 h y_2
x_3 h y_3

x_4 h y_4
BPTT length = 3
Input word OutputRecurrent state
# Truncated BPTT (length=3)
for i in range(0, datasize, batchsize):
...
accum_loss += model(x, t)
if i % bptt_length == 0:
model.zerograds()
accum_loss.backward()
accum_loss.unchain_backward()
optimizer.update()

26

Code sample (4/4): Deep Networks with Stochastic Depth
A paper published on arXiv, March 30, 2016
l  A variant of Residual Net that skips connecOons stochasOcally
̶  Outperformed the original Residual Net (ImageNet 2015 winner, MSR)
̶  StochasOc skip:
Taken from http://arxiv.org/abs/1603.09382v2
G. Huang et al.
# Mock code in Chainer
class StochasticResNet(Chain):
def __init__(self, prob, size, …):
super(StochasticResNet, size, …).__init__(
## Define f[i] as same for Residual Net )
self.p = prob # Survival probabilities
def __call__(self, h):
for i in range(self.size):
b = numpy.random.binomial(1, self.p[i])
c = self.f[i](h) + h if b == 1 else h
h = F.relu(c)
return h
w/ survival probability:
27

Miscellaneous
l  Other features
̶  Install with pip in one line:
̶  MulO-GPU support by explicitly selecOng the ID to use
̶  Pre-trained Caﬀe model import from Model Zoo
̶  Model serializaOon & save & load : HDF5 or NumPy npz
l  Future direcOon (not only for Chainer)
̶  JIT-like opOmizaOon during Deﬁne-by-Run
̶  Memory consumpOon reducOon (GPU memory is sOll small)
̶  Handling variable-length inputs without minibatch
̶  Maximizing performance on mulO-node & mulO-GPU environment
$ pip install chainer
28

Agenda
29

CuPy: (partially-)NumPy-compatible GPU library
l  MoOvaOon: NumPy + CUDA = CuPy
̶  NumPy is the standard library in Python for numerical computaOon
̶  CUDA is the standard APIs for using GPU for high-performance
̶  Unfortunately, NumPy does NOT work with CUDA
l  CuPy supports:
̶  Fast computaOon using NVIDIA’s cuBLAS and cuDNN
̶  Array indexing, slicing, transpose, and reshape
̶  Most of operaOons/funcOons in NumPy
u  Chainer v1.7.2 already supports more than 170 funcOons
̶  User-deﬁned funcOons and kernels
̶  all dtypes, broadcasOng, memory pool, etc
30

How to use CuPy
l  Usage of CuPy: just replace NumPy with CuPy
l  Conversion between numpy.ndarray and cupy.ndarray

l  Ex. CPU/GPU-agnosOc logsumexp funcOon
def logsumexp(x, axis=None):
xp = cuda.get_array_module(x) #Get CuPy or NumPy
x_max = x.max(axis)
exp_sum = xp.exp(x -　x_max).sum(axis)
return x_max + xp.log(exp_sum)
import numpy, cupy
enable_cupy = True
xp = cupy if enable_cupy else numpy
w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray
w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray
31

CuPy implementation:
Optimized for performance & NumPy-compatibility
l  Use Cython for cupy.core & cupy.cuda
l  Dynamic code generaOon & compile
̶  CUDA code is generated for speciﬁc tensor dimension & data type
̶  On-the-ﬂy compile by nvcc and binary cache (faster awer 1st use)
CUDA libraries (cuBLAS, cuRAND, cuDNN)
ndarray
ufunc, elementwise, reduc5on
CUDA Python wrapper cupy.cuda
cupy.core
Tensor opera5ons & func5ons cupy
32

CuPy performance on linear algebra:
5 to 25 times faster than NumPy
def test(xp):
a = xp.arange(1000000).reshape(1000, -1)
return a.T * 2
test(numpy)
t1 = datetime.datetime.now()
for i in range(1000):
test(numpy)
print(t2 -t1)
test(cupy)
for i in range(1000):
test(cupy)
print(t2 -t1)
msec speed
up
NumPy 2,929 1.0
CuPy 585 5.0
CuPy +
Memory Pool
123 23.8
Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970
33

Use CuPy for GPU-based computation
l  Support three paZerns as wrappers
̶  ElementwiseKernel: for element-wise computaOon
̶  ReducOonKernel: for reduce operaOon along axis
̶  ufunc: universal funcOon as in Numpy
l  Ex. deﬁniOon of an element-wise funcOon
l  Usage (automaOc broadcast and type check are supported)
squared_diff = cupy.ElementwiseKernel(
‘float32 x, float32 y’, # Input
‘float32 z’, # Output
‘z = (x - y) * (x - y)’, # Operation
‘squared_diff’) # Name
squared_diff(cupy.arange(10), 10)
34

Agenda
35

Public benchmark results (CNN):
Chainer shows comparable performance
l  Forward computaOon is almost the same with TensorFlow
l  Training with backward computaOon is slower, but it can be
offset by no compilaOon Ome while debugging/tuning
0
200
400
600
800
1000
1200
AlexNet GoogLeNet VGG-A OverFeat
Torch
TensorFlow
Chainer
Caffe (naCve)
0
200
400
600
800
1000
1200
Torch
TensorFlow
Chainer
Caffe (naCve)
Forward computation (msec) Backward computation (msec)
Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Caffe 36

Chainer can benefit from latest CUDA libraries:
Ex. Winograd algorithm in cuDNN v5
l  Conv3x3 is common in CNNs & now computed with Winograd
l  State-of-the-art CNN models (e.g., GoogLeNet, VGG-A)
can be accelerated up to 2.0x at test Ome (forward only)
0
100
200
300
400
500
600
cuDNN v4
cuDNN v5
0
100
200
300
400
500
600
cuDNN v4
cuDNN v5
Forward computation (msec) Backward computation (msec)
Independently measured by a modiﬁed version of soumith/convnet-benchmarks
cuDNN v5 can be used in Chainer v1.8.0 37

Algorithm implementation in Chainer:
A Neural Algorithm of Artistic Style (Gatys et al., 2015)
l  hZps://github.com/maZya/chainer-gogh
Content
image (cat)
Style
image
New
artistic
image
+ =
Main code (45 lines) 38

l  Many collaboraOons are on-going w/ Chainer-based
computer vision, deep reinforcement learning, etc…
l  Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016
l  Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015
̶  8 hours training to reach expert-level, commercializaOon by 2016 end
Chainer in industry:
Used in demonstrations & being commercialized
http://tinyurl.com/pfn-irex15http://tinyurl.com/pfn-ces16
39

Summary
l  Chainer is a Python-based deep learning framework
with dynamic network construcOon scheme and CuPy
l  It is designed for eﬃcient research and prototyping while
keeping comparable performance thanks to NVIDIA GPU
l  Oﬃcial web: hZp://chainer.org/
l  Github: hZps://github.com/pfnet/chainer
Your contribuOons will be appreciated & we are hiring!
40

Chainer GTC 2016

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Chainer GTC 2016

Ähnlich wie Chainer GTC 2016 (20)

Mehr von Shohei Hido

Mehr von Shohei Hido (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Chainer GTC 2016