3. Preferred Networks (PFN)
A startup that applies deep learning to industrial IoT
l Founded: March 2014
l Headquarter: Tokyo, Japan
l U.S. Subsidiary: San Mateo, California
l Company size: 35 engineers & researchers
l Investors: Toyota, FANUC, NTT
Deep learning Industrial IoT
3
Manufacturing
Automotive
Healthcare
4. Partnering with world-leading companies using Chainer
l R&D collaboraOon on industrial problems with real-world data
̶ Specific requirements, modified algorithms, many trials and errors, etc
̶ Different from making general-purpose recogniOon system
4
Toyota FANUC
Panasonic
NTT
Cisco NVIDIA
5. Two types of background behind DL frameworks
1. Scalability-oriented
l Use-cases in mind
̶ Image/speech recogniOon system
̶ Fast DL as a service in cloud
l Problem type
̶ A few general applicaOons
̶ 10+ million training samples
̶ 10+ nodes cluster w/ fast network
l Possible boZleneck
̶ Tuning of well-known algorithms
̶ Distributed computaOon for
model/data-parallel training
2. Flexibility-oriented
l Use-cases in mind
̶ Algorithm research
̶ R&D projects for new products
l Problem type
̶ Various specific applicaOons
̶ 10+ k training samples
̶ 1 node with mulOple GPUs
l Possible boZleneck
̶ Trial-and-error in prototyping
̶ Debugging, profiling & refactoring
̶ (wait Ome during compilaOon)
6. Designed for efficient research & development
l Flexible: new kinds of complex models for various applicaOons
l IntuiOve: rapid prototyping and efficient trial-and-error
l Powerful: comparable performance for 1 node & mulO-GPUs
6
Scalability-oriented Flexibility-oriented
8. Neural network and computation
x1
xN
・・ h1
hH
・・・・
kM
k1
yM
y1
Forward computation
Backward computation
(backpropagation)
・・
・・
Input Hidden units Output
Text
Image
Sensor
Object:
Tulip
Anomaly score:
0.35
Category:
Sports
・・
・・・・
8
9. Chainer focuses on network representation/training
l Design choices for deep learning frameworks
̶ How to build neural networks?
̶ How to train neural networks?
̶ Which text format/language for modeling?
̶ Which language for compuOng?
̶ Run with GPU?
̶ Run on mulOple GPUs?
̶ Run on mulOple compute nodes?
9
10. Building and training neural networks:
Computational graph construction is the key
1. Construct a computaOonal graph
̶ Based on network definiOon given by users
̶ Chains of funcOons and operaOons on input variables
2. Compute loss and gradients
̶ Forward computaOon to calculate loss for a minibatch
̶ BackpropagaOon gives gradients to all of parameters
3. OpOmize model
̶ Update each parameter with the gradient
̶ Repeat unOl convergence
Step 1. is the most important and there are many approaches
10
11. Building blocks
l These funcOonaliOes are very similar between frameworks
l But the structure, abstracOon level, and interface are different
l It comes to the design of domain-specific language for NN
Array data structure
(vector/matrix/tensor)
Operations & functions
Network
(computational graph)
Optimizer
(SGD/AdaGrad/Adam)
11
12. Types of domain-specific language for neural networks
l Text DSL
̶ Ex. Caffe (prototxt)
̶ Ex. CNTK (NDL)
l Symbolic program
̶ OperaOons
on symbols
̶ Ex. Theano
̶ Ex. TensorFlow
l ImperaOve program
̶ Direct computaOons
on raw data arrays
̶ Ex. Torch.nn
̶ Ex. Chainer
# Symbolic definiOon
A = Variable(‘A’)
B = Variable(‘B’)
C = B * A
D = C + Constant(1)
# Compile
f = compile(D)
d = f(A=np.ones(10),
B=np.ones(10) * 2)
# ImperaOve declaraOon
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
d = c + 1
%% DefiniOon in text
f: {
“A”: “Variable”,
“B”: “Variable”,
“C”: [“B”, “*”, “A”],
“ret”: [“C”, “+”, 1]
}
# Compile
f = compile(“f.txt”)
d = f(A=np.ones(10),
B=np.ones(10) * 2)
12
Ex. MXNet
13. Comparison of DSL type
DSL type Pros. Cons.
Text DSL
• Human-readable definiOon
• Non-programmer can easily
edit the network
• Users must study the format
• Format might have to be
extended for new algorithms
Internal DSL
Symbolic
• StaOc analysis at compile
• OpOmizaOon before training
• Easy to parallelize
• Users must study special syntax
• May need more efforts to
implement new algorithms
ImperaOve
• Less efforts to learn syntax
• Easy debugging and profiling
• Suitable for new algorithms
with complex logic
• Hard to opOmize in advance
• Less efficient in memory
allocaOon and parallelizaOon
Chainer is at the extreme end of imperaOve program for high flexibility
13
15. Chainer as an open-source project
l hZps://github.com/pfnet/chainer
l 50 contributors
l 1,277 stars & 255 fork
l 3,708 commits
l AcOve development & release for last 10 months
̶ v1.0.0 (June 2015) to v1.7.2 (March 2016)
15
Original developer
Seiya Tokui
16. CuPy
Chainer software stack
CPU NVIDIA GPU
CUDA
cuDNN
BLAS
NumPy
Chainer
l Chainer is built on top of NumPy and CUDA
l CuPy is also introduced as an equivalent of NumPy on GPU
16
17. Run
Define
Graph build scheme (1/2) - Define-and-Run:
Most of frameworks use this scheme (Chainer does not)
l Define: build a computaOonal graph based on definiOon
l Run: update the model (parameters) using training dataset
Network
definiOon
ComputaOonal
graph
Gradient
funcOon
Parameters
ComputaOonal
graph
Gradient
funcOon
Parameters
Training
data
Update
Loss & gradient
Auto differenOaOon
17
18. Define-by-Run
Graph build scheme (2/2) - Define-by-Run:
Computational graph construction on the fly
l No graph is constructed before training
l Instead, the graph is built at each forward computaOon
l ComputaOonal graph can be modified dynamically
for each iteraOon/sample or depending on some condiOons
Model
definiOon
ComputaOonal
graph
Gradient
funcOon
Parameters
Training
data
Update
Dynamic change
CondiOons
18
19. Define-by-Run example: MLP for MNIST
l Only transformaOons between units are set before training
l ConnecOon is given as forward computaOon
l1 = Linear(784, n_units)
l2 = Linear(n_units, 10))
Linear l2Linear l1
x yh1
W bias
0
5
9
W bias
ReLU
def forward(x):
h1 = ReLU(l1(x))
return l2(h1)
19
20. Define-by-Run:
An interpreted language for neural network
l Idea
̶ Forward computaOon actually goes through computaOonal graph
̶ By remembering the history, the actual graph can be obtained
l Advantage
̶ Flexibility for new algorithms with complex components
u Ex. recurrent, recursive, aZenOon, memory, adversarial, etc
̶ IntuiOve coding with highly imperaOve network definiOon
u Ex. stochasOc network of which graph changes for each iteraOon
l Current drawbacks
̶ Graph is generated every Ome also for fixed networks
̶ No opOmizaOon even for staOc part of graphs
u JIT-like analysis and subgraph cache might be useful
20
21. Basic components (1/2): Variable and Function
l Variable
̶ Variable wraps arrays (.data)
̶ It remembers parent funcOon
(.creator)
̶ It will be assigned gradient (.grad)
̶ It keeps track of not only data
but also computaOons
l FuncOon
̶ TransformaOon between Variable
̶ Stateless
̶ e.g. sigmoid, tanh, ReLU,
maxpooling, dropout
Function
x y
Variable
x yh1
0
5
9
21
22. Chain (MLP2)
Basic components (2/2): Link and Chain
l Link = funcOon with state
̶ Parameters are also Variable
and gradients will be assigned
̶ e.g. Linear (fully-connected), LSTM
ConvoluOon2d, word-embedding
l Chain = network
̶ Chain has a set of child Link
̶ Forward computaOon is defined
in . __call__()
̶ e.g. MLP2, AlexNet, GoogLeNet,
RNNLM, seq2seq,
Link
(Linear)
y=f(W*x+b)
x y
W b
Linear l2Linear l1
yh1
W bias
W bias
ReLU
22
23. Backpropagation through computational graph
l Consider an objecOve (Link.Linear): L = f(x * w + b)
l This computes the value of L in forward computaOon, and
simultaneously builds the following computaOonal graph
l The gradient of L can be computed with respect to
any variables by backpropagaOon
l Then the opOmizer updates the value of parameters
*x
W
+
b
f L
is Variable
is FuncOon
23
24. Code sample (1/4): Multi-layer perceptron
class MLP2(Chain):
def __init__(self):
super(MLP2, self).__init__(
l1=L.Linear(784, 100),
l2=L.Linear(100, 10),
)
def __call__(self, x):
h1 = F.relu(self.l1(x))
y = self.l2(h1)
return y
class Classifier(Chain):
def __init__(self, predictor):
super(Classifier, self).
__init__(predictor=predictor)
def __call__(self, x, t):
y = self.predictor(x)
self.accuracy = F.accuracy(y, t)
self.loss = F.softmax_cross_entropy(y, t)
return self.loss, self.accuracy
# Model and optimizer setup
model = Classifier(MLP2())
optimizer = optimizers.SGD()
optimizer.setup(model)
# training loop with minibatch
for i in range(0, datasize, batchsize):
x = Variable(x_tr[i:i+batchsize])
t = Variable(y_tr[i:i+batchsize])
model.zerograds()
loss, acc = model(x, t)
loss.backward()
optimizer.update()
Chain (MLP2)
Linear l2Linear l1
yh1
W bias
W bias
ReLU
24
25. Code sample (2/4): Convolutional neural network
class AlexNet(Chain):
def __init__(self):
super(AlexNet, self).__init__(
conv1=L.Convolution2D(3, 96, 11, stride=4),
conv2=L.Convolution2D(96, 256, 5, pad=2),
conv3=L.Convolution2D(256, 384, 3, pad=1),
conv4=L.Convolution2D(384, 384, 3, pad=1),
conv5=L.Convolution2D(384, 256, 3, pad=1),
fc6=L.Linear(9216, 4096),
fc7=L.Linear(4096, 4096),
fc8=L.Linear(4096, 1000),
)
def __call__(self, x, t):
h = F.max_pooling_2d(F.relu(
F.local_response_normalization(self.conv1(x))), 3, stride=2)
h = F.max_pooling_2d(F.relu(
F.local_response_normalization(self.conv2(h))), 3, stride=2)
h = F.relu(self.conv3(h))
h = F.relu(self.conv4(h))
h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2)
h = F.dropout(F.relu(self.fc6(h)), train=self.train)
h = F.dropout(F.relu(self.fc7(h)), train=self.train)
y = self.fc8(h)
return y
* ImageNet Classification with Deep Convolutional Neural Networks
http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
conv2d
conv2d
conv2d
conv2d
conv2d
linear
linear
25
linear
26. Code sample (3/4): Recurrent neural network
class SimpleRNN(Chain):
def __init__(self, n_vocab, n_units):
super(SimpleRNN, self).__init__(
embed=L.EmbedID(n_vocab, n_units)
x2h=L.Linear(n_units, n_units),
h2h=L.Linear(n_units, n_units),
h2y=L.Linear(n_units, n_vocab),)
self.h = None
def __call__(self, x):
y, h_new = self.fwd_one_step(x, self.h)
self.h = h_new
return y
def fwd_one_step(self, x, h):
x = F.tanh(self.embed(x))
if h is None:
h = F.tanh(self.x2h(x))
else:
h = F.tanh(self.x2h(x) + self.h2h(h))
y = F.softmax(self.h2y(h))
return y, h
x_1 h y_1
x_2 h y_2
x_3 h y_3
x_4 h y_4
BPTT length = 3
Input word OutputRecurrent state
# Truncated BPTT (length=3)
for i in range(0, datasize, batchsize):
...
accum_loss += model(x, t)
if i % bptt_length == 0:
model.zerograds()
accum_loss.backward()
accum_loss.unchain_backward()
optimizer.update()
26
27. Code sample (4/4): Deep Networks with Stochastic Depth
A paper published on arXiv, March 30, 2016
l A variant of Residual Net that skips connecOons stochasOcally
̶ Outperformed the original Residual Net (ImageNet 2015 winner, MSR)
̶ StochasOc skip:
Taken from http://arxiv.org/abs/1603.09382v2
G. Huang et al.
# Mock code in Chainer
class StochasticResNet(Chain):
def __init__(self, prob, size, …):
super(StochasticResNet, size, …).__init__(
## Define f[i] as same for Residual Net )
self.p = prob # Survival probabilities
def __call__(self, h):
for i in range(self.size):
b = numpy.random.binomial(1, self.p[i])
c = self.f[i](h) + h if b == 1 else h
h = F.relu(c)
return h
w/ survival probability:
27
28. Miscellaneous
l Other features
̶ Install with pip in one line:
̶ MulO-GPU support by explicitly selecOng the ID to use
̶ Pre-trained Caffe model import from Model Zoo
̶ Model serializaOon & save & load : HDF5 or NumPy npz
l Future direcOon (not only for Chainer)
̶ JIT-like opOmizaOon during Define-by-Run
̶ Memory consumpOon reducOon (GPU memory is sOll small)
̶ Handling variable-length inputs without minibatch
̶ Maximizing performance on mulO-node & mulO-GPU environment
$ pip install chainer
28
30. CuPy: (partially-)NumPy-compatible GPU library
l MoOvaOon: NumPy + CUDA = CuPy
̶ NumPy is the standard library in Python for numerical computaOon
̶ CUDA is the standard APIs for using GPU for high-performance
̶ Unfortunately, NumPy does NOT work with CUDA
l CuPy supports:
̶ Fast computaOon using NVIDIA’s cuBLAS and cuDNN
̶ Array indexing, slicing, transpose, and reshape
̶ Most of operaOons/funcOons in NumPy
u Chainer v1.7.2 already supports more than 170 funcOons
̶ User-defined funcOons and kernels
̶ all dtypes, broadcasOng, memory pool, etc
30
31. How to use CuPy
l Usage of CuPy: just replace NumPy with CuPy
l Conversion between numpy.ndarray and cupy.ndarray
l Ex. CPU/GPU-agnosOc logsumexp funcOon
def logsumexp(x, axis=None):
xp = cuda.get_array_module(x) #Get CuPy or NumPy
x_max = x.max(axis)
exp_sum = xp.exp(x - x_max).sum(axis)
return x_max + xp.log(exp_sum)
import numpy, cupy
enable_cupy = True
xp = cupy if enable_cupy else numpy
w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray
w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray
31
32. CuPy implementation:
Optimized for performance & NumPy-compatibility
l Use Cython for cupy.core & cupy.cuda
l Dynamic code generaOon & compile
̶ CUDA code is generated for specific tensor dimension & data type
̶ On-the-fly compile by nvcc and binary cache (faster awer 1st use)
CUDA libraries (cuBLAS, cuRAND, cuDNN)
ndarray
ufunc, elementwise, reduc5on
CUDA Python wrapper cupy.cuda
cupy.core
Tensor opera5ons & func5ons cupy
32
33. CuPy performance on linear algebra:
5 to 25 times faster than NumPy
def test(xp):
a = xp.arange(1000000).reshape(1000, -1)
return a.T * 2
test(numpy)
t1 = datetime.datetime.now()
for i in range(1000):
test(numpy)
t2 = datetime.datetime.now()
print(t2 -t1)
test(cupy)
t1 = datetime.datetime.now()
for i in range(1000):
test(cupy)
t2 = datetime.datetime.now()
print(t2 -t1)
msec speed
up
NumPy 2,929 1.0
CuPy 585 5.0
CuPy +
Memory Pool
123 23.8
Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970
33
34. Use CuPy for GPU-based computation
l Support three paZerns as wrappers
̶ ElementwiseKernel: for element-wise computaOon
̶ ReducOonKernel: for reduce operaOon along axis
̶ ufunc: universal funcOon as in Numpy
l Ex. definiOon of an element-wise funcOon
l Usage (automaOc broadcast and type check are supported)
squared_diff = cupy.ElementwiseKernel(
‘float32 x, float32 y’, # Input
‘float32 z’, # Output
‘z = (x - y) * (x - y)’, # Operation
‘squared_diff’) # Name
squared_diff(cupy.arange(10), 10)
34
36. Public benchmark results (CNN):
Chainer shows comparable performance
l Forward computaOon is almost the same with TensorFlow
l Training with backward computaOon is slower, but it can be
offset by no compilaOon Ome while debugging/tuning
0
200
400
600
800
1000
1200
AlexNet GoogLeNet VGG-A OverFeat
Torch
TensorFlow
Chainer
Caffe (naCve)
0
200
400
600
800
1000
1200
AlexNet GoogLeNet VGG-A OverFeat
Torch
TensorFlow
Chainer
Caffe (naCve)
Forward computation (msec) Backward computation (msec)
Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Caffe 36
37. Chainer can benefit from latest CUDA libraries:
Ex. Winograd algorithm in cuDNN v5
l Conv3x3 is common in CNNs & now computed with Winograd
l State-of-the-art CNN models (e.g., GoogLeNet, VGG-A)
can be accelerated up to 2.0x at test Ome (forward only)
0
100
200
300
400
500
600
AlexNet GoogLeNet VGG-A OverFeat
cuDNN v4
cuDNN v5
0
100
200
300
400
500
600
AlexNet GoogLeNet VGG-A OverFeat
cuDNN v4
cuDNN v5
Forward computation (msec) Backward computation (msec)
Independently measured by a modified version of soumith/convnet-benchmarks
cuDNN v5 can be used in Chainer v1.8.0 37
38. Algorithm implementation in Chainer:
A Neural Algorithm of Artistic Style (Gatys et al., 2015)
l hZps://github.com/maZya/chainer-gogh
Content
image (cat)
Style
image
New
artistic
image
+ =
Main code (45 lines) 38