SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
A NumPy-compatible Library for GPU
Shohei Hido
VP of Research
Preferred Networks
Preferred Networks: An AI Startup in Japan
● Founded: March 2014 (120 engineers and researchers)
● Major news
● $100+M investment from Toyota for autonomous driving
● 2nd place at Amazon Robotics Challenge 2016
● Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs)
2
Deep learning research Industrial applications
Manufacturing
Automotive
Healthcare
Key takeaways
● CuPy is an open-source NumPy for NVIDIA GPU
● Python users can easily write CPU/GPU-agnostic code
● Existing NumPy code can be accelerated thanks to GPU and CUDA libraries
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
CuPy: A NumPy-Compatible Library for NVIDIA GPU
● NumPy is extensively used in Python but GPU is not supported
● GPU is getting faster and more important for scientific computing
import numpy as np
x_cpu = np.random.rand(10)
W_cpu = np.random.rand(10, 5)
y_cpu = np.dot(x_cpu, W_cpu)
import cupy as cp
x_gpu = cp.random.rand(10)
W_gpu = cp.random.rand(10, 5)
y_gpu = cp.dot(x_gpu, W_gpu)
y_gpu = cp.asarray(y_cpu)
y_cpu = cp.asnumpy(y_gpu)
for xp in [numpy, cupy]:
x = xp.random.rand(10)
W = xp.random.rand(10, 5)
y = xp.dot(x, W)
CPU/GPU-agnostic
NVIDIA GPUCPU
CuPy is actively developed (1,600+ github stars, 11,000+ commits)
Ryosuke Okuta
CTO
Preferred
Networks
Deep learning framework
https://chainer.org/
Probabilistic and graphical modeling
https://github.com/jmschrei/pomegranate
Natural language processing
https://spacy.io/
Python libraries powered by CuPy
Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy
Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)
Our mission: make CuPy the default tool for GPU computation in Python
https://anaconda.org/anaconda/cupy/
● CuPy is now available on Anaconda in collaboration w/ Anaconda team
● You can install cupy with “$ conda install cupy” on Linux 64-bit
● We are working on Windows version
Don’t have GPU for CuPy? Google Colaboratory gives you one (for free!)
…
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
Implementation of CPU/GPU agnostic k-means fit(): 37 lines
https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py
K-means (1/3): Call function and initialization
● fit() follows the training API of scikit-learn
● xp represents either numpy or cupy
● Cluster centers are initialized by positions of
random samples
<- Specify NumPy or CuPy
K-means (2/3): Calculate distance to all of the cluster centers
● xp.linalg.norm is to compute the distance and
supported both in numpy and cupy
● _fit_calc_distances() uses custom kernel on cupy
Customized kernel with C++ snippet in cupy.ElementwiseKernel
● A kernel is generated by element-wise operation defined in C++ snippet
K-means (3/3): Update positions of cluster centers
● xp.stack is to update the cluster centers and
supported both in numpy and cupy
● _fit_calc_center() is also custom kernel based
Another element-wise kernel
● It just adds all of the points inside each cluster and count the number
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
Performance comparison with NumPy
● CuPy is faster than NumPy even in simple manipulation of large matrix
Benchmark code
Size CuPy [ms] NumPy [ms]
10^4 0.58 0.03
10^5 0.97 0.20
10^6 1.84 2.00
10^7 12.48 55.55
10^8 84.73 517.17
Benchmark result
6x faster
● Data types (dtypes)
○ bool_, int8, int16, int32, int64, uint8, uint16,
uint32, uint64, float16, float32, float64,
complex64, and complex128
● All basic indexing
○ indexing by ints, slices, newaxes, and Ellipsis
● Most of advanced indexing
○ except indexing patterns with boolean
masks
● Most of the array creation routines
○ empty, ones_like, diag, etc...
● Most of the array manipulation routines
○ reshape, rollaxis, concatenate, etc...
● All operators with broadcasting
● All universal functions for element-wise
operations
○ except those for complex numbers
● Linear algebra functions accelerated by cuBLAS
○ including product: dot, matmul, etc...
○ including decomposition: cholesky, svd,
etc...
● Reduction along axes
○ sum, max, argmax, etc...
● Sort operations implemented by Thrust
○ sort, argsort, and lexsort
● Sparse matrix accelerated by cuSPARSE
Compatibility with NumPy
Comparison with other Python libraries for/on CUDA
● CuPy is the only library that is designed for high compatibility with NumPy
still allowing users to write customized CUDA kernels for better performance
CuPy PyCUDA MinPy*
NVIDIA CUDA support ✔ ✔ ✔
CPU/GPU agnostic coding ✔ ✔
Automatic gradient support ** ✔
NumPy compatible interface ✔ ✔
User-defined CUDA kernel ✔ ✔
* https://github.com/dmlc/minpy
** Autograd is supported by Chainer
Inside CuPy
● CuPy extensively relies on NVIDIA libraries for better performance
Linear algebra
NVIDIA GPU
CUDA
cuDNN cuBLAS cuRANDcuSPARSE
NCCL
Thrust
Sparse matrix
DNN
Utility
Random
numbers
cuSOLVER
User-
defined
CUDA
kernel
Multi-
GPU
data
transfer
Sort
CuPy
Looks very easy?
● CUDA and its libraries are not designed for Python nor NumPy
━ CuPy is not just a wrapper of CUDA libraries for Python
━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API
● NumPy specification is not documented
━ We have carefully investigated some unexpected behaviors of NumPy
━ CuPy tries to replicate NumPy’s behavior as much as possible
● NumPy’s behaviors vary between different versions
━ e.g, NumPy v1.14 changed the output format of __str__
• `[ 0. 1.]` -> `[0. 1.]` (no space)
Advanced features of CuPy (1/2)
Memory pool GPU Memory profiler
Function name
Used
Bytes
Acquired
Bytes
Occurrence
LinearFunction 5.16GB 0.18GB 3900
ReLU 0.99GB 0.46GB 1300
SoftMaxEnropy 7.71MB 5.08MB 1300
Accuracy 0.62MB 0.35MB 700
● This enables function-wise memory
profiling on Chainer
● Avoiding cudaMalloc is a
common practice in CUDA
programming
● CuPy supports memory pooling
using Best-Fit with Coalescing
(BFC) algorithm
● It reduces memory usage
to 25% on seq2seq model
Advanced features of CuPy (2/2)
Kernel fusion (experimental)
@cp.fuse()
def fused_func(x, y, z):
return (x * y) + z
● By adding decorator @cp.fuse(),
CuPy stores a series of operations
● Then it compiles a single kernel
to execute the operations
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
• Start providing pre-built wheel packages of CuPy
– cupy-cuda80, cupy-cuda90, and cupy-cuda91
– $ pip install cupy-cuda80
• Memory pool is now the default allocator
– Added line memory profiler using memory hook and traceback
• CUDA stream is fully supported
stream = cupy.cuda.stream.Stream()
with stream:
y = cupy.linalg.norm(x)
stream.synchronize()
stream = cupy.cuda.stream.Stream()
stream.use()
y = cupy.linalg.norm(x)
stream.synchronize()
What’s new in CuPy v4?
cupy.argpartition
cupy.unravel_index
cupy.percentile
cupy.moveaxis
cupy.blackman
cupy.hamming
cupy.hanning
cupy.isclose
cupy.iscomplex
cupy.iscomplexobj
cupy.isfortran
cupy.isreal
cupy.isrealobj
cupy.linalg.tensorinv
cupy.random.shuffle
cupy.random.set_random_state
cupy.random.RandomState.tomaxint
cupy.sparse.random
cupy.sparse.csr_matrix.eliminate_zeros
cupy.sparse.coo_matrix.eliminate_zeros
cupy.sparse.csc_matrix.eliminate_zeros
cupyx.scatter_add
cupy.fft
Standard FFTs:
fft, ifft, fft2, ifft2, fftn, ifftn
Real FFTs:
rfft, irfft, rfft2, irfft2., rfftn, irfftn
Hermitian FFTs:
hfft, ihfft
Helper routines:
fftfreq, rfftfreq, fftshift, ifftshift
Newly added functions in v4
• Windows support
• AMD GPU support via HIP
• More useful fusion function
• Add more functions (NumPy, SciPy)
• Add more probability distributions
• Provide simple CUDA kernel
• Support DLPack and
TensorComprehension
– toDLPack() and fromDLPack()
@cupy.fuse()
def sample2(x, y):
return cupy.sum(x + y, axis=0) * 2
CuPy v5 - planned features
Summary: CuPy is a drop-in replacement of NumPy for GPU
1. Highly-compatible with NumPy
━ data types, indexing, broadcasting, operations
━ Users can write CPU/GPU-agnostic code
2. High performance on NVIDIA GPUs
━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL
3. Easy to install
━ $ pip install cupy
━ $ conda install cupy
4. Easy to write custom kernel
━ ElementwiseKernel, ReductionKernel
import numpy as np
x = np.random.rand(10)
W = np.random.rand(10, 5)
y = np.dot(x, W)
import cupy as cp
x = cp.random.rand(10)
W = cp.random.rand(10, 5)
y = cp.dot(x, W)
to
GPU to
CPU
Your contribution will be highly appreciated & We are hiring!

Weitere ähnliche Inhalte

Was ist angesagt?

KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjpKubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjpPreferred Networks
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
Rancher kubernetes storages
Rancher kubernetes storagesRancher kubernetes storages
Rancher kubernetes storagesTetsurou Yano
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境KiyotomoHiroyasu
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るPreferred Networks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
お小遣いでKubernetesクラスタ
お小遣いでKubernetesクラスタお小遣いでKubernetesクラスタ
お小遣いでKubernetesクラスタNobuaki Aoki
 
GKE multi-cluster Ingress
GKE multi-cluster IngressGKE multi-cluster Ingress
GKE multi-cluster IngressKiyoshi Fukuda
 
Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋
Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋
Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋Yahoo!デベロッパーネットワーク
 
The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputerinside-BigData.com
 
GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)智啓 出川
 
NEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfNEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfYasunori Goto
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能Kohei Tokunaga
 
eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動Kohei Tokunaga
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜Preferred Networks
 
Unboundの最適化(OSC2011 Tokyo/Spring)
Unboundの最適化(OSC2011 Tokyo/Spring)Unboundの最適化(OSC2011 Tokyo/Spring)
Unboundの最適化(OSC2011 Tokyo/Spring)Takashi Takizawa
 

Was ist angesagt? (20)

KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjpKubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Rancher kubernetes storages
Rancher kubernetes storagesRancher kubernetes storages
Rancher kubernetes storages
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
お小遣いでKubernetesクラスタ
お小遣いでKubernetesクラスタお小遣いでKubernetesクラスタ
お小遣いでKubernetesクラスタ
 
GKE multi-cluster Ingress
GKE multi-cluster IngressGKE multi-cluster Ingress
GKE multi-cluster Ingress
 
Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋
Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋
Yahoo! JAPANのマネージド Kubernetes サービスを支える技術 #ヤフー名古屋
 
The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputer
 
GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)
 
NEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfNEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdf
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 
eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動eStargzイメージとlazy pullingによる高速なコンテナ起動
eStargzイメージとlazy pullingによる高速なコンテナ起動
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
オンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッションオンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッション
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
 
Unboundの最適化(OSC2011 Tokyo/Spring)
Unboundの最適化(OSC2011 Tokyo/Spring)Unboundの最適化(OSC2011 Tokyo/Spring)
Unboundの最適化(OSC2011 Tokyo/Spring)
 

Ähnlich wie CuPy: A NumPy-compatible Library for GPU

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksKenta Oono
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeAcademy
 
KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016Apcera
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Ian Lumb
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019NVIDIA
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014Puppet
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)IT-Доминанта
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 

Ähnlich wie CuPy: A NumPy-compatible Library for GPU (20)

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
CuPy v4 and v5 roadmap
CuPy v4 and v5 roadmapCuPy v4 and v5 roadmap
CuPy v4 and v5 roadmap
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
 
KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
 
CUDA
CUDACUDA
CUDA
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 

Mehr von Shohei Hido

Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Shohei Hido
 
ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術Shohei Hido
 
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFAShohei Hido
 
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoSoftware for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoShohei Hido
 
Chainer GTC 2016
Chainer GTC 2016Chainer GTC 2016
Chainer GTC 2016Shohei Hido
 
How AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesHow AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesShohei Hido
 
NIPS2015概要資料
NIPS2015概要資料NIPS2015概要資料
NIPS2015概要資料Shohei Hido
 
プロダクトマネージャのお仕事
プロダクトマネージャのお仕事プロダクトマネージャのお仕事
プロダクトマネージャのお仕事Shohei Hido
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントShohei Hido
 
PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料Shohei Hido
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...Shohei Hido
 
機械学習CROSS 後半資料
機械学習CROSS 後半資料機械学習CROSS 後半資料
機械学習CROSS 後半資料Shohei Hido
 
機械学習CROSS 前半資料
機械学習CROSS 前半資料機械学習CROSS 前半資料
機械学習CROSS 前半資料Shohei Hido
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Shohei Hido
 
Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Shohei Hido
 
今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しましたShohei Hido
 
さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティストShohei Hido
 
ICML2013読み会 開会宣言
ICML2013読み会 開会宣言ICML2013読み会 開会宣言
ICML2013読み会 開会宣言Shohei Hido
 
ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?Shohei Hido
 

Mehr von Shohei Hido (20)

Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門
 
NIPS2017概要
NIPS2017概要NIPS2017概要
NIPS2017概要
 
ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術
 
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
 
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoSoftware for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
 
Chainer GTC 2016
Chainer GTC 2016Chainer GTC 2016
Chainer GTC 2016
 
How AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesHow AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industries
 
NIPS2015概要資料
NIPS2015概要資料NIPS2015概要資料
NIPS2015概要資料
 
プロダクトマネージャのお仕事
プロダクトマネージャのお仕事プロダクトマネージャのお仕事
プロダクトマネージャのお仕事
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイント
 
PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
 
機械学習CROSS 後半資料
機械学習CROSS 後半資料機械学習CROSS 後半資料
機械学習CROSS 後半資料
 
機械学習CROSS 前半資料
機械学習CROSS 前半資料機械学習CROSS 前半資料
機械学習CROSS 前半資料
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門
 
Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤
 
今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました
 
さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティスト
 
ICML2013読み会 開会宣言
ICML2013読み会 開会宣言ICML2013読み会 開会宣言
ICML2013読み会 開会宣言
 
ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

CuPy: A NumPy-compatible Library for GPU

  • 1. A NumPy-compatible Library for GPU Shohei Hido VP of Research Preferred Networks
  • 2. Preferred Networks: An AI Startup in Japan ● Founded: March 2014 (120 engineers and researchers) ● Major news ● $100+M investment from Toyota for autonomous driving ● 2nd place at Amazon Robotics Challenge 2016 ● Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs) 2 Deep learning research Industrial applications Manufacturing Automotive Healthcare
  • 3. Key takeaways ● CuPy is an open-source NumPy for NVIDIA GPU ● Python users can easily write CPU/GPU-agnostic code ● Existing NumPy code can be accelerated thanks to GPU and CUDA libraries
  • 4. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 5. CuPy: A NumPy-Compatible Library for NVIDIA GPU ● NumPy is extensively used in Python but GPU is not supported ● GPU is getting faster and more important for scientific computing import numpy as np x_cpu = np.random.rand(10) W_cpu = np.random.rand(10, 5) y_cpu = np.dot(x_cpu, W_cpu) import cupy as cp x_gpu = cp.random.rand(10) W_gpu = cp.random.rand(10, 5) y_gpu = cp.dot(x_gpu, W_gpu) y_gpu = cp.asarray(y_cpu) y_cpu = cp.asnumpy(y_gpu) for xp in [numpy, cupy]: x = xp.random.rand(10) W = xp.random.rand(10, 5) y = xp.dot(x, W) CPU/GPU-agnostic NVIDIA GPUCPU
  • 6. CuPy is actively developed (1,600+ github stars, 11,000+ commits) Ryosuke Okuta CTO Preferred Networks
  • 7. Deep learning framework https://chainer.org/ Probabilistic and graphical modeling https://github.com/jmschrei/pomegranate Natural language processing https://spacy.io/ Python libraries powered by CuPy
  • 8. Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy
  • 9. Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)
  • 10. Our mission: make CuPy the default tool for GPU computation in Python https://anaconda.org/anaconda/cupy/ ● CuPy is now available on Anaconda in collaboration w/ Anaconda team ● You can install cupy with “$ conda install cupy” on Linux 64-bit ● We are working on Windows version
  • 11. Don’t have GPU for CuPy? Google Colaboratory gives you one (for free!) …
  • 12. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 13. Implementation of CPU/GPU agnostic k-means fit(): 37 lines https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py
  • 14. K-means (1/3): Call function and initialization ● fit() follows the training API of scikit-learn ● xp represents either numpy or cupy ● Cluster centers are initialized by positions of random samples <- Specify NumPy or CuPy
  • 15. K-means (2/3): Calculate distance to all of the cluster centers ● xp.linalg.norm is to compute the distance and supported both in numpy and cupy ● _fit_calc_distances() uses custom kernel on cupy
  • 16. Customized kernel with C++ snippet in cupy.ElementwiseKernel ● A kernel is generated by element-wise operation defined in C++ snippet
  • 17. K-means (3/3): Update positions of cluster centers ● xp.stack is to update the cluster centers and supported both in numpy and cupy ● _fit_calc_center() is also custom kernel based
  • 18. Another element-wise kernel ● It just adds all of the points inside each cluster and count the number
  • 19. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 20. Performance comparison with NumPy ● CuPy is faster than NumPy even in simple manipulation of large matrix Benchmark code Size CuPy [ms] NumPy [ms] 10^4 0.58 0.03 10^5 0.97 0.20 10^6 1.84 2.00 10^7 12.48 55.55 10^8 84.73 517.17 Benchmark result 6x faster
  • 21. ● Data types (dtypes) ○ bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, complex64, and complex128 ● All basic indexing ○ indexing by ints, slices, newaxes, and Ellipsis ● Most of advanced indexing ○ except indexing patterns with boolean masks ● Most of the array creation routines ○ empty, ones_like, diag, etc... ● Most of the array manipulation routines ○ reshape, rollaxis, concatenate, etc... ● All operators with broadcasting ● All universal functions for element-wise operations ○ except those for complex numbers ● Linear algebra functions accelerated by cuBLAS ○ including product: dot, matmul, etc... ○ including decomposition: cholesky, svd, etc... ● Reduction along axes ○ sum, max, argmax, etc... ● Sort operations implemented by Thrust ○ sort, argsort, and lexsort ● Sparse matrix accelerated by cuSPARSE Compatibility with NumPy
  • 22. Comparison with other Python libraries for/on CUDA ● CuPy is the only library that is designed for high compatibility with NumPy still allowing users to write customized CUDA kernels for better performance CuPy PyCUDA MinPy* NVIDIA CUDA support ✔ ✔ ✔ CPU/GPU agnostic coding ✔ ✔ Automatic gradient support ** ✔ NumPy compatible interface ✔ ✔ User-defined CUDA kernel ✔ ✔ * https://github.com/dmlc/minpy ** Autograd is supported by Chainer
  • 23. Inside CuPy ● CuPy extensively relies on NVIDIA libraries for better performance Linear algebra NVIDIA GPU CUDA cuDNN cuBLAS cuRANDcuSPARSE NCCL Thrust Sparse matrix DNN Utility Random numbers cuSOLVER User- defined CUDA kernel Multi- GPU data transfer Sort CuPy
  • 24. Looks very easy? ● CUDA and its libraries are not designed for Python nor NumPy ━ CuPy is not just a wrapper of CUDA libraries for Python ━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API ● NumPy specification is not documented ━ We have carefully investigated some unexpected behaviors of NumPy ━ CuPy tries to replicate NumPy’s behavior as much as possible ● NumPy’s behaviors vary between different versions ━ e.g, NumPy v1.14 changed the output format of __str__ • `[ 0. 1.]` -> `[0. 1.]` (no space)
  • 25. Advanced features of CuPy (1/2) Memory pool GPU Memory profiler Function name Used Bytes Acquired Bytes Occurrence LinearFunction 5.16GB 0.18GB 3900 ReLU 0.99GB 0.46GB 1300 SoftMaxEnropy 7.71MB 5.08MB 1300 Accuracy 0.62MB 0.35MB 700 ● This enables function-wise memory profiling on Chainer ● Avoiding cudaMalloc is a common practice in CUDA programming ● CuPy supports memory pooling using Best-Fit with Coalescing (BFC) algorithm ● It reduces memory usage to 25% on seq2seq model
  • 26. Advanced features of CuPy (2/2) Kernel fusion (experimental) @cp.fuse() def fused_func(x, y, z): return (x * y) + z ● By adding decorator @cp.fuse(), CuPy stores a series of operations ● Then it compiles a single kernel to execute the operations
  • 27. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 28. • Start providing pre-built wheel packages of CuPy – cupy-cuda80, cupy-cuda90, and cupy-cuda91 – $ pip install cupy-cuda80 • Memory pool is now the default allocator – Added line memory profiler using memory hook and traceback • CUDA stream is fully supported stream = cupy.cuda.stream.Stream() with stream: y = cupy.linalg.norm(x) stream.synchronize() stream = cupy.cuda.stream.Stream() stream.use() y = cupy.linalg.norm(x) stream.synchronize() What’s new in CuPy v4?
  • 30. • Windows support • AMD GPU support via HIP • More useful fusion function • Add more functions (NumPy, SciPy) • Add more probability distributions • Provide simple CUDA kernel • Support DLPack and TensorComprehension – toDLPack() and fromDLPack() @cupy.fuse() def sample2(x, y): return cupy.sum(x + y, axis=0) * 2 CuPy v5 - planned features
  • 31. Summary: CuPy is a drop-in replacement of NumPy for GPU 1. Highly-compatible with NumPy ━ data types, indexing, broadcasting, operations ━ Users can write CPU/GPU-agnostic code 2. High performance on NVIDIA GPUs ━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL 3. Easy to install ━ $ pip install cupy ━ $ conda install cupy 4. Easy to write custom kernel ━ ElementwiseKernel, ReductionKernel import numpy as np x = np.random.rand(10) W = np.random.rand(10, 5) y = np.dot(x, W) import cupy as cp x = cp.random.rand(10) W = cp.random.rand(10, 5) y = cp.dot(x, W) to GPU to CPU Your contribution will be highly appreciated & We are hiring!