Deeplearning NLP

A not-so-short
introduction to
Deep Learning NLP
Francesco Gadaleta, PhD
1
worldofpiggy.com

What we do today
NLP introduction (<5 min)
Deep learning introduction (10 min)
What do we want (5 min)
How do we get there (15 min)
Demo (5 min)
What’s next (5 min)
Demo (5 min)
Questions (10 min)
2

A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA
The Goals of NLP
Analysis of (free) text
Extract knowledge/abstract concepts from textual data
(text understanding)
Generative models (chat bot, AI assistants, ...)
Word/Paragraph similarity/classification
Sentiment analysis
3

Traditional ML and
NLP
4

Traditional NLP word representation
0 0 0 0 1 0 0 0 0 0
One-hot encoding of words: binary vectors of <vocabulary_size>
dimensions
0 0 0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
“book”
“chapter”
“paper”
AND
AND
= 0
5

Traditional soft-clustering word
representation
Soft clustering models learn for each cluster/topic a distribution over
words of how likely that word is in each cluster
• Latent Semantic Analysis (LSA/LSI), Random projections
• Latent Dirichlet Analysis (LDA), HMM clustering
6

LSA - Latent Semantic Analysis
Words that are close in meaning will occur in similar pieces of text.
Good for not-so-large text data
SVD to reduce words and preserve similarity among paragraphs
paragraphs
words
Similarity = cosine(vec(w1), vec(w2))
Low-rank
No Polysemy
Poor Synonymy
Bag-of-word limitations (no order)
U
V
M=U *
Huge, Sparse, Noisy
7
X
word counts/paragraph

Traditional ML and
Deep Learning
8

The past and the present
Human-designed
representation
blah blahBlah blah
blah Blah blah blah
blah blahBlah blah
blah blah
Handcrafted
sound features
ML model Predictions
Regression
Clustering
Random Forest
SVM
KNN
...
Weight
Optimization
9

The future
Representation Learning automatically learn good
features or representations
Deep Learning learn multiple levels of representation with
increasing complexity and abstraction
10

The promises of AI (1969-2016)
11

Brief history of AI
1958 Rosenblatt’s
perceptron
1974
Backpropagation
1998
ConvNets
2012 Google
Brain Project
1995
Kernel
methods (SVM)
2006
Restricted
Boltzmann
Machine
AI winter AI spring AI summer
12

Why is this happening?
BIG
Data
GPU
Power
ALGO
Progress
13

Geoffrey Hinton
Cognitive psychologist AND Professor
at University of Toronto AND one of
the first to demonstrate the use of
generalized backpropagation to train
multi-layer networks.
Known for Backpropagation OR
Boltzmann machine AND great-great-
grandson of logician George Boole
14

Yann LeCun
Postdoc at Hinton’s lab.
Developed DJVu format.
Father of Convolutional Neural
Networks and Optical Character
Recognition (OCR).
Proposed bio inspired ML methods like
“Optimal Brain Damage” a
regularization method.
LeNet-5 is now state-of-the-art in
artificial vision.
15

Yoshua Bengio
Professor at University Montreal.
Many contributions in Deep
Learning.
Known for Gradient-based
learning, word representations
and representation learning for
NLP.
16

Some reasons to apply
Deep Learning
(non-exhaustive list)
17

No. 1
Automatic Representation Learning
1. Who wants to manually
prepare features?
2. Often over-specified or
incomplete (or both)
3. Done? Cool!
Now do it again and again...
Input Data
Feature
Engineering
ML
algorithm
Time consuming
ML
Algorithm 1
ML
Algorithm 2
ML
Algorithm 3
Domain #1
Domain #2
Domain #3
Validation
Validation
Validation
18
Feature
engineering
Feature
engineering
Feature
engineering

No. 2
Learning from unlabeled data
Traditional NLP requires labeled training data
Guess what?
Almost all data is unlabeled
Learning how data is generated is essential to ‘understand’ data
[Demo]
19

No. 3
Metric Learning
Similarity
Dissimilarity
Distance matrix
Kernel
Define
please!
20

No. 4
Human language is recursive
“People that don't know me think I'm shy.
People that do know me wish I were.”
Recursion
Same operator
applied to different
components (RNN)
21

Some examples
22

LeNet (proposed in 1998 by Yan LeCun)
● Convolutional Neural Network for reading bank checks
● All units of a feature map share same set of weights
Detect same feature at all possible locations of input
Robust to shifts and distortions
23

GoogLeNet (proposed in 2014 by Szegedy et al.)
Specs
22 layers
12x less parameters than winning network ILSVRC 2012 challenge
Introduced Inception module (filters similar to the primate visual cortex) to find out how a local sparse structure can
be approximated by readily available dense components
Too deep => gradient propagation problems => classifiers added in the middle of the network :)
Object recognition
Captioning
Classification
Scene description (*)
(*) with semantically valid phrases.
24

A not-so-classic example
“Kideatingicecream”
25

Neural Image Captioning
26

Sentiment analysis
Task
Socher et al. use RNN for sentiment
prediction
Demo http://nlp.stanford.
edu/sentiment
27

Neural Generative Model
Character-based RNN
Text Alice in
Wonderland
Corpus len 167546
Unique chars 85
# sequences 55842
Context chars 20
Epochs 280
CPU Intel i7
GPU NVIDIA 560M
RAM 16 GB
neural networks are fun
INPUT <20x85> OUTPUT <1x85>
o
r
r
f
e
28

demo
29

Neural Network Architectures
image - class image - caption sentence - class sentence - sentence sequence - sequence
30

How many
neural networks
for speech recognition
and NLP tasks?
31

Just one (*)
Layers
Output: predict supervised target
Hidden: learn abstract
representations
Input: raw sensory inputs.
(*) Provided you don’t fall for exotic stuff 32

NN architecture: Single Neuron
n (3) inputs, 1 output, parameters W, b
x1
x2
x3
b=+1
hw,b(x)
Logistic activation function
33

Many Single Neurons make a Network
Input Layer Layer 1 Layer 2
Learning
Many logistic regressions
at the same time
Hidden: neurons have no
meaning for humans
Output to be predicted
stays the same
Layer 3 Output Layer
x1
x2
x3
b=+1
34

Neural Networks in a
(not-so-small) nutshell
*** DISCLAIMER ***
After this section the charming and
fascinating halo surrounding Neural
Networks and Deep Learning will be gone.
35

The core of a Neural Network
x1
x2
x3
b=+1
36

x1
x2
x3
b=+1
W1 W2
(Logistic regression) (Logistic regression)
b1
b2
37

(Logistic regression)
SGD Stochastic
Gradient Descent
Backpropagation
(at each layer)
38

Non-linear
Activation Functions
RectifiedLinearUnit
➔ fast
➔ more expressive than
logistic function
➔ prevents vanishing
gradients
39

Optimization Functions
Stochastic Gradient Descent
➔ fast
➔ adaptive (Ada, RMS)
➔ handle many dimensions
40

Fixed-sized-input Neural Networks
Assumption:
we are happy with 5-gram
input (really?)
41

Recurrent Neural Networks
Fact:
n-gram input has a lot of limitations
42

Neural Networks and Text
the
cat
sat
b=+1
W1 W2b1 b2Emb
<vocsize, embsize> <hidden, class><hidden, hidden>
vocabulary size = 1000
embedding size = 50
context = 20
classes = 2, 10, 100 (depends on the problem)
next word
sentiment
PoS tagging
43

Neural Networks and Text
Emb
<vocsize, embsize>
Words are represented as numeric vectors
(can subtract, add, group, cluster,...)
Similarity kernel (learned)
This is “knowledge” that can be transferred
+1.4% F1 Dependency Parsing 15.2% error reduction (Koo & Collins 2008, Brown
clustering)
+3.4% F1 Named Entity Recognition 23.7% error reduction (Stanford NER, exchange
clustering)
44

Word Embedding: plotting
Courtesy of Christopher Olah
45

Courtesy of Christopher Olah
Word Embedding: algebraic operations
MAN + ‘something good’ == WOMAN
WOMAN - ‘something bad’ == MAN
MAN + ‘something’ == WOMAN
KING + ‘something’ == QUEEN
Identification of text regularities in [3] with 80-1600
dimensions, 320M words Broadcast news, 82k unique
words.
46

Demo: word embeddings
Training set 9 GB free text
Vocabulary size 50000
Embedding
dimensions
256
Context window 10
Skip top common
words
100
Layers [10,100,512,1]
Embeddings <50000, 256>
47

Feeding the network
Neural nets are fun and we are happy
1
Ted, Sarandos who runs Netflix’s Hollywood banana
(operation)
and
0
makes the company’s deals,. with networks and he
1
studios was up first to beer
(rehearse)
his lines
0
48
Em
b
<50000x256>

Demo
word embeddings: pre-processing
Remove HTML tags
replace unicode
utf-8 encode
tokenize
4-node Spark cluster
49

A NOT-SO-SHORT INTRODUCTION TO DEEP LEARNING NLP - FRANCESCO GADALETA 50
demo

What’s Next
from word to document embeddings
Distributed Representations of Sentences and
Documents
Quoc Le, Tomas Mikolov, Google Inc
Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel,
Antonio Torralba, Raquel Urtasun, Sanja Fidler
51

Who is
‘deep learning’?
Twitter,
Pinterest,
News delivery,
broadcast
Google Self
Driving car,
Smart Reply,
Ads.
Google, Alphabet
Facebook
automatic
tagging, text
understanding
Facebook, Inc.
52

Deep learning has simplified feature engineering in many cases
(it certainly hasn't removed it)
Less feature engineering is leading to more complex machine learning
architectures
Most of the time, these model architectures are as specific to a given task
as feature engineering used to be.
Conclusion
The job of the data scientist will stay sexy for a while
(keep your fingers crossed on this one).
53

References
[1] Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts
Stanford University, Stanford, CA 94305, USA
[2] Document Embedding with Paragraph Vectors
Andrew M. Dai, Christopher Olah, Quoc V. Le Google
[3] Linguistic Regularities in Continuous Space Word Representations
Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig, Microsoft Research
[4] Distributed Representations of Sentences and Documents
Quoc Le, Tomas Mikolov, Google Inc
[5] Skip-Thought Vectors
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
[6] Text Understanding from Scratch
Xiang Zhang, Yann LeCun Computer Science Department, Courant Institute of Mathematical Sciences, New York
University
[7] World of Piggy - Data Science at Home Podcast - History and applications of Deep Learning http://worldofpiggy.
com/history-and-applications-of-deep-learning-a-new-podcast-episode/
54

Thank you
55
github.com/worldofpiggy @worldofpiggy worldofpiggy@gmail.com worldofpiggy.com

Deeplearning NLP

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Deeplearning NLP

Ähnlich wie Deeplearning NLP (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deeplearning NLP