SlideShare a Scribd company logo
1 of 36
Download to read offline
The Author-Topic Model
Presented by:
Darshan Ramakant Bhat
Freeze Francis
Mohammed Haroon D
by M Rosen-Zvi, et al. (UAI 2004)
Overview
● Motivation
● Model Formulation
○ Topic Model ( i.e LDA)
○ Author Model
○ Author Topic Model
● Parameter Estimation
○ Gibbs Sampling Algorithms
● Results and Evaluation
● Applications
Motivation
● joint author-topic modeling had received little attention.
● author modeling has tended to focus on problem of predicting an author given
the document.
● modelling the interests of authors is a fundamental problem raised by a large
corpus (eg: NIPS dataset).
● this paper introduced a generative model that simultaneously models the
content of documents and the interests of authors.
Notations
● Vocabulary set size : V
● Total unique authors : A
● Document d is identified as (wd
, ad
)
● Nd
: number of words in Document d.
● wd
: Vector of Nd
words (subset of vocabulary)
wd
= [w1d
w2d
…….. wNd d
]
● wid
: ith
word in document d
● Ad
: number of authors of Document d
● ad
: Vector of Ad
authors (subset of authors)
ad
= [a1d
a2d
…….. aAd d
]
● Corpus : Collection D documents
= { (w1
, a1
) , (w2
, a2
) ……. (wD
, aD
) }
Plate Notation for Graphical models
● Nodes are random variables
● Edges denote possible dependence
● Observed variables are shaded
● Plates denote replicated structure
Dirichlet distribution
● generalization of beta distribution into multiple dimensions
● Domain : X = (x1
, x2
….. xk
)
○ xi
∈ (0 , 1)
○ k
xi
= 1
○ k-dimensional multinomial distributions
● "distribution over multinomial distributions"
● Parameter : = ( 1
, 2
…… k
)
○ i
> 0 , ∀i
○ determines how the probability mass is distributed
○ Symmetric : 1
= 2
=…… k
● Each sample from a dirichlet is a multinomial distribution.
○ ↑ : denser samples
○ ↓ : sparse samples
Topic 2
(0,1,0)
Topic 1
(1,0,0)
Topic 3
(0,0,1)
.
(⅓,⅓,⅓ )
Fig : Domain of Dirichlet (k=3)
Fig : Symmetric Dirichlet distributions for k=3
pdf
samples
Model Formulation
Topic Model (LDA)
Model topics as distribution over words
Model documents as distribution over topics
Author Model
Model author as distribution over words
Author-Topic Model
Probabilistic model for both author and topics
Model topics as distribution over words
Model authors as distribution over topics
Topic Model : LDA
● Generative model
● Bag of word
● Mixed membership
● Each word in a document has a topic
assignment.
● Each document is associated with a
distribution over topics.
● Each topic is associated with a
distribution over words.
● α : Dirichlet prior parameter on the
per-document topic distributions.
● β : Dirichlet prior parameter on the
per-topic word distribution.
● w : observed word in document
● z : is the topic assigned for w
Fig : LDA model
Author Model
● Simple generative model
● Each author is associated with a
distribution over words.
● Each word has a author assignment.
● β : Dirichlet prior parameter on the
per-author word distributions.
● w : Observed word
● x : is the author chosen for word w
Fig : Author Model
Model Formulation
Topic Model (LDA)
Model topics as distribution over words
Model documents as distribution over topics
Author Model
Model author as distribution over words
Author-Topic Model
Probabilistic model for both author and topics
Model topics as distribution over words
Model authors as distribution over topics
Author-Topic
Model
● Each topic is associated with a
distribution over words.
● Each author is associated with a
distribution over topics.
● Each word in a document has an
author and a topic assignment.
● Topic distribution of document is a
mixture of its authors topic distribution
● α : Dirichlet prior parameter on the
per-author topic distributions.
● x : author chosen for word w
● z : is the topic assigned for w from
author x topic distribution
● w : observed word in document
Fig : Author Topic Model plate notation
Author-Topic
Model
Fig : Author Topic model
Special Cases of ATM
Topic Model (LDA):
Each document is written by a
unique author
● the document determines
the author.
● author assignment becomes
trivial.
● author’s topic distribution
can be viewed as
document’s topic
distribution.
Special Cases of ATM
Author Model:
each author writes about a unique topic
● the author determines the topic.
● topic assignment becomes trivial.
● topic’s word distribution can be
viewed as author’s word
distribution.
Variables to be inferred
Variables to be inferred
Variables to be inferred
Collapsed Gibbs Sampling - LDA
● The assignment variables contain information about and .
● Use this information to directly update the Topic assignment variable with the
next sample. This is Collapsed Gibbs Sampling.
Collapsed Gibbs Sampling - Author Model
● Need to infer Author - word distribution A
● WIth Collapsing GIbbs sampling, directly infer author word assignments
Author Topic Model - Collapsed Gibbs Sampling
● Need to infer T
and ϴA
● Instead through
collapsed Gibbs
Sampling directly infer
Topic and Author
assignments
● By finding the author -
topic joint distribution
conditioned on all other
variables
Example
AUTHOR 1 2 2 1 1
TOPIC 3 2 1 3 1
WORD epilepsy dynamic bayesian EEG model
TOPIC-1 TOPIC-2 TOPIC-3
epilepsy 1 0 35
dynamic 18 20 1
bayesian 42 15 0
EEG 0 5 20
model 10 8 1
...
TOPIC-1 TOPIC-2 TOPIC-3
AUTHOR-1 20 14 2
AUTHOR-2 20 21 5
AUTHOR-3 11 8 15
AUTHOR-4 5 15 14
...
Example
AUTHOR 1 2 ? 1 1
TOPIC 3 2 ? 3 1
WORD epilepsy dynamic bayesian EEG model
TOPIC-1 TOPIC-2 TOPIC-3
epilepsy 1 0 35
dynamic 18 20 1
bayesian 42(41) 15 0
EEG 0 5 20
model 10 8 1
...
TOPIC-1 TOPIC-2 TOPIC-3
AUTHOR-1 20 14 2
AUTHOR-2 20(19) 21 5
AUTHOR-3 11 8 15
AUTHOR-4 5 15 14
...
Author-Topic How much topic ‘j’ likes the word ‘bayesian’ How much author ‘k’ likes the topic
A1T1 41/97 20/36
A2T1 41/97 19/45
A1T2 15/105 14/36
A2T2 ... ...
A2T3 ... ...
A2T3 ... ...
● To draw new Author-Topic assignment (equivalently)
○ Roll K X ad
sided die with these probabilities
○ Assign the Author-Topic tuple to the word
AUTHOR 1 2 1 1 1
TOPIC 3 2 3 3 1
WORD epilepsy dynamic bayesian EEG model
Example (cont..)
● Update the probability ( T
and A
) tables with new values
● This implies increasing the popularity of topic in author and word in topic
Sample Output (CORA dataset)Topicprobabilities
Topics shown with 10 most probable words
Author :
Experimental Results
● Experiment is done on two types of dataset, papers are chosen from NIPS and
CiteSeer database
● Extremely common words are removed from the corpus, leading V=13,649
unique words in NIPS and V=30,799 in CiteSeer, 2037 authors in NIPS and
85,465 in CiteSeer
● NIPS contains many documents which are closely related to learning theory
and machine learning
● CiteSeer contains variety of topics from User Interface to Solar astrophysics
Examples of topic author distribution
Importance ofTopicprobabilityofarandomauthor
Evaluating the predictive power
● Predictive power of the model is measured quantitatively using perplexity.
● Perplexity is ability to predict the words on a new unseen documents
● Perplexity of set of d words in a document (wd
,ad
) is
● When p(wd
|ad
) is high perplexity ≈ 1, otherwise it will be a high positive
quantity.
Continued….
● Approximate equation to calculate the joint probability
Continued….
Applications
● It can be used in automatic reviewer recommendation for a paper review.
● Given an abstract of a paper and a list of the authors plus their known past
collaborators, generate a list of other highly likely authors for this abstract who
might serve as good reviewers.
References
● Rosen-Zvi, Michal, et al. "The author-topic model for authors and documents."
Proceedings of the 20th conference on Uncertainty in artificial intelligence.
AUAI Press, 2004.
● D. M. Blei, A. Y. Ng, and M. I. Jordan (2003). “Latent Dirichlet Allocation”.
Journal of Machine Learning Research, 3 : 993–1022.
● T. L. Griffiths, and M. Steyvers (2004). “Finding scientific topics”. Proceedings
of the National Academy of Sciences, 101 (suppl. 1), 5228–5235.
● https://github.com/arongdari/python-topic-model
● https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome
Questions?

More Related Content

What's hot

ЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээл
ЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээлЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээл
ЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээлBayarlkham Byambaa
 
Micro l1.2019 2020
Micro l1.2019 2020Micro l1.2019 2020
Micro l1.2019 2020hicheel2020
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMsJim Steele
 
Бизнес төлөвлөгөө боловсруулах зах зээлийн судалгаа
Бизнес төлөвлөгөө боловсруулах зах зээлийн судалгааБизнес төлөвлөгөө боловсруулах зах зээлийн судалгаа
Бизнес төлөвлөгөө боловсруулах зах зээлийн судалгааГончигжавын Болдбаатар
 
Dad_7
Dad_7Dad_7
Dad_7oz
 
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
 
Managment 09
Managment 09Managment 09
Managment 09ub_old
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/
Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/
Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/Adilbishiin Gelegjamts
 
Lecture.12
Lecture.12Lecture.12
Lecture.12Tj Crew
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 

What's hot (20)

Ood lesson4
Ood lesson4Ood lesson4
Ood lesson4
 
Ood lesson2
Ood lesson2Ood lesson2
Ood lesson2
 
ЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээл
ЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээлЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээл
ЖДҮ-г хөгжүүлэх, Байгаль орчныг хамгаалах төслийн зээл
 
Node.js гэж юу вэ?
Node.js гэж юу вэ?Node.js гэж юу вэ?
Node.js гэж юу вэ?
 
олон улсын маркетингийн тухай ойлголт
олон улсын маркетингийн тухай ойлголтолон улсын маркетингийн тухай ойлголт
олон улсын маркетингийн тухай ойлголт
 
Micro l1.2019 2020
Micro l1.2019 2020Micro l1.2019 2020
Micro l1.2019 2020
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
 
Бизнес төлөвлөгөө боловсруулах зах зээлийн судалгаа
Бизнес төлөвлөгөө боловсруулах зах зээлийн судалгааБизнес төлөвлөгөө боловсруулах зах зээлийн судалгаа
Бизнес төлөвлөгөө боловсруулах зах зээлийн судалгаа
 
C++ vndsen oilgolt хичээл 1
C++ vndsen oilgolt хичээл 1C++ vndsen oilgolt хичээл 1
C++ vndsen oilgolt хичээл 1
 
Dad_7
Dad_7Dad_7
Dad_7
 
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
 
Managment 09
Managment 09Managment 09
Managment 09
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Ood lesson11 sequence
Ood lesson11 sequenceOod lesson11 sequence
Ood lesson11 sequence
 
Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/
Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/
Хэрэгцээт материалын төлөвлөлт-Үйл ажиллагааны менежмент /Хураангуй/
 
Lecture.12
Lecture.12Lecture.12
Lecture.12
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Lec 10 12
Lec 10 12Lec 10 12
Lec 10 12
 
олон улсын маркетинг 1
олон улсын маркетинг 1олон улсын маркетинг 1
олон улсын маркетинг 1
 
6 р анги мзүй нэгж
6 р анги мзүй нэгж6 р анги мзүй нэгж
6 р анги мзүй нэгж
 

Similar to Author Topic Model

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
 
Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentationPooja Mishra
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2Jisoo Jang
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicInstitute of Contemporary Sciences
 

Similar to Author Topic Model (20)

Topic modelling
Topic modellingTopic modelling
Topic modelling
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
话题模型2
话题模型2话题模型2
话题模型2
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentation
 
Lec1
Lec1Lec1
Lec1
 
Spatial LDA
Spatial LDASpatial LDA
Spatial LDA
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Probabilistic content models,
Probabilistic content models,Probabilistic content models,
Probabilistic content models,
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
 

Recently uploaded

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Author Topic Model

  • 1. The Author-Topic Model Presented by: Darshan Ramakant Bhat Freeze Francis Mohammed Haroon D by M Rosen-Zvi, et al. (UAI 2004)
  • 2. Overview ● Motivation ● Model Formulation ○ Topic Model ( i.e LDA) ○ Author Model ○ Author Topic Model ● Parameter Estimation ○ Gibbs Sampling Algorithms ● Results and Evaluation ● Applications
  • 3. Motivation ● joint author-topic modeling had received little attention. ● author modeling has tended to focus on problem of predicting an author given the document. ● modelling the interests of authors is a fundamental problem raised by a large corpus (eg: NIPS dataset). ● this paper introduced a generative model that simultaneously models the content of documents and the interests of authors.
  • 4. Notations ● Vocabulary set size : V ● Total unique authors : A ● Document d is identified as (wd , ad ) ● Nd : number of words in Document d. ● wd : Vector of Nd words (subset of vocabulary) wd = [w1d w2d …….. wNd d ] ● wid : ith word in document d ● Ad : number of authors of Document d ● ad : Vector of Ad authors (subset of authors) ad = [a1d a2d …….. aAd d ] ● Corpus : Collection D documents = { (w1 , a1 ) , (w2 , a2 ) ……. (wD , aD ) }
  • 5. Plate Notation for Graphical models ● Nodes are random variables ● Edges denote possible dependence ● Observed variables are shaded ● Plates denote replicated structure
  • 6. Dirichlet distribution ● generalization of beta distribution into multiple dimensions ● Domain : X = (x1 , x2 ….. xk ) ○ xi ∈ (0 , 1) ○ k xi = 1 ○ k-dimensional multinomial distributions ● "distribution over multinomial distributions" ● Parameter : = ( 1 , 2 …… k ) ○ i > 0 , ∀i ○ determines how the probability mass is distributed ○ Symmetric : 1 = 2 =…… k ● Each sample from a dirichlet is a multinomial distribution. ○ ↑ : denser samples ○ ↓ : sparse samples Topic 2 (0,1,0) Topic 1 (1,0,0) Topic 3 (0,0,1) . (⅓,⅓,⅓ ) Fig : Domain of Dirichlet (k=3)
  • 7. Fig : Symmetric Dirichlet distributions for k=3 pdf samples
  • 8. Model Formulation Topic Model (LDA) Model topics as distribution over words Model documents as distribution over topics Author Model Model author as distribution over words Author-Topic Model Probabilistic model for both author and topics Model topics as distribution over words Model authors as distribution over topics
  • 9. Topic Model : LDA ● Generative model ● Bag of word ● Mixed membership ● Each word in a document has a topic assignment. ● Each document is associated with a distribution over topics. ● Each topic is associated with a distribution over words. ● α : Dirichlet prior parameter on the per-document topic distributions. ● β : Dirichlet prior parameter on the per-topic word distribution. ● w : observed word in document ● z : is the topic assigned for w Fig : LDA model
  • 10. Author Model ● Simple generative model ● Each author is associated with a distribution over words. ● Each word has a author assignment. ● β : Dirichlet prior parameter on the per-author word distributions. ● w : Observed word ● x : is the author chosen for word w Fig : Author Model
  • 11. Model Formulation Topic Model (LDA) Model topics as distribution over words Model documents as distribution over topics Author Model Model author as distribution over words Author-Topic Model Probabilistic model for both author and topics Model topics as distribution over words Model authors as distribution over topics
  • 12. Author-Topic Model ● Each topic is associated with a distribution over words. ● Each author is associated with a distribution over topics. ● Each word in a document has an author and a topic assignment. ● Topic distribution of document is a mixture of its authors topic distribution ● α : Dirichlet prior parameter on the per-author topic distributions. ● x : author chosen for word w ● z : is the topic assigned for w from author x topic distribution ● w : observed word in document Fig : Author Topic Model plate notation
  • 14. Special Cases of ATM Topic Model (LDA): Each document is written by a unique author ● the document determines the author. ● author assignment becomes trivial. ● author’s topic distribution can be viewed as document’s topic distribution.
  • 15. Special Cases of ATM Author Model: each author writes about a unique topic ● the author determines the topic. ● topic assignment becomes trivial. ● topic’s word distribution can be viewed as author’s word distribution.
  • 16. Variables to be inferred
  • 17. Variables to be inferred
  • 18. Variables to be inferred
  • 19. Collapsed Gibbs Sampling - LDA ● The assignment variables contain information about and . ● Use this information to directly update the Topic assignment variable with the next sample. This is Collapsed Gibbs Sampling.
  • 20. Collapsed Gibbs Sampling - Author Model ● Need to infer Author - word distribution A ● WIth Collapsing GIbbs sampling, directly infer author word assignments
  • 21. Author Topic Model - Collapsed Gibbs Sampling ● Need to infer T and ϴA ● Instead through collapsed Gibbs Sampling directly infer Topic and Author assignments ● By finding the author - topic joint distribution conditioned on all other variables
  • 22. Example AUTHOR 1 2 2 1 1 TOPIC 3 2 1 3 1 WORD epilepsy dynamic bayesian EEG model TOPIC-1 TOPIC-2 TOPIC-3 epilepsy 1 0 35 dynamic 18 20 1 bayesian 42 15 0 EEG 0 5 20 model 10 8 1 ... TOPIC-1 TOPIC-2 TOPIC-3 AUTHOR-1 20 14 2 AUTHOR-2 20 21 5 AUTHOR-3 11 8 15 AUTHOR-4 5 15 14 ...
  • 23. Example AUTHOR 1 2 ? 1 1 TOPIC 3 2 ? 3 1 WORD epilepsy dynamic bayesian EEG model TOPIC-1 TOPIC-2 TOPIC-3 epilepsy 1 0 35 dynamic 18 20 1 bayesian 42(41) 15 0 EEG 0 5 20 model 10 8 1 ... TOPIC-1 TOPIC-2 TOPIC-3 AUTHOR-1 20 14 2 AUTHOR-2 20(19) 21 5 AUTHOR-3 11 8 15 AUTHOR-4 5 15 14 ...
  • 24. Author-Topic How much topic ‘j’ likes the word ‘bayesian’ How much author ‘k’ likes the topic A1T1 41/97 20/36 A2T1 41/97 19/45 A1T2 15/105 14/36 A2T2 ... ... A2T3 ... ... A2T3 ... ...
  • 25. ● To draw new Author-Topic assignment (equivalently) ○ Roll K X ad sided die with these probabilities ○ Assign the Author-Topic tuple to the word AUTHOR 1 2 1 1 1 TOPIC 3 2 3 3 1 WORD epilepsy dynamic bayesian EEG model
  • 26. Example (cont..) ● Update the probability ( T and A ) tables with new values ● This implies increasing the popularity of topic in author and word in topic
  • 27. Sample Output (CORA dataset)Topicprobabilities Topics shown with 10 most probable words Author :
  • 28. Experimental Results ● Experiment is done on two types of dataset, papers are chosen from NIPS and CiteSeer database ● Extremely common words are removed from the corpus, leading V=13,649 unique words in NIPS and V=30,799 in CiteSeer, 2037 authors in NIPS and 85,465 in CiteSeer ● NIPS contains many documents which are closely related to learning theory and machine learning ● CiteSeer contains variety of topics from User Interface to Solar astrophysics
  • 29. Examples of topic author distribution
  • 31. Evaluating the predictive power ● Predictive power of the model is measured quantitatively using perplexity. ● Perplexity is ability to predict the words on a new unseen documents ● Perplexity of set of d words in a document (wd ,ad ) is ● When p(wd |ad ) is high perplexity ≈ 1, otherwise it will be a high positive quantity.
  • 32. Continued…. ● Approximate equation to calculate the joint probability
  • 34. Applications ● It can be used in automatic reviewer recommendation for a paper review. ● Given an abstract of a paper and a list of the authors plus their known past collaborators, generate a list of other highly likely authors for this abstract who might serve as good reviewers.
  • 35. References ● Rosen-Zvi, Michal, et al. "The author-topic model for authors and documents." Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004. ● D. M. Blei, A. Y. Ng, and M. I. Jordan (2003). “Latent Dirichlet Allocation”. Journal of Machine Learning Research, 3 : 993–1022. ● T. L. Griffiths, and M. Steyvers (2004). “Finding scientific topics”. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228–5235. ● https://github.com/arongdari/python-topic-model ● https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome