SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
International Journal of Smart Computing and Information Technology
2020, Vol. 1, No. 1, pp. 7–17
Copyright © 2020 BOHR Publishers
www.bohrpub.com
Detecting Paraphrases in Marathi Language
Shruti Srivastava and Sharvari Govilkar
Department of Computer Engineering, PCE, University of Mumbai, India
E-mail: shrutics2015@gmail.com; sgovilkar@mes.ac.in
Abstract. Paraphrasing refers to the sentences that either differs in their textual content or dissimilar in rearrange-
ment of words but convey the same meaning. Identifying a paraphrase is exceptionally important in various real
life applications such as Information Retrieval, Plagiarism Detection, Text Summarization and Question Answering.
A large amount of work in Paraphrase Detection has been done in English and many Indian Languages. However,
there is no existing system to identify paraphrases in Marathi. This is the first such endeavor in Marathi Language.
A paraphrase has different structured sentences and Marathi being semantically strong language hence this
system is designed for checking both statistical and semantic similarity of Marathi sentences. Statistical similarity
measure does not need any prior knowledge as it is only based on the factual data of sentences. The factual data is
calculated on the basis of the degree of closeness between the word-set, word-order, word-vector and word-distance.
Universal Networking Language (UNL) speaks about the semantic significance in the sentence without any syntac-
tic point of interest. Hence, the semantic similarity calculated on the basis of generated UNL graphs for two Marathi
sentences renders semantic equality of two Marathi sentences. The total paraphrase score was calculated after joining
statistical and semantic similarity scores which gives the judgement of being paraphrase or non-paraphrase about
the Marathi sentences.
Keywords: Paraphrase, Marathi Language Statistical, Semantic, Sumo metric, Universal Networking Language
(UNL).
1 Introduction
Paraphrase is the translation of a sentence or a para-
graph into same language. Paraphrasing occurs when texts
are lexically or syntactically modified to appear different
texts, but retaining the same intention. Paraphrase can be
generated, extracted and identified. Paraphrase extraction
involves collection of different words or phrases which
express same or almost same meaning. Vocabulary plays
an important role in paraphrase extraction. Paraphrase
extraction helps in paraphrase generation in which para-
phrases are created. Paraphrase generation involves not
only dictionary exercise but also changing the information
sequence and grammatical structure.
Paraphrase identification is method of detecting the
variety of expressions which conveying same meaning. It
presents a major challenge for numerous NLP applications.
In automatic summarization, identifying paraphrases is
necessary to find repetitive information in the document.
In information extraction, paraphrase identification pro-
vides most significant information whereas in information
retrieval query paraphrases are generated to retrieve better
quality of relevant data. In question and answering system,
in absence of question from database the answers returned
for the question paraphrase is always helpful. The base of
paraphrasing is semantic equivalence which gives alter-
native translation in the same language. For paraphrase
detection it is necessary to study the possibilities of para-
phrasing at each level. Mainly there are 3 types of surface
paraphrases.
Lexical level: Lexical paraphrases occur when synonyms
appear in almost identical sentences. At lexical level, in
addition to synonym lexical paraphrasing is described by
hyperonymy. In hyperonymy one word has many alterna-
tives but only one of the words is more general or specific
than the other one.
Example – synonyms (solve and resolve), (पु – सुता, )
Hyperonymy (reply, say), (landlady, hostess) {घर} has
{सदन, शाला, आलय, धाम} hypernym
S1: मोदी सरकारन दोन पया
य समोर ठ
ेवल
े आह
े त. S1: The Modi
government has put forward two options.
S2: मोदी सरकारन
े दोन मा
ंडल
े आह
े त. S2: The Modi
government has given two options.
7
8 Shruti Srivastava and Sharvari Govilkar
Phrase level: Phrasal paraphrase states that different
phrases shared same semantic content. The phrasal para-
phrases include syntactic phrases as well as pattern forma-
tion with connected factors.
Example S1: ֆ֚֡֔֠ֈ֚֞֞
ե֊֠ ֒֞֐֑֞օ ֛֧֟֔֟֔. (Tulsidas wrote
Ramayana.)
S2: ֒֞֐֑֞օ ֆ֚֡֔֠ֈ֚֞֞
ե֊֠ ֆ֚֡֔֠ֈ֚֞֞
ե֊֠. (Ramayana written by
Tulsidas.)
S3: ֆ֚֡֔֠ֈ֚֞֞
ե֊֠ ֔ոշ
֧ ֡
ֆ֚֔֠ֈ֚֞ ֧
ը֛ֆ. (Ramayana’s author is
Tulsidas.)
S4: ֡
ֆ֚֔֠ֈ֚֞֞ե֊֠ ֒֞֐֑֞օ֞ռ֠ ֒ռ֊֞ շ֔֠
֧ . (Tulsidas composed the
Ramayana.)
Sentential paraphrases: In this type of paraphrase one sen-
tence is totally replaced by other sentence but still both
sentences keeping the same meaning.
Example –
S1: ֧ ֐֒֞ւ֠ ֟֊֒֞֕֠ ֧
ը֛. (Each region of
Maharashtra has a different Marathi language)
S2: ֐֒֞ւ֠ ֏֞֙֞ ֚֗
Sentential paraphrases: In this type of paraphrase one sentence is totally replaced by
keeping the same meaning.
Example –
S1: मिाराष्ट्रातील प्रत्येक प्राांताची मराठी भाषा लनराळी आिे. (Each region of Maharashtra has a d
S2: मराठी भाषा सर्व मिाराष्ट्राची जरी एक असली तरी तीच दर बारा कोसार्र बदलते.
(Marathi language is the only one in the whole of Maharashtra, the same rate varies fro
II. Literature Survey
This section presents various techniques for paraphrase detection in Indian languages. Par
applied on various languages to detect whether two sentences were paraphrase of each oth
Table 2.1: Summary of Literature Review
SNo Paper
Language: English
1
Title Paraphrase detection using Semantic relatedness based on
Synset Shortest path in WordNet
Semant
approac
Author Jun Choi Lee & Yu-N Cheah, August 2016
2
Title Learning Paraphrase Identification with structural
alignment Alignm
Approa
Author Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth
D. Forbus, July 2016
3
Title Paraphrase Detection based on Identical Phrase and
Similar Word matching
SimMat
combin
metrics
Author Hoang-Quoc, Nguyen-Son,Yusuke Miyao, Isao
Echizen,2015
4 Title Paraphrase detection using string similarity with synonyms Use of s
similari
Author Jun Choi Lee & Yu-N Cheah,2015
5
Title Re-examining Machine Translation Metrics for Paraphrase
Identification
Combin
metrics.
Author Nitin Madnani, Joel Tetreault, Martin Chodorow, 2012
6
Title Paraphrase Identification Using Weighted Dependencies
and word semantics
Calcula
dissimil
using de
Author Mihai Lintean and Vaile Rus, 2009
7 Title A Semantic Similarity Approach to paraphrase Detection JCN w
with ma
Author Fernando and Stevenson, August 2008
8 Title A Metric for Paraphrase Detection Paraphr
using S
Author Cordeiro, Dias, Brazdil , 2007(IEEE)
9
Title Paraphrase Identification on the Basis of Supervised
Machine Learning Techniques
Combin
and sem
Author Kozareva and Montoyo,2006
10
Title CFILT-CORE: Semantic Textual Similarity using
Universal Networking Language
Describ
similari
extracti
Univers
finding
pair of s
Author Avishek Dan & Pushpak Bhattacharyya, IIT Bombay
Mumbai
Language : Hindi
11
Title A Novel approach to Paraphrase Hindi Sentences using
NLP
Creating
and anto
Author N.Sethi,P. Agrawal, V. Madaan, S.K. Singh, 2016
12 Title Paraphrase Detection in Hindi Language using
Syntactic Features of Phrase
Random
classific
method
for ide
languag
Author
Rupal Bhargava, Anushka Baoni, Harshit Jain, 2016
13 Title Hindi Paraphrase Detection using Natural Language
Processing Techniques & Semantic Similarity
Computations
Discove
semanti
done by
.
(Marathi language is the only one in the whole of Maha-
rashtra, the same rate varies from 12 to 12 degrees.)
2 Literature survey
This section presents various techniques for paraphrase
detection in Indian languages. Paraphrase detection tech-
niques have been applied on various languages to detect
whether two sentences were paraphrase of each other
or not.
Table 1. Summary of literature review.
SNo Paper Algorithm Accuracy F-score
Language: English
1 Title Paraphrase detection using Semantic relatedness
based on Synset Shortest path in WordNet
Semantic relatedness
approach
71.1% 81.1%
Author Jun Choi Lee & Yu-N Cheah, August 2016
2 Title Learning Paraphrase Identification with structural
alignment
Alignment based
Approach
78.6% 85.3%
Author Chen Liang, Praveen Paritosh, Vinodh Rajendran,
Kenneth D. Forbus, July 2016
3 Title Paraphrase Detection based on Identical Phrase
and Similar Word matching
SimMat metric
combined with 8 MT
metrics
77.6% 83.9%
Author Hoang-Quoc, Nguyen-Son,Yusuke Miyao, Isao
Echizen, 2015
4 Title Paraphrase detection using string similarity with
synonyms
Use of synonyms in text
similarity metrics
70.6% 80.4%
Author Jun Choi Lee & Yu-N Cheah, 2015
5 Title Re-examining Machine Translation Metrics for
Paraphrase Identification
Combination of 8 MT
metrics
77.4% 84.1%
Author Nitin Madnani, Joel Tetreault, Martin Chodorow,
2012
6 Title Paraphrase Identification Using Weighted
Dependencies and word semantics
Calculate similarity and
dissimilarity scores
using dependencies
72.41% 81.32%
Author Mihai Lintean and Vaile Rus, 2009
7 Title A Semantic Similarity Approach to paraphrase
Detection
JCN word similarity
with matrix
74.1% 82.4%
Author Fernando and Stevenson, August 2008
8 Title A Metric for Paraphrase Detection Paraphrase detection
using SumoMetric
78.19% 80.92%
Author Cordeiro, Dias, Brazdil, 2007 (IEEE)
9 Title Paraphrase Identification on the Basis of
Supervised Machine Learning Techniques
Combination of lexical
and semantic features
76.6% 79.6%
Author Kozareva and Montoyo, 2006
10 Title CFILT-CORE: Semantic Textual Similarity using
Universal Networking Language
Described a syntactic and word level similarity method
combining with a semantic extraction method which was based
on Universal Networking Language (UNL) for finding
semantic similarity score between a pair of sentences
Author Avishek Dan & Pushpak Bhattacharyya, IIT
Bombay Mumbai
(Continued)
Detecting Paraphrases in Marathi Language 9
Table 1. Continued
SNo Paper Algorithm Accuracy F-score
Language: Hindi
11 Title A Novel approach to Paraphrase Hindi Sentences using NLP Creating paraphrase by applying synonyms and
antonyms replacement
Author N. Sethi, P. Agrawal, V. Madaan, S.K. Singh, 2016
12 Title Paraphrase Detection in Hindi Language using Syntactic
Features of Phrase
Random forest classifier used for classification and
Leven-shtein Distance method used for similarity
score calculation for identifying paraphrases in the
Hindi language
Author Rupal Bhargava, Anushka Baoni, Harshit Jain, 2016
13 Title Hindi Paraphrase Detection using Natural Language
Processing Techniques & Semantic Similarity Computations
Discovered synonym WordNet using a semantic
similarity metric and classification done by
paraphrase detection decision tree classifier
Author Vani K, Deepa Gupta, 2016
14 Title A Novel Paraphrase Detection Method in Hindi Language
using Machine Learning
Random forest classifier and different machine
learning algorithms were used for removing
overlapping words and calculated normalized IDF
score for paraphrase detection
Author Anuj Saini, 2016
Language: Malayalam
15 Title Malayalam Paraphrase Detection Procured CUSAT Malayalam Wordnet for calculating
sentence similarity using matching tokens, lemmas
and synonyms
Author Sindhu L. and Sumam Mary Idicula, 2016
16 Title Detecting Paraphrase in Indian Languages – Malayalam Calculated cosine similarity score using model named
Bag of Word and a threshold for calculating
paraphrase score
Author Manju K and Sumam Mary Idicula, 2016
17 Title Paraphrase Identification using Malayalam sentences Detecting paraphrase by Statistical measure and
semantic analysis
Author Ditty Mathew, Dec. 2013 (IEEE)
Language : Tamil
18 Title Detection of Paraphrases on Indian Languages Tamil Shallow parser used for extracting sixteen
diverse morphological features.
Author R. Thangarajan, S. V. Kogilavani, A. Karthic, S. Jawahar, 2016
Languages (Tamil, Malayalam, Hindi, Punjabi)
19 Title Detecting Paraphrases in Indian Languages based on
Gradient Tree Boosting
Employed a Cosine, Dice Distance, Jaccard Coefficient
and METEOR features and Gradient Boosting Tree
supervised classification method
Author Leilei Kong, Zhenyuan Hao, Kaisheng Chen, Zhongyuan
Han, Liuyang Tian, Haoliang Qi, 2016
20 Title Language Independent Paraphrases Detection Probabilistic Neural Network (PNN) has been used to
train Jaccard, Cosine Similarity and length normalized
Edit Distance language independent feature-set to
detect paraphrases in Indian languages
Author Sandip Sarkar, Partha Pakray, Saurav Saha, Dipankar Das,
Jereemi Bentham, Alexander Gelbukh, 2016
21 Title Detecting Paraphrases in Indian Languages Using
Multinomial Logistic Regression Model
The paper described about lexical and semantic level
similarities between two sentences using a trained
multinomial logistic regression model for paraphrase
identification and attain maximum f-measure
Author Kamal Sarkar, 2016
22 Title Paraphrase Detection in Indian Languages – A Machine
Learning Approach
The team worked on Hindi, Tamil, Punjabi and
Malayalam sentences. The paraphrase decision is
taken into consideration by various similarity
measures, machine translation evaluation metrics and
machine learning framework
Author Tanik Saikh, Sudip Kumar, Naskar Sivaji, Bandyopadhyay,
2016
10 Shruti Srivastava and Sharvari Govilkar
3 Proposed system
The suggested system is designed to take the input as set
of two Marathi sentences. The pre-processing can be done
on sentences for collecting actual root words. Then the pre-
processed data is given as input to the different statistical
and semantic similarity to find paraphrases. The final simi-
larity score is generated by combining maximum statistical
similarity score and semantic similarity score. The final
paraphrase judgement is decided based on a set threshold
value which gives the decision of Paraphrase (P) or Non-
Paraphrase (NP).
In this approach system takes one original Marathi sen-
tences and one suspicious paraphrase sentence as an input.
The sentences then pre-process through various phases
including pre-processing by tokenization and input valida-
tion by stop word removing, stemming and morphological
analysis. After which the pre-processed data is fetched to
the matrix calculation for paraphrase detection.
3.1 Pre-processing
The first step is to present two Marathi sentences into text
format.
Sentence 1:
होळीचा सण साजरा करता
ंना काळजी . (The environment
should be taking care of while celebrating Holi festival.)
Sentence 2:
पयावणाच भान ठ
ेवून होळीचा सण साजरा करावा. (Celebrate the Holi fes-
tival after keeping environment in mind.)
Figure 1. Proposed system architecture.
The pre-processing of the documents involves following
methods:
Tokenization
Tokenization is the method of identification and separation
of tokens from two Marathi input sentences. Lexicon is the
individual word of the sentence. The word boundaries in
the Marathi language are fixed as it is a segmented lan-
guage. The spaces between the words are used to separate
one word from another. The process of separation of tokens
involves removal of delimiters like white spaces and punc-
tuation marks.
Validation of Devanagari script
A Marathi language is written in Devanagari script. Input
sentences validation is an important phase due to the
language dependent information and type of query pro-
vided to the system. The characters in Devanagari script
are recognized with Unicode values from UTF-8. Com-
paring each and every character with UTF-8 list, invalid
Devanagari characters were removed from the sentences.
The valid characters in Devanagari script were retained for
next phase.
Stop word elimination
Stop words are the utmost irrelevant words which delay
the sentence processing. These stop words are most often
occurring words in any document. The list of stop words
includes articles, prepositions and other function words.
Hence to enhance the processing speed of the system, the
stop words needs to identified and removed properly.
Stemming
Stemming is a very important step in the system. In this
phase, suffix list is used to remove suffixes from words for
creating exact stem word as the stem may not be the lin-
guistic root of the word.
Morphological analysis
The morphological analyzer identifies the inner structure
of the word. The root words from the given are expected to
produce from a morphological analyzer. After stemming,
the words are evaluated for any inflection. The perfect root
words can be produced by generating and matching rules.
Addition or replacement of characters to the inflected stem
word results into the precise core word.
3.2 Statistical similarity
The tokens created in tokenization process are the root
words but it is necessary to distinguish the tokens. Sta-
tistical similarity is calculated by considering tokens of
Detecting Paraphrases in Marathi Language 11
both sentences and compares them on the basis of word-
set, word-order, word-vector and sumo-metric (word dis-
tance). The system uses 4 statistical methods for statistical
comparisons which are explained as follows:
A. Jaccard similarity
It is based on the similarity and differences between two
word sets from the pair of sentences [10]. Jaccard similarity
coefficient is calculated by taking ratio of the total number
of matched unigrams to the total number of unmatched
words in both the sentences. The Jaccard similarity is
defined using following equation;
Jaccard Similarity (S1, S2) =
S1 ∩ S2
S1 ∪ S2
Example: Jaccard Similarity
Sentences Score
S1 होळी सण साजरा करता
ं ना S1 ∪ S2 = 4
काळजी {होळी, सण, साजरा, }
S2 पयावण भान ठ
ेवून होळी सण साजरा S1 ∩ S2 = 9
Jaccard Similarity
(S1, S2) = 4/9 = 0.4
B. Cosine similarity
Cosine similarity [10] is the most common statistical fea-
ture to measure the similarity between word vectors of two
sentences. For each sentence, a word vector of root words
of those sentences is formed which interprets the frequency
of words in the sentences. Cosine similarity is the ratio of
the dot product of those two word vectors and the product
of their lengths.
Cosine Similarity (S1, S2) =
S1 · S2
|S1| |S2|
Hence, Cosine similarity (S1, S2) = 0.91
C. Word order similarity
The vectors of the two sentences were considered for cal-
culating Word order similarity between them. The vectors
of two sentences constructed for the calculating using fol-
lowing equations:
L(Sa) = {(wa1, wa2), (wa1, wa3), . . . , (wa(i−1), wai)}
L(Sb) = {(wb1, wb2), (wb1, wb3), . . . , (wb(i−1), wbi)}
Where vector (wa1, wa2, . . . , wai) and vector (wb1,
wb2, . . . , wbi) constructed from sentences Sa and sen-
tence Sb tokens respectively. wx is before wy in (wx, wy) ∈
L (Sa) ∪ L(Sb). The similarity calculation between Sa and
Sb can be done [10] on the basis of following equation:
WordOrder(Sa, Sb) =
|L(Sa) ∩ L(Sb)|
|L(Sa) ∪ L(Sb)|
L(Sa) ∩ L(Sb) = {होळी सण, होळी साजरा, सण साजरा, होळी
,साजरा ,सण } = 6
WordOrder (Sa, Sb) =
|L(Sa) ∩ L(Sb)|
|L(Sa) ∪ L(Sb)|
=
6
30
= 0.17
D. Sumo metric
The Sumo-Metric [12] is on the word distance. This met-
ric finds the special lexical links between root words of
a pair of Marathi sentences. This metric not only identify
paraphrases but also identifies the exact and quasi-exact
matches. It can be considered as 1-gram exclusive overlap.
S1 and S2 is a pair of sentences given as an input to the
system. Let x and y are the numbers of words in those two
sentences. The number of words in the intersection of the
two sentences, excluding repetitions is represented by λ.
To calculate the Sumo-Metric S(.,.), first evaluate the
function S(x, y, λ)
S(x, y, λ) = α log 2
x
λ

+ β log 2
y
λ

Where α, β ∈ [0, 1] and α + β = 1. Now the Sumo-Metric
S (.,.) can be calculated by the following equation:
S(S1, S2) =
(
S(x, y, λ) if S(x, y, λ)  1.0
e−k∗S(x,y,λ)
otherwise
α and β are two core components which involved in the
calculation. The exact and quasi exact match pairs are grad-
ually penalize by log2 (.) function as for equal pairs the
sumo-metric result is exactly zero.
Sentence 1: होळी सण साजरा करता
ं ना काळजी
Sentence 2: भान ठ
े वून होळी सण साजरा
λ = number of words in the intersection of the two sen-
tences = {होळी, सण, साजरा, } = 4
Hence, if α and β = 0.5
Then,
S(x, y, λ) = 0.5 ∗ log2

7
4

+ 0.5 ∗ log2

6
4

= 0.7
3.3 Semantic similarity
Universal Networking language (UNL) generates a graph
of semantic representation of sentences. In this method,
UNL relation signifies semantic information in a by draw-
ing a hyper-graph. In this hyper-graph nodes represents
concepts and arcs signifies relations. The hyper-graph con-
tains a set of directed binary relations between two con-
cepts of a sentence.
UNL relation involves mainly 3 important terms which
are as follows:
12 Shruti Srivastava and Sharvari Govilkar
1. Universal Words (UWs): These are the root word
which signifies word meaning.
2. Relation Labels: These labels tag the various interac-
tions between Universal Words.
3. Attribute Labels: It gives extra information for the
Universal Words present in a sentence.
Two UNL graphs are compared for the similarity indi-
rectly compares the semantic content of two sentences. This
comparison is useful for finding semantic similarity score.
UNL Expression
The directed graph generated in this method is known as
a “UNL Expression” or “UNL Graph” which gives a long
list of the binary relations.
The general format of UNL Expression is the following:
relation:[scope-ID](from-node, to-node )
A UW is described in each of from-node and to-
node.
An example for the UNL expression of the sentence
“Bharat is writing a novel.”
agt(write(icldo)@entry.@present.@progress,
Bharat (iofperson))
obj(write(icldo)@entry.@present.@progress,
novel(iclbook))
UNL En-conversion process
UNL en-cnoversion is a process of translating natural lan-
guage text into UNL for syntactic and semantic evaluation.
The En-converter scanned the sentence from left to right
then the word dictionary is used for matching of mor-
phemes and they will be the candidate morphemes for
further processing. By using defined rules a syntactic tree
is built and a UNL format semantic network is designed
for the input sentence until all the words are inputted.
UNL graph based Similarity
The generated UNL graphs of two sentences are compared
for computing semantic similarity using the formula given
below. The average of relation match scores and universal
word match scores is the actual relation score. The match-
ing attributes of the corresponding universal word decides
the universal word match score.
Figure 2. UNL graph example.
Semsim(S1, S2)
=
S1 ∩ S2
S1 ∪ S2
+
X
a∈S1,b∈S2
0.75 ∗
[[a.w1 = b.w1a.w2 = b.w2]]
|S1 ∪ S2|
+ 0.75 ∗
[[a.r = b.ra.Ew1 = b.Ew1a.Ew2 = b.Ew2]]
|S1 ∪ S2|
+ 0.6 ∗
[[a.Ew1 = b.Ew1a.Ew2 = b.Ew2]]
|S1 ∪ S2|
!
Example: Input sentence1: होळी सण साजरा करता
ंना
काळजी
——————UNL expression for sentence1——————
{unl} mod (काळजी (iofaction):01, )iofnature):
00@entry@pred)
seq (काळजी (iofaction):01, साजरा)icldo):02@present)
agt (सण(iofevent):02@def, साजरा)icldo):02@present)
aoj (होळी (iclevent):03@def, सण)iofevent):02@def) {/unl}
——————UNL expression for sentence1——————
Input sentence2: भान ठ
े वून होळी सण साजरा
——————UNL expression for sentence2——————
{unl} mod (भान (iofaction):01, )iofnature):
00@entry@pred)
seq (भान (iofaction):01, साजरा)icldo):02@present)
agt (सण (iofevent):02@def, साजरा)icldo):02@present)
aoj (होळी (iclevent):03@def, सण)iofevent):02@def) {/unl}
——————UNL expression for sentence2——————
Figure 3. UNL1 graph for sentence1.
Figure 4. UNL2 graph for sentence2.
Detecting Paraphrases in Marathi Language 13
Table 2. Similarity measure score.
Types of
similarity
Measure Value Similarity measure
score to consider
Statistical
similarity
measures
Jaccard Similarity
(Word set)
0.4 Maximum statistical
similarity Statsim(S1,
S2) = 0.91
Cosine
Similarity (Word
Vector)
0.91
Word order
Similarity
0.17
Sumo Metric 0.7
Semantic
similarity
measure
UNL graph
based similarity
0.59 Semantic similarity
Semsim(S1, S2) =
0.59
Overall
similarity
If p = 0.5, q = 0.5 then, Sim (S1, S2) = 0.5 * 0.91
+ 0.5 * 0.59 = 0.75
UNL1 and UNL2 denotes the UNL graph for sentence1 and
for sentence2 respectively. Hence the semantic Similarity
score can be calculated by using following formula,
S1 ∩ S2 = {होळी, सण, साजरा, पयावण} = 4
S1 ∪ S2 = {होळी, सण, साजरा, पयावण, काळजी, भान} = 6
Hence, Semsim (S1, S2) = 0.59
3.4 Overall similarity
The scores of maximum statistical similarity and seman-
tic similarity are combined using following formula for
overall similarity calculation of two sentences, Sim (S1, S2)
=p*Statsim(S1, S2)+q*Semsim(S1, S2).
The values of p and q are considered in such a way that,
p + q = 1; p, q ∈ [0, 1] as p and q are the constants which
represents involvement of each part to the overall similar-
ity calculation,
3.5 Paraphrase judgement
As the proposed system produced overall similarity score
of 0.75, it can easily conclude that the Marathi sentence
pair judged for paraphrasing identifies as a paraphrase
pair. Taking into consideration all 5 statistical and 1 seman-
tic measures, it can be detected that there are variation in
every similarity measure score. However, the overall sim-
ilarity score providesan enhanced outcome for paraphrase
detection process.
4 Results and discussion
4.1 Working of the system
The input to the Paraphrase Identification system will be
two Marathi text documents in “.txt” format. One docu-
ment will be original sentence and other will be suspicious
paraphrase sentence.
Figure 5. GUI of the system.
Figure 6. Overall similarity  Paraphrase Judgement.
Since the paraphrase detection system is designed exclu-
sively for Marathi sentences, all characters which do not
belong to Devanagari script must be eliminated to save
processing time and memory. The pre-processing of sen-
tences undergoes further processing steps like tokeniza-
tion, stop word removal, stemming and morphological
analysis and will result into validated output.
The overall similarity calculation is done by adding
values of maximum statistical similarity and semantic sim-
ilarity. The values of both similarities are multiplied by
a smoothing factor before addition for considering the
contribution of each similarity in overall similarity. The
sentence pair is paraphrase or not is decided on basis of
score of overall similarity. If the score is greater than 0.8
then the sentences are paraphrase of each other. On click-
ing of overall similarity button, system generates overall
similarity score and on basis of that it generates paraphrase
judgement.
4.2 Experimental analysis
All sentence pairs from dataset of five domains were tested
domain wise and similarity wise. The graphs were created
for comparative study of each similarity measure and to
find the best for the system. The best method taking it into
consideration while calculating overall similarity score for
the system.
14 Shruti Srivastava and Sharvari Govilkar
0
0.5
1
1.5
SL1 SL2 SL3 SL4 SL5 SL6 SL7 SL8
PARAPHRASE
SIMILARITY
SCORE
4 STASTICAL AND SEMANTIC (UNL) SIMILARITY
MEASURES
LITERATURE DOMAIN
Jaccard Similarity Cosine Similarity
Word Order Similarity Sumo Matrix
UNL
Figure 7. Comparison of Statistical and semantic similarity for
literature domain.
4.2.1 Domain wise statistical and semantic similarity
The sentence pairs from the political, sports, film, litera-
ture and miscellaneous domains were tested separately for
both types of similarities. The statistical similarity deals
with statistical structure between words of the sentences
whereas semantic similarity deals with the meaning. For
literature domain and miscellaneous domain, statistical
similarity score lies between 0 and 1. The semantic simi-
larity score lies between 0 to 2 for literature domain and
ranges from 0 to 3 for miscellaneous domain.
4.2.2 Overall similarity
Out of 4 statistical similarities, maximum statistical simi-
larity is calculated for each sentence pair. Combination of
the semantic similarity and maximum statistical similarity
considered as overall similarity. The overall similarity score
is used for setting the threshold value and give the judg-
ment for identifying paraphrase (P) and non-paraphrase
pair (NP).
0
0.5
1
1.5
2
2.5
PARAPHRASE
SIMILARITY
SCORE
SEMANTIC (UNL), MAXIMUM STASTICAL ,
OVERALL SIMILARITY MEASURES
SPORTS DOMAIN
UNL Max. Stastical OVERALL
Figure 8. Overall similarity measures for sports domain.
4.3 Threshold value generation
From obtained different similarity scores, it is necessary
to highlight a classical problem in classification – thresh-
olds. Usually, thresholds are some parameter upon which
system takes some decision. After computation and com-
parison of all similarity measures, it is essential to decide
upon some threshold value which decides whether a sen-
tence pair is a paraphrase or not.
Thresholds are the parameters which unease the pro-
cess of evaluation. In the evaluation process, no threshold
value is predefined. Instead, from overall comparison of
each similarity on each domain, best threshold is computed
upon which each domain is agreed.Threshold computa-
tion may be a classical issue of function maximization or
optimization.
The threshold values plays important role in paraphrase
judgement as the sentence pair can divide into 3 groups
as per their overall similarity scores viz, Paraphrase (P),
Semi-paraphrase (SP) and Non-Paraphrase (NP). The fol-
lowing Table 3 gives an example of overall similarity score
Table 3. Threshold selection.
Sentence pair Overall similarity score Threshold value Paraphrase judgement
Sentence 1: हा माणसाचा े आह
(Book is the best friend of the man)
Sentence 2: जवळचा आह
The book is a close friend of man.
1.22 (SM1 from
miscellaneous domain)
(score = 0.5) Paraphrase (P)
Sentence 1: मराठी भाषा िनराळी आह
Each region of Maharashtra has a different Marathi language
Sentence 2: मराठी भाषा सव जरी एक असली तरी तीच दर बारा कोसावर
बदलत
Marathi language is the only one in the whole of Maharashtra,
the same rate varies from 12 to 12 degrees
0.48 (SL5 from literature
domain)
(score  0.5

score  0.4)
Semi-paraphrase (SP)
Sentence 1: वेळी झोळी भरावी
Beans are to be filled with virtues at the time of Holi.
Sentence 2: होळीसाठी वाईट व मोळी बांधावी
Build bad rallies and rituals for Holi.
0.0 (SM4 from
miscellaneous domain)
(score = 0.0) Non-paraphrase (NP)
Detecting Paraphrases in Marathi Language 15
Table 4. Confusion matrix for performance evaluation.
Actual results Predicted results Total dataset
Similar Dissimilar
Similar 46 TP 12 FN 58
Dissimilar 1 FP 15 TN 16
of sentence pair and role of threshold for deciding its
section of Paraphrase (P), Semi-paraphrase (SP) and Non-
Paraphrase (NP).
4.4 Performance evaluation
The performance of the system is evaluated on 4 basic eval-
uation measures. These measures are accuracy, precision,
recall and F-measure. For calculation of these four perfor-
mance measures, the four parameters TP, TN, FP, and FN
must be calculated first.
The performance of the system was evaluated by creat-
ing a confusion matrix and it is given in Table 4.
The 46 sentence pairs classified as similar and 12 as dis-
similar out of 58 similar sentence pairs by the proposed
system. Out of actual 16 dissimilar sentence pairs, the 15
sentence pairs classified as dissimilar and 1 as similar in
the used data-set. In Table 4, True positive (TP) and True
Negative (TN) are the observations which are correctly pre-
dicted hence shown in green color. False positive (FP) and
False Negative (FN) must be minimized in order to get a
good system so they are shown in red color.
True positive (TP) and True Negative (TN) values occur
when actual results obtained from the system agreed with
the predicted results.
False positives (FP) and False negatives (FN), these val-
ues occur when the actual results contradicts with the
predicted results.
Accuracy
Accuracy is a ratio of correctly predicted observation to the
total observations.
Precision
Precision is obtained by dividing the number of cor-
rectly predicted paraphrases by the number of predicted
paraphrases. High precision relates to the low false posi-
tive rate.
Recall
Recall calculation is done taking the ratio of correctly pre-
dicted paraphrases to the all referenced paraphrases in
actual class.
F-measure
F-measure considers false positives and false negatives
into account and it is the weighted average of Precision and
Table 5. Performance analysis of overall similarity.
Measure Value
ACCURACY 0.82
PRECISION 0.98
RECALL 0.79
F-MEASURE 0.89
Recall. Accuracy gives best value when the values of false
positives and false negatives are similar. When the cost of
these two is different then consider values of both Precision
and Recall which in turn the value of F-measure.
Although there is deviation in the similarity score
of each method, the overall similarity score provedan
improved result. This system considered all the statistical
and semantic concerns.
The data obtained from the Table 5 prove that the
proposed system performed sound work of Paraphrase
Identification of Marathi sentences.
5 Dataset and testing
5.1 Dataset
The dataset used in this work consist of sentence pairs from
political, film, sports, literautre, miselleneous domains.
There are 74 marathi sentence pairs in the dataset and
each sentence pair has a binary decision accompanying
with it which gives the parahrase judgement for being
paraphrase pair or not.
5.2 Testing
To test the accuracy and F-measure, dataset was divided
into a ratio of 75% and 25% for training and testing respec-
tively. The performance measures (Accuracy, Precision,
Recall and F-Measure) were evaluated for the different
similarity measures.
A quartile is a form of quantile. 25% is set as the first
quartile (Q1) as it is the mid of the smallest number and
the median of the data set. The median of the data-set is
the second quartile (Q2). The mid value between median
and highest value of the data-set is considered as the third
quartile (Q3).
The decision of paraphrase of the sentence pair is
depend on the score of first quartile. If the similarity score
is greater than Q1 then the sentence pair is labelled as
paraphrase otherwise non-paraphrase. On the basis of this
decision accuracy, precision, recall and F-measure was cal-
culated and compared for each similarity measure.
In paraphrase identification of Marathi sentences, 5 sim-
ilarity measures were performed to check similarity score
of the sentences. The similarity measures were performed
on 58 sentence pairs to identify the highest score among
16 Shruti Srivastava and Sharvari Govilkar
Table 6. Performance of the similarity measures.
Similarity measure Min Quartile 25% Quartile 50% Median Quartile 75% Max
Jaccard 0 0.18 0.29 0.4 0.86
Cosine 0.74 0.885 0.91 0.94 1
Word order 0 0.065 0.12 0.24 4
Sumo metric 0 0.01 0.04 0.72 1
Semantic (UNL) 0 0.235 0.38 0.79 2.58
Overall 0.42 0.58 0.64 0.87 0.87
Table 7. Performance measures analysis.
MEASURE JACCARD COSINE WORD-ORDER SUMO METRIC UNL OVERALL
ACCURACY 0.77 0.87 0.8 0.83 0.77 0.82
F-MEASURE 0.85 0.92 0.87 0.89 0.85 0.89
RECALL 0.77 0.95 0.75 0.86 0.77 0.79
PRECISION 0.94 0.9 1 0.92 0.94 0.98
4 statistical similarities and the performance of semantic
similarity.
5.2.1 Analysis of performance measures
As first quartile value is decided threshold for each simi-
larity, the confusion matrix is created and values of perfor-
mance measures are calculated according to it.
From Table 6 it is clear that cosine similarity gives the
best score among all similarity methods. Hence for find-
ing paraphrases in Marathi Sentences, the Cosine similarity
method can be proved effectively accurate.
From testing of all similarity measures, it can be clearly
convey that cosine similarity is best method among all sta-
tistical similarity techniques for finding similarity. When
considering all sentences in given dataset, other statistical
techniques also performed well for different sentence pair.
Word-order similarity is the best to find dissimilar pairs.
For taking dissimilarity into consideration, overall simi-
larity score is resulted to 0 if any of the statistical similarity
or semantic similarity resulted into 0. Otherwise it is cal-
0.77
0.87
0.8 0.83
0.77
0
0.2
0.4
0.6
0.8
1
ACCURACY
0
0
0
0
ACCURACY
SCALE
JACCORD COSINE WORD-ORDER
SUMO METRIC UNL
Figure 9. Accuracy comparisons of all similarity techniques.
culated by combining the maximum statistical similarity
score and semantic similarity score.
6 Conclusion and future scope
Mostly the research has been done for English and Indian
regional languages like Hindi, Malayalam, Punjabi and
Tamil. However no paraphrase detection work has been
yet done for Marathi language. This is the first footstep
towards detecting Paraphrases in Marathi sentences. The
development of social media in Marathi brings the avail-
ability of massive data in Marathi languages. Analysis of
social data like daily tweets in Twitter is the interest grow-
ing field for different purposes. Identifying paraphrases
should be widely explored to help many more arenas of
Natural language processing field.
The proposed system provides paraphrase identifica-
tion tool for Marathi sentences using incorporation of both
statistical and semantic features. The detailed analysis and
performance comparison of all statistical methods con-
cludes that cosine similarity measure outperforms among
all statistical measures. Semantic similarity carried out
through a UNL method plays an important role in achiev-
ing 82% accuracy and 89% F-measure. This score proves
that the designed tool is extremely supportive for Para-
phrase Identification of Marathi Sentences.
The system can be enhanced for capturing the similar-
ity between Marathi paragraphs and Marathi documents.
There is possibility of improving statistical methods to
detect more structural similarity types. Misspelled words
in the sentences will yield wrong result even if the sen-
tences are similar. Pre-processing steps can be enriched for
capturing the spelling mistakes and can deal accordingly.
Semantic similarity measure score can be improved by con-
sidering synonyms of the universal words. The current
model is trained on fewer data sets which can be extended
Detecting Paraphrases in Marathi Language 17
to incorporate more data. The model constructed can also
be changed from static to online Paraphrase Identification
tool for Marathi documents.
Bibliography
[1] Lee, Jun Choi and Cheah, Yu-N (2016) Paraphrase Detection
using Semantic Relatedness based on Synset Shortest Path in
WordNet. In: International Conference on Advanced Infor-
matics: Concepts, Theory and Applications, 16-17 August
2016, Parkroyal Penang Resort.
[2] Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth
D. Forbus, Learning Paraphrase Identification with Struc-
tural Alignment Conference: IJCAI 2016, At New York.
[3] Hoang-Quoc Nguyen-Son, Yusuke Miyao, Isao Echizen,
Paraphrase Detection Based on Identical Phrase and Similar
Word Matching, 29th Pacific Asia Conference on Language,
2015.
[4] J.C. Lee, and Y. Cheah. “Paraphrase Detection using String
Similarity with Synonyms.” The Fourth Asian Conference
on Information Systems, ACIS 2015.
[5] Nitin Madnani, Joel Tetreault, Martin Chodorow, Re-
examining Machine Translation Metrics for Paraphrase
Identification, Conference of the North American Chapter of
the ACL: 2012 Association for Computational Linguistics.
[6] Mihai Lintean and Vaile Rus, Paraphrase Identification
Using Weighted Dependencies and word semantics, Pro-
ceedings of the Twenty-Second International FLAIRS Con-
ference (2009).
[7] Fernando and Stevenson, 2008. A Semantic Similarity
Approach to paraphrase Detection, In Proceedings of the
11th Annual Research Colloquium of the UK Special Interest
Group for Computational Linguistics, pages 45–52. Citeseer.
[8] Cordeiro, Dias, Brazdil A Metric for Paraphrase Detection
Proceedings of the International Multi-Conference on Com-
puting in the Global Information Technology (ICCGI’07),
IEEE.
[9] Kozareva and Montoyo, Paraphrase identification on the
basis of supervised machine learning techniques, Advances
in Natural Language Processing: 5th International Confer-
ence on NLP (FinTAL 2006), Turku, Finland, 524–533.
[10] Avishek Dan  Pushpak Bhattacharyya, “CFILT-CORE:
Semantic Textual Similarity using Universal Networking
Language” 2013 Association for Computational.
[11] Nandini Sethi, Prateek Agrawal, Vishu Madaan and Sanjay
Kumar Singh, A Novel Approach to Paraphrase Hindi Sen-
tences using Natural Language Processing, Indian Journal of
Science and Technology, Vol 9(28), July 2016.
[12] Rupal Bhargava, Anushka Baoni, Harshit Jain
“BITS_PILANI@DPIL-FIRE 2016: Paraphrase Detection
in Hindi Language using Syntactic Features of Phrase”,
DPIL 2016.
[13] Vani K, Deepa Gupta. “ASE@DPIL-FIRE2016: Hindi Para-
phrase Detection using Natural Language Processing Tech-
niques  Semantic Similarity Computations” DPIL 2016.
[14] Anuj Saini “Anuj@DPIL-FIRE2016: A Novel Paraphrase
Detection Method in Hindi Language using Machine Learn-
ing”, DPIL 2016.
[15] Sindhu L. and Sumam Mary Idicula “CUSAT_NLP@DPIL-
FIRE2016:Malayalam Paraphrase Detection”, DPIL 2016.
[16] Manju K. and Sumam Mary Idicula,” CUSAT TEAM@DPIL-
FIRE2016: Detecting Paraphrase in Indian Languages-
Malayalam”, DPIL 2016.
[17] Ditty Mathew and Dr. Suman Mary Idicula,” Paraphrase
identification of Malayalam sentences – an experience”,
IEEE 2013.
[18] R. Thangarajan, S. V. Kogilavani, A. Karthic, S. Jawahar,
“KEC@DPIL-FIRE2016: Detection of Paraphrases on Indian
Languages”, DPIL 2016.
[19] Leilei Kong, Zhenyuan Hao, Kaisheng Chen, Zhongyuan
Han, Liuyang Tian, Haoliang Qi “HIT2016@DPIL-FIRE2016:
Detecting Paraphrases in Indian Languages based on “Gra-
dient Tree Boosting”, DPIL 2016.
[20] Sandip Sarkar, Partha Pakray, Saurav Saha, Dipankar Das,
Jereemi Bentham, Alexander Gelbukh “NLP-NITMZ@DPIL-
FIRE 2016: Language Independent Paraphrases Detection”,
DPIL 2016.
[21] Kamal Sarkar “KS_JU@DPIL-FIRE2016: Detecting Para-
phrases in Indian Languages using Multinomial Logistic
Regression Model”, DPIL 2016.
[22] Tanik Saikh, Sudip Kumar, Naskar Sivaji, Bandyopadhyay
“JU_NLP@DPIL-FIRE2016: Paraphrase Detection in Indian
Languages - A Machine Learning Approach”, DPIL 2016.

Weitere ähnliche Inhalte

Was ist angesagt?

Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...NAVER Engineering
 
Giáo Trình PHP & MySql căn bản
Giáo Trình PHP & MySql căn bảnGiáo Trình PHP & MySql căn bản
Giáo Trình PHP & MySql căn bảnTiên Lý Rau Rút
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN
 
Lecture 6
Lecture 6Lecture 6
Lecture 6hunglq
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernelsivaderivader
 
Multifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdf
Multifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdfMultifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdf
Multifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdfMultifamily Insiders
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLPYuriy Guts
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 

Was ist angesagt? (20)

Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Distributed DBMS - Unit - 4 - Data Distribution Alternatives
Distributed DBMS - Unit - 4 - Data Distribution Alternatives Distributed DBMS - Unit - 4 - Data Distribution Alternatives
Distributed DBMS - Unit - 4 - Data Distribution Alternatives
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
A Neural Grammatical Error Correction built on Better Pre-training and Sequen...
 
Giáo Trình PHP & MySql căn bản
Giáo Trình PHP & MySql căn bảnGiáo Trình PHP & MySql căn bản
Giáo Trình PHP & MySql căn bản
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Web201 slide 3
Web201   slide 3Web201   slide 3
Web201 slide 3
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
Multifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdf
Multifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdfMultifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdf
Multifamily Demo Day - Effective Vendor Demos & Asking the Right Questions.pdf
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 

Ähnlich wie Detecting Paraphrases in Marathi Language

Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationijnlc
 
Rule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisRule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsijaia
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodijcsa
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
 
Comparative performance analysis of two anaphora resolution systems
Comparative performance analysis of two anaphora resolution systemsComparative performance analysis of two anaphora resolution systems
Comparative performance analysis of two anaphora resolution systemsijfcstjournal
 
Automatic multiple choice question generation system for
Automatic multiple choice question generation system forAutomatic multiple choice question generation system for
Automatic multiple choice question generation system forAlexander Decker
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESLinda Garcia
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Dhabal Sethi
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466IJRAT
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
Difficulties in processing malayalam verbs
Difficulties in processing malayalam verbsDifficulties in processing malayalam verbs
Difficulties in processing malayalam verbsijaia
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous CharactersAndi Wu
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesAlgoscale Technologies Inc.
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 

Ähnlich wie Detecting Paraphrases in Marathi Language (20)

Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classification
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
Rule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisRule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer method
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
 
Comparative performance analysis of two anaphora resolution systems
Comparative performance analysis of two anaphora resolution systemsComparative performance analysis of two anaphora resolution systems
Comparative performance analysis of two anaphora resolution systems
 
Automatic multiple choice question generation system for
Automatic multiple choice question generation system forAutomatic multiple choice question generation system for
Automatic multiple choice question generation system for
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
Paper id 25201466
Paper id 25201466Paper id 25201466
Paper id 25201466
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
Difficulties in processing malayalam verbs
Difficulties in processing malayalam verbsDifficulties in processing malayalam verbs
Difficulties in processing malayalam verbs
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous Characters
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 

Mehr von BOHR International Journal of Smart Computing and Information Technology

Mehr von BOHR International Journal of Smart Computing and Information Technology (10)

A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
 
The Usage of E-Learning Challenges in the Namibia Educational Institutions: A...
The Usage of E-Learning Challenges in the Namibia Educational Institutions: A...The Usage of E-Learning Challenges in the Namibia Educational Institutions: A...
The Usage of E-Learning Challenges in the Namibia Educational Institutions: A...
 
An Investigation Into the Impacts of ICT in the Compacting of COVID-19: A Nam...
An Investigation Into the Impacts of ICT in the Compacting of COVID-19: A Nam...An Investigation Into the Impacts of ICT in the Compacting of COVID-19: A Nam...
An Investigation Into the Impacts of ICT in the Compacting of COVID-19: A Nam...
 
The Importance of ICTs on Service Delivery at MTC, Namibia
The Importance of ICTs on Service Delivery at MTC, NamibiaThe Importance of ICTs on Service Delivery at MTC, Namibia
The Importance of ICTs on Service Delivery at MTC, Namibia
 
National COVID-19 Health Contact Tracing and Monitoring System: A Sustainable...
National COVID-19 Health Contact Tracing and Monitoring System: A Sustainable...National COVID-19 Health Contact Tracing and Monitoring System: A Sustainable...
National COVID-19 Health Contact Tracing and Monitoring System: A Sustainable...
 
Enabling Semantic Interoperability of Regional Trends of Disease Surveillance...
Enabling Semantic Interoperability of Regional Trends of Disease Surveillance...Enabling Semantic Interoperability of Regional Trends of Disease Surveillance...
Enabling Semantic Interoperability of Regional Trends of Disease Surveillance...
 
Mobility of ICT Services in Namibian Institutions: A Literature Review
Mobility of ICT Services in Namibian Institutions: A Literature ReviewMobility of ICT Services in Namibian Institutions: A Literature Review
Mobility of ICT Services in Namibian Institutions: A Literature Review
 
IOT Air Pollution Monitoring System (Live Location)
IOT Air Pollution Monitoring System (Live Location)IOT Air Pollution Monitoring System (Live Location)
IOT Air Pollution Monitoring System (Live Location)
 
For Investment Geometric Problems
For Investment Geometric ProblemsFor Investment Geometric Problems
For Investment Geometric Problems
 
Different Ways of Solving a Geometric Task
Different Ways of Solving a Geometric TaskDifferent Ways of Solving a Geometric Task
Different Ways of Solving a Geometric Task
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Detecting Paraphrases in Marathi Language

  • 1. International Journal of Smart Computing and Information Technology 2020, Vol. 1, No. 1, pp. 7–17 Copyright © 2020 BOHR Publishers www.bohrpub.com Detecting Paraphrases in Marathi Language Shruti Srivastava and Sharvari Govilkar Department of Computer Engineering, PCE, University of Mumbai, India E-mail: shrutics2015@gmail.com; sgovilkar@mes.ac.in Abstract. Paraphrasing refers to the sentences that either differs in their textual content or dissimilar in rearrange- ment of words but convey the same meaning. Identifying a paraphrase is exceptionally important in various real life applications such as Information Retrieval, Plagiarism Detection, Text Summarization and Question Answering. A large amount of work in Paraphrase Detection has been done in English and many Indian Languages. However, there is no existing system to identify paraphrases in Marathi. This is the first such endeavor in Marathi Language. A paraphrase has different structured sentences and Marathi being semantically strong language hence this system is designed for checking both statistical and semantic similarity of Marathi sentences. Statistical similarity measure does not need any prior knowledge as it is only based on the factual data of sentences. The factual data is calculated on the basis of the degree of closeness between the word-set, word-order, word-vector and word-distance. Universal Networking Language (UNL) speaks about the semantic significance in the sentence without any syntac- tic point of interest. Hence, the semantic similarity calculated on the basis of generated UNL graphs for two Marathi sentences renders semantic equality of two Marathi sentences. The total paraphrase score was calculated after joining statistical and semantic similarity scores which gives the judgement of being paraphrase or non-paraphrase about the Marathi sentences. Keywords: Paraphrase, Marathi Language Statistical, Semantic, Sumo metric, Universal Networking Language (UNL). 1 Introduction Paraphrase is the translation of a sentence or a para- graph into same language. Paraphrasing occurs when texts are lexically or syntactically modified to appear different texts, but retaining the same intention. Paraphrase can be generated, extracted and identified. Paraphrase extraction involves collection of different words or phrases which express same or almost same meaning. Vocabulary plays an important role in paraphrase extraction. Paraphrase extraction helps in paraphrase generation in which para- phrases are created. Paraphrase generation involves not only dictionary exercise but also changing the information sequence and grammatical structure. Paraphrase identification is method of detecting the variety of expressions which conveying same meaning. It presents a major challenge for numerous NLP applications. In automatic summarization, identifying paraphrases is necessary to find repetitive information in the document. In information extraction, paraphrase identification pro- vides most significant information whereas in information retrieval query paraphrases are generated to retrieve better quality of relevant data. In question and answering system, in absence of question from database the answers returned for the question paraphrase is always helpful. The base of paraphrasing is semantic equivalence which gives alter- native translation in the same language. For paraphrase detection it is necessary to study the possibilities of para- phrasing at each level. Mainly there are 3 types of surface paraphrases. Lexical level: Lexical paraphrases occur when synonyms appear in almost identical sentences. At lexical level, in addition to synonym lexical paraphrasing is described by hyperonymy. In hyperonymy one word has many alterna- tives but only one of the words is more general or specific than the other one. Example – synonyms (solve and resolve), (पु – सुता, ) Hyperonymy (reply, say), (landlady, hostess) {घर} has {सदन, शाला, आलय, धाम} hypernym S1: मोदी सरकारन दोन पया य समोर ठ ेवल े आह े त. S1: The Modi government has put forward two options. S2: मोदी सरकारन े दोन मा ंडल े आह े त. S2: The Modi government has given two options. 7
  • 2. 8 Shruti Srivastava and Sharvari Govilkar Phrase level: Phrasal paraphrase states that different phrases shared same semantic content. The phrasal para- phrases include syntactic phrases as well as pattern forma- tion with connected factors. Example S1: ֆ֚֡֔֠ֈ֚֞֞ ե֊֠ ֒֞֐֑֞օ ֛֧֟֔֟֔. (Tulsidas wrote Ramayana.) S2: ֒֞֐֑֞օ ֆ֚֡֔֠ֈ֚֞֞ ե֊֠ ֆ֚֡֔֠ֈ֚֞֞ ե֊֠. (Ramayana written by Tulsidas.) S3: ֆ֚֡֔֠ֈ֚֞֞ ե֊֠ ֔ոշ ֧ ֡ ֆ֚֔֠ֈ֚֞ ֧ ը֛ֆ. (Ramayana’s author is Tulsidas.) S4: ֡ ֆ֚֔֠ֈ֚֞֞ե֊֠ ֒֞֐֑֞օ֞ռ֠ ֒ռ֊֞ շ֔֠ ֧ . (Tulsidas composed the Ramayana.) Sentential paraphrases: In this type of paraphrase one sen- tence is totally replaced by other sentence but still both sentences keeping the same meaning. Example – S1: ֧ ֐֒֞ւ֠ ֟֊֒֞֕֠ ֧ ը֛. (Each region of Maharashtra has a different Marathi language) S2: ֐֒֞ւ֠ ֏֞֙֞ ֚֗ Sentential paraphrases: In this type of paraphrase one sentence is totally replaced by keeping the same meaning. Example – S1: मिाराष्ट्रातील प्रत्येक प्राांताची मराठी भाषा लनराळी आिे. (Each region of Maharashtra has a d S2: मराठी भाषा सर्व मिाराष्ट्राची जरी एक असली तरी तीच दर बारा कोसार्र बदलते. (Marathi language is the only one in the whole of Maharashtra, the same rate varies fro II. Literature Survey This section presents various techniques for paraphrase detection in Indian languages. Par applied on various languages to detect whether two sentences were paraphrase of each oth Table 2.1: Summary of Literature Review SNo Paper Language: English 1 Title Paraphrase detection using Semantic relatedness based on Synset Shortest path in WordNet Semant approac Author Jun Choi Lee & Yu-N Cheah, August 2016 2 Title Learning Paraphrase Identification with structural alignment Alignm Approa Author Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus, July 2016 3 Title Paraphrase Detection based on Identical Phrase and Similar Word matching SimMat combin metrics Author Hoang-Quoc, Nguyen-Son,Yusuke Miyao, Isao Echizen,2015 4 Title Paraphrase detection using string similarity with synonyms Use of s similari Author Jun Choi Lee & Yu-N Cheah,2015 5 Title Re-examining Machine Translation Metrics for Paraphrase Identification Combin metrics. Author Nitin Madnani, Joel Tetreault, Martin Chodorow, 2012 6 Title Paraphrase Identification Using Weighted Dependencies and word semantics Calcula dissimil using de Author Mihai Lintean and Vaile Rus, 2009 7 Title A Semantic Similarity Approach to paraphrase Detection JCN w with ma Author Fernando and Stevenson, August 2008 8 Title A Metric for Paraphrase Detection Paraphr using S Author Cordeiro, Dias, Brazdil , 2007(IEEE) 9 Title Paraphrase Identification on the Basis of Supervised Machine Learning Techniques Combin and sem Author Kozareva and Montoyo,2006 10 Title CFILT-CORE: Semantic Textual Similarity using Universal Networking Language Describ similari extracti Univers finding pair of s Author Avishek Dan & Pushpak Bhattacharyya, IIT Bombay Mumbai Language : Hindi 11 Title A Novel approach to Paraphrase Hindi Sentences using NLP Creating and anto Author N.Sethi,P. Agrawal, V. Madaan, S.K. Singh, 2016 12 Title Paraphrase Detection in Hindi Language using Syntactic Features of Phrase Random classific method for ide languag Author Rupal Bhargava, Anushka Baoni, Harshit Jain, 2016 13 Title Hindi Paraphrase Detection using Natural Language Processing Techniques & Semantic Similarity Computations Discove semanti done by . (Marathi language is the only one in the whole of Maha- rashtra, the same rate varies from 12 to 12 degrees.) 2 Literature survey This section presents various techniques for paraphrase detection in Indian languages. Paraphrase detection tech- niques have been applied on various languages to detect whether two sentences were paraphrase of each other or not. Table 1. Summary of literature review. SNo Paper Algorithm Accuracy F-score Language: English 1 Title Paraphrase detection using Semantic relatedness based on Synset Shortest path in WordNet Semantic relatedness approach 71.1% 81.1% Author Jun Choi Lee & Yu-N Cheah, August 2016 2 Title Learning Paraphrase Identification with structural alignment Alignment based Approach 78.6% 85.3% Author Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus, July 2016 3 Title Paraphrase Detection based on Identical Phrase and Similar Word matching SimMat metric combined with 8 MT metrics 77.6% 83.9% Author Hoang-Quoc, Nguyen-Son,Yusuke Miyao, Isao Echizen, 2015 4 Title Paraphrase detection using string similarity with synonyms Use of synonyms in text similarity metrics 70.6% 80.4% Author Jun Choi Lee & Yu-N Cheah, 2015 5 Title Re-examining Machine Translation Metrics for Paraphrase Identification Combination of 8 MT metrics 77.4% 84.1% Author Nitin Madnani, Joel Tetreault, Martin Chodorow, 2012 6 Title Paraphrase Identification Using Weighted Dependencies and word semantics Calculate similarity and dissimilarity scores using dependencies 72.41% 81.32% Author Mihai Lintean and Vaile Rus, 2009 7 Title A Semantic Similarity Approach to paraphrase Detection JCN word similarity with matrix 74.1% 82.4% Author Fernando and Stevenson, August 2008 8 Title A Metric for Paraphrase Detection Paraphrase detection using SumoMetric 78.19% 80.92% Author Cordeiro, Dias, Brazdil, 2007 (IEEE) 9 Title Paraphrase Identification on the Basis of Supervised Machine Learning Techniques Combination of lexical and semantic features 76.6% 79.6% Author Kozareva and Montoyo, 2006 10 Title CFILT-CORE: Semantic Textual Similarity using Universal Networking Language Described a syntactic and word level similarity method combining with a semantic extraction method which was based on Universal Networking Language (UNL) for finding semantic similarity score between a pair of sentences Author Avishek Dan & Pushpak Bhattacharyya, IIT Bombay Mumbai (Continued)
  • 3. Detecting Paraphrases in Marathi Language 9 Table 1. Continued SNo Paper Algorithm Accuracy F-score Language: Hindi 11 Title A Novel approach to Paraphrase Hindi Sentences using NLP Creating paraphrase by applying synonyms and antonyms replacement Author N. Sethi, P. Agrawal, V. Madaan, S.K. Singh, 2016 12 Title Paraphrase Detection in Hindi Language using Syntactic Features of Phrase Random forest classifier used for classification and Leven-shtein Distance method used for similarity score calculation for identifying paraphrases in the Hindi language Author Rupal Bhargava, Anushka Baoni, Harshit Jain, 2016 13 Title Hindi Paraphrase Detection using Natural Language Processing Techniques & Semantic Similarity Computations Discovered synonym WordNet using a semantic similarity metric and classification done by paraphrase detection decision tree classifier Author Vani K, Deepa Gupta, 2016 14 Title A Novel Paraphrase Detection Method in Hindi Language using Machine Learning Random forest classifier and different machine learning algorithms were used for removing overlapping words and calculated normalized IDF score for paraphrase detection Author Anuj Saini, 2016 Language: Malayalam 15 Title Malayalam Paraphrase Detection Procured CUSAT Malayalam Wordnet for calculating sentence similarity using matching tokens, lemmas and synonyms Author Sindhu L. and Sumam Mary Idicula, 2016 16 Title Detecting Paraphrase in Indian Languages – Malayalam Calculated cosine similarity score using model named Bag of Word and a threshold for calculating paraphrase score Author Manju K and Sumam Mary Idicula, 2016 17 Title Paraphrase Identification using Malayalam sentences Detecting paraphrase by Statistical measure and semantic analysis Author Ditty Mathew, Dec. 2013 (IEEE) Language : Tamil 18 Title Detection of Paraphrases on Indian Languages Tamil Shallow parser used for extracting sixteen diverse morphological features. Author R. Thangarajan, S. V. Kogilavani, A. Karthic, S. Jawahar, 2016 Languages (Tamil, Malayalam, Hindi, Punjabi) 19 Title Detecting Paraphrases in Indian Languages based on Gradient Tree Boosting Employed a Cosine, Dice Distance, Jaccard Coefficient and METEOR features and Gradient Boosting Tree supervised classification method Author Leilei Kong, Zhenyuan Hao, Kaisheng Chen, Zhongyuan Han, Liuyang Tian, Haoliang Qi, 2016 20 Title Language Independent Paraphrases Detection Probabilistic Neural Network (PNN) has been used to train Jaccard, Cosine Similarity and length normalized Edit Distance language independent feature-set to detect paraphrases in Indian languages Author Sandip Sarkar, Partha Pakray, Saurav Saha, Dipankar Das, Jereemi Bentham, Alexander Gelbukh, 2016 21 Title Detecting Paraphrases in Indian Languages Using Multinomial Logistic Regression Model The paper described about lexical and semantic level similarities between two sentences using a trained multinomial logistic regression model for paraphrase identification and attain maximum f-measure Author Kamal Sarkar, 2016 22 Title Paraphrase Detection in Indian Languages – A Machine Learning Approach The team worked on Hindi, Tamil, Punjabi and Malayalam sentences. The paraphrase decision is taken into consideration by various similarity measures, machine translation evaluation metrics and machine learning framework Author Tanik Saikh, Sudip Kumar, Naskar Sivaji, Bandyopadhyay, 2016
  • 4. 10 Shruti Srivastava and Sharvari Govilkar 3 Proposed system The suggested system is designed to take the input as set of two Marathi sentences. The pre-processing can be done on sentences for collecting actual root words. Then the pre- processed data is given as input to the different statistical and semantic similarity to find paraphrases. The final simi- larity score is generated by combining maximum statistical similarity score and semantic similarity score. The final paraphrase judgement is decided based on a set threshold value which gives the decision of Paraphrase (P) or Non- Paraphrase (NP). In this approach system takes one original Marathi sen- tences and one suspicious paraphrase sentence as an input. The sentences then pre-process through various phases including pre-processing by tokenization and input valida- tion by stop word removing, stemming and morphological analysis. After which the pre-processed data is fetched to the matrix calculation for paraphrase detection. 3.1 Pre-processing The first step is to present two Marathi sentences into text format. Sentence 1: होळीचा सण साजरा करता ंना काळजी . (The environment should be taking care of while celebrating Holi festival.) Sentence 2: पयावणाच भान ठ ेवून होळीचा सण साजरा करावा. (Celebrate the Holi fes- tival after keeping environment in mind.) Figure 1. Proposed system architecture. The pre-processing of the documents involves following methods: Tokenization Tokenization is the method of identification and separation of tokens from two Marathi input sentences. Lexicon is the individual word of the sentence. The word boundaries in the Marathi language are fixed as it is a segmented lan- guage. The spaces between the words are used to separate one word from another. The process of separation of tokens involves removal of delimiters like white spaces and punc- tuation marks. Validation of Devanagari script A Marathi language is written in Devanagari script. Input sentences validation is an important phase due to the language dependent information and type of query pro- vided to the system. The characters in Devanagari script are recognized with Unicode values from UTF-8. Com- paring each and every character with UTF-8 list, invalid Devanagari characters were removed from the sentences. The valid characters in Devanagari script were retained for next phase. Stop word elimination Stop words are the utmost irrelevant words which delay the sentence processing. These stop words are most often occurring words in any document. The list of stop words includes articles, prepositions and other function words. Hence to enhance the processing speed of the system, the stop words needs to identified and removed properly. Stemming Stemming is a very important step in the system. In this phase, suffix list is used to remove suffixes from words for creating exact stem word as the stem may not be the lin- guistic root of the word. Morphological analysis The morphological analyzer identifies the inner structure of the word. The root words from the given are expected to produce from a morphological analyzer. After stemming, the words are evaluated for any inflection. The perfect root words can be produced by generating and matching rules. Addition or replacement of characters to the inflected stem word results into the precise core word. 3.2 Statistical similarity The tokens created in tokenization process are the root words but it is necessary to distinguish the tokens. Sta- tistical similarity is calculated by considering tokens of
  • 5. Detecting Paraphrases in Marathi Language 11 both sentences and compares them on the basis of word- set, word-order, word-vector and sumo-metric (word dis- tance). The system uses 4 statistical methods for statistical comparisons which are explained as follows: A. Jaccard similarity It is based on the similarity and differences between two word sets from the pair of sentences [10]. Jaccard similarity coefficient is calculated by taking ratio of the total number of matched unigrams to the total number of unmatched words in both the sentences. The Jaccard similarity is defined using following equation; Jaccard Similarity (S1, S2) = S1 ∩ S2 S1 ∪ S2 Example: Jaccard Similarity Sentences Score S1 होळी सण साजरा करता ं ना S1 ∪ S2 = 4 काळजी {होळी, सण, साजरा, } S2 पयावण भान ठ ेवून होळी सण साजरा S1 ∩ S2 = 9 Jaccard Similarity (S1, S2) = 4/9 = 0.4 B. Cosine similarity Cosine similarity [10] is the most common statistical fea- ture to measure the similarity between word vectors of two sentences. For each sentence, a word vector of root words of those sentences is formed which interprets the frequency of words in the sentences. Cosine similarity is the ratio of the dot product of those two word vectors and the product of their lengths. Cosine Similarity (S1, S2) = S1 · S2 |S1| |S2| Hence, Cosine similarity (S1, S2) = 0.91 C. Word order similarity The vectors of the two sentences were considered for cal- culating Word order similarity between them. The vectors of two sentences constructed for the calculating using fol- lowing equations: L(Sa) = {(wa1, wa2), (wa1, wa3), . . . , (wa(i−1), wai)} L(Sb) = {(wb1, wb2), (wb1, wb3), . . . , (wb(i−1), wbi)} Where vector (wa1, wa2, . . . , wai) and vector (wb1, wb2, . . . , wbi) constructed from sentences Sa and sen- tence Sb tokens respectively. wx is before wy in (wx, wy) ∈ L (Sa) ∪ L(Sb). The similarity calculation between Sa and Sb can be done [10] on the basis of following equation: WordOrder(Sa, Sb) = |L(Sa) ∩ L(Sb)| |L(Sa) ∪ L(Sb)| L(Sa) ∩ L(Sb) = {होळी सण, होळी साजरा, सण साजरा, होळी ,साजरा ,सण } = 6 WordOrder (Sa, Sb) = |L(Sa) ∩ L(Sb)| |L(Sa) ∪ L(Sb)| = 6 30 = 0.17 D. Sumo metric The Sumo-Metric [12] is on the word distance. This met- ric finds the special lexical links between root words of a pair of Marathi sentences. This metric not only identify paraphrases but also identifies the exact and quasi-exact matches. It can be considered as 1-gram exclusive overlap. S1 and S2 is a pair of sentences given as an input to the system. Let x and y are the numbers of words in those two sentences. The number of words in the intersection of the two sentences, excluding repetitions is represented by λ. To calculate the Sumo-Metric S(.,.), first evaluate the function S(x, y, λ) S(x, y, λ) = α log 2 x λ + β log 2 y λ Where α, β ∈ [0, 1] and α + β = 1. Now the Sumo-Metric S (.,.) can be calculated by the following equation: S(S1, S2) = ( S(x, y, λ) if S(x, y, λ) 1.0 e−k∗S(x,y,λ) otherwise α and β are two core components which involved in the calculation. The exact and quasi exact match pairs are grad- ually penalize by log2 (.) function as for equal pairs the sumo-metric result is exactly zero. Sentence 1: होळी सण साजरा करता ं ना काळजी Sentence 2: भान ठ े वून होळी सण साजरा λ = number of words in the intersection of the two sen- tences = {होळी, सण, साजरा, } = 4 Hence, if α and β = 0.5 Then, S(x, y, λ) = 0.5 ∗ log2 7 4 + 0.5 ∗ log2 6 4 = 0.7 3.3 Semantic similarity Universal Networking language (UNL) generates a graph of semantic representation of sentences. In this method, UNL relation signifies semantic information in a by draw- ing a hyper-graph. In this hyper-graph nodes represents concepts and arcs signifies relations. The hyper-graph con- tains a set of directed binary relations between two con- cepts of a sentence. UNL relation involves mainly 3 important terms which are as follows:
  • 6. 12 Shruti Srivastava and Sharvari Govilkar 1. Universal Words (UWs): These are the root word which signifies word meaning. 2. Relation Labels: These labels tag the various interac- tions between Universal Words. 3. Attribute Labels: It gives extra information for the Universal Words present in a sentence. Two UNL graphs are compared for the similarity indi- rectly compares the semantic content of two sentences. This comparison is useful for finding semantic similarity score. UNL Expression The directed graph generated in this method is known as a “UNL Expression” or “UNL Graph” which gives a long list of the binary relations. The general format of UNL Expression is the following: relation:[scope-ID](from-node, to-node ) A UW is described in each of from-node and to- node. An example for the UNL expression of the sentence “Bharat is writing a novel.” agt(write(icldo)@entry.@present.@progress, Bharat (iofperson)) obj(write(icldo)@entry.@present.@progress, novel(iclbook)) UNL En-conversion process UNL en-cnoversion is a process of translating natural lan- guage text into UNL for syntactic and semantic evaluation. The En-converter scanned the sentence from left to right then the word dictionary is used for matching of mor- phemes and they will be the candidate morphemes for further processing. By using defined rules a syntactic tree is built and a UNL format semantic network is designed for the input sentence until all the words are inputted. UNL graph based Similarity The generated UNL graphs of two sentences are compared for computing semantic similarity using the formula given below. The average of relation match scores and universal word match scores is the actual relation score. The match- ing attributes of the corresponding universal word decides the universal word match score. Figure 2. UNL graph example. Semsim(S1, S2) = S1 ∩ S2 S1 ∪ S2 + X a∈S1,b∈S2 0.75 ∗ [[a.w1 = b.w1a.w2 = b.w2]] |S1 ∪ S2| + 0.75 ∗ [[a.r = b.ra.Ew1 = b.Ew1a.Ew2 = b.Ew2]] |S1 ∪ S2| + 0.6 ∗ [[a.Ew1 = b.Ew1a.Ew2 = b.Ew2]] |S1 ∪ S2| ! Example: Input sentence1: होळी सण साजरा करता ंना काळजी ——————UNL expression for sentence1—————— {unl} mod (काळजी (iofaction):01, )iofnature): 00@entry@pred) seq (काळजी (iofaction):01, साजरा)icldo):02@present) agt (सण(iofevent):02@def, साजरा)icldo):02@present) aoj (होळी (iclevent):03@def, सण)iofevent):02@def) {/unl} ——————UNL expression for sentence1—————— Input sentence2: भान ठ े वून होळी सण साजरा ——————UNL expression for sentence2—————— {unl} mod (भान (iofaction):01, )iofnature): 00@entry@pred) seq (भान (iofaction):01, साजरा)icldo):02@present) agt (सण (iofevent):02@def, साजरा)icldo):02@present) aoj (होळी (iclevent):03@def, सण)iofevent):02@def) {/unl} ——————UNL expression for sentence2—————— Figure 3. UNL1 graph for sentence1. Figure 4. UNL2 graph for sentence2.
  • 7. Detecting Paraphrases in Marathi Language 13 Table 2. Similarity measure score. Types of similarity Measure Value Similarity measure score to consider Statistical similarity measures Jaccard Similarity (Word set) 0.4 Maximum statistical similarity Statsim(S1, S2) = 0.91 Cosine Similarity (Word Vector) 0.91 Word order Similarity 0.17 Sumo Metric 0.7 Semantic similarity measure UNL graph based similarity 0.59 Semantic similarity Semsim(S1, S2) = 0.59 Overall similarity If p = 0.5, q = 0.5 then, Sim (S1, S2) = 0.5 * 0.91 + 0.5 * 0.59 = 0.75 UNL1 and UNL2 denotes the UNL graph for sentence1 and for sentence2 respectively. Hence the semantic Similarity score can be calculated by using following formula, S1 ∩ S2 = {होळी, सण, साजरा, पयावण} = 4 S1 ∪ S2 = {होळी, सण, साजरा, पयावण, काळजी, भान} = 6 Hence, Semsim (S1, S2) = 0.59 3.4 Overall similarity The scores of maximum statistical similarity and seman- tic similarity are combined using following formula for overall similarity calculation of two sentences, Sim (S1, S2) =p*Statsim(S1, S2)+q*Semsim(S1, S2). The values of p and q are considered in such a way that, p + q = 1; p, q ∈ [0, 1] as p and q are the constants which represents involvement of each part to the overall similar- ity calculation, 3.5 Paraphrase judgement As the proposed system produced overall similarity score of 0.75, it can easily conclude that the Marathi sentence pair judged for paraphrasing identifies as a paraphrase pair. Taking into consideration all 5 statistical and 1 seman- tic measures, it can be detected that there are variation in every similarity measure score. However, the overall sim- ilarity score providesan enhanced outcome for paraphrase detection process. 4 Results and discussion 4.1 Working of the system The input to the Paraphrase Identification system will be two Marathi text documents in “.txt” format. One docu- ment will be original sentence and other will be suspicious paraphrase sentence. Figure 5. GUI of the system. Figure 6. Overall similarity Paraphrase Judgement. Since the paraphrase detection system is designed exclu- sively for Marathi sentences, all characters which do not belong to Devanagari script must be eliminated to save processing time and memory. The pre-processing of sen- tences undergoes further processing steps like tokeniza- tion, stop word removal, stemming and morphological analysis and will result into validated output. The overall similarity calculation is done by adding values of maximum statistical similarity and semantic sim- ilarity. The values of both similarities are multiplied by a smoothing factor before addition for considering the contribution of each similarity in overall similarity. The sentence pair is paraphrase or not is decided on basis of score of overall similarity. If the score is greater than 0.8 then the sentences are paraphrase of each other. On click- ing of overall similarity button, system generates overall similarity score and on basis of that it generates paraphrase judgement. 4.2 Experimental analysis All sentence pairs from dataset of five domains were tested domain wise and similarity wise. The graphs were created for comparative study of each similarity measure and to find the best for the system. The best method taking it into consideration while calculating overall similarity score for the system.
  • 8. 14 Shruti Srivastava and Sharvari Govilkar 0 0.5 1 1.5 SL1 SL2 SL3 SL4 SL5 SL6 SL7 SL8 PARAPHRASE SIMILARITY SCORE 4 STASTICAL AND SEMANTIC (UNL) SIMILARITY MEASURES LITERATURE DOMAIN Jaccard Similarity Cosine Similarity Word Order Similarity Sumo Matrix UNL Figure 7. Comparison of Statistical and semantic similarity for literature domain. 4.2.1 Domain wise statistical and semantic similarity The sentence pairs from the political, sports, film, litera- ture and miscellaneous domains were tested separately for both types of similarities. The statistical similarity deals with statistical structure between words of the sentences whereas semantic similarity deals with the meaning. For literature domain and miscellaneous domain, statistical similarity score lies between 0 and 1. The semantic simi- larity score lies between 0 to 2 for literature domain and ranges from 0 to 3 for miscellaneous domain. 4.2.2 Overall similarity Out of 4 statistical similarities, maximum statistical simi- larity is calculated for each sentence pair. Combination of the semantic similarity and maximum statistical similarity considered as overall similarity. The overall similarity score is used for setting the threshold value and give the judg- ment for identifying paraphrase (P) and non-paraphrase pair (NP). 0 0.5 1 1.5 2 2.5 PARAPHRASE SIMILARITY SCORE SEMANTIC (UNL), MAXIMUM STASTICAL , OVERALL SIMILARITY MEASURES SPORTS DOMAIN UNL Max. Stastical OVERALL Figure 8. Overall similarity measures for sports domain. 4.3 Threshold value generation From obtained different similarity scores, it is necessary to highlight a classical problem in classification – thresh- olds. Usually, thresholds are some parameter upon which system takes some decision. After computation and com- parison of all similarity measures, it is essential to decide upon some threshold value which decides whether a sen- tence pair is a paraphrase or not. Thresholds are the parameters which unease the pro- cess of evaluation. In the evaluation process, no threshold value is predefined. Instead, from overall comparison of each similarity on each domain, best threshold is computed upon which each domain is agreed.Threshold computa- tion may be a classical issue of function maximization or optimization. The threshold values plays important role in paraphrase judgement as the sentence pair can divide into 3 groups as per their overall similarity scores viz, Paraphrase (P), Semi-paraphrase (SP) and Non-Paraphrase (NP). The fol- lowing Table 3 gives an example of overall similarity score Table 3. Threshold selection. Sentence pair Overall similarity score Threshold value Paraphrase judgement Sentence 1: हा माणसाचा े आह (Book is the best friend of the man) Sentence 2: जवळचा आह The book is a close friend of man. 1.22 (SM1 from miscellaneous domain) (score = 0.5) Paraphrase (P) Sentence 1: मराठी भाषा िनराळी आह Each region of Maharashtra has a different Marathi language Sentence 2: मराठी भाषा सव जरी एक असली तरी तीच दर बारा कोसावर बदलत Marathi language is the only one in the whole of Maharashtra, the same rate varies from 12 to 12 degrees 0.48 (SL5 from literature domain) (score 0.5 score 0.4) Semi-paraphrase (SP) Sentence 1: वेळी झोळी भरावी Beans are to be filled with virtues at the time of Holi. Sentence 2: होळीसाठी वाईट व मोळी बांधावी Build bad rallies and rituals for Holi. 0.0 (SM4 from miscellaneous domain) (score = 0.0) Non-paraphrase (NP)
  • 9. Detecting Paraphrases in Marathi Language 15 Table 4. Confusion matrix for performance evaluation. Actual results Predicted results Total dataset Similar Dissimilar Similar 46 TP 12 FN 58 Dissimilar 1 FP 15 TN 16 of sentence pair and role of threshold for deciding its section of Paraphrase (P), Semi-paraphrase (SP) and Non- Paraphrase (NP). 4.4 Performance evaluation The performance of the system is evaluated on 4 basic eval- uation measures. These measures are accuracy, precision, recall and F-measure. For calculation of these four perfor- mance measures, the four parameters TP, TN, FP, and FN must be calculated first. The performance of the system was evaluated by creat- ing a confusion matrix and it is given in Table 4. The 46 sentence pairs classified as similar and 12 as dis- similar out of 58 similar sentence pairs by the proposed system. Out of actual 16 dissimilar sentence pairs, the 15 sentence pairs classified as dissimilar and 1 as similar in the used data-set. In Table 4, True positive (TP) and True Negative (TN) are the observations which are correctly pre- dicted hence shown in green color. False positive (FP) and False Negative (FN) must be minimized in order to get a good system so they are shown in red color. True positive (TP) and True Negative (TN) values occur when actual results obtained from the system agreed with the predicted results. False positives (FP) and False negatives (FN), these val- ues occur when the actual results contradicts with the predicted results. Accuracy Accuracy is a ratio of correctly predicted observation to the total observations. Precision Precision is obtained by dividing the number of cor- rectly predicted paraphrases by the number of predicted paraphrases. High precision relates to the low false posi- tive rate. Recall Recall calculation is done taking the ratio of correctly pre- dicted paraphrases to the all referenced paraphrases in actual class. F-measure F-measure considers false positives and false negatives into account and it is the weighted average of Precision and Table 5. Performance analysis of overall similarity. Measure Value ACCURACY 0.82 PRECISION 0.98 RECALL 0.79 F-MEASURE 0.89 Recall. Accuracy gives best value when the values of false positives and false negatives are similar. When the cost of these two is different then consider values of both Precision and Recall which in turn the value of F-measure. Although there is deviation in the similarity score of each method, the overall similarity score provedan improved result. This system considered all the statistical and semantic concerns. The data obtained from the Table 5 prove that the proposed system performed sound work of Paraphrase Identification of Marathi sentences. 5 Dataset and testing 5.1 Dataset The dataset used in this work consist of sentence pairs from political, film, sports, literautre, miselleneous domains. There are 74 marathi sentence pairs in the dataset and each sentence pair has a binary decision accompanying with it which gives the parahrase judgement for being paraphrase pair or not. 5.2 Testing To test the accuracy and F-measure, dataset was divided into a ratio of 75% and 25% for training and testing respec- tively. The performance measures (Accuracy, Precision, Recall and F-Measure) were evaluated for the different similarity measures. A quartile is a form of quantile. 25% is set as the first quartile (Q1) as it is the mid of the smallest number and the median of the data set. The median of the data-set is the second quartile (Q2). The mid value between median and highest value of the data-set is considered as the third quartile (Q3). The decision of paraphrase of the sentence pair is depend on the score of first quartile. If the similarity score is greater than Q1 then the sentence pair is labelled as paraphrase otherwise non-paraphrase. On the basis of this decision accuracy, precision, recall and F-measure was cal- culated and compared for each similarity measure. In paraphrase identification of Marathi sentences, 5 sim- ilarity measures were performed to check similarity score of the sentences. The similarity measures were performed on 58 sentence pairs to identify the highest score among
  • 10. 16 Shruti Srivastava and Sharvari Govilkar Table 6. Performance of the similarity measures. Similarity measure Min Quartile 25% Quartile 50% Median Quartile 75% Max Jaccard 0 0.18 0.29 0.4 0.86 Cosine 0.74 0.885 0.91 0.94 1 Word order 0 0.065 0.12 0.24 4 Sumo metric 0 0.01 0.04 0.72 1 Semantic (UNL) 0 0.235 0.38 0.79 2.58 Overall 0.42 0.58 0.64 0.87 0.87 Table 7. Performance measures analysis. MEASURE JACCARD COSINE WORD-ORDER SUMO METRIC UNL OVERALL ACCURACY 0.77 0.87 0.8 0.83 0.77 0.82 F-MEASURE 0.85 0.92 0.87 0.89 0.85 0.89 RECALL 0.77 0.95 0.75 0.86 0.77 0.79 PRECISION 0.94 0.9 1 0.92 0.94 0.98 4 statistical similarities and the performance of semantic similarity. 5.2.1 Analysis of performance measures As first quartile value is decided threshold for each simi- larity, the confusion matrix is created and values of perfor- mance measures are calculated according to it. From Table 6 it is clear that cosine similarity gives the best score among all similarity methods. Hence for find- ing paraphrases in Marathi Sentences, the Cosine similarity method can be proved effectively accurate. From testing of all similarity measures, it can be clearly convey that cosine similarity is best method among all sta- tistical similarity techniques for finding similarity. When considering all sentences in given dataset, other statistical techniques also performed well for different sentence pair. Word-order similarity is the best to find dissimilar pairs. For taking dissimilarity into consideration, overall simi- larity score is resulted to 0 if any of the statistical similarity or semantic similarity resulted into 0. Otherwise it is cal- 0.77 0.87 0.8 0.83 0.77 0 0.2 0.4 0.6 0.8 1 ACCURACY 0 0 0 0 ACCURACY SCALE JACCORD COSINE WORD-ORDER SUMO METRIC UNL Figure 9. Accuracy comparisons of all similarity techniques. culated by combining the maximum statistical similarity score and semantic similarity score. 6 Conclusion and future scope Mostly the research has been done for English and Indian regional languages like Hindi, Malayalam, Punjabi and Tamil. However no paraphrase detection work has been yet done for Marathi language. This is the first footstep towards detecting Paraphrases in Marathi sentences. The development of social media in Marathi brings the avail- ability of massive data in Marathi languages. Analysis of social data like daily tweets in Twitter is the interest grow- ing field for different purposes. Identifying paraphrases should be widely explored to help many more arenas of Natural language processing field. The proposed system provides paraphrase identifica- tion tool for Marathi sentences using incorporation of both statistical and semantic features. The detailed analysis and performance comparison of all statistical methods con- cludes that cosine similarity measure outperforms among all statistical measures. Semantic similarity carried out through a UNL method plays an important role in achiev- ing 82% accuracy and 89% F-measure. This score proves that the designed tool is extremely supportive for Para- phrase Identification of Marathi Sentences. The system can be enhanced for capturing the similar- ity between Marathi paragraphs and Marathi documents. There is possibility of improving statistical methods to detect more structural similarity types. Misspelled words in the sentences will yield wrong result even if the sen- tences are similar. Pre-processing steps can be enriched for capturing the spelling mistakes and can deal accordingly. Semantic similarity measure score can be improved by con- sidering synonyms of the universal words. The current model is trained on fewer data sets which can be extended
  • 11. Detecting Paraphrases in Marathi Language 17 to incorporate more data. The model constructed can also be changed from static to online Paraphrase Identification tool for Marathi documents. Bibliography [1] Lee, Jun Choi and Cheah, Yu-N (2016) Paraphrase Detection using Semantic Relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advanced Infor- matics: Concepts, Theory and Applications, 16-17 August 2016, Parkroyal Penang Resort. [2] Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus, Learning Paraphrase Identification with Struc- tural Alignment Conference: IJCAI 2016, At New York. [3] Hoang-Quoc Nguyen-Son, Yusuke Miyao, Isao Echizen, Paraphrase Detection Based on Identical Phrase and Similar Word Matching, 29th Pacific Asia Conference on Language, 2015. [4] J.C. Lee, and Y. Cheah. “Paraphrase Detection using String Similarity with Synonyms.” The Fourth Asian Conference on Information Systems, ACIS 2015. [5] Nitin Madnani, Joel Tetreault, Martin Chodorow, Re- examining Machine Translation Metrics for Paraphrase Identification, Conference of the North American Chapter of the ACL: 2012 Association for Computational Linguistics. [6] Mihai Lintean and Vaile Rus, Paraphrase Identification Using Weighted Dependencies and word semantics, Pro- ceedings of the Twenty-Second International FLAIRS Con- ference (2009). [7] Fernando and Stevenson, 2008. A Semantic Similarity Approach to paraphrase Detection, In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pages 45–52. Citeseer. [8] Cordeiro, Dias, Brazdil A Metric for Paraphrase Detection Proceedings of the International Multi-Conference on Com- puting in the Global Information Technology (ICCGI’07), IEEE. [9] Kozareva and Montoyo, Paraphrase identification on the basis of supervised machine learning techniques, Advances in Natural Language Processing: 5th International Confer- ence on NLP (FinTAL 2006), Turku, Finland, 524–533. [10] Avishek Dan Pushpak Bhattacharyya, “CFILT-CORE: Semantic Textual Similarity using Universal Networking Language” 2013 Association for Computational. [11] Nandini Sethi, Prateek Agrawal, Vishu Madaan and Sanjay Kumar Singh, A Novel Approach to Paraphrase Hindi Sen- tences using Natural Language Processing, Indian Journal of Science and Technology, Vol 9(28), July 2016. [12] Rupal Bhargava, Anushka Baoni, Harshit Jain “BITS_PILANI@DPIL-FIRE 2016: Paraphrase Detection in Hindi Language using Syntactic Features of Phrase”, DPIL 2016. [13] Vani K, Deepa Gupta. “ASE@DPIL-FIRE2016: Hindi Para- phrase Detection using Natural Language Processing Tech- niques Semantic Similarity Computations” DPIL 2016. [14] Anuj Saini “Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language using Machine Learn- ing”, DPIL 2016. [15] Sindhu L. and Sumam Mary Idicula “CUSAT_NLP@DPIL- FIRE2016:Malayalam Paraphrase Detection”, DPIL 2016. [16] Manju K. and Sumam Mary Idicula,” CUSAT TEAM@DPIL- FIRE2016: Detecting Paraphrase in Indian Languages- Malayalam”, DPIL 2016. [17] Ditty Mathew and Dr. Suman Mary Idicula,” Paraphrase identification of Malayalam sentences – an experience”, IEEE 2013. [18] R. Thangarajan, S. V. Kogilavani, A. Karthic, S. Jawahar, “KEC@DPIL-FIRE2016: Detection of Paraphrases on Indian Languages”, DPIL 2016. [19] Leilei Kong, Zhenyuan Hao, Kaisheng Chen, Zhongyuan Han, Liuyang Tian, Haoliang Qi “HIT2016@DPIL-FIRE2016: Detecting Paraphrases in Indian Languages based on “Gra- dient Tree Boosting”, DPIL 2016. [20] Sandip Sarkar, Partha Pakray, Saurav Saha, Dipankar Das, Jereemi Bentham, Alexander Gelbukh “NLP-NITMZ@DPIL- FIRE 2016: Language Independent Paraphrases Detection”, DPIL 2016. [21] Kamal Sarkar “KS_JU@DPIL-FIRE2016: Detecting Para- phrases in Indian Languages using Multinomial Logistic Regression Model”, DPIL 2016. [22] Tanik Saikh, Sudip Kumar, Naskar Sivaji, Bandyopadhyay “JU_NLP@DPIL-FIRE2016: Paraphrase Detection in Indian Languages - A Machine Learning Approach”, DPIL 2016.