SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
Automatic English Text Correction
@tati_alchueyr
Automatic English Text Correction
Tatiana Al-Chueyr Martins
@tati_alchueyr
Bratislava, 12 March 2016
PyCon SK 2016
Automatic English Text Correction
@tati_alchueyr
tati.__doc__
● Brazilian
● Lives in London (United Kingdom)
● Pythonista and Open Source activist
● Computer Engineer by Unicamp (Brazil)
● Develops software programs since 2002
● Works at EF (Education First)
○ Backend & DevOps leader of CTX Team
Automatic English Text Correction
@tati_alchueyr
help(EF)
● EF: Education First
● International education company
○ Language training
○ Educational travel
○ Academic degree level
● Funded in 1955 in Sweden by Bertil Hult
● ~ 40,000 staff
● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))
● Privately held by the Hult family
Automatic English Text Correction
@tati_alchueyr
help(EF.CTX)
● Classroom Technology Experience
● Teaching and learning applications (Web & Mobile)
● Authoring platform
Automatic English Text Correction
@tati_alchueyr
CTX.__team__
● CTX Team
● Team travel
● Malta, November 2015
Automatic English Text Correction
@tati_alchueyr
CTX.backend
● Rafael Cunha de Almeida
● and I
● trying to master Italian culinary
● London, February 2016
● Although I’m presenting
this project alone, Rafa has
contributed to it as much
as I :)
Automatic English Text Correction
@tati_alchueyr
objective
● Present a challenge
● Introduce a useful dataset
● Introduce a bunch of Python scripts
● Collect ideas
● Build collaboratively good quality open source tools which can help dealing
with this challenge
Automatic English Text Correction
@tati_alchueyr
the challenge
Automatic English Text Correction
@tati_alchueyr
The challenge
To assess (evaluate) students’ activities &
exercises can be:
● Laborious
● Repetitive
● Slow
● In other words... painful!
https://classteaching.files.wordpress.com/2013/10/marking-pile.gif
Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameiscrystal.im nineyearsold.im formchina,imliveinjiang xi xing yu.
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im form china,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
spelling
Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
spelling
verb tense
Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
spelling
verb tense
There are
“only” 89
writings left to
access this
week...
Automatic English Text Correction
@tati_alchueyr
The challenge
Implement algorithms and tools which can help (teachers) assessing
English written essays
Example of application available in several applications (including LibreOffice,
Google Apps, MS Word):
● Highlight (potential) mistakes while user types in a text area
Automatic English Text Correction
@tati_alchueyr
The challenge
● Input:
○ English text
● Output:
○ List of items containing:
■ Position in text
■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)
■ Proposal of correction
Automatic English Text Correction
@tati_alchueyr
the dataset
Automatic English Text Correction
@tati_alchueyr
The dataset
EFCamDAT
● 551,036 written essays
○ 2,897,788 sentences
○ 32,980,407 word tokens
● by 85,864 learners
● 16 levels of proficiency
● 172 nationalities
● annotated with corrections by English teachers
Automatic English Text Correction
@tati_alchueyr
The dataset
Examples of essay topics
● Introducing yourself by email
● Writing an online profile
● Describing your favourite day
● Telling someone what you’re doing
● Replying to a new penpal
● Writing about what you do
● Writing a resume
● Giving instructions to play a game
● Reviewing a song for a website
● Writing an apology email
● Writing a movie review
● Turning down an invitation
● Giving advice about budgeting
● Covering a news story
● Researching a legendary creature
Automatic English Text Correction
@tati_alchueyr
The dataset
Examples of learners nationalities
● 36.9% Brazilians
● 18.7% Chinese
● 8.5% Russians
● 7.9% Mexicans
● 5.6% Germans
● 4.3% French
● ...
Automatic English Text Correction
@tati_alchueyr
The dataset
EFCamDAT
● EF-Cambridge Open Language Database
● Partnership between:
○ University of Cambridge (Department of Theoretical and Applied Linguistics)
■ EF-Research Unit
○ EF Education First
● Data collected from Englishtown
○ EF learning environment (online English school)
Automatic English Text Correction
@tati_alchueyr
The dataset
Types of mistakes annotated
● X >> y: change from x to y
● AG: agreement
● AR: article
● CO: combine sentence
● C: capitalization
● D: delete
● EX: expression of idiom
● HL: highlight
● I(x): insert x
● MW: missing word
● NS: new sentence
● NWS: no such word
● PH: phraseology
● PL: plural
● PO: possessive
● PR: preposition
● PS: part of speech
● PU: punctuation
● SI: singular
● SP: spelling
● VT: verb tense
● WC: word choice
● WO: word order
Automatic English Text Correction
@tati_alchueyr
● 10 most common
mistakes
The dataset
Automatic English Text Correction
@tati_alchueyr
The dataset
How to get it?
● https://corpus.mml.cam.ac.uk/efcamdat1/access.php
Licence:
● Use non-commercial research
● Commercial use when agreed upon agreement
● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf
Automatic English Text Correction
@tati_alchueyr
The dataset
Automatic English Text Correction
@tati_alchueyr
The dataset
Automatic English Text Correction
@tati_alchueyr
The dataset
Once you’ve registered
● It is possible to filter the dataset
● export the dataset into a XML file
Automatic English Text Correction
@tati_alchueyr
The dataset
Automatic English Text Correction
@tati_alchueyr
a bunch of Python scripts
Automatic English Text Correction
@tati_alchueyr
A bunch of (Python) scripts
Disclaimer
Code developed using the Extreme Go
Horse Methodology during Hackday
moments
They are a POC and lack:
- Proper automated tests
- Proper code design & API
- Documentation
https://gist.github.com/banaslee/4147370
Automatic English Text Correction
@tati_alchueyr
A bunch of (Python) scripts
What do they do?
1. Fix the XML files
2. Convert the XML files into good looking JSON files
3. Implement heuristics to identify some common English mistakes
○ For now: spelling, capitalization and articles
4. Analysis of how efficient the algorithm was
Automatic English Text Correction
@tati_alchueyr
A bunch of (Python) scripts
How to download them?
● https://github.com/ef-ctx/righter
Licence
● Apache version 2.0
Automatic English Text Correction
@tati_alchueyr
Hands on
Automatic English Text Correction
@tati_alchueyr
Mistakes identification
We wrote functions that apply heuristics and rules to detect mistakes related to:
1. Spelling
2. Capitalization
3. Article
Automatic English Text Correction
@tati_alchueyr
Efficiency
In order to check their efficiency, we created:
● A few unit tests
● Before committing any change, we’d evaluate
○ How close to the teacher’s annotations we reached, using:
■ Precision
■ Recall
■ F-score
● We print a side-to-side comparison of what the teacher annotated and what
the algorithm identified
Automatic English Text Correction
@tati_alchueyr
Efficiency
https://en.wikipedia.org/wiki/Precision_and_recall
Automatic English Text Correction
@tati_alchueyr
Efficiency
F-Score
Automatic English Text Correction
@tati_alchueyr
Spelling
Automatic English Text Correction
@tati_alchueyr
Spelling: heuristics
1. Remove unicode symbols (eg. —)
2. Transform diacritics (eg. é -> e)
○ This is particularly important for names
3. Remove punctuation (eg. !, ?, .)
4. Check if word:
○ Is inside dictionary (case insensitive)
○ Has digits
○ Is inside names file (created with domain specific names; eg. Englishtown)
5. If none of that is true, then word is probably misspelled
Automatic English Text Correction
@tati_alchueyr
Spelling: results
Summary:
● total essays: 85,629
● mean precision: 0.7128 (std: 0.3580)
● mean recall: 0.6535 (std: 0.4212)
Automatic English Text Correction
@tati_alchueyr
Spelling: precision and recall per learner level
Automatic English Text Correction
@tati_alchueyr
Spelling: F-score per nationality
Automatic English Text Correction
@tati_alchueyr
Capitalization
Automatic English Text Correction
@tati_alchueyr
Capitalization: heuristics
1. Check if word starts a sentence
○ Split on punctuation (!, ., ?, etc)
2. Check if word is a known capital word
○ First person (I)
○ Day of the week
○ Month
○ Language (eg. English, Spanish, French, etc)
○ Country
○ Names (selected from corpus to match context-specific names)
Automatic English Text Correction
@tati_alchueyr
Capitalization: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
Automatic English Text Correction
@tati_alchueyr
Capitalization: precision and recall per learner level
Automatic English Text Correction
@tati_alchueyr
Capitalization: F-score per nationality
Automatic English Text Correction
@tati_alchueyr
Articles
Automatic English Text Correction
@tati_alchueyr
Articles: heuristics
1. Check words using a before vogals
2. Check words using an before consonants
Automatic English Text Correction
@tati_alchueyr
Articles: results
Summary:
● total items: 47,054
● average precision: 0.9724 (std: 0.1602)
● average recall: 0.0718 (std: 0.2463)
Automatic English Text Correction
@tati_alchueyr
Articles: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
Automatic English Text Correction
@tati_alchueyr
Article: precision and recall per learner level
Automatic English Text Correction
@tati_alchueyr
Articles: F-score per nationality
Automatic English Text Correction
@tati_alchueyr
Overview
Automatic English Text Correction
@tati_alchueyr
● efficiency of current
heuristics
Mistakes identification per learner level
Automatic English Text Correction
@tati_alchueyr
ideas
Automatic English Text Correction
@tati_alchueyr
Next steps
● Clean up code
● Spelling
○ Use probabilistic models
■ http://norvig.com/spell-correct.html
● Capitalization
○ POS-tagging to identify names of people, organizations, places
● Articles
○ POS-tagging
○ Deal with plurals
○ Define heuristics for dealing with definite articles (the)
Automatic English Text Correction
@tati_alchueyr
Next steps
● Add to user-interface of EF Class
● Collect feedback from end-users (teachers)
● Algorithm for proposing the correct forms
● Dealing with the other kinds of mistakes
● Implement a classifier using NPL (natural language processing) so we can
have input from the end-users if the suggestions are good or not - and learn
with them
Automatic English Text Correction
@tati_alchueyr
Ideas
●
Automatic English Text Correction
@tati_alchueyr
PyCon SK is not over...
Automatic English Text Correction
@tati_alchueyr
● Sunday (13/03)
● 9:00 - 12:00
● Organizer:
○ Rodolfo Carvalho
Join the Coding Dojo tomorrow! (13/03)
http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojo
https://www.youtube.com/watch?v=vqnwQ3oVM1M
Automatic English Text Correction
@tati_alchueyr
Questions?
Thanks :)
@tati_alchueyr
ctx.tech.backend@ef.com

Weitere ähnliche Inhalte

Andere mochten auch

Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolutionTatiana Al-Chueyr
 
Error correction in English
Error correction in EnglishError correction in English
Error correction in EnglishArul Nehru
 
Rabbitmq & Postgresql
Rabbitmq & PostgresqlRabbitmq & Postgresql
Rabbitmq & PostgresqlLucio Grenzi
 
Django - The Web framework for perfectionists with deadlines
Django - The Web framework  for perfectionists with deadlinesDjango - The Web framework  for perfectionists with deadlines
Django - The Web framework for perfectionists with deadlinesMarkus Zapke-Gründemann
 
라이트닝 토크 2015 파이콘
라이트닝 토크 2015 파이콘라이트닝 토크 2015 파이콘
라이트닝 토크 2015 파이콘Jiho Lee
 
Overview of Testing Talks at Pycon
Overview of Testing Talks at PyconOverview of Testing Talks at Pycon
Overview of Testing Talks at PyconJacqueline Kazil
 
NoSql Day - Chiusura
NoSql Day - ChiusuraNoSql Day - Chiusura
NoSql Day - ChiusuraWEBdeBS
 
2007 - 应用系统脆弱性概论
2007 - 应用系统脆弱性概论 2007 - 应用系统脆弱性概论
2007 - 应用系统脆弱性概论 Na Lee
 
Django - The Web framework for perfectionists with deadlines
Django - The Web framework for perfectionists with deadlinesDjango - The Web framework for perfectionists with deadlines
Django - The Web framework for perfectionists with deadlinesMarkus Zapke-Gründemann
 
The Django Book, Chapter 16: django.contrib
The Django Book, Chapter 16: django.contribThe Django Book, Chapter 16: django.contrib
The Django Book, Chapter 16: django.contribTzu-ping Chung
 

Andere mochten auch (20)

Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolution
 
Error correction in English
Error correction in EnglishError correction in English
Error correction in English
 
Rabbitmq & Postgresql
Rabbitmq & PostgresqlRabbitmq & Postgresql
Rabbitmq & Postgresql
 
Bottle - Python Web Microframework
Bottle - Python Web MicroframeworkBottle - Python Web Microframework
Bottle - Python Web Microframework
 
Django - The Web framework for perfectionists with deadlines
Django - The Web framework  for perfectionists with deadlinesDjango - The Web framework  for perfectionists with deadlines
Django - The Web framework for perfectionists with deadlines
 
라이트닝 토크 2015 파이콘
라이트닝 토크 2015 파이콘라이트닝 토크 2015 파이콘
라이트닝 토크 2015 파이콘
 
Load testing
Load testingLoad testing
Load testing
 
Overview of Testing Talks at Pycon
Overview of Testing Talks at PyconOverview of Testing Talks at Pycon
Overview of Testing Talks at Pycon
 
Vim for Mere Mortals
Vim for Mere MortalsVim for Mere Mortals
Vim for Mere Mortals
 
Website optimization
Website optimizationWebsite optimization
Website optimization
 
NoSql Day - Chiusura
NoSql Day - ChiusuraNoSql Day - Chiusura
NoSql Day - Chiusura
 
2007 - 应用系统脆弱性概论
2007 - 应用系统脆弱性概论 2007 - 应用系统脆弱性概论
2007 - 应用系统脆弱性概论
 
User-centered open source
User-centered open sourceUser-centered open source
User-centered open source
 
Html5 History-API
Html5 History-APIHtml5 History-API
Html5 History-API
 
Django - The Web framework for perfectionists with deadlines
Django - The Web framework for perfectionists with deadlinesDjango - The Web framework for perfectionists with deadlines
Django - The Web framework for perfectionists with deadlines
 
Digesting jQuery
Digesting jQueryDigesting jQuery
Digesting jQuery
 
The Django Book, Chapter 16: django.contrib
The Django Book, Chapter 16: django.contribThe Django Book, Chapter 16: django.contrib
The Django Book, Chapter 16: django.contrib
 
PyClab.__init__(self)
PyClab.__init__(self)PyClab.__init__(self)
PyClab.__init__(self)
 
Django-Queryset
Django-QuerysetDjango-Queryset
Django-Queryset
 
PythonBrasil[8] closing
PythonBrasil[8] closingPythonBrasil[8] closing
PythonBrasil[8] closing
 

Ähnlich wie Automatic English text correction

TDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageTDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageLuciano Sabença
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPHendrik D'Oosterlinck
 
Teacher Tech Tools in CCSS Instruction for ELl Students & 4Cs
Teacher Tech Tools in CCSS Instruction for ELl Students & 4CsTeacher Tech Tools in CCSS Instruction for ELl Students & 4Cs
Teacher Tech Tools in CCSS Instruction for ELl Students & 4CsMartin Cisneros
 
Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017Julio Martinez
 
2010 Accountability Memo
2010 Accountability Memo2010 Accountability Memo
2010 Accountability MemoA Jorge Garcia
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Prompsit Language Engineering
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Gema Ramirez-Sanchez
 
Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko TransliteratorKoreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko TransliteratorHONGJOO LEE
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningMartin Wynne
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly
 
Introduction to yr 10
Introduction to yr 10Introduction to yr 10
Introduction to yr 10hendyea
 
Level-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing PositionsLevel-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing PositionsJinho Choi
 

Ähnlich wie Automatic English text correction (14)

TDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageTDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-Language
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLP
 
Teacher Tech Tools in CCSS Instruction for ELl Students & 4Cs
Teacher Tech Tools in CCSS Instruction for ELl Students & 4CsTeacher Tech Tools in CCSS Instruction for ELl Students & 4Cs
Teacher Tech Tools in CCSS Instruction for ELl Students & 4Cs
 
Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017Conclusion of the Seminary UPC 2017
Conclusion of the Seminary UPC 2017
 
2010 Accountability Memo
2010 Accountability Memo2010 Accountability Memo
2010 Accountability Memo
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko TransliteratorKoreanizer : Statistical Machine Translation based Ro-Ko Transliterator
Koreanizer : Statistical Machine Translation based Ro-Ko Transliterator
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
 
Introduction to yr 10
Introduction to yr 10Introduction to yr 10
Introduction to yr 10
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Level-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing PositionsLevel-based Resume Classification on Nursing Positions
Level-based Resume Classification on Nursing Positions
 

Mehr von Tatiana Al-Chueyr

Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache AirflowTatiana Al-Chueyr
 
From an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsFrom an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsTatiana Al-Chueyr
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamTatiana Al-Chueyr
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamTatiana Al-Chueyr
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamTatiana Al-Chueyr
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBCTatiana Al-Chueyr
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonTatiana Al-Chueyr
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBCTatiana Al-Chueyr
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCTatiana Al-Chueyr
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesTatiana Al-Chueyr
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for youTatiana Al-Chueyr
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareTatiana Al-Chueyr
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comTatiana Al-Chueyr
 
Desarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasDesarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasTatiana Al-Chueyr
 
Desarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidDesarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidTatiana Al-Chueyr
 
Transifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomasTransifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomasTatiana Al-Chueyr
 

Mehr von Tatiana Al-Chueyr (20)

Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache Airflow
 
From an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsFrom an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC Sounds
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache Beam
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache Beam
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBC
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and Python
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPC
 
Sprint cPython at Globo.com
Sprint cPython at Globo.comSprint cPython at Globo.com
Sprint cPython at Globo.com
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummies
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for you
 
Crafting APIs
Crafting APIsCrafting APIs
Crafting APIs
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging software
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.com
 
Desarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasDesarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebas
 
Desarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidDesarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y Android
 
Transifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomasTransifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomas
 

Kürzlich hochgeladen

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Kürzlich hochgeladen (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Automatic English text correction

  • 1. Automatic English Text Correction @tati_alchueyr Automatic English Text Correction Tatiana Al-Chueyr Martins @tati_alchueyr Bratislava, 12 March 2016 PyCon SK 2016
  • 2. Automatic English Text Correction @tati_alchueyr tati.__doc__ ● Brazilian ● Lives in London (United Kingdom) ● Pythonista and Open Source activist ● Computer Engineer by Unicamp (Brazil) ● Develops software programs since 2002 ● Works at EF (Education First) ○ Backend & DevOps leader of CTX Team
  • 3. Automatic English Text Correction @tati_alchueyr help(EF) ● EF: Education First ● International education company ○ Language training ○ Educational travel ○ Academic degree level ● Funded in 1955 in Sweden by Bertil Hult ● ~ 40,000 staff ● ~ 500 offices and schools in more than 50 countries (including Slovakia ;)) ● Privately held by the Hult family
  • 4. Automatic English Text Correction @tati_alchueyr help(EF.CTX) ● Classroom Technology Experience ● Teaching and learning applications (Web & Mobile) ● Authoring platform
  • 5. Automatic English Text Correction @tati_alchueyr CTX.__team__ ● CTX Team ● Team travel ● Malta, November 2015
  • 6. Automatic English Text Correction @tati_alchueyr CTX.backend ● Rafael Cunha de Almeida ● and I ● trying to master Italian culinary ● London, February 2016 ● Although I’m presenting this project alone, Rafa has contributed to it as much as I :)
  • 7. Automatic English Text Correction @tati_alchueyr objective ● Present a challenge ● Introduce a useful dataset ● Introduce a bunch of Python scripts ● Collect ideas ● Build collaboratively good quality open source tools which can help dealing with this challenge
  • 8. Automatic English Text Correction @tati_alchueyr the challenge
  • 9. Automatic English Text Correction @tati_alchueyr The challenge To assess (evaluate) students’ activities & exercises can be: ● Laborious ● Repetitive ● Slow ● In other words... painful! https://classteaching.files.wordpress.com/2013/10/marking-pile.gif
  • 10. Automatic English Text Correction @tati_alchueyr English Text Correction hi, my nameiscrystal.im nineyearsold.im formchina,imliveinjiang xi xing yu. there aretow peopleinmyfamily:mymother,myfather. my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold EFCamDat - C219811
  • 11. Automatic English Text Correction @tati_alchueyr English Text Correction hi, my nameis crystal.im nineyearsold. im form china,imliveinjiang xi xing yu . there aretow peopleinmyfamily:mymother,myfather. my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold EFCamDat - C219811 capitalization
  • 12. Automatic English Text Correction @tati_alchueyr English Text Correction hi, my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu . there aretow peopleinmyfamily:mymother,myfather. my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold EFCamDat - C219811 capitalization spelling
  • 13. Automatic English Text Correction @tati_alchueyr English Text Correction hi, my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu . there aretow peopleinmyfamily:mymother,myfather. my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold EFCamDat - C219811 capitalization spelling verb tense
  • 14. Automatic English Text Correction @tati_alchueyr English Text Correction hi, my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu . there aretow peopleinmyfamily:mymother,myfather. my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold EFCamDat - C219811 capitalization spelling verb tense There are “only” 89 writings left to access this week...
  • 15. Automatic English Text Correction @tati_alchueyr The challenge Implement algorithms and tools which can help (teachers) assessing English written essays Example of application available in several applications (including LibreOffice, Google Apps, MS Word): ● Highlight (potential) mistakes while user types in a text area
  • 16. Automatic English Text Correction @tati_alchueyr The challenge ● Input: ○ English text ● Output: ○ List of items containing: ■ Position in text ■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc) ■ Proposal of correction
  • 17. Automatic English Text Correction @tati_alchueyr the dataset
  • 18. Automatic English Text Correction @tati_alchueyr The dataset EFCamDAT ● 551,036 written essays ○ 2,897,788 sentences ○ 32,980,407 word tokens ● by 85,864 learners ● 16 levels of proficiency ● 172 nationalities ● annotated with corrections by English teachers
  • 19. Automatic English Text Correction @tati_alchueyr The dataset Examples of essay topics ● Introducing yourself by email ● Writing an online profile ● Describing your favourite day ● Telling someone what you’re doing ● Replying to a new penpal ● Writing about what you do ● Writing a resume ● Giving instructions to play a game ● Reviewing a song for a website ● Writing an apology email ● Writing a movie review ● Turning down an invitation ● Giving advice about budgeting ● Covering a news story ● Researching a legendary creature
  • 20. Automatic English Text Correction @tati_alchueyr The dataset Examples of learners nationalities ● 36.9% Brazilians ● 18.7% Chinese ● 8.5% Russians ● 7.9% Mexicans ● 5.6% Germans ● 4.3% French ● ...
  • 21. Automatic English Text Correction @tati_alchueyr The dataset EFCamDAT ● EF-Cambridge Open Language Database ● Partnership between: ○ University of Cambridge (Department of Theoretical and Applied Linguistics) ■ EF-Research Unit ○ EF Education First ● Data collected from Englishtown ○ EF learning environment (online English school)
  • 22. Automatic English Text Correction @tati_alchueyr The dataset Types of mistakes annotated ● X >> y: change from x to y ● AG: agreement ● AR: article ● CO: combine sentence ● C: capitalization ● D: delete ● EX: expression of idiom ● HL: highlight ● I(x): insert x ● MW: missing word ● NS: new sentence ● NWS: no such word ● PH: phraseology ● PL: plural ● PO: possessive ● PR: preposition ● PS: part of speech ● PU: punctuation ● SI: singular ● SP: spelling ● VT: verb tense ● WC: word choice ● WO: word order
  • 23. Automatic English Text Correction @tati_alchueyr ● 10 most common mistakes The dataset
  • 24. Automatic English Text Correction @tati_alchueyr The dataset How to get it? ● https://corpus.mml.cam.ac.uk/efcamdat1/access.php Licence: ● Use non-commercial research ● Commercial use when agreed upon agreement ● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf
  • 25. Automatic English Text Correction @tati_alchueyr The dataset
  • 26. Automatic English Text Correction @tati_alchueyr The dataset
  • 27. Automatic English Text Correction @tati_alchueyr The dataset Once you’ve registered ● It is possible to filter the dataset ● export the dataset into a XML file
  • 28. Automatic English Text Correction @tati_alchueyr The dataset
  • 29. Automatic English Text Correction @tati_alchueyr a bunch of Python scripts
  • 30. Automatic English Text Correction @tati_alchueyr A bunch of (Python) scripts Disclaimer Code developed using the Extreme Go Horse Methodology during Hackday moments They are a POC and lack: - Proper automated tests - Proper code design & API - Documentation https://gist.github.com/banaslee/4147370
  • 31. Automatic English Text Correction @tati_alchueyr A bunch of (Python) scripts What do they do? 1. Fix the XML files 2. Convert the XML files into good looking JSON files 3. Implement heuristics to identify some common English mistakes ○ For now: spelling, capitalization and articles 4. Analysis of how efficient the algorithm was
  • 32. Automatic English Text Correction @tati_alchueyr A bunch of (Python) scripts How to download them? ● https://github.com/ef-ctx/righter Licence ● Apache version 2.0
  • 33. Automatic English Text Correction @tati_alchueyr Hands on
  • 34. Automatic English Text Correction @tati_alchueyr Mistakes identification We wrote functions that apply heuristics and rules to detect mistakes related to: 1. Spelling 2. Capitalization 3. Article
  • 35. Automatic English Text Correction @tati_alchueyr Efficiency In order to check their efficiency, we created: ● A few unit tests ● Before committing any change, we’d evaluate ○ How close to the teacher’s annotations we reached, using: ■ Precision ■ Recall ■ F-score ● We print a side-to-side comparison of what the teacher annotated and what the algorithm identified
  • 36. Automatic English Text Correction @tati_alchueyr Efficiency https://en.wikipedia.org/wiki/Precision_and_recall
  • 37. Automatic English Text Correction @tati_alchueyr Efficiency F-Score
  • 38. Automatic English Text Correction @tati_alchueyr Spelling
  • 39. Automatic English Text Correction @tati_alchueyr Spelling: heuristics 1. Remove unicode symbols (eg. —) 2. Transform diacritics (eg. é -> e) ○ This is particularly important for names 3. Remove punctuation (eg. !, ?, .) 4. Check if word: ○ Is inside dictionary (case insensitive) ○ Has digits ○ Is inside names file (created with domain specific names; eg. Englishtown) 5. If none of that is true, then word is probably misspelled
  • 40. Automatic English Text Correction @tati_alchueyr Spelling: results Summary: ● total essays: 85,629 ● mean precision: 0.7128 (std: 0.3580) ● mean recall: 0.6535 (std: 0.4212)
  • 41. Automatic English Text Correction @tati_alchueyr Spelling: precision and recall per learner level
  • 42. Automatic English Text Correction @tati_alchueyr Spelling: F-score per nationality
  • 43. Automatic English Text Correction @tati_alchueyr Capitalization
  • 44. Automatic English Text Correction @tati_alchueyr Capitalization: heuristics 1. Check if word starts a sentence ○ Split on punctuation (!, ., ?, etc) 2. Check if word is a known capital word ○ First person (I) ○ Day of the week ○ Month ○ Language (eg. English, Spanish, French, etc) ○ Country ○ Names (selected from corpus to match context-specific names)
  • 45. Automatic English Text Correction @tati_alchueyr Capitalization: results Summary: ● total essays: 76,980 ● mean precision: 0.5714 (std: 0.4005) ● mean recall: 0.5550 (std: 0.4472)
  • 46. Automatic English Text Correction @tati_alchueyr Capitalization: precision and recall per learner level
  • 47. Automatic English Text Correction @tati_alchueyr Capitalization: F-score per nationality
  • 48. Automatic English Text Correction @tati_alchueyr Articles
  • 49. Automatic English Text Correction @tati_alchueyr Articles: heuristics 1. Check words using a before vogals 2. Check words using an before consonants
  • 50. Automatic English Text Correction @tati_alchueyr Articles: results Summary: ● total items: 47,054 ● average precision: 0.9724 (std: 0.1602) ● average recall: 0.0718 (std: 0.2463)
  • 51. Automatic English Text Correction @tati_alchueyr Articles: results Summary: ● total essays: 76,980 ● mean precision: 0.5714 (std: 0.4005) ● mean recall: 0.5550 (std: 0.4472)
  • 52. Automatic English Text Correction @tati_alchueyr Article: precision and recall per learner level
  • 53. Automatic English Text Correction @tati_alchueyr Articles: F-score per nationality
  • 54. Automatic English Text Correction @tati_alchueyr Overview
  • 55. Automatic English Text Correction @tati_alchueyr ● efficiency of current heuristics Mistakes identification per learner level
  • 56. Automatic English Text Correction @tati_alchueyr ideas
  • 57. Automatic English Text Correction @tati_alchueyr Next steps ● Clean up code ● Spelling ○ Use probabilistic models ■ http://norvig.com/spell-correct.html ● Capitalization ○ POS-tagging to identify names of people, organizations, places ● Articles ○ POS-tagging ○ Deal with plurals ○ Define heuristics for dealing with definite articles (the)
  • 58. Automatic English Text Correction @tati_alchueyr Next steps ● Add to user-interface of EF Class ● Collect feedback from end-users (teachers) ● Algorithm for proposing the correct forms ● Dealing with the other kinds of mistakes ● Implement a classifier using NPL (natural language processing) so we can have input from the end-users if the suggestions are good or not - and learn with them
  • 59. Automatic English Text Correction @tati_alchueyr Ideas ●
  • 60. Automatic English Text Correction @tati_alchueyr PyCon SK is not over...
  • 61. Automatic English Text Correction @tati_alchueyr ● Sunday (13/03) ● 9:00 - 12:00 ● Organizer: ○ Rodolfo Carvalho Join the Coding Dojo tomorrow! (13/03) http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojo https://www.youtube.com/watch?v=vqnwQ3oVM1M
  • 62. Automatic English Text Correction @tati_alchueyr Questions? Thanks :) @tati_alchueyr ctx.tech.backend@ef.com