Presentation about some common mistakes English learners make - and how it is possible to try to identify part of them automatically (spelling, capitalization and article). This presentation was made during PyCon SK on the 12th of March 2016. Many of the results are due to the partnership of the University of Cambridge and Education First.
1. Automatic English Text Correction
@tati_alchueyr
Automatic English Text Correction
Tatiana Al-Chueyr Martins
@tati_alchueyr
Bratislava, 12 March 2016
PyCon SK 2016
2. Automatic English Text Correction
@tati_alchueyr
tati.__doc__
● Brazilian
● Lives in London (United Kingdom)
● Pythonista and Open Source activist
● Computer Engineer by Unicamp (Brazil)
● Develops software programs since 2002
● Works at EF (Education First)
○ Backend & DevOps leader of CTX Team
3. Automatic English Text Correction
@tati_alchueyr
help(EF)
● EF: Education First
● International education company
○ Language training
○ Educational travel
○ Academic degree level
● Funded in 1955 in Sweden by Bertil Hult
● ~ 40,000 staff
● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))
● Privately held by the Hult family
4. Automatic English Text Correction
@tati_alchueyr
help(EF.CTX)
● Classroom Technology Experience
● Teaching and learning applications (Web & Mobile)
● Authoring platform
5. Automatic English Text Correction
@tati_alchueyr
CTX.__team__
● CTX Team
● Team travel
● Malta, November 2015
6. Automatic English Text Correction
@tati_alchueyr
CTX.backend
● Rafael Cunha de Almeida
● and I
● trying to master Italian culinary
● London, February 2016
● Although I’m presenting
this project alone, Rafa has
contributed to it as much
as I :)
7. Automatic English Text Correction
@tati_alchueyr
objective
● Present a challenge
● Introduce a useful dataset
● Introduce a bunch of Python scripts
● Collect ideas
● Build collaboratively good quality open source tools which can help dealing
with this challenge
9. Automatic English Text Correction
@tati_alchueyr
The challenge
To assess (evaluate) students’ activities &
exercises can be:
● Laborious
● Repetitive
● Slow
● In other words... painful!
https://classteaching.files.wordpress.com/2013/10/marking-pile.gif
10. Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameiscrystal.im nineyearsold.im formchina,imliveinjiang xi xing yu.
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
11. Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im form china,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
12. Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
spelling
13. Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
spelling
verb tense
14. Automatic English Text Correction
@tati_alchueyr
English Text Correction
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811
capitalization
spelling
verb tense
There are
“only” 89
writings left to
access this
week...
15. Automatic English Text Correction
@tati_alchueyr
The challenge
Implement algorithms and tools which can help (teachers) assessing
English written essays
Example of application available in several applications (including LibreOffice,
Google Apps, MS Word):
● Highlight (potential) mistakes while user types in a text area
16. Automatic English Text Correction
@tati_alchueyr
The challenge
● Input:
○ English text
● Output:
○ List of items containing:
■ Position in text
■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)
■ Proposal of correction
18. Automatic English Text Correction
@tati_alchueyr
The dataset
EFCamDAT
● 551,036 written essays
○ 2,897,788 sentences
○ 32,980,407 word tokens
● by 85,864 learners
● 16 levels of proficiency
● 172 nationalities
● annotated with corrections by English teachers
19. Automatic English Text Correction
@tati_alchueyr
The dataset
Examples of essay topics
● Introducing yourself by email
● Writing an online profile
● Describing your favourite day
● Telling someone what you’re doing
● Replying to a new penpal
● Writing about what you do
● Writing a resume
● Giving instructions to play a game
● Reviewing a song for a website
● Writing an apology email
● Writing a movie review
● Turning down an invitation
● Giving advice about budgeting
● Covering a news story
● Researching a legendary creature
20. Automatic English Text Correction
@tati_alchueyr
The dataset
Examples of learners nationalities
● 36.9% Brazilians
● 18.7% Chinese
● 8.5% Russians
● 7.9% Mexicans
● 5.6% Germans
● 4.3% French
● ...
21. Automatic English Text Correction
@tati_alchueyr
The dataset
EFCamDAT
● EF-Cambridge Open Language Database
● Partnership between:
○ University of Cambridge (Department of Theoretical and Applied Linguistics)
■ EF-Research Unit
○ EF Education First
● Data collected from Englishtown
○ EF learning environment (online English school)
22. Automatic English Text Correction
@tati_alchueyr
The dataset
Types of mistakes annotated
● X >> y: change from x to y
● AG: agreement
● AR: article
● CO: combine sentence
● C: capitalization
● D: delete
● EX: expression of idiom
● HL: highlight
● I(x): insert x
● MW: missing word
● NS: new sentence
● NWS: no such word
● PH: phraseology
● PL: plural
● PO: possessive
● PR: preposition
● PS: part of speech
● PU: punctuation
● SI: singular
● SP: spelling
● VT: verb tense
● WC: word choice
● WO: word order
24. Automatic English Text Correction
@tati_alchueyr
The dataset
How to get it?
● https://corpus.mml.cam.ac.uk/efcamdat1/access.php
Licence:
● Use non-commercial research
● Commercial use when agreed upon agreement
● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf
27. Automatic English Text Correction
@tati_alchueyr
The dataset
Once you’ve registered
● It is possible to filter the dataset
● export the dataset into a XML file
30. Automatic English Text Correction
@tati_alchueyr
A bunch of (Python) scripts
Disclaimer
Code developed using the Extreme Go
Horse Methodology during Hackday
moments
They are a POC and lack:
- Proper automated tests
- Proper code design & API
- Documentation
https://gist.github.com/banaslee/4147370
31. Automatic English Text Correction
@tati_alchueyr
A bunch of (Python) scripts
What do they do?
1. Fix the XML files
2. Convert the XML files into good looking JSON files
3. Implement heuristics to identify some common English mistakes
○ For now: spelling, capitalization and articles
4. Analysis of how efficient the algorithm was
32. Automatic English Text Correction
@tati_alchueyr
A bunch of (Python) scripts
How to download them?
● https://github.com/ef-ctx/righter
Licence
● Apache version 2.0
34. Automatic English Text Correction
@tati_alchueyr
Mistakes identification
We wrote functions that apply heuristics and rules to detect mistakes related to:
1. Spelling
2. Capitalization
3. Article
35. Automatic English Text Correction
@tati_alchueyr
Efficiency
In order to check their efficiency, we created:
● A few unit tests
● Before committing any change, we’d evaluate
○ How close to the teacher’s annotations we reached, using:
■ Precision
■ Recall
■ F-score
● We print a side-to-side comparison of what the teacher annotated and what
the algorithm identified
36. Automatic English Text Correction
@tati_alchueyr
Efficiency
https://en.wikipedia.org/wiki/Precision_and_recall
39. Automatic English Text Correction
@tati_alchueyr
Spelling: heuristics
1. Remove unicode symbols (eg. —)
2. Transform diacritics (eg. é -> e)
○ This is particularly important for names
3. Remove punctuation (eg. !, ?, .)
4. Check if word:
○ Is inside dictionary (case insensitive)
○ Has digits
○ Is inside names file (created with domain specific names; eg. Englishtown)
5. If none of that is true, then word is probably misspelled
40. Automatic English Text Correction
@tati_alchueyr
Spelling: results
Summary:
● total essays: 85,629
● mean precision: 0.7128 (std: 0.3580)
● mean recall: 0.6535 (std: 0.4212)
41. Automatic English Text Correction
@tati_alchueyr
Spelling: precision and recall per learner level
44. Automatic English Text Correction
@tati_alchueyr
Capitalization: heuristics
1. Check if word starts a sentence
○ Split on punctuation (!, ., ?, etc)
2. Check if word is a known capital word
○ First person (I)
○ Day of the week
○ Month
○ Language (eg. English, Spanish, French, etc)
○ Country
○ Names (selected from corpus to match context-specific names)
45. Automatic English Text Correction
@tati_alchueyr
Capitalization: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
46. Automatic English Text Correction
@tati_alchueyr
Capitalization: precision and recall per learner level
47. Automatic English Text Correction
@tati_alchueyr
Capitalization: F-score per nationality
49. Automatic English Text Correction
@tati_alchueyr
Articles: heuristics
1. Check words using a before vogals
2. Check words using an before consonants
50. Automatic English Text Correction
@tati_alchueyr
Articles: results
Summary:
● total items: 47,054
● average precision: 0.9724 (std: 0.1602)
● average recall: 0.0718 (std: 0.2463)
51. Automatic English Text Correction
@tati_alchueyr
Articles: results
Summary:
● total essays: 76,980
● mean precision: 0.5714 (std: 0.4005)
● mean recall: 0.5550 (std: 0.4472)
52. Automatic English Text Correction
@tati_alchueyr
Article: precision and recall per learner level
57. Automatic English Text Correction
@tati_alchueyr
Next steps
● Clean up code
● Spelling
○ Use probabilistic models
■ http://norvig.com/spell-correct.html
● Capitalization
○ POS-tagging to identify names of people, organizations, places
● Articles
○ POS-tagging
○ Deal with plurals
○ Define heuristics for dealing with definite articles (the)
58. Automatic English Text Correction
@tati_alchueyr
Next steps
● Add to user-interface of EF Class
● Collect feedback from end-users (teachers)
● Algorithm for proposing the correct forms
● Dealing with the other kinds of mistakes
● Implement a classifier using NPL (natural language processing) so we can
have input from the end-users if the suggestions are good or not - and learn
with them