


Let’s estimate probability of some fragment as a product of all n-grams of n-size: Let’s pre-calculate not only single words, but word and a small context (3 nearest words). Adding some contextįirst improvement - adding n-gram language model (3-grams). The candidate word with highest frequency is taken as an answer. For each vocabulary word frequencies are pre-calculated, based on some big text collections. We repeat this procedure for every word for a second time to get candidates with bigger edit distance (for cases with two errors).Įach candidate is estimated with unigram language model. for word abc possible candidates will be: ab ac bc bac cba acb a_bc ab_c aabc abbc acbc adbc aebc etc.Įvery word is added to a candidate list. Let’s take a word and brute force all possible edits, such as delete, insert, transpose, replace and split. Peter Norvig (director of research at Google) described the following approach to spelling correction. Let start with a Norvig’s spelling corrector and iteratively increase its capabilities. We need an automatic spelling corrector which can fix words with typos and, at the same time not break correct spellings. As painful as it may be, data should be cleaned before fitting. As the result, we are unable to reach the best score. In real-world NLP problems we often meet texts with a lot of typos.
