Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

24
Estimating Dyslexia in the Web Ricardo Baeza-Yates Yahoo! Research & Web Research Group, Pompeu Fabra University, Barcelona, Spain Web Research and NLP Groups Pompeu Fabra University, Barcelona, Spain Luz Rello W4A 2011, Hyderabad

Transcript of Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Page 1: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Estimating Dyslexia in the Web

Ricardo Baeza-Yates

Yahoo! Research & Web Research Group, Pompeu Fabra University, Barcelona, Spain

Web Research and NLP GroupsPompeu Fabra University, Barcelona, Spain

Luz Rello

W4A 2011, Hyderabad

Page 2: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Outline

— What

— Why

— How

— Results

to distinguish dyslexic errorsto build a sampleto measure dyslexia

Outline

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 3: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineWhat

DyslexiaDyslexia is a neurologically-based disorder which interferes with the acquisition and processing of language. It manifests itself with difficulties in receptive and expressive language, including phonological processing, in reading, writing, spelling and handwriting and sometimes in arithmetic.

The largest of the three subtypes of dyslexia that the author presents. Dysphonetic dyslexia is viewed as a disability in associating symbols with sounds. The misspellings typical of this disorder are due to phonetic inaccuracy.

Dysphonetic dyslexia

(The Boder’s Test of Reading-Spelling Patterns)

(Boder & Jarrico, 1982)

(Committee of Members Orton Dyslexia Society. Definition of Dyslexia, 1994.)

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 4: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineWhy

There is a universal neuro-cognitive basis for dyslexia.

It manifestations are culture-specific due to different orthographies.

English is a language with deep orthography, the mapping between letters, speech sounds, and whole-word sounds is often highly ambiguous and therefore dyslexics examples are more widespread than in other languages with transparent or shallow orthography.

All languages

(Paulesu et al. 2001)

(Alegria, 2006)

(Paulesu et al. 2001)

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 5: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineWhy

Frequent

Researchers estimate that 10-17% of the population in the U.S.A. has dyslexia and only 30% of dyslexics have trouble with reversing letters and numbers. On the other hand, the level of dyslexia in other regions such as Europe or China is lower.

There are around 38 million of dyslexics in Europe.

(H. Meng et al., 2005)

(Ruiz del Árbol, 2008)

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 6: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineWhy

Detecting the presence of dyslexic texts in the Web helps us to know the real impact of dyslexia in the Web as well as to value dyslexic-accessible practices.

There is a common agreement in these studies that the application of dyslexic-accessible practices benefits also the readability for non-dyslexic users as well as other users with disabilities such as low vision.

Spelling error rates has proven to be a useful index for website content quality.

Useful

(Gelman & Barletta, 2008)

(McCarthy & Swierenga, 2010)(Evett & Brown, 2005)

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 7: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineWhy

Novel

Estimating dyslexia in a group of web pages depending on their domain.

This is the first attempt to estimate the amount of texts containing English dyslexic errors in the Web.

(Ringlstetter et al. 2006)

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 8: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

z

Two examples of dyslexic texts

There seams to be some confusetion. Althrow he rembers the situartion, he is not clear on detailes. With regard to deleteing parts, could you advice me of the excat nature of the promblem and I will investgate it imeaditly.

I halve a spelling chequer It cam with my pea see Eye now I’ve gut the spilling rite Its plane fore al too sea ... Its latter prefect awl the weigh My chequer tolled mi sew. (Pedler, 2007)

Page 9: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

How many kinds of errors can be produced by a dyslexic?

dyslexic errors

Simple errors Multi errors Word boundary errors

Real-word errors Non-word errors

First letter errors

53% 39% 8% 100%

17% 83% 100%

5%

——

——

(Pedler, 2007)

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 10: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

How many kinds of errors in the Web?

1. Dyslexic errors: Among the different kinds of errors commonly made made by dyslexics (i.e. unfinishedwords or letters, omitted words, inconsistent spaces between words and letters (Vellutino, 1979). *reiecve instead of receive

2. Regular spelling errors produced by non-impaired native English individuals, such as the transposition error, i.e. *recieve.

3. Regular typos caused by the adjacency of letters in the keyboard, i.e. *teceive.

4. OCR errors, due to letters of similar shape, such as *ieceive.

5. Errors made by non-native speakers who use English as a foreign language. For example, *receibe is a typical error made by Spanish learners of English, since the graphemes ‘b’ and ‘v’ are pronounced as /b/, and the phoneme /v/ does not exist in the standard Spanish phonemic system.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 11: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Outline

Ricardo Baeza-Yates and Luz Rello Estimating Dyslexia in the Web

How

Selection criteria

To avoid the overlap of dyslexic errors and other errors:

To avoid the overlap of dyslexic errors and real words:

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

— We consider only words written by dyslexics containing multi-errors, that is, the dyslexic word differs from the intended correct word by more than one letter. For example, the dyslexic word *konwlegde from knowledge.

— Errors which coincide with other existing words in English are omitted, i.e. *trust being the intended word truth.

— Errors which give as a result a proper name are also filtered, for instance the typo *wirries from worries is also a proper name.

Page 12: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

Selection criteria

— All the dyslexic spelling errors are extracted from samples of text written by adults with diagnosed dyslexia (extracted from a corpus compiled for this purpose) and from literature (Pedler, 2007).

— Among the dyslexic errors, we take in account the ones which include the letters that produce more confusion among dyslexic individuals, such as ‘b’, ‘d’, ‘p’, ‘m’, ‘n’, ‘u’ and ‘w’ together with other similar looking letters. For instance, it is specially frequent to find reversals of similar letters, such as ‘b’ and ‘d’ (Deloche et al. 1982).i.e. *impossidle being the intended word impossible.

— Errors due to homophone confusion, that is words which have a similar pronunciation (Pedler, 2007), are not selected even though 15% of the dyslexic errors presented homophone confusion in a corpus of dyslexic texts (witch and which).

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 13: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

1. Dyslexic error:

2. Spelling errors:

3. Typos:

4. OCR errors: 5. Non-native speakerserrors:

Sample D, an example for the word comparison

*comaprsion.

*comparision, *conparison and *coparison.

*vomparison, *xomparison, *cimparison, *cpmparison, *conparison, *co,parison, *comoarison, *com[arison, *comprison, *compsrison, *compaeison, *compatison, *comparuson, *comparoson, *compariaon,*comparidon, *comparisin, *comparispn, *comparisob and *comparisom.

*compaiison and *comparisom.

*comparition and *comparizon.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 14: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

Sample D, dyslexic errors

comparison

understanding

knowledge

impossible

tomorrow

worries

explain

interesting

situation

confusion

*comaprsion

*understangind

*knwolegde

*inpossbile

*torromow

*worires

*exaplin

*intersenting

*situartion

*confusetion

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the

Page 15: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineHow

Estimating Dyslexia in the Web

— Let us define:

f : fraction of Web pages with lexical errors. d : fraction of dyslexic errors among all lexical errors.

— Then, the fraction of Web pages with dyslexia is f × d.

— We find a lower bound for f and d, to obtain a lower bound for the fraction of dyslexic pages in the Web.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 16: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Outline

— We use the main search engines (Bing, Google and Yahoo!) to estimate the document frequency of a word.

— Each of the words in our list is searched only in English web pages to avoid cases of wrong words that may have a meaning in other language.

How

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Estimating Dyslexia in the Web

Page 17: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Outline

Estimating Dyslexia in the Web

— We bound the relative fraction of documents with lexical error, f, by using a sample of frequent words that appear in most documents, usually called stopwords in information retrieval (becuase, trhough, etc.).

— We use the largest relative fraction of misspells for all these words to estimate f, as we cannot assume that all of them appear in different pages.

— To bound d we do the same frequency search with a sample of non-frequent words (Sample D) where we can distinguish the different types of errors without ambiguity.

How

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 18: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineResults

We use the real document frequencies of the terms from one of the search engines to validate the results obtained, finding very similar results.

Range of percentages and average for the different error classes.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Estimating Dyslexia in the Web

Page 19: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineResults

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

— From the sample D, the percentage of dyslexic errors among all lexical errors is very low with an average of 0.67%

— From Pedler (2007), only 39% of dyslexics errors are multi-errors

— This implies that the lower bound is at least d/0.39, but we can safely use a factor of 3 to correct this fact.

— We have that f is at least 0.27% from the word becuase.

— Then, we can estimate d as 2.01%.

— Lower bound for dyslexia in the Web is 0.005%.

Estimating Dyslexia in the Web

Page 20: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineConclusions

• The amount of dyslexic texts in the Web is not as large as it could be. This suggests the idea that the widespread use of spell checkers ameliorates dyslexia in the Web.

• Particular words can be used to detect dyslexic texts, and hence dyslexic users. This can be used to improve Web accessibility as well as future spell checkers or other tools targeted to dyslexic users.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 21: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineConclusions

• Since this is the first attempt to estimate text written by dyslexics individuals in the Web, a comparison with previous work is not possible.

• Previous research on dyslexia reveals that error frequency is related with word length (Pedler, 2007). Short words such as there, where, form, etc. are misspelled much more frequently in dyslexic texts than long words like the ones used in our experiments. Hence, we can do a better estimation by using a larger sample of stopwords as well as long dyslexic words.

• As a byproduct we have found that other types of errors are much more frequent in the Web and this can be used to assess the quality of Web text.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 22: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Outline

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

On-going Work

New methodology. Sample enlarged to 50 words. Real data extracted from a leading search engine. Up-down/Left-right typos. New lower bound: 0.8 % (16 times better).

Range of percentages and average for the different error classes.

Page 23: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

OutlineFuture Work

1 — Identification of dyslexic errors. Dyslexia diagnosis.

2 — NLP techniques for making text more accessible for dyslexic users.

3 — Web quality estimation (Gelman & Barletta, 2008), across countries, domiens and social media.

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

Page 24: Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Outline

Zank u beri mach

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web