Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language...
-
Upload
aglaja-dusel -
Category
Documents
-
view
108 -
download
0
Transcript of Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language...
![Page 1: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/1.jpg)
Data Analytics (1)
M.VlachosIBM Research – Zurich, Switzerland
How Difficult is a Foreign-Language Document?
![Page 2: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/2.jpg)
Data Analytics (2)
Our Goal• Provide:
– semantic ‘sorting’ operator – for foreign documents
(with respect to the reader native language)– based on their perceived comprehensibility
Documents/Books on a topic
Easy < > Difficult
![Page 3: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/3.jpg)
Data Analytics (3)
why is it useful ? (1/2)E-Bookstores:
Recommendations based on user’s language level
![Page 4: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/4.jpg)
Data Analytics (4)
why is it useful ? (1/2)E-Bookstores:
Recommendations based on user’s language level
Easy Difficult
><
![Page 5: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/5.jpg)
Data Analytics (5)
Web search/personalization: A lot of content overlap on the internet. Provide only a subset to the user, based on both:– Relevance– Document difficulty/comprehensibility
why is it useful ? (2/2)
Which documents should I read that
better correspond to my understanding of
the German language?
![Page 6: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/6.jpg)
Data Analytics (6)
Background - Readability
• Manuals / Army Documents
![Page 7: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/7.jpg)
Data Analytics (7)
Background - Readability• Zipf’s Law
“Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.”
![Page 8: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/8.jpg)
Data Analytics (8)
Background - Readability• Flesch Reading Ease
100 0
90-100
11 year old
60-70
13-15 year old
0-30
University student
Microsoft Word
![Page 9: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/9.jpg)
Data Analytics (9)
Readability 65 Readability 52
![Page 10: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/10.jpg)
Data Analytics (10)
what makes the new problem
challenging/interesting?
![Page 11: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/11.jpg)
Data Analytics (11)
Cognates• Many words in different languages exhibit
visual and semantic affinity– Derived words– ‘Loan’ words
“Ein Experte kam um die Maschine zu reparieren”
“An expert came to repair the machine.”
![Page 12: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/12.jpg)
Data Analytics (12)
![Page 13: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/13.jpg)
Data Analytics (13)
Compound Words
![Page 14: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/14.jpg)
Data Analytics (14)
Compound Words
• German, Dutch, Swedish, etc are compound languages.• Complex words can be built from simpler ones
• Intuition: Even if a word cannot be found in a Dictionary (or has low frequency), if it consists of easy building blocks then it is also easy to understand
![Page 15: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/15.jpg)
Data Analytics (15)
how to find word frequency?
• Better: Use web search engines!
Popularity of a word:
• Very large text corpora (eg project gutenberg)
![Page 16: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/16.jpg)
Data Analytics (16)
Putting it all together
• An easy text contains:– Simple syntactical structure (e.g. no deeply connected sentences)– Easy words:
• frequently encountered – (eg. web frequency)• similar to my native language – cognates (finanzkrise = finance crisis)
• Combine these measures to deduce overall difficulty
![Page 17: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/17.jpg)
Data Analytics (17)
Estimating Cognativity
![Page 18: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/18.jpg)
Data Analytics (18)
Estimating Cognativity
Compute how easy it is to transform
one word into another…
j -> y (ja -> yes)
k -> c (Architekt -> architect)
z -> c (sozial -> social)
Common Letter Transformations:
![Page 19: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/19.jpg)
Data Analytics (19)
![Page 20: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/20.jpg)
Data Analytics (20)
![Page 21: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/21.jpg)
Data Analytics (21)
Assembling everything
![Page 22: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/22.jpg)
Data Analytics (22)
![Page 23: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/23.jpg)
Data Analytics (23)
some experiments
![Page 24: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/24.jpg)
Data Analytics (24)
Results – User Study
Ich habe mit dreissig Jahren angefangen, Deutsch zu lernen. Das war
ziemlich spät; ich glaube, wenn man jünger ist, ist es viel leichter, eine
Fremdsprache zu lernen. Aber ich wollte
es trotzdem versuchen. Mich interessierte die Deutsche Kultur, und
einige Mitarbeiter der Firma hatten die Aussicht, einmal in Deutschland
zu arbeiten. Also lernte ich Deutsch.
über mangelnde Beschäftigung während der Weihnachtsfeiertage
konnte sich die städtische Berufsfeuerwehr dieses Jahr wahrhaftig night
beklagen. Mehr als dreihundert Einsätze im gesamten Münchner
Stadtgebiet hielten Oberbranddirektor Wanninger und seine Mitarbeiter
rund um die Uhr in Atem. In den meisten Fällen konnten sie das Feuer
schnell unter Kontrolle bringen. Zwei Einfamilienhäuser und mehrere
Etagenwohnungen brannten jedoch vollständig aus.Das sogenannte Vorgesicht ist ein bis zum Schauen oder mindestens
deutlichem Hören gesteigertes
Ahnungsvermögen und hier in Westfalen so gewöhnlich, dass man
überall doch tatsächlich damit Behaftete trifft und im Grunde fast kein
Eingeborener sich gänzlich davon freimachen dürfte.Seine Gabe
überkommt ihn zu jeder Tageszeit, am häufigsten jedoch in
Mondnächten, wo er plötzlich
erwacht und von fieberhafter Unruhe ins Freie oder ans Fenster
getrieben wird. Er hört das Geschrei der Verunglückten und an Tür oder
Fensterläden das Anklopfen desjenigen, der ihn oder
seinen Nachfolger zur Hilfe rufen wird.
easy
medium
difficult
![Page 25: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/25.jpg)
Data Analytics (25)
Comparing Readability vs Our Method
![Page 26: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/26.jpg)
Data Analytics (26)
Comprehensibility consistently outperforms readability measures
• 300 Essays from: CourseInfo.com
GCSE
(high-school)
A-level
(pre-college preparation)
University Level
![Page 27: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/27.jpg)
Data Analytics (27)
![Page 28: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/28.jpg)
Data Analytics (28)
LingoRANK• A web tool for keyword-based news retrieval in German language• Semantic ranking of document based on comprehensibility
![Page 29: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?](https://reader036.fdocuments.net/reader036/viewer/2022062318/55204d8149795902118d3dd0/html5/thumbnails/29.jpg)
Data Analytics (29)
In summary• Dynamic Corpus for Term Frequency
– Use search engines• Difficulty Depends on the Users’s Native Language
– Cognate Identification• Word Decompounding
– Building blocks simple to understand? -> Compound word is simple– Finanzminister (= Finance Minister)
Finanzminister
• We can mesh relevance and comprehensibility using a skyline ordering approach
“Customizing Search Results for Non-Native Speakers” (2012)
T. Lappas, M. Vlachos: International Conference on Information and Knowledge Management (CIKM)