SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC...
-
Upload
tamsyn-burke -
Category
Documents
-
view
217 -
download
0
Transcript of SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC...
![Page 1: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/1.jpg)
SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE
– AN ADVANCED LEXICOGRAPHIC TOOL FOR
LIBRARY TERMINOLOGY RESEARCH
Ivan KaničUniversity of Ljubljana, Faculty of Economics
International scientific conference «Corpus linguistics»Saint-Petersburg State University, June 25 – 27, 2013
![Page 2: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/2.jpg)
![Page 3: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/3.jpg)
SLOVENIA
• Population: 1,992,690• Ljubljana (capital) 260,000 • Independence: 25 June 1991 (from Yugoslavia)• Surface: 20,273 sq km• Border countries: Austria, Croatia, Hungary,
Italy• Adriatic coastline: 46.6 km• Highest point: Triglav 2,864 m
![Page 4: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/4.jpg)
SLOVENIA (2)
• Language: Slovene (var.: Slovenian)• Ethnic composition:
Slovene 83.1%, Serb 2%, Croat 1.8%, Bosniak 1.1%, other or unspecified 12%
• Religions: Catholic 57.8%, Muslim 2.4%, Orthodox 2.3%, other or unspecified 28%, none 10.1% (2002 census)
• GDP - per capita: $28,700 (2012)• Currency: EURO (introduced in 2007)
![Page 5: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/5.jpg)
SLOVENE LANGUAGE
• Slovenski jezik, slovenščina• Western South Slavic language• cca. 2,4 mio speakers (1,85 mio first language)• 50 regional dialects (limited understanding: „most
diverse Slavic language“)• Latin alphabet
Č, Š, Ž• Highly inflected language• Particularities: dual
![Page 6: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/6.jpg)
6
SLOVENSKI BESEDILNI KORPUSI
• 20 < CORPORA AVAILABLE ONLINE• REPRESENTATIVE (GENERAL) CORPORA• SYNCHRONOUS CORPORA• Nova Beseda– 240 mio words, 2004 (cca 10 years‘ coverage)
• GigaFida– 1,2 bill. words, 1990-2011
• SPECIALISED CORPORA– DSI, Jos, Evrokorpus, VAYNA . . . – EduKorp, Bibliotekarstvo
![Page 7: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/7.jpg)
Slovene LIS Terminology
• Long professional tradition• Linguistic shortage in the subject field– Lack of written technical texts– German language tradition– Later English influences– NO dictionaries in LIS terminology– Terminology Project 1987– Important tangible results
![Page 8: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/8.jpg)
Usables
• International Project – Multilingual Dictionaries of Library Terminology
• English-Slovene Dictionary of Library Terminology
• (Slovene) Dictionary of Library Terminology– Printed edition– Electronic edition (web, public access)
• Text Corpus – Korpus bibliotekarstva
![Page 9: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/9.jpg)
Korpus bibliotekarstva
• Specialized corpusLibrary and Information Science & practice
• Synchronous• Open public access• Dedicated in-house software– PC dat aprocessing– Web-based usage– Rich experience (eg. Dictionaries of the Slovene
Academy of Sciences and Arts)
![Page 10: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/10.jpg)
Texts
• Defined selection criteria• Subject & Level• Written texts• Electronic published texts only – Digital born– Digitized & published– NO scanning for the corpus
• Technical limitations and barriers
![Page 11: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/11.jpg)
![Page 12: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/12.jpg)
Selected texts & Functions
![Page 13: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/13.jpg)
Basic functions
• Simple/basic search– Single words & phrases– N-grams (N = 1 – 5)– Concordances– Global corpus – selected document segment(s)– Exact matching– Truncation (*)– Upper / lower case
Knjižnica - knjižnica
![Page 14: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/14.jpg)
Basic functions (2)
• Advanced search– Frequency search = , < , >
Fr>1000Fr>200 in be:kata*
– Word length = , < , >Do=15
• Word masking *adjective + substantive
* katalogknjižnični *
![Page 15: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/15.jpg)
Hyperlinked list of texts & authors
![Page 16: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/16.jpg)
Concordance list
![Page 17: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/17.jpg)
Citation
![Page 18: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/18.jpg)
Full-text access
![Page 19: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/19.jpg)
Single word
![Page 20: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/20.jpg)
Bigrams
![Page 21: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/21.jpg)
Bigrams (2)
![Page 22: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/22.jpg)
4-grams
![Page 23: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/23.jpg)
Insight
• 625 texts• 353 authors (single or co+authors)• 3,66 mio words• Lematisation• Part of speech tagging• 28.808 individual distinctive words• Highest frequency - 172.031 (aux. v. „to be“)• Hapax legomena - 7.310
![Page 24: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/24.jpg)
Frequency distribution
• First 50
![Page 25: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/25.jpg)
Zipf‘s Law vs. experience
![Page 26: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/26.jpg)
Parts of speech Verbs
noun adjective verb adverb the rest0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
![Page 27: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/27.jpg)
Nouns Adjectives
![Page 28: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/28.jpg)
28
Accessibility
• Open Access• CC License• BLOG Bibliotekarska terminologija
http://terminologija.blogspot.com
![Page 29: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/29.jpg)
Problems & Challenges
• Choice & acquisition of texts• „Analogue“ texts• Copyright issues• Technical barriers– PDF protected data– Special characters– Special text formatting– Typing errors– Genuine OCR errors
![Page 30: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/30.jpg)
Problems & Challenges (2)
• Linguistic– Highly inflected language
• Data processing• Search • Analysis• Part of speech tagging
– Foreign language „contamination“• General– Resources
• Human• financial
![Page 31: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/31.jpg)
Plans
• Harvesting new texts– Recent / current digital born publications– Recently digitized (e.g. „Knjižnica“)– „Backlog“• 120 graduate theses• 28 master theses• 25 monographs & proceedings
– Scientific analysis– Dictionary updating and supplementing
![Page 32: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.](https://reader035.fdocuments.net/reader035/viewer/2022062500/56649e7c5503460f94b7e4ce/html5/thumbnails/32.jpg)
СПАСИБО ЗА ВНИМАНИЕ!
Check: http://terminologija.blogspot.com Contact: [email protected]
http://www2.arnes.si/~ljnuk4/kanic.html