Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

23
Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE Semantic tagging of Leo Tolstoy

Transcript of Dialogue 2014, Bekasovo Anastasia Bonch-Osmolovskaya NRU HSE.

Dialogue 2014, Bekasovo

Anastasia Bonch-Osmolovskaya

NRU HSE

Semantic tagging of Leo Tolstoy

Association of Digital Humanities Organizations (Europe, America, Australasia, Japan)

Digital humanities project

What is Digital Humanities

Scholarly dissemination Big Data for HumanitiesDistant reading and complex network analysis

and vizualisationLinking cultural data: building standartized

resources and interoperability

Digital humanities: state of the art field

Republic of letters

a stack of books of 2,7 m

height

includes all published

works, variants,

unpublished drafts,

diaries, letters, fragments

13 volumes of diaries

31 volume or 8500 letters

about 14,5 mln tokens

commentaries, indexes

Leo Tolstoy’ 90-vol complete edition

Open cultural heritage

All Tolstoy in one click

A project to digitise the entire works of Leo Tolstoy – named All of Tolstoy in One Click – making them available for tablets and smartphones, turned out to be lighter work than expected for the Tolstoy Museum in Moscow, when thousands of readers from all over the world responded to a call for volunteers. (The Guardian)

Now, thanks largely to the efforts of these volunteers, nearly all of the great Russian writer’s massive body of work, including novels, diaries, letters, religious tracts, philosophical treatises, travelogues, and childhood memories, will soon be available online, in a form that can be easily downloaded, free of charge. (The New Yorker)

A Crowdsourcing Wonder

The idea of contemporary standards of cultural heritage web publishing

Tagging relevant structural elements of the text and textual data

Linking elements inside and outside the text

project participantsTolstoy Museum (Fekla Tolstaya)High School of Economics, philology department (Boris

Orekhov, Anastasia Bonch-Osmolovskaya)Tartu University (Roman Leibov)ABBYY Compreno ( Anatoly Starostin)students of the philological department HSE

Semantic Tolstoy

What should be tagged? What tags should be used?Should we do it manually or automatically?Do we represent book or text? (Do we tag

non-Tolstoy’s texts?)

Semantic Tolstoy: how to start

What should be tagged? Everything that can be tagged with TEI

What tags should be used? TEI schemeShould we do it manually or automatically? It

dependsDo we represent volumes or texts? Text

Semantic Tolstoy: how to start

xml standard scheme for books encoding http://www.tei-c.org

wiki, manuals, tutorials, events, discussions, groups of interest

ROMA -  http://www.tei-c.org/Roma/ - customization generator for TEI scheme

Text Encoding Initiative

TEI scheme modules

critical apparatusreadings,

variantsnames dates

placestables, formulae,

graphics, notated music

language corporadictionaries

linking, segmentation, alignment

linguistic annotationpos tagging

certainty, precision, responsibility

Types of texts

documentary textsliterary texts

proseverseperformance texts

spoken textstranscriptions of

speech

manuscriptsancient texts

on papyri, stonemedieval texts

illuminated mscmodern texts

variorumhandwrittentypewritten

Corrections

<l>Я просыпаюсь. Я <choice>   <orig>об'ят</orig>   <reg>объят</reg>  </choice> <l>Открывшимся. Я на <choice>   <orig>учете</orig>   <reg>учете.</reg>  </choice> </l>

Normalization

create volume/text-type matrixselect TEI schemes for different text types

use modificated xml from ABBYY Finereader for structural elements

parse indexes and link them to text define intertextual linksmake Semantic Tolstoy cookbook

Preliminary work

Улыбка <forename>Аграфены Петровны</forename> означала, что письмо было от <rolename>княжны</rolename> <surname>Корчагиной</surname>, на которой, по мнению <forename>Аграфены Петровны</forename>, <surname>Нехлюдов</surname> собирался жениться. И это предположение, выражаемое улыбкой <forename>Аграфены Петровны</forename>, было неприятно <surname>Нехлюдову</surname>.

TEI for Tolstoy (cookbook)

Automatic date extraction(M.Kolbasov, HSE student)

Прямой полный 17 марта 1847 года <date when="1847-17-03"> 17 марта 1847 года </date>

Прямой неполный Числа 22 <date when="1847-22-03"> Числа 22 </date>

Лучевой задний Вот уже шестой день Вот уже <date from="1847-24-04" to="1848-01-01"> шестой день </date>

Отрезковый наст. Эту неделю я сижу дома Эту <date from="1847-19-04" to="1847-25-04"> неделю </date> я сижу дома

Точечный прош. Я совершенно доволен собою за вчерашний день

Я совершенно доволен собою за <date when="1847-23-04"> вчерашний день </date>

Old2New orthography transliterator(M.Kartysheva, E.Sidorova, D.Kolomeytsev, students of HSE)

Student projectsOld2New orthography transliteratorTolstoy corpus for ruscorpora Universal index parser

Together with ComprenoNamed entity extractionEvaluation of NE merging (indexes as a Gold

Standard)Fact extraction

Accompanying projects

“The coolest thing to do with your data will be thought of by someone else.”

Rufus Pollock,Co-Founder and Director,

Open Knowledge Foundation