E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and...

Post on 09-Sep-2020

5 views 0 download

Transcript of E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and...

LT-basedE-science at Språkbanken

Språkbanken kick-off

January 2015

Definition of e-science

• E-Science (or eScience) is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid.

• Most of the research activities into e-Science have focused on the development of new computational tools and infrastructures to support scientific discovery.

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Infr

astr

uct

ure

s

Lan

guag

eTe

chn

olo

gy

SKO2nCjyT3

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Applicationareas

Front endsOn

eex

amp

le,

Lärk

a

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Applicationareas

Front endsSwe-

Cla

rin

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Applicationareas

Front ends

The SB definition of e-science

• IT based research methodology• With or without large amounts of data

• Corpus linguistics is a prominent example

• In the domain of digital humanities and social sciences

What has been done at SB?What are we working on?Words (multiwords)

Relations

Coordinations

Topics

Readability

Twitter

MWE detection to improve parsing quality

1. Use lists as a basis for e.g., idioms, terminology and entities

2. Add reg. exp., pattern matchingto find more MWEs

3. Perform Parsing

Confirmed intution and previous experiments that pre-recognizing MWEs improveparsing (by 16%). Figure from Boleda G. & Evert S.:

“Multiword Expressions: A pain in the neck of lexical semantics”

Semantics in Storytelling in Swedish Fiction

Relation extraction from Swedish Prose Fiction (SPF)

• List of relations

• NEE to extract names and aliasesdocument center approach to linkaliases names

• Extract sentences with min. 2 names.

• Detect relation

Automatic detection would sign. improve coverage of relations

Semantics in Storytelling in Swedish Fiction

Relation extraction from Swedish Prose Fiction (SPF)

• List of relations

• NEE to extract names and aliasesdocument center approach to linkaliases names

• Extract sentences with min. 2 names.

• Detect relation

Automatic detection would sign. improve coverage of relations

Relations between 2 males = red, between 2 females = green, otherwise blue.

Semantics in Storytelling in Swedish Fiction

Relation extraction from Swedish Prose Fiction (SPF)

• List of relations

• NEE to extract names and aliasesdocument center approach to linkaliases names

• Extract sentences with min. 2 names.

• Detect relation

Automatic detection would sign. improve coverage of relations

Relations between 2 males = red, between 2 females = green, otherwise blue.

Swedish Psuedo Coordination (SPC) detection and change• Verb pairs where the first is light

• ”åka och handla”, ”gå och gifta sig”, ”ringa och berätta”

• Typical properties apply:• E.g., both is not possible: ”jag

både satt och läste”

• No paraphrasing: ”Mona satt och hon läste”.

• Try to classify SPCs from non-SPCs using these features

• False positives

• We think non-SPC, algorithmguesses SPC.

• Relaxing drop in P/R

• fara, resa, trilla, varda, stog, vända, testa, mejla, maila, kommentera, blogga, googla

Precision and recall for Blogmixen

Topic Modeling

• SPF used as data set

• Topic Modeling applied (Mallet)

which parts of a documentbelong to which topic

which part of any documentbelongs to topic i

Link original resources to helpvalidate topics

Readability of text

• All paragraphs assigned to topic i that are easy to read.

• Investigate different readabilitymeasures for text.

• Measures for English Swedish

Readability measures are not very reliable when applied directlyto Swedish texts.

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1Topic

2

Topic3

Topic4

Topic5

Topic6

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1

October

May

Topic2

Topic3

Topic4

Topic5

Topic6

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1

October

MayTopic 10: jan björklund, allians, frisyr, åkesson, slips, siffra, sverige, prata, romson, nöjd

T exTopic 1: Digram attackera, fusklapp läcka, vinna, ord, dålig, analys, tydlig, jobba, önska, missa

T ex

Topic2

Topic3

Topic4

Topic5

Topic6

Conclusions

• There are many, manyinteresting things to do in the field of E-science

Come and join us!

Future work

• Workshop on SB relatedactivities for DHSS on April 17th

• Want to present your work?