BL Labs Presentation at Språkbanken, University of Gothenberg
E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and...
Transcript of E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and...
LT-basedE-science at Språkbanken
Språkbanken kick-off
January 2015
Definition of e-science
• E-Science (or eScience) is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid.
• Most of the research activities into e-Science have focused on the development of new computational tools and infrastructures to support scientific discovery.
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Infr
astr
uct
ure
s
Lan
guag
eTe
chn
olo
gy
SKO2nCjyT3
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Applicationareas
Front endsOn
eex
amp
le,
Lärk
a
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Applicationareas
Front endsSwe-
Cla
rin
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Applicationareas
Front ends
The SB definition of e-science
• IT based research methodology• With or without large amounts of data
• Corpus linguistics is a prominent example
• In the domain of digital humanities and social sciences
What has been done at SB?What are we working on?Words (multiwords)
Relations
Coordinations
Topics
Readability
MWE detection to improve parsing quality
1. Use lists as a basis for e.g., idioms, terminology and entities
2. Add reg. exp., pattern matchingto find more MWEs
3. Perform Parsing
Confirmed intution and previous experiments that pre-recognizing MWEs improveparsing (by 16%). Figure from Boleda G. & Evert S.:
“Multiword Expressions: A pain in the neck of lexical semantics”
Semantics in Storytelling in Swedish Fiction
Relation extraction from Swedish Prose Fiction (SPF)
• List of relations
• NEE to extract names and aliasesdocument center approach to linkaliases names
• Extract sentences with min. 2 names.
• Detect relation
Automatic detection would sign. improve coverage of relations
Semantics in Storytelling in Swedish Fiction
Relation extraction from Swedish Prose Fiction (SPF)
• List of relations
• NEE to extract names and aliasesdocument center approach to linkaliases names
• Extract sentences with min. 2 names.
• Detect relation
Automatic detection would sign. improve coverage of relations
Relations between 2 males = red, between 2 females = green, otherwise blue.
Semantics in Storytelling in Swedish Fiction
Relation extraction from Swedish Prose Fiction (SPF)
• List of relations
• NEE to extract names and aliasesdocument center approach to linkaliases names
• Extract sentences with min. 2 names.
• Detect relation
Automatic detection would sign. improve coverage of relations
Relations between 2 males = red, between 2 females = green, otherwise blue.
Swedish Psuedo Coordination (SPC) detection and change• Verb pairs where the first is light
• ”åka och handla”, ”gå och gifta sig”, ”ringa och berätta”
• Typical properties apply:• E.g., both is not possible: ”jag
både satt och läste”
• No paraphrasing: ”Mona satt och hon läste”.
• Try to classify SPCs from non-SPCs using these features
• False positives
• We think non-SPC, algorithmguesses SPC.
• Relaxing drop in P/R
• fara, resa, trilla, varda, stog, vända, testa, mejla, maila, kommentera, blogga, googla
Precision and recall for Blogmixen
Topic Modeling
• SPF used as data set
• Topic Modeling applied (Mallet)
which parts of a documentbelong to which topic
which part of any documentbelongs to topic i
Link original resources to helpvalidate topics
Readability of text
• All paragraphs assigned to topic i that are easy to read.
• Investigate different readabilitymeasures for text.
• Measures for English Swedish
Readability measures are not very reliable when applied directlyto Swedish texts.
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1Topic
2
Topic3
Topic4
Topic5
Topic6
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1
October
May
Topic2
Topic3
Topic4
Topic5
Topic6
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1
October
MayTopic 10: jan björklund, allians, frisyr, åkesson, slips, siffra, sverige, prata, romson, nöjd
T exTopic 1: Digram attackera, fusklapp läcka, vinna, ord, dålig, analys, tydlig, jobba, önska, missa
T ex
Topic2
Topic3
Topic4
Topic5
Topic6
Conclusions
• There are many, manyinteresting things to do in the field of E-science
Come and join us!
Future work
• Workshop on SB relatedactivities for DHSS on April 17th
• Want to present your work?