Textdatenbanken12.ppt...
Transcript of Textdatenbanken12.ppt...
Textdatenbanken
Sommersemester 2020
12. Vorlesung
- Text-Genres und Korpusstatistik, Teil 3 -
Uwe Quasthoff
Universität Leipzig
Institut für Informatik
U. Quasthoff Textdatenbanken 2
Präsentation der Messergebnisse
• Optisch: Graphische Darstellung (z.B. Gerade)
• Minimal: Beschreibung der Grafik durch möglichst wenige Parameter (z.B.
Anstieg)
• Exemplarisch: Angabe von Beispielen in einer Tabelle
U. Quasthoff Textdatenbanken 3
Präsentation der Messergebnisse:
Regelmäßigkeit
• Optisch: Graphische Darstellung (z.B. Gerade)
• Erkennen von Gesetzmäßigkeiten / Zusammenhängen
• Minimal: Beschreibung der Grafik durch möglichst wenige Parameter (z.B.
Anstieg)
• Verwendung in Feature-Vektoren
• Grundlage für Vergleich / Clustering
• Exemplarisch: Angabe von Beispielen in einer Tabelle
– Typische Beispiele
U. Quasthoff Textdatenbanken 4
Präsentation der Messergebnisse:
Unregelmäßigkeiten
• Optisch: Graphische Darstellung (z.B. Gerade)
• Abweichung von erwarteten Gesetzmäßigkeiten / Zusammenhängen
• Exemplarisch: Angabe von Beispielen in einer Tabelle
– Extreme Beispiele
• Aufspüren von interessanten Sonderfällen
• Hinweis auf Fehler in der Vorverarbeitung
U. Quasthoff Textdatenbanken 5
Beispiel: Satzlänge in Wörtern (Deu)
• Graph
• Mittelwert
• Beispiele für extrem kurze und
lange Sätze.
Aber auch:
• Unterscheidung nach Satztyp:
Aussage / Frage / Ausruf
• Unterscheidung der Quellen nach
mittlerer Satzlänge
• Varianz der Satzlänge pro Quelle
U. Quasthoff Textdatenbanken 6
Satzlänge in Ido-Wikipedia
Großer Teil automatisch
erzeugt. Beispielsätze:
• Tama, Iowa: Segun la
kontado di 2000 esas 2,731
homi, 1,065 hemanari, e
723 familii qui rezidas en la
urbo. La lojanto-denseso
esas 349.2/km² (905.1/mi²).
• Harper, Iowa: Segun la
kontado di 2000 esas 134
homi, 55 hemanari, e 40
familii qui rezidas en la
urbo. La lojanto-denseso
esas 574.9/km²
(1,523.7/mi²).
U. Quasthoff Textdatenbanken 7
Words by Length without
multiplicity
Here we ignore the fact that
words have different
frequencies. So for the
average word length,
each word is considered
equally. For a fixed
word length, we count
the number of different
words having this
length.
With a logarithmic scale of
the y-axis, we get a
nearly linear part
between length 15 and
40.
U. Quasthoff Textdatenbanken 8
Words by Length with multiplicity
The fact that stopwords are
very high frequent and
short will give a shorter
average word length than
in the previous picture.
U. Quasthoff Textdatenbanken 9
Average word length for different
frequency ranges
The table shows the average word
length (counted without
multiplicity) for the most
frequent 10n (n=1,2,…) words.
U. Quasthoff Textdatenbanken 10
Number of letter-N-grams at word
beginnings
How many different
letter-N-grams do we
find at the beginning
of a word? Of course
we will find many
unexpected N-grams,
but the will have low
frequency. This is the
reason to count these
numbers for different
ranges and use the
top K=10n words
(n=2, 3, 4, 5, 6).
U. Quasthoff Textdatenbanken 11
Text coverage by top words
Text coverage measures
the number of words
necessary to cover a
certain amount of
text of a corpus. The
table shows the text
coverage for the first
N=10n words,
n=1,…,5.
A diagram with these
values and
logarithmic x-axis
shows a nearly
straight line.
U. Quasthoff Textdatenbanken 12
Text Coverage
text coverage by the most frequent 10 words: 21.129%
text coverage by the most frequent 100 words: 40.212%
text coverage by the most frequent 1 000 words: 60.632%
text coverage by the most frequent 10 000 words: 80.703%
text coverage by the most frequent 100 000 words: 93.498%
U. Quasthoff Textdatenbanken 13
Sentences containing the most frequent
wordsFor the most frequent
words we present the
percentage of sentences
containing this word.
U. Quasthoff Textdatenbanken 14
Length of sentences in characters and
words
U. Quasthoff Textdatenbanken 15
Most frequent sentence beginnings
and endings of different length
U. Quasthoff Textdatenbanken 16
Buchstabenverteilung von Z in
Sätzen
U. Quasthoff Textdatenbanken 17
Buchstabenverteilung von X in
Sätzen
U. Quasthoff Textdatenbanken 18
Buchstabenverteilung von
Doppelpunkten in Sätzen
U. Quasthoff Textdatenbanken 19
Sentences consisting of short
words only
In this subsection we look for sentences containing only short words. The sentences
have minimum length of 40 characters and are ordered by the length of the
longest word.
U. Quasthoff Textdatenbanken 20
30 sentences with least maximum word number
U. Quasthoff Textdatenbanken 21
Sentences with highest average word
number
U. Quasthoff Textdatenbanken 22
Sentences with highest average word
length
U. Quasthoff Textdatenbanken 23
30 sentences with shortest word of
maximal length
U. Quasthoff Textdatenbanken 24
Types of Sentences by Punctuation
Mark
U. Quasthoff Textdatenbanken 25
Sentences consisting of long words
only
The table shows the sentences with maximal average word length. Because some
languages allow very long words, such sentences may also contain short
stopwords. Hence, we may find (at least some) well-formed sentences.
U. Quasthoff Textdatenbanken 26
Language Fingerprint
NN co-occurrences within the 10 most frequent words
U. Quasthoff Textdatenbanken 27
Number of NN-co-occurrences
depending on frequency classesIn many cases, two co-
occurring words have nearly the same frequency. In many other cases (like DET NN), the frequencies differ very much. The following plot shows the frequency classes of co-occurring words. Frequency classes are defined as the logarithm (with base 2) of the frequency rank. The size of the dots corresponds to the number of co-occurrences with the corresponding pair of frequency classes.
U. Quasthoff Textdatenbanken 28
Number of sentence co-
occurrences vs. FrequencyThe diagram below displays for any word its frequency and number of sentence co-
occurrences.
U. Quasthoff Textdatenbanken 29
Siz
e o
f So
urc
es
U. Quasthoff Textdatenbanken 30
Sentence length for different
sources: Min and Max
U. Quasthoff Textdatenbanken 31
Average word length for different
sources: Min and Max
U. Quasthoff Textdatenbanken 32
Sources consisting of many / few
words with frequency 1
U. Quasthoff Textdatenbanken 33
Sources with low / high average word
length of rare words