nouashpchekh

download nouashpchekh

of 2

Transcript of nouashpchekh

  • 8/22/2019 nouashpchekh

    1/2

    The list accessible from this page includes about 32000 words with frequency greater than 1

    ipm (one instance per million words). A shorter selection of 5000 most frequent words is also

    available. Lists use Windows-1251 encoding for Cyrillic and are compressed by lemma.al.zip

    - lemmas sorted in the alphabetical order

    lemma.num.zip - lemmas sorted by their frequencywords.num.zip - word forms sorted by their frequency

    Lists of 5000 most frequent words

    5000lemma.al.zip - lemmas sorted in the alphabetical order

    5000lemma.num.zip - lemmas sorted by their frequency

    Some data about uses of words in modern Russian

    The average word length is 5.28 characters.

    The average sentence length is 10.38 words.1000 most frequent lemmas cover 64.0708% of word forms in texts.

    2000 most frequent lemmas cover 71.9521% of word forms in texts.

    3000 most frequent lemmas cover 76.6824% of word forms in texts.

    5000 most frequent lemmas cover 82.0604% of word forms in texts.

    The exact information on the mapping of frequency to coverage is available from here.

    The list is compiled on the basis of a corpus of modern Russian. It contains a selection of

    modern fiction, political texts, newspapers, and popular science (about 40 million words,

    MW, fiction allocates for about half of the corpus). All texts were written originally in

    Russian between 1970 and 2002; the majority of them between 1980 and 1995, the

    newspapers corpus is from 1997-1999.

    It is widely known that large texts present a problem for frequency lists, since a large text that

    contains many instances of a rare word can boost its frequency. If the corpus is based on

    fiction, large texts are quite frequent. As an example, the corpus contains a huge sequel to

    Tolkien's "The Lord of the Rings" written by a Russian author (Nick Perumov). In spite of the

    fact that the length of the sequel is about 250 kW, less than one percent of the whole corpus,

    the frequency of uses of the word hobbit in that book puts the word in the first thousand of

    most frequent Russian words, if no precautions against large texts are made. Out of this

    reason, the frequency list is calculated under the condition that no single text from the corpuscontributes more than 10 kW and no author contributes more than 100 kW to the count. Thus,

    the subset of the whole corpus used for frequency count is about 16 MW.

    Words are not uniformly distributed in texts. Some of them (like prepositions) occur in many

    texts with predictable rate, some (like pronouns or mental verbs) are significantly more

    frequent for certain writers or genres, while some are "contagious": if a word (e.g. a proper

    name, a title of nobility or a technical term) occurs once in a text, it tends to be repeated, thus

    boosting its frequency in a document. The variation can be measured in a variety of ways

    (Church, K. and Gale, W. (1995) here. The structure:

    lemma, mean frequency (ipm), number of texts in which the lemma occurs, standard deviation

    of frequency counted for all texts, coefficient of variation, variance.

  • 8/22/2019 nouashpchekh

    2/2

    The corpus, tools for working with it, as well as an aligned parallel English-Russian corpus

    are discussed in the following publication:

    The list is compiled on the basis of a corpus of modern Russian. It contains a selection of

    modern fiction, political texts, newspapers, and popular science (about 40 million words,

    MW, fiction allocates for about half of the corpus). All texts were written originally inRussian between 1970 and 2002; the majority of them between 1980 and 1995, the

    newspapers corpus is from 1997-1999.

    It is widely known that large texts present a problem for frequency lists, since a large text that

    contains many instances of a rare word can boost its frequency. If the corpus is based on

    fiction, large texts are quite frequent. As an example, the corpus contains a huge sequel to

    Tolkien's "The Lord of the Rings" written by a Russian author (Nick Perumov). In spite of the

    fact that the length of the sequel is about 250 kW, less than one percent of the whole corpus,

    the frequency of uses of the word hobbit in that book puts the word in the first thousand of

    most frequent Russian words, if no precautions against large texts are made. Out of this

    reason, the frequency list is calculated under the condition that no single text from the corpuscontributes more than 10 kW and no author contributes more than 100 kW to the count. Thus,

    the subset of the whole corpus used for frequency count is about 16 MW.

    The list is compiled on the basis of a corpus of modern Russian. It contains a selection of

    modern fiction, political texts, newspapers, and popular science (about 40 million words,

    MW, fiction allocates for about half of the corpus). All texts were written originally in

    Russian between 1970 and 2002; the majority of them between 1980 and 1995, the

    newspapers corpus is from 1997-1999.

    It is widely known that large texts present a problem for frequency lists, since a large text that

    contains many instances of a rare word can boost its frequency. If the corpus is based on

    fiction, large texts are quite frequent. As an example, the corpus contains a huge sequel to

    Tolkien's "The Lord of the Rings" written by a Russian author (Nick Perumov). In spite of the

    fact that the length of the sequel is about 250 kW, less than one percent of the whole corpus,

    the frequency of uses of the word hobbit in that book puts the word in the first thousand of

    most frequent Russian words, if no precautions against large texts are made. Out of this

    reason, the frequency list is calculated under the condition that no single text from the corpus

    contributes more than 10 kW and no author contributes more than 100 kW to the count. Thus,

    the subset of the whole corpus used for frequency count is about 16 MW.