Analyzing concerns of people using Weblog articles and ...

12
Analyzing concerns of people using Weblog articles and real world temporal data Tomohiro Fukuhara Research Institute of Science and Technology for Society 2-5-1 Atago, Mori Tower 18F, Minato-ku, Tokyo JAPAN [email protected] Toshihiro Murayama Research Institute of Science and Technology for Society 2-5-1 Atago, Mori Tower 18F, Minato-ku, Tokyo JAPAN [email protected] Toyoaki Nishida Graduate School of Informatics, Kyoto University Yoshida-Honmachi, Sakyo-ku, Kyoto JAPAN [email protected] ABSTRACT We described a system for collecting and analyzing Weblog articles for understanding concerns of people from collective and personal viewpoints. The system collects and analyzes Japanese and Chinese blog articles. From analysis results using the system, we found (1) patterns of social concerns, (2) concerns of a person, and (3) relations between blog and real world temporal data such as temperature and news articles. Categories and Subject Descriptors H.3.5 [Information Systems]: Online Information Services; J.4.3 [Social and Behavioral Sciences]: Sociology; General Terms Management Keywords Weblog analysis, social concern, personal concern, relation between blog and real world temporal data. 1. INTRODUCTION Understanding concerns of people is important for solving social problems. Today, there are many problems in our world such as SARS (Severe Acute Respiratory Syndrome), BSE (Bovine Spongiform Encephalopathy), earthquakes, tsunami, terrorism, and so on. Because focuses on these problems are different by people, communities, and countries, it is important to understand concerns on the problems from personal and collective viewpoints, and domestic and international viewpoints. The aim of this research is twofold: (1) to understand concerns of people from Weblog (blog) articles, and (2) the second is to understand the effects of various real world factors affecting on our concerns. Firstly, we aim to understand social and personal concerns from blog articles. Because blog became a popular media for publishing information on the Internet, we can easily collect large amount of articles posted by various persons including celebrities. In addition to the quantity of blog articles, blog is suitable for comparing concerns across languages because (1) people in many countries are writing blog articles, and (2) articles in foreign languages are easily handled by using Unicode [1]. By collecting and analyzing blog articles enormously, we can find concerns of people from personal and collective viewpoints, and from domestic and international viewpoints. Secondly, we aim to understand the effects of various real world factors that may affect on our concerns. Figure 1 shows the image of effects affecting bloggers. Various factors that may affect concerns of bloggers can be considered such as (1) natural phenomena represented by temperature and weather, (2) media information represented by TV shows and news stories, (3) human relations, (4) social situations, (5) cultural situations, and so on. This paper consists of following sections. In Section 2, we describe an overview of the system for collecting and analyzing blog articles. In Section 3, we describe analysis results of social concerns. In Section 4, we describe analysis results of personal concerns. In Section 5, we describe relations between social concerns and real world temporal data. In Section 6, we discuss several issues related to this paper. In Section 7, we conclude. 2. KANSHIN: A SYSTEM FOR ANALYZING BLOG ARTICLES In this section, we describe (1) an overview of the system called KANSHIN for understanding concerns of people, (2) its architecture, and (3) functions. Copyright is held by the author/owner(s). WWW 2005, May 10--14, 2005, Chiba, Japan. 0 0.02 0.04 0.06 0.08 0.1 0.12 2004 3 18 2004 4 1 2004 4 15 2004 4 29 2004 5 13 2004 5 27 2004 6 10 2004 6 24 2004 7 8 2004 7 22 2004 8 5 2004 8 19 2004 9 2 2004 9 16 2004 9 30 2004 10 14 2004 10 28 2004 11 11 2004 11 25 2004 12 9 2004 12 23 2005 1 6 2005 1 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 2004 3 18 2004 4 1 2004 4 15 2004 4 29 2004 5 13 2004 5 27 2004 6 10 2004 6 24 2004 7 8 2004 7 22 2004 8 5 2004 8 19 2004 9 2 2004 9 16 2004 9 30 2004 10 14 2004 10 28 2004 11 11 2004 11 25 2004 12 9 2004 12 23 2005 1 6 2005 1 20 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 2004 3 18 2004 4 1 2004 4 15 2004 4 29 2004 5 13 2004 5 27 2004 6 10 2004 6 24 2004 7 8 2004 7 22 2004 8 5 2004 8 19 2004 9 2 2004 9 16 2004 9 30 2004 10 14 2004 10 28 2004 11 11 2004 11 25 2004 12 9 2004 12 23 2005 1 6 2005 1 20 “Spring” “Summer” “autumn” “winter” “Typhoon” “Earthquake” “Iraq” “pension” “Olympic” Natural phenomena Media Information Human relation Social situation Culture Various Factors Personal Concern Social Concern Bloggers Weblog Articles Figure 1. Understanding factors affecting bloggers’ concerns.

Transcript of Analyzing concerns of people using Weblog articles and ...

Page 1: Analyzing concerns of people using Weblog articles and ...

Analyzing concerns of people using Weblog articles and real world temporal data

Tomohiro Fukuhara Research Institute of Science and

Technology for Society 2-5-1 Atago, Mori Tower 18F,

Minato-ku, Tokyo JAPAN

[email protected]

Toshihiro Murayama Research Institute of Science and

Technology for Society 2-5-1 Atago, Mori Tower 18F,

Minato-ku, Tokyo JAPAN

[email protected]

Toyoaki Nishida Graduate School of Informatics,

Kyoto University Yoshida-Honmachi, Sakyo-ku,

Kyoto JAPAN

[email protected]

ABSTRACT We described a system for collecting and analyzing Weblog articles for understanding concerns of people from collective and personal viewpoints. The system collects and analyzes Japanese and Chinese blog articles. From analysis results using the system, we found (1) patterns of social concerns, (2) concerns of a person, and (3) relations between blog and real world temporal data such as temperature and news articles.

Categories and Subject Descriptors H.3.5 [Information Systems]: Online Information Services; J.4.3 [Social and Behavioral Sciences]: Sociology;

General Terms Management

Keywords Weblog analysis, social concern, personal concern, relation between blog and real world temporal data.

1. INTRODUCTION Understanding concerns of people is important for solving social problems. Today, there are many problems in our world such as SARS (Severe Acute Respiratory Syndrome), BSE (Bovine Spongiform Encephalopathy), earthquakes, tsunami, terrorism, and so on. Because focuses on these problems are different by people, communities, and countries, it is important to understand concerns on the problems from personal and collective viewpoints, and domestic and international viewpoints. The aim of this research is twofold: (1) to understand concerns of people from Weblog (blog) articles, and (2) the second is to understand the effects of various real world factors affecting on our concerns. Firstly, we aim to understand social and personal concerns from blog articles. Because blog became a popular media for publishing information on the Internet, we can easily collect large amount of articles posted by various persons including celebrities. In addition to the quantity of blog articles, blog is suitable for comparing concerns across languages because (1) people in many countries are writing blog articles, and (2) articles in foreign languages are easily handled by using Unicode [1].

By collecting and analyzing blog articles enormously, we can find concerns of people from personal and collective viewpoints, and from domestic and international viewpoints.

Secondly, we aim to understand the effects of various real world factors that may affect on our concerns. Figure 1 shows the image of effects affecting bloggers. Various factors that may affect concerns of bloggers can be considered such as (1) natural phenomena represented by temperature and weather, (2) media information represented by TV shows and news stories, (3) human relations, (4) social situations, (5) cultural situations, and so on.

This paper consists of following sections. In Section 2, we describe an overview of the system for collecting and analyzing blog articles. In Section 3, we describe analysis results of social concerns. In Section 4, we describe analysis results of personal concerns. In Section 5, we describe relations between social concerns and real world temporal data. In Section 6, we discuss several issues related to this paper. In Section 7, we conclude.

2. KANSHIN: A SYSTEM FOR ANALYZING BLOG ARTICLES In this section, we describe (1) an overview of the system called KANSHIN for understanding concerns of people, (2) its architecture, and (3) functions.

Copyright is held by the author/owner(s). WWW 2005, May 10--14, 2005, Chiba, Japan.

0

0.02

0.04

0.06

0.08

0.1

0.12

2004 3

18

2004 4

1

2004 4

15

2004 4

29

2004 5

13

2004 5

27

2004 6

10

2004 6

24

2004 7

8

2004 7

22

2004 8

5

2004 8

19

2004 9

2

2004 9

16

2004 9

30

2004 1

0 1

4

2004 1

0 2

8

2004 1

1 1

1

2004 1

1 2

5

2004 1

2 9

2004 1

2 2

3

2005 1

6

2005 1

20

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

2004 3

18

2004 4

1

2004 4

15

2004 4

29

2004 5

13

2004 5

27

2004 6

10

2004 6

24

2004 7

8

2004 7

22

2004 8

5

2004 8

19

2004 9

2

2004 9

16

2004 9

30

2004 1

0 1

4

2004 1

0 2

8

2004 1

1 1

1

2004 1

1 2

5

2004 1

2 9

2004 1

2 2

3

2005 1

6

2005 1

20

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2004 3

18

2004 4

1

2004 4

15

2004 4

29

2004 5

13

2004 5

27

2004 6

10

2004 6

24

2004 7

8

2004 7

22

2004 8

5

2004 8

19

2004 9

2

2004 9

16

2004 9

30

2004 1

0 1

4

2004 1

0 2

8

2004 1

1 1

1

2004 1

1 2

5

2004 1

2 9

2004 1

2 2

3

2005 1

6

2005 1

20

“Spring” “Summer” “autumn” “winter”

“Typhoon” “Earthquake”

“Iraq” “pension” “Olympic”

Natural phenomena

Media Information

Human relation

Social situation

Culture

Various Factors

Personal Concern

Social Concern

Bloggers

WeblogArticles

Figure 1. Understanding factors affecting bloggers’ concerns.

Page 2: Analyzing concerns of people using Weblog articles and ...

2.1 Overview Figure 2 shows an overview of the system. The system consists of (1) a database, (2) a Web server, and (3) several Perl scripts for collecting and analyzing Japanese and Chinese blog articles. The system works on the RedHat Linux. The system has two major tasks, i.e., to collect RSS and Atom syndication feeds from blog sites, and to analyze them.

Firstly, the system collects RSS and Atom feeds from Japanese and Chinese Weblog sites, and stores them to the database (see the left side of Figure 2). Feeds are collected from (1) personal blog sites that are listed on ping servers such as Weblogs.com1, (2) news sites that provide feed files such as asahi.com2, and (3) a governmental Web site3 whose contents containing news releases from Japanese government agencies are converted into an RSS feed by using an HTML to RSS conversion script4. The system collects feeds every 20 minutes. We collect about 20,000 Japanese articles, and 1,000 Chinese articles per day. Total number of articles is 9,690,885 for Japanese, and 90,802 for Chinese5.

Secondly, the system analyzes registered articles according to (1) users’ request, and (2) the schedule of a cron daemon. In the former case, the system retrieves articles based on the query given by the user (see the right side of Figure 2). In the latter case, the system automatically analyzes articles to find daily and monthly topics described in 2.3.2 by using a cron daemon. Figure 3 shows a screen image of the system. The system accepts a query from a user, and returns a daily trend graph of articles, and articles containing the keywords.

1 http://www.weblogs.com/ (accessed 2005-04-06) 2 http://www.asahi.com/information/service/rss.html (in Japanese;

accessed 2005-04-06) 3 http://www.gov-onlines.go.jp/ (in Japanese; accessed 2005-04-

06) 4 This is an experimental function. Articles acquired from

government agencies are stored for future analysis. 5 On April 7, 2005, 11:00:00.

2.2 System architecture Regarding the internal architecture, we use MySQL6 database and several relational tables to manage keywords and blog articles. Followings are key tables in the system; (1) “key_index” which is an index table of keywords used for retrieving entries, (2) “term_kid” which records pairs of date and number of articles containing keyword whose id is kid, and (3) “rss_date” which is a table for storing articles of a day specified by date. Definitions and samples of these tables are described in Section 10.1 in Appendix.

2.3 Functions The system has (1) retrieval function, (2) finding daily and monthly topics function, and (3) collecting and analyzing Chinese Weblog articles function.

2.3.1 Article retrieval function The first is the article retrieval function. One can retrieve blog articles by specifying a set of keywords in conjunction with

6 http://www.mysql.com/ (accessed 2005-03-03)

Table 1. Algorithm for finding monthly topics. - Let M be the set of months. If we want to know topics during Q months, M = {m1, m2, ... ,mQ}.

- Let W be the set of words appeared through Q months. If we find P words during Q months, W = {w1, w2, ..., wP}.

- For each wi (1 <= i <= P) in W, repeat followings.

1. For each mj (1 <= j <= Q) in M, repeat followings.

Let aij be the number of articles containing wi on mj.

2. Calculate sum(ai) = ∑ =

Q

j ija1

, max value max(ai), and

SD/average ratio sd(ai)/avg(ai).

3. Print wi as a topic word of month mj if (sum(ai) >= τ1) and (max(ai) >= τ2) and (sd(ai)/avg(ai) >= τ3).

Pictures appeared in articlesPictures appeared in articles

Daily trend of articles

Daily trend of articles

List of Weblog articles

List of Weblog articles

DateDate TitleTitle KeywordsKeywords Relevant newsRelevant news

Figure 3. Screen image of the KANSHIN system.

RSSfiles

WWW

Newssites

GovernmentalWebsite

Collecting Japanese and Chinese RSS files periodically(every 20 minutes)

Keywords User

Database

CGI

Web server Graph

PersonalWeblog sites

Retrieval

RSSfiles

Retrieved articles

0.3Portuguese (ポルトガル)

0.3Sulty (蒸し暑い)

0.3England (イングランド)

0.4Kintetsu(近鉄)

0.9The rainy season (梅雨)June

0.4Balley ball (バレーボール)

0.4Unpaid (未納)

0.5Golden Week (ゴールデンウィーク)

0.7Pension funds (年金)

1.5GW (GW)May

0.6Cherry blossom viewing (花見)

0.7Release (解放)

1.1Hostage (人質)

1.5Cherry blossom (桜)

1.6Iraq (イラク)April

%Term (Japanese)Month

0.3Portuguese (ポルトガル)

0.3Sulty (蒸し暑い)

0.3England (イングランド)

0.4Kintetsu(近鉄)

0.9The rainy season (梅雨)June

0.4Balley ball (バレーボール)

0.4Unpaid (未納)

0.5Golden Week (ゴールデンウィーク)

0.7Pension funds (年金)

1.5GW (GW)May

0.6Cherry blossom viewing (花見)

0.7Release (解放)

1.1Hostage (人質)

1.5Cherry blossom (桜)

1.6Iraq (イラク)April

%Term (Japanese)Month

Daily / Monthly topics

Analysistitlearticle3194 Mass media

Business English4054

titlearticle3194 Mass media

Business English4054

titlearticle3194 Mass media

Business English4054

Article table

term article3194Media

Business4054

Index tableterm article

3194MediaBusiness4054

term article3194Media

Business4054

term article3194Media

Business4054

Index table

HTML to RSSconversion script

RSSfile

Storing words and articles

Ping servers

h2r

Figure 2. Overview of the KANSHIN.

Page 3: Analyzing concerns of people using Weblog articles and ...

logical OR and AND operators. Figure 3 shows a screen image of search results. The figure contains (1) a daily trend graph of articles, (2) pictures appeared in the articles, (3) list of articles, (4) keywords extracted from articles, and (5) relevant news articles.

2.3.2 Finding daily and monthly topics The system finds topic-indicating words called daily topics and monthly topics by calculating feature values of words. Table 1 shows an overview of the algorithm for finding monthly topics. In this algorithm, we count number of articles containing a word, and examining whether this word is a monthly topic by applying several thresholds. We use (30, 25, 0.6) for thresholds (τ1, τ2, τ3). Table 12 and Table 13 in Appendix show examples of monthly topics in 2004. The tables show top five monthly topics for each month from April through December 2004. ‘%’ indicates the percentage of articles containing the word for each month. Monthly topics characterizing each month are found such as “Iraq” in April (see 3.1.4), “Olympic games” in August, and “Christmas” in December.

2.3.3 Collecting and analyzing Chinese blog articles For realizing cross-lingual concern analysis, we are implementing a function for collecting and analyzing Chinese blog articles. At this moment, we are collecting articles written in both of simplified and traditional Chinese. For extracting nouns, we use ICTCLAS POS tagger 7 [2]. Because ICTCLAS works on simplified Chinese articles, articles written in traditional Chinese are converted into simplified Chinese and parsed. After extracting nouns, we store them in the database. We encode words in UTF8 simplified Chinese. Users can retrieve articles by using both of characters by converting characters internally. Figure 4 shows an example of a graph of “Lunar new year (danian)”.

3. UNDERSTANDING SOCIAL CONCERNS IN THE BLOGSPHERE In this section, we describe (1) patterns of social concerns, and (2) living and obsolete words in the blogsphere.

7 http://mtgroup.ict.ac.cn/~zhp/ICTCLAS.htm (in Chinese;

accessed 2005-03-04)

3.1 Patterns of social concerns We classified concerns found by the system into several patterns. Figure 5 shows an overview of the patterns. There are five patterns, i.e., (1) periodic pattern, (2) gradual increase pattern, (3) sensitive pattern, (4) trailing pattern, and (5) others. We will describe each pattern in the following subsections.

3.1.1 Periodic pattern Periodic pattern appears when an event that is watched with keen interest by group of people is occurred periodically. Figure 6 shows an example of the periodic pattern. This figure is a histogram of “Winter Sonata (Fuyuno sonata)” which is the name of popular Korean TV drama broadcasted in Japan. x axis indicates a date, and y axis indicates the number of articles. Because this drama attracted public attention in Japan in 2004, several periodic peaks appeared clearly in the graph. Because this drama was broadcasted on every late Saturday, so we can see periodic peaks on every Sunday. From this graph, we can guess that most bloggers write articles after they watched the drama, i.e., on Sunday. Other keywords of this pattern contain “Holyday (Kyujitsu)”, “BBQ”, “Payday (Kyuryobi)”, “Trial examination (mogi shiken)”, “Barber (Tokoya)”, “The end of week / month / year (Syumatsu / Getsumatsu / Nenmatsu)”.

I. Periodic

III. Sensitive

II. Gradual increase

IV. Trailing V. Other

Figure 5. Patterns of social concerns.

May 2(Sun)

Apr 25(Sun)

Apr 18(Sun)

May 7(Sun)

On air(Sat)

On air(Sat)

On air(Sat)

On air(Sat)

Figure 6. Periodic pattern (“Winter Sonata (Fuyuno sonata)”).

Figure 4. Histogram of “Lunar new year (Danian)” by using Chinese articles.

Page 4: Analyzing concerns of people using Weblog articles and ...

3.1.2 Gradual increase pattern In this pattern, a peak appears gradually according to an event. This pattern is also called BoingBoing effect [3] or sleeper hit [4]. Gradual pattern appears when people know the existence of an event beforehand, and they have great interests or concerns on that event. Figure 7 shows a histogram of “GW (Golden Week)” which is the name of Japanese holidays. As shown in the graph, bloggers have strong interests in these holidays beforehand. After holidays, bloggers do not mention this word gradually. Other examples of this pattern contain “Election”, “Typhoon”, “Summer”, “Olympic games” (see Figure 8). Seasonal festivals such as “Xmas” and “new year” can be members of this pattern.

3.1.3 Sensitive pattern Sensitive pattern, which is also called Slashdot effect [5], appears when a serious matter that may cause a heavy impact on our society is occurred. Figure 9 shows an example of this pattern. The figure shows a histogram of “Winny” which is controversial peer-to-peer (P2P) software used for exchanging

copyrighted movie and music files illegally by some users 8 . Because we can see a keen peak on May 10, 2004 which is the day when the programmer of Winny was arrested, we can see that bloggers have great concerns on this report. Another example is “Tsunami” (see Figure 10). Because several destructive tsunamis caused severe damages in many Indian Ocean countries on December 26, 2004, the figure shows that people have great concerns on this disaster. The period of the figure begins from November 12, 2004 and ends with March 3, 2005. During this period, we can see two peaks, i.e., the first peak appearing on December 26 consists of concerns on damage caused by this disaster, and the second appearing January 5 consists of concerns on support for disaster areas represented by Michael Schumacher’s donation.

3.1.4 Trailing pattern In the trailing pattern, people’s concerns last after one or several events occur. Figure 11 shows an example of the trailing pattern. The figure shows a histogram of “Iraq”. The period of this graph begins from April 1 to May 26, 2004 in which three Japanese were kidnapped in Iraq. This accident was a very hot topic in Japan. TV and newspapers reported this accident repeatedly.

8 See http://en.wikipedia.org/wiki/Winny (accessed 2005-03-03)

1,376■ Donation

4,467■ Tsunami

HitsKeyword

1,376■ Donation

4,467■ Tsunami

HitsKeywordDamage of the Indian Ocean tsunami was reported.(Dec 26-) Michael

Schumacher’s donation was reported.(Jan 5)

Figure 10. Histogram of “Tsunami” and “Donation”.

The programmer of “Winny”was arrested (May 10)

2476 ■ WinnyHitsKeyword2476 ■ WinnyHitsKeyword

Figure 9. Sensitive pattern (“Winny”).

8,496■ Election

HitsKeyword

26,073■ Typhoon

(Tropicalcyclone)

Hits

87,970■ Summer

HitsKeyword

31,288■ Olympicgames

HitsKeyword

Figure 8. Examples of gradual increase pattern.

Figure 7. Gradual increase pattern (“GW”).

Page 5: Analyzing concerns of people using Weblog articles and ...

As well as mass media, various opinions on this issue appeared on Weblog sites. This figure indicates that bloggers have great concerns on this accident. We can see the change of focuses on “Iraq” by finding co-occurrence words for “Iraq”. Figure 12 shows the change of focus on “Iraq”. As shown in the graph, focus on Iraq was gradually changed from “Hostage (Hitojichi)”, “Release (Kaiho)”, to “Abuse (Gyakutai)”.

3.1.5 Others The other pattern is the rest case of the former patterns. Figure 13 shows examples of this pattern. The figure shows histograms of “Accident (Jiko)” and “Water famine (Mizubusoku)”. This pattern appears when (1) a word is abstract such as “Accident” and “Security”, and (2) a word is not a hot topic at this moment. Note that the latter type of words show the sensitive pattern when they are paid attention by mass media or influential Web sites. For example, “Nuclear power plant (Genpatsu)” (see Figure 14) was a member of the others pattern before August 9, 2004, which is the day when a severe accident occurred in the Mihama nuclear power plant. Once TV shows and newspapers reported the accident, people mentioned this accident sensitively. As a result, the pattern was changed to the sensitive pattern. Also note that after several weeks, few bloggers mention the accident, and the pattern changed to the original other pattern.

3.2 Living words and obsolete words in the blogsphere As shown in Figure 14, it seems that bloggers former topics immediately after an event happened. How long do bloggers remember a topic? For answering this question, we counted number of words registered finally in the database. Figure 15 shows the number of words registered finally to the database9. Total number of words appeared in this term is 65,944. About half number of words (53.7% = 100% - 46.3%) is registered on the final day. These words can be said as living words because bloggers use them in recent articles. Living words consist of (1) general words such as “today” or “I”, and (2) words that indicate current topics. On the other hand, words registered before April 6 can be said as obsolete words because bloggers begin to forget to use these words.

9 We extracted words whose total frequency is greater than 100.

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

18-M

ar

19-M

ar

20-M

ar

21-M

ar

22-M

ar

23-M

ar

24-M

ar

25-M

ar

26-M

ar

27-M

ar

28-M

ar

29-M

ar

30-M

ar

31-M

ar

1-A

pr

2-A

pr

3-A

pr

4-A

pr

5-A

pr

6-A

pr

7-A

pr

Num

ber

of

Word

s

0

10

20

30

40

50

60

70

80

90

100

Perc

enta

ge (

%)

Number of wordsregistered on the dayPercentage

Figure 15. Number of words registered finally to the database (counted on April 7, 2005, 23:00:00).

869■ Nuclearpower plant

HitsKeywordAn accident in theMihama nuclearpower plant isreported (onAugust 9, 2004)

Other patternbefore the accident

Other patternafter the accident

Sensitivepattern

2004-12-22 13:08:52JST

Figure 14. Histogram of “Nuclear power plant (Genpatsu)”.

6926■ AccidentHitsKeyword6926■ AccidentHitsKeyword

18■ Water famine

HitsKeyword

Figure 13. Other pattern (“Accident” and “Water famine”).

1,243■ Abuse

3,627■ Release

5,516■ Hostage

9,382■ Iraq

HitsKeyword

9,382■ Iraq

HitsKeyword

5,516■ Hostage

9,382■ Iraq

HitsKeyword

3,627■ Release

5,516■ Hostage

9,382■ Iraq

HitsKeyword

Figure 12. Change of focuses on “Iraq”.

Three Japanesewere kidnapped

in Iraq (Apr 8)

Japanese hostageswere released (Apr 15)

Report on Iraqiprisoner abuse(Apr 30)

10,127■ Iraq

HitsKeyword

Figure 11. Trailing pattern (“Iraq”).

Page 6: Analyzing concerns of people using Weblog articles and ...

What kinds of words are contained in obsolete words? We can find them by using (1) timestamp of a word registered in the database within last N days (from date1 through date2), and (2) average number m and standard deviation s of days N in which the word appears. For the former criterion, we execute following SQL query on “key_index” table (see Table 6 in Appendix).

SELECT * FROM key_index WHERE date >= ’date1’ AND date <= ’date2’;

Then we calculate m and s for each word found by the above query. By applying N=30, we found following words as obsolete words 10 . For example, “Häkkinen” (m = 0.43, s = 1.05) and “Berlusconi” (m = 0.07, s = 0.25) have not been mentioned since November 15, 2004, and “Armitage” (m = 0.73, s = 2.22) and “El Nino” (m = 0.07, s = 0.25) have not been mentioned since November 22, 2004. Table 14 in Appendix shows examples of obsolete words.

4. UNDERSTANDING PERSONAL CONCERNS In this section, we describe analysis concerns of two persons, and compare their concerns.

4.1 Overview Understanding concern of a person is also important for understanding social problems because we can find how key person think about the problems. In this section, we analyze two blog sites hosted by Jun’ichiro Koizumi who is the prime minister of Japan, and for Ryuichi Sakamoto who is a famous Japanese composer and musician.

4.2 Dataset We collected 184 articles of Mr. Koizumi11 from May 29, 2001 through April 7, 2005, and 54 articles of Mr. Sakamoto12 from August 31, 2003 through with September 26, 2004. They posted articles once a week. Table 2 shows the summary of dataset. Mean number of words per article is 249.5 and standard deviation (SD) is 252.8 for Mr. Koizumi. Mean is 64.4, and SD is 51.2 for Mr. Sakamoto.

4.3 Approach to find personal concerns We consider that a person uses words indicating his/her concerns (concerned words) in many articles if s/he has concerns constantly. So, we use following formula to calculate feature value of concerned words of a person.

iiic βα ×=

In this formula, ci indicates the feature value of one’s concerns on i-th word wi appeared in his/her articles. iα is the ratio of articles containing wi to the total number of articles.

10 Note that this result was made on December 8, 2004. 11 http://www.kantei.go.jp/jp/m-magazine/backnumber/ (in

Japanese; accessed 2005-03-03) 12 From Senken Nikki - Insight Diaries (http://diary.nttdata.co.jp/

(in Japanese; accessed 2005-04-07)). (c)Ryuichi Sakamoto, NTT DATA Corporation.

iβ is the average ratio of wi to the total number of words in each

article. Definitions of iα and iβ are described in Appendix 10.2.

4.4 Comparing concerns between persons In this section, we first analyze concerns of Mr. Koizumi and Mr. Sakamoto, and then compare their concerns.

4.4.1 Concerns of Mr. Koizumi Table 3 shows top 20 concerned words of Mr. Koizumi sorted by c value. Among extracted words, “Reform” characterizes him very well because he always mentions the structural reform of Japan. Figure 17 shows a histogram of “Reform”. As shown in this figure, he always mentions “Reform”. In Table 3, we can see several characteristic words for him and his cabinet. For example, “Iraq” indicates his recent concern because reconstruction assistance for Iraq is one of important issues for Koizumi cabinet. “Cooperation (Kyoryoku)” and “Assistance (Shien)” also indicate his concern because several severe disasters happened in 2004 that require intensive assistance from Japanese government to disaster areas such as the Niigata-Chuetsu earthquake in October, 2004, and the 2004 Indian Ocean earthquake in December, 2004. Figure 18 shows the change of concerns of Mr. Koizumi on social problems such as SARS and BSE. We can see that the focus of his concerns changed according to emergence of new problems.

4.4.2 Concerns of Mr. Sakamoto Table 4 shows top 20 concerned words of Mr. Sakamoto sorted by c value. Among extracted words, “Music (Ongaku)” and “Sound (Oto)” characterize him well. Figure 16 shows a histogram of “Music” and “Sound”. Comparing with Mr. Koizumi (see Figure 17), he does not use the word “music” constantly, but he talks a lot when the topic is related to the music, for example, he wrote a long article on the restriction of import music CDs on May 9, 2004. “Japan”, “America”, and “Bush” also indicate his concerns because he sometimes mentioned Iraq issues represented by Japan’s troops dispatch to Iraq in February, 2004, and Japanese hostage crisis in April, 2004.

4.4.3 Comparing concerns between Mr. Koizumi and Mr. Sakamoto Mr. Koizumi and Mr. Sakamoto had common concerns on Iraq because they wrote articles on Iraq when problems happened in Iraq such as Japanese hostage crisis in April, 2004. They also mentioned relations between Japan and other countries commonly. For example, “World”, “Relation”, are “Head of a state” are mentioned by Mr. Koizumi (see Table 3), and “America” and “Bush” are mentioned by Mr. Sakamoto (see Table 4). Although this is a preliminary result, we can see common and different concerns between them by following words appeared in personal blog articles.

Table 2. Summary of the dataset. Person # of articles Days Mean SD

Mr. Koizumi 184 1,409 249.5 252.8

Mr. Sakamoto 54 392 64.4 51.2

Page 7: Analyzing concerns of people using Weblog articles and ...

0

5

10

15

20

25

31-A

ug-0

314

-Sep

-03

28-S

ep-0

312

-Oct

-03

26-O

ct-0

39-

Nov

-03

23-N

ov-0

37-

Dec

-03

21-D

ec-0

34-

Jan-

0418

-Jan

-04

1-Fe

b-04

15-F

eb-0

429

-Feb

-04

14-M

ar-0

428

-Mar

-04

11-A

pr-0

425

-Apr

-04

9-M

ay-0

423

-May

-04

6-Ju

n-04

20-J

un-0

44-

Jul-0

418

-Jul

-04

1-A

ug-0

415

-Aug

-04

29-A

ug-0

412

-Sep

-04

26-S

ep-0

4

Wor

d Fr

eque

ncy

Music Sound

Figure 16. Histogram of words “Music (Ongaku)” and “Sound (Oto)” mentioned by Mr. Sakamoto.

Table 3. Concerned words of Mr. Koizumi. No. Term (Japanese) c

1 Japan (Nihon) 0.01112

2 Reform (Kaikaku) 0.00870

3 Koizumi 0.00725

4 I (Watashi) 0.00538

5 Jun’ichiro 0.00505

6 People (Hito) 0.00464

7 Issue (Mondai) 0.00312

8 Everyone (Minasan) 0.00235

9 Nation (Kuni) 0.00205

10 Economy (Keizai) 0.00198

11 World (Sekai) 0.00185

12 Citizen (Kokumin) 0.00137

13 Iraq 0.00137

14 Cooperation (Kyoryoku) 0.00134

15 Relation (Kankei) 0.00129

16 Everyone (Katagata) 0.00129

17 Head of a state (Shuno) 0.00116

18 Assistance (Shien) 0.00106

19 Society (Shakai) 0.00102

20 Efforts (Doryoku) 0.00100

Table 4. Concerned words of Mr. Sakamoto. No. Term (Japanese) c

1 I (Boku) 0.00386

2 People (Hito) 0.00271

3 Human (Ningen) 0.00228

4 Music (Ongaku) 0.00193

5 Sound (Oto) 0.00135

6 Japan (Nihon) 0.00134

7 Myself (Jibun) 0.00059

8 Now (Ima) 0.00035

9 America 0.00032

10 Time (Jikan) 0.00030

11 He (Kare) 0.00030

12 Wonder (Fushigi) 0.00030

13 Bush 0.00028

14 Children (Kodomo) 0.00028

15 Cinema (Eiga) 0.00026

16 Brain (No) 0.00024

17 Memory (Kioku) 0.00023

18 Words (Kotoba) 0.00022

19 Relation (Kankei) 0.00020

20 Administration (Seiken) 0.00020

BSE and bird flu (December, 2003 – April, 2004)

SARS (April - May, 2003)

Typhoondisasters (June -September, 2004)

The Niigata-Chuetsu earthquake (October, 2004

The 2004 Indian Ocean earthquake and tsunami (December, 2004)

Tsunami

Typhoon

Earthquake

Influenza

BSE

SARS

Tsunami

Typhoon

Earthquake

Influenza

BSE

SARS

Figure 18. Change of concerns of Mr. Koizumi on social problems

(SARS, BSE, Bird flu (Influenza), Earthquake, Typhoon, Tsunami).

0

5

10

15

20

25

29-M

ay-0

1

29-A

ug-0

1

29-N

ov-0

1

28-F

eb-0

2

29-M

ay-0

2

29-A

ug-0

2

29-N

ov-0

2

28-F

eb-0

3

29-M

ay-0

3

29-A

ug-0

3

29-N

ov-0

3

29-F

eb-0

4

29-M

ay-0

4

29-A

ug-0

4

29-N

ov-0

4

28-F

eb-0

5

Wor

d Fr

eque

ncy

Reform

Figure 17. Histogram of a word “Reform (Kaikaku)” mentioned by Mr. Koizumi.

Page 8: Analyzing concerns of people using Weblog articles and ...

5. UNDERSTANDING RELATIONS BETWEEN SOCIAL CONCERNS AND REAL WORLD TEMPORAL DATA

In this section, we describe relations between words in the blogsphere and the real world. We compare words in blog between (1) temperature data, and (2) a news paper.

5.1 Relations between blog and temperature As Glance et al. reported that there was a correlation between words in blog articles and Amazon’s sales rank [4], we can guess that words appeared in blog are affected by various real world events or phenomenon. In this section, we examine relations between natural phenomenon and words. Among natural phenomenon such as temperature, weather, earthquakes, and environmental index, we chose average temperature in Otemachi, Tokyo13in 2004. The period of data is 2004/03/18-2005/01/23.

13 http://www.data.kishou.go.jp/index.htm (in Japanese; 2005-03-

03)

We calculated correlation coefficient between temperature and frequency of words, and found 66 words that have positive correlation (correlation coefficient >= 0.6), 25 words that have negative correlation (correlation coefficient <= -0.6). Figure 19 shows typical words correlated with temperature. x axis is the date, and y axis is the normalized score of the percentage of articles. In this figure, “Sweat (Ase)” and “Bug (Mushi)” are positive correlation words. Meanwhile, “Warm (Atatakai)” and “Cold (Samui)” are negative correlation words. Table 5 shows examples of words correlated with temperature. Some of words were within the scope of our prediction (e.g., “Suntan”, “Beer”, and “Watermelon” were easy to guess), but some were not (e.g., “Bug”, “Mosquito”, and “Cockroach”). It is reasonable that insects emerge or disappear according to the temperature, but we couldn’t expect words related to insects beforehand. From this result, words in blog articles are clearly affected by temperature.

5.2 Relations between blog and mass media We then compared relations between number of blog articles and the amount of media reporting. We compared (1) number of characters in news articles reporting on the Winny case (see 3.1.3), and (2) number of blog articles on the same event. We used articles of the Asahi Shimbun newspaper14 containing the word “Winny”. Figure 20 shows the relation between blog and news from May 1 to June 26, 2004. Total number of characters in news is 24,450, and total number of blog articles is 2,419. Correlation coefficient between them is 0.75. There is a positive correlation between blog and news. Although we can not conclude that there is a strong relation between blog articles and media reporting from this result, we can guess that the amount of media reporting affects the number of blog articles. Analyzing relations between blog articles and media reporting is our future work.

6. DISCUSSION In this section, we discuss (1) the bias of concerns of bloggers, and (2) the limitation of current syndication format.

14 http://dnab.asahi.com/ (in Japanese; accessed 2005-04-07).

Table 5. Example of words correlated with temperature.

Positive correlation Negative correlation

Word (Japanese) correlation coefficient Word (Japanese) correlation

coefficient

Bug (“Mushi”) 0.82 Warm (“Atatakai”) -0.77

Mosquito (“Ka”) 0.79 Slope (“Gerende”) -0.65Summer (“Natsu”) 0.76 Heater -0.64

Suntan (“Hiyake”) 0.71 Comfortably warm

(“NukuNuku”) -0.64

Cockroach (“Gokiburi”) 0.68 Cold protection

(“Bokan”) -0.63

Air-conditioner (“cooler”) 0.66 Snowboard

(“Sunobo”) -0.62

Summer weariness (“Natsu bate”)

0.65 Glove (“Tebukuro”) -0.62

Stifling (“Atsu kurushii”) 0.65 Cold (“Kaze”) -0.61

Watermelon (“Suika”) 0.64

Snowcapped mountain (“Yukiyama”)

-0.61

Beer 0.61 Sweater -0.60

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

18-M

ar

30-M

ar

11-A

pr

23-A

pr

5-M

ay

17-M

ay

29-M

ay

10-Jun

22-Jun

4-Jul

16-Jul

28-Jul

9-A

ug

21-A

ug

2-Sep

14-Sep

26-Sep

8-O

ct

20-O

ct

1-N

ov

13-N

ov

25-N

ov

7-D

ec

19-D

ec

31-D

ec

12-Jan

Perc

enta

ge o

f A

rtic

les

0

5

10

15

20

25

30

35

Ave

rage

Tem

pera

ture

(Centigr

ade)

Warm Cold Sweat Bug Average Temperature

Figure 19. Relation between words and temperature.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

04/5/1

04/5/4

04/5/7

04/5/10

04/5/13

04/5/16

04/5/19

04/5/22

04/5/25

04/5/28

04/5/31

04/6/3

04/6/6

04/6/9

04/6/12

04/6/15

04/6/18

04/6/21

04/6/24

Perc

ent

age o

f B

log

Art

icle

s (%

)

0

1000

2000

3000

4000

5000

6000

7000

Char

acte

rs in

New

s A

rtic

lesPercentage of blog articles

Number of characters in news articles

Figure 20. Relation between blog and news articles (on the Winny case).

Page 9: Analyzing concerns of people using Weblog articles and ...

6.1 Bias of concerns of bloggers It is often said that there might be bias of concerns of bloggers. According to surveys on WWW users15 , WWW users do not reflect the real world properly with respect to age, gender, occupations, and so on. Furthermore, 52.71% of bloggers are 20-30 years old, 29.68% are 30-40 years old, and 68.56% of bloggers are male according to Miura et al. [6] Thus, blog articles may not be adequate resources for conducting a rigid social investigation. On the other hand, understanding social concerns from blog articles has another merit with respect to the speed. We aim to create a system with which a researcher who has concerns on social problems can find current hot topics instantly. By using the proposed system, researchers can find current topics instantly. This is an important feature of this system.

6.2 Limitation of current syndication feeds Because several blog sites provide RSS feeds containing partial contents of an entire article, analysis results using such feeds might not reflect precise concerns. Comparing with BlogPulse [4] and blogWatcher [7] that collect entire articles in the HTML format, our system might not provide correct concerns of bloggers. Although our system relies on such RSS feeds for analyzing social concerns and daily/monthly topics16, it is enough for us to understand current concerns of people roughly. When we started to create the system, we considered that understanding concerns roughly was OK because we can find an overview of concerns by collecting plenty of articles. As a result, we consider that we were able to find rough concerns from RSS feeds. Now, we are collecting both of feeds and HTML files for comparing precision of concerns. Evaluating concerns objectively is our future work. Meanwhile, from a macroscopic viewpoint, it is needed to design a blog tool by exchanging opinions between developers and researchers who want to analyze contents effectively. Feedback from researchers to developers might be important.

7. CONCLUSION AND FUTURE WORK In this paper, we described a system for collecting and analyzing Weblog articles for understanding concerns of people. By using the system, users can find social and personal concerns, and concerns in several languages. As a first step towards cross-lingual concern analysis, we implemented Japanese and Chinese blog analysis function in the system. From analysis results, we found (1) several patterns of social concerns, (2) concerns of a person, and (3) relations between blog and real world temporal data such as temperature and news articles. Our future work contain (1) investigation of methods to extract patterns of social concerns statistically, (2) development of a cross-language concern analysis system, and (3) study of effects of real world temporal data on concerns of bloggers.

8. ACKNOWLEDGEMENT The authors thank Ryuichi Sakamoto and NTT DATA for their cooperation to analyze Mr. Sakamoto’s diary.

15 http://www.gvu.gatech.edu/gvu/user_surveys/ (accessed 2005-

04-07) 16 Note that we analyzed personal concerns by using entire

articles extracted by hand.

9. REFERENCES [1] The Unicode Consortium. The Unicode Standard, Version

4.0.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1) (available from http://www.unicode.org/versions/Unicode4.0.0/, accessed 2005-04-07)

[2] Zhang, H.P., Yu, H.K., Xiong, D.Y., and Liu, Q. HHMM-based Chinese lexical analyzer ICTCLAS, In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pages 184–187, 2003. (available from http://acl.ldc.upenn.edu/W/W03/W03-1730.pdf, accessed 2005-03-04).

[3] Adar, E., and Zhang, L. Implicit structure and the dynamics of blogspace. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. (available from http://www.blogpulse.com/papers/Adar_blogworkshop2_ppt.pdf, accessed 2005-03-03).

[4] Glance, N., Hurst, M., and Tomokiyo, T., BlogPulse: Automated trend discovery for Weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. (available from http://www.blogpulse.com/www2004-workshop.html, accessed 2005-03-03).

[5] Adler, S. The Slashdot effect. (online), 1999. (available from http://ssadler.phy.bnl.gov/adler/SDE/SlashDotEffect.html, accessed 2004-12-10).

[6] Miura, A. and Yamashita, K. (2004). Why do people publish Weblogs?: An online survey of Weblog authors in Japan. In Human Perspectives in the Internet Society: Culture, Psychology and Gender, pages 43–50. WIT Press.

[7] Nanno, T., Suzuki, Y., Fujiki, T., and Okumura, M. Automatic collection and monitoring of Japanese Weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. (available from http://www.blogpulse.com/www2004-workshop.html, accessed 2005-03-03).

10. APPENDIX In this section, we describe (1) definition and example of tables, (2) definitions of parameters for calculating feature value, and (3) reference materials mentioned in the body of this paper.

Page 10: Analyzing concerns of people using Weblog articles and ...

10.1 Definition and example of tables We describe definitions and samples of tables used in the system. Tables listed in this section are dump lists of MySQL version 3.23.53-Max.

10.1.1 Table “key_index” Definition and sample of “key_index” table are described in Table 6 and Table 7. Table “key_index” has following columns: (1) kid, (2) term, (3) freq, (4) date, and (5) time. kid indicates ID number of a word in the database. This column is associated with “term _kid” table (see Table 8). term indicates label of the word. freq indicates the total number of a word. date and time indicate the date and time when this word is registered finally.

10.1.2 Table “term_kid” Table 8 and Table 9 show the definition and the sample of “term_ kid”. Table “term_kid” has following columns: (1) _id, (2) num, (3) date, and (4) time. _id is the ID number of article which contains term kid. We can find this article by using “rss_date” table. num is the frequency of articles. date and time are date and time when this word is appeared.

10.1.3 Table “rss_date” Definition and sample of table “rss_date” are described in Table 10 and Table 11. This table has following columns: (1) _id, (2) url, (3) title, (4) description, and (5) time. _id is local ID number of this article in this table. Articles are identified by _id and date in our system. url is the URL of this article. title is the title of this article. description is the body of this article. time is the time of registration of this article.

10.2 Feature value calculation Definitions of iα and

iβ are follows.

Ndfi

i =α ∑=

=N

j j

iji W

tfN 1

In the above formulas, N is the total number of articles, dfi is the number of articles containing i-th word wi (i.e., document frequency of wi), Wj is the number of words appeared in document j-th d , and tfij, is the frequency of wi in j-th article aj.

Table 11. Example of “rss_date”: “rss_20040629” _id url title description time

16 blog.l Hakodate Shinkansen Haya 14:06:49

32 www.do Lunch After lunch, I … 16:06:08

68 www.co anne short I was looking … 23:24:03

Table 10. Definition of “rss_date”. CREATE TABLE rss_date (

_id int(10) unsigned NOT NULL auto_increment,

url varchar(255) NOT NULL,

title text,

description text,

time time NOT NULL,

PRIMARY KEY (_id),

KEY title_index (title(80)),

KEY url_index (url)

) TYPE=MyISAM;

Table 9. Example of “term_kid”: “term_3”. date _id num time

2004-06-29 16 1 14:06:49

2004-06-29 32 2 16:06:08

2004-06-29 68 1 23:24:03

2004-06-29 70 2 11:32:10

2004-07-01 312 1 18:56:37

Table 8. Definition of “term_ kid” CREATE TABLE term_kid (

_id int(10) unsigned NOT NULL,

num int(11) NOT NULL,

date date NOT NULL,

time time NOT NULL,

KEY date_index (date),

KEY rid_index (_id)

) TYPE=MyISAM;

Table 7. Example of “key_index”. kid term freq date time

1 Sunset 5984 2004-12-10 18:20:00

2 Hill 4790 2004-12-10 19:53:19

3 Night 787 2004-11-27 00:39:21

4 Year 8210 2004-12-10 20:38:58

5 Parents 1518 2004-12-10 19:46:11

Table 6. Definition of “key_index”. CREATE TABLE key_index (

kid bigint(20) unsigned NOT NULL auto_increment,

term varchar(80) NOT NULL,

freq bigint(20) unsigned NOT NULL default '1',

date date NOT NULL,

time time NOT NULL,

PRIMARY KEY (kid),

KEY term_index (term),

KEY date_index (date)

) TYPE=MyISAM;

Page 11: Analyzing concerns of people using Weblog articles and ...

10.3 Reference materials 10.3.1 Monthly topics

Table 13. Monthly topics in 2004 (from September through December)

Month Term (Japanese) %

Baseball team (Kyudan) 0.4

Entry (San-nyu) 0.3

School term (Gakki) 0.2

Strike (Sutoraiki) 0.3

September

Avoid (Kaihi) 0.2

Typhoon (Taifu) 4.0

Autumn (Aki) 2.0

Earthquake (Jishin) 1.8

Cold (Samui) 1.7

October

Niigata 1.2

Rakuten 0.6

Autumn leaves (Koyo) 0.5

Howl 0.4

President (Daitoryo) 0.3

November

Bush 0.3

Christmas 4.0

Next year (Rainen) 1.9

End of the year (Nenmatsu) 1.6

Year-end party (Bonenkai) 1.3

December

New year's card (Nengajo) 0.8

Table 12. Monthly topics in 2004 (from April through August)

Month Term (Japanese) %

Iraq 1.6

Cherry blossom (Sakura) 1.5

Hostage (Hitojichi) 1.1

Release (Kaiho) 0.7

April

Cherry blossom viewing (Hanami) 0.6

GW (GW) 1.5

Pension funds (Nenkin) 0.7

Golden Week 0.5

Unpaid (Mino) 0.4

May

Balley ball 0.4

The rainy season (Tsuyu) 0.9

Kintetsu 0.4 England 0.3 Sulty (Mushi-atsui) 0.3

June

Portuguese 0.3

Hot (Atsui) 4.7 Summer (Natsu) 3.7 Election (Senkyo) 0.8 Publishing (Shuppan) 0.7

July

The Star Festival (Tanabata) 0.6

Olympic games (Orinpikku) 2.4

Summer vacation (Natsuyasumi) 1.9

Player (Senshu) 1.8

Competition (Taikai) 1.3

August

Athens 1.3

Page 12: Analyzing concerns of people using Weblog articles and ...

10.3.2 List of obsolete words

In the above table, we found several categories in which these words can be categorized, i.e., (1) seasonal topics such as “April fool’s day” and “Suffer from the summer heat (Natsubate)”, (2) holiday names such as “Respect for the aged day (Keiro no hi)”, (3) social events such as “Tenjin Matsuri Festival”, and (4) former news topics such as “The Republic of North Ossetia-Alania” and “Sasser”.

Table 14. List of obsolete words (extracted on December 8, 2004)

Term (in Japanese) Mean SD Last Registered

Date Tenjin Matsuri Festival (Tenjin Matsuri)

0.03 0.18 8 Nov

Najaf 0.03 0.18 9 Nov

Silk Famous (Sirukufeimasu in Katakana)

0.17 0.58 10 Nov

Typhoon ``Tokage'' (Tokage) 0.07 0.18 11 Nov

The Republic of North Ossetia-Alania 0.03 0.18 12 Nov

Marlon Brando (Maron) 0.17 0.17 13 Nov

Bonfire (Okuribi) 0.10 0.3 14 Nov

U-23 0.13 0.43 18 Nov

Sasser (Sasser in Katakana) 0.10 0.40 19 Nov

G.W 0.03 0.18 20 Nov

Radcliff (Radokurifu in Katakana) 0.40 0.97 21 Nov

Yokozawa 0.10 0.30 22 Nov

April fool's day 0.23 0.58 23 Nov

Nakahata 0.43 0.76 24 Nov

Rock odyssey 0.20 0.60 25 Nov

Cherry blossom (Yaezakura) 0.17 0.37 26 Nov

Golden week (Ogon Shukan) 0.07 0.24 27 Nov

Day of the sea (Umi no hi) 0.13 0.34 28 Nov

Beginning of the rainy season (Tsuyuiri)

0.17 0.37 29 Nov

Sasser (Sasser in English) 0.07 0.25 30 Nov

Suffer from the summer heat (Natsubate)

0.40 0.6 1 Dec

Burnt by the sun (Entenka) 0.60 0.92 2 Dec

Respect for the aged day (Keiro no hi)

0.27 0.51 3 Dec

Health sports day (Taiiku no hi) 0.37 0.66 4 Dec

Bronze medal 0.97 1.20 5 Dec

The end of the rainy season (Tsuyuake)

0.23 0.56 6 Dec

Lingering summer heat (Zansho) 0.90 1.01 7 Dec