A Method of Extracting Malicious Expressions in Bulletin Board Systems
Transcript of A Method of Extracting Malicious Expressions in Bulletin Board Systems
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
1/13
A method of extracting malicious expressions in bulletin board systems
by using context analysis
Hiroshi Hanafusa, Kazuhiro Morita, Masao Fuketa, Jun-ichi Aoe
Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan
a r t i c l e i n f o
Article history:
Received 29 September 2009
Received in revised form 17 August 2010
Accepted 17 August 2010
Keywords:
Malicious expressions
Bulletin board systems
Filtering systems
Context analysis
Multi-attribute rules
Separate co-occurrence expressions
a b s t r a c t
Bulletin board systems are well-known basic services on the Internet for information fre-
quent exchange. The convenience of bulletin boards enables us to communicate with other
persons and to read the communication contents at any time. However, malicious postings
about crimes are serious problems for serving companies and users. The extracting scheme
of the traditional methods depends on words or a sequence of words without considering
contexts of articles and, therefore, it takes a lot of human efforts to alert malicious articles.
In order to reduce the human efforts, this paper presents a new filtering algorithm that can
recover the error rate of false positive for non-malicious articles by using context analysis.
The presented scheme builds detecting knowledge by introducing multi-attribute rules. By
the experimental results for 11,019 test data, it turns out that sensitivity and specificity of
the presented method become 38.7 and 24.1 (%) points higher than traditional method,
respectively.
2010 Elsevier Ltd. All rights reserved.
1. Introduction
Bulletin board systems (BBS) have been used as basic services on the Internet for frequent information exchange. Repre-
sentative examples of BBS are 2 channel h2 channeli in Japan, and Yahoo! Bulletin board hYahoo! BBSi in the world. Social
networking services (SNS) and blog services can be considered as the applications of BBS because they provide the commu-
nicating place in the Internet (Claypool, Brown, LE, & Waseda, 2001). Moreover, Mixi hMixii and Yokoku-In hYokoku.ini are
well-known services in Japan. Moreover, myspace hmyspacei and Livejournal hLivejournali are representative services in the
world. Therefore, BBS to be discussed in this paper includes the above SNS applications.
The convenience of bulletin boards is to casually communicate with other persons due to the anonymity (Security) of the
services and also enables us to read the communication contents any time. However, malicious postings in bulletin boards
are serious problems for users, which are postings about mental abuse and warnings of crimes.
Each country takes action against the mentioned problems. In America, the Childrens Internet Protection Act (CIPA) hChil-
dren Internet Protection Acti was established in 2000, which requires public schools and libraries to use filtering software. In
Germany, the Youth Protection Act hYouth Protection Acti was established in 2002, which requires providers to filter harmful
content. In France, the Digital Economic Act hDigital Economic Acti was established in 2004, which requires explanation of
filtering for accessing online communication service to the public. In Japan, the Provider Liability Act hProvider Liability Acti
was established in May 2002: the action plan for the dissemination of and enlightenment of filtering was started in March
2006, and the Internet Environmental Improvement Act was established in June 2008. Moreover, there were cabinet
decisions for the comprehensive measurements for suicide established in June 2007 and the Prohibition of harm to the third
0306-4573/$ - see front matter 2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.ipm.2010.08.003
Corresponding author.
E-mail address: [email protected] (J.-i. Aoe).
Information Processing and Management 47 (2011) 323335
Contents lists available at ScienceDirect
Information Processing and Management
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / i n f o p r o m a n
http://dx.doi.org/10.1016/j.ipm.2010.08.003mailto:[email protected]://dx.doi.org/10.1016/j.ipm.2010.08.003http://www.sciencedirect.com/science/journal/03064573http://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.sciencedirect.com/science/journal/03064573http://dx.doi.org/10.1016/j.ipm.2010.08.003mailto:[email protected]://dx.doi.org/10.1016/j.ipm.2010.08.003 -
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
2/13
persons with a revised version was appended in December 2008. Considering those laws and measurements, legislative
preparations in Japan is more advanced than other countries and there are many court cases and case examples.
In order to solve these serious problems for serving companies and users, typical filtering schemes are using the URL fil-
tering system hhAnichivaii which is a knowledge base representing malicious sites. This scheme can completely filter the
sites including malicious expressions, however the problem is to build URL knowledge by human for extremely increasing
WEB information. Therefore, it can not apply to filter the part of malicious articles in the same site.
The useful schemes (Goldberg, Nichols, Oki, & Terry, 1992; Goldberg, Roeder, Guptra, & Perkins, 2001; Good et al., 1996;
Heckerman, Chickering, Meek, Rounthwite, & Kadie, 2000; Herlocker, Konstan, & Riedl, 2002; Kim, Min, Jeon, Man Ro, & Han,
2009; Lee, Lee, Chung, & An, 2007; Pennock, Horvitz, Lawrence, & Giled, 2000; Reddy, Kitsuregawa, Sreekanth, & Rao, 2002;
Wang, Arjen, & Marcel, 2006) are automatic detection filtering systems. In these researches, content analysis ( Kim et al.,
2009; Lee et al., 2007) was introduced for malicious domains such as pornography, drug, violence, crime and so on, but
the basic scheme is used two steps of classification to detect harmful word filtering based on SVM and does not use context
analysis, etc. In the methodologies, there are two types of detecting knowledge bases. One is rule-based (Francis, Frantz, &
Mathieu 2000; Landau, Sillion, & Vichot, 1993) and another is statistic-based schemes (Gharieb, 2000; Yoohwan, Wing, Mooi,
& Chao, 2006). Rule-based techniques can detect precise locations for expected expressions and it is easy to improve the part
of rules, but it needs to develop practical matching engines and to keep expert persons to build knowledge. The typical sta-
tistic-based scheme is Support Vector Machine (SVM) (Larry & Malik, 2001) which is easy to build detection knowledge by its
strong learning technique and also to control decision engines, but it is weak to locate the expected expressions precisely,
and to improve detecting method with knowledge partly because of its automatic learning.
These automatic filtering techniques are supporting only systems in the current applications. Although the current tech-
niques can detect a sequence of words, there are no practical techniques that can consider contexts of articles which have
many disadvantages as the biggest limitation. Therefore, in the current automatic filtering systems, the rate of false positive
(Shiraki et al., 2004; Xu, Chong, Lu, & Zhou, 2004 ) for non-malicious articles is low. Consequently, human must check a lot of
articles in the application systems. For example, DeNA BBS (3.14 million articles per day) services need 300 persons and 2000
million Yen for checking, but there are very critical problems because all articles cannot check as it has a lot inappropriate
Web article even they are using automatic filtering techniques.
In order to solve these problems, this paper presents a new context filtering algorithm to reduce the effort of human and
to improve the rate of false positive without degrading the rate of false negative. First of all, the presented method defines
separate co-occurrence (SC) expressions that cannot be detected by word sequence of the traditional methods. Moreover, the
context analyses for SC expressions are proposed by introducing multi-attribute rules which are proper to extract expres-
sions in changing Web, especially for inappropriate Web contents and a common hierarchy method. The presented method
is estimated for 11,019 test data including malicious and non-malicious articles. It is verified that the presented method can
improve the rate of false positive of the traditional method without degrading the rate of false negative.
Section 2 describes the outline of the context analysis and introduces the outline of the presented system by classifying
malicious expressions into inadequate and crime expressions. Section 3 proposes multi-attribute rules for these expressions.
In Section 4, a context analysis algorithm is presented. Section 5 evaluates the accuracy and time performance of the pre-
sented method from the experimental data. Section 6 presents conclusion and possible further works.
2. The presented system
2.1. The outline of context analysis
Fig. 1 shows examples for malicious and non-malicious expressions, where the abbreviation RPG in (b1) means a role
playing game.
Texts (a1), (a2) and (a3) in Fig. 1a are malicious because underlined expressions are crime and inadequate. However, texts
(b1), (b2), and (b3) in Fig. 1b are not malicious because bold expressions means negative expressions that deny malicious
expressions. In (b1), malicious expressions, I have a strong sword and I kill them, can be denied by a game filed indentifiedby SC expressions, RPG. In (b2), there are negative expressions by a computer field identified by SC expressions, machine
and processes. In (b3), there are negative expressions by attention of SC expressions Do not write malicious articles that
can deny malicious expressions You are ugly and this woman is BBW. The difficulty of the traditional method is that it
(a) Malicious expression texta (b) Non-malicious expression text
(a1) I get a strong sword. Bring your company to
Tokyo station tomorrow. Iwill kill them.
(a2) This machine is very heavy because there are
many muzzles. I will try to kill them soon.
(a3) You are ugly in any jacket, always lying in the
meeting at work and you speak about BBW.
(b1) I have a strong sword. Bring your company in
the next scene ofRPG tomorrow. Ikill them.
(b2) This machine is very heavy because there are
many processes. I will try to kill them soon.
(b3) Do not write malicious articles in BBS. For
example, You are ugly in any jacket or this woman is
BBW.
Fig. 1. Examples of malicious and non-malicious expressions.
324 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335
http://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://www.elsevier.com/locate/infopromanhttp://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?- -
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
3/13
has no scheme of detecting separate concurrency (SC) expressions for combinations by (I have a strong sword, RPG) and
(RPG, I kill them) in (b1), (machine and processes, I will try to kill them) in (b2), and (Do not write malicious arti-
cles, this woman is BBW) in (b3).
In order to achieve the above solution, this paper presents a new filtering algorithm to detect SC expressions by introduc-
ing context analysis and multi-attribute rules. Note that expressions to be detected in the presented method combines a se-
quence of expressions based on the traditional method and SC expressions corresponding to context analysis.
2.2. Inadequate and crime expressions
In this paper, expressions to be detected are classified into two categories of inadequate and crime expressions.
2.2.1. Inadequate expressions
Inadequate expressions which people feel irritated have four main categories as follows:
(a) hhABUSEii expressions involve violent or insulting comments towards someone or causes the psychological state of
being annoyed by someone as follows:
(1) You are ugly in any jacket.
(2) This Woman is BBW.
(3) You are always lying in the meeting at work.
(4) Everybody in the company says you are stupid.
(5) Are you crazy?.(b) hhDISCRIMINATIONii means treating people differently through prejudice: unfair treatment of one person or group,
usually because of prejudice about race and ethnicity as follows:
(1) I think she is deaf because she cant understand what I say all the time .
(2) He is a bad man as all his talk about BBW.
(3) Yellow monkeys cant use this room because this is for white people.
(c) hhDATING SERVICE WEBSITEii is a dating system which allows individuals, couples and groups to make contact and to
communicate with each other over the Internet as follows:
(1) Im a 16-year-old girl. I can go out with guys at 3 .1
(d) hhOBSCENTITYii means the trait of behaving in an obscene manner as follows:
(1) I want to see you naked.
(2) I want to buy kid porn.
(3) He will visit that building to buy some kid porn .
Although there are overlap expressions which consider successive postings with the same contents as troll and ungram-
matical expressions, both are not included in this paper discussion because they have no SC expressions.
2.3. Crime expressions
Bulletin boards include expressions which warn about crimes. They are very important expressions to detect because
those terribly malicious postings have the possibility to affect people and organizations seriously. As some cases actually
happened from postings which warn crimes, those postings should not be permitted even if they are fake. There are four
categories of expressions with warnings of crimes as follows:
(a) hhMURDER&VIOLENCEii, as defined in common law countries, is the unlawful killing of another human being with
intent as follows:
(1) Your friends are immediately killed,
(2) killed some people at Tokyo station last week.
(b) hhEXPLOSION&ARSONii means the crime of deliberately and maliciously destroying or setting fire to structures or
wildland as follows:
(1) A strange boy set fire to his grandparents house last night.
(2) A Female terrorist destroyed a big shopping mall with dynamites in Wakayama last Monday.
(c) hhCRIME MATERIALii means the tools which are used in the crime processing as follows:
(1) I get a strong sword, I will kill them by it.
(2) This machine is very heavy because there are many muzzles.
(d) hhDRUGii means a chemical substance that affects the processes of the mind or body as follows:
(1) S crystal, high quality, 0.0002 g.
(2) White and clear SS, high quality, ice ice ice.
1 Where 3 means the amount of money.
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 325
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
4/13
2.4. The construction of the presented system
Fig. 2 shows the construction of the presented system.
In Fig. 2, Article number/Posting title/Acquisition is extracted from the bulletin board (Text Data), and then the input sen-
tence is segmented by morpheme analysis and named entity processing. In the next step, two context phases of inadequate
and crime expressions are carried out by each extraction rule in parallel. Finally, risk judgment is conducted according to the
above results.
3. Rule-based extracting knowledge
3.1. Definition of multi-attribute rules
For extraction of the expected expressions in natural language processing, it is important to introduce an efficient algo-
rithm that can match multi-attribute rules by formation (morphological, syntactic and semantic). In order to build efficient
detection rules or knowledge, the fundamental concept has proposed by Ando, Mizobuchi, Shishibori, and Aoe (1998) and it
has been utilized for the target-based approach of sentence classification (Kadoya et al., 2005). Moreover, this approach has
been applied to classification of medical reports (Kiyoi, Atlam, Fuketa, Yoshinari, & Aoe, 2008) and emotion analysis (Yosh-
inari, Atlam, Morita, Kiyoi, & Aoe, 2008). Generally, these attributes include strings (words), part of speeches (categories) and
concepts (semantic, or meanings). Suppose that A_NAME represents the attribute name and let A_VALUE represent the attri-
bute value. Then, let R be a finite set of pairs (A_NAME, A_VALUE), then R is called a rule structure and attributes as follows:
STR: string, or, word spelling.
CAT: category by general concepts, or a part of speeches.
SEM: semantic information to be defined in this paper.
The formal definition depends on the description by Kadoya et al. (2005), but all rules correspond to inadequate and crime
expressions. For example, by using these attributes, the input structures of the sentence He kills someone are defined as
follows:
N1 = {(STR, He), (CAT, HUMAN)}.
N2 = {(STR, kills), (CAT, VERB), (SEM, MURDER&VIOLENCE)}.
N3 = {(STR, someone), (CAT, HUMAN)}.
InputBulletin Board (Text Data)
Morpheme and Named Entity Analysis
Risk judgment
Acquisition of Posting Identification Information
Article number/Posting title/Acquisition
Extraction
rules forinadequate
ex ressions
Morpheme
Dictionary
Extraction rules
for crime
expressions
Crime ExpressionTest
Inadequate ExpressionsTest
Fig. 2. The construction of the presented systems.
326 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
5/13
In the above examples, (SEM, MURDER&VIOLENCE) denotes semantics for crime expressions, (CAT, HUMAN) means cat-
egories by general concepts, and (CAT, HUMAN) and (CAT, VERB) mean part of speeches. A huge number of expressions about
crime expressions can be represented by multi-attribute rules by using these semantics.
The p-th multi-attribute matching rule Rule (p) is defined as follows:
Rule p Rp1Rp2 . . . Rpm; m np;0 < np:
Consider a rule to match the above input expression He kills someone can be defined by the Rule (1) as follows:
Rule1 R11R12R13
R11 fCAT;HUMANg;
R12 fSEM;MURDER&VIOLENCEg;
R13 fCAT;HUMANg:
3.2. Multi-attribute descriptions
3.2.1. Semantic information
Semantic information (SEM) depends on Section 2, so the typical semantics are explained.
Inadequate and crime expressions are described by combining a variety of words, phrases, categories and semantics as
follows:
(1) Abuse expressions: (SEM, ABUSE) use for violence or insulting comments towards someone.
For example, abuse expressions are ugly, liar, stupid and crazy.
(2) Discrimination expressions: (SEM, DISCRIMINATION) use for treating people differently through prejudice. For exam-
ple, a discrimination expression is deaf.
(3) Obscenity expressions: (SEM, OBSCENITY) use for the trait of behaving in an obscene manner.
For example, obscenity expressions are naked and kid porn.
(4) Murder & Violence expressions: (SEM, MURDER&VIOLENCE) use for the unlawful killing of another human being with
intent.
For example, murder and violence expressions are kill, and shoot.
(5) Explosion & Arson expressions: (SEM, EXPLOSION&ARSON) use for destroying or setting fire to structures or wild land
areas.
For example, explosion and arson expressions are terrorists, destroy and set fire.
(6) Crime material expressions: (SEM, CRIME MATERIAL) use for the tools which are used in the crime processing.
For example, crime material expressions are sward and muzzles.
3.2.2. Multi-attribute rules
Context analysis of multi-attribute expressions (MULTI) is carried out in two stages. The first stage determines candidates
of inadequate expressions in the text and produces results (CON, x), where CON and x represent features for context analysis
of the second stage. The second stage determines the final results or risk judgement, by using (CON, x) and produces (FIX, y) if
the result is fixed, otherwise (NON, y). The detailed method will be discussed in the next section.
Tables 1 and 2 show rule-based knowledge for the first and the second stages, respectively.
Table 1 uses general concepts such as CLOTHES, JOB, DOCUMENT. (CON, NEGATION)) neglects inadequate and crime deci-
sions. NEGATION) can be also performed by the concepts denoting special fields (CON, GAME) and (CON, COMPUTER). In Ta-
ble 1, Rule (8) can match the input crime expression He kills someone and produces (CON, MURDER&VIOLENCE) for
context analysis by the second stage.
In Table 2, Rule (18) = {{(CON, CRIME MATERIAL)}{(CON, PLACE)} {(CON, TIME)} {(CON, MURDER&VIOLENCE)}} is the
decision rule for hhCRIME MATERIALii and hhCRIME MATERIALii, where hhii corresponds to types of inadequate and crime
classes in Section 2. This rule takes the features for context analysis of the first stage ofTable 1 and produces the final judg-
ment of (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii), where this notation means (FIX, hhCRIME MATERIALii) and (FIX,
hhMURDER&VIOLENCEii). If the final judgment is not FIX, then the output becomes (NON, hCRIME MATERIALii), (NON,
hhMURDER&VIOLENCEii) and (NON, hhABUSEii) as in Table 2.
4. Multi-attribute matching
4.1. Construction of machines
For multi-attribute matching, Ando et al. (1998) has proposed a set matching algorithm and the implementation is devel-
oped in Cprogramming language. Kadoya et al. (2005) used this approach for sentence classification and Kiyoi et al. (2008)
used this approach for medical reports and Yoshinari et al. (2008) used it for emotion analysis.
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 327
http://-/?-http://-/?-http://-/?-http://-/?- -
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
6/13
Suppose that R be a sequence of the input structures. The machine multi-attribute pattern-matching (MAPM) machine in
this method takes R as the input and produces matching results as the output corresponding to the rules. Formally, the ma-
chine MAPM consists of a set of states and each state is represented by a number. The matching operation of the machine
MAPM is similar to the multi-keyword string pattern-matching method of AhoCorasick (Aho & Corasick, 1975; Ando et al.,
1998), but it has the following distinctive features:
4.1.1. goto and output functions
Let Tbe a set of states and let L be a set of the rule structure R, then the behaviour of the machine MAPM is defined by next
two functions:
(a) goto function goto: T
L? T[
{fail} where the function goto maps a set of consisting of a state and a rule structure
into a state or the message fail. A transition label of the goto function is extended to a set of notation. Therefore, in
the machine MAPM, a confirming transition is decided by the inclusion relationship whether the input structure N
includes the rule structure R or not.
(b) output function: T?A where A is a set of pair, (p,(x, y)), for rule number p and for matching results (x, y). For Rule (1)
in Table 1, the matching result becomes (CON, ABUSE) and then {(1,(CON, ABUSE))} is the proper representation.
The input structures to be matched by the matching rule are also defined by the same set of representation. Nis used as
the notation for input structures to distinguish them from R. In order to consider the abstraction of the rule structure, match-
ing of the rule structure R and the input structure Nare decided by the inclusion relationship such that Nincludes R, denoted
by N R. Therefore, the machine MAPM is also called a set matching machine.
Let R_SET be a set of Rule (p) for inadequate and crime expressions. Consider the following Rule (7) and Rule (8) in Table 1.
R71= {(SEM, MURDER&VIOLENCE)}},
R72= {(CAT, HUMAN)}}.
R81 = {(CAT, HUMAN)}, R82 = {(SEM, MURDER&VIOLENCE)}, R83 = {(CAT, HUMAN)}}.
Table 2Examples of decision knowledge of the second stage (context analysis).
output Multi-attribute rules Examples
(FIX, hhCRIME MATERIALii,
hhMURDER&VIOLENCEii)
Rule (18) = {{(CON, CRIME MATERIAL)}{(CON, PLACE)}
{(CON, TIME)} {(CON, MURDER&VIOLENCE)}}
Fig. 1 (a1) I get a strong sword. Bring your company to Tokyo
station tomorrow. I will kill them
(NON, hhCRIME MATERIALii) Rule (19) = {{(CON, CRIME MATERIAL)}{{(CON, GAME)}} Fig. 1 (b1) I get a strong sword. Bring your company to play
RPG tomorrow. I will kill them.
(FIX, hhCRIME MATERIALii,
hhMURDER&VIOLENCEii)
Rule (20) = {{(CON, CRIME MATERIAL)}{{(CON,
MURDER&VIOLENCE)}}
Fig. 1 (a2) This machine has many muzzles. I will try to kill
them soon.
(NON,
hhMURDER&VIOLENCEii)
Rule (21) = {(CON, COMPUTER)} {(CON,
MURDER&VIOLENCE)}}
Fig. 1 (b2) This machine has many processes. I will try to kill
them soon.
(FIX, hhABUSEii) Rule (22) = {{(CON, ABUSE)}{(CON, ABUSE)}} Fig. 1 (a3) You are ugly in any jacket, always lying in the
meeting at work and you speak about BBW.
(NON, hhABUSEii) Rule (23) = {{(CON, NEGATION)}{(CON, ABUSE)}} Fig. 1 (b3) Do not write malicious articles in BBS. For
example, You are ugly in any jacket or this woman is BBW.
Table 1
Examples of extracting knowledge of the first stage.
output Multi-attribute rules Examples
(CON, ABUSE) Rule (1) = {{(SEM, ABUSE)}{(CAT, CLOTHES)}} (1) ugly in any Jacket
Rule (2) = {{(SEM, ABUSE)}{(CAT, JOB)}} (2) liar at work
Rule (3) = {{{(SEM, ABUSE)} {(CAT, DOCUMENT)}} (3)malicious articles
(CON, DISCRIMINATION) Rule (4) = {{(CAT, HUMAN)} {(CAT, VERB)}{(SEM, DISCRIMINATION)}} (4)she is deaf
(CON, OBSCENITY) Rule (5) = {{(CAT, VERB)} {(CAT, HUMAN)} {(SEM, OBS CENITY)}} (5)see you naked
Rule (6) = {{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, OBSCENITY)}} (6) I buy kid porn(CON, MURDER&VIOLENCE) Rule (7) = {{(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}} (7) kill people
Rule (8) = {{(CAT, HUMAN)} {(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}} (8) He kills someone
(CON, EXPLOSION&ARSON) Rule (9) = {{(CAT, HUMAN)} {(SEM, EXPLOSION&ARSON)}} (9) strange boy set fire
Rule (10) = {{(SEM, EXPLOSION&ARSON)}{(CAT, ORGANIZATION)}} (10) destroy a high school
(CON, CRIME MATERIAL) Rule (11) = {{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, CRIME MATERIAL)}} (11) I get a strong sword
Rule (12) = {{(CAT, MACHINE)}{(SEM, CRIME MATERIAL)}} (12) machine has many muzzles
(CON, GAME) Rule (13) = {{(CAT, VERB)} {(CAT, GAME)}} (13) play RPG
(CON, COMPUTER) Rule (14) = {{(CAT, MACHINE)} {(CAT, VERB)} {(CAT, PROCESS )}} (14) machine has many processes
(CON, NEGATION) Rule (15) = {{(SEM, NEGATIVE)}} (15) Do not write malicious articles
(CON, PLACE) Rule (16) = {{(CAT, NAME)} {(CAT, STATION)}} (16) Tokyo station
(CON, TIME) Rule (17) = {{(CAT, TIME)} (17) tomorrow
328 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
7/13
Suppose that the input he kills someone has the following structures.
N1 = {(STR, he), (CAT, HUMAN)}.
N2 = {(STR, kills), (CAT, VERB}, {(SEM, MURDER&VIOLENCE)}.
N3 = {(STR, someone)}, (CAT, NOUN), (CAT, HUMAN)}.
Each input structure can include the corresponding rule structure as follows:
N1 R71, N2 R72.
N1 R81, N2 R82, N3 R83.
The machine MAPM becomes non-deterministic if there are two more rules that can match the input structure. The ambi-
guity can be solved by selecting the longest applicable rules with high priority.
Figs. 3 and 4 show goto and output functions for Tables 1 and 2, respectively. In these figures Rules 10, 15, 16, 17 and 19
are neglected as we used some samples only.
4.2. Multi-stage matching
The context analysis of MULTI is carried out by two stages, where the rules ofTable 1 are used for the first stage matching
and those of Table 2 are used by the second stage.
The following procedure summarizes the behaviour of the machine MAPM as the procedure MAPM(a, M) that can carryout context analysis of the proposed method MULTI by calling this procedure MAPM(a, M) twice (Kiyoi et al., 2008).
4.2.1. Procedure MAPM(a, M)
A sequence a of input structures is N1, N2, . . ., Nn where each Ni (0 < i < n + 1) is an input structure. Mis a machine MAPM
defined by goto and output functions. Note that the input of the first state becomes results of named entity processing and
the second stage is the sequence of outputs with the notation (CON, x) of the first stage matching. The function NEXT(a) re-
turns the first structure N1 and modifies a = N2. . ..Nn, where the first structure N1 is removed.
* (CAT,CLOTHES), (CAT,JOB) and (CAT,DOCUMENT) are merged into the same transition and output(16) merges rules 1,2 and 3.
{(CAT, HUMAN)}
{(SEM, EXPLOSION)}
{(SEM, MURDER&VIOLENCE)}
{(CAT, VERB)}
{(CAT, HUMAN)}
output(4) = {(8,(CON,(MURDER&VIOLENCE)))}
output(8) ={(11, (CON, CRIME MATERIAL))}
1
2
3 4
5{(SEM, DISCRIMINATION)}
output(7) = {(6, (CON, OBSCENITY))}7
output(6) = {(4, (CON, DISCRIMINATION))}
{(SEM, OBSCENITY)}
output(13) = {(13, (CON, GAME))}{(CAT, GAME)}
output(10)= {(9, (CON, EXPLOSION&ARSON))}
14
18
11
12
{(CAT, VERB)}
{(CAT, HUMAN)}
15{(SEM, ABUSE)} {(CAT, CLOTHES, JOB, DOCUMENT)}*
17 19
{(CAT, MACHINE)} {(CAT, VERB)}
20 output(20)= {(14, (CON, COMPUTER))}
21 22{(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}
{(CAT, PROCESS)}
output(22) = {(7, (CON,MURDER&VIOLENCE))}
output(18)= {(12 , (CON, CRIME MATERIAL))}
{(SEM, CRIME MATERIAL)}
{(SEM, CRIM MATERIAL)}
output(14) ={(5, (CON, OBSCENITY))}{(SEM, OBSCENITY)}
output(16)={(1, 2,3, (CON, ABUSE))}
8
6
13
10
16
Fig. 3. The goto and output functions for some sample rules in Table 1.
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 329
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
8/13
beginSTATE:= 0;
while a NULL do
begin
N= NEXT (a);
while STATE fail do STATE:= goto (STATE, R) such that N R;
MAPM (a, M);
Output:= output (p) for matched Rule (p);
N:= NEXT (a);
end
end
Consider the input sentence I get a strong sward in (a1) of Fig. 1 with the following sequence of structures.
N1 = {(STR, I), (CAT, HUMAN)}.
N2 = {(STR, get), (CAT, VERB)}.
N3 = {(STR, a strong sward), (CAT, NOUN), (SEM, CRIME MATERIAL)}.
Table 3 shows the matching flow of the first stage for the above input structures. State transitions are 1, 2, 5, 8, and then
state 8 produces output (8) = 7 identifying {(CON, CRIME MATERIAL)} which becomes the input of the second stage
matching.
Table 4 shows the matching flow of the second stage for (a1) in Fig. 1. Suppose that the following results are obtained
from the first stage.
I get a strong sword.is N1 = {(CON, CRIME MATERIAL)}.
Bring your company to Tokyo station tomorrow. is N2 = {(CON, PLACE)} and N3 = {(CON, TIME)}, I will kill them is
N4 = {(CON, MURDER&VIOLENCE)}
State transitions are 1, 4, 5, 6, 7, and, then, output (7) produces that the final decision of (a1) is hhCRIME MATERIALii and
hhMURDER&VIOLENCEii.
In the same manner, Table 5 shows the matching flow of the second stage for the part of (b1) in Fig. 1. Suppose that the
following results are obtained from the first stage.
1
2{(CON, ABUSE)}
3{(CON, ABUSE)}
output (3) = {(22, (FIX,))}
4{(CON, PLACE)}
5
9
{(CON, NEGATION)}
{(CON, ABUSE)}
output (10) = {(23, (NON,))}10
{(CON, MURDER&VIOLENCE)}
8
{(CON, COMPUTER)}
12{(CON, MURDER&VIOLENCE)}
output (12) = {(21, (NON, >))}
{(CON, CRIME MATERIAL)}
7
{(CON, MURDER&VIOLENCE)}
6{(CON, TIME)}
output(8)={(20,(FIX,,>))}
11
output(7)={(18,(,>))}
Fig. 4. The goto and output functions for rules in Table 2.
Table 3
Examples of matching process in the first stage.
STATE N R goto/output
1 N1 {(CAT, HUMAN)} 2
2 N2 {(CAT, VERB)} 5
5 N3 {(SEM, CRIME MATERIAL)} output (8) = {(11, (CON, CRIME MATERIAL))}
330 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
9/13
Bring your company in the next scene of RPGtomorrow is N1 = {(CON, GAME)}.
I kill them is N2 = {(CON, MURDER&VIOLENCE)}.
State transitions are 1, 11 and 12. Then, output (12) produces that the final decision is not hhMURDER&VIOLENCEii.
5. Experimental results
5.1. Basic detection knowledge and experimental data
Basic detection knowledge has been built to detect expressions for abuse, obscenity, drug, and crime. Table 6 shows the
contents of detection knowledge for each expression, where the following abbreviations are utilized.
NUM-WORD: The number of word expressions.
NUM-RULE: The number of multi-attribute rules.
NUM-PAT: The number of surface patterns.
Experimental data have been collected from 22 bulletin boards and the number of articles with possibility of inappropri-
ate expressions is 8450. For main sites of 2 channel h2 channeli, Yahoo! hYahoo! BBSi, Gakkou-Ura hGakkou-Urai and Yoko-
ku-In hYokoku-Ini, fields of the above articles include criticism, requests, hospitals, educational problems, cartoons, betrayal,dirt, comics, rumors, arrest, notices and crimes. 1525 inadequate (I) and 388 crime (C) expressions to be estimated has been
obtained from 8450 malicious articles. For non-inadequate (NI) and non-crime (NC) test data, 2569 non-malicious expres-
sions (1277 non-inadequate (NI) and 1382 crime (NC) expressions) have been prepared from Web pages, like Fig. 1 (b),
including basic single words such as kill, sword, sex, adults and so on. That is to say, test data I and C are malicious
data, but NI and NC are non-malicious data.
5.2. Experimental results
The presented method based on multi-attribute expressions in context is called as MULTI and the traditional method
based on sequential of morphemes or words is called as SINGLE against to MULTI. To estimate MULTI and SINGLE, specificity
and sensitivity (Altman & Bland, 1994) are used as follows:
True positive (TP): Malicious expressions correctly determined as malicious.False positive (FP): Non-malicious expressions incorrectly determined as malicious.
True negative (TN): Non-malicious expressions correctly determined as non-malicious.
False negative (FN): Malicious expressions incorrectly determined as non-malicious.
Let NUM_TP, NUM_FP, NUM_TN and NUM_FN be the numbers of TP, FP, TN and FN, respectively.
SPECIFICITY: The rate (%) of specificity by calculating NUM_TN/(NUM_TN + NUM_FP).
SENSITIVITY: The rate (%) of sensitivity by calculating NUM_TP/(NUM_TP + NUM_FN).
Table 4
Examples of matching process in second stage.
STATE N R goto/output
1 N1 {(CON, CRIME MATERIAL)} 4
4 N2 {(CON, PLACE)} 5
5 N3 {(CON, TIME)} 6
6 N4 {(CON, MURDER&VIOLENCE)} output (7) = {(18, (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii))}
Table 5
Examples of matching process in second stage.
STATE N R goto/output
1 N1 {(CON, GAME)} 11
11 N2 {(CON, MURDER&VIOLENCE)} output (12) = {(21, (NON, hhMURDER&VIOLENCEii))}
Table 6
The contents of basic detection knowledge.
NUM-WORD NUM-RULE NUM-PAT
Inadequate 16,239 1281 12,875,138
Crime 12,681 1378 9486,523
Total 28,920 2659 22,361,661
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 331
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
10/13
A specificity of 100% means that the all non-malicious expressions will be detected as non-malicious expressions. A sen-
sitivity of 100% means that the all malicious expressions will be detected as malicious expressions. We can say that the pre-
sented method has low error rate and can reduce human efforts for a large number of malicious candidates.
Other preparations for malicious test data (I and C) are explained as follows:
ALL_DATA: The number of all data to be estimated.
ALL_CORR: The number of all correct expressions to be extracted.
NUM_EXTR: The number of the extracted expressions.
Table 7 shows experimental results for malicious expressions, where SINGLE(I) and SINGLE(C) represent SINGLE for I and
C, respectively. MULTI(I) and MULTI(C) are the same meaning.
Table 8 shows experimental results for non-malicious test data (NI and NC), where SINGLE(NI) and SINGLE(NC) represent
SINGLE for NI and NC, respectively. MULTI(NI) and MULTI(NC) are the same meaning.
Table 9 shows evaluation results by SPECIFICITY and SENSITIVITY obtained from TP, FN, TN and FP in Tables 7 and 8.
From simulation results in Table 9, it turns out that SPECIFICITY and SENSITIVITY of MULTI can be improved by 38.7 and
24.1 (%) points for those of SINGLE, respectively. The main reason is that the error rate by false positive (FP) of MULTI is very
lower than that of SINGLE in Table 8. High TP and TN of MULTI are also related to the improvement. In general, the number of
malicious expressions is extremely larger than that of non-malicious expressions. Therefore, the presented method contrib-
utes reduction of human efforts.
Moreover, the presented rule-based method has two efficient advantages as follows:
(a) Unknown words: The presented rule-based method is proper to extract expressions in changing Web, especially for
inappropriate Web contents. The reason is based on the set matching ability such that N R. Suppose that N has
(CAT, UNKNOWN) when the input has unknown expressions. Then, R is replaced by {(CAT, UNKNOWN)} and N R
is confirmed.
Table 7
Simulation results for inadequate (I) and crime (C) expressions.
ALL_CORR NUM_EXTR NUM_TP NUM_FN TP (%) FN (%)
SINGLE(I) 1525 920 891 634 96.8 58.4
MULTI(I) 1525 1453 1247 252 85.8 81.8
MULTI(I)-SINGLE(I) NON 533 356 382 11.0 23.3
SINGLE(C) 388 229 228 160 99.6 58.8
MULTI(C) 388 324 312 76 96.3 80.4
MULTI(C)-SINGLE(C) NON 95 84 84 3.3 21.6SINGLE 1913 1149 1119 794 97.4 58.5
MULTI 1913 1777 1559 328 87.7 81.5
MULTI-SINGLE NON 628 440 466 9.7 23.0
NON = represents the empty data.
Table 8
Simulation results for non-inadequate (NI) and non-crime (NC) expressions.
ALL_NON NUM_TN NUM_FP TN (%) FP (%)
SINGLE(NI) 1277 526 751 41.1 58.8
MULTI(NI) 1277 1078 199 84.4 15.6
MULTI(NI)-SINGLE(NI) NON 552 552 43.3 43.2
SINGLE(NC) 1382 228 160 56.7 43.3
MULTI(NC) 1382 312 76 91.2 8.8
MULTI(NC)-INGLE(NC) NON 84 84 34.5 34.5SINGLE 2659 754 911 49.2 50.8
MULTI 2659 1390 275 87.9 12.1
MULTI-SINGLE NON 636 636 37.8 38.7
NON = represents the empty data.
Table 9
Evaluation results by SPECIFICITY and SENSITIVITY.
SPECIFICITY (%) SENSITIVITY (%)
SINGLE(I + NI) 41.2 58.4
MULTI(I + NI) 84.4 83.2
SINGLE(C + NC) 56.7 58.8
MULTI(C + NC) 91.2 80.4
SINGLE 49.2 58.5
MULTI 88.0 82.6
332 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
11/13
Consider the following input and rule structures.
N1 = {(STR, Killed), (SEM, MURDER&VIOLENCE)}.
N2 = {(STR, ABC), (CAT, UNKNOWN)}.
N3 = {(STR, Tokyo Station), (CAT,PLACE)}.
Rule (7) = R71, R72, R73.
R71 = {(SEM, MURDER&VIOLENCE)}.R72 = {(CAT, HUMAN)}.
R73= {(CAT, PLACE)}.
In this case, N1 R71, N2 R72 and N3 R73 are satisfied because R72 = {(CAT, HUMAN)} is replaced by R72 = {(CAT, UN-
KNOWN)}. Therefore, transitions are always success by this error recovery.
Although this error recovery produces many accessible transitions, it is a very practical scheme with robustness because
it is easy to restrict the upper bound of possible transitions in the practical system.
It is difficult task to register new words and expressions into dictionaries together with their categories and semantics.
For this problem, it is clear that the above method does not need to be extended together with the registration. The impor-
tant point of robustness issue is how to extract possible candidates from malicious expressions with many syntax errors and
argotic words.
(b) Hierarchical concept matching.
Consider delete people at Tokyo station with the following structures.
N1 = {(STR, delete), (SEM, DELETE)}.
N2 = {(STR, people), (CAT, NOUN) (CAT, HUMAN)}.
N3 = {(STR, Tokyo Station), (CAT, PLACE)}.
Suppose that the semantic meaning of the verb delete is the super-category DELETE of category MURDER&VIOLENCE
and suppose R71 of the above Rule (7) is {(SEM, DELETEnMURDER&VIOLENCE)}, where n means a hierarchical notation.
Hierarchical concept matching is succeeded if DELETE of N1 is equal to the super-category DELETE of
DELETEnMURDER&VIOLENCE ofR71. This matching is weak because it is not perfect, but the extended matching is practical
in the error recovery that the similar expressions can be extracted. That is to say, it enables us to support rule-based knowl-
edge using concepts.
The rules bases of the presented method MULTI is building for frequent expressions step by step, but there are difficult
problems as shown in the following examples:
RQJmcf2O kill Aaaaqqqbbb, where RQjmcf2O and Aaaaqqqbbb are user ID.
Context analysis for a sequence of articles including past information should be proposed, but the current system has no
ability to describe applicable rules. This technique depends on discourse analysis and remains in the future research.
Moreover, there are some hUngrammatical sentencei as follows:
D
R
U
G
S
To solve this problem, special frozen analysis must be introduced case by case and remains in the future research.
Support Vector machine (SVM) is a well-known approach. SVMs depend on words or a sequence of words without con-
sidering the context of articles which has many disadvantage (Burgess, 1998) as follows:
(1) The biggest limitation of the support vector approach lies in choice of the kernel.
(2) The second limitation is speed and size, both in training and testing.
(3) Discrete data presents another problem.
(4) The most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements in large-
scale tasks (Horvth, 2003).
However, the detected results by the presented method can use SVM schemes as the learning futures because SVMs re-
quire a lot of correct training data. That is to say, SVMs and the presented method can work in a coordinated manner.
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 333
-
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
12/13
5.3. Time evaluation and error analysis
The detecting ability of the presented method is excellent as described above, but it is very important to evaluate the time
performance together with essential error analysis for the whole system consisting of the following modules.
The first module FOCUS determines the essential text by removing redundant texts (advertising parts) from Web pages.
This module is carries out by HTML tag processing. High frequent pages with the same tag format as recommending products
and news are removed in this module to reduce the error rate of false positive. In the second module, morphological analysis
MOPH determines part of speeches and fundamental concepts. For error analysis, unknown expressions are detected in this
module. The module KW determines keywords consisting of sequential expressions from the results of morphological anal-
ysis. The module FIELD determines document fields (Atlam, Elmarhomy, Fuketa, Morita, & Aoe, 2006; Atlam, Fuketa, Morita,
& Aoe, 2003; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Fuketa et al., 2005). This module is carried out by matching field asso-
ciation words to the results of morphological analysis. The results of this module can be used to reduce the error rate by false
positive. Examples are game and computer fields for (b1) and (b2) in Fig. 1, and for Rule (13) and Rule (14) in Table 2, respec-
tively. The next module NE determines named entities such as names, organizations, places and so on. This module is carried
out for the results of keywords analysis (Asahara and Matsumoto, 2003; Wright & Budin, 1997). For example, ABC Station is
a station name and Nagoya company is a company name. In the error analysis of this modules, unknown word ABC and
ambiguous name Nagoya in the module MOPH can be solved. The module ATTR determines SC expressions by using the
presented multi-attribute method.
SINGLE uses FOCUS, MOPH, KW and NE. MULTI uses FOCUS, MOPH, NE and ATTR, where NE is included ATTR.
The presented system has been developed Windows 2003 server and two CPU of Intel Xeon E5440 (2.83 GHz) with 2 GB
main memory. Fig. 5 shows the time expenses for the above modules, where the analysis time is estimated for 100 articles of
HTML and its text (TEXT). The sizes of HTML and TEXT are 1 MB and 60 KB, respectively.
For HTML documents in Fig. 5, it turns out that the time of the presented method is practical although MULTI is about 1.28
times slower than SINGLE. In fact, MORH and FOCUS can be performed by the preprocess servers, so the analysis time of the
main module ATTR of MULTI becomes 20 ms for a text article.
6. Conclusion
The extracting scheme of traditional methods depends on words or a sequence of words without considering the context
of articles. Therefore, many irrelevant candidates of possible malicious expressions are extracted. Although the current fil-tering scheme can precisely alert malicious articles, many non-malicious articles are not detected well. In order to solve
these problems, this paper has presented a new filtering algorithm to detect SC expressions by introducing multi-attribute
rules (MULTI). For 11,019 articles, it has been verified that the presented method could improve the rate of false positive of
the traditional method without degrading the rate of false negative. Therefore, we can say that the presented method MULTI
is a very useful approach for filtering services for inadequate expressions.
In future work, it needs to build rule-based knowledge for many types of malicious postings together with the error
recovery.
References
2 channel. .Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333340.Altman, D. G., & Bland, J. M. (1994). Diagnostic tests: Sensitivity and specificity. BMJ, 308, 1552.
Ando, K., Mizobuchi, S., Shishibori, M., & Aoe, J. (1998). Efficient multi-attribute pattern matching. An International Journal of Computer Mathematics, 66(1+2),2138.
Fig. 5. Time evaluation of the presented method and the traditional method.
334 H. Hanafusa et al./ Information Processing and Management 47 (2011) 323335
http://www.2ch.net/http://www.2ch.net/ -
7/31/2019 A Method of Extracting Malicious Expressions in Bulletin Board Systems
13/13
Anichiva. .Asahara, M., & Matsumoto Y. (2003). Japanese named entity extraction with redundant morphological analysis. In Proc. of HLTNAACL 03 (pp. 815).Atlam, E.-S., Elmarhomy, G., Fuketa, M., Morita, K., & Aoe, J. (2006). Automatic building of new field association word candidates using search engine.
Information Processing & Management Journal, 42(4), 951962.Atlam, E.-S., Fuketa, M., Morita, K., & Aoe, J. (2003). Documents similarity measurement using field association terms. An International Journal of Information
Processing and Management, 39(6), 809824.Burgess 1998. .Children Internet Protection Act. .Claypool, M., Brown, D., LE, P., & Waseda, M. (2001). Inferring user interest. IEEE Internet Computing, 5, 3239.Digital Economic Act. .
Francis, W., Frantz, V. & Mathieu, S., (2000). Using learning-based filters to detect rule-based filtering obsolescence. Article in proceeding of researchinformation assister of ordinate, RIAO 2000, Paris.
Fuketa, M., Kadoya, Y., Atlam, E.-S., Kunikata, T., Morita, K., Kashiji, S., et al (2005). A method of extracting and evaluating good and bad reputations fornatural language expressions. Information Technology & Decision Making, 4(2), 77196.
Fuketa, M., Lee, S., Tsuji, T., Okada, M., & Aoe, J. (2000). A document classification method by using field association words. An International Journal ofInformation Sciences, 126(1), 5770.
Gakkou-Ura. (in Japanese).Gharieb, R. R. (2000). Higher order statistics based IIR notch filtering scheme for enhancing sinusoids in colored noise. IEE Proceedings Vision Image and
Signal Processing, 147(2), 115121.Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35, 6170.Goldberg, K., Roeder, T., Guptra, D., & Perkins, C. (2001). Eigentaste: A constant-time collaborative filtering algorithm. Information Retrieval, 4, 133151.Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sarwar, B. M., & Harter, S. P. (1996). Variations in relevance assessments and the measurement of retrieval
effectiveness. Journal of the American Society for Information Science, 47, 3749.Heckerman, D., Chickering, D. M., Meek, C., Rounthwite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data
visualization. Journal of Machine Learning Research, 1, 4975.Herlocker, J. L., Konstan, J. A., & Riedl, J. (2002). An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Information
Retrieval, 5, 287310.
Horvath, 2003. Horvth (2003) in Suykens et al. p. 392.Kadoya, Y., Morita, K., Fuketa, M., Ohono, M., Atlam, E.-S., Sumitomo, T., et al (2005). A sentence classification technique by using intention association
expressions. Computer Mathematics, 82(7), 777792.Kim, S., Min, H., Jeon, J., Man Ro, Y., & Han, S. (2009), Malicious content filtering based on semantic features. In Proceedings of the ACM international conference
on interaction sciences: Information technology, culture and human (Vol. 403, pp. 802806), Seoul, Korea.Kiyoi, K., Atlam, E.-S., Fuketa, M., Yoshinari, T., & Aoe, J. (2008). A method for extracting knowledge from medical texts including numerical representation.
International Journal of Computer Applications in Technology, 33 (2/3), 226236.Landau, M.C., Sillion, F., & Vichot, F. (1993), Exoseme: A thematic document filtering system. In Intelligence Artificial, Avignon, France.Larry, M., & Malik, Y. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 139, 154.Lee, W., Lee, S., Chung, S., & An, D. (2007), Harmful contents classification using the harmful word filtering and SVM. In Proceedings of the 7th international
conference on computational science, Part III: ICCS 2007 (pp. 1825), May 2730, 2007, Beijing, China.Livejournal. .Mixi. (in Japanese).myspace. .Pennock, D. M., Horvitz, E., Lawrence, S., & Giled, C. L. (2000). Collaborative filtering by personality diagnosis: A hybrid memory and, model-based approach.
In Proceedings of the sixteenth annual conference on uncertainty in artificial intelligence (UAI-2000) (pp. 473480), Morgan Kaufmann, San Francisco.Provider Liability Act. .
Reddy, P. K., Kitsuregawa, P., Sreekanth, P., Rao, S. S. (2002), A graph based approach to extract a neighborhood customer community for collaborativefiltering. In Lecture notes in computer science databases in networked information systems, second international workshop (pp. 188200), Springer.
Shiraki, N., Hara, M., Ogino, H., Shibamoto, Y., Iida, A., Tamaki, T., et al (2004). False-positive and true-negative hilar and mediastinal lymph nodes on FDG-PET Radiologicalpathological correlation. Annals of Nuclear Medicine, 18(1), 2328.
Wang, J., Arjen, P., Marcel, J.T. (2006). Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In Proceeding of SIGIR 2006.August 611, 2006, Seattle, Washington, USA.
Wright, S. E., & Budin, G. (1997). Handbook of terminology management. Basic aspects of terminology management (Vol. 1). Amsterdam, Philadelphia: JohnBenjamins.
Xu, J., Chong, Z., Lu, H., Zhou, A. (2004). False positive or false negative: Mining frequent itemsets from high speed transactional Data streams. In Proceeding ofthe 30th VLBD (pp. 204215), Toronto, Canada.
Yahoo! BBS. .Yokoku.in. (in Japanese).Yoohwan, K., Wing, C., Mooi, C., & Chao, H. Jonathan (2006). PacketScore: A statistics-based packet filtering scheme against distributed denial-of-service
attacks. IEEE Transactions on Dependable and Secure Computing, 3(2), 141155.Yoshinari, T., Atlam, E.-S., Morita, K., Kiyoi, K., & Aoe, J. (2008). Automatic acquisition for sensibility knowledge using co-occurrence relation. International
Journal of Computer Applications in Technology, 33(2/3), 218225.Youth Protection Act. .
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 335
http://cn.anchiva.com/download/Commtouch%20URL%20Filtering%20White%20Paper_Anichiva_En.pdfhttp://www.svms.org/disadvantages.htmlhttp://en.wikipedia.org/wiki/Children's_Internet_Protection_Acthttp://en.wikipedia.org/wiki/Children's_Internet_Protection_Acthttp://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000000801164&dateTextehttp://schecker.jp/http://www.livejournal.com/http://mixi.jp/http://us.myspace.com/http://law.e-gov.go.jp/htmldata/H13/H13HO137.htmlhttp://messages.yahoo.co.jp/index.htmlhttp://yokoku.in/http://www.wien.gv.at/recht/landesrecht-wien/landesgesetzblatt/jahrgang/2002/html/lg2002017.htmhttp://www.wien.gv.at/recht/landesrecht-wien/landesgesetzblatt/jahrgang/2002/html/lg2002017.htmhttp://yokoku.in/http://messages.yahoo.co.jp/index.htmlhttp://law.e-gov.go.jp/htmldata/H13/H13HO137.htmlhttp://us.myspace.com/http://mixi.jp/http://www.livejournal.com/http://schecker.jp/http://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000000801164&dateTextehttp://en.wikipedia.org/wiki/Children's_Internet_Protection_Acthttp://www.svms.org/disadvantages.htmlhttp://cn.anchiva.com/download/Commtouch%20URL%20Filtering%20White%20Paper_Anichiva_En.pdf