SAMUEL FULL MSC PROJECT

Filtering Offensive Language in Online Communities using Grammatical Relations

BY

SAMUEL AYOKUNLE ADEKANMBI

MATRIC NO: 133466

Project submitted in partial fulfillment award of Master of Science degree

(Computer science)

Department of computer science,

University of Ibadan.

February, 2014.

Certification

I certify that this research work was carried out by Samuel Ayokunle ADEKANMBI (133466)

under my supervision.

.

____________________ _______________________

Date Dr B O Longe

DEDICATION

This entire work dedicated to everyone that believes in the PromoUpdate dream.

ACKNOWLEDGEMENT

My profound gratitude goes to my parent and my siblings for their moral and financial support

which has immensely led to the success of this project. To my Dad, You are the best; I love you

so much even though I don’t show it.

I am indeed grateful to my supervisor, Dr. Olumide B. Longe for his moral support, patience and

understanding during the course of this project. Thank you very much Sir.

I also want to appreciate my very good and crazy friends: Tini, Phina, Kunchasho, TY, Alamu,

Oluwashola Amiola Philip, Emmanuel, Muideen, Lola Mojekodunmi, Jane, Gbenro, N.O Jimoh,

Tifa; You guys are my brothers from another mother.

I can’t underestimate the effort of all my lecturers in the department; I pray the blessing of the lord

shall not depart from your homes.

My Msc. Programme will have being incomplete without some set of wonderful people: Tini,

Phina, Helen, Rotimi, Modupe, Tolu, Big Fish, Last Don, Giel, and the whole crew at chief Madu’s

Palace. Thanks for being there for me.

To all my classmates, Dimple, Becky, Elohor, Ben, Fake AYs, John, Uzomma, Deola, Banky,

Shukurat, Toyosi, Shola, Adesi, GP, Toyosi, Tosinsss, etc; you have been a blessing to me and the

success of my programme. I say a big thanks to you for your support throughout the programme.

I appreciate your love. Thanks for believing in the PromoUpdate dream. You guys are the best.

Finally, to anyone that has contributed to the success of this project and my success in life, whose

name is not mentioned here, please just know that you are not unknown to me to me and you are

appreciated more than you know. God bless you all. See you at the top.

TABLE OF CONTENT page

Title page i

Certification ii

Dedication iii

Acknowledgement iv

Table of content v

Abstract viii

CHAPTER ONE: INTRODUCTION

1.1 Background of study 1

1.2 Problem Statement 4

1.3 Aims and Objectives 4

1.4 Research Methodology 5

1.5 Scope and Limitation 5

1.6 Organization of the study 6

1.7 Expected Contribution to Knowledge 6

1.7.1 Glossary of terms. 7

CHAPTER TWO: REVIEW OF THE LITERATURE

2.1 Offensive Language in Online Communities 8

2.2 Rate of Cyberbullying among youth 9

2.3 Tradition-Bullying and Cyber-Bullying 10

2.4 Type of Bullying Online 12

2.5 Challenges in the fight to stop cyberbullying 12

2.6 Preventing Cyberbullying 13

2.7 Responding to Cyberbullying 14

2.8 Grammatical Relations 16

2.9 Using text mining techniques to detect online offensive content 17

2.10 Heads and Dependents 20

2.11 Statistical Parsing 21

2.12 Dependency Parsing 27




CHAPTER THREE: SYSTEM ANALYSIS AND DESIGN

3.1 Systems Analysis 36

3.2 Analysis of the existing system 37

3.3 Problem of the existing approaches 40

3.4 Proposed Filtering Philosophy 41

3.5 Identify Removable Content by Grammatical Relations 44

CHAPTER FOUR: IMPLEMENTATION

4.1 Justification of Programming Language Used 56

4.2 System Specification 58

4.3 System Implementation 59

CHAPTER FIVE: SUMMARY, CONCLUSION AND FUTURE WORKS

5.1 Summary. 65

5.2 Conclusion 65

5.3 Future Works 66

References 67

ABSTRACT

Offensive language has risen to be a big issue to the health of both online communities and their

users. To the online community, the spread of offensive language undermines its reputation, drives

users away, and even directly affects its growth. To users, viewing offensive language brings

negative influence to their mental health, especially for children and youth.

A semantic filtering model is been proposed and implemented using grammatical analysis and part

of speech tagging. Statistical/probabilistic analysis of recurring offensive tokens is been done using

Bayesian method. The designed semantic filtering system was tested as an online web application

with a client application by engaging users to validate the efficiency of the designed system.

When offensive language is detected in a user message, a problem arises about how the offensive

language should be removed, i.e. the offensive language filtering problem.

Our semantic filtering technique is based on the grammatical relations of words in a sentence so

that the rest of the filtered sentence is readable and the existence of offensive words in the original

sentence is hard to notice. We tested the effectiveness of our approach with a large dataset and the

results show that our techniques are very effective and accurate with little process overhead.

Moreover, as the most time-consuming part of semantic filtering is the sentence parsing process,

we will examine other light-weighted NLP techniques to speed up sentence parsing. Also, we also

plan to extend our filtering approach to support other languages such as Chinese and French in

future works.

CHAPTER ONE

INTRODUCTION

Online social networking (OSN) websites have enjoyed a great success in recent years and have

become the new frontier in today’s social relationships providing great places for self-expression

and exchange of ideas.

Social networking has provided opportunities for new relationships as well as strengthening

existing relationships. Benefits of social networking platforms vary based on platform type,

features and the company itself. OSN allows organizations to improve communication and

productivity by disseminating information among different groups of employees in a more

efficient manner, resulting in increased productivity.

In the past, social networks were viewed as a distraction and offered no educational benefit.

Blocking these social networks was a form of protection for students against wasting time,

bullying, and invasions of privacy. In an educational setting, OSNs are seen by many instructors

and educators as a frivolous, time-wasting distraction from schoolwork, and it is not uncommon

to be banned in school computer labs. Cyberbullying has also become an issue of concern with

social networks. According to the Children Go Online survey of 9-24 year olds, it was found that

a third have received bullying comments online.( http://internetsafety101.org) To avoid this

problem, many school districts/boards have blocked access to online social networks within the

school environment. I

Social networking services often include a lot of personal information posted publicly, and many

believe that sharing personal information is a window into privacy theft. Schools have taken action

to protect students from this. It is believed that this outpouring of identifiable information and the

easy communication vehicle that social networking services opens the door to sexual predators,

cyberbullying, and cyber-stalking.(http://en.wikipedia.org/wiki/Social_networking_service) In

contrast, however, 70% of social media using teens and 85% of adults believe that people are

mostly kind to one another on social network sites.(

http://en.wikipedia.org/wiki/Social_networking_service) Research has suggested that there has

been a shift in blocking the use of social networking services. In many cases, the opposite is

occurring as the potential of online networking services is being realized. It has been suggested

that if schools block them [Online Social Networks], they’re preventing students from learning the

skills they need. Banning social networking is not only inappropriate but also borderline

irresponsible when it comes to providing the best educational experiences for students. Schools

have the option of educating safe media usage as well as incorporating digital media into the

classroom experience, thus preparing students for the literacy they will encounter in the future.

Cyberbullying is a fast growing trend that experts believe is more harmful than typical schoolyard

bullying. Nearly all of us can be contacted 24/7 via the internet or our mobile phones. Victims can

be reached anytime and at anyplace. For many children, home is no longer a refuge from the

bullies. “Children can escape threats and abuse in the classroom, only to find text messages and

emails from the same tormentors when they arrive home.”

“There’s no safe place anymore and one can be bullied 24/7; even in the privacy of his/her own

bedroom.” (Cyberbullying, Able Publishing Newsletter - Term 3, 2008).

Online social networking sites have become increasingly popular with children, especially young

teens, as a place where they can meet other people, communicate, and exchange information. No

type of bullying is harmless. In some cases, it can constitute criminal behaviour. In extreme

incidents, cyberbullying has led teenagers to suicide. Most victims, however, suffer shame,

embarrassment, anger, depression and withdrawal.(Cyberbullying, Able Publishing Newsletter -

Term 3, 2008) Cyberbullying is often seen as anonymous, and the nature of the internet allows it

to spread quickly to hundreds and thousands of people.

Cyberbullying has the same insidious effects as any kind of bullying, turning children away from

school, friendships, and in tragic instances, life itself. Parents often tell their children to turn off

the mobile phones or stay off the computer. Many parents don’t understand that the internet and

mobile phone act as a social lifeline for teenagers to their peer group. Victims often don't tell their

parents because they think their parents will only make the problem worse, or that they might even

confiscate their mobile phone or take away their internet access, removing that social lifeline.

While bullying is something that is often ‘under the radar’ of adults, cyberbullying is even more

so. Teenagers are increasingly communicating in ways that are often unknown by adults and away

from their supervision. They organize their social lives through these mediums. Their friendships

are made and broken over these mediums.

So the question remains "How can we avoid offensive languages in OSNs?” This research work

aims at removing offensive languages in a user message. When offensive language is detected in

a user message, a problem arises about how the offensive language should be removed, i.e. the

offensive language filtering problem. To solve this problem, manual filtering approach is known

to produce the best filtering result. However, manual filtering is costly in time and labor thus

cannot be widely applied.(http://en.wikipedia.org/wiki/Anti-spam_techniques) Here, we will

analyze the offensive language in text messages posted in online communities, and propose a new

automatic sentence-level filtering approach that is able to semantically remove the offensive

language by utilizing the grammatical relations among words. Comparing with existing automatic

filtering approaches, the proposed filtering approach provides filtering results much closer to

manual filtering.

1.1 Statement problem

The online community has encouraged the use of offensive languages which has spread into about

80% of all OSN; and has been very harmful to the mental health of both children and youth.(Zhi

Xu and Sencun Zhu, 2010) To the online community, the deluge of offensive language undermines

the community’s reputation, drives users away, and even directly affects its growth.

People have realized the problems brought by offensive language in online communities and many

efforts have been made on detecting the existence of offensive language within user messages.

However, detection alone is not enough to eliminate the hazard caused by offensive language.

When offensive contents are detected within a user message, a question arises naturally about how

the detected offensive content should be removed from the message before it is been transmitted.

Also, how do we remove or filter offensive languages and words form a message thoroughly and

still keep inoffensive content untouched as much as possible. Also, will the readability of filtered

content be guaranteed so as to make our filtering transparent to readers?

1.2 Aims and objectives:

This project work intends to develop and implement a sentence-level semantic filtering System,

which will

1. Utilize grammatical relations among words to stop cyberbullying by semantically remove

offensive content in a sentence.

2. Produce minimal error when filtering offensive languages and words form a message and

still keeps inoffensive content untouched as much as possible.

3. Guarantee the readability of filtered content so as to make the filtering transparent to

readers.

4. Implement the designed model which is going to be a sophisticated NLP application, not

an AI application, since learning is not going to be involved.

5. To help reduce the chances of victimization in Online Social Networking Sites.

1.3 Research Methodology

The methodology adopted in carrying out this project include the use of interviews to gather

primary data from a number of leading filtering vendors in Nigeria. Both telephone and face-to-

face interviews will be carried out with the relevant technology experts within selected

organizations. Also, an existing database of offensive words and languages will be collected and

use to simulate an offensive database engine. A semantic filtering model will be proposed and

implemented using XYZ. Statistical/probabilistic analysis of recurring offensive tokens will be

done using Bayesian method. The designed semantic filtering system will be tested as an online

web application with a client application by engaging users to validate the efficiency of the

designed system.

1.4 Organization of the study

The thesis work is arranged in five chapters with the breakdown as follows:

The First Chapter is termed introduction and it includes the Online Social Networking System,

research aim and objectives, research methodology and organization of dissertation.

Chapter Two deals with the literature review on grammatical relations, cyberbullying and the

concept of sematic filtering system.

Chapter Three presents the Methodology and analysis of the input and output specification of the

proposed system and the design of the system.

Chapter Four describes the system implementation and evaluation of the system design. This

would consist of a brief description of each program module and their functions. It also justifies

the choice of package and describes the software required to implement the system. It also shows

the measures taking during the implementation.

Chapter Five summarizes the project work. It covers conclusion and recommendation for the

project.

LITERATURE REVIEW

2.1 Offensive Language in Online Communities

A lot of people most especially kids have been bullying each other for generations. The latest

generation, however, has been able to utilize technology to expand their reach and the extent of

their harm. (http://cyberbullying.us) This phenomenon is being called cyberbullying, defined as:

“willful and repeated harm inflicted through the use of computers, cell phones, and other

electronic devices.” Basically, we are referring to incidents where adolescents use technology,

usually computers or cell phones, to harass, threaten, humiliate, or otherwise hassle their peers.

For example, youth can send hurtful text messages to others or spread rumors using cell phones

or computers. Teens have also created web pages, videos, profiles on social networking sites

making fun of others. With cell phones, adolescents have taken pictures in a bedroom, a

bathroom, or another location where privacy is expected, and posted or distributed them online.

More recently, some have recorded unauthorized videos of other kids and uploaded them for the

world to see, rate, tag, and discuss.( http://cyberbullying.us)

However, there are many detrimental outcomes associated with cyberbullying and making use of

offensive languages that reach into the real world. First, many targets of cyberbullying report

feeling depressed, sad, angry, and frustrated. As one teenager stated: “It makes me hurt both

physically and mentally. It scares me and takes away all my confidence. It makes me feel sick

and worthless.” Victims who experience cyberbullying also reveal that they were afraid or

embarrassed to go to school or even come out to talk in public.(http://cyberbullying.us) In

addition, there is a link between cyberbullying and low self-esteem, family problems, academic

problems, school violence, and delinquent behavior. Finally, cyberbullied youth also report

having suicidal thoughts, and there have been a number of examples around the world where

youth who were victimized ended up taking their own lives.(http://cyberbullying.us)

Cyberbullying occurs across a variety of venues and mediums in cyberspace, and it shouldn’t

come as a surprise that it occurs most often where teenagers congregate. Initially, many kids

hung out in chat rooms, and as a result that is where most harassment took place. In recent years,

most youth are have been drawn to social networking websites (such as Facebook, Twitter,

Linked In, etc.) and video-sharing websites (such as YouTube). This trend has led to increased

reports of cyberbullying occurring in those environments. (Burgess-Proctor, Patchin, & Hinduja,

2009; Hinduja & Patchin, 2008b; R. M. Kowalski & Limber, 2007; Lenhart, 2007; Li, 2007a;

Patchin & Hinduja, 2006). Instant messaging on the Internet or text messaging via a cell phone

also appear to be common ways in which youth are harassing one another.

2.2 Rate of Cyberbullying among Youth

Estimates of the number of youth who experience cyberbullying vary widely (ranging from 10-

40% or more), depending on the age of the group studied and how cyberbullying is formally

defined. In this research, we inform secondary school students(of International School, Ibadan;

Abadina College, U.I; and Igbobi College Yaba, Lagos) that cyberbullying is when someone

“repeatedly picks on another person by making use of offensive languages through OSN when

chatting or when someone posts something offensive online about another person that they don’t

like.” Using this definition, about 62% of the over 800 randomly-selected 11-18 year-old

students indicated they had been a victim at some point in their life. About this same number

admitted to cyberbullying others during their lifetime. Finally, about 40% of youths in this recent

study said they had both been a victim and an offender.

Fig 2.1

2.3 Traditional-Bullying and Cyber-Bullying

While often similar in terms of form and technique, bullying and cyberbullying have many

differences that can make the latter even more devastating. First, victims often do not know who

the bully is, or why they are being targeted. The cyberbully can cloak his or her identity behind a

computer using anonymous email addresses or pseudonymous screen names.

Second, the hurtful actions of a cyberbully are viral; that is, a large number of people (at school,

in the neighborhood, in the city, in the world!) can be involved in a cyber-attack on a victim, or

at least find out about the incident with a few keystrokes or clicks of the mouse. The perception,

then, is that absolutely everyone knows about it.

Third, it is often easier to be cruel using technology because cyberbullying can be done from a

physically distant location, and the bully doesn’t have to see the immediate response by the

target. In fact, some teens simply might not recognize the serious harm they are causing because

they are sheltered from the victim’s response.

Finally, while parents and teachers are doing a better job supervising youth at school and at

home, many adults don’t have the technological know-how to keep track of what teens are up to

online. As a result, a victim’s experience may be missed and a bully’s actions may be left

unchecked. Even if bullies are identified, many adults find themselves unprepared to adequately

respond.()5

All these and more makes cyberbullying is a growing problem because increasing numbers of

kids are using and have completely embraced interactions via computers and cell phones. Two-

thirds of youth go online every day for school work, to keep in touch with their friends, to play

games, to learn about celebrities, to share their digital creations, or for many other reasons.

Because the online communication tools have become an important part of their lives, it is not

surprising that some youths have decided to use the technology to be malicious or menacing

towards others. The fact that teens are connected to technology 24/7 means they are susceptible

to victimization (and able to act on mean intentions toward others) around the clock.() Apart

from a measure of anonymity, it is also easier to be hateful using typed words rather than spoken

words face-to-face and because some adults have been slow to respond to cyberbullying, many

cyberbullies feel that there are little to no consequences for their actions.

Cyberbullying crosses all geographical boundaries. The Internet has really opened up the whole

world to users who access it on a broad array of devices, and for the most part, this has been a

good thing. Nevertheless, some kids feel free to post or send whatever they want while online

without considering how that content can inflict pain – and sometimes cause severe

psychological and emotional wounds.

2.4 Types of Bullying Online

According to the Internet Safety 101 curriculum, there are many types of cyberbullying which

includes:

Gossip: Posting or sending cruel gossip to damage a person’s reputation and

relationships with friends, family, and acquaintances.

Exclusion: Deliberately excluding someone from an online group.

Impersonation: Breaking into someone’s e-mail or other online account and sending

messages that will cause embarrassment or damage to the person’s reputation and

affect his or her relationship with others.

Harassment: Repeatedly posting or sending offensive, rude, and insulting messages.

Cyber-stalking: Posting or sending unwanted or intimidating messages, which may

include threats.

Flaming: Online fights where scornful and offensive messages are posted on websites,

forums, or blogs.

Outing and Trickery: Tricking someone into revealing secrets or embarrassing

information, which is then shared online.

Cyber-threats: Remarks on the Internet threatening or implying violent behavior,

displaying suicidal tendencies.

2.5 Challenges in the fight to stop cyberbullying

There are two major challenges that make it difficult to prevent cyberbullying. First, many

people don’t see the harm associated with it. Some attempt to dismiss or disregard cyberbullying

because there are “more serious forms of aggression to worry about.” While it is true that there

are many issues facing adolescents, parents, teachers, and law enforcement today, we first need

to accept that cyberbullying is one such problem that will only get more serious if ignored.

The other challenge relates to who is willing to step up and take responsibility for responding to

inappropriate use of technology. Parents often say that they don’t have the technical skills to

keep up with their kids’ online behavior; teachers are afraid to intervene in behaviors that often

occur away from school; and law enforcement is hesitant to get involved unless there is clear

evidence of a crime or a significant threat to someone’s physical safety. As a result,

cyberbullying incidents often slip through the cracks. Indeed, the behavior often continues and

escalates because they are not quickly addressed.() Based on these challenges, there is need to

collectively create an environment where kids feel comfortable talking with adults about this

problem and feel confident that meaningful steps will be taken to resolve the situation. We also

need to get everyone involved - youth, parents, educators, counselors, law enforcement, social

media companies, and the community at large. It will take a concerted and comprehensive effort

from all stakeholders to really make a difference in reducing cyberbullying.

2.6 Preventing Cyberbullying

The most important preventive step that schools can take is to educate the school community

about responsible internet use. Students need to know that all forms of bullying are wrong and

that those who engage in harassing or threatening behaviors will be subject to discipline. It is

therefore important to discuss issues related to the appropriate use of online communications

technology in various areas of the general curriculum. To be sure, these messages should be

reinforced in classes that regularly utilize technology. Signage also should be posted in the

computer lab or at each computer workstation to remind students of the rules of acceptable use.

In general, it is crucial to establish and maintain a school climate of respect and integrity where

violations result in informal or formal sanction.()

Furthermore, school district personnel should review their harassment and bullying policies to

see if they allow for the discipline of students who engage in cyberbullying. If their policy covers

it, cyberbullying incidents that occur at school - or that originate off campus but ultimately result

in a substantial disruption of the learning environment - are well within a school’s legal authority

to intervene. The school then needs to make it clear to students, parents, and all staff that these

behaviors are unacceptable and will be subject to discipline. In some cases, simply discussing the

incident with the offender’s parents will result in the behavior stopping.

2.7 Responding to Cyberbullying

Students should already know that cyberbullying is unacceptable and that the behavior will result

in discipline. Utilize school liaison officers or other members of law enforcement to thoroughly

investigate incidents, as needed, if the behaviors cross a certain threshold of severity. Once the

offending party has been identified, develop a response that is commensurate with the harm done

and the disruption that occurred.

School administrators should also work with parents to convey to the student that cyberbullying

behaviors are taken seriously and are not trivialized. Moreover, schools should come up with

creative response strategies, particularly for relatively minor forms of harassment that do not

result in significant harm. For example, students may be required to create anti-cyberbullying

posters to be displayed throughout the school. Older students might be required to give a brief

presentation to younger students about the importance of using technology in ethically-sound

ways. The point here, again, is to condemn the behavior while sending a message to the rest of

the school community that bullying in any form is wrong and will not be tolerated.

Even though the vast majority of these incidents can be handled informally (calling parents,

counseling the bully and target, expressing condemnation of the behavior), there may be

occasions where formal response from the school is warranted. This is particularly the case in

incidents involving serious threats toward another student, if the target no longer feels

comfortable coming to school, or if cyberbullying behaviors continue after informal attempts to

stop it have failed. In these cases, detention, suspension, changes of placement, or even

expulsion may be necessary. If these extreme measures are required, it is important that

educators are able to clearly demonstrate the link to school and present evidence that supports

their action.

Also, youth should develop a relationship with an adult they trust (a parent, teacher, or someone

else) so they can talk about any experiences they have online (or off) that make them upset or

uncomfortable. If possible, teens should ignore minor teasing or name calling, and not respond to

the bully as that might simply make the problem continue. It’s also useful to keep all evidence of

cyberbullying to show an adult who can help with the situation. If targets of cyberbullying are

able to keep a log or a journal of the dates and times and instances of the online harassment, that

can also help prove what was going on and who started it.

Overall, youth should go online with their parents – show them what web sites they use, and

why. At the same time, they need to be responsible when interacting with others on the Internet.

For instance, they shouldn’t say anything to anyone online that they wouldn’t say to them in

person with their parents in the room. Finally, youth ought to take advantage of the privacy

settings within Facebook and other websites, and the social software (instant messaging, email,

and chat programs) that they use – they are there to help reduce the chances of victimization.

Users can adjust the settings to restrict and monitor who can contact them and who can read their

online content.

Law enforcement officers also have a role in preventing and responding to cyberbullying. To

begin, they need to be aware of ever-evolving state and local laws concerning online behaviors,

and equip themselves with the skills and knowledge to intervene as necessary. In a recent survey

of school resource officers, we found that almost one-quarter did not know if their state had a

cyberbullying law. This is surprising since their most visible responsibility involves responding

to actions which are in violation of law (e.g., harassment, threats, stalking). Even if the behavior

doesn’t immediately appear to rise to the level of a crime, officers should use their discretion to

handle the situation in a way that is appropriate for the circumstances. For example, a simple

discussion of the legal issues involved in cyberbullying may be enough to deter some youth from

future misbehavior. Officers might also talk to parents about their child’s conduct and express to

them the seriousness of online harassment.

Relatedly, officers can play an essential role in preventing cyberbullying from occurring or

getting out of hand in the first place. They can speak to students in classrooms about

cyberbullying and online safety issues more broadly in an attempt to discourage them from

engaging in risky or unacceptable actions and interactions. They might also speak to parents

about local and state laws, so that they are informed and can properly respond if their child is

involved in an incident.

2.8 Grammatical Relations

Grammatical relations refer to functional relationships between constituents in a clause. The

standard examples of grammatical functions from traditional grammar are subject, direct object,

and indirect object. Beyond these concepts from traditional grammar, more modern theories of

grammar are likely to acknowledge many further types of grammatical relations (e.g.

complement, specifier, predicative, etc.). The role of grammatical relations in theories of

grammar is the greatest in many dependency grammars, which tend to posit dozens of distinct

grammatical relations.() Every head-dependent dependency bears a grammatical function.

Grammatical relations are exemplified in traditional grammar by the notions of subject, direct

object, and indirect object;

For example:

Adekanmbi gave Samuel the book.

The subject Adekanmbi performs or is the source of the action. The direct object the book is

acted upon by the subject, and the indirect object Samuel receives the direct object or otherwise

benefits from the action. Traditional grammars often begin with these rather vague notions of the

grammatical functions. When one begins to examine the distinctions more closely, it quickly

becomes clear that these basic definitions do not provide much more than a loose orientation

point. What is indisputable about the grammatical relations is that they are relational. That is,

subject and object can exist as such only by virtue of the context in which they appear. A noun

such as Adekanmbi or a noun phrase such as the book cannot qualify as subject and direct object,

respectively, unless they appear in an environment, e.g. a clause, where they are related to each

other and/or to an action or state. In this regard, the main verb in a clause is responsible for

assigning grammatical relations to the clause "participants".

2.9 Using Text Mining Techniques to Detect Online Offensive Contents

Offensive language identification in social media is a difficult task because the textual contents

in such environment is often unstructured, informal, and even misspelled. While defensive

methods adopted by current social media are not sufficient, researchers have studied intelligent

ways to identify offensive contents using text mining approach. Implementing text mining

techniques to analyze online data requires the following phases:

1) Data acquisition and preprocess,

2) Feature extraction

3) Classification

The major challenges of using text mining to detect offensive contents lie on the feature selection

phrase, which will be elaborated in the following sections.

a) Message-level Feature Extraction

Most offensive content detection research extracts two kinds of features: lexical and syntactic

features.

Lexical features treat each word and phrase as an entity. Word patterns such as appearance of

certain keywords and their frequencies are often used to represent the language model. Early

research used Bag-of-Words (BoW) in offensiveness detection. The BoW approach treats a text

as an unordered collection of words and disregards the syntactic and semantic information.

However, using BoW approach alone not only yields low accuracy in subtle offensive language

detection, but also brings in a high false positive rate especially during heated arguments,

defensive reactions to others’ offensive posts, and even conversations between close friends. N-

gram approach is considered as an improved approach in that it brings words’ nearby context

information into consideration to detect offensive contents. N-grams represent subsequences of

N continuous words in texts. Bi-gram and Tri-gram are the most popular N-grams used in text

mining. However, N-gram suffers from difficulty in exploring related words separated by long

distances in texts. Simply increasing N can alleviate the problem but will slow down system

processing speed and bring in more false positives.

Syntactic features: Although lexical features perform well in detecting offensive entities,

without considering the syntactical structure of the whole sentence, they fail to distinguish

sentences’ offensiveness which contain same words but in different orders. Therefore, to

consider syntactical features in sentences, natural language parsers are introduced to parse

sentences on grammatical structures before feature selection. Equipping with a parser can help

avoid selecting un-related word sets as features in offensiveness detection.

b) User-level Offensiveness Detection

Most contemporary research on detecting online offensive languages only focus on sentence-

level and message-level constructs. Since no detection technique is 100% accurate, if users keep

connecting with the sources of offensive contents (e.g., online users or websites), they are at high

risk of continuously exposure to offensive contents. However, user-level detection is a more

challenging task and studies associated with the user level of analysis are largely missing. There

are some limited efforts at the user level. For example, Kontostathis et al propose a rule-based

communication model to track and categorize online predators. Pendar uses lexical features with

machine learning classifiers to differentiate victims from predators in online chatting

environment. Pazienza and Tudorache propose utilizing user profiling features to detect

aggressive discussions. They use users’ online behavior histories (e.g., presence and

conversations) to predict whether or not users’ future posts will be offensive. Although their

work points out an interesting direction to incorporate user information in detecting offensive

contents, more advanced user information such as users’ writing styles or posting trends or

reputations has not been included to improve the detection rate.()

Fig 2.2

2.10 Heads and dependents

The importance of the syntactic functions reaches its greatest extent in dependency grammar

(DG) theories of syntax. Every head-dependent dependency bears a syntactic function. The result

is that an inventory consisting of dozens of distinct syntactic functions is needed for each

language. For example, a determiner-noun dependency might be assumed to bear the DET

(determiner) function, and an adjective-noun dependency is assumed to bear the ATTR

(attribute) function. These functions are often produced as labels on the dependencies themselves

in the syntactic tree, e.g.

Fig 2.3

The tree contains the following syntactic functions: ATTR (attribute), CCOMP (clause

complement), DET (determiner), MOD (modifier), OBJ (object), SUBJ (subject), and VCOMP

(verb complement). The actual inventories of syntactic functions will differ from the one

suggested here in the number and types of functions that are assumed. In this regard, this tree is

merely intended to be illustrative of the importance that the syntactic functions can take on in

some theories of syntax and grammar.

2.11 Statistical parsing

CFGs can be used to parse, but some ambiguous sentences could not be disambiguated, and we

would like to know the most likely parse. A corpus could be used to do that.

2.11.1 Basic idea

1. Start with a Treebank (we can say bank of trees, e.g. Penn Treebank) which is a

collection of sentences with syntactic annotation, i.e., already-parsed sentences.

2. Examine which parse trees occur frequently

3. Extract grammar rules corresponding to those parse trees, estimating the probability of

the grammar rule based on its frequency.

That is, we’ll have a CFG augmented with probabilities (PCFG).

2.11.2 Probabilistic Context-Free Grammars (PCFGs)

Definition of a PCFG:

- Set of non-terminals (N)

- Set of terminals (T)

- Set of rules/productions (P), of the form Α → β

- Designated start symbol (S)

- Function, D assigns probabilities to each rule in P

D = P (A → β)

2.11.3 Estimating Probabilities using a Treebank

- Given a corpus of sentences annotated with syntactic annotation

(e.g., the Penn Treebank)

- Consider all parse trees

- (1) Each time we have a rule of the form A → ß applied in a parse tree, increment a counter for

that rule

- (2) Also count the number of times A is on the left hand side of a rule

- Divide (1) by (2) D = P (A→ ß | A) = Count (A → ß) / Count (A)

2.11.4 Using Probabilities to Parse

• P (T) = probability of a particular parse tree

= the product of the probabilities of all the rules r used to expand each node n in the parse

tree

Fig 2.4

We have the following rules and probabilities

- S → VP .05

- VP → V NP .40

- NP → Det N .20

- V → book .30

- Det → that .05

- N → flight .25

P ( T ) = P ( S → VP ) * P ( VP→ V NP ) *… * P ( N → flight )

= .05 * .40 * .20 * .30 * .05 * .25 = .000015

So, the probability for that parse is 0.000015. Probabilities are useful for comparing with other

probabilities. Whereas we couldn’t decide between two parses using a regular CFG, we now can.

2.11.5 Obtaining the best parse

The best parse T(S), where S is our sentence is the tree which has the highest probability.

We can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse

- CYK is a form of dynamic programming

- CYK is a chart parser, like the Earley parser

2.11.6 Problems with PCFGs

It’s still only a CFG, so dependencies on non-CFG information is not captured.

- e.g., Pronouns are more likely to be subjects than objects:

P [ ( NP → Pronoun ) | NP = subject ] >> P [ ( NP → Pronoun)

| NP =obj]

Ignores lexical dependency information (statistics), which is usually crucial for disambiguation

of “PP attachment ambiguity” and “Coordination ambiguity”.

- (T1) America sent [ [250,000 soldiers] [into Iraq] ]

- (T2) America sent [250,000 soldiers] [into Iraq]

“Sent” with “into”-PP always-attached high (T2) probability.

An example of Coordination ambiguity is two parses of the phrase “dogs in houses and cats”

- (T1) [ [NP dogs] in [ NP houses and cats ] ]

- (T2) [ [NP dogs in houses] and [NP cats ] ]

Here T1 is semantically wrong and T2 is correct but both tree results same score. So only PCFG

is not enough to disambiguate parse trees, lexical dependency information is also needed.

To handle lexical information, we’ll turn to lexicalized PCFGs.

2.11.7 Lexicalized PCFGs

Lexicalized Parse Trees

- Add “headwords” to each phrasal node. Each PCFG rule in a tree is augmented to

identify one RHS constituent to be the head daughter

- The headword for a node is set to the head word of its head daughter

- Headship not in (most) treebanks

- Usually use head rules, e.g.:

- NP:

• Take leftmost NP

• Take rightmost N*

• Take rightmost JJ

• Take right child

- VP:

• Take leftmost VB*

• Take leftmost VP

• Take left child

Fig 2.5

2.11.8 Incorporating head probabilities

Previously, we conditioned on the mother node (A):

- P ( A → β | A )

Now, we can condition on the mother node and the headword of A (h(A)):

- P( A → β | A , h (A) )

We’re no longer conditioning on simply the mother category A, but on the mother category when

h(A) is the head.

- e.g., P ( VP → VBD NP PP | VP , dumped)

2.11.9 Calculating rule probabilities

We calculate this by comparing how many times the rule occurs with h(n) as the

headword versus how many times the mother/headword combination appear in total:

P ( VP → VBD NP PP | VP , dumped )

= C (VP (dumped) → VBD NP PP) / Σβ C ( VP ( dumped ) → β)

2.11.10 Adding info about word-word dependencies

We want to take into account one other factor: the probability of being a head word (in a

given context)

- P(h(n)=word | …)

We condition this probability on two things: 1. the category of the node (n), and 2. the

headword of the mother (h(m(n)))

- P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n)))

- P(sacks | NP, dumped)

What we’re really doing is factoring in how words relate to each other

We will call this a dependency relation later: sacks is dependent on dumped, in this case

Fig 2.6: Lexicalized parsing can be seen as producing dependency trees

2.12 Dependency Parsing

Modern dependency grammar has been created by French linguistic Lucien Tesniere (1959).

Although its roots may be traced back to Panini’s grammar of Sankskrit (predecessor of bangla)

several centuries before. In NLP, dependency parse tree is thought as a ‘bridge’ between

syntactic and semantic analysis, since it gives some semantic information as well as syntactic.

Some peoples also argues that it is another version of chunk parsing, because a very careful

observation of a dependency tree will reveal that every subpart of a sentence: subject, object or

complements are appeared in different sub trees or under different relation, where each node is

dependent on another node. These sub trees or semantically dependent nodes can be thought of

as separate chunks.

2.12.1 Basic Concepts

In a dependency representation every node in the structure is a surface word (there are no

abstract nodes such as NP or VP), but each word may have additional attributes such as its part-

of-speech (POS) tag. The parent word is known as the head, and its children are its modifiers.

The observation which derives DG is: In a sentence, all but one word depend on other words.

The one word that doesn’t depend on any other is called the root of the sentence. A typical DG

analysis of the sentence “A man sleeps” is demonstrated below:

A depends on man

Man depends on sleep

Sleep depends on nothing (it is the root of the sentence)

Or, put differently

A modifies man

Man is the subject of sleep

Sleep is the main verb of the sentence

This is Dependency Grammar. A formulated dependency grammar is given below:

Capturing relations between words is moving in the direction of dependency grammar

(DG)

In DG, there is no such thing as constituency

The structure of a sentence is purely the binary relations between words, A → B means

that B depends on A

Dependencies are motivated by grammatical function, both syntactically and semantically. A

word depends on another either if it is a complement or a modifier of the latter. The edge

between a parent and a child node specifies the grammatical relationship between the two words

(e.g. subj, obj, and adj).

In most formulations of DG for example, functional heads or governors (e.g. verbs)

subcategorize for their complements. Hence, a transitive verb like ‘like’ requires two

complements (dependents), one noun with the grammatical function subject and one with the

function object.

In this research thesis, we are using Stanford-Parser version-jdk1.5 for all of the output.

Ex sentence: John likes Italian food.

Tagged output: John/NNP likes/VBZ Italian/NN food/NN

Constituent structure output:

(ROOT

(S

(NP (NNP John))

(VP (VBZ likes)

(NP (NN Italian) (NN food)))))

Dependency structure output:

nsubj(likes-2, John-1)

nn(food-4, italian-3)

dobj(likes-2, food-4)

2.12.2 Dependency functions

2.12.2.1 Main functions

main

main element

The main element of a clause is usually a verb, but in a verb-less clause other elements may

serve as a head as well.

Ex: a sentence with a verb

He doesn't know whether to send a gift.

nsubj(know-4, He-1)

aux(know-4, does-2)

advmod(know-4, n't-3)

aux(send-7, to-6)

whether(know-4, send-7)

det(gift-9, a-8)

dobj(send-7, gift-9)

Ex: a sentence without a verb

A comprehensive grammar of the English language

det(grammar-3, A-1)

amod(grammar-3, comprehensive-2)

det(language-7, the-5)

amod(language-7, english-6)

of(grammar-3, language-7)

2.12.2.2 Verb complementation

nsubj

nominal subject

The dependency syntax collapses the classes of formal subject and ordinary subject into

one. The subject may also be a non-finite clause that-clause, WH-clause, etc.

dobj

direct object

The notion of object is wider than that in Quirk, comprising essentially all types of

second arguments, except subject complements. The motivation is that the subtypes of

second arguments are complementary, i.e. they occupy the same valency slot. There are

both simple nominal objects and more complex objects such as a non-finite clause, that-

clause, WH-clause or quote structure.

Ex: John explained that topic

nsubj(explained-2, John-1)

det(topic-4, that-3)

dobj(explained-2, topic-4)

ccomp

coordinated complement

Subject complement is the second argument of a copular verb.

Ex: Mary said John didn't go there

nsubj(said-2, Mary-1)

nsubj(go-6, John-3)

aux(go-6, did-4)

advmod(go-6, n't-5)

ccomp(said-2, go-6)

advmod(go-6, there-7)

iobj

indirect object

Indirect object corresponds to a third argument. The prepositional dative is described

accordingly. Again, the syntactic motivation is that the prepositional phrase occupies the

same valency slot as the indirect object and is semantically equivalent to it.

Ex: I gave him my address.

nsubj(gave-2, I-1)

iobj(gave-2, him-3)

dep(address-5, my-4)

dobj(gave-2, address-5)

What did Pauline give Tom?

Pauline gave it to Tom.

2.12.2.3 Determinative functions det

determiner

Central determiners (articles) or a determining pronoun. Successive determiners are

linked to each other.

Ex: This is an apple

nsubj(is-2, This-1)

det(apple-4, an-3)

dobj(is-2, apple-4)

2.12.3 Robinson’s axiom

Robinson (1970) formulated four axioms to govern the well-formedness of dependency

structures, depicted below:

1. One and only one element is independent.

2. All others depend directly on some element.

3. No element depends directly on more than one other.

4. If A depends directly on B and some element C intervenes between them (in the linear

order of string), then C depends directly on A or B or some other intervening element.

The first three axioms ensure that they shall be trees. Axioms 1 and 2 state that in each sentence,

only one element is independent and all others dependent on some other elements. Axiom 3

states that if element A depends on B, it must not depend on another element C. This

requirement is referred as single-headness. Axiom 4 is called the requirement of projectivity and

disallows crossing edges in dependency trees.

2.12.4 Dependency relation

A mapping M maps W to the actual words of a sentence. Now for w1, w2 ∈

W, <w1, w2 >∈ R asserts that w1 is dependent on w2. The properties of R treeness constraints

on dependency graphs as Robinson’s axioms.

Ex: Mary loves another Mary

↑ ↑ ↑ ↑

w1 w2 w3 w4

here, M (w1…w4 ∈ W)

1. R ⊂ W × W

2. ∀w1w2…wk-1 wk ∈ W: <w1,w2> ∈ R … < wk-1 wk > ∈ R: w1 ≠ wk

(acyclicity)

3. ∃!w1 ∈ W : ∀w2 ∈ W: < w1,w2 > ∉ R (rootedness)

4. ∀w1w2w3 ∈ W : <w1,w2> ∈ R ∧ <w1,w3> ∈ R→w2=w3 (singleheadedness)

2.12.5 Stanford dependency parser by Dan Klein

This parser uses the feature of Collin’s parser. Michael Collins in his ‘Head Driven Statistical

Parser’ showed mapping of his statistical parser to the dependency relation sets. Dan Klein’s

Stanford parser deals with tagged words: pairs <w, t>. First the head <wh, th> of a constituent is

generated using ‘Collins head finder’ method, then successive right dependents <wd, td> until a

‘stop’ token is generated, then successive left dependents until ‘stop’ token is generated. It

supports three formats for output:

1. dependencies

2. typedDependencies

3. typedDependenciesCollapsed

For example: Factory payrolls fell in September.

Tagged output: Factory/NN payrolls/NNS fell/VBD in/IN September/NNP Dependency

structure:

nn(payrolls-2, Factory-1)

nsubj(fell-3, payrolls-2)

in(fell-3, September-5)

Fig 2.7

First, fell-VBD is chosen as the head of the sentence, then, in-IN to the right is generated, which

then generates September-NN to the right, which generates ‘stop’ token on both sides. Then

return to in-IN, generate ‘stop’ to the right, and so on. The above output is the

‘typedDependenciesCollapsed’ format of Stanford dependency parse tree. This

‘typedDependenciesCollapsed’ doesn’t make separate nodes for the words, which are obvious in

any dependency relation in a sentence; instead it makes it a relation between two prominent

words. In the above example the preposition ‘in’ is used as a relation or dependency function

between the words ‘fell’ and ‘September’.

For example, only ‘typedDependencies’ format of the above sentence will be:

nn(payrolls-2, Factory-1)

nsubj(fell-3, payrolls-2)

dep(fell-3, in-4)

dep(in-4, September-5)

Fig 2.8

Example shows that it makes a separate node ‘in’ between ‘fell’ and ‘September’, which can be

used as a relation to make the tree shorter in depth. This thesis uses the

‘typedDependenciesCollapsed’ format as well because we don’t need to look at every word to

extract necessary information.

CHAPTER THREE

SYSTEM ANALYSIS AND DESIGN

In the following section of this chapter, existing sentence-level semantic filtering approaches and

methodologies for online social networking communities will be thoroughly examined, and issues

related to these approaches will be highlighted.

The proposed sentence-level semantic filtering approach will also be examined, and its operation

procedures, benefits, and feasibility will be expressed. Methodologies employed in acquiring the

requirement towards the successful implementation of the proposed filtering System will also be

discussed.

The design of the filtering system from both perspectives will be discussed, also with the program

components which will not be left out.

3.1 System Analysis

System analysis can be defined as the process of analyzing a system with the essential goal of

improving or modifying it. It can also be defined as the methodical study of a system, its current

and future required objectives, and procedures in order to form a basis for the system design.

It is the first of the three major phases in developing an information system. All the system analysis

efforts are directed towards deciding these 3 basic objectives:

1. Identify system owner and system users.

2. Define what the system will do.

3. Determine the technicality, economical and operational feasibility of the proposed system.

The purpose of analyzing is to produce a clear requirement specification of the newly designed or

upgraded system efficiently and effectively. It requires the ability to analyze the essential features

of a system.

This knowledge of a system is achieved through the investigation of the system and its

environment.

3.2 Analysis of the existing system


teens, as a place where they can meet other people, communicate, and exchange information.

However, this medium has encouraged the wide usage of offensive languages and also brought

about a fast growing trend that experts believe is very harmful called cyberbullying; which has

led teenagers to suicide in very extreme cases. People have realized the problems brought by

offensive language in online communities and many efforts have been made on detecting and

eliminating the existence of offensive language within user messages. The approaches used are

being discuss below.

3.2.1 Keyword Censoring Approach

Keyword censoring approaches match words appearing in user messages with offensive words

stored in the blacklist. Once found, these offensive words will be removed, partially replaced

(e.g., “bitch”), completely replaced (e.g., “******”), or substituted by family friendly words

(e.g., “naughty”). Because of its simplicity, keyword based censoring approach has been widely

applied in OSN websites, such as YouTube and World of Warcraft. However, the filtering result

is not as desired; brutally removing words from users message breaks the readability of the

messages. Replacing offensive words with symbols usually makes it easy to guess the original

offensive words. The idea of substitution seems tempting, but accurate substitution is usually

impractical. Inaccurate substitution will introduce additional issues. For example, in 2001,

Yahoo! deployed an Email filter which may automatically alter certain words in emails by family

friendly words. This filter was criticized as a “foolish filter” by BBC news because of its

inaccurate substitution.

To demonstrate the shortcoming of keyword censoring approaches, we present an example

below.

Filtering results with Keyword Censoring

Original comment: “What the fuck is wrong with you?”

Keyword Censoring: “What the f**k is wrong with you?”

According to presented filtering results, readers can still easily understand what the offender

wants to say and even be able to infer the removed words. This indicates the failure filtering

because offensive opinion has been successfully delivered to victims. Also, removing words

from a sentence without considering their context breaks the readability of rest of the sentence.

Compared with keyword censoring approaches, our proposed semantic filtering approach is

much more sophisticated and can achieve thorough filtering effort by utilizing the grammatical

relations among words in the sentence. Given a sentence containing both offensive and

inoffensive words, not only offensive words but also inoffensive words assisting to express

offensive opinions will be removed during our filtering. In this way, we essentially stop the

delivery of offensive opinion. And, there will be no way to infer the offensive content in original

messages after filtering.

3.2.2 Content Control Approach

Content control approaches are usually deployed at user side or ISP side to prevent user from

seeing inappropriate content on the Internet. Its filtering is usually done based on certain criteria,

such as URL address, the occurrence of offensive words, and topic classification. Here our focus

is text based criteria.

For example, if we present a sentence based content control approach with threshold set as the

number of offensive words in the sentence. If at least one offensive word is being detected within

a sentence, the filter will remove the sentence from user message.

To demonstrate the shortcoming of content control approaches, we present examples below.

Filtering results with Content control Censoring


Keyword Censoring: “ ”

However, content control approaches are too coarse-grained to be applied in online communities.

First of all, offender can easily bypass the filtering as long as knowing the estimation criteria.

More important, a sentence in user comment may contain both offensive and inoffensive content.

Inoffensive part may be removed falsely because of offensive part. Not allowing user to post

inoffensive content would easily drive users away and thus affect the growth of community.

Compared with content control approaches, we provide a fine-grained filtering by removing only

the smallest syntactic part in the sentence containing offensive language. The inoffensive content

in the original message will remain; thereby, user still has the freedom of speech for posting

inoffensive content. We believe such delicate filtering will be more acceptable to online

communities.

3.2.3 Manual Filtering Approach

Manual filtering is believed to produce the best filtering result. Basically, user messages are

reviewed by community administrator before being posted on the website.

Filtering results with Manual Filtering Approach


Keyword Censoring: “What is wrong with you?”

As shown above, the administrator is able to easily understand what the author wants to express

and precisely remove only the offensive content within the message.

However, manual filtering is very time and labor consuming, making it impossible to be widely

applied. For example, in the Linda-Ikeji blog community (http://lindaikeji.blogspot.com), the

blog administrator will manually review and filter user comments on some celebrities’ public

blogs. Obviously, users would expect a delay between posting a comment on a blog and

displaying this comment on the blog’s webpage. Further, the filtering totally relies on the

judgment of the community administrator. Our proposed semantic filtering approach mimics the

procedure of manual filtering by trying to understand the relations among words in order to

remove the offensive content semantically. The proposed semantic filtering approach will be

fully automatic, requiring no interference of any administrator.

3.3 Problem of the existing approaches

From the study of the existing approaches and based on the information provided above, the

following problems have been identified:

Using the Keyword censoring approach, the readers can still easily understand what the

offender wants to say and even be able to infer the removed words. This indicates the

failure filtering because offensive opinion has been successfully delivered to victims. Also,

removing words from a sentence without considering their context breaks the readability

of rest of the sentence.

Using the content control approaches will also be too coarse-grained to be applied in online

communities. Offender can easily bypass the filtering as long as knowing the estimation

criteria and more important, a sentence in user comment may contain both offensive and

inoffensive content. Inoffensive part may be removed falsely because of offensive part.

Not allowing user to post inoffensive content would easily drive users away and thus affect

the growth of community.

Using the manual filtering approach is very time and labor consuming. The administrator

will have to manually review and filter all the users’ comments and messages; making it

impossible to be widely applied. Also, the filtering totally relies on the judgment of the

community administrator.

3.4 Proposed Filtering Philosophy

The goal of our semantic filtering is to achieve filtering results close to that of manual filtering.

To reach this goal, the foremost thing is to answer the question about how the filtering should be

performed in order to get the desired filtering results. In this section, we present our answer in

three steps. First, we analyze the characteristics of offensive text content in user messages. Then,

we introduce our filtering philosophy according to the summarized characteristics. Finally, we

show how this philosophy is transformed into heuristic rules applicable in the filtering process.

3.4.1 Offensive Language Text Content

Based on the observation on user comments collected from YouTube website, a sentence in a

user message may contain both offensive and inoffensive text content. Offensive text content is

exposed intentionally with purpose of bringing negative influence to victims (e.g., the readers of

message). The victim receives the negative influence by reading the offensive part of sentence

and understanding the carried offensive information.

Hence, the information carried by original sentence can be represented as

I = Ioff + Iinoff

The offender reaches his goal when the offensive information Ioff is delivered to readers.

Therefore, to achieve a thorough filtering, all words used to deliver Ioff should be removed.

Meanwhile, with respect to free speech, the part with Iinoff should be saved.

3.4.2 Filtering Philosophy

According to the analysis, we propose the philosophy that should be followed in sentence-level

offensive language filtering:

Precisely identify all offensive contents and remove them semantically, so that viewers

will not notice the existence of offensive language in the original sentence;

Keep the readability and inoffensive content in the sentence, so that the author will still

be allowed to express his opinion freely as long as it is not offensive;

This is called the philosophy of “filtering instead of blocking”. To the filter, the philosophy

states that: if removing one word will make another word meaningless or confusing to readers,

we should consider removing both words to keep the readability of a filtered sentence;

meanwhile, we only remove words that are affected by offensive words.

For example, in the sentence “Samuel said it and what the fuck is wrong with what he said?”,

suppose “fuck” is the only offensive words, the sentence can be separated into two parts. The

first part, “Samuel said it”, is inoffensive; but the second part, “what the fuck is wrong with

you?” is offensive. Therefore, we should remove the offensive word in the second part while

keeping the first part and also still making the sentence a meaningful and readable one. i.e. We

won’t have:

Samuel said it and what the is wrong with what he said? (Wrong)

But

Samuel said it and what is wrong with what he said? (Correct)

The words “the” and “fuck” must be removed in order to keep the transparency of filtering as

well as the readability of filtered text content.

3.4.3 Filtering Rules

Specifically, the proposed philosophy is transformed into two heuristic rules to estimate the

impact of removing words in a sentence.

Rule 1. (Modification Relation) in a modification relation, if the modifier is determined to be

offensive, removing modifier solely is enough; if the head is determined to be offensive, both the

head and the modifier should be removed.

The modification relation is a binary semantic relationship between two syntactic elements, such

as word, phrase, etc. One element is named head and the other is named modifier. The modifier

is used to describe the head (i.e. the modified component). Semantically, modifiers describe and

provide more accurate definitional meaning for head. As the modifier acts as a complement, the

removal of the modifier typically will not affect the grammaticality of the construction. For

example, in the sentence “she likes red apples.”, the adjective “red” is used to modify the noun

“apples”. Removing “red” will keep the readability of rest of sentence. We admit that, removing

modifiers will lose some information carried by modifiers. However, if the modifier is

determined removable but the head is not, removing modifier will remove only the offensive

information.

Rule 2. (Pattern Integrity) if removing the offensive word breaks the integrity of sentence’s basic

pattern, the whole sentence should be removed in order to keep the readability.

English sentences and clauses are organized in basic patterns, such as “Subject-Verb”, “Subject-

Verb-Object”, “Subject- Verb-Adjective”, “Subject-Verb-Adverb”, and “Subject-Verb- Noun”.

Every sentence or clause can be categorized into one pattern. The integrity of basic pattern is

essential to the readability of content. For example, the sentence “she sleeps on the sofa.” follows

“Subject-Verb” pattern. If we only remove “sleeps”, the rest of the sentence, “she on the sofa.”

will become nonsense and meaningless.

We will be applying these two rules during the filtering of the sentences.

3.5 Identify Removable Content by Grammatical Relations

A text or user message can be decomposed into a sequence of sentences. Each sentence is

considered as a unit in filtering. Given a sentence containing both offensive words and

inoffensive words, the goal of filtering is to identify inoffensive words which should be removed

together with offensive words. We define the words that should be removed by the filtering as

“removable” words.

We noticed that manual filtering can easily achieve this goal because human can easily

understand the context of words in a sentence and precisely identify which words should be

removed with known offensive words. So, we mimic the manual filtering in that, we extract the

grammatical relations among words from sentences and use the proposed filtering rules to

estimate the impact of removing offensive words on other inoffensive words based on extracted

grammatical relations.

Specifically, the proposed approach includes two steps. In the first step, we scan the sentence and

see if offensive words exist. If offensive words exist, we continue to retrieve grammatical

information (i.e. Part-of-Speech tags and typed dependency relations) among words in the

sentence. Using retrieved grammatical information, we create a tree data structure, named

RelTree, for the second step estimation. In this second step, we propose a set of estimation

functions following the filtering rules we proposed. Using the RelTree structure and the proposed

rules, we then estimate if there are inoffensive words that should be removed together with those

identified offensive words.

The overview idea of our semantic filtering approach is shown in Algorithm 1 below. Within the

algorithm, the functions POStagging and TDgenerator generate Part-of-Speech tags and typed

dependency relations, respectively. We use existing NLP (Natural Language Processing) tools to

implement these two functions. We will also focus on the design of two other functions

CreateRelTree and EstimateRelTree.

In this methodology, we are assuming that the filtering is based on a comprehensive offensive

lexicon containing all offensive words. Words that do not appear in the lexicon are considered

inoffensive.

input : a text comment T,

a blacklist of offensive words Blacklist

output: a filtered text comment T′

1 T′ ←“”;

2 senList ← chunk T into a list of sentences;

3 foreach sentence s ∈ senList do

4 scan s for offensive words using Blacklist;

5 if no offensive word found then

6 T′ ← T′ + s;

7 end

8 else

9 PTree ← POStagging(s);/*get parse tree*/

10 TDset ← TDgenerator(s);

/* get typed dependency relations */

11 RelTree ← CreateRelTree(PTree, TDset);

/* create RelTree */

12 LabelRelTree ← EstimateRelTree (RelTree,

Blacklist); /* estimate using RelTree */

13 s′ ← remove all words in LabelRelTree those are

labeled as “removable”;

14 T′ ← T′ + s′;

15 end

16 end

17 Return T′;

Algorithm 1: Procedure of Semantic Filtering

3.5.1 First Step: Grammatical Analysis

In the first step, we extract two types of grammatical information from a given sentence. One is

the Part-of-Speech information associated with every word. The other is the dependency relation

among words. Part-of-Speech information helps us to understand the organization of a sentence,

which is essential for keeping the readability when we try to remove words from a sentence.

Dependency relations will be used directly to estimate the impact of removing one word on other

semantically related words, making the filtering more “meaningful”. Combining these two types

of information, we can create a new data structure, called RelTree, for the next-step estimation.

3.5.1.1 Part of Speech Tagging

Part-of-Speech tagging has been widely used in Natural Language Processing applications to

identify the syntactic properties of lexical items in a sentence, such as word or phrase. Through

Part-of-Speech tagging, the sentence can be represented in a tree structure basing on Part-of-

Speech tags. We adopt the Penn Treebank tag set for our Part-of-Speech tagging.

An example of Penn Treebank style parse tree is shown in Figure 1 below.

Figure 1: A parse tree of a sentence basing on Part-of-Speech tags

Here, the leaf nodes are words appearing in the sentence. The non-leaf nodes represent syntactic

elements, such as phrases or clauses. Each element consists of the words within its subtree. For

example, the words “said” and “it” constitute a Verb Phrase (i.e. VP) node.

3.5.1.2 Typed Dependency Relations

Typed Dependency is a kind of general relations describing the grammatical dependencies within

a sentence, proposed by Stanford Natural Language Processing Group. Each typed dependency

includes a dependency type and a (governor, dependent) word pair. For example, in the sentence

“what the fuck is wrong with what he said?”, the typed dependency amod(wrong, fuck) means

that “fuck” is an adjectival modifier of an noun phrase containing “wrong”. A typed dependency

may represent the dependent relations between two syntactic elements, not limited to words only.

Fig 2: An example of typed dependency graph

The typed dependencies in a sentence can be represented as a graph. For example, Figure 2

shows the typed dependency relations for the same sentence shown in Figure 1. We explain the

relations appeared in Figure 2 from left to right: the nominal subject relation, nsubj(it, Samuel),

means that “Samuel” is the syntactic subject of the clause (same is nsubj(wrong, he)); the copula

relation, cop(it, said), means that “it” is the complement of verb “said” (same is cop(wrong, is));

the noun compound modifier, the determiner, det(fuck, the), means that “the” is a determiner of

“fuck”; the adjectival modifier, amod(fuck, wrong), means that “fuck” serves as adjectival

modifier of “wrong”; and the conjunct, conj and(it, wrong), means that the coordinating

conjunction “and” is used to connect two elements with head “it” and “wrong”, respectively.

3.5.1.3 Relation Tree (RelTree)

Both Part-of-Speech and typed dependency relations are utilized in the second step estimation.

The parse tree shows the sentence syntactic organization and typed dependency relations provide

semantic information among words. To combine both information, we propose a new data

structure called RelTree.

In a RelTree, the leaf nodes are words in the sentence. And the non-leaf node represents either a

phrase or a clause inside the sentence. In each nonleaf node, we associate the set of typed

dependency relations on the words within its subtree. Each node only contains the typed

dependency relations which have not appeared in its subtree nodes.

Figure 3: A RelTree combining the parse tree and typed dependency relations

input : a parse tree PTree,

a set of typed dependency relations TDset

output: a RelTree RelTree

1 RelTree ← PTree;

2 Remove all word nodes in RelTree;

3 Traverse RelTree in postorder foreach node n visited do

4 if n is a leaf node then

5 n.wordset ← {n};/*create word nodes*/

6 end

7 if n is not a leaf node then

8 n.wordset ← ∅;

9 foreach direct child node ci do

10 n.wordset ← n.wordset ∪ ci.wordset;

11 n.rel ← ∅;

12 foreach relation Ti(Gi,Di) in TDset do

13 if Gi ∈ n.wordset and Di ∈ n.wordset then

14 n.rel ← n.rel ∪ Ti(Gi,Di);

15 TDset ← TDset − Ti(Gi,Di);

16 end

17 end

18 end

19 end

20 end

21 Return RelTree;

Algorithm 2: create a RelTree using the parse tree and typed dependency relations

The RelTree data structure is proposed only for the convenience of offensiveness estimation in

the next step. Algorithm 2 shows the algorithm for RelTree construction. With the parse tree

PTree given, the computational complexity of algorithm CreateRelTree relies on the post-order

traversal and the search in TDset. As the number of relations never exceeds N(N −1)/2, where N

is the number of words in the sentence, the computational complexity is O(N3). The

computational complexity itself is acceptable. Indeed, there are a lot of ways to improve the

efficiency in the implementation of this algorithm.

3.5.2 Step Two: Bottom-Up Estimation

In the second step, we first use the offensive lexicon to identify offensive words in the sentence.

The leaf node with an offensive word will be labeled as “removable”. Starting from leaf nodes,

we perform bottom-up estimation through a postorder traversal on the RelTree.

For each non-leaf node in the RelTree, we estimate whether it should be removed based on (1)

the associated typed dependency relations and (2) its child nodes within its subtree. If a non-leaf

node is estimated to be “removable”, all its descendants, including words, within its subtree will

also be labeled as “removable”. The meaning of “removable” to a non-leaf node is that all words,

phrases, or even clauses within its subtree have been determined to be removed at the end of

filtering. The estimation process includes two steps. We first estimate based on typed

dependency relations, and then apply a set of heuristic rules as complements.

3.5.2.1 Estimation with Typed Dependency Relations

Consider a non-leaf node n in a RelTree with a set n.rel of typed dependency relations. Each

relation describes a semantic connection between a governor word and a dependent word. Both

words are leaf nodes in the subtree rooted at n. n.rel could be empty when n only has one child

node. For each typed dependency relation in n.rel, we study its semantic information and map it

to an estimation function.

These estimation functions and mapping are created following the Modification Relation and

Pattern Integrity rules. Take the Direct Object (dobj) relation for example. The dobj(G, D)

relation is defined as : the direct object of a verb phrase, containing governor word G, is the noun

phrase, containing dependent word D. For example, in a relation dobj(win,matcℎ), “win” is the

governor word and “match” is the dependent word. According to Pattern Integrity rule, we know

that “Subject-Verb-Object” is a basic pattern. Therefore, if either the phrase with G or phrase

with D will be removed because of offensiveness, both phrases should be removed together.

To formalize, we define an estimation function H(T) =H(P(G)) OR H(P(D)) and map relation

dobj(G,D) to it. We use symbol C(G) and P(G) to denote the clause and phrase containing word

G as head, respectively. In this estimation function, H(T) is the label to be assigned to relation T

and H(P(G)) is the label with phrase node containing G in the RelTree.

Using the estimation function, we generate a label for every relation associated with node n and

then for the node itself. If a relation T(G,D) of node n is estimated and labeled as “removable”,

the two child nodes of n, containing word G and word D, will be labeled as “removable”. If all

relations in n.rel are labeled as “removable”, the node n as well as all its descendants, will be

labeled as “removable”.

3.5.2.2 Estimation with Heuristic Rules

Heuristic rules will also be applied as complement after typed dependency relation estimation.

Applying heuristic rules is necessary mainly because of two reasons. First of all, the typed

dependency relation contains some syntactic information but limited. For example, the

possessive ending (i.e. POS) tag, which is a quite popular Part-of-Speech tag, is ignored during

the typed dependency tagging.

Secondly, not all relations between syntactic elements in a sentence can be classified into one of

typed dependency relations. For those uncertain relations, a generic grammatical relation is being

defined, named dep. To prevent confusion to filter, we include dep into the Rule H(T) = H(G)

AND H(D) which means either G or D is labeled removable will not affect each other and the

label of T. Because dep relation stands for uncertain relation, we have to rely on Part-of-Speech

tags in the RelTree for our filtering.

Take the conj tag node rule as an example. The conjunct relation (conj) is a type of relation

between two syntactic elements connected by a coordinating conjunction, such as “and”. The

parameters of conj do not include the coordinating conjunction. However, explicitly, the

coordinating conjunction sits between the two parameters of conj. If one side is determined

removable, the coordinating conjunction should be removed as well. For example, in the

sentence “I like A and B”, if either A or B is removed, the coordinating conjunction “and”

should also be removed.

Figure 4: Estimate a RelTree in a bottom-up manner

3.5.2.3 Estimation Algorithm

To estimate and assign labels for all nodes in a RelTree, we perform the estimation also in a

bottom-up manner. Figure 4 shows an example estimation process. The number in the circle

represents the order of estimation for each node in the RelTree. The dashed nodes are estimated

as “removable”. For example, the clause node with nsubj(you, fuck) is estimated as “removable”

according to the estimation. Therefore, its two child nodes containing “you” and “fuck”

respectively are both labeled as “removable”. Moreover, the word “and” is removable according

to the heuristic rule (i.e. conj tag node rule), in order to keep the filtering transparent to readers.

Finally, inoffensive words, “what”, “the”, “is”, “wrong”, “with”, “he”, and “said”, are removed

with the offensive word, “fuck” in the filtering.

According to Algorithm 2, each typed dependency relation will appear exactly once in the

RelTree. No relation will be checked repeatedly in the estimation. The cleaned sentence after

filtering in this example will be “Samuel said it.”. As we can see, the result satisfies the

requirement of our proposed filtering philosophy. Only the offensive part, “what the fuck is

wrong with what he said”, is removed. The reader can still get the inoffensive information. The

detailed algorithm for estimation process is presented below.

input : a RelTree RelTree,

a blacklist of offensive words Blacklist,

output: a labeled RelTree LebelRelTree

1 LebelRelTree ← RelTree;

2 Label all leaf nodes with offensive words by

“removable” in LabelRelTree ;

3 Traverse LabelRelTree in postorder foreach node n

visited do

4 if n is a leaf node then

5 ignore; /* already labeled */

6 end

7 if n is not a leaf node then

8 if n only has one child node then

9 n.label ← n.cℎild.label;

10 end

11 if n has more than one child node then

12 Estimate the label for n by its associated

labels, using proposed estimation function and

heuristic rules;

13 end

14 end

15 end

16 Return LabelRelTree;

Algorithm 3: estimate nodes in RelTree

CHAPTER FOUR

IMPLEMENTATION

4.1. JUSTIFICATION OF PROGRAMMING LANGUAGE USED.

The Spam filtering system is an online application implemented using HTML, JAVA SERVLET

PAGE (JSP), JAVASCRIPT, and MYSQL relational database software.

4.1.1 HTML

HTML, which stands for Hypertext Markup Language, is the predominant markup language for

web pages. It provides a means to create structured documents by denoting structural semantics

for text such as heading, paragraphs, list, etc. bas well as for links, quotes and other items. It allows

images and objects to be embedded and can be used to create interactive forms. It is written in the

form of HTML elements consisting of “tags” surrounded by ankle brackets within the webpage

content. It can include or can load script in language such as JavaScript which affect the behaviour

of HTML processors like Web browsers; and Cascading Style Sheets (CSS) to define the

appearance and layout of text and other material.

4.1.2 JAVASCRIPT

JavaScript has been around for several years now, in many different flavors. The main benefit of

JavaScript is to add additional interaction between the web site and its visitors at the cost of a

little extra work by the web developer. JavaScript allows industrious web masters to get more out

of their website than HTML and CSS can provide.

By definition, JavaScript is a client-side scripting language. This means the web surfer's browser

will be running the script. This is opposite to client-side is server-side, which occurs in a

language like PHP. These PHP scripts are run by the web hosting server.

There are many uses (and abuses!) for the powerful JavaScript language. Here, it is being used

for:

Alert Messages

Popup Windows

HTML Form Data Validation

4.1.3 JAVA SERVLET PAGE

"JSP is an HTML-embedded scripting language. JSP goal is to allow developers to write

dynamically generated pages quickly." It is a server-side programming language specifically

designed for creating dynamic web pages. JSP will allow you to:

Reduce the time to create large websites.

Create a customized user experience for visitors based on information that you have

gathered from them.

Open up thousands of possibilities for online tools.

Unlike other server-side languages, JSP is an open source product.

When someone visits your JSP webpage, your web server processes the JAVA code. It then sees

which parts it needs to show to visitors (content and pictures) and hides the other stuff (file

operations, math calculations, etc.) then translates your JSP into HTML. After the translation

into HTML, it sends the webpage to your visitor's web browser.

4.1.4 MYSQL

MySQL is the most popular open source database server in existence because of its consistent fast

performance, high reliability and ease of use. It's used in more than 6 million installations ranging

from large corporations to specialized embedded applications on every continent in the world. It

is very commonly used in conjunction with PHP scripts to create dynamic and powerful server

applications. MySQL has been criticized in the past because it does not have all the features of

other Database Management Systems. However, MySQL continues to improve significantly, with

each major upgrade, and has great popularity because of these improvements.

4.1.5 CSS

Cascading Style Sheets (CSS) are a way to control the look and feel of the HTML documents in

an organized and efficient manner. Cascading Style Sheet enables us to add new looks to the

HTML, completely restyles a web site with only a few changes to the CSS code and also allows

us to use the "style" created on any webpage we wish. With CSS you will be able to:

Add new looks to your old HTML

Completely restyle a web site with only a few changes to your CSS code

Use the "style" you create on any webpage you wish

4.2 System Specification

The system specifications is divided into two part:

1. Hardware Specification

2. Software Specification

4.2.1 HAREWARE SPECIFICATION FOR THE APPLICATION

Any computer tagged by the manufacturer as a workstation can be used to access this application

using the internet browser, but the following minimum specification would be required to host

the application:

1. A computer tagged by the manufacturer as a server

2. Core 2Duo processor and above

3. A 2GB memory

4. A keyboard and a mouse

5. A hard disk of 120GB and above

4.2.2 SOFTWARE APPLICATION FOR THE APPLICATION

Windows Server 2005 and above

Microsoft .NET framework version 3.0 and above must be installed

Microsoft SQL Server 2005 and above should be installed

Microsoft Internet Information Server (IIS) should be enabled

Server FTP capability must be enabled

4.3 System Implementation

This section briefly described the screens of the online application.

4.3.1 Application Login Screen

This system contains a secure login panel that requires a combination of email address and

password. The email address is used because it is meant to be unique.

Fig 4.1 – Web Application Login Screen

4.3.2 Application Registration Page

FIG. 4.2 – Web Application Registration Page

Here the user fills in his/her details and after the system verifies that all details provided is correct,

it also has a captcha image which acts as a spam guard to ensure than the inputted data was done

by human and not robot.

4.3.2 Post and Comment Page

FIG. 4.3 – Filtered Post Page Using Keyword Censoring Approach

FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach

FIG. 4.5 – Filtered Post Page Using FOLOC Censoring Approach

Looking at the three post and comment pages above, we will realize the our proposed semantic

filtering approach mimics the procedure of manual filtering by trying to understand the relations

among words and has removed the offensive content semantically. The proposed semantic

filtering approach is fully automated and it required no interference of any administrator and at

the same time eliminating the offensive words in the sentence.

“What the fuck is wrong with you?” has been changed to “What is wrong with you?” using the

proposed semantic filtering approach instead of having “what the f*** is wrong with you?”

which still delivers the offensive words to the victims successfully.

Our semantic filtering result is also so close to that of manual filtering as our desired results have

been produced just by applying the heuristic rules in the filtering process.

FIG. 4.6 – Filtered Post Page Using Keyword Censoring Approach

FIG. 4.4 – Filtered Post Page Using Content Control Censoring Approach

FIG. 4.8 – Filtered Post Page Using FOLOC Censoring Approach

Looking at the three post and comment pages above in fig 4.6, 4.7 and 4.8, we will realize that

our proposed semantic filtering approach also mimics the procedure of manual filtering by trying

to understand the relations among words and has removed the offensive content semantically

again. The proposed semantic filtering approach is fully automated and it required no

interference of any administrator and at the same time eliminating the offensive words in the

sentence.

“I have told all these bitches to stop calling my husband’s phone” has been changed to “I have

told all to stop calling my husband’s phone” using the proposed semantic filtering approach

instead of having “I have told all these b****** to stop calling my husband’s phone” which still

delivers the offensive words to the victims successfully.

Our semantic filtering result is also so close to that of manual filtering as our desired results have

been produced just by applying the heuristic rules in the filtering process.

CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATIONS

5.1 Summary and Conclusion


teens, as a place where they can meet other people, communicate, and exchange information.

This has also brought cyberbullying which is a fast growing trend that experts believe is more

harmful than typical schoolyard bullying. Nearly all of us can be contacted 24/7 via online social

networking communities. Victims can be reached anytime and at anyplace. For many children,

home is no longer a refuge from the bullies. Children can escape threats and abuse in the

classroom, only to find offensive comments and posts from the same tormentors when they

arrive home. There’s no safe place anymore and one can be bullied 24/7; even in the privacy of

his/her own bedroom.

However, we are not only trying to filter out offensive content but also making sure the

sentences still make sense. From statistical analysis it has been revealed that, more than 60% of

insulting messages are posted as a direct insult and direct insulting messages always contain

insulting words or phrases. From psychological point of view, if these messages are categorized

and restrict a user to send these kinds of messages, then a human intension to post or exchange of

abusive messages can be significantly reduced.

Offensive language is a serious problem facing the online community. Our semantic filtering

technique is based on the grammatical relations of words in a sentence so that the rest of the

filtered sentence is readable and the existence of offensive words in the original sentence is hard

to notice. We tested the effectiveness of our approach with a large dataset and the results show

that our techniques are very effective and accurate with little process overhead.

5.2 Recommendation

Our future work includes looking at the issues described in the discussion section. Moreover, as

the most time-consuming part of semantic filtering is the sentence parsing process, we will

examine other light-weighted NLP techniques to speed up sentence parsing. Last but not the

lease, we also plan to extend our filtering approach to support other languages such as Chinese

and French.

SAMUEL FULL MSC PROJECT

Documents

Transcript of SAMUEL FULL MSC PROJECT