Creating a Computational Lexicon of Alcohol Consumption...
Transcript of Creating a Computational Lexicon of Alcohol Consumption...
1
Creating a Computational Lexicon of
Alcohol Consumption from Twitter
Developed by Evghenii Varghin
Supervised by Sophia Ananiadou
School of Computer Science
University of Manchester
Computer Science with Business and Management
April 2016
2
Contents Introduction ............................................................................................................................................ 5
1.1 Project overview ............................................................................................................................... 5
1.2 Report structure ................................................................................................................................ 5
1.3 Motivation ......................................................................................................................................... 6
Background ............................................................................................................................................. 8
2.1 Social media ...................................................................................................................................... 8
2.1.1 Twitter API ..................................................................................................................................... 8
2.2 Text mining and normalization ......................................................................................................... 9
2.2.1 Computational lexicology ............................................................................................................ 10
2.3 Lexicon evaluation .......................................................................................................................... 10
2.3.1 Related work ................................................................................................................................ 10
Design ................................................................................................................................................... 12
3.1 Requirements .................................................................................................................................. 12
3.2 Database structure .......................................................................................................................... 13
3.3 Term finder ..................................................................................................................................... 14
3.4 Twitter search ................................................................................................................................. 14
3.5 Cleaning and normalizing data ........................................................................................................ 15
3.6 Information extraction .................................................................................................................... 15
Implementation .................................................................................................................................... 17
5.1 Collecting and storing tweets ......................................................................................................... 17
5.2 Data normalization.......................................................................................................................... 18
5.2.1 Tokenization ................................................................................................................................. 18
5.2.2 Slang and acronym normalization ............................................................................................... 19
5.3 Part-of-speech tagging .................................................................................................................... 20
5.4 Named-entity recognition ............................................................................................................... 20
Results and evaluation ......................................................................................................................... 22
6.1 Results ............................................................................................................................................. 22
6.2 Evaluation of the extracted corpus ................................................................................................. 23
6.2.1 Search keywords comparison ...................................................................................................... 24
6.2.2 Alternative computational approach to corpus evaluation for future work ............................... 24
Conclusions ........................................................................................................................................... 25
7.1 Work overview ................................................................................................................................ 25
7.2 Future work ..................................................................................................................................... 25
3
7.3 Summary ......................................................................................................................................... 26
References ............................................................................................................................................ 27
4
Acknowledgements
I would like to take this opportunity to thank my supervisor Sophia Ananiadou
for her continuous support and valuable feedback throughout the stages of
project development. This was a completely new field for me and I would not
be able to achieve such progress without her guidance.
I would also like to thank my parents from the bottom of my heart for
providing me with this amazing opportunity to study abroad. I cannot express
how much this means to me and what opportunities have opened to me as a
result of their decisions and support.
5
Part 1
Introduction
1.1 Project overview
Information is the key building block in today's world of Big Data. Extracting information
and making sense of it has become a valuable process not only for large corporations, but
also in scientific, social and other researches. There are multiple sources to extract
information from. For businesses it can be internal or market data, for researches it can be
through conducted interviews or questionnaires, however, there is another source that is
quickly becoming one of the largest of all – social media. Social media has seen an incredible
growth in the past couple of years, with twitter alone, having almost 320 million monthly
active users [1], which post, discuss, comment and share information via this platform.
Through these processes users create an extraordinary pool of information that can be used
for multiple different purposes. This project focuses on how to extract such information and
build a lexicon on a specific topic from the complete pool of twitter data.
1.2 Report structure
This report has been broken down into 6 parts. The remaining of this part will discuss
personal motivation for choosing this project and what were its main goals and objects.
Second part will provide an overview and background of all the technical concepts that have
been used during the creation of this program, or to which this project is related and which
will be applied in the future.
Third and Fourth parts focus on the design and the implementation of technical tools in this
program.
Finally, “Results and evaluation” part will outline what was achieved and the last part will
provide a reflective summary on the completed tasks, as well as the future of this project.
6
1.3 Motivation
I believe that information extraction and analysis already drive today's fast-paced world. As
sources of information grow and its structure is becoming more and more complicated,
developers are coming up with more sophisticated data mining tools. Understanding the
importance of such tools and their potential impact in the world, I wanted to create a
project that would utilize some of these tools and techniques, in order to learn how they
can be implemented to solve text mining tasks.
Another part of motivation behind choosing this project is to feel comfortable working in an
area, which I was totally unfamiliar with before the start of the project. I wanted to develop
a skillset that would allow me to face any new tools and programs with confidence, because
I have already proven to myself that I can overcome complicated learning curves to
complete a task.
Finally, I wanted to contribute to a social project. Coming into this degree and throughout
the first years in the university I was always looking for the ways how computer science
skills, which I have developed, could be applied to real-world problems. I was fascinated to
see how I can apply my knowledge in programming to develop a project that can potentially
help to tackle some of the health and social problems in our society.
1.4 Goals and objectives
This project’s main goal was developing a computational lexicon of alcohol consumption
from Twitter. The collected and normalized data would allow doctors and psychologists to
analyse and study the corpus related to alcohol consumption in a structured and efficient
manner. In such studies potential relationship between certain types of twitter posts and
alcoholism could be found, allowing doctors to have more information about different
causes of alcoholism and take proactive actions in helping people to prevent this disease.
Giving the overall big size of the task, this project was designed to provide a starting ground
for such researches. The following list of objectives has been developed for this project:
i. Develop a program that would analyse data of a given section from health forum
in order to find valid keywords for the Twitter search.
ii. Collect data from Twitter and store it for further processing.
iii. Clean and normalize the collected data.
iv. Extract meaningful information form the normalized corpus.
7
v. Develop a flexible program that would allow researchers to collect and analyse
data from other domains.
vi. Outline evaluation methods that are required to create the computational
lexicon.
8
Part 2
Background
2.1 Social media
Social media has become one of the biggest internet sensations over the past decade. Not
only has it had a great impact on people’s life allowing them to stay better connected *2],
but it is almost the largest sources of information worldwide. This report will outline two
social media sources that were used in the project: Twitter and health forums.
Twitter is an online social media platform that allows its users to send and read short (up to
140 characters) messages. This service is being referred to as microblogging. Twitter’s
metadata primarily consists of user-mentions (“@”) and hashtags (“#”). Hashtags are often
used to add sentiment to user’s posts (e.g. #happy, #fun) [3], while user-mentions allow
people to address their tweets to specific users, or entities. Since 2013, Twitter introduced
an additional set of metadata, which includes geo-locations, language, time and other
information for each tweet [4] [5].
Health forums are another type of social media platforms that allow people to share health
practices with others and seek advice anonymously or identified regarding any issues they
might have. Such forums are typically monitored by qualified professionals to make sure any
harmful content is promptly removed [6]. As health has become the most widely searched
topic on the Internet [7], health forums contain large volumes of information, which can be
used for research purposes.
2.1.1 Twitter API
Twitter provides two different application programming interfaces (APIs) for developers to
access its data: Search API and Streaming API.
Search API is part of Twitter’s REST API. It allows developers to access and collect historical
Twitter data published in the past 7 days. Search API provides a set of tools that allow
developers to test, modify and validate search queries in order to get the most reliable
9
results. Search API has a rate limit of making 450 queries/requests per 15 minutes when
using application-only auth [8].
Streaming API is used for collecting real-time published tweets, as long as they match the
search criteria [9]. Streaming API can collect a much higher flow of tweets (180k tweets per
hour vs. 72k tweets per hour of Search API), however, it does not have the tools to create
more advanced search queries, making the overall range of collected data narrower [10].
2.2 Text mining and normalization
Text mining is a sub-category of data mining and is primarily concerned with deriving high-
quality information from unstructured text. This is usually achieved through the application
of various computer tools and techniques and can be applied to solve tasks of varied
complexity. Starting from a simple count of the number of occurrences of a specified word
in a document, text mining tools can identify and semantically classify named entities, as
well as different relationships between [11].
One of the main objectives of text mining tools is text normalization before further
processing. Text normalization ensures consistency of the generated output, dealing with all
instances of ambiguity in text. There are multiple natural language processing techniques to
solve tasks of lexical normalization, some of which are listed below:
Tokenization – tokenization task is often considered as a required pre-requisite
before applying any other NLP tools and methods on the extracted corpus. It refers
to splitting the input sequence of characters into words, punctuation symbols or
other meaningful basic linguistics [12]. The typical method of splitting a string is
usually performed based on the occurrences of whitespace characters; however,
there are many algorithms that use more sophisticated methods (e.g. “shouldn’t” ->
,“should”, “n’t”-).
Word stemming – word stemming is performed to achieve more consistency across
the lexicon. The process can be summarized by “reducing inflected (or sometimes
derived words) to their word stem, base or root form” *13]. To solve this task for a
complete corpus, multiple stemming algorithms were developed [14].
Part-of-speech tagging – POS tagging is referred to as a process of assigning an
appropriate tag to each token in an input character sequence, based on their
meaning and the context where they appear [15]. Due to the possibility of a single
word being referred to as different parts of speech, such tagging algorithms often
use probabilistic approach to assign correct labels.
10
2.2.1 Computational lexicology
Computational lexicology is a branch of computational linguistics, which is responsible for
analysing how computers are used in the study of lexicon. It has contributed to a wide
range of tasks which are related to how computers process, interpret and produce human
language [16]. By doing so, studies in computational lexicology have also identified what are
the current limitations of print dictionaries for computational purposes. [17]
2.3 Lexicon evaluation
Evaluation of the results is one of the most important aspects in creating a computational
lexicon. There are multiple techniques that can be used for lexicon evaluation to ensure the
autonomy and reliability of any text mining work [18]. This report will explore the following
lexicon evaluation methods:
Evaluation of extracted corpus – The typical evaluation of text mining techniques is
performed against “gold standards” *19+. This part of evaluation is concerned with
the relevance of data that has been retrieved. This project was aimed at creating a
computational lexicon of alcohol consumption from Twitter; therefore, the extracted
corpus was evaluated on its relevance to this topic, when opposed to it being generic
alcohol-related discussions.
Evaluation against the existing resources – A computational lexicon can be evaluated
against an existing resource of the same or similar domain. Such evaluation can
demonstrate how the newly created lexicon compliments or adds to an existing one,
improving its coverage of the domain.
Domain-specific evaluation – A computational lexicon can be evaluated by domain
experts. This usually requires manual annotation of gathered corpus, preferably, by
two or more domain experts. This task will compare annotations given by the text
mining tools with the ones provided by domain experts, thus, evaluating the
effectiveness and reliability of such tools.
2.3.1 Related work
Related work on lexicon evaluation can be found in almost any paper, which is concerned
with the creation of a computational lexicon, as it is an integral part of such researches. This
11
project has consulted and studied tools and evaluation methods used by Thompson et.al. in
the creation of the BioLexicon [20]. Some of the evaluation methods used by Thompson
et.al. included evaluation of normalization rules according to ambiguity and variability
metrics and evaluation of the created lexicon against existing resources – WordNet [21] and
the SPECIALIST lexicon [22]. All performed evaluations have demonstrated positive results,
highlighting the effectiveness and efficiency of text mining techniques used for creating
BioLexicon.
12
Part 3
Design
3.1 Requirements
Due to the large scale of the project, final requirements were narrowed down and simplified
in order to accommodate the available time and knowledge constraints. One of the main
goals and requirements for this project included becoming familiar with the tasks of text
mining and lexical normalization, thus initial high priority requirement involved collecting
and studying a number of related works in this domain.
Other requirements were gathered at different stages of project development. The final list
of program requirements and their priorities is outlined in table below.
Requirement Priority
Term finder 1 Collect data from health forum Low
2
Analyse data to identify correct search keywords Low
Corpus extraction 3 Extract twitter corpus based on identified search keywords High
4
Design a database structure for the extracted corpus Medium
Text normalization 5
Clean extracted corpus from noisy data (non-ASCII characters, links, twitter symbols) Medium
6 Normalize slang and acronyms High
7
Identify named entities in text (names, locations, organizations) High
Information extraction 8
Extract verbs, nouns and adjectives to find most occurring words High
9
Build visual graphs based on extracted information Low
13
3.2 Database structure
The first stage of the design process was to develop an appropriate data structure in order
to store the collected twitter corpus. This included deciding on what type of database to use
and how the data would be stored inside that database.
The final version of the database type was decided to be a comma-separated values (CSV)
file. As the required version of the program output was designed to be a computational
lexicon, CSV-type database added the flexibility and easy-to-read design to the data. A CSV
file can be easily opened with Microsoft Excel or any similar software, providing a clear
structure and the ability for the user to manually view and modify the content.
Database columns design:
SourceTweet – the unmodified version of the extracted tweet. Stored for validation
and evaluation purposes.
TweetID – the unique identification of the extracted tweet. Allowed the program to
remove any duplicate tweets that could have been collected using different search
keywords.
Location – if the user has enabled the location tracking option on his device, this field
would provide a geo-location of the tweet in the format
GeoLocation{latitude=39.6503, longitude=-75.7923}. Stored for information
extraction purposes.
Time – timestamp of the tweet. Displays the time of the time zone of the location
from which the tweet was send, but not collected. Stored for information extraction
purpose
hasLink – boolean variable displaying whether a tweet had a link. Stored for statistic
gathering purpose.
14
hasNonASCII – boolean variable displaying whether a tweet had a non-ASCII
character. Stored for statistic gathering purpose.
hasSlang – boolean variable displaying whether a tweet had a slang or acronym
word. Stored for statistic gathering purpose.
hasOOVWord – boolean variable displaying whether a tweet had a word which could
not be found in English or slang dictionaries. Stored for statistic gathering purpose.
hasNE – boolean variable displaying whether a tweet had a Named Entity (name,
location, organization). Stored for statistic gathering purpose.
3.3 Term finder
Term finder tool has been designed during the stage of twitter corpus collection. Its main
purpose was to add autonomy to the previously selected search keywords, confirming their
validity.
The approach to term finder was to develop a separate program that collected data and text
from a resource appropriate to alcohol consumption domain. For this purpose a section
from a health forum that was related to alcoholism was identified as a relevant resource
[23]. A program was designed to loop through the last 30 pages of the forum section (600
topics), collecting all the text from topic discussions and replies. The most used words and
terms from this corpus were analysed to identify reliable search keywords for twitter
extraction.
3.4 Twitter search
As the primary objective of this project was creating a computational lexicon the main focus
of the program was to collect, normalize and evaluate the historical twitter data. In order to
satisfy this requirement Twitter Search API has been selected as the primary method of
searching and collecting tweets over the Twitter Streaming API.
Search API was found to be useful and reliable, as there was enough available data to
collect. Given this factor, API’s focus on search relevance allowed to maximize the number
of domain-specific tweets that were extracted.
15
Twitter extraction has been designed to search the application’s database for tweets
published in the past 7 days and store the result in order to provide a good base twitter
corpus for further normalization and text analysis.
3.5 Cleaning and normalizing data
In order to create a computational lexicon all collected data must be cleaned from noise and
normalized to achieve maximum consistency across text.
This part of the program has been designed to complete the following tasks:
Cleaning the twitter noise – twitter data can create a large amount of noisy data due
to the fact that it accepts a wide range of characters that are allowed to be included
in a tweet. These can include emoticons, symbols and characters from other
languages. As only the tweets written in English language were considered relevant
for this project, such characters (all referred to as non-ASCII characters) were
designed to be excluded from the corpus. All links and pictures found in twitter text
were also designed to be excluded as part of this task.
Normalizing the twitter text – performing text normalization was an important part
of the project in order to achieve the final result. This included applying text mining
methods to achieve consistency in letter capitalization, punctuation and other
aspects of text. This would allow for a normalized corpus to be created, which could
reliably be processed by natural language processing tools.
Normalizing slang and acronyms – this is an extended task coming from
normalization of twitter text, which specifically focused on normalizing any detected
occurrences of slang and acronyms. The use of slang and acronyms is popular among
the twitter users due to the character limit in a single tweet, so this project was
designed to address this task to the most effective degree in order to create a
reliable and consistent twitter corpus.
3.6 Information extraction
For the purpose of this project a simple information extraction has been designed. As per
requirements, this project was aimed at performing part-of-speech tagging task on the
16
extracted twitter corpus in order to collect all adjectives, nouns and verbs from text. The list
of all unique parts of speech would then be stored in separate files together with the count
of occurrences of each word.
This task was designed to provide a better understanding of the created lexicon. Analysing
what are the most used adjectives or verbs in the corpus could provide insight about the
overall sentiment of the alcohol consumption lexicon. Furthermore, with additional
functionality this information extraction can be considered as starting point for further
lexicon analysis, such as analysing the contextual use of selected parts of speech by bringing
all related tweets in which the analysed word has been used.
17
Part 5
Implementation
This chapter will discuss what implementation methods and tools were used to develop this
project.
5.1 Collecting and storing tweets
As previously mentioned in the Design chapter, collecting and storing tweets was an integral
part of the project. First, search keywords had to be identified by creating and running the
term finder program.
This program has been created using the Ruby programming language [24]. Ruby language
has been selected for this task, because of its efficient libraries (gems) related to web-page
handling – Nokogiri and Watir-Webdriver. With the implementation of these gems, Ruby
program was able to loop through the given number of topics, collecting all relevant text
and storing it in a file. Next part of this sub-task was to analyse the extracted text and
identify correct search keywords based on more sophisticated method that just word count.
For this purpose TerMine tool, developed by The National Centre for Text Mining, was used.
TerMine is a system for terminological management featuring term extraction and acronym
recognition. The term extraction employs the C-value method that incorporates linguistic
filters and statistics for recognising terms, making it ideal for this task [25].
Next step of this implementation part was to use the extracted terms to search and collect
data from twitter. There are a number of libraries that can be used by applications to access
twitter’s data. Twitter4j Java library [26] has been identified as the best for this purpose for
its good documentation and readability. Java program with the implementation of this
library has been developed for this task searching and collecting twitter data in a secure and
effective manner.
Finally, gathered tweets were stored locally for further processing. As previously defined in
the design stage, CSV file has been chosen as the tweet storage method. It used the pipeline
character (“|”) to separate between columns of data. As any occurrence of this character
would split the data into different columns, pre-processing was required to remove all the
occurrences of the pipeline character in tweets, in order to ensure consistent and reliable
database structure.
18
5.2 Data normalization
First part of data normalization task required cleaning all the tweets from noisy data. This
included removing all non-ASCII characters, links and twitter-specific characters from the
extracted corpus. This task has been performed with the implementation of Regular
Expressions (RegEx), which identified all occurrences of selected characters and removed
them from text with a rule-based approach. Some examples of this implementation are
shown below.
Removing all non-ASCII characters ("[^\\x00-\\x7F]"):
Input: “The good Lord has changed water into wine, so how can drinking beer be a
sin? –Belgium”
Output: “the good lord has changed water into wine so how can drinking beer be a
sin ? belgium”
Removing all links (“https\\S+|http\\S+”):
Input: “Got to drink the local beer while in Roswell. - Drinking a Roswell Alien Amber
Ale @ Billy Rays Bar - https://t.co/qY6OAmy703 #photo”
Output: “got to drink the local beer while in roswell drinking a roswell alien amber
ale at billy rays bar photo”
The next part of text normalization required tokenization and slang handling.
5.2.1 Tokenization
Tokenization is a necessary step in development before more sophisticated NLP methods
can be applied to text. For this part of implementation the Apache OpenNLP libraries [27]
were selected over the other options (e.g. Stanford CoreNLP) due to its good
documentation.
The Apache OpenNLP Tokenizer segments the input character sequence into tokens. As in
most other systems, the process is performed in two stages: first, sentence boundaries are
identified, then tokens within each sentence are identified.
Input: “someone is drinking beer with a straw. stop him”
Output: ,“someone”, “is”, “drinking”, “beer”, “with”, “a”, “straw”, “.”- ,“stop”,
“him”}
19
5.2.2 Slang and acronym normalization
Slang and acronym normalization has been given the most focus in this project, as they
often occur in social media and their handling is crucial for creating a consistent and reliable
lexicon. It was important to ensure that all terms and acronyms, which have the same
meaning, are stored identically so that the best efficiency of NLP tools is achieved for future
analysis.
For this purpose the most complete lists of English words and slang were considered to
perform look-up operations against them. A local database of slang words and their
meanings was created from an online resource (noslang.com) [28], which contained more
than 5000 English slangs and acronyms. Another database of more than 300,000 English
words has been implemented in the look-up operations. If a word was not found in the
English dictionary, a look-up against the slang dictionary was performed, swapping the
acronym with its fully spelled meaning (e.g. “b4” -> “before”).
Input: “Bernie Sanders was drinking beer on national tv last night , for that doesn't
prove that he should be president then idk what will .”
Output: “bernie sanders was drinking beer on national tv last night for that does
not prove that he should be president then I don't know what will”
The slang normalization process pipeline is shown on the figure below.
20
5.3 Part-of-speech tagging
Following the data normalization and tokenization parts, the program had implemented a
POS tagger module. This module took the tokenized tweet as its input and marked all tokens
with their corresponding word types based on the context and the token itself. This was
implemented with the use of another Apache Open NLP tool, which produced an array of
tags, each corresponding to the input token sequence, as its output.
Part-of-speech tagger uses a probabilistic model to predict the correct word tag out of the
set. It can be further improved with the implementation of tag dictionary, which would
increase the tagging and runtime performance [27].
Finally, all adjectives, nouns and verbs found with the use of the tagger were extracted and
stored separately in the respective files for future analysis.
5.4 Named-entity recognition
The final part of the program implementation required the development of a Named-entity
(NE) recognition module. The module was designed to analyse each word inside the twitter
corpus on its relation to being a person name, location or organization. This was achieved by
loading three separate Apache OpenNLP pre-trained NE models (location name finder,
person name finder and organization name finder models) and analysing each word from
twitter data against these models. Any successful detection of such instances would mark
the given tweet as having a Named-entity in it.
The process pipeline of this module is show in figure below.
21
22
Part 6
Results and evaluation This chapter will outline what results were achieved as part of this project and what
evaluation methods were used to validate them.
6.1 Results
The program has extracted a total of 84632 tweets in relation to alcohol consumption. Due
to high computational load and time constraints, a total of 9174 tweets were processed and
normalized using methods described in “Development and implementation” chapter.
Four files have been generated as the result of the program’s output – main corpus file,
containing processed and normalized tweets, and 3 files, each containing extracted
adjectives, nouns and verbs respectively.
The distribution of adjectives, nouns and verbs is shown in figure below. A summary of the
most used adjectives, nouns and verbs can be found in Appendix 1.
8785
40008
15403 Adjectives
Nouns
Verbs
23
6.2 Evaluation of the extracted corpus
For the purpose of this project, it was important to evaluate the extracted corpus on its
relevance to the selected domain – alcohol consumption. For this task a subset of 1000
tweets has been manually analysed and annotated on its relationship to this topic. The
results of the evaluation are shown in figure below.
929 (93%) tweets of the selected subset were identified as alcohol consumption related.
Example of alcohol consumption related tweet:
“One of the best ones I have ever tried :). - Drinking a Cusque Trigo Wheat Beer - *link*
#photo”
Generic alcohol related tweet:
“No. I do not associate drinking beer with eating.”
While the second example has the search keywords in it, it was not considered as alcohol
consumption related and was marked as False Positive (FP).
Such overall positive evaluation of the extracted corpus can be related to the techniques
used in Twitter Search API. An important feature of the Search API is its focus on relevance
and not completeness [8], which means the extracted tweets do not necessarily fully match
the search criteria; however, they are more likely to be relevant to the topic. For example,
tweets “Day 1: listening to David Bowie and Mac DeMarco on vinyl in my living room whilst
drinking beer. Stay tuned for day 2.” and “Moderately high heat, great dark chocolate notes.
- Drinking a Habanero Supernova @ Dogberry Brewing - *link*” were both extracted using
the search phrase “drinking beer”.
93%
7%
Alcoholconsumption
Generic alcoholrelated
24
6.2.1 Search keywords comparison
An important observation has been made in relation to the effectiveness of different search
keywords. A further corpus evaluation has been conducted on 2 subsets of tweets extracted
with keywords “ginger ale” and “party drinking”, in order to compare the results of different
search phrases. Below figure shows the comparison of results of this evaluation.
The graph shows that the best results were achieved extracting tweets with the search
keywords “drinking beer” with a 93% True Positive instances. Applying search keywords
“ginger ale” has seen a significant decline in performance, showing a 32% of False Positive
results (4.5 times higher than “drinking beer” subset), extracting tweets, which are not
related to alcohol consumption.
This evaluation has further highlighted the importance of careful selection of search
keywords, specifically in the case when working with the Twitter Search API, as it may have
a big impact on the relevance of the extracted corpus to the selected domain.
6.2.2 Alternative computational approach to corpus evaluation for future
work
An alternative corpus evaluation method will be developed as part of future work on this
project. Its main focus will be the implementation of Naïve Bayes classifier, which will use
already extracted and partially labelled data set, as its training data.
Such approach would create an automated method of corpus evaluation, while making it
faster and more efficient. It can also help in evaluation of search keywords, identifying the
most reliable and effective ones.
0
100
200
300
400
500
600
700
800
900
1000
drinking beer ginger ale party drinking
TP
FP
25
Part 7
Conclusions
This chapter will provide a reflective overview of the accomplished results, as well as
provide plans for future project development.
7.1 Work overview
The program that has been developed for this project managed to deliver a sensibly
normalized twitter corpus of alcohol consumption. The results of this normalization can be
visually seen by comparing the initially extracted corpus, which contained a lot of noisy and
meaningless data, with the normalized one. The normalized corpus provides a good
platform for future tasks that are required to create a computational lexicon.
Additionally, some initial data analysis and corpus evaluation tasks have been completed as
part of this project. The extracted and analysed parts of speech serve as a starting reference
point for anyone who wants to have a basic understanding of the content and structure of
alcohol consumption lexicon.
Finally, the term finder tool that has been developed as part of this project provides
researchers with extra functionality and flexibility. With improved development and
integration of all parts in one system this tool can be used to quickly analyse and compare
lexicons of different domains and to find relationships between them (e.g. alcoholism vs.
alcohol consumption lexicons).
7.2 Future work
The process of creating a computational lexicon has raised more potential program
implementations than there was time to develop them. There are exciting text mining and
natural language processing tools and techniques that are yet to be implemented in order to
create a fully functional computational lexicon. The next stages of development will focus
on adding the following features and functionality:
26
Stemming and lemmatisation. This will be an important part of future development,
as the implementation of such tools will greatly improve consistency of the
generated lexicon. At this point in time the program incorrectly processes words like
“listening” and “listened” separately, due to the lack of this functionality. Stemming
and lemmatisation will increase the performance of the part-of-speech tagger
module, showing more consistent and reliable results.
Lexicon evaluation tools. As previously mentioned in the report, corpus evaluation is
one of the most important aspects in creating a computational lexicon. Due to the
time constraints and the size of this task it was difficult to implement any sensible
evaluation tools at this stage of the project. One of the evaluation tools that will be
developed as part of the future work will include a GUI interface that researchers
can use in order to label and annotate words, terms and their relationships inside
the extracted and normalized corpus. This would be an important step towards the
validation of the computational lexicon.
Contextual analysis of the extracted words. It would be a great addition to the
project, if researchers can analyse the context of extracted adjectives, verbs and
nouns. This can be achieved by linking every tagged word with a list of source
tweets that contained it. Adding Named Entities to this list of extracted words will
also add more information and analysis points for the created computational
lexicon.
7.3 Summary
Overall this project has definitely been a great learning experience for me. To be honest, at
the start of the project I was not even able to describe what would be the end result of it. It
took many long hours of reading through related work in this domain to be able to fully
understand the final deliverables, but my clear goal of learning these tools and the desire to
have a better understanding of different Big Data analytics methods have kept me going and
have proven this project to be a good choice. The study of related work done by the
National Centre for Text Mining in the medical text mining domain and their creation of the
BioLexicon has taught me that there is much more to creating a fully functional
computational lexicon than I first imagined.
While the targeted computational lexicon was not fully completed, this project went beyond
my initial expectations. Now when I have more advanced understanding of text mining and
natural language processing tools together with the working program, I believe that through
the future development work I have a chance to contribute to the health and science world
by developing a computational lexicon of alcohol consumption.
27
References
[1] “Twitter Company – About” 2016. Available at: https://about.twitter.com/company
[2] Morrow, M. (2014). Social Media: Staying Connected. Nursing Science Quarterly, Vol.
27(4) 340
[3] D. Davidov, O. Tsur, and A. Rappoport, “Enhanced sentiment learning using twitter
hashtags and smileys” in Proceedings of the 23rd International Conference on
computational Linguistics: Posters, pp. 241–249, Association for Computational Linguistics,
2010.
*4+ Dwoskin, E. (2014). “In a Single Tweet, as Many Pieces of Metadata as There Are
Characters”. The Wall Street Journal. Available at:
http://blogs.wsj.com/digits/2014/06/06/in-a-single-tweet-as-many-pieces-of-metadata-as-
there-are-characters/
*5+ “Introducing the new metadata for Tweets” (2013). Available at:
https://blog.twitter.com/2013/introducing-new-metadata-for-tweets
[6] “Healthchannels forum”. Available at:
http://www.healthcommunities.com/health/forums.shtml
[7] “Health Forums List” (2012). Available at: http://forumlist.info/health-forums-list/
*8+ “Twitter Search API”. Available at: https://dev.twitter.com/rest/public/search
*9+ “Twitter Streaming APIs”. Available at: https://dev.twitter.com/streaming/overview
[10] “Aggregating tweets: Search API vs. Streaming API”. Twitter API Consulting. Available
at: http://140dev.com/twitter-api-programming-tutorials/aggregating-tweets-search-api-vs-
streaming-api/
[11] Thompson, P., Theresa Batista-Navarro, R., Kontonatsios, G., Carter, J., Toon, E.,
McNaught, J., Timmermann, C., Worboys, M., Ananiadou, S. (2016). “Text Mining the History
of Medicine”. Available at:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717#authcontrib
*12+ “Tokenization”. Available at:
http://searchsecurity.techtarget.com/definition/tokenization
*13+ “Stemming”. Available at: https://en.wikipedia.org/wiki/Stemming
[14] D. A. Hull, “Stemming algorithms: A case study for detailed evaluation,” JASIS, vol.
47,no. 1, pp. 70–84, 1996.
28
[15] R. Feldman and J. Sanger, The text mining handbook: advanced approaches in analysing
unstructured data. Cambridge University Press, 2007.
*16+ “Computational Lexicology”. Available at:
https://en.wikipedia.org/wiki/Computational_lexicology
[17] Schubert, Lenhart, "Computational Linguistics", The Stanford Encyclopedia of
Philosophy (Spring 2015 Edition), Edward N. Zalta (ed.), Available at:
http://plato.stanford.edu/archives/spr2015/entries/computational-linguistics
[18] Cohen, A. and Hersh, W. (2004). “A Survey of Current Work in Biomedical Text Mining”.
[19] Tekiroglu, S., Ozbal, G. and Strapparava, C. (2014). “A Computational Approach to
Generate a Sensorial Lexicon”.
[20] Paul Thompsonr, John McNaught, Simonetta Montemagni, Nicoletta Calzolari, Riccardo
del Gratta, Vivian Lee, Simone Marchi, Monica Monachini, Piotr Pezik, Valeria Quochi, CJ
Rupp, Yutaka Sasaki, Giulia Venturi, Dietrich Rebholz-Schuhmann and Sophia Ananiadou
(2011). “The BioLexicon: a large-scale terminological resource for biomedical text mining”.
BMC Bioinformatics, 12:397
[21] Fellbaum C, WordNet: An electronic lexical database. 1998, MIT press Cambridge, MA
[22] Browne AC, Divita G, Aronson AR, McCray UMLS language and vocabulary tools. AMIA
Annu Symp Proc. 2003, 798-
[23] http://www.dailystrength.org/
[24] “Ruby Programming Language” Available at: https://www.ruby-lang.org/en/
[25] Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word
terms. International Journal of Digital Libraries 3(2), pp.117-132.
[26] http://twitter4j.org/en/index.html
[27] Apache OpenNLP. Available at: https://opennlp.apache.org/
[28] http://www.noslang.com/
29
Appendix 1
Verbs:
Word
Count
drinking 8302
want 523
does 481
is 615
'm 169
be 162
was 126
have 107
stop 101
do 97
are 88
aged 88
let 82
manning 78
get 68
sends 66
had 61
shop 59
're 55
've 52
am 50
watching 50
love 50
been 48
eating 46
enjoy 44
go 43
brewing 38
laughing 38
got 37
30
Adjectives:
Word Count good 366
pale 309
nice 212
black 201
red 183
imperial 136
great 133
old 129
double 128
big 125
white 117
brown 100
little 89
new 83
sweet 82
cold 81
last 78
hoppy 74
golden 74
delicious 74
dark 72
sour 70
bitter 70
best 66
bad 65
blue 59
irish 58
super 50
dry 48
first 48
31
Nouns:
Word Count photo 2727
beer 2567
ipa 1226
ale 1052
company 733
craft 599
peyton 522
stout 497
bud 438
brewing 313
porter 227
i 214
brewery 208
bar 207
hop 201
hopslam 179
lager 174
house 173
chocolate 142
coffee 135
wine 106
day 105
samuel 105
barrel 105
pub 104
co 102
time 100
night 94
love 92
winter 91