SIDEFFECTIVE - SYSTEM TO MINE PATIENT REVIEWS: SIDE …
Transcript of SIDEFFECTIVE - SYSTEM TO MINE PATIENT REVIEWS: SIDE …
SIDEFFECTIVE - SYSTEM TO MINE PATIENTREVIEWS: SIDE EFFECT EXTRACTION
BY Sangeetha Rajagopalan
A thesis submitted to the
Graduate School—New Brunswick
Rutgers, The State University of New Jersey
in partial fulfillment of the requirements
for the degree of
Master of Science
Graduate Program in Computer Science
Written under the direction of
Prof. Tomasz Imielinski
and approved by
New Brunswick, New Jersey
May, 2011
Abstract of the Thesis
Sideffective - System to mine patient reviews:Side Effect Extraction
by Sangeetha Rajagopalan
Thesis Director: Prof. Tomasz Imielinski
Sideffective is the system to crawl, rank and analyze patient testimonials about
side effects from common medications. Since the wealth of any mining model is
the Data corpus, the data collection phase involved extensive crawling of massive
medical websites comprised of user forums from the internet. Subsequently, the
raw files were subjected to certain site-specific parsing routines, yielding outputs
conforming to a well-defined data model. Currently, the system holds close to
400,000 user testimonials pertaining to more than 2500 drugs/medicines. Sidef-
fective aims at gathering and aggregating this wealth of information, build useful
associations and present interesting observations and numeric validations, all in
a user-friendly interface. The important issues that we have tried to tackle are:
Extracting side effects without relying on pre-built lists, aggregating distribution
of different side effect for a give drug, site-specific search, ranking and determining
the negativity of reviews.
The main focus of this thesis undertaking is Extraction & Discovery of Side-effect
from a users review about a drug. Apache Lucenes Shingle Analyzer, which ex-
tracts terms and their frequency, was used to generate more than 7 million phrases
out of which the top 25,000 terms, with frequencies more than 100 was chosen for
discovering side effects. After eliminating the syntactically incorrect phrases, our
method calculates the frequency of occurrence of each of the terms in a medical
websites domain versus a purely non-medical user websites domain, which proves
ii
to be highly effective in extracting side effects. Using this technique, more than
600 unique side effects reported by users has been discovered without using any
fixed lists. This list extracted is also used to mine and summarize patients reviews.
The aggregation and distribution tables we built, effectively determine top reac-
tions exhibited by various drugs and reverse mapping of the same, demonstrating
the symptom to drug associations. Our system also eliminates synonymous side
effects as well as cures falsely appearing as a possible side effect.
iii
Acknowledgements
I would like to thank Prof. Tomasz Imielinski for all the valuable support and
constant encouragement throughout the period of my graduate study. He has
been exceptionally motivational at every step and provided the right guidance in
achieving results. I have a learnt a great deal from him in my time here and I am
very grateful to him for having given me this opportunity to work on this thesis
project.
I would like to thank Prof. Apostolos Gerasoulis for his invaluable inputs and
suggestions. His guidance during the early stages of the project helped me to steer
the work in the right direction. I would also like to thank Prof. Alex Borgida for
his sharp and insightful ideas during our discussion. I feel greatly privileged to
have had an opportunity to interact with him.
Finally, I would like to thank my project partner Deepak Yalamanchi for all the
support and encouragement. I am also greatly thankful to my parents, sister and
my friends at Rutgers University for being my pillars of strength.
iv
Dedication
This work is dedicated to my parents and sister for their constant encouragement
and invaluable support.
v
Contents
Abstract ii
Acknowledgements iv
Dedication v
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Problem Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Data Collection 6
2.1 Choice of Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 What is Web Crawling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 HTTrack Web-Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Gathering Data for Sideffective . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Parsing HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Why parse raw files? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 HTML Parser JAVA Library . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Data Model and Associations 18
3.1 Extraction of useful data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Creating Association and dependency models . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Defining Database Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Populating database tables . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Data Corpus Harvesting 27
4.1 Side effect discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Building n-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Preliminary Filtering of phrases . . . . . . . . . . . . . . . . . . . . . . . . 32
vi
4.1.3 Term Extraction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Determining Top side effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Section I: Top Side-effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 Section II: Graphical Representation . . . . . . . . . . . . . . . . . . . . . 42
4.2.3 Section III: User Testimonials . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Determining Top Drugs for each Symptom . . . . . . . . . . . . . . . . . . . . . . 43
5 Discussion 45
5.1 Challenges & Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Eliminating Cures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 Synonymous Side-effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Profoundness Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Conclusions and Future Work 52
7 Bibliography 54
vii
List of Figures
1 Basic web-crawler architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 A sample screenshot demonstrating the first step in HTTrack Crawler where the
website to be crawled and action to be performed is specified. . . . . . . . . . . 11
3 A sample screenshot demonstrating the options screen of HTTrack . . . . . . . . 13
4 Sample screenshot demonstrating the final step of HTTrack web crawler . . . . . 14
5 A portion of the webpage on the site: www.medications.com from which user
reviews have been extracted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 A portion of the webpage on the site: www.askapatient.com from which user
reviews have been extracted, with a different page structure. . . . . . . . . . . . . 15
7 Code Snippet demonstrating the Parser class to eliminate HTML tags and meta-
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 Process-flow overview of Side-effect Extraction . . . . . . . . . . . . . . . . . . . 28
9 Heirarchy of Lucene classes used in creating index list . . . . . . . . . . . . . . . 32
10 Sample screenshot showing the various features represented for drug Xanax . . . 41
11 Sample Pie chart representing the distribution of side-effects for Xanax . . . . . 43
12 Pie Chart distribution for Drugs reporting Dizziness as a side-effect . . . . . . . 44
viii
List of Tables
1 Sample websites chosen to create medical and non-medical data domain . . . . . 37
2 Sample subset of phrases with corresponding Google count from medical and non-
medical web domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Top 20 side effects reported by patients for Xanax and its corresponding frequency
percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Sample reverse mapping from Side-effect -¿ Drug for Dizziness . . . . . . . . . . 44
5 Sample drugs and its top side-effects which are indicators of issue 1 faced . . . . 46
6 Top 10 side-effects of Zoloft and their corresponding Pf-scores . . . . . . . . . . . 50
7 Top common side-effects for drugs under Anti-Depressants category . . . . . . . 51
8 Representation of other major side-effects of the drugs in Anti-depressants category 51
ix
1
1 Introduction
1.1 Problem Scope
Todays generation is often termed as the Digital era. Before the advent of Internet,
a traditional doctor-patient relationship was considered the most reliable source of
information to a patient. However, recent studies show a shift in the role of the
patient from passive recipient to active consumer of health information, with the
Internet acting as the catalyst of this shift. Patients no longer solely depend on
certified doctors for treatment advice.
The use of the Internet as a source of medical information has become increasingly
popular as more patients go online. According to a recent United States survey,
52 million adults have used the World Wide Web to obtain health or medical
information. Back in 2005, an estimated 88.5 million adults were using the Internet
to research health information and/or health-related products and to communicate
with providers. This number has increased leaps and bounds today. Access to large
amounts of medical information is available through an estimated 20,000 to 100,000
health-related Web sites.
In another such social experiment, of the 1289 patients participating, 65% reported
access to the internet; age, sex, race, education, and income were each significantly
associated with internet access. It was found that a total of 74% of those with
access had used the Internet to find health information for themselves or family
members. This clearly is a huge shift from conventional ways where the patient-
provider relationships will probably change, and medical providers will face new
challenges as patients obtain health information from the Internet, share only
some of this with their physicians, or potentially turn to the Internet instead
of consulting a health care provider. The World Wide Web, in all its vastness,
has contributed to extensive rise of online forums and discussion groups that has
2
changed the perspective of these ’e-patients’, thereby responsible for this paradigm
shift.
1.2 Problem Statement
As patient increasingly turn to internet for advice, the number of websites provid-
ing such information is also growing exponentially. More often than not, this galore
of information is unstructured and overwhelming to an average internet user. This
Masters thesis project therefore aims at gathering and aggregating this wealth of
information available on the web, build useful associations and present interesting
observations and numeric validations, all in a user-friendly interface.
Our main focus is was to gather as much data as possible about various drugs
and medications available in the market and design an algorithm to automatically
extract the side-effects reported by each of it, solely based on patient testimonials.
The goal is to provide an unbiased/non-opinionated aggregation of information. In
the process, we have also focused on developing a friendly user interface to present
our observations and interesting results in a manner which is most appealing to
the general audience. Following are some of the major contributions of this thesis
undertaking:
• A side-effect extraction algorithm focusing on Most-frequent metric rather than
the regular Most serious consideration.
• Building association between various drugs and its side-effects in a manner
most relevant for presentation.
• Creating distribution models based on the associations to show top side-effects
for each drug and the reverse mapping of top drugs reporting a particular
side-effect.
• Defining a metric called Profoundness Score to perform comparative analysis
3
between various drugs and categories of drugs.
Apart from the above, we also discovered a set of drawbacks while developing this
kind of a system. While we provide solutions to a couple of challenges, we believe
that some of the setbacks act as foundations for future research on this topic.
1.3 Related Work
Terminology mining, term extraction, term recognition, or glossary extraction,
is a subtask of information extraction. The goal of terminology extraction is to
automatically extract relevant terms from a given corpus. There are mainly two
categories of approaches in Term Extraction:
• Linguistic approach
• Statistical and Machine Learning approach
Early approaches to automatic term extraction were focused on information-theoretic
approaches based on mutual information in detecting collocations [Manning and
Schuetze1999]. Collocations are expressions that are composed of two or more
words, the meaning of which is not easy to guess from the meanings of the compo-
nent words. There are nuances in the detection of collocation that require linguistic
criteria to resolve [Justeson and Katz1995].
It is a common practice to extract candidate terms using a part-of-speech (POS)
tagger and an automaton (a program extracting word sequences corresponding to
predefined POS patterns). Usually, those patterns are manually handcrafted and
target noun phrases, since most of the terms of interest are noun phrases [Justeson
and Katz, 1995]. Typical examples of such patterns can be found in [Jacquemin,
2001]. As pointed out in [Justeson and Katz, 1995], relying on a POS tagger and
legitimate pattern recognition is error prone, since taggers are not perfect. This
might be especially true for very domain specific texts where a tagger is likely to be
4
more erratic. Another key problem is that of nesting, where subsets of consecutive
words of terms consisting of multiple words would satisfy the statistical criteria for
termhood”, but they would not be called terms.
Secondly, purely statistical systems [Church & Hanks, 1990; Dunning, 1993; Smadja,
1993; Shimohata, 1997] extract discriminating multiword terms from text corpora
by means of association measure regularities. Although highly effective, these
methods depend on training data sets with manually identified terms or patterns.
Apart from these two, hybrid methodologies [Enguehard, 1993; Justeson, 1993;
Daille, 1995; Heid, 1999] define co-occurrences of interest in terms of syntactical
patterns and statistical regularities. However, by reducing the searching space to
groups of words that correspond to pre-defined syntactical patterns, such systems
do not deal with a great proportion of terms and introduce noise in the retrieval
process.
Since we deal with a very specific domain of Patient testimonials driven medical
data corpus, our aim was focused at extracting side-effects from the reviews with-
out a standard list of symptoms collected from various sources. Purely linguistic
techniques would not be effective either since we do not restrict out term extraction
to specific parts of speech (Nouns, adjectives or verbs). Also, most of the existing
methods are restricted to either unigram or at the most Bigram words in termi-
nology identification. We believe that going beyond that restriction would yield a
set of interesting and rare side-effects reported by patients that can be valuable.
The rest of the thesis is organized as follows: The next section presents related
work in the area of Automatic Terminology extraction in other domains. Chapter
2 explains in detail our data collection process, the methodology and tools used
for the same. In chapter 3, we discuss the data model created from the collected
data and the related associations built from it. This forms the foundation for all
the next phases of the project. The most important section is dealt in chapter
5
4 where the actual Side-effect extraction algorithm is discussed, followed by our
distribution model. Chapter 5 deals with some of the challenges faced and the so-
lutions implemented to tackle them. Finally, we provide some concluding remarks
and future work in chapter 6.
6
2 Data Collection
The wealth of any mining model is the corpus of data. Since the early 90s, the
World Wide Web has grown to become one of the largest repositories of human
knowledge. It is an inter-connected document network of content conforming to
different formats, topics and types. Although the web data is highly heterogeneous,
unstructured and often redundant, it is one of the most commonly used reference
repository, mainly because of its broad variety of information and easy accessibility.
Since our work is based on user reviews of various medications, we mainly target
online discussion forums as our primary source of data.
An online discussion forum is a web community that allows people to discuss com-
mon topics, exchange ideas, and share information in a certain domain, such as
sports, movies, medicines, politics, travel, cars and so on. In most of these forums,
users either start new threads to begin discussion or reply to existing threads.
Large repositories of archived threads and reply records in online discussion fo-
rums contain a great deal of human knowledge on many topics. Although forums
contain rich first-hand information from the reviewers, it is noteworthy that they
are highly unstructured and do not conform to any fixed grammatical or syntacti-
cal standards of language. We aim to collect such user-oriented data and extract
useful information from it in straightforward, yet highly effective manner.
The data collection phase involves three major tasks, namely:
• Choice of websites
• Web Crawling
• Parsing of crawled files
All three stages are of equal importance as they form the basis of our data model.
We examine each of these in a greater detail as follows.
7
2.1 Choice of Websites
Choosing the right websites to collect information is of at most importance as it
determines the quality of data used in the experiments and also has a huge impact
on the results observed. As internet websites are the source of our data, there were
few main considerations in choosing the websites to crawl, namely:
1. Contain User testimonials
2. Large volume of the sites
3. High traffic to the sites
Based on the above factors, the following are some of the sites which were crawled:
• www.drugs.com
• www.medications.com
• www.askapatient.com
• www.dailystrength.org
• www.rxlist.com
• www.drugratingz.com
The main consideration is to find sites which carry patient testimonials. This is in
accordance to our goal to determine the side-effects experienced by most patients
rather than what is informed by the drug manufacturing companies. Therefore,
we surveyed the internet extensively to find sites conforming to this requirement.
The second consideration is to pick sites which are larger in volume and have more
pages which are indirectly an indication of how much data the site collects and
activeness of the site in general. To determine the volume of the site, we perform
a simple Google site-search and get a rough estimate of the number of pages in
that site. Although not all the pages carry patient testimonials, it is still a good
metric to determine the volume of the site. Some examples are:
8
• www.medications.com: 187,00 pages
• www.drugs.com: 554,000 pages
• www.askapatient.com: 258,000 pages
• www.drugratingz.com: 100,000 pages
• www.rxlist.com: 320,000 pages
Finally, as a last factor, the sites traffic estimate per month from Quantcast was
used to determine the sites with high popularity among internet users. Sites with
higher traffic also imply that the content is more recent and fresh. Quantcast is
a media measurement service that lets advertisers view audience reports inclusing
traffic, demographics, geography, site affinity and categories of interest on millions
of websites and services. Some examples for the sites of our interest include:
• www.medications.com: 65,000 hits/month
• www.drugs.com: 7.2M hits/month
• www.askapatient.com: 352,000 hits/month
• www.drugratingz.com: 16,500 hits/month
• www.rxlist.com: 3.5M hits/month
Therefore, considering all the above factors, a set of 12 websites were chosen and
crawled for patient testimonials. The upcoming sections describe the next steps in
details.
2.2 Crawling
2.2.1 What is Web Crawling?
A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner or in an orderly fashion and bulk downloads web
9
pages. Other terms for Web crawler are ants, automatic indexers, bots, Web spiders
or Web robots. This process is called Web crawling or spidering. Many sites, in
particular search engines, use spidering as a means of providing up-to-date data.
Web crawlers are mainly used to create a copy of all the visited pages for later
processing by a search engine that will index the downloaded pages to provide fast
searches. Crawlers can also be used for automating maintenance tasks on a Web
site, such as checking links or validating HTML code. Also, crawlers can be used
to gather specific types of information from Web pages, such as harvesting e-mail
addresses (usually for spam).
Predominantly, crawlers today are used for the following purposes:
• They are one of the main components of web search engines, systems that
assemble a corpus of web pages, index them, and allow users to issue queries
against the index and find the web pages that match the queries.
• Web archiving where large sets of web pages are periodically collected and
archived for posterity.
• Web data mining, where web pages are analyzed for statistical properties, or
where data analytics is performed on them.
Our work is one of the examples related to the third category where Crawlers
collect data necessary for extracting useful information from the dataset.
In general, a crawler starts with a list of URLs to visit, called the seeds. As the
crawler visits these URLs, it identifies all the hyperlinks in the page and adds
them to the list of URLs to visit, called the crawl frontier. URLs from the frontier
are recursively visited according to a set of policies. The policies specific of this
project shall be discussed in the following paragraphs.
This figure depicts a typical Web Crawler architecture which schedules, queues
and downloads web pages from the internet cloud.
10
Figure 1: Basic web-crawler architecture
2.2.2 HTTrack Web-Crawler
The crawler used in the Data collection phase of this project, is a free and open
source Web crawler and offline browser, developed by Xavier Roche and licensed
under the GNU General Public License called HTTrack. It allows one to download
World Wide Web sites from the Internet to a local computer. By default, HTTrack
arranges the downloaded site by the original site’s relative link-structure. The
downloaded (or ”mirrored”) website can be browsed by opening a page of the
site in a browser. HTTrack can also update an existing mirrored site and resume
interrupted downloads. HTTrack is fully configurable by options and by filters
(include/exclude), and has an integrated help system. HTTrack uses a Web crawler
to download a website. Some parts of the website may not be downloaded by
default due to the robots exclusion protocol unless disabled during the program.
HTTrack is very flexible and provides a wide range of options which can be set
to suit the requirements. Since sites are usually massive, these options come in
quite handy to allow crawling of selected pages which carry the data needed. If
you ask it to, and have enough disk space, it will try to make a copy of the whole
11
Figure 2: A sample screenshot demonstrating the first step in HTTrack Crawler where the websiteto be crawled and action to be performed is specified.
Internet on the local computer. Hence it is important to understand and use these
options provided by the Crawler in the right manner. The following section briefly
describes some of the features of HTTrack which have proved useful in our work.
2.2.3 Gathering Data for Sideffective
The first step is to identify the websites which need to be crawled. The in-
ternet hosts a galore of medical websites, both government approved and non-
governmental third-party ones. But since we were interested in only user gener-
ated forums, the web had to be massively scanned to hand-pick a list of sites which
hosted discussion forums on different medications.
The next step is to observe URLs of each site and its corresponding forums, and
identify patterns in these URLs to be given as input to the HTTrack. This software
also provides the flexibility of choosing different action options including
• Download websites
• Continue interrupted download
• Update existing download
12
The last option is proved particularly very crucial, since websites are dynamic
in nature and get updated constantly. Periodically running the crawler on this
action option gives the advantage to updating the content and maintaining con-
tent freshness. The engine will go through the complete structure, checking each
downloaded file for any updates on the web site. Figure 2 shows a sample screen
snapshot of this actions page.
The next step involves setting a range of options according to the project require-
ments. Some of the important options are discussed below:
a Proxy Options: The engine can use default HTTP proxy for all ftp transfers.
Most proxies allow this, and if you are behind a firewall, this option will allow
you to easily catch all ftp links. Besides, ftp transfers managed by the proxy
are more reliable than the engine’s default FTP client. This option is checked
by default.
b Scan Rules: Filters (scan rules) are the most important and powerful option
that can be used: one can exclude or accept subdirectories, skip certain types
of files, and so on.
c Limits Options: A very important section which can set maximum mirroring
depth, maximum external depth, maximum transfer rate, maximum connec-
tions per second and a whole set of other such options.
d Flow Control Options: This is used to set number of connections, timeout,
retries, minimum number of connections etc.,
e Spider Options: This can be set to accept cookies, follow or not follow robots.txt
rules, check document types etc.,
The above mentioned are only some of the most important options while there
are numerous other settings which can be modified as well like Log, Index, Cache
options, MIME options, Browser ID options, Build and Link options.
13
Figure 3: A sample screenshot demonstrating the options screen of HTTrack
Finally, once all the required options are set, the Crawler begins to download the
pages to the specified folder on the local machine. The project download can be
aborted anytime by hitting the cancel button. Also, the download can be resumed
by starting the project again and picking * Continue interrupted download from
the menu on the Mirroring Mode page.
2.3 Parsing HTML
2.3.1 Why parse raw files?
The output from a crawler is massive amounts of raw HTML files. These files
contain HTML tags and a lot of other metadata. To make this usable, these files
have to be processed and cleansed to extract the required information and also
structure them to conform to a data model.
Firstly, it is useful to understand the structure of the HTML pages which have
been crawled. Different sites use different layout architecture on their websites.
14
Figure 4: Sample screenshot demonstrating the final step of HTTrack web crawler
As an example, the webpage snippets represented in Fig 5 Fig 6 show a portion of
the webpage of two different sites which we have crawled.
The snapshot shown in Fig 5 is merely a section of the entire webpage. The page
contains a lot of other site specific data which is not useful for our consideration
and hence must be eliminated. Another such example is provided in Fig 6 which
is from a different site (www.askapatient.com) and has an entirely different layout
structure to organize the patient reviews.
The challenge therefore, lies in taking these raw HTMLs and converting them into
usable format, irrespective of the site specific layout structures. For this purpose,
we have used Javas HTML parser library which provides a wide range to options
to parse these files. The next section briefly deals with the Java library and its
functionalities.
15
Figure 5: A portion of the webpage on the site: www.medications.com from which user reviewshave been extracted.
Figure 6: A portion of the webpage on the site: www.askapatient.com from which user reviewshave been extracted, with a different page structure.
16
2.3.2 HTML Parser JAVA Library
HTML Parser is a Java library used to parse HTML in either a linear or nested
fashion. Primarily used for transformation or extraction, it features filters, visitors,
custom tags and easy to use JavaBeans. It is a fast, robust and well tested pack-
age. The two fundamental features provided by this library are: Extraction and
Transformation. Extraction encompasses text extraction, link extraction, screen
scraping, resource extraction, line checking and site monitoring, while Transfor-
mation URL rewriting, site capture, censorship, HTML cleanup, ad removal and
conversion to XML. Transformation preserves the output file format as HTML
while Extraction does not.
The library provides a HTML Lexer and a HTML Parser API. The lexer provides
low level access to generic string, remark and tag nodes on the page in a linear,
flat, sequential manner, whereas the parser provides access to a page as a sequence
of nested differentiated tags containing string, remark and other tag nodes. The
parser attempts to balance opening tags with ending tags to preserve the structure
of the page, while the lexer simply spits out nodes. As our application requires
knowledge of the nested structure of the page, we use the HTML parser to generate
clean text files.
The code snippet in Fig 7 shows the usage of the Parser class:
The lines 1 - 4 iterates over all the html files from the crawler and performs a
preliminary check to verify if each is indeed a file or a directory. Lines 5 10 reads
each file and appends every line to a string. Line 11 13 is the part where the
HTML Parser Library is used to create a new Parser object using the html string.
HtmlPage object is created using the parser object of line 11. Finally, the parser
executes the visitAllNodesWith() method which does a Depth First Traversal of
each node in the page to eliminate the html tags.
17
Figure 7: Code Snippet demonstrating the Parser class to eliminate HTML tags and metadata
Therefore, by the end of this phase, the following goals were met:
• Identified websites with large repositories of user generated reviews for drugs.
• Used HTTrack Crawler to gather all web pages in these identified sites which
carry the user testimonials.
• Implemented Javas HtmlParser API to parse raw HTML files, thereby gener-
ating a massive repository of clean text files.
The following sections describe at length, the subsequent steps involved in post
processing this data even further to create a data model.
18
3 Data Model and Associations
In the previous section we discussed at length the process of data Crawling and
Parsing. In this chapter we examine the next logical steps involving post-processing
raw text files and creating database tables using them. This data model forms the
building block for the User Interface and its features which are explained in the
next chapter.
There are two main categories of discussion in creating the data model viz.,
• Extraction of information for Database model
• Creating Associations and Dependencies
We discuss each of these in the following sections.
3.1 Extraction of useful data
Data for this work has been collected from more than 10 medical review websites
on the internet. As illustrated in fig.5 and fig.6, each website has a different
representation of the data it carries. This makes it very difficult to run a common
routine to extract the necessary information from the webpage. The extraction
routine has to be customized specifically for each website after careful examination
of the data organization structure that it holds.
We first examine a few examples of raw text files from 3 different websites. The
methodology for creating database tables from them is described based on these
examples. The following are some sample snippets:
• www.askapatient.com
• www.drugratingz.com
• www.medications.com
19
20
21
In each of these snippets, the relevant portion is only the part where patients
describe their experience with the drug. In the first example, the way the website
is structured; askapatient.com represents a set of fields like Ratings, Reason, Side
effect, Comments, Sex, Age, Time Taken and Date Added. Of all these fields,
we identify and extract only the field corresponding to side-effect and comments.
Observing the pattern in the second snippet taken from drugratingz.com, the site
represents 4 different ratings followed by the patient testimonial, 2 links and Date-
added fields. In the final example from medications.com, the website presents
Date, Year, Time, Review and Link. Additionally, it is important to note that the
examples presented above are only snippets from the actual raw text file. Each file
therefore, carries a lot of metadata about the site before and after these snippets.
It is essential to clean this metadata information as well. Finally, it is essential to
correlate each raw text file to the drug name associated with it.
The aim of this phase is to create a relevant Database model based on the infor-
mation extracted from these crawled websites. The first table we create, which is
22
the Master table has the following specification:
drugname varchar(50): As the name suggests, this holds the name of the drug
for which the review corresponds.
user varchar(150) : Most websites which were crawled for the data were forums
wherein the users had a unique username. This field does not carry a special
functionality and has been included just for the sake of ease of implementation.
link varchar(500) : This field carries the link to the website from which the
particular review has been extracted. A small detail to be noted here is that, the
links in case of some reviews may not match now as the users and administrators
are constantly updating the websites, thereby posing the problem of outdated links.
review varchar(2000) : This field is the most vital attribute as it has the actual
testimonial written by the user for the specific drug. Noteworthy feature here is
that the text is highly unstructured with no specific format. Therefore, further
cleansing the data sufficiently is very important before using it for any analysis.
Considering all the above factors, the following steps were performed to translate
the text files to the database table described above.
Step I: Identify a fixed recurring pattern in text files corresponding to each website.
Step II: Extract only the recurring patterns into an array, say A. Based on the
website, copy the fields in the array corresponding to the user testimonials, website
links and user name (if applicable).
Step III: Final step is to identify the location of the Drugname specified in each
file, which is usually in the HTML header of the webpage.
Step IV: Once all the fields are identified, insert them into the database Master
table. After our experiments, the master table contained nearly 400,000 unique
patient reviews collected from over 10 websites with information about nearly 2500
Drugs. This table forms the foundation for the next steps of creating associations
23
and dependencies between the drugs and its side effects.
3.2 Creating Association and dependency models
The stages till now describe the way websites were processed to extract webpages,
parse HTML to extract text files and further subject the raw text files to certain
site-specific routines to create the Master Table for the Drug database. In this sec-
tion, we discuss the way we use this Master table to define more useful associations
which are Drug-centered in nature. At the end of this phase, we aim to achieve a
very structured Database model that fuels our Graphical User Interface in a way
that is most efficient as well as site-user friendly. First, we describe the structure
of the Database tables desired. Next we describe the routine for populating these
tables.
3.2.1 Defining Database Tables
Any web-based application is supported in the backend by a database schema
holding all the data. It is of utmost importance to design these tables in a way
that it most efficient for querying and retrieval. Some of our design considerations
were:
• Ease of understanding the schema
• Ease of querying the tables
• Simplicity of the schema
• Striking the right balance between static and dynamic calculations
24
Keeping in mind the above considerations, the schema consists of the following
tables:
UnigramSE This table has the various drugs and correspondingly, the different
unigram side effects found in their user reviews. This table also has the frequency
of each unigram for each drug. The structure of the table is as follows:
DrugName: varchar(200) : This is the field with the name of the drug.
Effect: varchar(200) : This field holds the bigram side effect itself .
Frequency: int(50) : Frequency stands for no. of times a particular bigram has
been mentioned as a side effect in context of a particular drug.
NGramSE This table is very similar to the UnigramSE described above. Just
like unigrams, this table holds n-gram side effects.
DrugName: varchar(200) : This is the field with the name of the drug.
Effect: varchar(200) : This field holds the n-gram side effect itself.
Frequency: int(50) : Frequency stands for no. of times a particular n-gram has
been mentioned as a side effect in context of a particular drug.
Categories This table is a simple collection of records with just 2 fields.
DrugName: varchar(200): This represents the name of the medicine
Category: varchar(200): This represents the class to which the drug belongs. Ex:
Anti-depressants, Analgesics, Beta blockers etc.
3.2.2 Populating database tables
Having defined the tables, now we examine the way these tables are populated with
relevant data. The UnigramSE and NGramSE tables are populated by iterating
over the Master Table and identifying the side-effects mentioned for each drug in
every testimonial. This process requires a list of all side-effects which can be used
25
to cross-reference across the reviews. The way this fixed list is built is of atmost
importance to his work and has been discussed at length in Chapter 5. But for
the purpose of discussion in this section, we assume that we are provided with a
nearly comprehensive list of all possible side-effects, say List L.
The module fetches all the rows of (drugname, review) from the Master table and
iterates over every row processing the review to extract medical terms/side effects.
The first step involves removal of words that do not add any semantic value to
the sentence. For this purpose, there is a stop list (%stop list) of words, which is
used to eliminate the common words of English, which are definitely deemed not
medical.
There is another list of words, List L which are a collection of medical side effects
and symptoms collected using an extraction algorithm described in later sections.
This list is used as a reference against each review to bring out the drug to side
effect mapping for that drug. A drug to side effect mapping in a particular review
is counted only once even though it might contain multiple occurrences of the same
mapping over and over again.
Only n-gram side effects with n=4 are considered in all experiments.
• The mapping with unigrams in them are stored in the table UnigramSE.
• The mapping with n-grams (n=4) in them are stored in the table NGramSE.
Now to look at the methodology:
For each review, the first step is to remove all the punctuation marks, convert all
letters to lower case and remove leading spaces. This is done using the regular
expressions feature of Perl programming language. The next loop considers every
word of this review to check if it a stop-list word or not. If it is, then it is ignored.
If not, then the first sub-step checks to see if this word exists in the unigram list.
If it does, then the next check is to see if the word already exists in the table. If
26
it does, then it is ignored else, this word is inserted into the tabledsf table with
the frequency as 1. The second sub-step is the exact same procedure for bigrams
to insert into the bitable. The only difference here is, every two consecutive words
are considered instead for single words. The above procedure takes place every
time a new Drug row appears into consideration.
Other than the first occurrence of a Drug, all other rows are handled a little dif-
ferently. The initial clean up and pre-processing remains the same. The difference
is, every word is checked to see if it has been already inserted from any previous
review. If yes, then the second check is to see if this word has already appeared in
the same review. If yes, then it is ignored. If no, then the frequency is updated by
incrementing by 1. The same procedure takes place with bigrams as well although
two consecutive words are considered at a time.
Finally, the Categories table is populated by crawling a couple of sites providing
information about Drug Categories. This table is quite simple in its structure and
has no hidden complexities involved in building it. It was built to cater to a specific
feature of the User interface involving Drug Categories and comparisons, which is
discussed in Section 6.
27
4 Data Corpus Harvesting
In this section, we discuss in detail the main contributions of this thesis work,
including Side-effect Extraction without using fixed-lists, mining domain specific
knowledge with respect to each Drug and each side-effect reported by patients in
their testimonials.
4.1 Side effect discovery
Terminology extraction is to automatically extract relevant terms from a corpus.
The extracted terms are used to build Domain-specific ontology and associations
also acting as the data dictionary for the later steps. In general, techniques for
term extractions involve either constructing Linguistic rules based on the corpus
or statistical metrics evaluating the probability of a phrase being a valid term in
context. The other approach is using Machine Learning algorithms where a small
set of terms are manually validated and used as training set to classify other terms.
In this project, we aim to extract a unique list of Side-effects, purely derived
from patient testimonials without using a standard list of FDA/NIH specified
side effects. Our approach uses a simple but effective search engine validation
technique to separate medical phrases from non-medical phrases. We also target
n-gram phrases of size up to 4, rather than just unigrams and bigrams in an effort
to build a more comprehensive set of side effects.
The following figure demonstrates the overview of the process flow:
The following are the phases in building Side-effect library from the medical data
corpus:
• Index all patient testimonials to create Master List of n-grams
• Apply preliminary filters to preprocess the master list of terms
28
Figure 8: Process-flow overview of Side-effect Extraction
• Apply algorithm of Google Search Validation to extract medical terms from the
corpus
We discuss each of these steps in detail in the subsequent sections.
4.1.1 Building n-grams
The data corpus has close to 400,000 patient testimonials. It is therefore important
first step to build an index spanning over the entire set. Before we delve into the
details of building the n-gram index, it is useful to understand a few key concepts
as described below:
N-Gram: An n-gram is a subsequence of n items from a given sequence of items.
The items in question can be anything, letters, numbers or words, though most
commonly n-grams are made up of character or word/token sequences. By this,
we can say that the sub-sequence experience severe from the sequence I experience
severe muscle cramps is an n-gram.
Different kinds of n-grams have received their own notations:
• Unigram: A unigram is when the n-gram size is 1. That is, there is only one
item in the sub-sequence.
• Bigram: A bigram is when the n-gram size is 2. Following the previous pattern,
29
there is only two items in the sub-sequence.
• Trigram: A trigram is when the n-gram size is 3. As previous, there is now only
three items in the sub-sequence.
• N-grams: An n-gram is when the n-gram size is 4 or above. That is, there are 4
or more items in the sub-sequence.
Shingles: A shingle is just a word-based n-gram, as opposed to a character-based
n-gram. They are widely used to create pseudo-phrases during the indexing process
since the shingle ends up being a single token, which is then subject to the normal
TF-IDF scoring. In many cases, searching for phrases yields relevance improve-
ments, but finding phrases at query-time can be more expensive than normal term
queries, so in such cases it is a common practice to use shingles. Non-trivial word
n-grams (aka shingles) extracted from a document can also be useful for document
clustering.
Apache Lucene: It is an open-source search and indexing technology framework,
and has become quite popular within the last couple of years. In the initial stage,
only one implementation existed in the Java programming language. There now
exist several ports to other programming languages including PHP.
Apache Lucene is not a single search application which can be installed and ex-
ecuted. It is a complete search engine library which contains all the necessary
functions to both index and search a document collection. With this library one
can create a search engine which complies with user defined requirements and other
special needs. The library is very simple to use, and is despite its simplicity very
fast and efficient. The library provides out-of-the-box default values so that one
can quickly be able to index and search. However, it is quite simple to extend the
library to enhance indexing performance, search domain, document analyzers, etc.
A few quick definitions in the context of Lucene are as follows:
30
Tokenizer Tokenization is the process of breaking a stream of text up into mean-
ingful elements. A class/function that parses an input stream into tokens is called
a Tokenizer.
TokenFilter A TokenFilter is a TokenStream whose input is another token stream
and it applies a set of rules to the stream to get a resultant desired output con-
forming to a specific format.
Analyzer An Analyzer builds TokenStreams, which analyze text. It thus rep-
resents a policy for extracting index terms from text. Typical implementations
first build a Tokenizer, which breaks the stream of characters from the Reader into
raw Tokens. One or more TokenFilters may then be applied to the output of the
Tokenizer.
Lucenes n-gram support is provided through NGramTokenizer class which tok-
enizes an input String into character n-grams and can be useful when building
character n-gram models from text. NGramTokenFilter, EdgeNGramTokenizer
and EdgeNGramTokenFilter are the other supporting classes providing the same
functionality. On the other hand, word n-gram statistics or model is built using
ShingleFilter or ShingleMatrixFilter classes. ShingleAnalyzerWrapper wraps the
ShingleFilter around another Analyzer.
Based on the above concepts, the patient testionials are subject to certain routines
to yield the comprehensive index of all the phrases in the data corpus. The program
which builds the index has the following specifications:
a Input: Nearly 400,000 patient testimonials read one after the other in a loop.
b Processing: This step involves three main tasks, namely
• Create an IndexWriter Object. An IndexWriter is the one which creates
and maintains the index. The constructor of the class determines whether
to create a new index or update an existing one.
31
• Iterate over each testimonial and create a Document object from it. Doc-
uments are the unit of indexing and search. A Document is a set of fields.
Each field has a name and a textual value. A field may be stored with the
document thereby uniquely identifying each document.
• The last step is to actually create the index using various Analyzers and
filters configured according to the requirement.
This final step is architected using a set of Lucene Analyzers and filters de-
scribed as follows:
ShingleAnalyzerWrapper wraps a ShingleFilter around another analyzer. In
our case, the analyzer we chose is a Standard Analyzer. The Shingle Wrapper
is used to specify the Maximum Shingle size, which for the purpose of our
experiments is size = 4.
StandardAnalyzer filters StandardTokenizer with StandardFilter, LowerCase-
Filter and StopFilter, using a list of English stop words.
LowerCaseFilter normalizes token text to lower case.
StopFilter removes stop words from a token stream.
StandardFilter normalizes tokens extracted with StandardTokenizer.
StandardTokenizer: A grammar-based tokenizer constructed with Jflex. This
should be a good tokenizer for most European-language documents:
• Splits words at punctuation characters, removing punctuation. However, a
dot that’s not followed by whitespace is considered part of a token.
• Splits words at hyphens, unless there’s a number in the token, in which case
the whole token is interpreted as a product number and is not split.
• Recognizes email addresses and internet hostnames as one token.
As an example, the sentence The quick brown fox jumps over the lazy dog
32
Figure 9: Heirarchy of Lucene classes used in creating index list
would yield the following tokens:
Unigrams: quick, brown, fox, jumps, over, lazy, dog
Bigrams: quick brown, brown fox, fox jumps, jumps over, over lazy, lazy dog
Trigrams: quick brown fox, brown fox jumps, jumps over lazy, over lazy dog
4-gram: quick brown fox jumps, brown fox jumps over, fox jumps over lazy,
jumps over lazy dog
The figure 9 shown below depicts the overall class hierarchy of the Lucene
modules and their associations.
c Output: The output from this stage is two lists:
• Term list without Stop-Words consisting of nearly 7 million terms
• Term list with Stop-Words consisting of nearly 15 million terms
4.1.2 Preliminary Filtering of phrases
The output from the first phase generates a Master list of terms which belong
of all categories i.e., not restricted to only the medical terminology. Due to the
massiveness of the master lists, we subject them to a set of preliminary filters to
33
reduce the index set. This facilitates a better implementation of our extraction
algorithm and also helps is more effective evaluation of the final list of medical
terms. The two preprocessors used in our experiments are discussed below:
a Top Frequency Filter: The first filter that we use is to reduce the size experi-
mental data set. Lucenes Analyzer not only indexes the testimonials, but also
calculates the frequency of occurrence of each term in the document corpus.
This frequency metric is used to filter the Master lists. The rule to apply the
filter was to select only those terms which had a minimum support count of
100. This step is an effort towards fulfilling one of our primary contribution
or aim in this work, which is:
Sideffective, unlike other medical data miners, focuses on bringing forth the
Most-Frequent side-effects of a drug, rather than the most serious/non-serious
side-effect.
In accordance to this, a side effect like Headache which occurs 75,000 times
in all the testimonials is given higher ranking than a very rare side-effect like
Death which is reported by only one patient. We believe that users of this
system would be more interested in knowing about the most frequent, thereby
the most common effects of taking a drug, rather than an effect which was
experienced by only one rare patient.
The top 25,000 phrases of each of the lists (with and without stop words)
were chosen as the output of the first filter. It is a noticeable fact here that
the list without stop words was a good source of Unigram and Bigram side
effects while the list with stop words was the source for n-gram side effects.
This is a fairly obvious conclusion due to the grammatical rules of the En-
glish language which governs that when trigrams or n-grams are formed using
Nouns or Verbs, they are connected together by prepositions or conjunctions
and preceded by adjectives. These grammatical rules are also the building
34
blocks for the next filter applied discussed in the following paragraphs.
b Semantic Filter: The main aim of this extraction process is to identify side ef-
fects automatically from the data corpus. The index built by Apaches Lucene
with the corresponding term frequencies no doubt contains a lot of noise.
Even the list which has been filtered with Stop Word Analyzers does not
necessarily provide a 100
For example, if we consider phrases like in the morning, I am now, doctor
prescribed this, they are semantically correct phrases but do not contribute
towards extraction of side effects. Therefore, it is necessary to eliminate such
contextually meaningless terms. The rule of the filter is:
A phrase is considered contextually meaningless if and only if the phrase
begins or ends with a stop word, and therefore eliminated from the index list.
The reason the filter is built around this rule is, side effects more than bigram
fall under either of these categories:
i Adjective followed by one or more nouns
ii Nouns and verbs joined by preposition or conjunction
The following are some examples of category (i):
Lowered Blood pressure
Severe joint pains
Decreased sex drive
Extreme mood swings
While examples of category (ii) are like the following:
Ringing in ears
Stiffness in neck
35
A third set of phrases which do not conform to either of the above two cat-
egories are the ones which either begin or end with conjunctions, prepo-
sitions or in general, any Stop Words. This filter does the task of elimi-
nating such phrases. Some examples of such phrases are:
in the morning
am so
time and i am now
The output from this filter is a set of clean phrases, both medical and non-
medical. The non stop-word list consists of nearly 8000 phrases while the
one with stop words is left with around 15,000 phrases. The next section
describes the term extraction algorithm formulated to pick out unique
side-effects from the index list.
4.1.3 Term Extraction Algorithm
This is the final step of the side-effect identification algorithm. The process
begins with a Master index which is a massive list of phrases, followed by
applying a couple of filters which effective clean and reduce the size of the
experimental data set. The final step is to separate the medical phrases from
the non-medical phrases, thereby yielding a subset of unique Side-Effects.
The methodology employed for this separation process is a simple Google
Search Engine results based technique. Here we examine a couple of Google
engine based concepts that are useful in this context.
Number of results:
When a Google search is performed on a search query, the results are often
displayed with the information: Results 1 - 10 of about XXXX. This XXXX
number is the estimated total number of results that exist for that query. The
36
estimated number may be either higher or lower than the actual number of
results that exist. Google’s calculation of this total number of search results
is only an approximate ballpark figure and is provided only to make search
faster rather than calculating the exact number. But we believe that this
number is quite valuable when used in conjunction with Googles Site Search
feature which thereby gives a fair idea of the frequency of occurrence of a
term in any domain of websites.
Site specific search:
Google allows one to specify that the search results must come from a given
website. For example, the query [nausea site:www.drugs.com] will return
pages about nausea but only from drugs.com. The simpler queries like [nausea
from Drugs.com] will usually be just as good, though they might return results
from other sites that mention the Drugs.com in association with nausea. One
can also specify a whole class of sites, for example [nausea site:.gov] will return
results only from a .gov domain. It is also possible to search in multiple
domains simultaneously using the keyword OR. Therefore a query that says
[nausea site:drugs.com OR site:medications.com] returns result from either of
the two sites corresponding to search key nausea.
Based on these concepts, we define our side-effect extraction routine as follows:
Separate medical from non-medical terms by counting the frequency of occur-
rence of each term in a domain of medical websites (X) VS purely non-medical
websites (Y), and thereby eliminating the ones where Y is significantly less
than X by a factor n.
As discussed before, the frequency of occurrence of a term in any domain is
estimated using Googles search result approximation number. We choose the
domain of medical sites and non-medical sites considering the following three
factors:
37
Medical Websites Domain Non-medical Websites Domainwww.medications.com
www.drugs.comwww.rxlist.com
www.askapatient.comwww.dailystrength.org
www.finalgear.comwww.travelblog.org
www.virtualtourist.comwww.nj.com
Table 1: Sample websites chosen to create medical and non-medical data domain
• Contains User testimonials
• Large volume of the sites
• High traffic to the sites
Based on this, the following sites were chosen in each of the domains:
The medical sites are same as the ones chosen for crawling testimonials. The
non-medical sites were selected such that they contained massive user testimo-
nials to maintain uniformity in the pattern of data and formation of sentences
by users. A sample search query in a medical domain would be:
Muscle cramps site:drugs.com OR site:medications.com OR site:rxlist.com OR
site:askapatient.com OR site:dailystrength.org
Similarly, a query in a non-medical site domain would be:
Some people site:finalgear.com OR site:nj.com OR site:travelblog.org OR site:virtualtourist.com
We tabulate these results for all the phrases collected and by trial and error,
determine the factor n based on which medical phrases are separated from
non-medical ones.
A sample table with about 15 phrases is shown below:
Determining factor n:
According to our rule defined, we extract only those phrases from the list
that have a search count in medical websites by a factor n more than the
one in non-medical web domain. To determine this n, we use a trial and
38
# Phrase Medical Non-Medical1 anxiety attacks 9,370 2502 severe depression 37,100 3813 acid reflux 4,670 8784 lethargic 2,000 4505 lyrica 8,390 1,0306 Depakote 6,680 1207 permanent 9,410 39,5008 got worse 19,200 37,1009 would recommend 26,700 55,20010 doesnt work 7,660 4,08011 taste in my mouth 4,260 3,28012 extremely painful cramps 4,770 1,31013 horrible headaches 4,200 91314 on the pill 181,000 3,76015 is getting worse 17,300 11,300
Table 2: Sample subset of phrases with corresponding Google count from medical and non-medical web domain.
error approach starting with n = 1. We then perform an editorial audit of
the resultant final list to certify the quality of the extracted terms based on
the false positive rate. After several trials, we narrow down on n = 4 being
the best value of this factor. We examine the various n values tried on the
experimental set and the associated problems in the result set in the following
paragraphs.
Factor n < 4 - For smaller values of n, the number of phrases that get filtered
are fewer in number. This makes the resultant final list noisier in nature.
From table 1, it can be seen that using an n-value of 1, 2 or 3 identifies
medical phrases like row # = 1, 2, 3, 4, 5, 6 but at the same time, it also
flags rows like 10 and 11 as medical which is incorrect. The false positive rate
while using smaller n-value is very high.
Factor n > 4 By using a larger n-value, a lot of noisy phrases are eliminated.
This is an advantage although there appears a new problem with the resultant
list. Phrases which are more common in nature i.e., more likely to occur
in both medical and non-medical domain get eliminated. These could be
valuable side-effects which are not expected to be discarded. Classic examples
39
are row 12 and 13 of table 1 whose frequencies in both data sets are very
close. It is not uncommon to find phrases like horrible headache or terrible
cramps in a Travel or Tourism website where users write testimonials of their
experiences.
Considering all the above issues, an n-value of 4 proves to have the least false
positive rate and generated the maximum number of Unique Side Effects.
Despite these advantages, we observe that the final list from this scenario also
has a few noisy phrases which do not categorize as purely medical. The final
list after all the phases and filters contains nearly 2000 terms which fall under
the following 3 categories:
• Side effects
• Drug Names
• Noisy terms
Finally, the following 2 steps are performed to obtain the list of unique side
effects:
Step I: Eliminate drug names by cross-referencing with the drug names from
the database.
Step II: Run a manual audit of the list from previous step to obtain unique
side effects.
4.2 Determining Top side effects
We examined in great detail the side effect extraction process in the previous
section which is our primary contribution in this project. Once the side effects
are extracted, it is important to represent all the valuable data collected in a
manner that is most useful as well as appealing to the target audience, who are
patients using various drugs and medications in day-to-day life. Hence, our second
40
important contribution is aggregation and data distribution representation which
is discussed in the following two sections.
Data aggregation is any process in which information is gathered and expressed
in a summary form, for purposes such as statistical analysis. We use aggrega-
tion and its strength to represent distributional data in a user friendly interface
designed and developed specifically for this undertaking. The details of the sys-
tem implementation are discussed at length in Section 7. As discussed in Section
4, we built database tables for Unigram and N-gram side effects with their fre-
quencies indexed by Drug names in tables UnigamSE and NGramSE respectively.
These tables were built using the side-effects identified from our extraction routine.
Additionally, there exists the Master table called DrugTable which holds all the
reviews/testimonials collected from various websites, also indexed by Drug names.
In this section, we try to answer the lurking question by most patients:
What are the top side-effects reported for a medicine?
For the rest of this discussion, we explain the various features of the distribution
model using the example of drug Xanax and its side effects. There are 3 main
features represented for each Drug:
• Top Side effects with percentage distributions
• Graphical representation of the above distribution
• Patient testimonials in the corresponding category
The screenshot in Fig 10 shows all the above features represented for drug Xanax.
We examine each of these in the upcoming paragraphs.
41
Figure 10: Sample screenshot showing the various features represented for drug Xanax
4.2.1 Section I: Top Side-effects
The UnigramSE and NGramSE table contains all the side effects for a specific
drug. Their frequencies are extracted and the percentage calculated as follows:
If, f = frequency of occurrence of side-effect as reported in the DB table
T = Total number of reviews reported for the particular drug
Then, Percentage frequency PerD(X) where X = any side-effect for drug D is:
PerD(X) = (f/T ) ∗ 100 (1)
Based on this calculation, the top 20 most frequently reported side effects are
displayed in the column I. Table3 shows the Percentage frequency calculated for
Xanax and its 20 most reported side-effects.
42
Side Effect Percentage Frequencydrowsiness 1.03%depression 0.97%
Memory loss 0.74%insomnia 0.65%
fear 0.59%tiredness 0.27%dizziness 0.27%
anger 0.22%Dry mouth 0.22%Weight gain 0.20%
seizures 0.20%sensitive 0.16%nausea 0.16%
Mood swings 0.13%Vivid dreams 0.07%
Increased appetite 0.07%Muscle weakness 0.05%Muscle spasms 0.05%
Heart palpitations 0.05%Chest pain 0.05%
Table 3: Top 20 side effects reported by patients for Xanax and its corresponding frequencypercentages
4.2.2 Section II: Graphical Representation
Visual representations are usually more appealing and easily perceptible to patients
reviewing large volumes of data. This motivated us to represent the top side effects
for a drug in the form of a pie chart on our interface. Pie charts make a good
representation of data when the categories illustrate a proportion of the total.
Following the same example as before, Fig 11 represents the Pie distribution for
Xanax and its side effects.
4.2.3 Section III: User Testimonials
Recent survey conducted on a small group of patients shows that patients are
always looking to read experiences of others who are on the same medication as
themselves to feel more reassured about their own condition. Therefore, the final
section of the webpage is displaying User Testimonials for every drug selected.
These reviews are unfiltered, first-hand information from patients who have used
43
Figure 11: Sample Pie chart representing the distribution of side-effects for Xanax
and experienced side-effects from various drugs. Although not verified by any cer-
tified medical resource, yet these reviews are quite valuable to a common man who
believes Internet to be the greatest source of second medical opinion. Addition-
ally, selecting any of the side effects in the right column displays the testimonials
specific to that particular symptom.
4.3 Determining Top Drugs for each Symptom
Patients who take multiple medications are generally interested in knowing which
drugs cause a specific side-effect. This section of our work provides the necessary
reverse information. We examine an example of this utility by considering the
side-effect Dizziness. Table 4 represents the Top 20 Drugs for which patients have
reported Dizziness as one of the side-effects. The table also presents the percentage
distribution of each drug for that specific symptom. Fig 12 shows a pie chart
distribution of the same.
44
Drug Name % of patients reporting DizzinessYasmin 3.38%
Effexor XR 2.90%Levaquin 2.70%Mirena 2.69%
Lisinopril 2.33%Cymbalta 2.14%
Effexor 1.95%Lamictal 1.89%Lyrica 1.69%
Toprol-XL 1.59%Flagyl 1.50%
Wellbutrin 1.42%Lexapro 1.36%
Paxil 1.27%Coumadin 1.24%
Lipitor 1.23%Buspar 1.23%Zoloft 1.16%
WELLBUTRIN XL 1.14%Topamax 1.13%
Table 4: Sample reverse mapping from Side-effect -¿ Drug for Dizziness
Figure 12: Pie Chart distribution for Drugs reporting Dizziness as a side-effect
45
5 Discussion
5.1 Challenges & Solutions
In the previous sections we have examined in detail all the phases involved in Side-
effect extraction as well as representation of extracted information in the User
interface. In this chapter, we focus on discussing a few important issues that were
encountered and the solutions implemented to alleviate these problems. Although
there were several technical implementation difficulties, we restrict our discussion
to only the two major challenges which pertain to actual domain in context, which
is medical data. The topics of discussion include:
• Elimination of Cures appearing as top side-effect
• Synonym side-effects
The next two sections deal with the above two issues at length.
5.1.1 Eliminating Cures
In section 5.2, we discussed the methodology behind determining the top side-
effects exhibited by a particular drug. The top side-effects are determined based
on the frequency of occurrence of the term in all the reviews associated with the
drug. This proves to be very essential consideration since patients and users of the
internet show great interest in knowing about what is most common rather than
what is the most serious side-effect when on any medication. The frequency based
calculation for this feature posed one of the main challenges in this work. We
first examine a few examples in Table.5 which are indicators for the issue at hand.
These examples were some of the first few samples that we had noticed during our
experimental process that led to uncovering the issue and therefore can provide a
good insight into the problem.
46
DepakoteTop Side-effects:Seizures – 15.8%
Weight gain – 7.47%Hair loss – 2.74%
Depression – 1.93%Mood Swings – 1.58%
SingulairTop Side-effects:
Difficulty Breathing – 15.1%Depression – 8.82%
Allergy – 8.80%Anxiety – 8.59%
Mood Swings – 4.96%Cymbalta
Top Side – effects:Anxiety – 11.8%Nausea – 3.20%
Weight gain – 2.35%Insomnia – 2.14%Sweating – 2.12%
ZoloftTop Side-effects:
Depression – 15.2%Weight gain – 2.62%
Insomnia – 1.41%Nausea – 1.31%
Dry mouth – 1.22%
Table 5: Sample drugs and its top side-effects which are indicators of issue 1 faced
In table 5, the condition Seizures is reported by 15.8% of patients using Depakote
when the fact remains that it is infact used to cure/prevent seizures. Similarly
Zoloft treats depression, Cymbalta treats Anxiety and Singulair is used to ease
breathing difficulties.
This observation implies that it is very obvious to spot, that the top side-effect
reported for each of the drugs is significantly higher in frequency than the others
following it. This immediately raised a red-flag in our experiments and we thereby
went on to investigate it further. The interesting finding after observing the trend
with several drugs was,
The side-effect appearing as the top reported effect was in most cases the condition
that the specific drug is expected to cure.
Considering the fact that our calculations of top reported side-effects are frequency
based, this was not an uncommon result. Patients writing testimonials about any
drug are bound to use the words representing the conditions that the drug cures.
To solve this issue, the approach that we implemented was a 3-step process:
1. Crawl the official NIH (National institute of health) website and extract a list
of drug to cures mappings.
47
2. For each drug, examine the list of side-effects generated and detect outliers
which have a significantly high frequency percentage compared to others in
the list.
3. Finally, cross reference each of the outliers to the list obtained in step I to
detect the possible cures. These symptoms are then flagged in the database
as a ’cure’ and are restricted from appearing as a side-effect.
The reason behind having a list of cures is to ensure that are any genuinely high
reported side-effect for a drug does not get eliminated since it has a significantly
large frequency, thereby becoming the outlier. The only drawback with this ap-
proach is, there exists a set of drugs that are known to cause the same side-effect
as the ones that the drug is expected to cure. To illustrate this better, if Drug X is
administered to treat a condition Y, but instead ends up causing a higher degree
of condition Y, this gets eliminated and does not show up as a potential side-effect
despite the fact that it was very highly reported by patients.
5.1.2 Synonymous Side-effects
The second major issue was the presence of synonymous side-effects in the top
20 most frequent effects returned for each drug. Although this does not pose any
technical errors, it is still considered a noisy result since the same effect appears in
multiple forms. For example, Headche, Headaches and Ache in Head are all con-
sidered the same from the perspective of a user/patient. But the system considers
each of them as a unique side-effect and displays all the three in the list that it
generates. The solution we use here is to incorporate a synonym generator API
called Big Huge Labs which provides multiple versions of a particular word. Based
on the results returned in the top side-effects, only one of the versions is chosen
and displayed.
Although this solution solves the problem in most cases, there still are a few
48
exceptions. The API only generated valid synonyms but not singular and plural
forms. This leads to the case where Headache can be a synonym for Ache in Head
but not for Headaches. This is an area of possible future work where the problem
can be tackled by examining root words rather than the entire word in context.
5.2 Profoundness Score
During the phase of literature review, one of the main goals was to identify what
patients look for in medical websites. By extensively surveying the internet for
patient reviews, it was a common requirement for patients to be able to obtain
comparative analysis based on drugs or side-effects. Motivated by this fact, our
system defines a new parameter called ’Profoundness Score’ to project the impor-
tance of a side-effect in the context of a particular drug.
Profoundness Score for a side-effect is defined as:
”The Z-score or Standard score calculated over a population of data corresponding
to a specific category of drugs or entire corpus of drugs.”
Before understanding this term further, we discuss the concept of Z-score in general
statistical context. A common statistical way of standardizing data on one scale
so a comparison can take place is using a z-score. The z-score is like a common
yard stick for all types of data. Each z-score corresponds to a point in a normal
distribution and as such is sometimes called a normal deviate since a z-score will
describe how much a point deviates from a mean or specification point. It is
a dimensionless quantity derived by subtracting the population mean from an
individual raw score and then dividing the difference by the population standard
deviation. This conversion process is called standardizing or normalizing.
Therefore, Profoundness Score Pf for a side-effect SE in the context of Drug D is
calculated as:
49
This term ’population’ in this context can be either the entire set of drugs in our
corpus or drugs belonging to only a specific category of drugs.
We examine two examples in the next section where Zoloft is the drug in consid-
eration for calculating the Profoundness score and in two different populations.
Population 1: Complete Drug corpus and Pf-score of top side-effects of
Zoloft
Table 6 shows the calculated Profoundness scores of the top 10 side-effects ex-
hibited by Zoloft. The calculations are based on the formula explained above.
The frequency of occurrence of each side-effect is recorded in the UnigramSE and
NGramSE tables and therefore used as the datum in individual calucations.
In order to interpret these scores, we consider an example of ’Weight Gain for
Zoloft’. The frequency of occurrence of weight gain in Zoloft related testimonials
by patients is 261. Based on this, the Pf-score is calculated as 2.871. This means
that weight gain as a side-effect is reported around 2.8 (3) standard deviations
above the mean of frequency of occurrence of weight gain in the entire corpus. This
means that weight gain for Zoloft is over-represented when compared to other drugs
implying that the side-effect is around 3 times more profound for Zoloft. Similarly,
50
Side-Effect Frequency — Profoundness Scoreweight gain 261 — Pf-score is: 2.871insomnia 141 — Pf-score is: 0.815nausea 131 — Pf-score is: 2.938
dry mouth 122 — Pf-score is: 1.992dizziness 75 — Pf-score is: 1.755
mood swings 68 — Pf-score is: 0.643diarrhea 66 — Pf-score is: 1.463headache 59 — Pf-score is: 1.517yawning 56 — Pf-score is: 0.435sweating 45 — Pf-score is: 0.803
vivid dreams 42 — Pf-score is: 0.28weight loss 42 — Pf-score is: 0.388
Table 6: Top 10 side-effects of Zoloft and their corresponding Pf-scores
insomnia for Zoloft is under-represented and is less profound compared to other
drugs.
Population 2: Comparative Analysis based on Drug Category
This kind of comparative analysis provides a very good understanding of the impor-
tance of the side-effect for a drug. Although statistically significant information,
it fails to provide a good insight for patients who perform a comparative study of
various drugs within same category. For this specific reason, the second set of ex-
periments involved the population being restricted to drugs of the same category.
We examine an example where the category is Anti-Depressants and the drugs in
context are Cymbalta, Prozac and Zoloft.
The system determines the top side-effects for each drug and then calculates an
intersection set which contains the side-effects reported by all three drugs and its
corresponding Profoundness scores. Table 7 consolidates this example.
This analysis provides a way to compare side-effects caused by various drugs in
the same category, thereby giving the users an opportunity to explore the options
before making choices. The system also provides a feature to display the other
major side-effects for each drug as shown in table 8.
51
Side-effect Pf- Score Cymbalta Pf- Score Prozac Pf- Score Zoloftnausea 2.395 0.239 0.663
weight gain 1.08 0.062 1.618Insomnia 1.805 0.386 1.133Sweating 2.79 0.273 0.143Dizziness 1.884 0.471 0.6
Constipation 3.536 0.627 0.119Dry mouth 1.594 0.164 1.495Headache 1.717 0.318 0.761
Weight loss 1.176 0.027 0.279Vivid dreams 1.532 0.555 0.419
Table 7: Top common side-effects for drugs under Anti-Depressants category
The other side effects of CYMBALTA The other side effects of PROZAC The other side effects of ZOLOFTfever
bleedingear pain
muscle twitchinglight headedness
stuporinfection
muscle weaknessflushing
delusions
anxietybulimiahiccups
dry throatmuscle weakness
pelvic painstomach irritation
reduced sleeprigid muscles
abdominal discomfort
bleedingspotting
dehydrationbulimia
light headednesssoresdread
bruxismcoma
stomach irritation
Table 8: Representation of other major side-effects of the drugs in Anti-depressants category
52
6 Conclusions and Future Work
In this project, we set out to solve an almost impossible task of interpreting human
reviews to extract relevant terms in our domain. There were a set of sequential
tasks which were accomplished in order:
• Starting with data collection, we gathered nearly 400,000 user reviews about
nearly 2,500 drugs from the internet which found our medical data corpus. The
challenging part of this step was to identify the right sites in order to have a
quality dataset.
• Next we moved to the next step of parsing this massive collection of data.
The biggest road-block in this phase was not only identifying the pattern of
organization of data in every website, but also bringing these varied structures
together into a common data model. This involved individually treating each
site to a unique parsing routine specific to the layout architecture.
• In the next phase, we iteratively process the reviews to extract certain patterns
and build our dependency models, which is essentially the database schema.
• Finally, we define and implement the side-effect extraction algorithm which
gave us a unique list of nearly 900 side-effects extracted purely from the patient
reviews and without relying on fixed lists. This process helped us uncover a
few rare side-effects which are not generally reported by the pharmaceutical
companies or drug manufacturers.
At this point we have accomplished all the goals that we had started out with and
had even stumbled upon a few interesting road-blocks which were not anticipated
in the beginning. We proposed methodology for the following:
• tackle the problem of Synonymous side-effects
• Solving the interesting case of Cures appearing as top side-effect.
53
Above all, one of the biggest goals we achieved in this undertaking is presenting the
results of the aggregations, distributions and a lot of other useful data in a web-
interface. Our target audiences are patients who take drugs and medications and
the assumption here is, the patient using the interface only has basic knowledge of
using the internet to get the information he seeks. We therefore have invested a lot
of time and effort into making the front-end as friendly as possible. The technical
details and algorithm are left to the scope of the thesis and not expanded in the
interface.
This work is one of the first few undertakings in the direction of automatic side-
effect extraction in the medical domain. We believe that it is an excellent founda-
tion for future work focusing on some of the drawbacks of the system namely.,
• Integrating a Stemmer to find root words and determining side-effects based
on that factor.
• Using various Clustering and similarity metrics to determine the categories of
reviews and also side-effects. This kind of grouping can give a possible insight
into the ’Seriousness’ of the side-effect as well.
• To deal with the noise associated with user data, we believe that implementing
a spell check on the patient reviews could greatly reduce the number of ’non-
words’ and thereby generate an even bigger list of unique side-effects.
Our system would act as a solid foundation for other such user review-oriented
fields including mining for Top Rated movies, determining Top Teams in various
sports or even finding ’Top Friends’ on social networking sites!
54
7 Bibliography
1. Hiroshi Nakagawa and Tatsunori Mori. A Simple but Powerful Automatic
Term Extraction Method.
2. Beatrice Daille. Study and Implementation of Combined Techniques for Au-
tomatic Extraction of Terminology.
3. Sophie Aubin and Thierry Hamon (2006). Improving term extraction with
terminological resources.
4. Angelos Hliaoutakisy, Kalliopi Zervanouy and Euripides G.M. Petrakisy. Au-
tomatic Document Indexing in Large Medical Collections.
5. E. Milios, Y. Zhang, B. He and L. Dong. Automatic Term Extraction and
Document similarity in special text corpora.
6. Wentau Yih, Joshua Goodman and Vitor R. Carvalho. Finding Advertising
Keywords on Web Pages.
7. Diana Maynard and Sophia Ananiadou. Indentifying Contextual Infromation
for Multi-Word Term Extraction.
8. Wen-tau Yih, Po-hao Chang, Wooyoung Kim. Mining Online Deal Forums
for Hot Deals.
9. Gal Dias, Sylvie Guillor, Jean-Claude Bassano & Jos Gabriel Pereira Lopes.
Combining Linguistics with Statistics for Multiword Term Extraction: A
Fruitful Association?
10. Youngja Park, Roy J Byrd and Branimir K Boguraev. Automatic Glossary
Extraction: Beyond Terminology Identification.
11. Anette Hulth. Improved Automatic Keyword Extraction Given More Lin-
guistic Knowledge.
55
12. Jizhou Huang, Ming Zhou, Dan Yang. Extracting Chatbot Knowledge from
Online Discussion Forums.
13. Alexandre Patry and Philippe Langlais. Corpus-Based Terminology Extrac-
tion.
14. Jody Foo (2009). Term extraction using machine learning.
15. Web Crawling By Christopher Olston and Marc Najork.
16. Minqing Hu and Bing Liu. Mining Opinion Features in Customer Reviews.
17. EffectiveWeb Crawling by Carlos Castillo.
18. Web-references:
• www.cs.cmu.edu/ wcohen/collab-filtering-tutorial.ppt
• http://en.wikipedia.org/wiki/Collaborative filtering
• http://www.exinfm.com/pdffiles/intro dm.pdf
• http://en.wikipedia.org/wiki/Web crawler
• http://acl.ldc.upenn.edu/H/H05/H05-1043.pdf
• http://pages.stern.nyu.edu/ aghose/icec2007.pdf
• http://www.cs.uic.edu/ liub/publications/kdd04-revSummary.pdf
• http://www.ncbi.nlm.nih.gov/pubmed/16406474
• http://ezinearticles.com/?The-Power-of-Social-Media-For-the-E-Patient&id=4304655
• http://ezinearticles.com/?The-Doctor-Review-That-Tells-the-Truth&id=5881600
• http://htmlparser.sourceforge.net/
• http://httrack.kauler.com/help/Home
• http://www.ncbi.nlm.nih.gov/pubmed/9433730
• http://www.jabfm.org/cgi/content/full/19/1/39
56
• http://www.aclweb.org/anthology/H/H92/H92-1022.pdf
• http://rali.iro.umontreal.ca/Publications/urls/paper-tke-2005.pdf