Data Mining in the Pharmaceutical Industry
Introduction• Data Mining is the process of extracting information
from large data sets through the use of algorithms
and techniques drawn from the field of Statistics,
Machine Learning and Data Base Management
Systems.
• “Mining” means to find something that already exists.
• Therefore, data mining can be defined as a process of
identifying hidden patterns and relationships, and
trends within data.
• Traditional methods often involves:-
1) manual work
2) interpretation of data.
• that is slow, expensive and highly subjective.
• Data Mining, popularly called as knowledge
discovery in
• large data
• Enables organizations to make calculated
decisions by
• Assembling
• accumulating
• analyzing and
• accessing corporate data.
______
______
______
Transformed
Data
Patterns
and
Rules
Target
Data
Interpretation
& Evaluation
Un
de
rsta
nd
ing
Knowledge
DATA
Ware
house
Raw
Dat
a
Integration
• The scope of pharmaceutical applications is large and it
may involve drug manufacturing processes as well as
data processing.
• Data processing and analysis is a key area in the
pharmaceutical industry.
• The vision of a pharmaceutical industry that can be
achieved with data mining.
• pharmaceutical companies delivers drugs, developing
test kits (including genetic tests) and computer
programs to deliver the best drug to the patient.
Pharmaceutical companies can also employ data mining
methods to huge masses of genomic data to predict how
a patient’s genetic makeup determines his or her response
to a drug therapy .
genomic data :-The complete set of chromosomal and
extra chromosomal genes of an organism, a cell, an
organelle or a virus; the complete DNA component of an
organism.
It uses variety of tools like
• Query and reporting tools:-
Analytical processing tools
•Use to analyze database information from multiple database
systems at one time.
Decision Support System (DSS) tools.
• Decision support systems (DSS) are defined as
• interactive computer-based systems intended to help decision makers to utilize data and models in order to
• identify problems, solve problems and make decisions.
DATA MINING TECHNIQUES.
•Many organizations generate
mountains of data about their new
drugs discovered and its
performance reports, etc.
•This data is a strategic resource.
Now, making use of most of these
strategic resources will lead to
•improving the quality of pharma
industries.
• Six important steps in the Data Mining process as
1. Problem Definition.
2. Knowledge acquisition.
3. Data selection.
4. Data Preprocessing.
5. Analysis and Interpretation.
6. Reporting and Use.
Identify the data mining process as
1. Definition of the objectives of the analysis.
2. Selection &Pretreatment of the data.
4. Explanatory analysis.
5. Specification of the statistical methods.
6. Analysis of the data.
7. Evaluation and comparison of methods.
8. Interpretation of the chosen model.
1. Definition of the objectives of the analysis.
Understanding the project objectives and
requirements from a business perspective and then
converting this knowledge into a data mining
problem definition with a preliminary plan
designed to achieve the objectives.
Relevant data sources for the pharma industry are:
•clinical data (patient data, pharmaceutical data,
medical treatments, length of stay);
•administrative data (staff skills, overtime, nursing
care hours, staff sick leave);
• financial data (treatment costs, drug costs, staff
salaries, accounting, cost-effectiveness studies); and
• organizational data (room occupation, facilities,
equipment).
Data mining is used to support:
•The clinicians at the point of care delivery;
•The controlling of clinical treatment pathways;
•The administrative and management tasks; and
•Efficient management of organizational and
financial data.
Associations, Mining Frequent
Patterns.
• These methods identify rules of affinities
among the collections.
• rules of affinities:- relationships among
data
• That the patterns occur frequently during
Data Mining process.
• The applications of association rules
include market basket analysis
• attached mailing in direct marketing
• Fraud detection
• department store floor/shelf planning etc.
•Association of training undertaken diseases
with drugs
•Association and analysis of staff movements
•Application tracking mechanism in
physicians adopting drugs with customer’s
prescription
Classification And Prediction.
• The classification and
prediction models are two
data analysis techniques
that are used to describe
data classes and predict
future data classes.
• E.g. A credit card company
whose customer credit
history is known can
classify its customer Record
as
• Good, Medium, or Poor.
•Predicting consumer behavior
•Predicting the likelihood of success in a drug
adoption process
•Predicting the percentage accuracy in performance of
a drug
•Classifying the historical health records
•Prediction of what type of drugs most likely to be
retained, most likely to be left, most likely to
transform their composition.
Predicting pharma product behavior and attitude
•Predicting demand projections by seasonal variations
•Predicting the performance progress of segments
throughout the performance period
•Identifying the best profile for different drugs
•Classify trends of movements through the
organization for successful/unsuccessful patient
historical records
•Categorization of drugs, diseases and patients.
• The models of decision
trees, neural networks
based classifications
schemes are very much
useful in pharma industry.
• Decision trees:- Decision-tree is a common knowledge
representation used for classification.
• In classification, one is given data from a specific
instance, and the decision tree predicts, based on the
data, into which of two or more classes the instance
belongs.
• Each instance contains data from multiple attributes.
• Instances are collections of previously acquired data
which are sorted into class labels.
• It does so by determining which tests best divide the
instances into separate classes, forming a tree.
• Neural Networks
–Learn through training
–Resemble to biological
networks in structure
–Can produce very good
predictions
–Not easy to use and to
understand
–Cannot deal with
missing data
Uses Bayesian neural networkPrior probability is probability that any report contains reference to adverse eventPosterior probability is probability that report has link between drug and adverse eventDetermines “strength” of link between adverse event and drug (called Information Component or IC)More complicated than appears: patient may consume multiple drugs – which one caused adverse event?
Bayesian Neural Network
Adverse Event
Drug
Strength of link between adverse event and drug
• Classification works on discrete and unordered data, while prediction
works on continuous data.
• E.g. Discrete data This data set shows a group of discrete data.
• This is called discrete data because the units of measurement (for example,
CDs) cannot be split up; there is nothing between 1 CD and 2 CDs
• E.g. Continues data
• This data is called continuous because the scale of measurement - distance -
has meaning at all points between the numbers given, e.g we can travel a
distance of 1.2 and 1.85 and even 1.632 miles.
Music format Number sold
CD albums 140
CD singles 70
Downloads 55
Vinyl 5
Total sales 270
Distance in miles 0.1 0.2 0.6 1.1 1.2 1.8 2.0 2.7 3.4 4.6 6.2 8.0 12.1 14.2
• Regression is often used as it is a
statistical method used for numeric
prediction.
• Primary emphasis should be made on
the selection measurement accuracy
and predicative efficiency of any
new drug discovery.
• Simple or multiple regressions is
the basic prediction model that
enables a decision maker to forecast
each criterion status based on
predictor information.
• neural network technology is useful
from different areas of business.
CLUSTERING.• It is a method by which similar
records are grouped together.
• Clustering is usually used to mean
segmentation.
• An organization can take the
hierarchy of classes that group
similar events.
• Using clustering, patients can be
grouped based on age, name,
diseases etc.
• In business, clustering helps identify
groups of similarities;
• characterize customer groups based
on purchasing patterns, etc.
DATA MINING AND STATISTICS.• The ability to build a successful
predictive model depends on past
data.
• Data Mining is designed to learn from
past success and failures and will be
able to predict what will happen
next (future prediction).
• The Data Mining tool checks the
statistical significance of the
predicted patterns and reports.
The difference between Data Mining
and statistics
• Data Mining automates the statistical process
requiring in several tools.
• Statistical inference is assumption driven in the
sense that a hypothesis is formed and tested
against data.
• Data Mining, in contrast is discovery driven.
That is, the hypothesis is automatically
extracted from the given data.
Data Mining can answer analytical
questions such as:
• what are discovery of new molecules and
issues over it?
• What factors or combinations are directly
impacting the drugs?
• What are the best and outstanding drugs?
• Which drugs are likely to be retained?
• How to optimally allocate resources to ensure
effectiveness and efficiency? etc.
• An intelligent text mining system could provide a platform for extracting and managing specific information at the entity level.
• For e.g. Information pertaining to
• genes
• proteins
• diseases
• organisms
• chemical substance etc can be analytically extracted for patterns .
It would also provide insights into inter relationships such as
• protein-protein
• Gene-gene
• Protein-Chemical
• Gene-Disease and
• Drug-Drug interactions.
• Text mining can be applied to biomedical literature, clinical documents and other medical literary sources for data curation and database population in a semi-automated manner.
Applications Of Data Mining In
The Pharmaceutical Industry
• A lot of information is hidden in the legacy
systems.
• This information can easily be extracted.
• Most of the times this can not be done directly
from the legacy systems, because these are not
build to answer questions that are
unpredictable.
• A user-interface may be designed to accept all kinds
of information from the user (e.g. weight, sex, age,
foods consumed, reactions reported, dosage, length of
usage).
• Then, based upon the information in the databases
and the relevant data entered by the user,
• a list of warnings or known reactions (accompanied
by probabilities) should be reported.
• Note that user profiles can contain large amounts of
information, and efficient and effective data mining
tools need to be developed to probe the databases for
relevant information.
• Secondly, the patient's (anonymous) profile should be recorded along with any adverse reactions reported by the patient, so that future correlations can be reported.
• Over time, the databases will become much larger, and interaction data for existing medicines will become more complete.
• The amount of existing pharmaceutical information pharmacological properties, dosages, contraindications, warnings, etc. is enormous;
• however, this fact reflects the number of medicines on the market, rather than an abundance of detailed information about each product.
One of the major problems with pharmaceutical
data is a lack of information.
• a food and drug administration department
estimated that
• only about 1% of serious events are reported to
the food and drug administration department.
Fear of litigation may be a contributing factor;
• however, most health care providers simply
don't have the time to fill out reports of
possible adverse drug reactions.
•Furthermore, it is expensive and time consuming
for pharmaceutical companies to perform a
thorough job of data collection, especially when
most of the information is not required by law.
•Finally, one should note that the food and drug
administration department does not require
manufacturers to test new medicines for potential
interactions.
Three stages of drug development
• Finding of new drugs
• Development tests and Predicts drug behavior
• Clinical trials test the drug in humans and
• Commercialization takes drug and sells it to
likely Consumers (doctors and patients).
APPLICATIONS OF DATA
MINING IN THE
PHARMACEUTICAL INDUSTRY
1) Clinical data analysis – clinical data analysis
evaluates and streamlines from large amount of
information.
Data mining helps to see trends, irregularity, and
risk during product development and launch.
2) Marketing and sales analysis –the
identification of the most profitable product and
allocation of marketing funds.
Data mining here helps to examine consumer
behavior in terms of prescription renewal and
product purchases.
3) Customer analysis – using data mining one can
develop more targeted customer profiles that focus
not only on products, but also on the ability to pay
for them by analyzing historical health trends in
combination with demographics.
4) Target physicians who have high prescription
rates of a certain drug or treatment with new drug
information that treat complementary symptoms or
conditions.
DEVELOPMENT OF NEW
DRUGS.
• This can be achieved by clustering the
molecules into groups according to the
chemical properties of the molecules via
cluster analysis.
• every time a new molecule is discovered it can
be grouped with other chemically similar
molecules.
•Mining can help us to measure the chemical activity
of the molecule on specific disease say tuberculosis
and find out which part of the molecule is causing the
action.
•This way we can combine a vast number of
molecules forming a super molecule with only the
specific part of the molecule which is responsible for
the action and inhibiting the other parts.
•This would greatly reduce the adverse effects
associated with drug actions.
• They use high speed screening to test tens,
hundreds, or thousands of drugs very quickly.
• The general goal is to find activity on
relevant genes or to find drug compounds that
have desirable characteristics.
• The Data mining techniques that are used in
developing of new drugs are clustering,
classification and neural networks.
• The basic objective is to determine
compounds with similar activity.
• The reason is for similar activity compounds
behave similarly.
• This is possible only when we have known
compound and looking for something better.
• When we don’t have known compounds but
have desired activity and want to find
compound that exhibits this activity, then data
mining rescues this.
DEVELOPMENT TESTS AND
PREDICTS DRUG BEHAVIOR
• Issues which affect the success of a drug which can impact the future development of the drug.
1) Adverse reactions to the drugs are reported spontaneously and not in any organized manner.
2) we can only compare the adverse reactions with the drugs of our own company and not with other drugs from competing firms.
3) we only have information on the patient taking the drug not the adverse reaction that the patient is suffering from
Solution
• All this can be solved with creation of a data warehouse for drug reactions and running business intelligence tools on them.
• BI tool:- Business intelligence tools are a type of software that is designed to retrieve, analyze and report data.
• This broad definition includes everything from spreadsheets, visual analytics, and querying software to data mining, warehousing, and decision engineering.
•The drug undergoes testing in animals and human
tissue to observe effect and determines how much
drug to consume for desired effect or how
dangerous is the drug.
•The Data mining techniques can be here used is
classification and neural networks.
• The goal here is to predict if treatment will aid patients.
• Because if drug will not aid patients, what purpose does drug serve.
• Predicting the drug behavior is essential when we have data supporting use of drug and also have training data that shows effects of drug (positive or negative).
• The test should be able to predict which patients will benefit and which treatment help sickle cell anemia patients.
How it works
•The information like gender, body weight, disease state, etc will play crucial role.
•This crucial data should be fed into neural network and predict whether patient will benefit from drug.
•Only one of two classifications yes/no will be available on training data.
•Network is trained for the yes classifications and a snapshot is taken of the neural network.
•Then network is trained for the no classifications and another snapshot is taken.
•The output is yes or no, depending on whether the inputs are more similar to the yes or the no training data.
•E.G. ARTMAP.
Weight
Height
Gender
Blood Pressure
Patient Benefits?
Imagine array of weights, one for each “template”
Template closest to input chosen.
Path of “least resistance” chosen for output.
CLINICAL TRIALS TEST THE
DRUG IN HUMANS
• Company tests drugs in actual patients on larger scale.
• company has to keep track of data about patient progress.
• The Government wants to protect health of citizens, many rules govern clinical trials.
• In developed countries food and drug administration oversees trials.
• The Data mining techniques used here can be neural networks.
• Here data is collected by pharmaceutical
company but undergoes statistical analysis to
determine success of trial.
• Data is generally reported to food and drug
administration department and inspected
closely.
• Too many negative reactions might indicate
drug is too dangerous.
• An adverse event might be medicine causing
drowsiness.
• The goal is to detect when too many adverse events occur or detect link between drug and adverse event.
• Too many adverse events linked to a drug might indicate drug is too dangerous or health of patient is at risk.
• Adverse events are reported to food and drug administration when link is suspected.
• One can feed the information on drug causing too many adverse events pertaining to drugs into a neural network and let network lead us to what is meant by ‘too many’.
Benefits
• Research Stage – instead of trial and error, data mining can help find drugs that have desirable activity
• Development Stage – data mining can help predict who will benefit from drug
• Clinical Trials Stage – data mining protects patients and helps regulate drug testing
• Commercialization Stage – data mining can optimize use of sales resources like manpower, advertising
CONCLUSION.• Due to increased computerization and consumer/patient
awareness.
• Reporting (via the internet) by health care workers can easily
be facilitated.
• Data collection in hospitals and extended care facilities is not
difficult, and this information is of high quality since such
institutions typically have tailored diets for their patients, and
maintain accurate records of treatments, lab tests, and
administration of prescriptions.
• Furthermore, given the popularity of the internet, it is
relatively easy for consumers to voluntarily fill in and submit
detailed profiles of themselves.
•It is mostly observed that data mining techniques are
seldom used in a pharmaceutical environment.
•How data mining can help find drugs that have desirable
activity and predict who will benefit from drug.
•Data mining protects patients and helps regulate drug
testing and optimizes use of sales resources like
manpower, advertising.
Top Related