Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
-
Upload
dwain-farmer -
Category
Documents
-
view
220 -
download
4
Transcript of Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Jonathan’s rules : Blue or CircleJessica’s rules : All the rest
What is Datamining?
Whose block is this?
Jonathan’s blocks
Jessica’s blocks
• Complete genomes are now available
• Knowing the genes is not enough to understand how biology functions
• Proteins, not genes, are responsible for many cellular activities
• Proteins function by interacting with other proteins and biomolecules
GENOME PROTEOME
INTERACTOME
Driving Forces: Genes, Proteins, Interactions, Diagnosis, & Cures
If we figure out how these work, we get these Benefits
To the patient:Better drug, better treatment
To the pharma:Save time, save cost, make more $
To the scientist:Better science
To figure these out,we bet on...
“solution” = Data Mgmt + Knowledge Discovery
Data Mgmt =Integration + Transformation + Cleansing
Knowledge Discovery = Statistics + Algorithms + Databases
Predict Epitopes,Find Vaccine Targets
• Vaccines are often the only solution for viral diseases
• Finding & developing effective vaccine targets (epitopes) is slow and expensive process
• Develop systems to recognize protein peptides that bind MHC molecules• Develop systems to recognize hot spots in viral antigens
Recognize Functional Sites,Help Scientists
• Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments
• Data mining of bio seqs to find rules for recognizing & understanding functional sites
Dragon’s 10x reduction of TSS recognitionfalse positives
Diagnose Leukaemia, Benefit Children
• Childhood leukaemia is a heterogeneous disease
• Treatment is based on subtype
• 3 different tests and 4 different experts are needed for diagnosis
Curable in USA, fatal in Indonesia
• A single platform diagnosis based on gene expression• Data mining to discover rules that are easy for doctors to understand
Understand Proteins,Fight Diseases
• Understanding function and role of protein needs organised info on interaction pathways
• Such info are often reported in scientific paper but are seldom found in structured databases
• Knowledge extraction system to process free text • extract protein names• extract interactions
• Objectives– Translate inspiration
from biological systems into advancement of life and computing sciences
– Advance data mining technologies in decision systems for complex problems
Direction & Plan
• To work on practical systems for– data mining
– data cleansing
– knowledge extraction
• Applied to – gene regulation
– protein interaction
– clinical data analysis
– ligand-receptor interaction
a b
It seems that configurationa is less likely than b. Canwe exploit this?
E.g., How to Get More Out of the Same Experiments?
• How to recognize false positives from two-hybrid and other types of high-throughput protein interaction experiments?
• Some initial thoughts:
E.g., How to Improve Classifier Algorithms?
• SVM, ANN, etc.
– Good accuracy,
– but not easy to understand
• C4.5, CART, etc.
– Clear rules,
– but lower accuracy
• Why can’t we have a classifier algorithm that
– handles high dimension
– achieves high accuracy
– provides understandable rules