Complex Networks and Data Mining on genetic databases

19
Knowledge Discovery in Databases through Complex Networks: application to phylodynamics Luiz Max F. de Carvalho Scientific Computing Programme (PROCC), Fiocruz Pan American Center for Foot-and- Mouth Disease (PAHO/WHO) WaFiS 2012 Knowledge Discovery in Databases (KDD) Complex Networks Who’s this guy? Knowledge Discovery in Databases through Complex Networks: application to phylodynamics Luiz Max F. de Carvalho Scientific Computing Programme (PROCC), Fiocruz Pan American Center for Foot-and-Mouth Disease (PAHO/WHO) WaFiS 2012 September 28, 2012

description

This is my presentation in WaFIS last year, about the use of complex networks on data mining in genetic databases.

Transcript of Complex Networks and Data Mining on genetic databases

Page 1: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Who’s this guy?

Knowledge Discovery in Databases throughComplex Networks: application to

phylodynamics

Luiz Max F. de CarvalhoScientific Computing Programme (PROCC), FiocruzPan American Center for Foot-and-Mouth Disease

(PAHO/WHO)

WaFiS 2012

September 28, 2012

Page 2: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Outline

1 Knowledge Discovery in Databases (KDD)

2 Complex Networks

3 Example 1: Chitin pathway phylogeny

4 Example 2: Foot-and-mouth disease virus in South America

Page 3: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Knowledge Discovery in Databases (KDD)

Lots of data

human brain very limited processing capacity

Information → Knowledge

Increasing number of molecular data (sequences, 3Dstructures, antigenicity,. . . )

Is it possible to explore these databases to discover usefulstuff?

Page 4: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Well. . . Let’s see

Page 5: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

[We may use] Complex Networks

Graphs → G = (V ,E )

Page 6: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Yeah, but how?

We can explore the ”dynamic signature” of these ComplexNetworks, i.e., study and compare their structural properties.Some useful formulas:

Clustering Coefficient < c >: 3×#triangles#triples

Degree distribution PK =∑∞

K ′=K pK ′

Diameter: max(d(i , j))

Page 7: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Ok, Let’s work then

1 Grab n sequences;

2 Create an n × n matrix using some kind of (normalized)distance (say, S);

3 For each σ ∈ [0, 1] build M(σ) such that:

mij(σ) =

{1 if Sij > σ,

0 if Sij < σ.

In a sense, we are transforming a single network in a family ofnetworks.

Page 8: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Analysis

We shall explore the relationships between these networks:First, define a higher-order neighborhood indicator function,such that you binarize the adjacency matrix with regard thepath length `, obtaining a matrix M =

∑D`=1 `M(`). Then

δ(α, β) =1

N2

N∑i=1

N∑j=1

(mij(α)

D(α)−

mij(β)

D(β)) (1)

Evaluating δ(σ, σ + ∆σ) can give some interesting insights.

Page 9: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Example 1: Chitin pathway phylogeny

Proteins related to the chitin metabolic pathway from1605 complete genomes;

BLAST distances (which are asymmetric);

Search for phylogenetic relationships

Page 10: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Example 1: Some results

Page 11: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Example 1: Some more results

Page 12: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Example 1: The expected Network(s)

Page 13: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Example 2: Foot-and-mouth disease virus in SouthAmerica

S was built with phylogenetic (TN93) distances for NTand JTT distances for AA;

Try to make sense of a somewhat big data set (167 seqs);

Extract some nice patterns;

Page 14: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Indexes × σ

(a) (b)

Page 15: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

A nice network

Page 16: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Some more developments

Page 17: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Related Work

Identify transmission clusters (HIV, HCV) (Lewis et al,2008,Plos Medicine)

Explore scale-free behavior in phylodynamics (Shiino,2012, Frontiers in Microbiology)

Page 18: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Future Directions

Explore the spatial aspect in the construction of SMaybe S = µ+ S(G )α

Power law analysis

Implement assortativity

Suggestions. . .

Page 19: Complex Networks and Data Mining on genetic databases

KnowledgeDiscovery inDatabases

throughComplex

Networks:application tophylodynamics

Luiz Max F.de Carvalho

ScientificComputingProgramme(PROCC),

FiocruzPan American

Center forFoot-and-

Mouth Disease(PAHO/WHO)

WaFiS 2012

KnowledgeDiscovery inDatabases(KDD)

ComplexNetworks

Example 1:Chitinpathwayphylogeny

Example 2:Foot-and-mouth diseasevirus in SouthAmerica

Thank You!