Computational Discovery in Evolving Complex Networks
description
Transcript of Computational Discovery in Evolving Complex Networks
![Page 1: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/1.jpg)
Yongqin Gao Dissertation DefenseDecember 2006
Computational Discovery in Evolving Complex Networks
Yongqin Gao
Advisor: Greg Madey
![Page 2: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/2.jpg)
Yongqin Gao December 2006
Dissertation Defense
Outline• Background• Methodology for Computational Discovery• Problem Domain – OSS Research• Process I: Data Mining• Process II: Network Analysis• Process III: Computer Simulation• Process IV: Research Collaboratory• Contributions• Conclusion and Future Work
![Page 3: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/3.jpg)
Yongqin Gao December 2006
Dissertation Defense
Background• Network research gains more attentions
– Internet
– Communication network
– Social network
– Software developer network
– Biological network
• Understanding the evolving complex network– Goal I: Search
– Goal II: Prediction
• Computational scientific discovery
![Page 4: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/4.jpg)
Yongqin Gao December 2006
Dissertation Defense
Computational DiscoveryOur Methodology
ResearchCollaboratory
Data Mining
NetworkAnalysis
ComputerSimulation
Discovery Assessment
RevisionFeedback
Researcher
Community Members
Contribution Reference
Initialization
![Page 5: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/5.jpg)
Yongqin Gao December 2006
Dissertation Defense
Problem Domain• Open Source Software Movement
– What is OSS• Free to use, modify and distribute and source code available
and modifiable
• Potential advantages over commercial software: Potentially high quality; Fast development; Low cost
– Why study OSS (Goal)• Software engineering — new development and coordination
methods
• Open content — model for other forms of open, shared collaboration
• Complexity — successful example of self-organization/emergence
![Page 6: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/6.jpg)
Yongqin Gao December 2006
Dissertation Defense
Glory of OSSNumber of Active Apache Hosts
![Page 7: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/7.jpg)
Yongqin Gao December 2006
Dissertation Defense
Problem Domain• SourceForge.net community
– The biggest OSS development communities– 134,751 registered projects– 1,439,773 registered users
![Page 8: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/8.jpg)
Yongqin Gao December 2006
Dissertation Defense
Problem Domain• Our Data Set
– 25 monthly dumps since January 2003.– Totally 460G and growing at 25G/month.– Every dump has about 100 tables.– Largest table has up to 30 million records.
• Experiment Environment– Dual Xeon 3.06GHz, 4G memory, 2T storage– Linux 2.4.21-40.ELsmp with PostgreSQL 8.1
![Page 9: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/9.jpg)
Yongqin Gao December 2006
Dissertation Defense
Related Research• OSS research
– W. Scacchi, “Free/open source software development practices in the computer game community”, IEEE Software, 2004.
– C. Kevin, A. Hala and H. James, “Defining open source software project success”, 24th International Conference on Information Systems, Seattle, 2003.
• Complex networks– L.A. Adamic and B.A. Huberman, “Scaling behavior of
the world wide web”, Science, 2000.– M.E.J. Newman, “Clustering and preferential attachme
nt in growing networks”, Physics Review, 2001.
![Page 10: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/10.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Related Research:
– S. Chawla, B. Arunasalam and J. Davis, “Mining open source software (OSS) data using association rules network”, PAKDD, 2003.
– D. Kempe, J. Kleinberg and E. Tardos, “Maximizing the spread of influence through a social network”, SIGKDD, 2003.
– C. Jensen and W. Scacchi, “Data mining for software process discovery in open source software development communities”, Workshop on Mining Software Repositories, 2004.
![Page 11: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/11.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining
Raw data
Relevant data
Data Purging
Feature Selection
Algorithm Application
Data Preparation
Database
![Page 12: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/12.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Data Preparation
– Data discovery• Locating the information
– Data characterization • Activity features: user categorization• Network features
– Data assembly• Data Purging
– Treatment about data inconsistency• Unifying the date presentation by loading into single depository
– Treatment about data pollution• Removing “inactive” projects
• Feature Selection– This method is used to remove dependent or insignificant features.– NMF (Non-negative Matrix Factorization)
![Page 13: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/13.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Result I
– Significant features• By feature selection, we can identify the significant
feature set describing the projects.
• Activity features: “file_releases”, “followup_msg”, “support_assigned”, “feature_assigned” and task related features
• Network features: “degrees”, “betweenness” and “closeness”
![Page 14: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/14.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Distribution-based clustering (Christley, 2005)
– Clustering according to the distribution of features instead of values of individual feature
– We assume every entity (project) has an underlying distribution of the feature set (activity features)
– Using statistical hypothesis test• Non-parametric test• Fisher’s contingency-table test is used
– Joachim Krauth, “Distribution-free statistics: an application-oriented approach”, Elsevier Science Publisher, 1988.
![Page 15: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/15.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Procedure:
While (still unclustered entities)Put all unclustered entities into one clusterWhile (some entities not yet pairwise compared)
A = Pick entity from clusterFor each other entity, B, in cluster not
yet compared to ARun statistical test on A and BIf significant result
Remove B from cluster
• Worst case complexity: O(n2)
![Page 16: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/16.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Result II• Unsupervised learning
– Distribution-based method used to cluster the project history using the activity distribution
– We named the clusters using ID and the results are shown in the table
– High support and confidence in evaluation
Cluster ID Size
1 89709
2 9191
3 2060
Total 100960
![Page 17: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/17.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Two sample
distributions from different categories
• Unbalanced feature distribution → could be “unpopular”
• Balanced feature distribution → could be “popular”
20
1641
3488
22 0
312
736
229
1510
534
82 12128 0 4
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Activity Category
Cluster 1
134
3781
8435
4310
21792537
667
9169
7134
601
2411
1651
0399
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Activity Category
Cluster 3
![Page 18: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/18.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Discoveries in Process I
– Significant feature set selection• Network features are important
• Further inspection in next process
– Distribution based predictor• Based on the activity feature distribution
• Prediction of the “popularity” based on the balance of the activity feature distribution
• Benefit of these discoveries– For collaboration based communities, these discoveries
can help in resource allocation optimization.
![Page 19: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/19.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Why network analysis
– Assess the importance of the network measures to the whole network and to individual entity in the network
– Inspect the developing patterns of these network measures
• Network analysis– Structure analysis– Centrality analysis– Path analysis
![Page 20: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/20.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Related research:
– P. Erdös and A. Rényi, “On random graphs”, Publicationes Mathematicae, 1959.
– D.J. Watts and S. H. Strogatz, “Collective dynamics of small-world networks”, Nature, 1998.
– R. Albert and A.L. Barabάsi, “Emergence of scaling in random networks”, Science, 1999.
– Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.
![Page 21: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/21.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Structure Analysis
– Understanding the influence of the network structure to individual entities in the network
– Inspected measures• Approximate diameter
• Approximate clustering coefficient
• Component distribution
1)/log(
)/log(
12
1 zz
zND
)32())((
1
1
32111
21212
C
![Page 22: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/22.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Conversion among C-NET, P-NET and D-
NET
![Page 23: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/23.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result I
– Approximate Diameters• D-NET: between (5,7) while network size ranged
from 151,803 to 195,744.
• P-NET: between (6,8) while network size ranged from 123,192 to 161,798.
– Approximate Clustering Coefficient• D-NET: between (0.85, 0.95)
• P-NET: between (0.65, 0.75)
![Page 24: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/24.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result I
![Page 25: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/25.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Centrality Analysis
– Understanding the importance of individual entities to the global network structure
– Inspected measures:• Average Degrees
• Degree Distributions
• Betweenness
• Closeness
Vtvs st
st vvB
)(
)(
Vt G tvd
vC),(
1)(
![Page 26: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/26.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result II
– Average Degrees• Developer degree in C-NET: 1.4525
• Project degree in C-NET: 1.7572
• Developer degree in D-NET: 12.3100
• Project degree in P-NET: 3.8059
![Page 27: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/27.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result II (Degree distributions in C-NET)
![Page 28: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/28.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result II (Degree distributions in D-NET
and P-NET)
![Page 29: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/29.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result II
– Average Betweenness• P-NET: 0.2669e-003
– Average Closeness• P-NET: 0.4143e-005
– Normally these two measures yield very small value in large networks (N>10,000).
![Page 30: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/30.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Path Analysis
– Understanding the developing patterns of the network structure and individual entities in the network
– Inspected measures:• Active Developer Percentage• Average Degrees• Diameters• Clustering coefficients• Betweenness• Closeness
![Page 31: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/31.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result III (Active entities)
![Page 32: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/32.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result III (Average degrees in C-NET)
![Page 33: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/33.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result III (Average degrees in D-NET and
P-NET)
![Page 34: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/34.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result III (Diameters in D-NET and P-
NET)
![Page 35: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/35.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result III (Clustering coefficients for D-
NET and P-NET)
![Page 36: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/36.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Result III (Average betweenness and closen
ess for P-NET)
![Page 37: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/37.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network AnalysisMeasures D-NET P-NET C-NET
Average Degree Yes Yes Yes
Diameter Yes Yes N/A
Clustering Coefficient Yes Yes N/A
Degree Distribution Yes Yes Yes
Component Distribution N/A Yes N/A
Major Component N/A Yes N/A
Average Betweenness Yes Yes N/A
Average Closeness Yes Yes N/A
Active Entity Size Development Yes Yes Yes
Average Degree Development Yes Yes Yes
Diameter Development Yes Yes N/A
Clustering Coefficient Development Yes Yes N/A
Average Betweenness Development Yes Yes N/A
Average Closeness Development Yes Yes N/A
![Page 38: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/38.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process II: Network Analysis• Discoveries in Process II:
– Measures of structure analysis and centrality analysis all indicate very high connectivity of the network.
– Measures of path analysis reveal the developing patterns of these measures (life cycle behavior).
• Benefits of these discoveries– High connectivity in a network is an important feature
for information propagation, failure proof. Understanding this discovery can help us improve our practices in collaboration networks and communication networks.
– Understanding the developing patterns of these network measures provides us a method to monitor network development and to improve the network if necessary.
![Page 39: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/39.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Related Research:– P.J. Kiviat, “Simulation, technology, and the decision p
rocess”, ACM Transactions on Modeling and Computer Simulation,1991.
– R. Albert and A.L. Barabási, “Emergence of scaling in random networks”, Science, 1999.
– J. Epstein R. Axtell, R. Axelrod and M. Cohen, “Aligning simulation models: A case study and results”, Computational and Mathematical Organization Theory, 1996.
– Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.
![Page 40: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/40.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Iterative simulation method– Empirical dataset
– Model
– Simulation
• Verification and validation– More measures
– More methods
Model
SimulationEmpirical
DataCollection
Verification
Validation
![Page 41: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/41.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Previous iterated models (master thesis):– Adapted ER Model– BA Model– BA Model with fitness– BA Model with dynamic fitness
• Iterated models in this study– Improved Model Four (Model I)– Constant user energy (Model II)– Dynamic user energy (Model III)
![Page 42: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/42.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Model I– Realistic stochastic procedures.
• New developer every time step based on Poisson distribution
• Initial fitness based on log-normal distribution
– Updated procedure for the weighted project pool (for preferential selection of projects).
![Page 43: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/43.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Average degrees
![Page 44: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/44.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Diameter and CC
![Page 45: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/45.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Betweenness and Closeness
![Page 46: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/46.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Degree Distributions
![Page 47: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/47.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Deficit in the measures
![Page 48: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/48.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Model II– New addition: user energy.– User energy
• the “fitness” parameter for the user
• Every time a new user is created, a energy level is randomly generated for the user
• Energy level will be used to decide whether a user will take a action or not during every time step.
![Page 49: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/49.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Degree distributions for Model II
![Page 50: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/50.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Deficit in the measures
![Page 51: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/51.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Model III– New addition: dynamic user energy.– Dynamic user energy
• Decaying with respect to time
• Self-adjustable according to the roles the user is taking in various projects.
![Page 52: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/52.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Degree distributions (Model III)
![Page 53: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/53.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer SimulationModels Measures Patterns in Data Simulated Patterns
Model I
(more realistic distributions)
Developer Distribution Power Law (large tail) Power Law (small tail)
Project Distribution Power Law (small tail) Power Law (large tail)
Average Degrees Increasing Increasing
Clustering Coefficient Decreasing Decreasing
Diameter Decreasing Decreasing
Average Betweenness Decreasing Decreasing
Average Closeness Decreasing Decreasing
Model II
(constant user energy)
Developer Distribution Power Law (large tail) Power Law (large tail)
Project Distribution Power Law (small tail) Power Law (reasonable tail)
Average Degrees Increasing Increasing
Clustering Coefficient Decreasing Decreasing
Diameter Decreasing Decreasing
Average Betweenness Decreasing Decreasing
Average Closeness Decreasing Decreasing
Model III
(dynamic user energy)
Developer Distribution Power Law (large tail) Power Law (large tail)
Project Distribution Power Law (small tail) Power Law (small tail)
Average Degrees Increasing Increasing
Clustering Coefficient Decreasing Decreasing
Diameter Decreasing Decreasing
Average Betweenness Decreasing Decreasing
Average Closeness Decreasing Decreasing
![Page 54: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/54.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Discoveries in Process III– Expanding the network models for modeling ev
olving complex networks (more parameters)– Providing a validated model to simulate the co
mmunity network at SourceForge.net
• Benefits of these discoveries– Expanded network models can benefit other res
earchers in complex networks.– Validated model for SourceForge.net can be us
ed to study other OSS communities or similar collaboration networks.
![Page 55: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/55.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Related Research:– G. Chin Jr. and C. Lansing, “The biological scie
nces collaboratory”, Mathematics and Engineering Techniques in Medicine and Biological Sciences, 2004.
– L. Koukianakis, “A system for hybrid learning and hybrid psychology”, Cybernetics and Information Technologies, Systems and Applications, 2003.
– NCBI, FlyBase, Ensembl, VectorBase
![Page 56: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/56.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• What is Collaboratory?– An elaborate collection of data, information,
analytical toolkits and communication technologies
– A new networked organizational form that also includes social processes, collaboration techniques and agreements on norms, principles, value, and rules
![Page 57: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/57.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
Data Repository
Wiki Interface
Query
Researchers
RPC Browse
Researchers
Presentation Tier
This top tier is the user interface.The main function of the interface isto translate tasks and results tosomething the user can understand.
Logic Tier
This tier coordinates the webinterface and the data storage,moves and processes data betweenthe two surrounding tiers.
Data Tier
Here information is stored andretrieved from a database. Theinformation will then be passed backto user through the logic tier.
![Page 58: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/58.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Data tier - schema design
SF0205
SF0103
SF0405SF0305
SF0605
SF0705SF0805
SF0505
Every schema is adatabase dump
from theSourceForge.net
Timeline
![Page 59: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/59.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Data tier - connection pool
TimelineConnection Pool
ConnectionAssigner
LogicTier
ConnectionRequest
PersistentLink
PersistentLink
PersistentLink
![Page 60: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/60.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Presentation Tier– Various access
methods
– Documentation and references
– Community support
– Wiki interface
![Page 61: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/61.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Logic Tier– Interactive web query system
• Authorized user can submit query to the back end repository through the web query
• Results are provided by files with various formats
– Dynamic web schema browser• Authorized user can access the dynamic schema of
the repository through the schema browser
![Page 62: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/62.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Utilization reports– Monthly statistics (June 2006)
• Total queries submitted: 16,947
• Total data files retrieved: 13,343
• Total bytes of query data downloaded: 26,684,556,278
• Programmable access method– Programmable access method should be provided
for complicated access– Web services planned
![Page 63: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/63.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process IV: Research Collaboratory
• Results in Process IV– Designing, implementing and maintaining a
research collaboratory for OSS related research.
• Benefits of these results– OSS researchers can access one of the most
complete data sets for a OSS community development.
– By providing the community service to OSS researchers, the collaboratory can help in sparkling, improving and promoting research ideas about OSS.
![Page 64: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/64.jpg)
Yongqin Gao December 2006
Dissertation Defense
Contributions• Designed and demonstrated a computational discovery methodology
to study evolving complex networks using research on OSS as a representative problem domain
• Understanding the OSS movement by applying the methods.– Process I: data mining
• Identifying significant features to describe a project• Using distribution based clustering to generate a distribution based predictor to
predict the “popularity” of a project– Process II: network analysis
• Introducing more complete analysis to inspect more complete data set from SourceForge.net.
• Discovering high connectivity and possible life cycle behaviors in both the network structure and individuals in the network
– Process III: computer simulation• Introducing more parameters in modeling evolving complex networks• Generating a “fit” model to replicate the evolution of the SourceForge.net
community.– Process IV: research collaboratory
• Designing, implementing and maintaining a research collaboratory to host the SourceForge.net data set and provide community support for OSS related researches.
![Page 65: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/65.jpg)
Yongqin Gao December 2006
Dissertation Defense
Publications to-date• Y. Gao; G. Madey and V. Freeh. “Modeling and simulation of the open so
urce software community”, ADSC, San Diego, 2005.• Y. Gao and G. Madey. “Project development analysis of the oss communit
y using st mining”, NAACSOS, Notre Dame, 2005.• S. Christley; Y. Gao; J: Xu and G. Madey. “Public goods theory of the op
en source software development community”, Agent, Chicago, 2004.• Y. Gao, Y. Huang and G. Madey, “Data Mining Project History in Open S
ource Software Communities”, NAACSOS, Pittsburgh, 2004.• J. Xu, Y. Gao, J. Goett and G. Madey, “A Multi-model Docking Experime
nt of Dynamic Social Network Simulations”, Agent, Chicago, 2003.• Y. Gao, V. Freeh, and G. Madey, “Analysis and Modeling of the Open So
urce Software Community”, NAACSOS, Pittsburgh, 2003. • Y. Gao, V. Freeh, and G. Madey, “Conceptual Framework for Agent-base
d Modeling and Simulation”, NAACSOS, Pittsburgh, 2003. • G. Madey; V. Freeh; R: Tynan and Y. Gao. “Agent-based modeling and si
mulation of collaborative social networks”, AMCIS, Tampa, 2003.• Y. Gao; V. Freeh and G. Madey. “Topology and evolution of the open sou
rce software community”, SwarmFest, Notre Dame, 2003.
![Page 66: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/66.jpg)
Yongqin Gao December 2006
Dissertation Defense
Publication Plan• Chapter III (data mining)
– Journal of Machine Learning Research – Journal of Systems and Software
• Chapter IV (network analysis)– Journal of Network and Systems Management– Journal of Social Structure
• Chapter V (computer simulation)– Spring Simulation Conference 2007 (under review)– IEEE Computing in Science and Engineering
• Chapter VI (research collaboratory)– CITSA 2007– Journal of Computer Science and Applications
![Page 67: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/67.jpg)
Yongqin Gao December 2006
Dissertation Defense
Conclusion and Future Work• Cyclic computational discovery method for
studying evolving complex networks• Study of Open Source Software by applying this
method• Future works:
– Maintaining and expanding the collaboratory– Verifying the discoveries in the SourceForge.net
against further accumulated database dump from SourceForge.net
– Applying our simulation model on other software development communities
– Extending our methodology to other evolving complex networks like Internet, communication network and various social networks
![Page 68: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/68.jpg)
Yongqin Gao December 2006
Dissertation Defense
Acknowledgement• My advisor: Dr. Madey• My committee members:
– Dr. Flynn– Dr. Striegel– Dr. Wood
• My Colleagues: – Scott Christley, Yingping Huang, Tim Schoenharl, Matt Van Antw
erp, Ryan Kennedy, Alec Pawling and Jin Xu
• SourceForge.net managers:– Jeff Bates, VP of OSTG Inc.– Jay Seirmarco, GM of SourceForge.net.
• US NSF CISE/IIS-Digital Society & Technology, under Grant No. 0222829.
![Page 69: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/69.jpg)
Yongqin Gao December 2006
Dissertation Defense
Questions
![Page 70: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/70.jpg)
Yongqin Gao December 2006
Dissertation Defense
Case Study II
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57]
7597 dev[46]dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
dev[58]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
OSS Developer Network (Part)Developers are nodes / Projects are links
24 Developers5 Projects
2 hub Developers1 Cluster
![Page 71: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/71.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Characteristics of data set
– Massive
– Incomplete, noisy, redundant
– Complex structures, unstructured
• Classic analysis tools are often inadequate and inefficient for analyzing these data, especially in exploratory research
• What is DM (Data mining)– Nontrivial extraction of implicit, previously unknown
and potentially useful information from data.
![Page 72: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/72.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Feature Selection
– Given a non-negative n x m matrix V, find factors W (n, r) and H (r, m) , such that
V ≈ W *H– This is called the non-negative matrix
factorization (NMF) of the matrix V– NMF can be used on multivariate data to reduce
the dimension of the data set– By using NMF, we can reduce dimension from
m features to r features
![Page 73: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/73.jpg)
Yongqin Gao December 2006
Dissertation Defense
Why NMF?• Feature extraction methods
– linear methods are simpler and more completely understood.
– nonlinear methods are more general and more difficult to analyze.
• Linear methods: – ICA: Independent Component Analysis– Matrix decomposition: PCA, SVD, NMF
• In practice, NMF is most popular and simple.• Dimensionality reduction is effective if the loss of
information due to mapping to a lower-dimensional space is less than the gain due simplifying the problem.
![Page 74: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/74.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Feature-based Clustering
– Grouping data into K number of clusters based on features.
– The distance metrics used is Euclidean distance like
– Hierarchical K-Means is used.• The result is a binary tree.
• The root is the whole data set and the leaf clusters are the fine-grained clusters, which are the resulting K clusters.
n
iii yxED
0
2)(
![Page 75: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/75.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining• Case Study Result II • Unsupervised learning
– K-Means method used to cluster the project history using the features we selected
– We named the clusters using ID and the results are shown in the table
– The result is not acceptable by evaluation
Cluster ID Size
1 6201
2 98
3 64824
4 2
5 4
6 29724
7 4
8 10
9 9
10 84
Total 100960
![Page 76: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/76.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining
Admin_flags?
Administrator Core developer Co-developer Active user lurker
Grantcvs?
Yes
No
Yes
User_grouptable
artifacttable
Forumtable
People_jobtable
Project_tasktable
Doc_datatable
UNION
Othertables
User_project_acttable
Assigned?
Activities?
Yes
No
No
Yes
No
![Page 77: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/77.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process I: Data Mining
![Page 78: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/78.jpg)
Yongqin Gao December 2006
Dissertation Defense
Clustering Result Evaluation• Evaluation test set generation
– Popular/unpopular projects
– Stratified sampling to make 500 projects
• Feature sets used– Popular feature set
– Activity Feature set (Page 34, Table 3.2)
– Network Feature set (Page35, Table 3.3)
• Generating rules for the test sets• Calculating the support and confidence value
![Page 79: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/79.jpg)
Yongqin Gao December 2006
Dissertation Defense
Popularity Definition
Feature DescriptionDevelopers Number of core developers
Downloads Number of downloads
Site_views Number of views of the website
Subdomain_views Number of views of the subdomain
Page_views Number of views of the pages
![Page 80: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/80.jpg)
Yongqin Gao December 2006
Dissertation Defense
Why K-MEAN?• The algorithm has remained extremely popular because it c
onverges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points.
• K-Means is most successful algorithm in large data set (size>1000, dimension > 2) than GA and Evolution
• CLIQUE is sensitive to noise• CURE is not scalable O(n2logn)• CLARANS & BIRCH are not good for high dimension dat
a
• D. Arthur, S. Vassilvitskii (2006): "How Slow is the k-means Method?," Proceedings of the 2006 Symposium on Computational Geometry (SoCG).
![Page 81: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/81.jpg)
Yongqin Gao December 2006
Dissertation Defense
K-MEAN• It maximizes inter-cluster (or minimizes
intra-cluster) variance, but does not ensure that the result has a global minimum of variance. Multiple run is needed.
• Elbow criterion
![Page 82: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/82.jpg)
Yongqin Gao December 2006
Dissertation Defense
Distribution CategoriesCategory Feature
1 File release
2 New message
3 Followup message
4 Artifact request
5 Todo request
6 Support request
7 Feature request
8 Patch request
9 Bug reports
10 Bug assigned
11 Patch assigned
12 Feature assigned
13 Support assigned
14 Todo assigned
15 Artifact assigned
![Page 83: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/83.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer SimulationStart
Stop
End of Simu?
WeightedProject Pool
User Action
No
Yes
Project ListUser List
Project PoolUpdate
JoinCreateIdle Drop
User_ProjectLinks
New UsersSimulation model
procedure
![Page 84: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/84.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Poisson Process:– It expresses the probability of a number of events
occurring in a fixed period of time if these events occur with a known average rate, and are independent of the time since the last event.
– PDF:
!);(
k
ekF
k
![Page 85: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/85.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Log-normal distribution:
![Page 86: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/86.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
• Kolmogorov-Smirnov test– Used to determine whether two underlying one-dimen
sional distributions differ.
– Two one-sided K-S test statistics are given by
))()(max(
))()(max(
xFxFD
xFxFD
nn
nn
![Page 87: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/87.jpg)
Yongqin Gao December 2006
Dissertation Defense
Process III: Computer Simulation
![Page 88: Computational Discovery in Evolving Complex Networks](https://reader036.fdocuments.net/reader036/viewer/2022070411/56814706550346895db44237/html5/thumbnails/88.jpg)
Yongqin Gao December 2006
Dissertation Defense
Similar Publications• Chapter III (data mining)
– JMLR: G. Hamerly, E. Perelman..Using machine learning to guide simulation (Feb. 2006)
– JSS: S. Kim, J. Yoon..Shape-based retrieval in time-series database (Feb. 2006)
• Chapter IV (network analysis)– JNSM: Special Issue Self-Managing Systems and Networks – JoSS: The Journal of Social Structure (JoSS) is an electronic journal of th
e International Network for Social Network Analysis (INSNA) • Chapter V (computer simulation)
– SSC 2007: simulation co– IEEE/CSE: E. Luijten..Fluid simulation with monte carlo algorithm (2006
Vol. 8, Issue 2)• Chapter VI (research collaboratory)
– CITSA 2007: L. Koukianakis..A system for hybrid learning and hybrid psychology (2005)
– JCSA: S. Chen, K. Wen..An Integrated System for Cancer-Related Genes Mining from Biomedical Literatures (2006)