CAS May 24, 2010
Creating a science base to support new directions in
computer science
John HopcroftCornell UniversityIthaca, New York
CAS May 24, 2010
Time of change
The information age is a fundamental revolution that is changing all aspects of our lives.
Those individuals and nations who recognize this change and position themselves for the future will benefit enormously.
CAS May 24, 2010
Drivers of change
Merging of computing and communications
Data available in digital form Networked devices and sensors Computers becoming ubiquitous
CAS May 24, 2010
Internet search engines are changing
When was Einstein born?
Einstein was born at Ulm, in Wurttemberg, Germany, on March 14, 1879.
List of relevant web pages
CAS May 24, 2010
Internet queries will be different
Which car should I buy? What are the key papers in Theoretical
Computer Science? Construct an annotated bibliography on
graph theory. Where should I go to college? How did the field of CS develop?
CAS May 24, 2010
Which car should I buy?
Search engine response: Which criteria below are important to you? Fuel economy Crash safety Reliability Performance Etc.
CAS May 24, 2010
Make Cost Reliability Fuel economy
Crash safety
Links to photos/ articles
Toyota Prius 23,780 Excellent 44 mpg Fair
Honda Accord 28,695 Better 26 mpg Excellent
Toyota Camry 29,839 Average 24 mpg Good
Lexus 350 38,615 Excellent 23 mpg Good
Infiniti M35 47,650 Excellent 19 mpg Good
CAS May 24, 2010
2010 Toyota Camry - Auto ShowsToyota sneaks the new Camry into the Detroit Auto Show.
Usually, redesigns and facelifts of cars as significant as the hot-selling Toyota Camry are accompanied by a commensurate amount of fanfare. So we were surprised when, right about the time that we were walking by the Toyota booth, a chirp of our Blackberries brought us the press release announcing that the facelifted 2010 Toyota Camry and Camry Hybrid mid-sized sedans were appearing at the 2009 NAIAS in Detroit.
We’d have hardly noticed if they hadn’t told us—the headlamps are slightly larger, the grilles on the gas and hybrid models go their own way, and taillamps become primarily LED. Wheels are also new, but overall, the resemblance to the Corolla is downright uncanny. Let’s hear it for brand consistency!
Four-cylinder Camrys get Toyota’s new 2.5-liter four-cylinder with a boost in horsepower to 169 for LE and XLE grades, 179 for the Camry SE, all of which are available with six-speed manual or automatic transmissions. Camry V-6 and Hybrid models are relatively unchanged under the skin.
Inside, changes are likewise minimal: the options list has been shaken up a bit, but the only visible change on any Camry model is the Hybrid’s new gauge cluster and softer seat fabrics. Pricing will be announced closer to the time it goes on sale this March.
Toyota Camry
› Overview
› Specifications › Price with Options › Get a Free Quote
News & Reviews
2010 Toyota Camry - Auto Shows
Top Competitors
Chevrolet Malibu
Ford Fusion Honda Accord sedan
CAS May 24, 2010
Which are the key papers in Theoretical Computer Science? Hartmanis and Stearns, “On the computational complexity of algorithms” Blum, “A machine-independent theory of the complexity of recursive functions” Cook, “The complexity of theorem proving procedures” Karp, “Reducibility among combinatorial problems” Garey and Johnson, “Computers and Intractability: A Guide to the Theory of NP-Completeness” Yao, “Theory and Applications of Trapdoor Functions” Shafi Goldwasser, Silvio Micali, Charles Rackoff , “The Knowledge Complexity of Interactive
Proof Systems” Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy, “Proof
Verification and the Hardness of Approximation Problems”
CAS May 24, 2010
Temporal Cluster Histograms: NIPS Results
12: chip, circuit, analog, voltage, vlsi11: kernel, margin, svm, vc, xi10: bayesian, mixture, posterior, likelihood,
em9: spike, spikes, firing, neuron, neurons8: neurons, neuron, synaptic, memory,
firing7: david, michael, john, richard, chair6: policy, reinforcement, action, state,
agent5: visual, eye, cells, motion, orientation4: units, node, training, nodes, tree3: code, codes, decoding, message, hints2: image, images, object, face, video1: recurrent, hidden, training, units, error0: speech, word, hmm, recognition, mlp
NIPS k-means clusters (k=13)
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10 11 12 13 14Year
Num
ber o
f Pap
ers
Shaparenko, Caruana, Gehrke, and Thorsten
CAS May 24, 2010
NEXRAD Radar
Binghamton, Base Reflectivity 0.50 Degree Elevation Range 124 NMI — Map of All US Radar Sites Animate MapStorm TracksTotal PrecipitationShow SevereRegional RadarZoom Map Click:Zoom InZoom OutPan Map
(Full Zoom Out)
»
A
dvanced
R
adar
Types
C
LIC
K
»
BGMN0R042.40547-76.51950Ithaca, NY000.12574445412029999999900
CAS May 24, 2010
Science base to support activities
Track flow of ideas in scientific literature Track evolution of communities in social networks Extract information from unstructured data sources.
CAS May 24, 2010
Tracking the flow of ideas in the scientific literatureYookyung Jo
Page rank
Web
Link
GraphRetrieval
Query
Search
Text
Web
Page
Search
Rank
Web
Chord
Usage
Index
Probabilistic
TextFile
Retrieve
Text
Index
Discourse
Word
Centering
Anaphora
CAS May 24, 2010
Topic evolution thread
• Seed topic :– 648 : making c program
type-safe by separating pointer types by their usage to prevent memory errors
• 3 subthreads :– Type :
– Garbage collection :
– Pointer analysis :
Yookyung Jo, 2010
CAS May 24, 2010
“Statistical Properties of Community Structure in Large Social and Information Networks”, Jure Leskovec; Kevin Lang; Anirban Dasgupta; Michael Mahoney
Studied over 70 large sparse real-world networks.
Best communities approximately size 100 to 150.
CAS May 24, 2010
Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like."
CAS May 24, 2010
Our view of a community
TCS
Me
Colleagues at Cornell
Classmates
Family and friendsMore connections outside than inside
CAS May 24, 2010
Should we remove whiskers?
Does there exist a core in social networks? Yes
Experimentally it appears at p=1/n in G(n,p) model
Is the core unique? Yes and No In G(n,p) model should we require that a
whisker have only a finite number of edges connecting it to the core?
Laura Wang
CAS May 24, 2010
Algorithms How do you find the core? Are there communities in the core of social
networks?
CAS May 24, 2010
How do we find whiskers?
NP-complete if graph has a whisker There exists graphs with whiskers for
which neither the union of two whiskers nor the intersection of two whiskers is a whisker
CAS May 24, 2010
Communities
Conductance Can we compress graph?
Rosvall and Bergstrom, “An informatio-theoretic framework for resolving community structure in complex networks”
Hypothesis testing Yookyung Jo
CAS May 24, 2010
Description of graph with community structure
Specify which vertices are in which communities
Specify the number of edges between each pair of communities
CAS May 24, 2010
Information necessary to specify graph given community structure m=number of communities ni=number of vertices in ith community
lij number of edges between ith and jth communities
1
1 / 2log
mi ji i
i i j ijii
n nn nH Z
ll
CAS May 24, 2010
Description of graph consists of description of community structure plus specification of graph given structure. Specify community for each edge and the
number of edges between each community
Can this method be used to specify more complex community structure where communities overlap?
community structure
12 1 lognlogm m m l H Z
CAS May 24, 2010
Hypothesis testing
Null hypothesis: All edges generated with some probability p0
Hypothesis: Edges in communities generated with probability p1, other edges with probability p0.
CAS May 24, 2010
Massively overlapping communities
Are there a small number of massively overlapping communities that share a common core?
Are there massively overlapping communities in which one can move from one community to a totally disjoint community?
CAS May 24, 2010
Clustering Social networksMishra, Schreiber, Stanton, and Tarjan Each member of community is connected
to a beta fraction of community No member outside the community is
connected to more than an alpha fraction of the community
Some connectivity constraint
CAS May 24, 2010
In sparse graphs
How do you find alpha – beta communities?
What if each person in the community is connected to more members outside the community then inside?
CAS May 24, 2010
Theory to supportnew directions Large graphs Spectral analysis High dimensions and dimension reduction Clustering Collaborative filtering Extracting signal from noise
CAS May 24, 2010
Theory of Large Graphs
Large graphs with billions of vertices Exact edges present not critical Invariant to small changes in definition Must be able to prove basic theorems
CAS May 24, 2010
Erdös-Renyin vertices
each of n2 potential edges is present with independent probability
Nn
pn (1-p)N-n
vertex degreebinomial degree distribution
numberof
vertices
CAS May 24, 2010
Generative models for graphs
Vertices and edges added at each unit of time
Rule to determine where to place edges
Uniform probability Preferential attachment - gives rise to
power law degree distributions
CAS May 24, 2010Vertex degree
Number
of
vertices
Preferential attachment gives rise to the power law degree distribution common in many graphs
CAS May 24, 2010
Protein interactions
2730 proteins in data base
3602 interactions between proteins SIZE OF COMPONENT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1000
NUMBER OF COMPONENTS
48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 0
Science 1999 July 30; 285:751-753
Only 899 proteins in components, where are 1851 missing proteins?
CAS May 24, 2010
Protein interactions
2730 proteins in data base
3602 interactions between proteins
SIZE OF COMPONENT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1851
NUMBER OF COMPONENTS
48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 1
Science 1999 July 30; 285:751-753
CAS May 24, 2010
Giant Component
1.Create n isolated vertices
2.Add Edges randomly one by one
3.Compute number of connected components
CAS May 24, 2010
Giant Component
1 2 3 4 5 6 7 8
367 70 24 12 9 3 2 2
9 10 12 13 14 20 55 101
2 2 1 2 2 1 1 1
CAS May 24, 2010
Giant Component
1 2 3 4 5 6 7 8 9 10
252 39 13 6 3 6 2 1 1 0
11 12 13 14 15 16 17 18 ••• 514
1 0 0 0 0 0 0 0 0 1
CAS May 24, 2010
High dimensional data is inherently unstable Given n random points in d dimensional
space essentially all n2 distances are equal.
22
1
d
i ii
x yx y
CAS May 24, 2010
High Dimensions
Intuition from two and three dimensions not valid for high dimension
Volume of cube is one in all dimensions
Volume of sphere goes to zero
CAS May 24, 2010
Distance between two random points from same Gaussian
Points on thin annulus of radius
Approximate by sphere of radius
Average distance between two points is
(Place one pt at N. Pole other at random. Almost surely second point near the equator.)
d
d
2d
CAS May 24, 2010
Can separate points from two Gaussians if
2
14
2
12 2
2
2 2
2 1 2
1
2 2
2 2
d
d d
d d
d
d
CAS May 24, 2010
Dimension reduction
Project points onto subspace containing centers of Gaussians
Reduce dimension from d to k, the number of Gaussians
CAS May 24, 2010
Centers retain separation Average distance between points reduced
by dk
1 2 1 2, , , , , , ,0, ,0d k
i i
x x x x x x
d x k x
CAS May 24, 2010
Can separate Gaussians provided
2 2 2k k
> some constant involving k and γ independent of the dimension
Finding centers of Gaussians1
2
n
a
a
a
1a2a
na
1v
First singular vector v1 minimizes the perpendicular distance to points
CAS May 24, 2010
The first singular vector goes through center of Gaussian and minimizes distance to points
Gaussian
CAS May 24, 2010
Best k-dimensional space for Gaussian is any space containing the line through the center of the Gaussian
CAS May 24, 2010
Given k Gaussians, the top k singular vectors define a k dimensional space that contains the k lines through the centers of the k Gaussian and hence contain the centers of the Gaussians
CAS May 24, 2010
CAS May 24, 2010
We have just seen what a science base for high dimensional data might look like.
What other areas do we need to develop a science base for?
CAS May 24, 2010
Ranking is important Restaurants, movies, books, web pages Multi billion dollar industry
Collaborative filtering When a customer buys a product what else is
he likely to buy Dimension reduction Extracting information from large data
sources Social networks
Top Related