Sigmod2009 Tastier

download Sigmod2009 Tastier

of 12

Transcript of Sigmod2009 Tastier

  • 8/8/2019 Sigmod2009 Tastier

    1/12

    Efficient Type-Ahead Search on Relati onal Data: aTASTIER Approach

    Guoliang Li Shengyue Ji Chen Li Jianhua FengDepartment of Computer Science and Technology,

    Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, ChinaDepartment of Computer Science, University of California, Irvine, CA 92697-3435, USA

    {liguoliang,fengjh}@tsinghua.edu.cn;{shengyuj,chenli}@ics.uci.edu

    ABSTRACT

    Existing keyword-search systems in relational databases re-quire users to submit a complete query to compute answers.Often users feel left in the dark when they have limitedknowledge about the data, and have to use a try-and-seemethod to modify queries and find answers. In this pa-per we propose a novel approach to keyword search in therelational world, called Tastier. A Tastier system can

    bring instant gratification to users by supporting type-aheadsearch, which finds answers on the fly as the user typesin query keywords. A main challenge is how to achieve ahigh interactive speed for large amounts of data in multipletables, so that a query can be answered efficiently withinmilliseconds. We propose efficient index structures and al-gorithms for finding relevant answers on-the-fly by joiningtuples in the database. We devise a partition-based methodto improve query performance by grouping relevant tuplesand pruning irrelevant tuples efficiently. We also develop atechnique to answer a query efficiently by predicting highlyrelevant complete queries for the user. We have conducteda thorough experimental evaluation of the proposed tech-niques on real data sets to demonstrate the efficiency andpracticality of this new search paradigm.

    Categories and Subject Descriptors

    H.3.3 [Information Storage and Retrieval]: Informa-tion Search and Retrieval; H.2.8 [Database Applications]:Miscellaneous

    General Terms

    Algorithms, Experimentation, Performance

    Keywords

    Type-Ahead Search, Keyword Search, Query Prediction

    1. INTRODUCTIONKeyword search is a widely accepted mechanism for query-

    ing document systems and relational databases [1, 6, 15].

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD 09, June 29-July 2, 2009, Providence, Rhode Island, USA.Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00.

    One important advantage of keyword search is that it en-ables users to search information without knowing a complexquery language such as SQL, nor having prior knowledgeabout the structure of the underlying data. In traditionalkeyword search in databases, a user composes a completequery and submits it to the system to find relevant answers.This search paradigm requires the user to formulate a good,complete query to find the interesting answers. If the userhas limited knowledge about the targeted entities, often the

    user feels left in the dark when issuing queries, and hasto use a try-and-see method to modify the query and findinformation. As an example, suppose a user wants to finda movie starring an actor, but remembers the movies titleand the actors name vaguely. In this case, it is not conve-nient for the user to find the movie without knowing how toformulate a query properly.

    Many systems are introducing various features to solvethis problem. One of the commonly used methods is auto-complete, which predicts a word or phrase that the user maytype based on the partial query the user has entered. Searchengines nowadays automatically suggest possible keywordqueries as a user types in a partial query.

    In this paper, we study how to bring this type of search

    to the relational world. We propose an approach, calledTastier, which stands for type-ahead search techniquesin large data sets. A Tastier system supports type-aheadsearch on the underlying relational data on the fly as theuser types in query keywords. It allows users to exploredata as they type, thus brings instant gratification to theusers in the search process. Figure 1 shows a screenshot of aTastier system that supports type-ahead search on a pub-lication data set from DBLP (http://dblp.uni-trier.de). Inthe figure, a user has typed in a query yu keyword searchsig, and the system finds the relevant answers possiblywith multiple tuples from different tables joined by foreignkeys (e.g., the fourth answer). Notice that in Tastier, thekeywords in a query can appear at different places in thedata, while most autocomplete systems treat a query with

    multiple keywords as a single string, and require the wholestring to appear in the answers.

    A main requirement of type-ahead search on large amountsof relational data is the need of a high interactive speed.Each keystroke from a user could invoke a query to the sys-tem, which needs to compute the answers within millisec-onds. This high-speed requirement is technically challeng-ing especially due to the following reasons. First, for a querywith multiple keywords, finding tuples matching these key-words, even if the tuples are from a single table, requires

  • 8/8/2019 Sigmod2009 Tastier

    2/12

    Figure 1: A screenshot of a Tastier system using a publication database (DBLP).

    intersection or join operations, which need to be done effi-ciently. Second, a keyword in the query can b e treated asa prefix, which requires us to consider multiple completionsof the prefix. The corresponding union operation increasesthe time complexity to find answers. Third, the computa-tion can be even more costly if the tuples in the answerscome from different tables in a relational database, leading

    to more join operations.In this paper, we develop novel techniques to address these

    challenges, and make the following contributions. We firstformulate the problem of type-ahead search in which the re-lational data is modeled as a database graph (Section 2).This graph presentation has the advantage of finding an-swers to a query that are from different tables or records butsemantically related. For instance, for the keyword query inFigure 1, the system finds related papers even though noneof them includes all the keywords. In the fourth answer,two related papers about keyword search in databasesare joined through citation relationships to form an answer.Type-ahead search on the graph representation of a databaseis technically interesting and challenging. In Section 3 wepropose efficient index structures and algorithms for com-

    puting relevant answers on-the-fly by joining tuples on thegraph. We use a structure called -step forward index tosearch efficiently on candidate vertices. We study how toanswer queries incrementally by using the cached results ofearlier queries.

    In Section 4 we study how to improve query performanceby graph partitioning. The main idea is to partition thedatabase graph into subgraphs, and build inverted lists onthe subgraphs instead of vertices on the entire graph. Inthis way, a query is answered by first finding candidate sub-graphs using inverted lists, and then searching within eachcandidate subgraph. We study how to partition the graphin order to construct high-quality index structures to sup-port efficient search, and present a quantitative analysis of

    the effect of the subgraph number on query performance.The approach presented so far computes answers for allpossible queries and suggests possible complete queries as aby-product. In Section 5 we develop a new approach fortype-ahead search, which first predicts most likely queriesand then computes answers to those predicted queries. Wepresent index structures and techniques using Bayesian net-works to predict complete queries for a query efficiently andaccurately. We have conducted a thorough experimentalevaluation of the proposed techniques on real data sets (Sec-tion 6). The results show that our techniques can support

    efficient type-ahead search on large amounts of relationaldata, which prove the practicality of this search paradigm.

    1.1 Related WorkKeyword Search in DBMS: There have been many stud-ies on keyword search in relational databases [15, 14, 2, 25,27, 6, 17, 10, 13, 22, 23]. Most of them employed Steiner-

    tree-based methods [6, 17, 10, 13, 22]. Other methods [15,14] generated answers composed of relevant tuples by gen-erating and extending a candidate network following theprimary-foreign-key relationship. Many techniques [12, 33,8, 24, 31, 26, 34] have been proposed to study the problemof keyword search over XML documents.

    Prediction and Autocomplete: There have been manystudies on predicting queries and user actions [9, 28, 20,11, 32, 29]. Using these techniques, a system predicts aword or a phrase that the user may type in next based onthe input the user has already typed. The differences be-tween these techniques and our work are the following. (1)Most existing techniques predict or complete a query, whileTastier computes answers for the user on the fly. (2) Mostexisting methods do prediction using sequential data to con-struct their model, therefore they can only predict a word orphrase that matches sequentially with underlying data (e.g.,phrases or sentences). Our techniques treat the query as aset of keywords, and support prediction of any combinationof keywords that may not be close to each other in the data.

    CompleteSearch: A closely related study is the work byBast et al. [4, 5, 3]. They proposed techniques to supportCompleteSearch, in which a user types in keywords letterby letter, and the system finds records that include thesekeywords. CompleteSearch focuses on searching on a collec-tion of documents or a single table. Their work in ESTER [3]combined full-text and ontology search. Our work differs

    from theirs in the following aspects. (1) We study type-ahead search in relational databases and identify Steinertrees as answers, while ESTER focuses on text and entitysearch, and finds documents and entities as answers. Thedata graph used in our approach has more flexibility to rep-resent information and find semantically-related answers forusers, which also require more join operations. This graphrepresentation and join operations are not the main focus ofESTER. (2) Our graph-partition-based technique presentedin Section 4 is novel. (3) Our query-prediction technique tosuggest complete queries presented in Section 5 is also novel.

  • 8/8/2019 Sigmod2009 Tastier

    3/12

    Table 1: A publication database.Authors Citations Author-Paper

    AID Name

    a1 Vagelis Hristidis

    a2 S. Sudarshan

    a3 YannisPapakonstantinou

    a4 An dr ey Bal mi n

    a5 Shashank Pandit

    a6 Jeffrey Xu Yua7 Xuemin Lin

    a8 Philip S. Yu

    a9 Ian Felipe

    PID CitedPID

    p1 p3

    p2 p3

    p3 p4

    p4 p5

    p5 p6

    p6 p7p7 p9

    p8 p9

    AID PID

    a1 p1

    a1 p3

    a2 p2

    a3 p1

    a3 p3

    a4 p4

    a5 p5

    a6 p6

    a7 p7

    a8 p8

    a9 p9

    PapersPID Title Conf Year

    p1 DISCOVER: Keyword Searc h in Relational Databases VLDB 2002

    p2 Keyword Searching and Browsing in Databases using

    BANKS

    ICDE 2002

    p3 Efficient IR-Style Keyword Search over Relational

    Databases

    VLDB 2003

    p4 ObjectRank: Authority-Based Keyword Search in

    Databases

    VLDB 2004

    p5 Bidirectional Expansion For Keyword Search on Graph

    Databases

    VLDB 2005

    p6 Finding Top-k Min-Cost Conne cted Tree s in Databases ICDE 2007

    p7 Spark: Top-k Keyword Query in Relational Databases SIGMOD 2007

    p 8 B LI NK S: R an ke d K ey wo rd S ea rch es o n G ra ph s S IG MO D 2 00 7

    p9 Keyword Search on Spatial Databases ICDE 2008

    Notice that the work in [4] predicts complete queries for aprefix using frequent phrases in the data. Our probabilis-tic prediction approach is different since it treats the queryas a set of keywords, and does not require the keywordsto form a p opular phrase. Thus our approach can predictlikely queries even if they do not appear very often in thedata as phrases. Other related studies include fuzzy type-ahead search on a set of records/documents [16], type-aheadsearch in XML data [21], and the work in [7].

    2. PRELIMINARIESDatabase Graph: A relational database can be modeledas a database graph G = (V, E). Each tuple in the databasecorresponds to a vertex in G, and the vertex is associated

    with the keywords in the tuple. A foreign-key relationshipfrom a tuple to another one corresponds to an edge betweentheir vertices. The graph can be modeled as a directed orundirected graph. The graph can have weights, in whichthe weight of an edge represents the strength of the proxim-ity relationship between the two tuples [6]. For simplicity, inthis paper we consider undirected graphs, and all edges havethe same weight. Our methods can be extended to directedgraphs with other edge-weight functions. For example, con-sider a publication database with four tables. Table 1 showspart of the database, which includes 9 publications and 9authors. Figure 2 is the corresponding database graph. Forsimplicity, we do not include the vertices for the tuples inthe Author-Paper and Citations tables.

    Steiner Trees as Answers to Keyword Queries: Aquery includes a set of keywords and asks for subgraphswith vertices that have these keywords. Consider a set ofvertices V V that include the query keywords. We wantto find a subtree that contains all the vertices in V (possiblyother vertices in the graph as well). The following conceptfrom [6] defines this intuition formally.

    Definition 1. (Steiner Tree) Given a graph G = (V, E)and a set of vertices V V, a subtree T in G is a Steinertree of V if T covers all the vertices in V.

    Often we are interested in a smallest Steiner tree, as-suming we are given a function to compute the size of a tree,such as the summation of the weights of its edges. A Steinertree with the smallest size is called a minimum Steiner tree.For simplicity, in this paper we use the number of edges in atree as its size. For example, consider the graph in Figure 2.For the vertex set {a1, a3}, its Steiner trees include the treesof vertices {a1, a3, p1}, {a1, a3, p3}, {a1, a3, p1, p3}, etc. Thefirst two are two minimum Steiner trees.

    A query keyword can appear in multiple vertices in thegraph, and we are interested in Steiner trees that include avertex for every keyword as defined below.

    Definition 2. (Group Steiner Tree) Given a databasegraphG = (V, E) and subsets of vertices V1, V2, . . . , V n ofV,a subtree T of G is a group Steiner tree of these subsets if Tis a Steiner tree that contains a vertex from each subset Vi.

    Consider a query Q with two keywords {Yu, sigmod}. Theset of vertices that contain the first keyword, denoted byVYu, is {a6, a8}. Similarly, Vsigmod = {p7, p8}. The subtreecomposed of vertices {a8, p8} and the subtree composed ofvertices {a6, p6, p7} are two group Steiner trees.

    The problem of finding answers to a keyword query on a

    database can be translated into the problem of finding groupSteiner trees in the database graph [6]. Often a user is not in-terested in a Steiner tree with a large diameter, which is thelargest value of the shortest path between any two vertices inthe tree. To address this issue, we introduce the concept of-diameter Steiner tree, which is a Steiner tree whose diam-eter is at most . In this paper, we want to find -diameterSteiner trees as answers to keyword queries. For example,consider the graph in Figure 2. For the query Q above,the top-2 minimum group Steiner trees are the subtree ofvertices {a8, p8} and the subtree of vertices {a6, p6, p7}.

    ClientCache

    FastCGI Server

    DHTML

    JavaScript

    Events Modifications

    Client: Web Browser

    Types Reads

    AJAX Requests/Responses

    Server

    Prefix Finder

    Client

    Steiner Tree

    Finder

    Ranker

    IndexerServerCache

    Web Server

    InternetPrefixes

    Keywords

    Steiner trees

    Top-KMinimumS

    teinerTrees

    Figure 3: Tastier Architecture.

    Type-Ahead Search: A Tastier system supports type-ahead search by computing answers to keyword queries asthe user types in keywords. Figure 3 illustrates the architec-ture of such a system. Each keystroke that the user typescould invoke a query, which includes the current string theuser has typed in. The browser sends the query to theserver, which computes and returns to the user the bestanswers ranked by their relevancy to the query. We treat

  • 8/8/2019 Sigmod2009 Tastier

    4/12

  • 8/8/2019 Sigmod2009 Tastier

    5/12

    tex v, for 0 i , the index has an i-step forward listthat includes the keywords in the vertices whose shortestdistance to the vertex v is i. The 0-step forward list consistsof the keywords in v. Table 2 shows the -step forward in-dex for some vertices in our running example. For the vertexa6, its 0-step forward list contains keyword Yu. Its 1-stepforward list contains no keyword. Its 2-step forward list con-tains keywords graph, sigmod, search, and spark(in vertices p5 and p7). Given a vertex, its -step forward in-

    dex can be constructed on the fly by a breadth-first traversalin the graph from the vertex.

    Table 2: -step forward index.

    VertexKeyword ids in neighbors

    0-step 1-step 2-stepp5 1 2 3 4 5p6 1 2 3 4 5 p7 3 4 2 1 5p8 1 2 3 5 4a6 5 1 2 3 4a8 5 1 2 3

    Given a query with multiple keywords Q = {p1, p2, . . . , pn},we first locate the trie nodes for the prefixes. We selectthe node of keyword p Q with the minimum size (pre-computed) of the union of its leaf-node inverted lists, andcompute this union list . For each vertex v on , we con-struct its -step forward index. For each keyword pi = p(pi Q), we test whether the -step forward index of v in-cludes a complete keyword of prefix pi. This test can be doneefficiently by using binary search based on the fact that thetrie node of pi has a range [Li, Ui] for the IDs of completekeywords of pi. Specifically, for each j between 0 and , wedo a binary search to check if the (sorted) list of keywordsof vs j-step neighbors has a keyword within the range. Ifso, then a complete keyword of pi appears in a neighborof vertex v. If all the query keywords appear in neighborsof v within steps, then v can be expanded to a Steinertree as an answer to the query. We construct a -diameterSteiner tree by visiting the neighbors of v. This step canbe optimized if sufficient information was collected when weconstruct the -step forward index of vertex v. This step isrepeated for each vertex on the union list .

    For example, assuming the user types in a query Yu sletter by letter, Figure 5 shows how to compute its answersincrementally. When the user types in a single keywordYu, we use the trie node of the keyword to find the verticesthat include the keyword, i.e., vertices a6 and a8 (secondcolumn in the figure). When the user types in Yu s, wefirst locate the trie nodes 22 and 6 for the two keywords(see Figure 4). As node 22 has the minimum union-list size,

    we compute its union list {a6, a8}. As shown in the thirdcolumn in the figure, we construct the -step forward indexas shown in Table 2. The keyword IDs with p = s asa prefix are within the range [2,4]. For vertex a6, we testwhether its neighbors have a complete keyword in the range[2,4] by binary searching the lower bound 2 in the -stepforward index. As a6 contains the prefix, we construct theSteiner tree as follows. We visit a6s only neighbor, p6, whichdoes not include a complete word of p, and add it into thetree. Next, we visit p6s neighbor, p5, and add it to the tree,since it has a complete keyword of p (Search). Thus we

    a6

    0 51 --2 1234

    a8

    0 51 1232 --

    s [2,4]

    a6

    a8

    sp [4,4]Yu [5,5]

    a6

    a8

    Yu s p

    a8 p80 1231 5

    a6 p60 --1 12345

    Query:

    Find

    Node

    IDs

    Identify

    Steiner

    Trees a6 p60 --1 12345

    p70 34

    p70 34

    Yu [5,5] Yu [5,5]

    s [2,4]

    a6 p60 --1 12345

    p50 12

    a6

    a8

    T1

    T2

    T3

    T2

    Figure 5: Type-ahead search for query Yu sp.

    have constructed a Steiner tree T1 with vertices a6, p6, andp5 as an answer to the query. Similarly, we can constructanother Steiner tree T2 for a6, which consists of vertices a6,p6, and p7. For vertex a8, we construct a Steiner tree withvertices a8 and p8.Incremental Computation: As the user types in moreletters in the query, we can compute the answers incremen-tally by utilizing the results of earlier queries. Suppose auser has typed in a query with keywords p1, p2, . . . , pnin the same order. We have computed and cached the ver-tices whose neighbors contain the keywords (as prefixes) andtheir corresponding Steiner trees. Now the user modifies thequery in one of the following ways. (1) The user types inone more non-space letter after pn and submits a query{p1, p2, . . . , p

    n}, where p

    n is a keyword obtained by append-

    ing a letter to pn. In this case, for each cached Steiner tree,we check if it contains a complete keyword of the new pre-fix pn, and if so, we add it as an answer to the new query.(2) The user types in one more keyword pn+1 and submitsthe query {p1, p2, . . . , pn, pn+1}. In this case, we locate thetrie node of pn+1, test whether the cached vertices containa complete keyword of pn+1 using the keyword range of thenode and the -step forward indices of the cached vertices,and identify the answers to the new query. (3) The usermodifies the query arbitrarily. Suppose the user submits anew query {p1, p2, . . . , pi, p

    i+1, p

    i+2, . . .}, which shares the

    first i keywords with the original query. In this case, we canuse the cached results of the query {p1, p2, . . . , pi} and findthose that include a complete keyword for each new keywordpj (j > i).

    In our running example, suppose the user types in anothercharacter p and submits a query Yu sp. As shown in thelast column of Figure 5, we test whether the three subtreescomputed for the previous query contain a complete keywordfor prefix sp. The testing succeeds only for tree T2 becauseof the keyword spark in vertex p7, and we return T2 asthe answer to the new query.

  • 8/8/2019 Sigmod2009 Tastier

    6/12

    DISCOVER: Keyword Search in

    Relational Databases

    VLDB 2002

    Efficient IR-Style Keyword Search

    over Relational Databases

    VLDB 2003

    Keyword Searching and Browsing in

    Databases using BANKS

    ICDE 2002

    ObjectRank: Authority-Based

    Keyword Search in Databases

    VLDB 2004

    Bidirectional Expansion For Keyword

    Search on Graph Databases

    VLDB 2005

    Spark: Top-k Keyword Query

    in Relational Databases

    SIGMOD 2007

    BLINKS: Ranked Keyword

    Searches on Graphs

    SIGMOD 2007Finding Top-k Min-Cost Connected

    Trees in Databases

    ICDE 2007

    Vagelis

    Hristidis

    Yannis

    Papakonstantinou

    S. Sudarshan

    Andrey

    Balmin

    Shashank

    Pandit

    Jeffrey

    Xu Yu

    Xuemin

    LinPhilip S. Yu

    Ian Felipe

    p1

    p2

    p3

    p4

    p5

    p6p7 p8

    a1 a3

    a2

    a4

    a5

    a6

    a7 a8

    a9

    Keyword Search on Spatial Databases

    ICDE 2008p9

    s1 s2 s3

    Figure 6: A graph partitioning with three subgraphs.

    4. IMPROVING QUERY PERFORMANCE

    BY GRAPH PARTITIONINGWe study how to improve performance of type-ahead search

    by graph partitioning. We first discuss how to do type-ahead search using a given partition of the database graphin Section 4.1. We then study how to partition the graph inSection 4.2.

    4.1 Type-Ahead Search on a Partitioned GraphOur main idea is to partition the database graph into

    (overlapping) subgraphs, and assign unique ids to them. Oneach leaf node w of the trie of the database keywords, itsinverted list stores the ids of the subgraphs with vertices thatinclude the keyword w, instead of storing vertex ids. Wemaintain a forward list for each subgraph, which includesthe keywords within the subgraph.

    A keyword query is answered in two steps. In step 1,we find ids of subgraphs that contain the query keywords.We identify the keyword with the shortest union list ls ofinverted lists. We use the keyword range of the other key-words to probe on the forward list of each subgraph on ls(similar to the method described in Section 3.2). Each sub-graph that passes all the probes is a candidate subgraph tobe processed in the next step. In step 2, we find -diameterSteiner trees inside each candidate subgraph. We construct-depth forward lists for vertices in each candidate subgraph,using the method described in Section 3. Note that we canalso do incremental search as discussed in Section 3.

    For instance, consider the database graph in Figure 2.

    Suppose we partition it into three subgraphs as illustratedin Figure 6. We construct a trie similar to Figure 4, in whicheach leaf node has an inverted list of subgraphs that includethe keyword of the leaf node. Consider a query yu sig.We find answers from subgraphs s2 and s3, and identify 2-diameter Steiner trees: the subtree of vertices a8 and p8, thesubtree of vertices p6, a6, and p5. Incremental computationto support type-ahead search can be done similarly.

    An answer (a -diameter Steiner tree) to a query mayinclude edges in different subgraphs. For example, whenanswering query min-cost sigmod using the subgraphs in

    Figure 6, we cannot find any answer, although p6 and p7 are

    connected by an edge. To make sure our approach based ongraph partitioning can always find all answers, we expandeach subgraph to include, for each vertex in the subgraph, itsneighbors within -steps, and build index structures on theseexpanded subgraphs. For example, if = 1, we add vertexp4 into subgraph s1, add vertices p3 and p7 into subgraphs2, and add p6 into subgraph s3. In the rest of the section,we use subgraphs to refer to these expanded subgraphs.

    4.2 High-Quality Graph PartitioningA commonly-used method to partition a graph is to have

    subgraphs with similar sizes and few edges across the sub-graphs. Formally, for a graph G = (V, E) and an integer, we want to partition V into disjoint vertex subsetsV1, V2, . . . , V , such that the number of edges with endpointsin different subsets is minimized. In the general settingwhere both vertices and edges are weighted, the problembecomes dividing G into disjoint subgraphs with similarweights (which is the summation of the weights of vertices),and the cost of the edge cuts is minimized, where the sizeof a cut is the summation of the weights of the edges inthe cut. Graph partitioning is known to be an NP-completeproblem [18].

    The following example shows that a traditional methodwith the above optimization goal may not produce a gooddatabase-graph partition in our problem setting. Figure 7shows a graph with 10 vertices v1, v2, . . . , v10. Each vertexvi has a list of its keywords in parentheses. For example, thelist of keywords for vertex v3 is B, D, and E. Figure 7(a)

    shows one graph partition using a traditional method tominimize the number of edges across subgraphs. It has twosubgraphs: {v1, v2, v3, v4, v5} and {v6, v7, v8, v9, v10}.(For simplicity, we represent a subgraph using its set of ver-tices.) After extending each subgraph by assuming = 1,the two subgraphs become S1 = {v1, v2, v3, v4, v5, v6, v9}and S2 = {v4, v5, v6, v7, v8, v9, v10}. Every keyword in{A,B,C,D ,E,F } has the same inverted list: {S1, S2}. Asa consequence, for a query with any of these keywords, weneed to search within both subgraphs. Figure 7(b) showsa different graph partition that can produce more efficient

  • 8/8/2019 Sigmod2009 Tastier

    7/12

    index structures. In this partition there are three edges be-tween the two expanded subgraphs: S3 = {v1, v2, v4, v5,v6, v7, v8} and S4 = {v2, v3, v4, v5, v7, v8, v9, v10}. Theinverted lists for keywords A and C are {S3}, those of Band D are {S4}, and those for E and F are {S3, S4}. Asthe inverted lists become shorter, fewer subgraphs need tobe accessed in order to answer keyword queries.

    v1 (A,C)

    v2 (E)

    v3 (B,D,E)

    v4 (E)

    v5 (F)

    v6 (A,C,F)

    v7 (F)

    v8 (E,F)

    v9 (B,D)

    v10 (B,D,E,F)

    v1 (A,C)

    v2 (E)

    v3 (B,D,E)

    v4 (E)

    v5 (F)

    v6 (A,C,F)

    v7 (F)

    v8 (E,F)

    v9 (B,D)

    v10 (B,D,E,F)

    (a) Partition 1 (b) Partition 2

    Figure 7: Two graph partitions. Dashed edges arebetween subgraphs.

    We develop a keyword-sensitive graph-partitioning methodthat produces index structures to support efficient type-ahead search. The algorithm considers the keywords in thevertices. Its goal is to minimize the number of keywordsshared by different subgraphs, since shared keywords willcause subgraph ids to appear on multiple inverted lists onthe trie. We use a hypergraph to represent the shared key-words by vertices. For a database graph G = (V, E), weconstruct a hypergraph Gh = (Vh, Eh) as follows. The setof vertices Vh is the same as V. We construct the hyper-edges in Eh in two steps. In the first step, for each edge ein E, we add a corresponding hyperedge in Eh that includesthe two vertices of e. The weight of this hyperedge is thetotal number of keywords on the vertices within -steps ofthe two vertices ofe. The intuition is that, if this hyperedgeis deleted in a partition, the keywords of the neighboringvertices will be shared by the resulting subgraphs. In thesecond step, for each keyword, we add a hyperedge that in-cludes all the vertices that include this keyword. The weightof this hyperedge is 1, since if this hyperedge is deleted ina partition, the corresponding keyword will be shared bythe subgraphs. Figure 8 shows the hypergraph constructedfrom the graph in Figure 7. Hyperedges from keywords aredrawn using rectangles with keyword labels. For example,the hyperedge for keyword E connects vertices v2, v3, v4,and v8. Hyperedges from the original edges are shown inlines.

    We partition the weighted hypergraph using an existing

    hypergraph-partitioning method (e.g., [19]), with the goalof minimizing the total weight of the deleted hyperedges.For instance, a multilevel approach is to coarsen the graphby merging vertices in pairs step by step, generating smallhypergraphs, and partition the smallest hypergraph usingheuristic or random methods. The partition is projectedback to the original hypergraph level by level, with properpartition refinement involved during each uncoarsening step.After computing a partition of the hypergraph Gh, we derivea partition on the original database graph G by deleting theedges whose corresponding hyperedges have been deleted

    v1 (A,C)

    v2 (E)

    v3 (B,D,E)

    v4 (E)

    v5 (F)

    v6 (A,C,F)

    v7 (F)

    v8 (E,F)

    v9 (B,D)

    v10 (B,D,E,F)

    A,C

    B, D

    F

    E

    Figure 8: Hypergraph partitioning.

    in the hypergraph partition. For instance, Figure 8 alsoshows a hypergraph partition with two sub-hypergraphs.Dashed lines and dashed rectangles are the deleted hyper-

    edges. This partition minimizes the total weight of deletedhyperedges. This hypergraph partition produces the graphpartition shown in Figure 7(b).

    Deciding Number of Subgraphs: The number of sub-graphs in a graph partition can greatly affect the queryperformance on the generated index structures. As we in-crease , there will be more subgraphs, causing the invertedlists on the trie to become longer. Thus it will take moretime to access these lists to find candidate subgraphs. Onthe other hand, the average size of subgraphs becomes smaller,which can reduce the search time within each candidate sub-graph. We have quantitatively analyzed the effect of onthe overall query performance. We omit the analysis due tolimited space.

    5. IMPROVING PERFORMANCE BY QUERY

    PREDICTIONIn this section we study how to improve query perfor-

    mance by predicting complete keywords for a partial key-word, and running the most likely predicted complete querieson the database graph to compute answers, before the userfinishes typing the complete keyword. For instance, sup-pose a user types in a query similarity se, where thefirst keyword has been typed in completely, and se is apartial keyword. If we can predict several possible querieswith high probabilities, such as similarity search, wecan suggest these queries to the user, and run them on the

    database to compute answers. Figure 9 shows a screenshotin which the system predicts top-8 complete queries anddisplays the top-4 answers to the predicted queries. In thissection, we focus on the case where only the last keywordis treated as a prefix. The solution generalizes to the casewhere each keyword is treated as a prefix, and we predictcomplete keywords for each of them, assuming the user canmodify an arbitrary keyword if he/she does not see the in-tended complete keyword in the predicted queries.

    Notice that the method presented in Section 3 essentiallyconsiders all possible complete keywords for a prefix keyword

  • 8/8/2019 Sigmod2009 Tastier

    8/12

    Figure 9: Screenshot of predicated query and answers in Tastier.

    and computes answers to these queries. On the contrary, theapproach presented in this section efficiently predicts sev-eral possible complete keywords, and runs the corresponding

    queries. Thus this new approach can compute answers moreefficiently. In case the predicted queries are not what theuser wanted, then as the user types in more letters, we willdynamically change the predicted queries. After the userfinishes typing in a keyword completely, there is no ambigu-ity about the users intention, thus we can still compute theanswers to the complete query. In this way, our approachcan improve query performance while it still guarantees tohelp users find the correct answers.

    5.1 Query-Prediction ModelGiven a query with multiple keywords, among the com-

    plete query keywords of the last keyword, we want to predictthe most likely complete keywords given the other keywordsin the query. Notice that this prediction task is different

    from prediction in traditional autocomplete applications,in which a query with multiple keywords is often treated asa single string. These applications can only find records thatmatch this entire string. For instance, consider the searchbox on Apple.com, which allows autocomplete search on Ap-ple products. Although a keyword query itunes can finda record itunes wi-fi music store, a query with key-words itunes music cannot find this record (as of Novem-ber 2008), because these two keywords appear at differentplaces in the record. Our approach allows the keywords toappear at different places in a tuple or Steiner tree.

    Given a single keyword p, among the words with p as aprefix, we predict the top- words with the highest proba-bilities, for a constant . The probability of a word w is

    computed as follows:

    P r(w) =Nw

    N, (1)

    where N is the number of vertices in the database graphand Nw is the number of vertices that contain word w. Forexample, for the database graph in Figure 2, there are 18vertices and 3 vertices containing relation. Thus we haveP r(relation) = 3

    18.

    Suppose a user has typed in a query with complete key-words {k1, k2, . . . , kn}, after which the user types in another

    prefix keyword pn+1. We use {k1, k2, . . . , kn} to compute theprobability of a complete keyword kn+1 of prefix pn+1, i.e.,P r(kn+1|k1, k2, . . . , kn). We use a Bayesian network to com-

    pute this probability. In a Bayesian network, if two nodes xand y are d-separated by a third node z, then the corre-sponding variables ofx and y are independent given the vari-able ofz [30]. Suppose query keywords form a Bayesian net-work, and the nodes ofk1, k2, . . . , kn are d-separated by thenode of kn+1. Then we have P r(ki|kj, kn+1) = P r(ki|kn+1)for i = j. We can show

    P r(kn+1|k1, k2, . . . , kn) =P r(k1, k2, . . . , kn, kn+1)

    P r(k1, k2, . . . , kn)

    =P r(kn+1) P r(k1|kn+1) P r(k2|kn+1) . . . P r(kn|kn+1)

    P r(k1) P r(k2|k1) P r(k3|k1, k2) . . . P r(kn|k1, . . . , kn1).

    (2)

    Notice the probabilities P r(k3|k1

    , k2

    ), P r(k4|k1

    , k2

    , k3

    ),. . . , P r(kn|k1, . . . , kn1) can be computed and cached inearlier predictions as the user types in these keywords. ForP r(k1|kn+1), P r(k2|kn+1), . . ., P r(kn|kn+1), we can com-pute and store them offline. That is, for every two words wiand wj in the databases, we keep their conditional proba-bilities P r(wi|wj) and P r(wj|wi):

    P r(wi|wj) =T(wi,wj)

    Twj, (3)

    where Twi is the number of -diameter trees that containword wi, and T(wi,wj) is the number of -diameter trees

    that contain both wi and wj.2 For example, for the graphin Figure 2, if = 1, there are 18 1-diameter trees con-taining keyword and 16 1-diameter trees containing bothkeyword and search. We have P r(search|keyword )=16

    18.

    5.2 Two-Tier Trie for Efficient PredictionTo efficiently predict complete queries, we construct a two-

    tier trie. In the first tier, we construct a trie for the keywordsin the database. Each leaf node has a probability of the wordin the database. For the internal node, we maintain the top-

    2For each vertex, we construct a -diameter tree by addingits neighbors within steps using a breadth-first traversalfrom the vertex.

  • 8/8/2019 Sigmod2009 Tastier

    9/12

    words of its complete keywords (i.e., descendant leaf nodes)with the highest probabilities. In the second tier, for eachleaf node w1 in the trie, we construct a trie for the wordsin the -diameter trees of word w1, and add the trie to theleaf node w1. For each node in the second tier, we main-tain the number of complete words under the node. A leafnode of keyword w2 in the second tier under the leaf nodew1 also stores the conditional probability P r(w2|w1). Fig-ure 10 illustrates an example, in which we use radix (com-

    pact) tries. Node search has a probability of 718 . Wekeep a top-1 complete word for node s (i.e., search).The node r under the node search (search+r inshort) has two words in its descendants. The probabilityof search+relation,

    P r(relation|search)

    , is 9

    16.

    r

    database keyword

    ank elation

    s

    earch igmod park

    s

    earch igmod park

    r

    ank elation

    search: 7/18

    search: 16/18relation: 10/18 relation: 9/16

    8/18 8/18

    7/18 2/18 1/18

    3/18 10/18 16/18 6/18 4/18 3/16 9/16

    relation

    3/18

    2 3 2

    rank

    1/18

    Figure 10: Two-tier trie.

    This index structure stores information about the P r(w)probability in Equation 1 and the P r(ki|kn+1) probabili-ties in Equation 2. We use this index structure to predictcomplete keywords of the last keyword pn+1 given other key-words {k1, k2, . . . , kn}. A naive way is the following. We usethe trie of the first tier to find all the complete keywords ofpn+1. For each of them kn+1, we use Equation 2 to computethe conditional probability P r(kn+1|k1, k2, . . . , kn) using theconditional probabilities cached earlier and the conditionalprobabilities stored on the two-tier structure. We select thetop- complete keywords with the highest probabilities.

    We want to reduce the number of complete keywordswhose conditional probabilities need to be computed. No-tice that each complete keyword with a non-zero conditionalprobability has to appear in the trie of the second tier underthe leaf node of each keyword ki. Based on this observation,for each 1 i n, we locate the trie node ki +pn+1, i.e.,node pn+1 under node ki. We retrieve the set of completekeywords of this node. We only need to consider the inter-

    section of these sets of complete keywords for the keywordsk1, . . . , kn.

    For instance, suppose a user types in a query keywordsearch r letter by letter. We predict queries as follows.(1) When the user finishes typing the query key, we lo-cate the trie node and return the complete query keywordbased on the information on the trie node. (2) When theuser finishes typing the query keyword s, we predict thecomplete query keyword search based on the informa-tion on trie node keyword+s. (3) When the user typesin the query keyword search r, we locate the trie nodes

    keyword+r and search+r, and select node keyword+rwith the minimum number of words and identify completewords rank and relation. We intersect their sets ofwords in their leaf nodes, and find two possible keywordsrank and relation. We compute their conditional prob-abilities using Equation 2.

    In general, among the top- predicted keyword queries,we choose the predicted queries following their probabilities(highest first), and compute their answers. We terminate

    the process once we have found enough -diameter Steinertrees. If there are not enough answers, we can use previousmethods in Section 4 to compute answers.

    6. EXPERIMENTAL STUDYWe have conducted an experimental evaluation of the de-

    veloped techniques on two real data sets: DBLP3 andIMDB4. DBLP included about one million computer sci-ence publication records as described in Table 1. The origi-nal data size of DBLP was about 470 MB. IMDB had threetables user, movie, and ratings. It contained ap-proximately one million anonymous ratings of 3900 moviesmade by 6040 users. We set = 3 for all the experiments.

    We implemented the backend server using FastCGI in

    C++, compiled with a GNU compiler. All the experimentswere run on a Linux machine with an Intel Core 2 Quadprocessor X5450 3.00GHz and 4 GB memory.

    6.1 Search Efficiency without PartitioningWe evaluated the search efficiency of Tastier without

    graph partitioning. We constructed the trie structure andbuilt the forward index for each graph vertex and the in-verted index for each keyword. Table 3 summarizes the in-dex costs on the two datasets.

    Table 3: Data sets and index costs.Data set DBLP IMDB

    Original data size 470 MB 50 MB

    # of distinct keywords 392K 14KIndex-construction Time 35 seconds 6 secondsTrie size 14 MB 2 MB

    Inverted-list size 54 MB 5.5 MBForward-list size 45 MB 4.8 MB

    We generated 100 queries by randomly selecting verticesfrom the database graph, and choosing keywords from eachvertex. We evaluated the average time of each keystroketo answer a query. Figure 11 illustrates the experimentalresults. We observe that for queries with two keywords,Tastier spent much more time than queries with a singlekeyword. This is because we need to find the vertex ids andidentify the Steiner trees on the fly. When queries had morethan two keywords, Tastier achieved a high efficiency as

    we can use the cached information to do incremental search.6.2 Effect of Graph Partitioning

    We evaluated the effect of graph partitioning. We used atool called hMETIS5 for graph partitioning. We constructedthe trie structure, and built the forward index for each ver-tex and the inverted index for each keyword on top of sub-graphs and vertexes. Figure 12 illustrates the index cost for3http://dblp.uni-trier.de/xml/4http://www.imdb.com/interfaces5http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview

  • 8/8/2019 Sigmod2009 Tastier

    10/12

    0

    50

    100

    150

    200

    654321AverageSearchTime(ms)

    # of keywords

    Identify Steiner treesFind vertex IDs

    (a) DBLP

    0

    20

    40

    60

    80

    100

    654321AverageSearchTime(ms)

    # of keywords

    Identify Steiner treesFind vertex IDs

    (b) IMDB

    Figure 11: Search efficiency without partitioning.

    graph partitioning. We observe that the index size is a bitlarger than that without partitioning. When we increasedthe number of subgraphs, the index size increased slightly.

    0

    50

    100

    150

    200

    161514131211109876543210

    IndexSize(MB)

    # of subgraphs (2x)

    Traditional PartitionKeyword-Sensitive Partition

    Figure 12: Index size vs # of subgraphs on DBLP.

    We evaluated search efficiency. We randomly generated100 queries and evaluated the average time of each keystroketo answer a query. Figure 13 gives the experimental re-sults. We observe that, for a very small number of sub-graphs, graph partitioning cannot improve search efficiencyas it is hard to prune large subgraphs. For a very largenumber of subgraphs, the inverted lists on top of subgraphsbecame larger, and thus they could not improve search ef-ficiency. We find that when the number of subgraphs was

    about 10K, the graph-partition-based methods achieved thebest performance. Moreover, our keyword-sensitive-basedmethod outperformed the traditional one, as it reduced theinverted-list size and could prune more irrelevant subgraphs.

    6.3 Effect of Query PredictionWe evaluated the effect of query prediction. Besides the

    forward index and inverted index, we also constructed a two-tier trie structure which was about 200 MB. Note that thetwo-tier trie size depends on the number of keywords in thedatabase, but not the number of tuples.

    0

    50

    100

    150

    200

    161514131211109876543210AverageSearchTime(ms)

    # of subgraphs (2x)

    Traditional PartitionKeyword-Sensitive Partition

    Figure 13: Efficiency vs # of subgraphs on DBLP.

    Quality of Query Prediction: We evaluated the resultquality of query prediction. We first used the reciprocal rankto evaluate the result quality of prediction. For a query, thereciprocal rank is the ratio between 1 and the rank at whichthe first correct answer is returned; or 0 if no correct answeris returned. For example, if the first relevant query is rankedat 2, then the reciprocal rank is 1

    2. We randomly selected 100

    queries with two to then keywords on each dataset, and usedthem as the complete correct queries. We then composedprefix queries of the first three letters of each keyword inthe complete queries. We evaluated the average reciprocalrank. The average reciprocal rank of the 100 queries onDBLP was 0.91, and that of IMDB was 0.95. Table 4 givesseveral sample queries.

    Table 4: Reciprocal rank.(a) DBLP

    Complete Query Partial Query reciprocalrank

    keyword search database key sea dat 1approximate search sigmod app sea sig 1

    2similarity search icde sim sea icd 1christos faloutsos mining chr fal min 1

    (b) IMDBComplete Query Partial Query reciprocal

    rank

    children corn chi cor 1star wars sta war 1slumber party massacre slu par mas 1toxic venger comedy tox ven com 1

    We then used top-k precision to evaluate the result qual-ity. We used the graph-partition-based method and thequery-prediction-based method to respectively identify thetop-k results. For the latter method, we predicted top-10complete queries for each partial query and identified an-swers on top of the 10 complete queries. Given a query, thetop-k precision of the prediction-based method is the ratio

    between the number of same answers from the two meth-ods and k. Table 5 gives the experimental results on thetwo datasets. We observe that the prediction-based methodachieved good quality as we can accurately predict the mostrelevant complete queries. When we increased k, the top-kprecision decreased as there may not be enough top-k an-swers for the predicted queries.Efficiency of Prediction: We randomly generated 100queries on the two datasets. We evaluated the average timeof each keystroke to answer a query. We partitioned DBLPinto 10,000 subgraphs and IMDB into 1,000 subgraphs. Fig-

  • 8/8/2019 Sigmod2009 Tastier

    11/12

  • 8/8/2019 Sigmod2009 Tastier

    12/12

    multiple tables. We proposed efficient index structures andalgorithms for incrementally computing answers to queriesin order to achieve an interactive speed on large data sets.We developed a graph-partition-based method and a query-prediction technique to improve search efficiency. We haveconducted a thorough experimental study of the algorithms.The results proved the efficiency and practicality of this newcomputing paradigm.

    Acknowledgements

    Special thanks go to Jiannan Wang who helped with queryprediction and the experiments. The work was partially sup-ported by the National Science Foundation of the US underthe award No. IIS-0742960, the National Science Foundationof China under the grant No. 60828004, the National Natu-ral Science Foundation of China under Grant No. 60873065,and the National High Technology Development 863 Pro-gram of China under Grant No. 2007AA01Z152, the Na-tional Grand Fundamental Research 973 Program of Chinaunder Grant No. 2006CB303103, and 2008 HP Labs Inno-vation Research Program.

    8. REFERENCES[1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: Asystem for keyword-based search over relationaldatabases. In ICDE, pages 516, 2002.

    [2] A. Balmin, V. Hristidis, and Y. Papakonstantinou.Objectrank: Authority-based keyword search indatabases. In VLDB, pages 564575, 2004.

    [3] H. Bast, A. Chitea, F. M. Suchanek, and I. Weber.Ester: efficient search on text, entities, and relations.In SIGIR, pages 671678, 2007.

    [4] H. Bast and I. Weber. Type less, find more: fastautocompletion search with a succinct index. InSIGIR, pages 364371, 2006.

    [5] H. Bast and I. Weber. The completesearch engine:

    Interactive, efficient, and towards ir& db integration.In CIDR, pages 8895, 2007.

    [6] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti,and S. Sudarshan. Keyword searching and browsing indatabases using banks. In ICDE, pages 431440, 2002.

    [7] S. Chaudhuri and R. Kaushik. Extendingautocompletion to tolerate errors. In SIGMOD, 2009.

    [8] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv.Xsearch: A semantic search engine for xml. In VLDB,pages 4556, 2003.

    [9] B. D. Davison and H. Hirsh. Predicting sequences ofuser actions. In AAAI/ICML Workshop on Predictingthe Future, 1998.

    [10] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and

    X. Lin. Finding top-k min-cost connected trees indatabases. In ICDE, pages 836845, 2007.

    [11] K. Grabski and T. Scheffer. Sentence completion. InSIGIR, pages 433439, 2004.

    [12] L. Guo, F. Shao, C. Botev, andJ. Shanmugasundaram. Xrank: Ranked keywordsearch over xml documents. In SIGMOD Conference,pages 1627, 2003.

    [13] H. He, H. Wang, J. Yang, and P. S. Yu. Blinks:ranked keyword searches on graphs. In SIGMODConference, pages 305316, 2007.

    [14] V. Hristidis, L. Gravano, and Y. Papakonstantinou.Efficient ir-style keyword search over relationaldatabases. In VLDB, pages 850861, 2003.

    [15] V. Hristidis and Y. Papakonstantinou. Discover:Keyword search in relational databases. In VLDB,pages 670681, 2002.

    [16] S. Ji, G. Li, C. Li, and J. Feng. Interative fuzzykeyword search. In WWW 2009, 2009.

    [17] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan,R. Desai, and H. Karambelkar. Bidirectionalexpansion for keyword search on graph databases. InVLDB, pages 505516, 2005.

    [18] G. Karypis and V. Kumar. A fast and high qualitymultilevel scheme for partitioning irregular graphs.SIAM J. Sci. Comput., 20(1):359392, 1998.

    [19] G. Karypis and V. Kumar. Multilevel k-wayhypergraph partitioning. In DAC, pages 343348,1999.

    [20] K. Kukich. Techniques for automatically correctingwords in text. ACM Comput. Surv., 24(4):377439,1992.

    [21] G. Li, J. Feng, and L. Zhou. Interactive search in xmldata. In WWW, 2009.

    [22] G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou. Ease:an effective 3-in-1 keyword search method forunstructured, semi-structured and structured data. InSIGMOD Conference, pages 903914, 2008.

    [23] G. Li, X. Zhou, J. Feng, and J. Wang. Progressivekeyword search in relational databases. In ICDE, 2009.

    [24] Y. Li, C. Yu, and H. V. Jagadish. Schema-free xquery.In VLDB, pages 7283, 2004.

    [25] F. Liu, C. T. Yu, W. Meng, and A. Chowdhury.Effective keyword search in relational databases. InSIGMOD Conference, pages 563574, 2006.

    [26] Z. Liu and Y. Chen. Identifying meaningful returninformation for xml keyword search. In SIGMODConference, pages 329340, 2007.

    [27] Y. Luo, X. Lin, W. Wang, and X. Zhou. Spark: top-kkeyword query in relational databases. In SIGMODConference, pages 115126, 2007.

    [28] H. Motoda and K. Yoshida. Machine learningtechniques to make computers easier to use. Artif.Intell., 103(1-2):295321, 1998.

    [29] A. Nandi and H. V. Jagadish. Effective phraseprediction. In VLDB, pages 219230, 2007.

    [30] J. Pearl. Evidential reasoning using stochasticsimulation of causal models. Artif. Intell.,32(2):245257, 1987.

    [31] C. Sun, C. Y. Chan, and A. K. Goenka. Multiwayslca-based keyword search in xml data. In WWW,pages 10431052, 2007.

    [32] H. E. Williams, J. Zobel, and D. Bahle. Fast phrasequerying with combined indexes. ACM Trans. Inf.Syst., 22(4):573594, 2004.

    [33] Y. Xu and Y. Papakonstantinou. Efficient keywordsearch for smallest lcas in xml databases. In SIGMODConference, pages 537538, 2005.

    [34] Y. Xu and Y. Papakonstantinou. Efficient LCA basedkeyword search in XML data. In EDBT, pages535546, 2008.