Rule-Based Table Analysis and...

38
CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured Tabular Data Integration Alexey Shigarov, [email protected] Matrosov Institute for System Dynamics and Control Theory of SB RAS 17th International Conference on Data Analytics and Management in Data Intensive Domains Obninsk, Russia October 13-16, 2015 * This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042) and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)

Transcript of Rule-Based Table Analysis and...

Page 1: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: A Rule Languagefor Table Analysis and Interpretation*

in Unstructured Tabular Data Integration

Alexey Shigarov, [email protected]

Matrosov Institute for System Dynamics and Control Theory of SB RAS

17th International Conference on

Data Analytics and Management in Data Intensive Domains

Obninsk, Russia

October 13-16, 2015

* This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042)

and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)

Page 2: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Unstructured vs Structured

UnstructuredTabular Data

Arbitrary Tables inASCII-text, Spreadsheets,PDF Documents,Web-Pages

Structured DataRelational Databases

Easy Way

Hard Way

For HumansTo Understand

No ExplicitSemantics

We Can Read, Write, and Edit

For Computers To Understand

Formal Data Model (Semantics)

We Can Query (SQL)and Analyse (DM, OLAP)

2

Page 3: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Hard Way Back to Structured Data World

Table Detection*

Table Recognition*

Table Analysis*

Table Interpretation*

ASCII-text

Untagged PDF Documents

ImageDocumnets

Spreadsheets

Web PagesWord Documents

OCR

Databases

Cannonical Forms

XMLETL

* Hurst M. Layout and language: Challenges for table understanding on the web //Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3

Page 4: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Our purpose

Globallyto automate unstructured tabular data integration

DatabasesArbitrary Tablesin Spreadsheets

Currentlyto automate table analysis and interpretation

Tables in Cannonical Form

4

Page 5: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Ok, We Have Initially an Arbitrary Tagged TableWe know • structure (rows, columns, cells) • style settings (fonts, colors, alignments, etc.)• textual content

5

Page 6: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

All We Need Is To Recover Semantics

Relationships likeentry-label, label-label, label-category*

* Our terminology is inspired bythe X. Wang’s abstract table model [Wang X. Tabular Abstraction, Editing, and Formatting, PhD Thesis. 1996]

6

Page 7: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

When We Know Semantics We Can Generate a Canonical Table

It can be loaded into a database by ETL tools

7

Page 8: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Challenges on the Hard Way Back

• Too many layouts to create a table

• Anyone can invent new one

• Messy data

• No guarantees your tabular data are clear and standardized

• Natural Language

• Table understanding needs using knowleadge

8

Page 9: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Our Idea

When

• A table creator (e.g. a company, a government agency, ad-hoc software)

use a set of rules for table generation

• Tables have similar structure, style, and content

within a set of generating rules

Then

• We can define a set of rules for table analysis and interpretation

• We can use a rule engine to execute these rules9

Page 10: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Table Analysis and Interpretation Rules

• Rules can be expressed in

• Drools Rule Language* (DRL)

General-purpose language for expressing production rules in Drools* rule engine

• Cells Rule Language (CRL)

Our domain-specific language for expressing table analysis and interpretation rules

• Rules can be executed with Drools* rule engine

*http://drools.org

10

Page 11: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL Rules

Rules map known table data to unknown ones

rule

when

Left hand side defines conditions using available facts (cells, categories)

then

Right hand side defines actions to recover unknown semantics

(entries, labels, categories, entry-label, label-label, label-category)

end

11

Page 12: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Left Hand Side

factType $variable : Java boolean expressions

cell $cell : constraints

entry $entry : constraints

label $label : constraints

category $category : constraints

12

Page 13: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Merged Cells Splitted Cells

Cell splitting

To split n-tiles cell into n cellssplit $cell

Cell merging

To merge two cells into onemerge $cell1 -> $cell2

13

Page 14: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Cell marking

set mark @mark -> $cell

where @mark is a word with @ starting character

Using marks in conditions

cell $cell : mark == @mark, constraints

Short form

cell@mark $cell : constraints

14

Page 15: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Entry creating

Using a cell valuenew entry $cell

Using a specified valuenew entry value -> $cell

Label creating

Using a cell valuenew label $cell

Using a specified valuenew label value -> $cell

15

Page 16: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Label categorizing

To associate a label with a category

set category $category -> $label

Trying to find or create a category with a specified name

set category category_name -> $label

16

Page 17: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Label associating

set parent label $label1 -> $label2

• Labels can be organized in a tree

• We can build hierarchical categories

• We can build compound label values like label1|label2|…|labelN

17

Page 18: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Label grouping

group $label1 -> $label2

• A label group constitutes an anonymous category

• We can divide labels into categories without knowing categories

• We can entirely categorize a label group

18

Page 19: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

CRL: Right Hand Side

Entry associating

To associate an entry with a label

add label $label -> $entry

Trying to find or create a label in the category with specified value

add label label_value from $category -> $entry

Trying to find or create a category with specified name

add label label_value from category_name -> $entry

19

Page 20: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Cannonical Form Generation

<entries>={1,2,3,4,5,6,7,8}

<labels>={a1,a11,a12,a2,a21,a22,b1,b2}

<categories>={A,B}

<entry-label pairs>={(1,a11),(1,b1),(2,a12),

(2,b1),(3,a21),(3,b1),(4,a22),(4,b1),(5,a11),

(5,b2),(6,a12),(6,b2),(7,a21),(7,b2),(8,a22),

(8,b2)}

<label-label pairs>={(a11,a1),(a12,a1),

(a21,a2),(a22,a2)}

<label-category pairs>={(a1,A),(a11,A),

(a12,A),(a2,A),(a21,A),(a22,A),(b1,B),(b2,B)}

DATA A B

1 a1 | a11 b1

2 a1 | a12 b1

3 a2 | a21 b1

4 a2 | a22 b1

5 a1 | a11 b2

6 a1 | a12 b2

7 a2 | a21 b2

8 a2 | a22 b2

a11 a12 a21 a22

b1 1 2 3 4

b2 5 6 7 8

A

B

a1 a2

20

Page 21: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: Critical Cells*

c d c d e

j 2 2 2 3

k

i l 6 7

h1

45

a b

f g

* Nagy G. Learning the Characteristics of Critical Cells from Web Tables // In Proc. of the 21st Int. Conf. on Pattern Recognition, Tsukuba, Japan, IEEE Comp. Soc., 2012, pp. 1554-1557

when

cell $cc : cl==1, rt==1, blank

cell $ec : cl>$cc.cr, rt>$cc.rb

then

new entry $ec

-> <entries> = {1,2,3,4,5,6,7}

21

Page 22: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

when

cell $cc : cl == 1, rt == 1, blank

cell $clc : cl > $cc.cr, rb <= $cc.rb

then

set mark @ColLabel -> $clc

new label $clc

when

cell@ColLabel $c1

cell@ColLabel $c2 : rt == $c1.rt

then

group $c1.label -> $c2.label

Applying CRL: Label Groups

c d c d e

j 2 2 2 3

k

i l 6 7

h1

45

a b

f g

-> <labels>={a,b,c,d,e,f,g,...}

-> <groups>={{a,b},{c,d,e},

{f,g},...}

22

Page 23: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: Row Label Hierarchies

when

cell $c1 : cl==1, $l1 : label

cell $c2 : cl==1, rt>$c1.rt,

indent==$c1.indent+2, $l2 : label

no cells : cl==1, rt>$c1.rt,

rt<$c2.rt, indent==$c1.indent

then

set parent label $c1.label -> $c2.label

-> <label-label pairs> =

{(c1,c),(c11,c1),(c12,c1),(c2,c),

(c21,c2),(d1,d),(d11,d1)}

23

Page 24: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: YAML* Specified categories

Category YAML specification

# category YEAR

name: Year

description: years from 1982 to 2015

constraints:

-"198[2-9]"

-"200[1-9]"

-"201[0-5]"

when

category $c : name == "Year"

label $l : $c.canHaveLabel(value)

then

set category $c -> $l

Category YAML specification

# category COUNTRY_CODE

name: CountryCode

description: ISO 3166 2-letter country codes

labels:

-AD

-AE

-...

-ZW

when

category $c : name == "CountryCode"

label $l : $c.hasLabel(value)

then

set category $c -> $l

*http://yaml.org

24

Page 25: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: Category Names

when

cell $cc : cl == 1, rt == 1

cell $c : mark == "@ColLabel"

then

set category token($cc, 0) -> $c.label

A

Ba1 a2 a3

b1 1 2 3

b2 4 5 6

-> <categories> = {A,...}

-> <labels> = {a1,a2,a3,...}

-> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...}

25

Page 26: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: Multi-Valued Cells

α β

阿爾法 公測

γ 1 2

伽馬 一 二

δ 3 4

三角洲 三 四

C1 C2 C3

a = 1 b = 2 c = 3

d = 4 e = 5 f = 6

g = 7 h = 8 i = 9

Bilingual Tables Key=Value Cells

when

cell $c : cl==1 || rt==1, !blank

then

new label token($c, 0) -> $c

new label token($c, 1) -> $c

when

cell $c : rt>1

then

new label left($c, '=') -> $c

new entry right($c, '=') -> $c

26

Page 27: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: Footnotes

when

cell $footer : onLastRow, $notes : text

entry $e : cell.text matches ".+\\*+",

$ref : extract(cell.text, "\\*+")

then

add label between($notes, $ref, '\n')

from "footnotes" -> $e

c d c d

e 1* 2** 3 4

f 5 6 7 8

g 9 10 11 12

a b

* x

** y

-> <labels>={x,y,...}

-> <categories>={"footnotes",...}

-> <entry-label pairs>={(1,x),(2,y),...}

-> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...}27

Page 28: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Applying CRL: Colored Tables

when

cell $lc : style.bgColor == "#4f81bd"

cell $ec : style.bgColor == null, rt >= $lc.rt, cl > $lc.cr

no cells : style.bgColor == "#4f81bd", cl > $lc.cr, cr < $ec.cl

then

add label $lc.label -> $ec.entry

1l

l2 l3 l4 l2 l3 l2

l5 l7 e1 e2 e2 l5 l7 e6 e8 l5 l8 e9

l6 l8 e3 e4 e5 l6 l7 e7 e8 l5 l8 e9

c1 c2l1

c1 c2l1

c1 c2

28

Page 29: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Prototype of Spreadsheet Data Extractionand Transformatiom System

29

Page 30: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Experimental Evaluation

Our purpose is evaluation of recovering entries, labels, entry-label and label-label relationships

Dataset• We use the TANGO dataset (http://tango.byu.edu/data)

which

• is a part of the TANGO (Table ANalysis for Generating Ontologies) project (http://tango.byu.edu)

• is intended for testing table interpretation methods

• has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009

30

Page 31: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Experimental Evaluation

Multi-rowhierarchical layout

Multi-columnplain layout

One-columnhierarchicallayout

Multi-column &multi-row layout

One-columnplain layout

Category name cells

Row label cells

Column label cells

Entry cells

Table regions

One-column &one-row layout

Multi-column &one-row layout

One-row plain layout

Multi-rowplain layout

47,5%

47%5,5% 100%

94,5% 5,5% 65,5%

26%

8,5%

31

We develop two setsof CRL rules to define two table types

• TANGO-200all tables

• TANGO-SUBwithout tables having hierarchical layout in the leftmost column

Layouts of TANGO Tables

Page 32: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Experimental Evaluation

Measures

• Recall• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which

are implicitly contained in its source form are explicitly included in its canonical form

• Presision• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which

are explicitly included in its canonical form are implicitly contained in its source form

Process• Two experts independently compare sources and generated automatically canonical forms of tables

• They referee that each table is processed successfully or not in terms of recall and precision

• When they make opposite decisions on a table, a final decision is made by third expert

32

Page 33: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Experimental EvaluationResults

Rule Set / Table Type TANGO-200 TANGO-SUB

Tables 200 105

Cells 22757 10893

Rules 16 13

Recall 87% 95%

Precision 89% 95%

For TANGO-200

• 33 tables are processed with errors

• 85% of errors are born in the leftmost column with one-column hierarchical layout

• Two main causes:1) ambiguity among style characteristics2) hierarchical relationships expressed by natural language only

33

Page 34: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Comparison with others

Methods for Table Analysis and Interpretation

Fixed Table Types Programmable Table Types

Domain-specific

Douglas, 1995Tijerino, 2005Embley, 2005WangJ, 2012

• Domain ontologies• Taxonomies like

ProBase, FreeBase

Domain-independent

Gatterbauer, 2007Pivk, 2005, 2006, 2007Kim, 2008Chen&Cafarella, 2013, 2014Embley, 2014Nagy, 2014

• Spatial, style, and textual data• Several typical table types

We are here!2014, 2015• Rule language (CRL, DRL)• Relative cell addressing• Fixed target schema• Spatial, style,

and textual data

Hung, 2011• Spreadsheet-like formula

mapping language (TranSheet)• Absolute cell addressing• Programmable target schema• Spatial and textual data

34

Page 35: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Conclusions

• Our methodology is mainly oriented on unstructured tabular data integration

• We expect it to be useful in cases when data from a large number of tables appertaining to a few table types are required for populating a database

• One set of rules can be suitable for processing a wide range of arbitrary tables with high accuracy

• Experiment demonstrates that narrowing of a table type can cause simplifying of rules and increase of recall and precision in table canonicalization

35

Page 36: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Further Work

• Table Layouts

to develop techniques for widely used table features, e.g. for recovering a row label hierarchy in the leftmost column

• Messy Tabular Data

to incorporate data cleansing techniques into table understanding

• Natural Language

to add knowledge, global taxonomies (e.g. FreeBase, DBpedia) and domain ontologies

36

Page 37: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Supplementary Materials

CRL language specification

Examples of CRL rules

All details of our experiment

http://cells.icc.ru/pub/crl

Source code of our prototype licensed under Apache License 2.0

https://github.com/shigarov/cells-ssdc

37

Page 38: Rule-Based Table Analysis and Interpretationtd.icc.ru/files/publ/2015_shigarov_damdid_presentation.pdf · CRL: A Rule Language for Table Analysis and Interpretation* in Unstructured

Thanks!This presentation is available on SlideShare.net

http://www.slideshare.net/shig

Alexey [email protected]

http://cells.icc.ru

38