Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of...

25
Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management Brigham Young University Co-Authors: Douglas M. Campbell, Chad Crawford
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of...

Page 1: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

Automatically Extracting Structure and Data from Business Reports

Stephen W. LiddleSchool of Accountancy and Information SystemsMarriott School of ManagementBrigham Young University

Co-Authors: Douglas M. Campbell, Chad Crawford

Page 2: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 2 of 25

Overview

What are business reports and why do we care about them?Extracting structure and data Field types Line types Page headers/footers Inferring recursive group structure

Experimental results

Page 3: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 3 of 25

Business Reports

Business reports are used to disseminate information pertinent to business operations Financial, inventory, production

They are the result of periodic data processing Daily, weekly, monthly, etc. COBOL, 4GL’s, report writers

Page 4: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 4 of 25

Business Reports

RUN 05/21/99 12:34:56 00551 L A R G E C D R E P O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 001

CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC

006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,529.23 .06250 05/16/99

* * * TOTAL LARGE CD * * * 1,605,529.79

CLL3605 12 BANK 012 BRANCH 0402 MY TOWN BANK-CHARGE OFF BK 05/19/98 PAGE 1 EXCEPTION REPORT - NUMBER 12 PART.SOLD - DETAIL LIST

CUSTOMER NUMBER OFF ORG DATE INTEREST RATE LOAN BALANCE PART. INTEREST PARTICIPANT OUR OUR PART.CUSTOMER NAME LDG MAT DATE ORG AMOUNT LOAN CODE RATE OWING PART BALANCE BALANCE LISTED WITH

9999900—MY TOWN BANK --------------------

1234567 9001 81196 09-25-92 14.50000 50,860.19 44,428.59TUBBS, BILLY 63 DEMAND 60,825.23 OA 14.50000 62,814.48 14,954-

1234569 9002 09644 03-22-93 12.75000 29,817.50 20,386.38JONES, MARIA 66 DEMAND 30,079.41 OA 12.75000 30,079.41 1,261-

PART-TOTALS NUMBER 64,814.97 16,216.20- COUNT 2 96,893.89

BRANCH TOTALS NUMBER .00 .00 COUNT .00

Page 5: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 5 of 25

Business Reports M Y T O W N S T A T E B A N K RUN/DATE: 05/20/98 TIME: 00:34

TRANS/DATE: 05/21/99 CASH CONTROL - TELLER CASH LISTING CASH4 PAGE NO. 1

OFFICE NUMBER 001

.................................................................................................................................. . TELLER . AMOUNT TYPE BATCH & . AMOUNT TYPE BATCH & . AMOUNT TYPE BATCH & . AMOUNT TYPE BATCH & . . NUMBER . SEQUENCE . SEQUENCE . SEQUENCE . SEQUENCE . .................................................................................................................................. . . . . . . . 001 . 15.00 OUT 214-23767 . 22.00 OUT 214-23632 . 41.17 OUT 214-23651 . 45.00 OUT 214-23726 . . . 190.00 OUT 214-23752 . 200.00 OUT 215-18550 . 300.00 OUT 215-18579 . 300.00 OUT 214-23735 . . . 400.00 OUT 214-23754 . 400.00 OUT 214-23810 . 400.00 OUT 215-18548 . 500.00 OUT 215-18600 . . . 1,000.00 OUT 214-23764 . 4,138.85 OUT 215-05631 . 10,000.00 OUT 214-23780 . 20.00 IN 214-23686 . . . 60.00 IN 214-23670 . 100.00 IN 214-23711 . 110.25 IN 214-23720 . 140.00 IN 214-23763 . . . 160.00 IN 214-23679 . 200.00 IN 214-23643 . 200.00 IN 214-23647 . 200.00 IN 214-23696 . . . 211.00 IN 214-23751 . 280.00 IN 214-23655 . 300.00 IN 214-23739 . 300.00 IN 214-23770 . . . 340.48 IN 214-23777 . 400.00 IN 214-23740 . 400.00 IN 214-23732 . 400.00 IN 215-18575 . . . 420.00 IN 214-23813 . 700.00 IN 214-23734 . 1,000.00 IN 214-23779 . 1,003.86 IN 214-23888 . . . 1,240.85 IN 214-23718 . 1,506.00 IN 214-23742 . 1,806.00 IN 214-23692 . 6,000.00 IN 214-23688 . . . . . . . . . . . . . . ......................................................................................................................... . . . . . CASH OUT TOTAL 17,952.02 * CASH IN TOTAL 17,498.44 * NET CASH TOTAL 453.58-* . . . . .................................................................................................................................. . . . . . . . 002 . 6.00 OUT 214-27788 . 15.00 OUT 214-27821 . 25.00 OUT 214-28073 . 40.00 OUT 214-27836 . . . 50.00 OUT 214-27798 . 200.00 OUT 214-27790 . 200.00 OUT 215-18551 . 250.00 OUT 214-27819 . . . 400.00 OUT 215-18547 . 1,000.00 OUT 215-18545 . 1,000.00 OUT 214-27668 . 1,080.00 OUT 214-27658 . . . 4,000.00 OUT 214-27675 . 4,000.00 OUT 214-27662 . 3,545.42 OUT 215-05659 . 45.00 IN 214-27753 . . . 50.00 IN 214-27810 . 60.00 IN 214-27807 . 95.00 IN 214-27725 . 265.00 IN 214-27723 . . . 305.00 IN 214-27759 . 330.00 IN 214-27755 . 400.00 IN 214-27797 . 400.00 IN 214-27818 . . . 408.73 IN 214-27667 . 419.03 IN 214-27832 . 560.00 IN 214-27805 . 600.00 IN 214-27811 . . . 850.00 IN 214-27650 . 1,000.00 IN 214-27640 . 1,821.85 IN 214-27678 . 4,200.00 IN 214-27785 . . . 3,480.32 IN 214-27695 . . . . . . . . . . . ......................................................................................................................... . . . . . CASH OUT TOTAL 15,811.42 * CASH IN TOTAL 15,289.93 * NET CASH TOTAL 521.49-* . . . . .................................................................................................................................. . . . . . . . 003 . 10.00 OUT 214-18486 . 20.00 OUT 214-18462 . 20.00 OUT 215-18640 . 40.00 OUT 214-27483 . . . 50.00 OUT 214-18296 . 50.00 OUT 214-18301 . 55.00 OUT 214-27456 . 120.77 OUT 214-18465 . . . 137.00 OUT 214-27486 . 342.54 OUT 214-18489 . 700.00 OUT 214-27490 . 1,255.00 OUT 214-18449 . . . 1,705.59 OUT 215-18642 . 1,765.34 OUT 215-18649 . 1,884.92 OUT 215-18629 . 15,000.00 OUT 214-27882 . . . .15 IN 214-18417 . 10.00 IN 214-18429 . 18.62 IN 214-27395 . 29.00 IN 214-25207 . . . 50.00 IN 214-27842 . 50.00 IN 214-18393 . 68.28 IN 214-27399 . 100.00 IN 214-27474 . . . . . . . ..................................................................................................................................

Page 6: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 6 of 25

Business ReportsBCRCL10 PACKAGE DETAIL 05/21/99 PAGE 1

TO-YOUR TOWN BANK FROM-MY TOWN STATE BANK JOB-001 SP-P2414 DVC-01 PKT 13 AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE 123.45 21409648 81.56 21410732 23.00 21411405 .58 21412327 25.32 21412947 50.00 21413527 679.00 21409664 34.05 21410744 100.00 21411408 71.00 21412329 115.47 21412950 136.80 21413531 170.00 21409667 27.68 21410750 40.00 21411409 150.00 21412337 25.00 21412951 21.40 21413548 38.00 21409714 528.94 21410772 100.00 21411416 100.00 21412340 18.60 21412952 25.00 21413576 5.00 21409742 274.00 21410780 75.00 21411420 68.00 21412344 15.00 21412957 50.00 21413580 13.75 21409743 383.53 21410793 65.00 21411423 25.00 21412351 232.44 21412962 20.00 21413582 849.77 21409778 511.46 21410816 40.00 21411430 56.60 21412377 200.00 21412991 26.40 21413583 211.43 21409829 276.22 21410860 40.00 21411431 71.50 21412378 100.00 21412992 10.00 21413596 291.58 21409914 62.00 21410883 40.00 21411438 432.24 21412423 28.00 21412995 10.00 21413601 15.63 21409936 35.00 21410888 854.00 21411491 11.00 21412426 8.22 21413006 25.00 21413603 50.00 21409985 35.00 21410889 86.34 21411515 258.00 21412446 18.00 21413007 103.00 21413639 1,500.00 21410053 35.00 21410890 1,000.00 21411545 483.10 21412474 79.90 21413008 70.00 21413653 20.39 21410062 35.00 21410892 1,000.00 21411546 260.00 21412475 75.00 21413013 799.63 21413709 257.61 21410065 35.00 21410894 277.05 21411579 115.00 21412498 6.00 21413017 116.24 21413713 7,467.35 21410082 33.00 21410904 351.00 21411614 100.00 21412538 29.86 21413020 103.09 21413725 692.90 21410083 18.40 21410905 42.14 21411678 132.14 21412543 61.95 21413022 1,000.00 21413730 927.98 21410084 19.25 21410906 61.61 21411682 30.00 21412557 28.84 21413024 246.00 21413752 25.00 21410126 10.02 21410909 432.16 21411737 150.00 21412623 35.15 21413028 40.00 21413814 7.50 21410141 63.00 21410919 69.00 21411765 15.80 21412643 25.00 21413041 35.36 21413815 7.50 21410145 4,751.00 21410921 64.00 21411768 42.65 21412646 40.00 21413042 65.00 21413817 60.00 21410152 45.69 21410922 64.00 21411769 22.75 21412653 50.00 21413043 21.00 21413818 27.50 21410154 125.00 21410923 438.00 21411791 10.00 21412688 200.00 21413044 30.00 21413819 20.00 21410163 75.00 21410948 44.15 21411801 62.00 21412698 39.31 21413050 20.00 21413820 7.50 21410164 19.93 21410954 40.00 21411815 70.00 21412699 11.89 21413051 24.99 21413821 7.00 21410337 60.00 21411252 258.00 21412121 40.00 21412899 414.00 21413192 75.00 21410338 59.00 21411257 516.00 21412153 40.00 21412901 20.00 21413199 15.00 21410341 35.00 21411268 450.00 21412164 200.00 21412903 87.35 21413200 45.00 21410360 592.00 21411269 35.00 21412183 69.09 21412922 80.00 21413229 258.00 21410390 64.06 21411327 65.00 21412193 42.53 21412923 16.00 21413231 100.00 21410551 49.28 21411347 15.00 21412195 47.32 21412924 85.00 21413336 333.33 21410552 68.93 21411362 60.00 21412196 17.08 21412925 215.00 21413361 852.34 21410581 56.64 21411365 30.00 21412220 100.00 21412934 312.50 21413468 YOUR TOWN BANK 110.00 21410642 163.32 21411380 148.66 21412269 200.00 21412935 500.00 21413469 - ( ) 50.00 21410657 160.50 21411383 29.95 21412300 10.00 21412936 100.00 21413490 FIRST 123.45 449.60 21410673 5,000.00 21411390 29.95 21412301 200.00 21412937 20.00 21413503 LAST 24.99 112.80 21410709 417.34 21411398 29.95 21412310 81.79 21412944 18.00 21413506 SEP# 00000 61.24 21410715 129.18 21411401 164.25 21412326 25.29 21412946 4.83 21413520 47,452.45 209

Page 7: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 7 of 25

Business-Report Structure

Pages Rows/columns Page headers/footers Group headers/footers

Assumptions ASCII format (EBCDIC easy to

translate) Page boundaries known We can basically ignore blank lines

Page 8: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 8 of 25

Type I Reports

A type I report exhibits repeated structure only along the row (vertical) dimension

RUN 05/21/99 12:34:56 00551 L A R G E C D R E P O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 001

CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC

006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,529.23 .06250 05/16/99

* * * TOTAL LARGE CD * * * 1,605,529.79

Page 9: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 9 of 25

Type II Reports

A type II report exhibits repeated structure within rows (along the horizontal dimension)

BCRCL10 PACKAGE DETAIL 05/21/99 PAGE 1

TO-YOUR TOWN BANK FROM-MY TOWN STATE BANK JOB-001 SP-P2414 DVC-01 PKT 13 AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE 123.45 21409648 81.56 21410732 23.00 21411405 .58 21412327 25.32 21412947 50.00 21413527 679.00 21409664 34.05 21410744 100.00 21411408 71.00 21412329 115.47 21412950 136.80 21413531. . . 7.50 21410164 19.93 21410954 40.00 21411815 70.00 21412699 11.89 21413051 24.99 21413821 7.00 21410337 60.00 21411252 258.00 21412121 40.00 21412899 414.00 21413192 333.33 21410552 68.93 21411362 60.00 21412196 17.08 21412925 215.00 21413361 852.34 21410581 56.64 21411365 30.00 21412220 100.00 21412934 312.50 21413468 YOUR TOWN BANK 110.00 21410642 163.32 21411380 148.66 21412269 200.00 21412935 500.00 21413469 - ( ) 50.00 21410657 160.50 21411383 29.95 21412300 10.00 21412936 100.00 21413490 FIRST 123.45 449.60 21410673 5,000.00 21411390 29.95 21412301 200.00 21412937 20.00 21413503 LAST 24.99 112.80 21410709 417.34 21411398 29.95 21412310 81.79 21412944 18.00 21413506 SEP# 00000 61.24 21410715 129.18 21411401 164.25 21412326 25.29 21412946 4.83 21413520 47,452.45 209

AMOUNT SEQUENCE AMOUNT SEQUENCE123.45 21409648 81.56 21410732679.00 21409664 34.05 21410744

Page 10: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 10 of 25

Structure Extraction Process

Field Description

Lattice

Business

Report

Extract Fields

Infer Line Types

Infer Page Headers/Foote

rs

Infer Recursive

Group Structure

Report Structure

Decomposition

Page 11: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 11 of 25

Data Extraction Process

Extract Data

Business

ReportReport

StructureDecomposition

Business

Report

Business

Report

PopulatedDatabase

Page 12: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 12 of 25

Delimitations

This study examines only type I reports (i.e. a line in a report pertains to one record)We focus on report structure extraction

Page 13: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 13 of 25

Algorithm 1: Extract Fields

Use field extraction lattice to identify basic fields in each line of the reportRepresent lattice with a total ordering E of regular expressions

Any

String

Time Date Phone Number ID Code Number Page Number …

HMS HM Julian DMY MDY … w/AC no AC … Percent Negative Currency …

Page 14: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 14 of 25

Field Extraction

Extract fields to form line type vectorGeneral Number \b\d+(,\d\d\d)*\.?\d*(?=(\D|$))String ([^ ]+([ ][^ ]+)*)Currency \d*(,\d\d\d)*\.\d\d(?=(\D|$))

006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99

DMYFractionCurrencyStringGeneral Number

Line Type Vector:General Number (1, 3, "006")General Number (5, 4, "9994")General Number (13, 5, "10355")String (21, 28, "JASON MASON CONSTRUCTION INC")Currency (54, 10, "100,000.00")Fraction (68, 6, ".06005")DayMonthYear (77, 8, "03/07/99")

Page 15: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 15 of 25

Algorithm 2: Infer Line Types

Cluster line types by similarity to form the set B of basic line types for RUse line distances: First-order distance

Based on character comparison Identical strings have distance 0

Second-order distance Based on field types Uses field-description lattice for distance

Page 16: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 16 of 25

Infer Basic Line Types

Generalize line types when distance is small

006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99

DMYFractionCurrencyStringGeneral Number

008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS

MDYFractionCurrencyStringGeneral Number String

Right-Aligned NumbersLeft-Aligned Strings Similar Date Fields Fully Aligned

Optional String Field

Page 17: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 17 of 25

Algorithm 3: Infer Page Headers/Footers

Separate report detail from page headers and footersA line is considered detail if It repeats in report two or more times in

immediate succession, or It repeats more than twice on one page

Find the maximal page prefix/suffix of non-detail linesRemove page headers/footers and blank lines

Page 18: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 18 of 25

Page Headers/FootersRUN 05/21/99 12:34:56 00551 L A R G E C D R E P O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 001

CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC

006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,529.23 .06250 05/16/99

RUN 05/21/99 12:34:56 00551 L A R G E C D R E P O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 002

CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC

013 2349 12334 MOMS FINE COUNTRY COOKING 100,000.00 .06005 06/13/99 MS 015 1012 11221 BAKERY AT THE TOWN SQUARE 300,000.00 .05990 06/23/99 016 2344 2899 JILL JENKINS 75,000.00 .05990 06/25/99 MS 016 4389 8983 JEAN LUC PICARD 100,000.00 .06250 06/30/99

* * * TOTAL LARGE CD * * * 2,180,529.79

Page Header

Page Detail

Page 19: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 19 of 25

Algorithm 4: Infer Group Structure (uvkw)

006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,529.23 .06250 05/16/99 * * * TOTAL LARGE CD * * * 1,605,529.79

= aaaab= a4b

u = , v = a, w = b

aaaab

Page 20: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 20 of 25

Inferring Group Structure

= abcdcdefef u = ab, v = cd, w =

substitute g (= ab(cd)+) for uvkw= gefef u = g, v = ef, w =

substitute h (= g(ef)+) for uvkw= h

regular expression:= g(ef)+ = ab(cd)+(ef)+

abcdcdefef

9999900—MY TOWN BANK --------------------1234567 9001 81196 09-25-92 14.50000 50,860.19 44,428.59TUBBS, BILLY 63 DEMAND 60,825.23 OA 14.50000 62,814.48 14,954-1234569 9002 09644 03-22-93 12.75000 29,817.50 20,386.38JONES, MARIA 66 DEMAND 30,079.41 OA 12.75000 30,079.41 1,261- PART-TOTALS NUMBER 64,814.97 16,216.20- COUNT 2 96,893.89 BRANCH TOTALS NUMBER .00 .00 COUNT .00

Page 21: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 21 of 25

Experiment

The initial field description lattice and empirical constants were constructed using several inputs: Our own experience Dozens to hundreds of actual business

reports generated by different firms

We tuned our process to handle dozens of reports well

Page 22: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 22 of 25

Experimental Results

After tuning, we ran against a real set of 76 reports not previously seen 7 reports were not type I 7 reports were too short to be

meaningful Of 62 remaining reports, our process

was successful with 40, unsuccessful with 22.

Page 23: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 23 of 25

Interpretation of Results

Opportunities for improvement: Tuning field-description lattice Optional fields in a line type Optional lines in a uvkw group Clustering algorithm

Page 24: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 24 of 25

Future Work

Improved handling of complex type I reportsType II reportsLong-term goal: Inverted index/compressed

representation of business reports SQL front-end with data mining

support

Page 25: Automatically Extracting Structure and Data from Business Reports Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management.

11/3/99 CIKM’99, Copyright © 1999, Stephen W. Liddle 25 of 25

Data Extraction Group

For papers/tools produced by the BYU Data Extraction Group see:

www.deg.byu.edu