Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information...

Introduction to Information Extraction

Chia-Hui Chang

Dept. of Computer Science and Information Engineering, National

Central University, Taiwanchia@csie.ncu.edu.tw

Problem Definition Information Extraction (IE) is to identify

relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form.

Input extractor structured output

The output template of the IE task Several fields (slots) Several instances of a field

Difficulties of IE tasks depends on …

Text type From plain text to semi-structured Web

pages e.g. Wall Street Journal articles, or

email message, HTML documents. Domain

From financial news, or tourist information, to various language.

Scenario

Various IE Tasks Free-text IE:

For MUC (Message Understanding Conference) E.g. terrorist activities, corporate joint

ventures

Semi-structured IE: E.g.: meta-search engines, shopping agents,

Bio-integration system

Types of IE from MUC Named Entity recognition (NE)

Finds and classifies names, places, etc. Coreference Resolution (CO)

Identifies identity relations between entities in texts.

Template Element construction (TE) Adds descriptive information to NE results.

Scenario Template production (ST) Fits TE results into specified event scenarios.

Named Entity Recognitionhttp://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_3.html

NE Recognition (Cont.) Spanish:

93% Japanese:

92% Chinese:

84.51%

Coreference Resolution Coreference resolution (CO) involves

identifying identity relations between entities in texts.

For example, in

Alas, poor Yorick, I knew him well.

Tie “Yorick" with “him“. The Sheffield system scored 51% recall and

71% precision.

http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html

Template Element Production Adds description with named entities Sheffield system scores 71%

Scenario Template Extraction STs are the

prototypical outputs of IE systems

They tie together TE entities into event and relation descriptions.

Performance for Sheffield: 49%

http://www.cs.nyu.edu/cs/ faculty/grishman/ IEtask15.book_2.html

Example The operational domains that user interests are centered

around are drug enforcement, money laundering, organized crime, terrorism, ….

1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation;

2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company, . . . );

3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies;

4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

Example Text

Output Example (NE, TE)

Output (STs)

Another IE Example Corporate Management Changes Purpose

which positions in which organizations are changing hands?

who is leaving a position and where the person is going to? who is appointed to a position and where the person is

coming from? the locations and types of the organizations involved in the

succession events; the names and titles of the persons involved in the

succession events

http://www.cs.umanitoba.ca/~lindek/ie-ex.htm

Input TextPresident Clinton nominated John Rollwagen, the chairman

and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." ......

Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him. ......

Extraction ResultCorporate Management Database

Person Organization Position Transition

John Rollwagen Cray Research Inc. chairman out

John Rollwagen Cray Research Inc. CEO out

John F. Carlson Cray Research Inc. chairman in

John F. Carlson Cray Research Inc. CEO in

Organization Database

Name Location Alias Type

Cray Research Inc. Eagan, Minn. Cray COMPANY

Commerce Department GOVERNMENT

MUC Data Set for

MET2 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/met2/met2package.tar.gz

MUC3&4 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_data/muc34.tar.gz

MUC6&7 from LDC http://www.ldc.upenn.edu/ MUC-6:

http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html MUC-7 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/

proceedings/muc_7_toc.html

Summary Evaluation

Precision= Recall=

Design Methodology for Text IE Natural Language Processing Machine Learning

# of correctly extracted fields# of extracted fields

# of correctly extracted fields# of fields to be extracted

IE from Web pages

Output Template: k-tuple Multiple instances of a field Missing data

Web data extraction

Various Web pages Multiple-record page extraction One-record (singular) page extraction

Multiple-record page extraction

One-record (singular) page extraction

Applications Information integration

Meta Search Engines Shopping agents Travel agents

Information Integration Systems

Unprocessed,Unintegrated

Details

Translation and Wrapping

Semantic Integration

Mediation

AbstractedInformation

Text,Images/Video,Spreadsheets

Hierarchical& NetworkDatabases

RelationalDatabases

Object &Knowledge

SQL ORBWrapper Wrapper

Mediator Mediator

Human & Computer Users

Heterogeneous Data Sources

InformationIntegrationService

Mediator

User Services:• Query• Monitor• Update Agent/Module

Coordination

Web Wrappers What is a wrapper?

An extracting program to extract desired information from Web pages.Web pages → wrapper→ Structure Info.

Web wrappers wrap... “Query-able’’ or “Search-able’’ Web sites Web pages with large itemized lists

Summary Evaluation

Precision= Recall=

Methodology for Web IE Programming package Machine Learning Pattern Mining

# of correctly extracted records# of extracted records

# of correctly extracted records# of records to be extracted

Type III: News Group IE Example: Computer-Related Jobs

Output Template

Between free-text IE and semi-structured IE [CaliffRapier 99]

Wrapper Induction Systems

Wrapper induction (WI) or information extraction (IE) systems are software that are designed to generate wrappers.

Taxonomy of Web IE systems by Task domain

• free text vs semi-structured pages Automation degree

• supervised vs unsupervised Techniques applied

• Machine learning vs pattern mining

Task Domain Document type Extraction level

Field-level, record-level, page-level Extraction target variation

Missing Attributes Multi-valued Attributes Multi-order attribute Permutations Nested Data Objects

Template variation Various Templates for an attribute Common Templates for various attributes

Untokenized Attributes

Automation Degree

Page-fetching Support Annotation Requirement Output Support API Support

Techniques Applied Scan passes Extraction rule types Learning algorithms Tokenization schemes Feature used

Conclusion Define the IE problem Specify the input: training example

with annotation, or without annotation

Depict the extraction rule Use necessary background knowledge

References *H. Cunningham, Information Extraction – a User

Guide, http://www.dcs.shef.ac.uk *MUC-6, http://www.cs.nyu.edu/cs/faculty/

grishman/muc6.html *I. Muslea,

Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.

Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information...

Documents

Transcript of Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information...

chia conclusiones

PERFORMANCE OF THE - Massachusetts CHIA Summary Each year, the Center for Health Information and Analysis (CHIA) reports on the performance of the Massachusetts health care system

Nevada takes flight with MONAHRQ Center for Health Information Analysis (CHIA) Joseph A. Greenway.

Chia turistica

chia Ayacucho.pdf

Chia turistica.

Porting SIP User Agents to IPv6 Authors: Whai-En Chen, Chia-Yung Su, and Quincy Wu Speaker: Chia-Yung Su Dept. of Computer Science and Information Engineering.

RELATIVE PRICE · 2020-04-01 · 1 Relative Price | May 2017 center for health information and analysis CHIA In 2016, the Center for Health Information and Analysis (CHIA) developed

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University chia@csie.ncu.edu.tw.

Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw.

Mantak Chia, Maneewan Chia - A Multiorgazmusos Pár

Semillas Chia

Siembra chia

Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on.

Tesis Chia

Chia Interesante

Mamma Chia Vitality Beverage Sellsheet Chia Vitality Beverage Sellsheet.pdfGuava Mamma Ingredients: hydrated chia seeds* (purified water, chia seeds*), agave*, lemon juice*, lime juice*,

Chinoiserie_David Chia

Chia, ALIMENTOS

Subpoena Cover 4 - California Health Information AssociationSAMPLE. ACKNOWLEDGEMENTS The California Health Information Association (CHIA) extends its appreciation to the editors of

Mamma Chia Vitality Beverage Sellsheet Chia Vitality Beverage Sellsheet.pdfGuava Mamma Ingredients: hydrated chia seeds* (purified water, chia seeds), agave, lemon juice, lime juice,