CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation...

30
Introduction Text Mining Classification Bridgeman Digital Art Library Bridgeman Categories Sample Classification Data Text mining in digital collections CHASE: Going digital Deirdre Lungley [email protected] February 6, 2013 Deirdre Lungley

Transcript of CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation...

Page 1: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

CHASE: Going digital

Deirdre [email protected]

February 6, 2013

Deirdre Lungley

Page 2: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

Deirdre Lungley

Page 3: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

Bridgeman Categories

2 Oriental Miniatures 41 Mosaics

7 Maps 44 Semi-precious Stones (see also Jewellery)

9 Posters 46 Science

12 Arms, Armour & Militaria 47 Sculpture

15 Botanical 51 Sports and Leisure

18 Clocks, Watches, Barometers & Sundials 56 Trade Emblems, City Crests, Coats of Arms

20 Costume & Fashion 1126 CHOIR BOOKS

21 Enamels 5000 The Arts and Entertainment

22 Ephemera 5001 Ancient and World Cultures

24 Furniture 5002 Architecture

25 Glass 5003 Business and Industry

27 Icons 5004 Places

29 Inventions 5005 Science and Medicine

30 Jewellery (see also Semi-precious stones) 5006 History

31 Juvenilia / Children's Toys & Games 5007 Religion and Belief

33 Lighting 5010 Travel and Transport

35 Medicine 5011 Plants and Animals

38 Mythology Mythological Myth 5013 Emotions and Ideas

40 Animals

Deirdre Lungley

Page 4: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

Sample Classification Data

Query/Clicked URL Gold Standard Annotations Classifier Predictions

monster woman 5007 : Religion and Belief 5007 : Religion and Belief

Dulle Griet raiding Hell 5 : Allegory / Allegorical

38 : Mythology Mythological Myth

nuno 5007 : Religion and Belief 5007 : Religion and Belief

The Fishermen from the Polyptych of St. Vincent 42 : Personalities 5012 : Land and Sea

42 : Personalities

girl poor 5009 : People and Society 5009 : People and Society

A Peasant Girl Gathering Faggots in a Wood 5012 : Land and Sea

Deirdre Lungley

Page 5: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Tools of the trade

Python:

High level languageMany standard libraries, e.g., XML parser

Natural Language Toolkit (NLTK):

A platform for building Python programs to work with humanlanguage data (nltk.org)

Why?

Glue between applicationsData preparation for tools such as WekaAllows programmatic access to web services

Deirdre Lungley

Page 6: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Example Web Service – WikipediaMiner

Deirdre Lungley

Page 7: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Sample Python XML parsing – Wikify RSS title

Deirdre Lungley

Page 8: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Sample Python XML parsing – Wikify RSS title (Output)

Deirdre Lungley

Page 9: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Deirdre Lungley

Page 10: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Deirdre Lungley

Page 11: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 12: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 13: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 14: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 15: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Examples

Feature Extractor

Test Examples

Pos/Neglabelled feature

sets

Test feature

sets

Learning tool

Classifier model

Predictions

Deirdre Lungley

Page 16: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Examples

Feature Extractor

Test Examples

Pos/Neglabelled feature

sets

Test feature

sets

Learning tool

Classifier model

Predictions

Project Gutenberg Catalogue BBC RSS Feed

Training Data

Test Data

SVM_Learn SVM_Classify

Deirdre Lungley

Page 17: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Data – Project Gutenberg

Deirdre Lungley

Page 18: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Case Study Task: Classify BBC RSS feeds

Retrieve & parse BBC RSS feed

Create Classification Features

CasefoldingTokenisationStemmingStopwords

Classify (test data → predictions)

Output to file on diskCall commandRead file

Deirdre Lungley

Page 19: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Retrieve & parse RSS feed

Deirdre Lungley

Page 20: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Retrieve & parse RSS feed (Output)

Deirdre Lungley

Page 21: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Text to Features

Deirdre Lungley

Page 22: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Text to Features (Output)

Deirdre Lungley

Page 23: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Deirdre Lungley

Page 24: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Deirdre Lungley

Page 25: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Classify: Test data → predictions (Output)

Deirdre Lungley

Page 26: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Data – Project Gutenberg

Deirdre Lungley

Page 27: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Deirdre Lungley

Page 28: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Create training data (Output)

Deirdre Lungley

Page 29: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

References:

The Regex Coach

Deirdre Lungley

Page 30: CHASE: Going digital · Bridgeman Digital Art Library Bridgeman Categories Sample Classi cation Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Thank You!

Deirdre Lungley