Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text...

67
Natural Language Processing (NLP), University of Essex 6 February, 2013 Tutorial on Text Mining for the Going Digital initiative

Transcript of Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text...

Page 1: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

Natural Language Processing (NLP), University of Essex

6 February, 2013

Tutorial on Text Mining

for

the Going Digital initiative

Page 2: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

2

Natural Language Processing (NLP), University of Essex

Topics of This Tutorial

o Information Extraction (IE)

o Examples of IE systems

o GATE (General Architecture for Text Engineering)

• Introduction to GATE developer

• Guided tour of the GATE GUI

• Language Resources, Datastores, Applications

and Processing Resources

• Annotations

• ANNIE Tool , GATE's default IE system

Page 3: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

3

Natural Language Processing (NLP), University of Essex

Information Retrieval vs. Information Extraction

• Information Retrieval

large text collections (Web) Documents

Page 4: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

4

Natural Language Processing (NLP), University of Essex

Information Retrieval vs. Information Extraction

• Getting Facts can be hard and slow.

Page 5: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

5

Natural Language Processing (NLP), University of Essex

Information Retrieval vs. Information Extraction

• Information Extraction

Facts

large text collections

Structured Information

Page 6: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

6

Natural Language Processing (NLP), University of Essex

Named Entity Recognition

o Cornerstone of IE

o Identification of proper names in texts,

o Classification them into a set of predefined categories

Person

Organisation (companies, government organisations,

committees, etc)

Location (cities, countries, rivers, etc)

Date and time expressions

o other types are frequently added, as appropriate to the

Application.

Emperors

Historical events.

Ships, etc.

Page 7: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

7

Natural Language Processing (NLP), University of Essex

Named Entity Recognition

Mark lives in London. He has worked at University of London since 1990.

Person Location

Organisation

Year

Page 8: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

8

Natural Language Processing (NLP), University of Essex

Co-reference

Mark lives in London. He has worked at University of London since 1990.

Person Location

Organisation

Year

Page 9: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

9

Natural Language Processing (NLP), University of Essex

Relations

Mark lives in London. He has worked at University of London since 1990.

Person Location

Organisation

Year

Live_in

Page 10: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

10

Natural Language Processing (NLP), University of Essex

Relations

Mark lives in London. He has worked at University of London since 1990.

Person Location

Organisation

Year

Employee_of

Page 11: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

11

Natural Language Processing (NLP), University of Essex

Relations

Mark lives in London. He has worked at University of London since 1990.

Person Location

Organisation

Year

Based_in

Page 12: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

12

Natural Language Processing (NLP), University of Essex

Examples of IE systems

Page 13: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

Natural Language Processing (NLP), University of Essex

Health and Safety Information Extraction (HaSIE)

13

Page 14: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

14

Natural Language Processing (NLP), University of Essex

Obstetrics records

Page 15: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

15

Natural Language Processing (NLP), University of Essex

Guided tour to GATE GUI

• How to navigate the GATE GUI

• How to set up the different options

• Introduction to resources and parameters

Page 16: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

16

Natural Language Processing (NLP), University of Essex

ANNIE tool

You can try ANNIE tool online at:

• http://services.gate.ac.uk/annie/

Page 17: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

17

Natural Language Processing (NLP), University of Essex

How we will work on tutorial

• Each section in this tutorial will be explained first.

After that enough time will be given to you to try

yourself.

• red texts indicates your time to try experimenting with

GATE.

Page 18: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

18

Natural Language Processing (NLP), University of Essex

Guided tour to GATE GUI

Resources Pane

• Language resources (LRs) : documents

• Corpus: a collection of documents

• Data stores: specialised files where

documents are kept for future use

• Processing resources (PRs) : annotation

tools that operate on text within the

documents

• Applications: groups of processes that

run on one or more documents

Page 19: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

19

Natural Language Processing (NLP), University of Essex

Page 20: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

20

Natural Language Processing (NLP), University of Essex

Parameters

• Applications, LRs, and PRs all have various parameters.

• Parameters enable different settings to be used, e.g.

case sensitivity

• Initialisation Parameters (set at load time): cannot be

changed without reloading.

• Run time Parameters: can be changed between each

application run

Page 21: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

21

Natural Language Processing (NLP), University of Essex

Display Pane

Page 22: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

22

Natural Language Processing (NLP), University of Essex

Display Pane

Example of LRs

Page 23: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

23

Natural Language Processing (NLP), University of Essex

Setting up GATE options

Change the look and

feel of GATE, such as

menu and text fonts

Enables you to adjust settings

such as saving your options,

and saving the session so that

when you reopen GATE, it will

remember and reload the

applications you had open at

the end of your previous

session

Page 24: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

24

Natural Language Processing (NLP), University of Essex

Try out GATE

• Open GATE

Start → All Programs → GATE developer 7.0

• Try setting different options in GATE

Click Options → Configuration → Appearance

• Download the presentation file.

• Download the “Hands-on-materials.zip” file from

https://sites.google.com/site/nlelab2013/gate

Save the zipped file on desktop and unzip it.

The folder “Hands-on-materials” contains all files

required for the next experiments.

Page 25: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

25

Natural Language Processing (NLP), University of Essex

Loading and Viewing Documents

• Loading a document and setting its parameters.

• Creating a corpus .

• Populating a corpus of documents in different

ways

• Removing documents.

Page 26: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

26

Natural Language Processing (NLP), University of Essex

Loading and Viewing Documents

• GATE can process documents in all kinds of formats:

plain text, HTML, XML, PDF, Word etc.

• When GATE loads a document, it converts it into a

special format for processing.

• Documents can be exported in various formats or

saved in a datastore for future processing within GATE.

Page 27: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

27

Natural Language Processing (NLP), University of Essex

Loading Document

Page 28: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

28

Natural Language Processing (NLP), University of Essex

Document Initialisation parameters

The sourceURL parameter enables you to specify the document to be

loaded.

You

can type the filename or URL,

or

click the file browser icon to navigate to the correct document.

Page 29: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

29

Natural Language Processing (NLP), University of Essex

Document Initialisation parameters

You can also just type a string of text into the box by selecting stringContent

rather than sourceUrl.

Page 30: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

30

Natural Language Processing (NLP), University of Essex

Document Initialisation parameters

Set to true to ensure GATE will process any existing annotations such as

HTML tags and present them as annotations rather than leaving them in

the text.

Page 31: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

31

Natural Language Processing (NLP), University of Essex

Opening and closing the document

Page 32: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

32

Natural Language Processing (NLP), University of Essex

Viewing the document

Page 33: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

33

Natural Language Processing (NLP), University of Essex

Experiment 2 – Opening the document in GATE

• load the document “Anna Comnena Alexiad.htm” from “hands-

on-materials” folder.

− right click on Language Resources and select “New → GATE

Document” or

− File menu → New Language Resource → GATE Document

• A dialogue box will appear.

• Leave the name input box empty.

• Click the file browser icon to navigate to the correct document.

• To view a document, double click on the document name in the

Resources pane

• To view the annotations, you first need click “Annotation Sets”,

and then select the relevant set and annotation(s) on the right

hand side of the GUI

• To see a list of annotations at the bottom, click on “Annotations

List”

Page 34: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

34

Natural Language Processing (NLP), University of Essex

Creating a corpus

• A corpus is a collection of documents.

• For most GATE applications, it is easier to work with

a corpus rather than an individual document, even if

that corpus only contains one document.

Page 35: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

35

Natural Language Processing (NLP), University of Essex

Corpus Initialisation Parameters

Click the edit button [add icon]

and add your document to the corpus

Page 36: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

36

Natural Language Processing (NLP), University of Essex

Another way to add documents to a corpus

You can also create an empty corpus and then add documents

to it, if these documents are already loaded in GATE

((1))

Double click on the

corpus

((2))

Press on add icon

Page 37: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

37

Natural Language Processing (NLP), University of Essex

Removing documents from a corpus

((1))

Double click on the

corpus

(3))

Press on delete icon

((2))

Select the document

you want to delete

Page 38: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

38

Natural Language Processing (NLP), University of Essex

Removing documents from a corpus

• To remove documents from a corpus, use the X button

in the corpus editor

Note that this does not remove the document from

GATE, just from the corpus

The document is available to be added to other

corpora. Indeed a document can belong to several

corpora

• If you do remove the document from GATE, it will also

remove it from the corpus

• But if you remove the corpus, it doesn't remove the

document!

Page 39: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

39

Natural Language Processing (NLP), University of Essex

Populate the corpus in one go

• Using the populate function means you don't have to

preload the documents in GATE first, and allows you to

load all the documents into the corpus in one go.

Page 40: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

40

Natural Language Processing (NLP), University of Essex

Populate the corpus in one go

• Extensions parameter: lets you select only documents of a certain

type e.g., “xml” or “htm” files.

• Encoding: lets you choose the right encoding for the documents. The

wrong encoding can cause characters to be incorrectly displayed e.g.,

“UTF-8”

• Recurse directories : load documents in any subdirectories

Page 41: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

41

Natural Language Processing (NLP), University of Essex

Experiment 3: creating a corpus and populate it

• Create a corpus

− Right click Language Resources → New → GATE Corpus.

• A dialogue box will appear.

• Leave a name input box empty and press OK.

• Right click on the created corpus in the Resources pane and

select Populate.

• A dialogue box will appear

• Use the file browser icon to select the name of the directory in

which your documents are stored

• In Extensions parameter ,type “htm” in the box (without the

quotes)

• In Encoding parameter , enter “UTF-8” .

• Press OK

• all the documents will be loaded in one go

• Double click the corpus to view it.

Page 42: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

42

Natural Language Processing (NLP), University of Essex

Preprocessing Resources

• Loading processing resources.

• Managing plugins.

• Loading and running ANNIE and pre-existing applications

• Creating a new application

Page 43: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

43

Natural Language Processing (NLP), University of Essex

Preprocessing Resources

• Processing resources (PRs) are the tools that creates or modifies

annotations on the text .They implement algorithms.

• An application consists of any number of PRs, run sequentially

over a corpus of documents

• A plugin is a collection of one or more PRs, bundled together. For

example, all the PRs needed for IE in Arabic are found in the

Lang_Arabic plugin.

• An application can contain PRs from one or more different plugins.

• In order to access new PRs, you need to load the relevant plugin

Page 44: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

44

Natural Language Processing (NLP), University of Essex

Plugins

To open Plugin Manager:

Page 45: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

45

Natural Language Processing (NLP), University of Essex

Plugins

Page 46: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

46

Natural Language Processing (NLP), University of Essex

Plugins

Page 47: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

47

Natural Language Processing (NLP), University of Essex

Applications

• Loading and running ANNIE and pre-existing

applications

• Creating a new application

Page 48: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

48

Natural Language Processing (NLP), University of Essex

ANNIE Tool

• ANNIE is a readymade collection of PRs that

performs IE on unstructured text.

• To use ANNIE tool:

Page 49: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

49

Natural Language Processing (NLP), University of Essex

Running ANNIE Tool

• Double click on the ANNIE application to view it.

PRs in order of

their execution

Corpus on which

the application is

executed

Runtime

parameters of the

selected PR

Execute the

application

Page 50: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

50

Natural Language Processing (NLP), University of Essex

ANNIE Tool

Page 51: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

51

Natural Language Processing (NLP), University of Essex

Viewing the results

((1))

Double click on the

document to view it ((2))

Select Annotation Lists

and Annotation Sets

((3))

click on any Annotation

types in the Default

(unnamed) set

Page 52: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

52

Natural Language Processing (NLP), University of Essex

Viewing the results

Page 53: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

53

Natural Language Processing (NLP), University of Essex

change the name of the annotation set

• Now we're going to change the name of the annotation set,

so that all ANNIE annotations appear in a new set called

ANNIEresult

• The annotation set where the results are stored is one of the

runtime parameters of the PRs

Page 54: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

54

Natural Language Processing (NLP), University of Essex

change the name of the annotation set

• For each PR listed, click on it

and check whether it has any

parameters labelled

“annotationSetName”,

“inputASName” or

“outputASName”

• Edit all of these by typing

“ANNIEresult” in the box.

Page 55: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

55

Natural Language Processing (NLP), University of Essex

change the name of the annotation set

Page 56: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

56

Natural Language Processing (NLP), University of Essex

Add a new Processing Resources

• Load a plugin called “Tools” using Plugins Manager.

• Right click on “Processing Resources” in the Resources Pane and

select “New” → “ANNIE VP Chunker”

• Double click on ANNIE.

• You'll see the VP chunker in the list of loaded PRs. This means it's

available in GATE, but isn't yet contained in the application.

• Add it to the application by selecting it and using the right arrow to

transfer it.

• Now use the up arrow to move it to the right place in the application. It

should go after (below) the POS tagger but before (above) the NE

transducer.

• Change the inputASName and outputASName parameters to

ANNIEresult.

• Run the application and view the results on the document.

• You should see a new annotation type “VG”.

Page 57: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

57

Natural Language Processing (NLP), University of Essex

Saving documents

• Using datastores

• Saving documents for use outside GATE

Page 58: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

58

Natural Language Processing (NLP), University of Essex

Saving documents

There are 2 types of datastore

• Serial datastores store data directly in a directory

• Lucene datastores provide a searchable repository

with Lucene-based indexing

Page 59: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

59

Natural Language Processing (NLP), University of Essex

Create a new serial datastore

Page 60: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

60

Natural Language Processing (NLP), University of Essex

Create a new serial datastore

• Right click “Datastores” from the Resources pane

and select “Create Datastore”

• Select “Serial Datastore”

• Create a new empty directory by clicking the “Create

New Folder” icon and give your new directory a name

• Select this directory and click “Open”

Now your datastore is ready to store your documents

Page 61: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

61

Natural Language Processing (NLP), University of Essex

Save documents to the datastore

• Right click on your corpus and select “Save to Datastore”

• Select the datastore that you just created

• Now close the corpus and document

• Double click on the name of the datastore in the Resources

pane

• You should see the corpus and document

• Double click on them to load them back into GATE and view

them

Page 62: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

62

Natural Language Processing (NLP), University of Essex

remove things from the datastore

((1))

Double click on the

name of the datastore

((2))

Double click on the

name of corpus or

documents and then

select delete

Page 63: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

63

Natural Language Processing (NLP), University of Essex

Saving documents outside GATE

• If you want to use your documents outside GATE,

you can save them in 2 ways:

as standoff markup, in a special GATE

representation

as inline annotations (preserving the original

format)

• Both formats are XML-based. However “save as xml”

refers to the first option, while “save preserving

format” refers to the second option.

Page 64: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

64

Natural Language Processing (NLP), University of Essex

Saving documents outside GATE

<Person> Ravenna </person>

Page 65: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

65

Natural Language Processing (NLP), University of Essex

Saving as XML

Page 66: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

66

Natural Language Processing (NLP), University of Essex

Save preserving format

Page 67: Tutorial on Text Mining for the Going Digital initiative · 2013. 2. 28. · Tutorial on Text Mining for the Going Digital initiative . 2 Natural Language Processing (NLP), ... •

67

Natural Language Processing (NLP), University of Essex

References

• http://gate.ac.uk/sale/tao/split.html

• Introduction to GATE Developer [PDF document].

Retrieved Web site: http://gate.ac.uk

You can download the program from:

• http://gate.ac.uk/download/