TextMarker user guide

TextMarker Guide and ReferenceWritten and maintained by the Apache

UIMA™ Development Community

Version 2.4.1-SNAPSHOT

Copyright © 2006, 2012 The Apache Software Foundation

License and Disclaimer. The ASF licenses this documentation to you under the ApacheLicense, Version 2.0 (the "License"); you may not use this documentation except in compliancewith the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, this documentation and its contentsare distributed under the License on an "AS IS" BASIS, WITHOUT WARRANTIES ORCONDITIONS OF ANY KIND, either express or implied. See the License for the specificlanguage governing permissions and limitations under the License.

Trademarks. All terms mentioned in the text that are known to be trademarks or service markshave been appropriately capitalized. Use of such terms in this book should not be regarded asaffecting the validity of the the trademark or service mark.

Publication date ######, 2012

http://www.apache.org/licenses/LICENSE-2.0

TextMarker Guide and Reference iii

Table of Contents1. TextMarker ................................................................................................................... 1

1.1. Introduction ....................................................................................................... 11.2. Core Concepts .................................................................................................... 11.3. Examples ........................................................................................................... 21.4. Special Features ................................................................................................. 31.5. Get started ......................................................................................................... 3

1.5.1. Up and running ........................................................................................ 31.5.2. Learn by example .................................................................................... 31.5.3. Do it yourself .......................................................................................... 4

2. TextMarker Language .................................................................................................... 52.1. Basic Annotations and tokens ............................................................................... 52.2. Syntax ............................................................................................................... 52.3. Syntax ............................................................................................................... 62.4. Declarations ....................................................................................................... 7

2.4.1. Type ....................................................................................................... 72.4.2. Variable .................................................................................................. 72.4.3. Resources ................................................................................................ 72.4.4. Scripts .................................................................................................... 72.4.5. Components ............................................................................................. 8

2.5. Quantifiers ......................................................................................................... 82.5.1. * Star Greedy .......................................................................................... 82.5.2. *? Star Reluctant ...................................................................................... 82.5.3. + Plus Greedy .......................................................................................... 92.5.4. +? Plus Reluctant ..................................................................................... 92.5.5. ? Question Greedy .................................................................................... 92.5.6. ?? Question Reluctant ............................................................................... 92.5.7. [x,y] Min Max Greedy .............................................................................. 92.5.8. [x,y]? Min Max Reluctant ....................................................................... 10

2.6. Conditions ........................................................................................................ 102.6.1. AFTER ................................................................................................. 102.6.2. AND ..................................................................................................... 102.6.3. BEFORE ............................................................................................... 112.6.4. CONTAINS ........................................................................................... 112.6.5. CONTEXTCOUNT ................................................................................ 112.6.6. COUNT ................................................................................................ 122.6.7. CURRENTCOUNT ................................................................................ 132.6.8. ENDSWITH .......................................................................................... 132.6.9. FEATURE ............................................................................................. 132.6.10. IF ....................................................................................................... 142.6.11. INLIST ................................................................................................ 142.6.12. IS ....................................................................................................... 142.6.13. LAST .................................................................................................. 152.6.14. MOFN ................................................................................................. 152.6.15. NEAR ................................................................................................. 152.6.16. NOT ................................................................................................... 162.6.17. OR ...................................................................................................... 162.6.18. PARSE ................................................................................................ 162.6.19. PARTOF ............................................................................................. 172.6.20. PARTOFNEQ ...................................................................................... 172.6.21. POSITION ........................................................................................... 172.6.22. REGEXP ............................................................................................. 18

TextMarker Guide and Reference

iv TextMarker Guide and Reference UIMA Version 2.4.1

2.6.23. SCORE ............................................................................................... 182.6.24. SIZE ................................................................................................... 182.6.25. STARTSWITH ..................................................................................... 192.6.26. TOTALCOUNT ................................................................................... 192.6.27. VOTE ................................................................................................. 19

2.7. Actions ............................................................................................................ 202.7.1. ADD ..................................................................................................... 202.7.2. ASSIGN ................................................................................................ 202.7.3. CALL ................................................................................................... 212.7.4. CLEAR ................................................................................................. 212.7.5. COLOR ................................................................................................. 212.7.6. CONFIGURE ........................................................................................ 222.7.7. CREATE ............................................................................................... 222.7.8. DEL ...................................................................................................... 222.7.9. DYNAMICANCHORING ....................................................................... 232.7.10. EXEC .................................................................................................. 232.7.11. FILL ................................................................................................... 232.7.12. FILTERTYPE ...................................................................................... 232.7.13. GATHER ............................................................................................. 242.7.14. GET .................................................................................................... 242.7.15. GETFEATURE .................................................................................... 252.7.16. GETLIST ............................................................................................. 252.7.17. LOG ................................................................................................... 252.7.18. MARK ................................................................................................ 262.7.19. MARKFAST ........................................................................................ 262.7.20. MARKLAST ........................................................................................ 272.7.21. MARKONCE ....................................................................................... 272.7.22. MARKSCORE ..................................................................................... 272.7.23. MARKTABLE ..................................................................................... 282.7.24. MATCHEDTEXT ................................................................................. 282.7.25. MERGE .............................................................................................. 292.7.26. REMOVE ............................................................................................ 292.7.27. REMOVEDUPLICATE ......................................................................... 292.7.28. REPLACE ........................................................................................... 302.7.29. RETAINTYPE ..................................................................................... 302.7.30. SETFEATURE ..................................................................................... 302.7.31. TRANSFER ......................................................................................... 312.7.32. TRIE ................................................................................................... 312.7.33. UNMARK ........................................................................................... 312.7.34. UNMARKALL .................................................................................... 32

2.8. Expressions ...................................................................................................... 322.8.1. Type Expressions ................................................................................... 322.8.2. Number Expressions ............................................................................... 322.8.3. String Expressions .................................................................................. 322.8.4. Boolean Expressions ............................................................................... 32

2.9. Robust extraction using filtering ......................................................................... 322.10. Blocks ............................................................................................................ 332.11. Heuristic extraction using scoring rules .............................................................. 332.12. Modification ................................................................................................... 34

3. TextMarker Workbench ................................................................................................ 353.1. Installation ....................................................................................................... 353.2. TextMarker Projects .......................................................................................... 353.3. Explanation ...................................................................................................... 36

TextMarker Guide and Reference

UIMA Version 2.4.1 TextMarker Guide and Reference v

3.4. Dictionariers ..................................................................................................... 363.5. Parameters ....................................................................................................... 383.6. Query .............................................................................................................. 393.7. Views .............................................................................................................. 39

3.7.1. Annotation Browser ................................................................................ 393.7.2. Annotation Editor ................................................................................... 393.7.3. Marker Palette ....................................................................................... 393.7.4. Selection ............................................................................................... 393.7.5. Basic Stream .......................................................................................... 403.7.6. Applied Rules ........................................................................................ 403.7.7. Selected Rules ....................................................................................... 403.7.8. Rule List ............................................................................................... 403.7.9. Matched Rules ....................................................................................... 403.7.10. Failed Rules ......................................................................................... 403.7.11. Rule Elements ...................................................................................... 403.7.12. Statistics .............................................................................................. 413.7.13. False Positive ....................................................................................... 413.7.14. False Negative ...................................................................................... 413.7.15. True Positive ........................................................................................ 41

3.8. Testing ............................................................................................................ 413.8.1. Overview ............................................................................................... 413.8.2. Usage .................................................................................................... 463.8.3. Evaluators ............................................................................................. 48

3.9. TextRuler ......................................................................................................... 483.9.1. Available Learners .................................................................................. 49

TextMarker 1

Chapter 1. TextMarkerThe TextMarker system is an open source tool for the development of rule-based informationextraction applications. The development environment is based on the DLTK framework. Itsupports the knowledge engineer with a full-featured rule editor, components for the explanationof the rule inference and a build process for generic UIMA Analysis Engines and Type Systems.Therefore TextMarker components can be easily created and combined with other UIMAcomponents in different information extraction pipelines rather flexibly. TextMarker applies aspecialized rule representation language for the effective knowledge formalization: The rulesof the TextMarker language are composed of a list of rule elements that themselves consists offour parts: The mandatory matching condition establishes a connection to the input document byreferring to an already existing concept, respectively annotation. The optional quantifier definesthe usage of the matching condition similar to regular expressions. Then, additional conditions addconstraints to the matched text fragment and additional actions determine the consequences of therule. Therefore, TextMarker rules match on a pattern of given annotations and, if the additionalconditions evaluate true, then they execute their actions, e.g. create a new annotation. If no initialannotations exist, for example, created by another component, a scanner is used to seed simpletoken annotations contained in a taxonomy. The TextMarker system provides unique functionalitythat is usually not found in similar systems. The actions are able to modify the document either byreplacing or deleting text fragments or by filtering the view on the document. In this case, the rulesignore some annotations, e.g. HTML markup, or are executed only on the remaining text passages.The knowledge engineer is able to add heuristic knowledge by using scoring rules. Additionally,several language elements common to scripting languages like conditioned statements, loops,procedures, recursion, variables and expressions increase the expressiveness of the language. Rulesare able to directly invoke external rule sets or arbitrary UIMA Analysis Engines and foreignlibraries can be integrated with the extension mechanism for new language elements.

1.1. IntroductionIn manual information extraction humans often apply a strategy according to a highlightermetaphor: First relevant headlines are considered and classified according to their content bycoloring them with different highlighters. The paragraphs of the annotated headlines are thenconsidered further. Relevant text fragments or single words in the context of that headline can thenbe colored. In this way, a top-down analysis and extraction strategy is implemented. Necessaryadditional information can then be added that either refers to other text segments or containsvaluable domain specific information. Finally the colored text can be easily analyzed concerningthe relevant information. The TextMarker system (textmarker is a common german word for ahighlighter) tries to imitate this manual extraction method by formalizing the appropriate actionsusing matching rules: The rules mark sequences of words, extract text segments or modify theinput document depending on textual features.The default input for the TextMarker system issemi-structured text, but it can also process structured or free text. Technically, HTML is often theinput format, since most word processing documents can be converted to HTML. Additionally, theTextMarker systems offers the possibility to create a modified output document.

1.2. Core ConceptsAs a first step in the extraction process the TextMarker system uses a tokenizer (scanner) totokenize the input document and to create a stream of basic symbols. The types and validannotations of the possible tokens are predefined by a taxonomy of annotation types. Annotationssimply refer to a section of the input document and assign a type or concept to the respective textfragment. The figure on the right shows an excerpt of a basic annotation taxonomy: CW describes

Examples

2 TextMarker UIMA Version 2.4.1

all tokens, for example, that contains a single word starting with a capital letter, MARKUPcorresponds to HTML or XML tags, and PM refers to all kinds of punctuations marks. Take a lookat [basic annotations|BasicAnnotationList] for a complete list of initial annotations.

By using (and extending) the taxonomy, the knowledge engineer is able to choose the mostadequate types and concepts when defining new matching rules, i.e., TextMarker rules formatching a text fragment given by a set of symbols to an annotation. If the capitalization of aword, for example, is of no importance, then the annotation type W that describes words of anykind can be used. The initial scanner creates a set of basic annotations that may be used by thematching rules of the TextMarker language. However, most information extraction applicationsrequire domain specific concepts and annotations. Therefore, the knowledge engineer is able toextend the set of annotations, and to define new annotation types tuned to the requirements ofthe given domain. These types can be flexibly integrated in the taxonomy of annotation types.One of the goals in developing a new information extraction language was to maintain an easilyreadable syntax while still providing a scalable expressiveness of the language. Basically, theTextMarker language contains expressions for the definition of new annotation types and fordefining new matching rules. The rules are defined by a list of rule elements. Each rule elementcontains at least a basic matching condition referring to text fragments or already specifiedannotations. Additionally a list of conditions and actions may be specified for a rule element.Whereas the conditions describe necessary attributes of the matched text fragment, the actions pointto operations and assignments on the current fragments. These actions will then only be executedif all basic conditions matched on a text fragment or the annotation and the related conditions arefulfilled.

1.3. ExamplesThe usage of the language and its readability can be demonstrated by simple examples:

CW{INLIST('animals.txt') -> MARK(Animal)}; Animal "and" Animal{-> MARK(Animalpair, 1, 2, 3)};

The first rule looks at all capitalized words that are listed in an external document animals.txt andcreates a new annotation of the type animal using the boundaries of the matched word. The secondrule searches for an annotation of the type animal followed by the literal and and a second animalannotation. Then it will create a new annotation animalpair covering the text segment that matchedthe three rule elements (the digit parameters refer to the number of matched rule element).

Document{-> MARKFAST(Firstname, 'firstnames.txt')}; Firstname CW{-> MARK(Lastname)}; Paragraph{VOTE(Firstname, Lastname) -> LOG("Found more Firstnames than Lastnames")};

In this example, the first rule annotates all words that occur in the external document firstnames.txtwith the type firstname. The second rule creates a lastname annotation for all capitalized word thatfollow a firstname annotation. The last rule finally processes all paragraph} annotations. If the

Special Features

UIMA Version 2.4.1 TextMarker 3

VOTE condition counts more firstname than lastname annotations, then the rule writes a log entrywith a predefined message.

ANY+{PARTOF(Paragraph), CONTAINS(Delete, 50, 100, true) -> MARK(Delete)}; Firstname{-> MARK(Delete,1 , 2)} Lastname; Delete{-> DEL};

Here, the first rule looks for sequences of any kind of tokens except markup and creates oneannotation of the type delete for each sequence, if the tokens are part of a paragraph annotationand contains together already more than 50% of delete annoations. The + signs indicate this greedyprocessing. The second rule annotates first names followed by last names with the type delete andthe third rule simply deletes all text segments that are associated with that delete annotation.

1.4. Special FeaturesThe TextMarker language features some special characteristics that are usually not found inother rule-based information extraction systems or even shift it towards scripting languages. Thepossibility of creating new annotation types and integrating them into the taxonomy facilitatesan even more modular development of information extraction systems. Read more about robustextraction using filtering, complex control structures and heuristic extraction using scoring rules.

1.5. Get startedThis section page gives you a short, technical introduction on how to get started with TextMarkersystem and mostly just links the information of the other wiki pages. Some knowledge aboutthe usage of Eclipse and central concepts of UIMA are useful. TextMarker consists of theTextMarker rule language (and of course the rule inference) and the TextMarker workbench.Additionally, the CEV plugin is used to edit and visualize annotated text. The TextRuler systemwith implementations of well known rule learning methods and development extension withsupport for test-driven development are already integrated.

1.5.1. Up and runningFirst of all, install the Workbench and read the introduction and its examples. In order to verify ifthe Workbench is correctly installed, take a look at Help-About Eclipse-Installation Details andcompare the installed plugins with the plugins you copied into the plugins folder of your Eclipseapplication. Normally most of the plugins do not cause any troubles, but the CEV does becauseof the XPCom and XULRunner dependencies. You should at least get the XPCom plugin up andrunning. However, you cannot use the additional HTML functionality without the XULRunnerplugin. If the plugins of the installation guide do not work properly and a google search for asuiteable plugin is not successful, then write a mail to the user list and we will try to solve theproblem. If all plugins are correctly installed, then start the Eclipse application and switch to theTextMarker perspective (Window-Open Perspective-Other...)

1.5.2. Learn by exampleHaving a running Workbench download the example project and import/copy this TextMarkerproject into your workspace. The project contains some simple rules for extraction the author, titleand year of reference strings. Next, take a look at the project structure and the syntax and compareit with the example project and its contents. Open the Main.tm TextMarker script in the folder

Do it yourself

4 TextMarker UIMA Version 2.4.1

script/de.uniwue.example and press the Run button in the Eclipse toolbar. The docments in theinput folder will then be processed by the Main.tm file and the result of the information extractiontask is placed in the output folder. As you can see, there are four files: an xmiCAS for each inputfile and a HTML file (the modifed/colored result). Open one of the .xmi files with the CAS Editorplugin (-popup menu-Open with) and select some checkboxes in the Annotation Browser view.

1.5.3. Do it yourselfTry to write some rules yourself. Read the description of the available language constructs, e.g.,conditions and actions and use the explanation component in order to take a closer look at the ruleinference. Then finally, read the rest of this document.

TextMarker Language 5

Chapter 2. TextMarker Language

2.1. Basic Annotations and tokensThe TextMarker system uses a JFlex lexer to initially create a seed of basic, token annotations.

2.2. SyntaxStructure

script -> packageDeclaration globalStatements statements packageDeclaration -> "PACKAGE" DottedIdentifier ";" globalStatments -> globalStatment* globalStatment -> ("TYPESYSTEM" | "SCRIPT" | "ENGINE") DottedIdentifier ";" statements -> statement* statement -> typeDeclaration | resourceDeclaration | variableDeclaration | blockDeclaration | simpleStatement

Declarations

typeDeclaration -> "DECLARE" (AnnotationType)? Identifier ("," Identifier )* | "DECLARE" AnnotationType Identifier ( "(" featureDeclaration ")" )? featureDeclaration -> ( (AnnotationType | "STRING" | "INT" | "DOUBLE" | "BOOLEAN") Identifier)+ resourceDeclaration -> ("WORDLIST" Identifier = listExpression | "WORDTABLE" Identifier = tableExpression) ";" variableDeclaration -> ("TYPE" | "STRING" | "INT" | "DOUBLE" | "BOOLEAN") Identifier ";"

More information about Declarations. Statements

blockDeclaration -> "BLOCK" "(" Identifier ")" ruleElementWithType "{" statements "}" simpleStatement -> ruleElements ";" ruleElements -> ( ruleElementWithLiteral | ruleElementWithType )+ ruleElementWithLiteral -> simpleStringExpression quantifierPart? conditionActionPart? ruleElementWithType -> typeExpression quantifierPart? conditionActionPart? quantifierPart -> "*" | "*?" | "+" | "+?" | "?" | "??" | "[" numberExpression "," numberExpression "]" | "[" numberExpression "," numberExpression "]?" conditionActionPart -> "{" (condition ( "," condition )*)? ( "->" (action( "," action)*))? "}" condition -> ConditionName ("(" argument ("," argument)* ")")? action -> ActionName ("(" argument ("," argument)* ")")?

More information about Quantifiers, Conditions, Actions and Blocks. The ruleElementWithTypeof a BLOCK declaration must have opening and closing curly brackets (e.g., BLOCK(name)Document{} {...}) Expressions

Syntax

6 TextMarker Language UIMA Version 2.4.1

argument -> typeExpression | numberExpression | stringExpression | booleanExpression typeExpression -> AnnotationType | TypeVariable numberExpression -> additiveExpression additiveExpression -> multiplicativeExpression multiplicativeExpression -> simpleNumberExpression ( ( "*" | "/" | "%" ) simpleNumberExpression )* | ( "EXP" | "LOGN" | "SIN" | "COS" | "TAN" ) numberExpressionInPar numberExpressionInPar -> "(" additiveExpression ")" simpleNumberExpression -> "-"? ( DecimalLiteral | FloatingPointLiteral | NumberVariable) | numberExpressionInPar stringExpression -> simpleStringExpression ( "+" simpleSEOrNE )* simpleStringExpression -> StringLiteral | StringVariable simpleSEOrNE -> simpleStringExpression | numberExpressionInPar booleanExpression -> booleanNumberExpression | BooleanVariable | BooleanLiteral booleanNumberExpression -> "(" numberExpression ( "<" | "<=" | ">" | ">=" | "==" | "!=" ) numberExpression ")" listExpression -> Identifier | ResourceLiteral tableExpression -> Identifier | ResourceLiteral

More information about Expressions. A ResourceLiteral is something like 'folder/file.txt' (yes, withsingle quotes).

2.3. SyntaxThe inference relies on a complete, disjunctive partition of the document. A basic (minimal)annotation for each element of the partition is assigned to a type of a hierarchy. These basicannotations are enriched for performance reasons with information about annotations that start atthe same offset or overlap with the basic annotation. Normally, a scanner creates a basic annotationfor each token, punctuation or whitespace, but can also be replaced with a different annotationseeding strategy. Unlike other rule-based information extraction language, the rules are executedin an imperative way. Experience has shown that the dependencies between rules, e.g., the sameannotation types in the action and in the condition of a different rule, often form tree-like andnot graph-like structures. Therefore, the sequencing and imperative processing did not causedisadvantages, but instead obvious advantages, e.g., the improved understandability of large rulesets. The following algorithm summarizes the rule inference:

collect all basic annotations that fulfill the first matching condition for all collected basic annotations do for all rule elements of current rule do if quantifier wants to match then match the conditions of the rule element on the current basic annotation determine the next basic annotation after the current match if quantifier wants to continue then if there is a next basic annotation then continue with the current rule element and the next basic annotation else if rule element did not match then reset the next basic annotation to the current one set the current basic annotation to the next one if some rule elements did not match then stop and continue with the next collected basic annotation else if there is no current basic annotation and the quantifier wants to continue then set the current basic annotation to the previous one if all rule elements matched then execute the actions of all rule elements

The rule elements can of course match on all kinds of annotations. Therefore the determinationof the next basic annotation returns the first basic annotation after the last basic annotation of thecomplete, matched annotation.

Declarations

UIMA Version 2.4.1 TextMarker Language 7

2.4. DeclarationsThere are three different kinds declaration in the TextMarker system: Declarations of types withoptional feature definitions of that type, declaration of variables and declarations for importingexternal resources, scripts of UIMA components.

2.4.1. Type

Type declarations define new kinds of annotations types and optionally its features. Examples:

DECLARE SimpleType1, SimpleType2; // <- two new types with the parent type "Annotation" DECLARE ParentType NewType (SomeType feature1, INT feature2); // <- defines a new type "NewType" // with parent type "ParentType" and two features

If the parent type is not defined in the same namepace, then the complete namespace has to be used,e.g., DECLARE my.other.package.Parent NewType;

2.4.2. Variable

Variable declarations define new variables. There are five kinds of variables: * Type variable: Avariable that represents an annotation type. * Integer variable: A variable that represents a integer.* Double variable: A variable that represents a floating-point number. * String variable: A variablethat represents a string. * Boolean variable: A variable that represents a boolean. Examples:

TYPE newTypeVariable; INT newIntegerVariable; DOUBLE newDoubleVariable; STRING newStringVariable; BOOLEAN newBooleanVariable;

2.4.3. Resources

There are two kinds of resource declaration, that make external resources available in hteTextMarker system: * List: A list represents a normal text file with an entry per line or a compiledtree of a word list. * Table: A table represents comma separated file. Examples:

LIST Name = 'someWordList.txt'; TABLE Name = 'someTable.csv';

2.4.4. Scripts

Additional scripts can be imported and reused with the CALL action. The types of the importedrules are then also available, so that it is not neccessary to import the Type System of the additionalrule script. Examples:

Components


SCRIPT my.package.AnotherScript; // <- "AnotherScript.tm" in the "my.package" package Document{->CALL(AnotherScript)}; // <- rule executes "AnotherScript.tm"

2.4.5. Components

There are two kind of UIMA components that can be imported in a TextMarker script: * TypeSystem: includes the types defined in an external type system. * Analysis Engine: makes anexternal analysis engine available. The type system needed for the analysis engine has to beimported seperately. Please mind the filtering setting when calling an external analysis engine.Examples:

ENINGE my.package.ExternalEngine; // <- "ExternalEngine.xml" in the // "my.package" package (in the descriptor folder) TYPESYSTEM my.package.ExternalTypeSystem; // <- "ExternalTypeSystem.xml" // in the "my.package" package (in the descriptor folder) Document{->RETAINTYPE(SPACE,BREAK),CALL(ExternalEngine)}; // calls ExternalEngine, but retains white spaces

2.5. Quantifiers

2.5.1. * Star Greedy

The Star Greedy quantifier matches on any amount of annotations and evaluates always true. Pleasemind, that a rule element with a Star Greedy quantifier needs to match on different annotations thanthe next rule element. Examples:

Input: small Big Big Big small Rule: CW* Matched: Big Big Big Matched: Big Big Matched: Big

2.5.2. *? Star Reluctant

The Star Reluctant quantifier matches on any amount of annotations and evaluates always true, butstops to match on new annotations, when the next rule element matches and evaluates true on thisannotation. Examples:

Input: 123 456 small small Big Rule: W*? CW Matched: small small Big Matched: small Big Matched: Big

+ Plus Greedy


2.5.3. + Plus GreedyThe Plus Greedy quantifier needs to match on at least one annotation. Please mind, that a ruleelement after a rule element with a Plus Greedy quantifier matches and evaluates on differentconditions. Examples:

Input: 123 456 small small Big Rule: SW+ Matched: small small Matched: small

2.5.4. +? Plus ReluctantThe Plus Reluctant quantifier has to match on at least one annotation in order to evaluate true, butstops when the next rule element is able to match on this annotation. Examples:

Input: 123 456 small small Big Rule: W+? CW Matched: small small Big

2.5.5. ? Question GreedyThe Question Greedy quantifier matches optionally on an annotation and therefore alwaysevaluates true. Examples:

Input: 123 456 small Big small Big Rule: SW CW? SW Matched: small Big small

2.5.6. ?? Question ReluctantThe Question Reluctant quantifier matches optionally on an annotation if the next rule element cannot match on the same annotation and therefore always evaluates true. Examples:

Input: 123 456 small Big small Big Rule: SW CW?? SW Matched: small Big small

2.5.7. [x,y] Min Max GreedyThe Min Max Greedy quantifier has to match at least x and at most y annotations of its rule elementto elaluate true. Examples:

Input: 123 456 small Big small Big

[x,y]? Min Max Reluctant


Rule: SW CW[1,2] SW Matched: small Big small

2.5.8. [x,y]? Min Max ReluctantThe Min Max Greedy quantifier has to match at least x and at most y annotations of its rule elementto elaluate true, but stops to match on additional annotations if the next rule element is able tomatch on this annotation. Examples:

Input: 123 456 small Big Big Big small Big Rule: SW CW[2,100]? SW Matched: small Big Big Big small

2.6. Conditions

2.6.1. AFTERThe AFTER condition evaluates true if the matched annotation starts after the beginning of anarbitrary annotation of the passed type. If a list of types is passed, this has to be true for at least oneof them.

2.6.1.1. Definition:

AFTER(Type|TypeListExpression)

2.6.1.2. Example:

CW{AFTER(SW)};

Here, the rule matches on a capitalized word if there is any small written word previously.

2.6.2. ANDThe AND condition is a composed condition and evaluates true if all contained conditions evaluatetrue.


AND(Condition1,...,ConditionN)

2.6.2.2. Example:

Paragraph{AND(PARTOF(Headline),CONTAINS(Keyword)) ->MARK(ImportantHeadline)};

In this example a Paragraph is annotated with an ImportantHeadline annotation if it is part of aHeadline and contains a Keyword annotation.

BEFORE


2.6.3. BEFOREThe BEFORE condition evaluates true if the matched annotation starts before the beginning of anarbitrary annotation of the passed type. If a list of types is passed, this has to be true for at least oneof them.


BEFORE(Type|TypeListExpression)

2.6.3.2. Example:

CW{BEFORE(SW)};

Here, the rule matches on a capitalized word if there is any small written word afterwards.

2.6.4. CONTAINSThe CONTAINS condition evaluates true on a matched annotation if the frequency of the passedtype lies within an optionally passed interval. The limits of the passed interval are per defaultinterpreted as absolute numeral values. By passing a further boolean parameter set to true the limitsare interpreted as percental values. If no interval parameters are passed at all the condition checkswhether the matched annotation contains at least one occurrence of the passed type.


CONTAINS(Type(,NumberExpression,NumberExpression(,BooleanExpression)?)?)

2.6.4.2. Example:

Paragraph{CONTAINS(Keyword)->MARK(KeywordParagraph)};

A Paragraph is annotated with a KeywordParagraph annotation if it contains a Keyword annotation.

Paragraph{CONTAINS(Keyword,2,4)->MARK(KeywordParagraph)};

A Paragraph is annotated with a KeywordParagraph annotation if it contains between two and fourKeyword annotations.

Paragraph{CONTAINS(Keyword,50,100,true)->MARK(KeywordParagraph)};

A Paragraph is annotated with a KeywordParagraph annotation if it contains between 50% and100% Keyword annotations. This is calculated based on the tokens of the Paragraph. If theParagraph contains six basic annotations (see Section 2.1, “Basic Annotations and tokens” [5]), two of them are part of one Keyword annotation and one basic annotation is also annotated with aKeyword annotation, then the percentage of the contained Keywords is 50%.

2.6.5. CONTEXTCOUNTThe CONTEXTCOUNT condition numbers all occurrences of the matched type within thecontext of a passed type's annotation consecutively, thus assigning an index to each occurrence.

COUNT


Additionally it stores the index of the matched annotation in a numerical variable if one is passed.The condition evaluates true if the index of the matched annotation is within a passed interval. If nointerval is passed, the condition always evaluates true.


CONTEXTCOUNT(Type(,NumberExpression,NumberExpression)?(,Variable)?)

2.6.5.2. Example:

Keyword{CONTEXTCOUNT(Paragraph,2,3,var) ->MARK(SecondOrThirdKeywordInParagraph)};

Here, the position of the matched Keyword annotation within a Paragraph annotation is calculatedand stored in the variable 'var'. If the counted value lies within the interval [2,3] the matchedKeyword is annotated with the SecondOrThirdKeywordInParagraph annotation.

2.6.6. COUNT

The COUNT condition can be used in two different ways. In the first case (see first definition),it counts the number of annotations of the passed type within the window of the matchedannotation and stores the amount in a numerical variable if such a variable is passed. The conditionevaluates true if the counted amount is within a specified interval. If no interval is passed, thecondition always evaluates true. In the second case (see second definition), it counts the numberof occurrences of the passed VariableExpression (second parameter) within the passed list (firstparameter) and stores the amount in a numerical variable if such a variable is passed. Again thecondition evaluates true if the counted amount is within a specified interval. If no interval is passed,the condition always evaluates true.


COUNT(Type(,NumberExpression,NumberExpression)?(,NumberVariable)?)

COUNT(ListExpression,VariableExpression (,NumberExpression,NumberExpression)?(,NumberVariable)?)

2.6.6.2. Example:

Paragraph{COUNT(Keyword,1,10,var)->MARK(KeywordParagraph)};

Here, the amount of Keyword annotations within a Paragraph is calculated and stored inthe variable 'var'. If one to ten Keywords were counted, the Paragraph is marked with aKeywordParagraph annotation.

Paragraph{COUNT(list,"author",5,7,var)};

Here, the number of occurrences of STRING "author" within the STRINGLIST 'list' is countedand stored in the variable 'var'. If "author" occurs five to seven times within 'list', the conditionevaluates true.

CURRENTCOUNT


2.6.7. CURRENTCOUNTThe CURRENTCOUNT condition numbers all occurences of the matched type within the wholedocument consecutively, thus assigning an index to each occurence. Additionally it stores the indexof the matched annotation in a numerical variable if one is passed. The condition evaluates trueif the index of the matched annotation is within a specified interval. If no interval is passed, thecondition always evaluates true.


CURRENTCOUNT(Type(,NumberExpression,NumberExpression)?(,Variable)?)

2.6.7.2. Example:

Paragraph{CURRENTCOUNT(Keyword,3,3,var)->MARK(ParagraphWithThirdKeyword)};

Here, the Paragraph which contains the third Keyword of the whole document is annotated with theParagraphWithThirdKeyword annotation. The index is stored in the variable 'var'.

2.6.8. ENDSWITHThe ENDSWITH condition evaluates true if an annotation of the given type ends exactly at thesame position as the matched annotation. If a list of types is passed, this has to be true for at leastone of them.


ENDSWITH(Type|TypeListExpression)

2.6.8.2. Example:

Paragraph{ENDSWITH(SW)};

Here, the rule matches on a Paragraph annotation if it ends with a small written word.

2.6.9. FEATUREThe FEATURE condition compares a feature of the matched annotation with the second argument.


FEATURE(StringExpression,Expression)

2.6.9.2. Example:

Document{FEATURE("language",targetLanguage)}

This rule matches if the feature named 'language' of the document annotation equals the value ofthe variable 'targetLanguage'.

IF


2.6.10. IFThe IF condition evaluates true if the contained boolean expression does.


IF(BooleanExpression)

2.6.10.2. Example:

Paragraph{IF(keywordAmount > 5)->MARK(KeywordParagraph)};

A Paragraph annotation is annotated with a KeywordParagraph annotation if the value of thevariable 'keywordAmount' is greater than five.

2.6.11. INLISTThe INLIST condition is fulfilled if the matched annotation is listed in a given word or string list.The (relative) edit distance is currently disabled.


INLIST(WordList(,NumberExpression,(BooleanExpression)?)?)

INLIST(StringList(,NumberExpression,(BooleanExpression)?)?)

2.6.11.2. Example:

Keyword{INLIST(specialKeywords.txt)->MARK(SpecialKeyword)};

A Keyword is annotated with the type SpecialKeyword if the text of the Keyword annotation islisted in the word list 'specialKeywords.txt'.

2.6.12. ISThe IS condition evaluates true if there is an annotation of the given type with the same beginningand ending offsets as the matched annotation. If a list of types is given, the condition evaluates trueif at least one of them fulfills the former condition.


IS(Type|TypeListExpression)

2.6.12.2. Example:

Author{IS(Englishman)->MARK(EnglishAuthor)};

If an Author annotation is also annotated with an Englishman annotation, it is annotated with anEnglishAuthor annotation.

LAST


2.6.13. LASTThe LAST condition evaluates true if the type of the last token within the window of the matchedannotation is of the given type.


LAST(TypeExpression)

2.6.13.2. Example:

Document{LAST(CW)};

This rule fires if the last token of the document is a capitalized word.

2.6.14. MOFNThe MOFN condition is a composed condition. It evaluates true if the number of containingconditions evaluating true is within a given interval.


MOFN(NumberExpression,NumberExpression,Condition1,...,ConditionN)

2.6.14.2. Example:

Paragraph{MOFN(1,1,PARTOF(Headline),CONTAINS(Keyword)) ->MARK(HeadlineXORKeywords)};

A Paragraph is marked as a HeadlineXORKeywords if the matched text is either part of a Headlineannotation or contains Keyword annotations.

2.6.15. NEARThe NEAR condition is fulfilled if the distance of the matched annotation to an annotation ofthe given type is within a given interval. The direction is defined by a boolean parameter, whosedefault value is true, therefore searching forward. By default this condition works on an unfilteredindex. An optional fifth boolean parameter can be set to true to get the condition being evaluated ona filtered index.


NEAR(TypeExpression,NumberExpression,NumberExpression (,BooleanExpression(,BooleanExpression)?)?)

2.6.15.2. Example:

Paragraph{NEAR(Headline,0,10,false)->MARK(NoHeadline)};

NOT


A Paragraph that starts at most ten tokens after a Headline annotation is annotated with theNoHeadline annotation.

2.6.16. NOTThe NOT condition negates the result of its contained condition.


"-"Condition

2.6.16.2. Example:

Paragraph{-PARTOF(Headline)->MARK(Headline)};

A Paragraph that is not part of a Headline annotation so far is annotated with a Headline annotation.

2.6.17. ORThe OR Condition is a composed condition and evaluates true if at least one contained condition isevaluated true.


OR(Condition1,...,ConditionN)

2.6.17.2. Example:

Paragraph{OR(PARTOF(Headline),CONTAINS(Keyword))->MARK(ImportantParagraph)};

In this example a Paragraph is annotated with the ImportantParagraph annotation if it is a Headlineor contains Keyword annotations.

2.6.18. PARSEThe PARSE condition is fulfilled if the text covered by the matched annotation can be transformedinto a value of the given variable's type. If this is possible, the parsed value is additionally assignedto the passed variable.


PARSE(variable)

2.6.18.2. Example:

NUM{PARSE(var)};

If the variable 'var' is of an appropriate numeric type, the value of NUM is parsed and subsequentlystored in 'var'.

PARTOF


2.6.19. PARTOFThe PARTOF condition is fulfilled if the matched annotation is part of an annotation of the giventype. However it is not necessary that the matched annotation is smaller than the annotation of thegiven type. Use the (much slower) PARTOFNEQ condition instead if this is needed. If a type list isgiven, the condition evaluates true if the former described condition for a single type is fulfilled forat least one of the types in the list.


PARTOF(Type|TypeListExpression)

2.6.19.2. Example:

Paragraph{PARTOF(Headline) -> MARK(ImportantParagraph)};

A Paragraph is an ImportantParagraph if the matched text is part of a Headline annotation.

2.6.20. PARTOFNEQThe PARTOFNEQ condition is fulfilled if the matched annotation is part of (smaller than andinside of) an annotation of the given type. If also annotations of the same size should be acceptable,use the PARTOF condition. If a type list is given, the condition evaluates true if the formerdescribed condition is fulfilled for at least one of the types in the list.


PARTOFNEQ(Type|TypeListExpression)

2.6.20.2. Example:

W{PARTOFNEQ(Headline) -> MARK(ImportantWord)};

A word is an ImportantWord if it is part of a headline.

2.6.21. POSITIONThe POSITION condition is fulfilled if the matched type is the k-th occurence of this type withinthe window of an annotation of the passed type, whereby k is defined by the value of the passedNumberExpression. If the additional boolean paramter is set to false, then k count the occurences ofof the minimal annotations.


POSITION(Type,NumberExpression(,BooleanExpression)?)

2.6.21.2. Example:

Keyword{POSITION(Paragraph,2)->MARK(SecondKeyword)};

REGEXP


The second Keyword in a Paragraph is annotated with the type SecondKeyword.

Keyword{POSITION(Paragraph,2,false)->MARK(SecondKeyword)};

A Keyword in a Paragraph is annotated with the type SecondKeyword, if it starts at the same offsetas the second (visible) TextMarkerBasic annotation, which normally corresponds to the tokens.

2.6.22. REGEXPThe REGEXP condition is fulfilled if the given pattern matches on the matched annotation.However, if a string variable is given as the first argument, then the pattern is evaluated on thevalue of the variable. For more details on the syntax of regular expressions, have a look at the JavaAPI1 . By default the REGEXP condition is case-sensitive. To change this add an optional booleanparameter set to true.


REGEXP((StringVariable,)? StringExpression(,BooleanExpression)?)

2.6.22.2. Example:

Keyword{REGEXP("..")->MARK(SmallKeyword)};

A Keyword that only consists of two chars is annotated with a SmallKeyword annotation.

2.6.23. SCOREThe SCORE condition evaluates the heuristic score of the matched annotation. This score is set orchanged by the MARK action. The condition is fulfilled if the score of the matched annotation is ina given interval. Optionally the score can be stored in a variable.


SCORE(NumberExpression,NumberExpression(,Variable)?)

2.6.23.2. Example:

MaybeHeadline{SCORE(40,100)->MARK(Headline)};

A annotation of the type MaybeHeadline is annotated with Headline if its score is between 40 and100.

2.6.24. SIZEThe SIZE contition counts the number of elements in the given list. By default this conditionalways evaluates true. If an interval is passed, it evaluates true if the counted number of listelements is within the interval. The counted number can be stored in an optionally passed numeralvariable.

1 http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html



STARTSWITH



SIZE(ListExpression(,NumberExpression,NumberExpression)?(,Variable)?)

2.6.24.2. Example:

Document{SIZE(list,4,10,var)};

This rule fires if the given list contains between 4 and 10 elements. Additionally, the exact amountis stored in the variable var.

2.6.25. STARTSWITHThe STARTSWITH condition evaluates true if an annotation of the given type starts exactly at thesame position as the matched annotation. If a type list is given, the condition evaluates true if theformer is true for at least one of the given types in the list.


STARTSWITH(Type|TypeListExpression)

2.6.25.2. Example:

Paragraph{STARTSWITH(SW)};

Here, the rule matches on a Paragraph annotation if it starts with small written word.

2.6.26. TOTALCOUNTThe TOTALCOUNT condition counts the annotations of the passed type within the wholedocument and stores the amount in an optionally passed numerical variable. The conditionevaluates true if the amount is within the passed interval. If no interval is passed, the conditionalways evaluates true.


TOTALCOUNT(Type(,NumberExpression,NumberExpression(,Variable)?)?)

2.6.26.2. Example:

Paragraph{TOTALCOUNT(Keyword,1,10,var)->MARK(KeywordParagraph)};

Here, the amount of Keyword annotations within the whole document is calculated and storedin the variable 'var'. If one to ten Keywords were counted, the Paragraph is marked with aKeywordParagraph annotation.

2.6.27. VOTEThe VOTE condition counts the annotations of the given two types within the window of thematched annotation and evaluates true if it found more annotations of the first type.

Actions



VOTE(TypeExpression,TypeExpression)

2.6.27.2. Example:

Paragraph{VOTE(FirstName,LastName)};

Here, this rule fires if a paragraph contains more firstnames than lastnames.

2.7. Actions

2.7.1. ADD

The ADD action adds all the elements of the passed TextMarkerExpressions to a given list. Forexample this expressions could be a string, an integer variable or a list itself. For a completeoverview on Textmarker expressions see Section 2.8, “Expressions” [32] .


ADD(ListVariable,(TextMarkerExpression)+)

2.7.1.2. Example:

Document{->ADD(list, var)};

In this example, the variable 'var' is added to the list 'list'.

2.7.2. ASSIGN

The ASSIGN action assigns the value of the passed expression to a variable of the same type.


ASSIGN(BooleanVariable,BooleanExpression)

ASSIGN(NumberVariable,NumberExpression)

ASSIGN(StringVariable,StringExpression)

ASSIGN(TypeVariable,TypeExpression)

2.7.2.2. Example:

Document{->ASSIGN(amount, (amount/2))};

CALL


In this example, the value of the variable 'amount' is halved.

2.7.3. CALL

The CALL action initiates the execution of a different script file or script block. Currently onlycomplete script files are supported.


CALL(DifferentFile)

CALL(Block)

2.7.3.2. Example:

Document{->CALL(NamedEntities)};

Here, a script 'NamedEntities' for named entity recognition is executed.

2.7.4. CLEAR

The CLEAR action removes all elements of the given list.


CLEAR(ListVariable)

2.7.4.2. Example:

Document{->CLEAR(SomeList)};

This rule clears the list 'SomeList'.

2.7.5. COLOR

The COLOR action sets the color of an annotation type in the modified view if the rule is fired.The background color is passed as the second parameter. The font color can be changed by passinga further color as third parameter. By default annotations are not automatically selected whenopening the modified view. This can be changed for the matched annotations by passing true asfourth parameter. By default The supported colors are: black, silver, gray, white, maroon, red,purple, fuchsia, green, lime, olive, yellow, navy, blue, aqua, lightblue, lightgreen, orange, pink,salmon, cyan, violet, tan, brown, white, mediumpurple.


COLOR(TypeExpression,StringExpression(, StringExpression (, BooleanExpression)?)?)

CONFIGURE


2.7.5.2. Example:

Document{->COLOR(Headline, "red", "green", true)};

This rule colors all Headline annotations in the modified view. Thereby background color is set tored, font color is set to green and all 'Headline' annotations are selected when opening the modifiedview.

2.7.6. CONFIGUREThe CONFIGURE action can be used to configure the analysis engine of the given namespace(first parameter). The parameters that should be configured with corresponding values are passed asname-value pairs.


CONFIGURE(StringExpression(,StringExpression = Expression)+)

2.7.7. CREATEThe CREATE action is similar to the MARK action. It also annotates the matched text fragmentswith a type annotation, but additionally assigns values to a choosen subset of the type's featureelements.


CREATE(TypeExpression(,NumberExpression)*(,StringExpression = Expression)+)

2.7.7.2. Example:

Paragraph{COUNT(ANY,0,10000,cnt)->CREATE(Headline,"size" = cnt)};

This rule counts the number of tokens of type ANY in a Paragraph annotation and assigns thecounted value to the int variable 'cnt'. If the counted number is between 0 and 10000, a Headlineannotation is created for this Paragraph. Moreover the feature named 'size' of Headline is set to thevalue of 'cnt'.

2.7.8. DELThe DEL action deletes the matched text fragments in the modified view.


DEL

2.7.8.2. Example:

Name{->DEL};

DYNAMICANCHORING


This rule deletes all text fragments that are annotated with a Name annotation.

2.7.9. DYNAMICANCHORINGThe DYNAMICANCHORING action turns dynamic anchoring on or off (first parameter) andassigns the anchoring parameters penalty (sceond parameter) and factor (third parameter).


DYNAMICANCHORING(BooleanExpression(,NumberExpression(,NumberExpression)?)?)

2.7.10. EXECThe EXEC action initiates the execution of a different script file or analysis engine on the completeinput document independent of the matched text and the current filtering settings. If the argumentrefers to another script file, a new view on the document is created: the complete text of the originalCAS and with the default filtering settings of the TextMarker analysis engine.


EXEC(DifferentFile)

2.7.10.2. Example:

ENGINE NamedEntities; Document{->EXEC(NamedEntities)};

Here, an analysis engine for named entity recognition is executed once on the complete document

2.7.11. FILLThe FILL action fills a choosen subset of the given type's feature elements.


FILL(TypeExpression(,StringExpression = Expression)+)

2.7.11.2. Example:

Headline{COUNT(ANY,0,10000,tokenCount) ->FILL(Headline,"size" = tokenCount)};

Here, the number of tokens within an Headline annotation is counted an stored in variable'tokenCount'. If the number of tokens is within the interval [0;10000], the FILL action fills theHeadline's feature 'size' with the value of 'tokenCount'.

2.7.12. FILTERTYPEThis action filters the given types of annotations. They are now ignored by rules. Formore informations on how rules work see Section 2.3, “Syntax” [6] . Expressions are

GATHER


not yet supported. This action is complementary to RETAINTYPE (see Section 2.7.29,“RETAINTYPE” [30] ).


FILTERTYPE((TypeExpression(,TypeExpression)*))?

2.7.12.2. Example:

Document{->FILTERTYPE(SW)};

This rule filters all small written words in the input document. This means they are further ignoredby any rules.

2.7.13. GATHERThis action creates a complex structure, a annotation with features. The optionally passed indexes(NumberExpressions after the TypeExpression) can be used to create an annotation that spanns thematched information of several rule elements. The features are collected using the indexes of therule elements of the complete rule.


GATHER(TypeExpression(,NumberExpression)* (,StringExpression = NumberExpression)+)

2.7.13.2. Example:

DECLARE Annotation A; DECLARE Annotation B; DECLARE Annotation C(Annotation a, Annotation b); W{REGEXP("A")->MARK(A)}; W{REGEXP("B")->MARK(B)}; A B{-> GATHER(C, 1, 2, "a" = 1, "b" = 2)};

Two annotations A and B are declared and annotated. The last rule creates an annotation Cspanning the elements A (index 1 since it is the first rule element) and B (index 2) with its features'a' set to annotation A (again index 1) and 'b' set to annotation B (again index 2).

2.7.14. GETThe GET action retrieves an element of the given list dependent on a given strategy.

Table 2.1. Currently supported strategies

Strategy Functionality

dominant finds the most occuring element


GET(ListExpression, Variable, StringExpression)

GETFEATURE


2.7.14.2. Example:

Document{->GET(list, var, "dominant")};

In this example, the element of the list 'list' that occurs most is stored in the variable 'var'.

2.7.15. GETFEATUREThe GETFEATURE action stores the value of the matched annotation's feature (first paramter) inthe given variable (second parameter).


GETFEATURE(StringExpression, Variable)

2.7.15.2. Example:

Document{->GETFEATURE("language", stringVar)};

In this example, variable 'stringVar' will contain the value of the feature 'language'.

2.7.16. GETLISTThis action retrieves a list of types dependent on a given strategy.

Table 2.2. Currently supported strategies

Strategy Functionality

Types get all types within the matched annotation

Types:End get all types that end at the same offset as thematched annotation

Types:Begin get all types that start at the same offset as thematched annotation


GETLIST(ListVariable, StringExpression)

2.7.16.2. Example:

Document{->GETLIST(list, "Types")};

Here, a list of all types within the document is created and assigned to list variable 'list'.

2.7.17. LOGThe LOG action simply writes a log message.

MARK



LOG(StringExpression)

2.7.17.2. Example:

Document{->LOG("processed")};

This rule writes a log message with the string "processed".

2.7.18. MARK

The MARK action is the most important action in the TextMarker system. It creates a newannotation of the given type. The optionally passed indexes (NumberExpressions after theTypeExpression) can be used to create an annotation that spanns the matched information ofseveral rule elements.


MARK(TypeExpression(,NumberExpression)*)

2.7.18.2. Example:

Freeline Paragraph{->MARK(ParagraphAfterFreeline,1,2)};

This rule matches on a free line followed by a Paragraph annotation and annotates both in asingle ParagraphAfterFreeline annotation. The two numerical expressions at the end of the markaction state that the matched text of the first and the second rule elements are joined to create theboundaries of the new annotation.

2.7.19. MARKFAST

The MARKFAST action creates annotations of the given type (first parameter) if an element ofthe passed list (second parameter) occurs within the window of the matched annotation. Therebythe created annotation doesn't cover the whole matched annotation. Instead it only covers thetext of the found occurence. The third parameter is optional. It defines if the MARKFAST actionshould ignore the case, whereby its default value is false. The optional fourth parameter specifiesa character threshold for the ignorence of the case. It is only relevant if the ignore-case valueis set to true. The last parameter is set to true by default and specifies whether whitespaces inthe entries of the dictionary should be ignored. For more information on lists see Section 2.4.3,“Resources” [7] . Additionally to external word lists, string lists variables can be used.


MARKFAST(TypeExpression,ListExpression(,BooleanExpression (,NumberExpression,(BooleanExpression)?)?)?)

MARKFAST(TypeExpression,StringListExpression(,BooleanExpression

MARKLAST


(,NumberExpression,(BooleanExpression)?)?)?)

2.7.19.2. Example:

WORDLIST FirstNameList = 'FirstNames.txt'; DECLARE FirstName; Document{-> MARKFAST(FirstName, FirstNameList, true, 2)};

This rule annotates all first names listed in the list 'FirstNameList' within the document and ignoresthe case if the length of the word is greater than 2.

2.7.20. MARKLASTThe MARKLAST action annotates the last token of the matched annotation with the given type.


MARKLAST(TypeExpression)

2.7.20.2. Example:

Document{->MARKLAST(Last)};

This rule annotates the last token of the document with the annotation Last.

2.7.21. MARKONCEThe MARKONCE action has the same functionality as the MARK action, but creates a newannotation only if it does not yet exist.


MARKONCE(NumberExpression,TypeExpression(,NumberExpression)*)

2.7.21.2. Example:

Freeline Paragraph{->MARKONCE(ParagraphAfterFreeline,1,2)};

This rule matches on a free line followed by a Paragraph and annotates both in a singleParagraphAfterFreeline annotation if it is not already annotated with ParagraphAfterFreelineannotation. The two numerical expressions at the end of the MARKONCE action state that thematched text of the first and the second rule elements are joined to create the boundaries of the newannotation.

2.7.22. MARKSCOREThe MARKSCORE action is similar to the MARK action. It also creates a new annotation ofthe given type, but only if it does not yet exist. The optionally passed indexes (parameters afterthe TypeExpression) can be used to create an annotation that spanns the matched information of

MARKTABLE


several rule elements. Additionally a score value (first parameter) is added to the heuristic scorevalue of the annotation. For more information on heuristic scores see Section 2.11, “Heuristicextraction using scoring rules” [33] .


MARKSCORE(NumberExpression,TypeExpression(,NumberExpression)*)

2.7.22.2. Example:

Freeline Paragraph{->MARKSCORE(10,ParagraphAfterFreeline,1,2)};

This rule matches on a free line followed by a paragraph and annotates both in a singleParagraphAfterFreeline annotation. The two number expressions at the end of the mark actionindicate that the matched text of the first and the second rule elements are joined to create theboundaries of the new annotation. Additionally the score '10' is added to the heuristic threshold ofthis annotation.

2.7.23. MARKTABLEThe MARKTABLE action creates annotations of the given type (first parameter) if an element ofthe given column (second parameter) of a passed table (third parameter) occures within the windowof the matched annotation. Thereby the created annotation doesn't cover the whole matchedannotation. Instead it only covers the text of the found occurence. Optionally the MARKTABLEaction is able to assign entries of the given table to features of the created annotation. For moreinformation on tables see Section 2.4.3, “Resources” [7] . Additionally several configurationparameters are possible. (See example.)


MARKTABLE(TypeExpression, NumberExpression, TableExpression (,BooleanExpression, NumberExpression, StringExpression, NumberExpression)? (,StringExpression = NumberExpression)+)

2.7.23.2. Example:

WORDTABLE TestTable = 'TestTable.csv'; DECLARE Annotation Struct(STRING first); Document{-> MARKTABLE(Struct, 1, TestTable, true, 4, ".,-", 2, "first" = 2)};

In this example, the whole document is searched for all occurences of the entries of the first columnof the given table 'TestTable'. For each occurence an annotation of the type Struct is created and itsfeature 'first' is filled with the entry of the second column. Moreover the case of the word is ignoredif the length of the word exceeds 4. Additionally the chars '.', ',' and '-' are ignored, but at maximumtwo of them.

2.7.24. MATCHEDTEXTThe MATCHEDTEXT action saves the text of the matched annotation in a passed String variable.The optionally passed indexes can be used to match the text of several rule elements.

MERGE



MATCHEDTEXT(StringVariable(,NumberExpression)*)

2.7.24.2. Example:

Headline Paragraph{->MATCHEDTEXT(stringVariable,1,2)};

The text covered by the Headline (rule elment 1) and the Paragraph (rule elment 2) annotation issaved in variable 'stringVariable'.

2.7.25. MERGEThe MERGE action merges a number of given lists. The first parameter defines if the merge isdone as intersection (false) or as union (true). The second parameter is the list variable that willcontain the result.


MERGE(BooleanExpression, ListVariable, ListExpression, (ListExpression)+)

2.7.25.2. Example:

Document{->MERGE(false, listVar, list1, list2, list3)};

The elements that occur in all three lists will be placed in the list 'listVar'.

2.7.26. REMOVEThe REMOVE action removes lists or single values from a given list


REMOVE(ListVariable,(Argument)+)

2.7.26.2. Example:

Document{->REMOVE(list, var)};

In this example, the variable 'var' is removed from the list 'list'.

2.7.27. REMOVEDUPLICATEThis action removes all duplicates within a given list.


REMOVEDUPLICATE(ListVariable)

REPLACE


2.7.27.2. Example:

Document{->REMOVEDUPLICATE(list)};

Here, all duplicates in list 'list' are removed.

2.7.28. REPLACEThe REPLACE action replaces the text of all matched annotations with the given StringExpression.It remembers the modification for the matched annotations and shows them in the modified view(see Section 3.7.1, “Annotation Browser” [39] ).


REPLACE(StringExpression)

2.7.28.2. Example:

FirstName{->REPLACE("first name")};

This rule replaces all first names with the string 'first name'.

2.7.29. RETAINTYPEThe RETAINTYPE action retains the given types. This means that they are now notignored by rules. This action is complementary to FILTERTYPE (see Section 2.7.12,“FILTERTYPE” [23] ).


RETAINTYPE((TypeExpression(,TypeExpression)*))?

2.7.29.2. Example:

Document{->RETAINTYPE(SPACE)};

All spaces are retained and can be matched by rules.

2.7.30. SETFEATUREThe SETFEATURE action sets the value of a feature of the matched complex structure.


SETFEATURE(StringExpression,Expression)

2.7.30.2. Example:

Document{->SETFEATURE("language","en")};

TRANSFER


Here, the feature 'language' of the input document is set to English.

2.7.31. TRANSFERThe TRANSFER action creates a new feature structure and adds all compatible features of thematched annotation.


TRANSFER(TypeExpression)

2.7.31.2. Example:

Document{->TRANSFER(LanguageStorage)};

Here, a new feature structure LanguageStorage is created and the compatible features of theDocument annotation are copied. E.g., if LanguageStorage defined a feature named 'language', thenthe feature value of the Document annotation is copied.

2.7.32. TRIEThe TRIE action uses an external multi tree word list to annotate the matched annotation andprovides several configuration parameters.


TRIE((String = Type)+,ListExpression,BooleanExpression,NumberExpression, BooleanExpression,NumberExpression,StringExpression)

2.7.32.2. Example:

Document{->TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company, 'Dictionary.mtwl', true, 4, false, 0, ".,-/")};

Here, the dictionary 'Dictionary.mtwl' that contains word lists for first names and companies is usedto annotate the document. The words previously contained in the file 'FirstNames.txt' are annotatedwith the type FirstName and the words in the file 'Companies.txt' with the type Company. The caseof the word is ignored if the length of the word exceeds 4. The edit distance is deactivated. The costof an edit operation can currently not be configured by an argument. The last argument additionallydefines several chars that will be ignored.

2.7.33. UNMARKThe UNMARK action removes the annotation of the given type overlapping the matchedannotation.


UNMARK(TypeExpression)

UNMARKALL


2.7.33.2. Example:

Headline{->UNMARK(Headline)};

Here, the headline annotation is removed.

2.7.34. UNMARKALLThe UNMARKALL action removes all the annotations of the given type and all of its descendantsoverlapping the matched annotation, except the annotation is of at least one type in the passed list.


UNMARKALL(TypeExpression, TypeListExpression)

2.7.34.2. Example:

Annotation{->UNMARKALL(Annotation, {Headline})};

Here, all annotations but headlines are removed.

2.8. Expressions

2.8.1. Type Expressions

2.8.2. Number Expressions

2.8.3. String Expressions

2.8.4. Boolean Expressions

2.9. Robust extraction using filteringRule based or pattern based information extraction systems often suffer from unimportant fillwords, additional whitespace and unexpected markup. The TextMarker System enables theknowledge engineer to filter and to hide all possible combinations of predefined and new typesof annotations. Additionally, it can differentiate between every kind of HTML markup and XMLtags. The visibility of tokens and annotations is modified by the actions of rule elements and can beconditioned using the complete expressiveness of the language. Therefore the TextMarker systemsupports a robust approach to information extraction and simplifies the creation of new rules sincethe knowledge engineer can focus on important textual features. If no rule action changed theconfiguration of the filtering settings, then the default filtering configuration ignores whitespacesand markup. Using the default setting, the following rule matches all four types of input in thisexample:

Blocks


"Dr" PERIOD CW CW

Dr. Peter SteinmetzDr . Peter SteinmetzDr. <b><i>Peter</i> Steinmetz</b>Dr.PeterSteinmetz

2.10. BlocksBlocks combine some more complex control structures in the TextMarker language: conditionedstatement, loops and procedures. The rule element in the definition of a block has to define acondition/action part, even if that part is empty (LCURLY and RCULRY). A block can use normalconditions to condition the execution of its containing rules. Examples:

DECLARE Month;

BLOCK(EnglishDates) Document{FEATURE("language", "en")} { Document{->MARKFAST(Month,'englishMonthNames.txt')}; //...}

BLOCK(GermanDates) Document{FEATURE("language", "de")} { Document{->MARKFAST(Month,'germanMonthNames.txt')}; //...}

A block can be used to execute the containing rule on a sequence of similar text passages.Examples:

BLOCK(Paragraphs) Paragraphs{} { // <- limit the local view on the document: defines a local document // This rule will be executed for each Paragraph that can be found in the current document. Document{CONTAINS(Keyword)->MARK(SpecialParagraph)}; // Here, Document represents not the complete input document, but each Paragraph defined by the block statement.}

2.11. Heuristic extraction using scoring rulesDiagnostic scores are a well known and successfully applied knowledge formalization pattern fordiagnostic problems. Single known findings valuate a possible solution by adding or subtractingpoints on an account of that solution. If the sum exceeds a given threshold, then the solution isderived. One of the advantages of this pattern is the robustness against missing or false findings,since a high number of findings is used to derive a solution. The TextMarker system tries totransfer this diagnostic problem solution strategy to the information extraction problem. In additionto a normal creation of a new annotation, a MARK action can add positive or negative scoringpoints to the text fragments matched by the rule elements. If the amount of points exceeds thedefined threshold for the respective type, then a new annotation will be created. Further, the currentvalue of heuristic points of a possible annotation can be evaluated by the SCORE condition. In thefollowing, the heuristic extraction using scoring rules is demonstrated by a short example:

Modification


Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)}; Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)}; Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)}; Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)}; Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)}; Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)}; Headline{SCORE(10)->MARK(Realhl)}; Headline{SCORE(5,10)->LOG("Maybe a headline")};

In the first part of this rule set, annotations of the type paragraph receive scoring points fora headline annotation, if they fulfill certain CONTAINS conditions. The first condition, forexample, evaluates to true, if the paragraph contains one word up to five words, whereas the fourthconditions is fulfilled, if the paragraph contains thirty up to eighty percent of emph annotations.The last two rules finally execute their actions, if the score of a headline annotation exceeds tenpoints, or lies in the interval of five and ten points, respectively.

2.12. ModificationThere are different actions that can modify the input document, like DEL, COLOR and REPLACE.But the input document itself can not be modified directly. A seperate engine, the Modifier.xml,has to be called in order to create another cas view with the name "modified". In that document allmodifications are executed.

TextMarker Workbench 35

Chapter 3. TextMarker Workbench

3.1. Installation# Download, install and start an Eclipse 3.5 or Eclipse 3.6. # Add the Apache UIMA update site(http://www.apache.org/dist/uima/eclipse-update-site/) and the TextMarker update site (http://ki.informatik.uni-wuerzburg.de/~pkluegl/updatesite/) to the available software sites in your Eclipseinstallation. This can be achived in the "Install New Software" dialog in the help menu of Eclipse.# Eclipse 3.6: TextMarker is currently based on DLTK 1.0. Therefore, adding the DLTK 1.0update site (http://download.eclipse.org/technology/dltk/updates-dev/1.0/) is required since theEclipse 3.6 update site only supports DLTK 2.0. # Select "Install New Software" in the help menuof Eclipse, if not done yet. # Select the TextMarker update site at "Work with", deselect "Groupitems by category" and select "Contact all update sites during install to find required software"# Select the TextMarker feature and continue the dialog. The CEV feature is already containedin the TextMarker feature. Eclipse will automatically install the Apache UIMA (version 2.3)plugins and the DLTK Core Framework (version 1.X) plugins. # ''(OPTIONAL)'' If additionalHTML visualizations are desired, then also install the CEV HTML feature. However, you needto install the XPCom and XULRunner features previously, for example by using an appropriateupdate site (http://ftp.mozilla.org/pub/mozilla.org/xulrunner/eclipse/). Please refer to the [CEVinstallation instruction|CEVInstall] for details. # After the successful installation, switch to theTextMarker perspective. You can also download the TextMarker plugins from [SourceForge.net|https://sourceforge.net/projects/textmarker/] and install the plugins mentioned above manually.

3.2. TextMarker ProjectsSimilar to Java projects in Eclipse, the TextMarker workbench provides the possibility to createTextMarker projects. TextMarker projects require a certain folder structure that is created with theproject. The most important folders are the script folder that contains the TextMarker rule files in apackage and the descriptor folder that contains the generated UIMA components. The input foldercontains the text files or xmiCAS files that will be executed when starting a TextMarker script. Theresult will be placed in the output folder.

||Project element|| Used for | Project | the TextMarker project | - script | source folder with TextMarker scripts | -- my.package | the package, resulting in several folders | --- Script.tm | a TextMarker script | - descriptor | build folder for UIMA components | -- my/package | the folder structure for the components | --- ScriptEngine.xml | the analysis engine of the Script.tm script | --- ScriptTypeSystem.xml | the type system of the Script.tm script | -- BasicEngine.xml | the analysis engine template for all generated engines in this project | -- BasicTypeSystem.xml | the type system template for all generated type systems in this project | -- InternalTypeSystem.xml | a type system with TextMarker types | -- Modifier.xml | the analysis engine of the optional modifier that creates the ''modified'' view | - input | folder that contains the files that will be processed when launching a TextMarker script | -- test.html | an input file containing html | -- test.xmi | an input file containing text and annotations | - output | folder that contains the files that were processed by a TextMarker script | -- test.html.modified.html | the result of the modifier: replaced text and colored html | -- test.html.xmi | the result CAS with optional information | -- test.xmi.modified.html | the result of the modifier: replaced text and colored html | -- test.xmi.xmi | the result CAS with optional information

Explanation

36 TextMarker Workbench UIMA Version 2.4.1

| - resources | default folder for word lists and dictionaries | -- Dictionary.mtwl | a dictionary in the "multi tree word list" format | -- FirstNames.txt | a simple word list with first names: one first name per line | - test | test-driven development is still under construction

3.3. ExplanationHandcrafting rules is laborious, especially if the newly written rules do not behave as expected.The TextMarker System is able to protocol the application of each single rule and block in orderto provide an explanation of the rule inference and a minmal debug functionality. The explanationcomponent is built upon the CEV plugin. The information about the application of the rules itself isstored in the result xmiCAS, if the parameter of the executed engine are configured correctly. Thesimplest way the generate these information is to open a TextMarker file and click on the common"Debug" button (looks like a green bug) in your eclipse. The current TextMarker file will then beexecuted on the text files in the input directory and xmiCAS are created in the output directorycontaining the additional UIMA feature structures describing the rule inference. The resultingxmiCAS needs to be opened with the CEV plugin. However, only additional views are capableof displaying the debug information. In order to open the neccessary views, you can either openthe "Explain" perspective or open the views separately and arrange them as you like. There arecurrently seven views that display information about the execution of the rules: Applied Rules,Selected Rules, Rule List, Matched Rules, Failed Rules, Rule Elements and Basic Stream.

3.4. DictionariersThe TextMarker system suports currently the usage of dictionaries in four different ways. Thefiles are always encoded with UTF-8. The generated analysis engines provide a parameter"resourceLocation" that specifies the folder that contains the external dictionary files. The paramteris initially set to the resource folder of the current TextMarker project. In order to use a differentfolder, change for example set value of the paramter and rebuild all TextMarker rule files in theproject in order to update all analysis engines. The algorithm for the detection of the entires of adictionary:

for all basic annotations of the matched annotation do set current candidate to current basic loop if the dictionary contains current candidate then remember candidate else if an entry of the dictionary starts with the current candidate then add next basic annotation to the current candidate continue loop else stop loop

Word List (.txt) Word lists are simple text files that contain a term or string in each line. Thestrings may include white spaces and are sperated by a line break. Usage: Content of a file namedFirstNames.txt (located in the resource folder of a TextMarker project):

PeterJochenJoachimMartin

Examplary rules:

Dictionariers

UIMA Version 2.4.1 TextMarker Workbench 37

LIST FirstNameList = 'FirstNames.txt';DECLARE FirstName;Document{-> MARKFAST(FirstName, FirstNameList)};

In this example, all first names in the given text file are annotated in the input document with thetype FirstName. Tree Word List (.twl) A tree word list is a compiled word list similar to a trie.A .twl file is an XML-file that contains a tree-like structure with a node for each character. Thenodes themselves refer to child nodes that represent all characters that succeed the caracter of theparent node. For single word entries, this is resulting in a complexity of O(m*log(n)) instead ofa complexity of O(m*n) (simple .txt file), whereas m is the amount of basic annotations in thedocument and n is the amount of entries in the dictionary. Usage: A .twl file are generated using thepopup menu. Select one or more .txt files (or a folder containing .txt files), click the right mousebutton and choose ''Convert to TWL''. Then, one or more .twl files are generated with the accordingfile name. Examplary rules:

LIST FirstNameList = 'FirstNames.twl';DECLARE FirstName;Document{-> MARKFAST(FirstName, FirstNameList)};

In this example, all first names in the given text file are again annotated in the input documentwith the type FirstName. Multi Tree Word List (.mtwl) A multi tree word list is generated usingmultiple .txt files and contains special nodes: Its nodes provide additional information about theoriginal file. The .mtwl files are useful, if several different dictionaries are used in a TextMarkerfile. For five dictionaries, for example, also five MARKFAST rules are necessary. Therefore thematched text is searched five times and the complexity is 5 * O(m*log(n)). Using a .mtwl filereduces the complexity to about O(m*log(5*n)). Usage: A .mtwl file is generated using the popupmenu. Select one or more .txt files (or a folder containing .txt files), click the right mouse buttonand choose ''Convert to MTWL''. A .mtwl file named "generated.mtwl" is then generated thatcontains the word lists of all selected .txt files. Renaming the .mtwl file is recommended. If thereare for example two or more word lists with the name "FirstNames.txt", "Companies.txt" and soon given and the generated .mtwl file is renamed to "Dictionary.mtwl", then the following ruleannotates all companies and first names in the complete document. Examplary rules:

LIST Dictionary = 'Dictionary.mtwl';DECLARE FirstName, Company;Document{-> TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company, Dictionary, false, 0, false, 0, "")};

Table (.csv) The TextMarker system also supports .csv files, respectively tables. Usage: Content ofa file named TestTable.csv (located in the resource folder of a TextMarker project):

Peter;P;Jochen;J;Joba;J;

Examplary rules:

PACKAGE de.uniwue.tm;TABLE TestTable = 'TestTable.csv';DECLARE Annotation Struct (STRING first);Document{-> MARKTABLE(Struct, 1, TestTable, "first" = 2)};

Parameters


In this example, the document is searched for all occurences of the entries of the first column ofthe given table, an annotation of the type Struct is created and its feature "first" is filled with theentry of the second column. For the input document with the content "Peter" the result is a singleannotation of the type Struct and with P assigned to its features "first".

3.5. Parameters• mainScript (String): This is the TextMarker script that will be loaded and executed by the

generated engine. The string is referencing the name of the file without file extension butwith its complete namespace, e.g., my.package.Main.

• scriptPaths (Multiple Strings): The given strings specify the folders that contain TextMarkerscript files, the main script file and the additional script files in particular. Currently, there isonly one folder supported in the TextMarker workbench (script).

• enginePaths (Multiple Strings): The given strings specify the folders that contain additionalanalysis engines that are called from within a script file. Currently, there is only one foldersupported in the TextMarker workbench (descriptor).

• resourcePaths (Multiple Strings): The given strings specify the folders that contain theword lists and dictionaries. Currently, there is only one folder supported in the TextMarkerworkbench (resources).

• additionalScripts (Multiple Strings): This parameter contains a list of all known script filesreferences with their complete namespace, e.g., my.package.AnotherOne.

• additionalEngines (Multiple Strings): This parameter contains a list of all known analysisengines.

• additionalEngineLoaders (Multiple Strings): This parameter contains the class names of theimplementations that help to load more complex analysis engines.

• scriptEncoding (String): The encoding of the script files. Not yet supported, please useUTF-8.

• defaultFilteredTypes (Multiple Strings): The complete names of the types that are filtered bydefault.

• defaultFilteredMarkups (Multiple Strings): The names of the markups that are filtered bydefault.

• seeders (Multiple Strings):

• useBasics (String):

• removeBasics (Boolean):

• debug (Boolean):

• profile (Boolean):

• debugWithMatches (Boolean):

• statistics (Boolean):

• debugOnlyFor (Multiple Strings):

Query


• style (Boolean):

• styleMapLocation (String):

3.6. QueryThe query view can be used to write queries on several documents within a folder with theTextMArker language. A short example how to use the Query view:

• In the first field ''Query Data'', the folder is added in which the query is executed, forexample with drag and drop from the script explorer. If the checkbox is activated, then allsubfolder will be included in the query.

• The next field ''Type System'' must contain a type system or a TextMarker script thatspecifies all types that are used in the query.

• The query in form of one or more TextMarker rules is specified in the text field in themiddle of the view. In the example of the screenshot, all ''Author'' annotations are selectedthat contain a ''FalsePositive'' or ''FalseNegative'' annotation.

• If the start button near the tab of the view in the upper right corner ist pressed, then theresults are displayed.

3.7. Views

3.7.1. Annotation Browser

3.7.2. Annotation Editor

3.7.3. Marker Palette

3.7.4. Selection

Basic Stream


3.7.5. Basic Stream

The basic stream contains a listing of the complete disjunct partition of the document by theTextMarkerBasic annotation that are used for the inference and the annotation seeding.

3.7.6. Applied Rules

The Applied Rules views displays how often a rule tried to apply and how often the rule succeeded.Additionally some profiling information is added after a short verbalisation of the rule. Theinformation is structured: if BLOCK constructs were used in the executed TextMarker file,the rules contained in that block will be represented as child node in the tree of the view. EachTextMarker file is itself a BLOCK construct named after the file. Therefore the root node of theview is always a BLOCK containing the rules of the executed TextMarker script. Additionally, ifa rule calls a different TextMarker file, then the root block of that file is the child of that rule. Theselection of a rule in this view will directly change the information visualized in the other views.

3.7.7. Selected Rules

This views is very similar to the Applied Rules view, but displays only rules and blocks undera given selection. If the user clicks on the document, then an Applied Rule view is generatedcontaining only element that affect that position in the document. The Rule Elements view thenonly contains match information of that position, but the result of the rule element match is stilldisplayed.

3.7.8. Rule List

This views is very similar to the Applied Rules view and the Selected Rules view, but displays onlyrules and NO blocks under a given selection. If the user clicks on the document, then a list of rulesis generated that matched or tried to match on that position in the document. The Rule Elementsview then only contains match information of that position, but the result of the rule element matchis still displayed. Additionally, this view provides a text field for filtering the rules. Only thoserules remain that contain the entered text in their verbalization.

3.7.9. Matched Rules

If a rule is selected in the Applied Rules views, then this view displays the instances (text passages)where this rules matched.

3.7.10. Failed Rules

If a rule is selected in the Applied Rules views, then this view displays the instances (text passages)where this rules failed to match.

3.7.11. Rule Elements

If a successful or failed rule match in the Matched Rules view or Failed Rules view is selected, thenthis views contains a listing of the rule elements and their conditions. There is detailed informationavailable on what text each rule element matched and which condition did evavaluate true.

Statistics


3.7.12. Statistics

This views displays the used conditions and actions of the TextMarker language. Three numbersare given for each element: The total time of execution, the amount of executions and the time perexecution.

3.7.13. False Positive

3.7.14. False Negative

3.7.15. True Positive

3.8. TestingThe TextMarker Software comes bundled with its own testing environment, that allows you to testand evaluate TextMarker scripts. It provides full back end testing capabilities and allows you toexamine test results in detail. As a product of the testing operation a new document file will becreated and detailed information on how well the script performed in the test will be added to thisdocument.

3.8.1. Overview

The testing procedure compares a previously annotated gold standard file with the result of theselected TextMarker script using an evaluator. The evaluators compare the offsets of annotations inboth documents and, depending on the evaluator, mark a result document with true positive, falsepositive or false negative annotations. Afterwards the f1-score is calculated for the whole set oftests, each test file and each type in the test file. The testing environment contains the followingparts :

• Main view

• Result views : true positive, false positive, false negative view

• Preference page

Overview


All control elements,that are needed for the interaction with the testing environment, are located inthe main view. This is also where test files can be selected and information, on how well the scriptperformed is, displayed. During the testing process a result CAS file is produced that will containnew annotation types like true positives (tp), false positives (fp) and false negatives (fn). Whiledisplaying the result .xmi file in the script editor, additional views allow easy navigation throughthe new annotations. Additional tree views, like the true positive view, display the correspondingannotations in a hierarchic structure. This allows an easy tracing of the results inside the testingdocument. A preference page allows customization of the behavior of the testing plug-in.

3.8.1.1. Main View

The following picture shows a close up view of the testing environments main-view part. Thetoolbar contains all buttons needed to operate the plug-ins. The first line shows the name of thescript that is going to be tested and a combo-box, where the view, that should be tested, is selected.On the right follow fields that will show some basic information of the results of the test-run.Below and on the left the test-list is located. This list contains the different test-files. Right besidesit, you will find a table with statistic information. It shows a total tp, fp and fn information, as wellas precision, recall and f1-score of every test-file and for every type in each file.

Overview


3.8.1.2. Result Views

This views add additional information to the CAS View, once a result file is opened. Each viewdisplays one of the following annotation types in a hierarchic tree structure : true positives, falsepositive and false negative. Adding a check mark to one of the annotations in a result view, willhighlight the annotation in the CAS Editor.

Overview


3.8.1.3. Preference Page

The preference page offers a few options that will modify the plug-ins general behavior. Forexample the preloading of previously collected result data can be turned off, should it produce ato long loading time. An important option in the preference page is the evaluator you can select.On default the "exact evaluator" is selected, which compares the offsets of the annotations, thatare contained in the file produced by the selected script, with the annotations in the test file. Otherevaluators will compare annotations in a different way.

Overview


3.8.1.4. The TextMarker Project Structure

The picture shows the TextMarker's script explorer. Every TextMarker project contains a foldercalled "test". This folder is the default location for the test-files. In the folder each script-file hasits own sub-folder with a relative path equal to the scripts package path in the "script" folder. Thisfolder contains the test files. In every scripts test-folder you will also find a result folder with theresults of the tests. Should you use test-files from another location in the file-system, the resultswill be saved in the "temp" sub-folder of the projects "test" folder. All files in the "temp" folderwill be deleted, once eclipse is closed.

Usage


3.8.2. UsageThis section will demonstrate how to use the testing environment. It will show the basic actionsneeded to perform a test run.

Preparing Eclipse: The testing environment provides its own perspective called "TextMarkerTesting". It will display the main view as well as the different result views on the right hand side. Itis encouraged to use this perspective, especially when working with the testing environment for thefirst time.

Selecting a script for testing: TextMarker will always test the script, that is currently open in thescript-editor. Should another editor be open, for example a java-editor with some java class beingdisplayed, you will see that the testing view is not available.

Creating a test file: A test-file is a previously annotated .xmi file that can be used as a goldenstandard for the test. To create such a file, no additional tools will be provided, instead theTextMarker system already provides such tools.

Selecting a test-file: Test files can be added to the test-list by simply dragging them from the ScriptExplorer into the test-file list. Depending on the setting in the preference page, test-files from ascripts "test" folder might already be loaded into the list. A different way to add test-files is to usethe "Add files from folder" button. It can be used to add all .xmi files from a selected folder. The"del" key can be used to remove files from the test-list.

Selecting a CAS View to test: TextMarker supports different views, that allow you to operate ondifferent levels in a document. The InitialView is selected as default, however you can also switch

Usage


the evaluation to another view by typing the views name into the list or selecting the view you wishto use from the list.

Selecting the evaluator: The testing environment supports different evaluators that allow asophisticated analysis of the behavior of a TextMarker script. The evaluator can be chosen in thetesting environments preference page. The preference page can be opened either trough the menuor by clicking the blue preference buttons in the testing views toolbar. The default evaluator is the"Exact CAS Evaluator" which compares the offsets of the annotations between the test file and thefile annotated by the tested script.

Excluding Types: During a test-run it might be convenient to disable testing for specific types likepunctuation or tags. The ''exclude types`` button will open a dialog where all types can be selectedthat should not be considered in the test.

Running the test: A test-run can be started by clicking on the green start button in the toolbar.

Result Overview: The testing main view displays some information, on how well the script did,after every test run. It will display an overall number of true positive, false positive and falsenegatives annotations of all result files as well as an overall f1-score. Furthermore a table willbe displayed that contains the overall statistics of the selected test file as well as statistics forevery single type in the test file. The information displayed are true positives, false positives, falsenegatives, precision, recall and f1-measure.

The testing environment also supports the export of the overall data in form of a comma-separatedtable. Clicking the export evaluation data will open a dialog window that contains this table. Thetext in this table can be copied and easily imported into OpenOffice.org or MS Excel.

Result Files: When running a test, the evaluator will create a new result .xmi file and will add newtrue positive, false positive and false negative annotations. By clicking on a file in the test-file list,you can open the corresponding result .xmi file in the TextMarker script editor. When opening aresult file in the script explorer, additional views will open, that allow easy access and browsing ofthe additional debugging annotations.

Evaluators


3.8.3. EvaluatorsWhen testing a CAS file, the system compared the offsets of the annotations of a previouslyannotated gold standard file with the offsets of the annotations of the result file the script produced.Responsible for comparing annotations in the two CAS files are evaluators. These evaluators havedifferent methods and strategies, for comparing the annotations, implemented. Also a extensionpoint is provided that allows easy implementation new evaluators.

Exact Match Evaluator: The Exact Match Evaluator compares the offsets of the annotations in theresult and the golden standard file. Any difference will be marked with either an false positive orfalse negative annotations.

Partial Match Evaluator: The Partial Match Evaluator compares the offsets of the annotationsin the result and golden standard file. It will allow differences in the beginning or the end of anannotation. For example "corresponding" and "corresponding " will not be annotated as an error.

Core Match Evaluator: The Core Match Evaluator accepts annotations that share a core expression.In this context a core expression is at least four digits long and starts with a capitalized letter.For example the two annotations "L404-123-421" and "L404-321-412" would be considered atrue positive match, because of "L404" is considered a core expression that is contained in bothannotations.

Word Accuracy Evaluator: Compares the labels of all words/numbers in an annotation, whereas thelabel equals the type of the annotation. This has the consequence, for example, that each word ornumber that is not part of the annotation is counted as a single false negative. For example we havethe sentence: "Christmas is on the 24.12 every year." The script labels "Christmas is on the 12" asa single sentence, while the test file labels the sentence correctly with a single sentence annotation.While for example the Exact CAS Evaluator while only assign a single False Negative annotation,Word Accuracy Evaluator will mark every word or number as a single False Negative.

Template Only Evaluator: This Evaluator compares the offsets of the annotations and the features,that have been created by the script. For example the text "Alan Mathison Turing" is marked withthe author annotation and "author" contains 2 features: "FirstName" and "LastName". If the scriptnow creates an author annotation with only one feature, the annotation will be marked as a falsepositive.

Template on Word Level Evaluator: The Template On Word Evaluator compares the offsets ofthe annotations. In addition it also compares the features and feature structures and the valuesstored in the features. For example the annotation "author" might have features like "FirstName"and "LastName" The authors name is "Alan Mathison Turing" and the script correctly assignsthe author annotation. The feature assigned by the script are "Firstname : Alan", "LastName :Mathison", while the correct feature values would be "FirstName Alan", "LastName Turing". Inthis case the Template Only Evaluator will mark an annotation as a false positive, since the featurevalues differ.

3.9. TextRulerUsing the knowledge engineering approach, a knowledge engineer normally writes handcraftedrules to create a domain dependent information extraction application, often supported by a goldstandard. When starting the engineering process for the acquisition of the extraction knowledgefor possibly new slot or more general for new concepts, machine learning methods are often ableto offer support in an iterative engineering process. This section gives a conceptual overviewof the process model for the semi-automatic development of rule-based information extractionapplications.

Available Learners


First, a suitable set of documents that contain the text fragments with interesting patterns needsto be selected and annotated with the target concepts. Then, the knowledge engineer chooses andconfigures the methods for automatic rule acquisition to the best of his knowledge for the learningtask: Lambda expressions based on tokens and linguistic features, for example, differ in theirapplication domain from wrappers that process generated HTML pages.

Furthermore, parameters like the window size defining relevant features need to be set to anappropriate level. Before the annotated training documents form the input of the learning task, theyare enriched with features generated by the partial rule set of the developed application. The resultof the methods, that is the learned rules, are proposed to the knowledge engineer for the extractionof the target concept.

The knowledge engineer has different options to proceed: If the quality, amount or generality ofthe presented rules is not sufficient, then additional training documents need to be annotated oradditional rules have to be handcrafted to provide more features in general or more appropriatefeatures. Rules or rule sets of high quality can be modified, combined or generalized and transferedto the rule set of the application in order to support the extraction task of the target concept. In thecase that the methods did not learn reasonable rules at all, the knowledge engineer proceeds withwriting handcrafted rules.

Having gathered enough extraction knowledge for the current concept, the semi-automatic processis iterated and the focus is moved to the next concept until the development of the application iscompleted.

3.9.1. Available LearnersOverview ||Name||Strategy||Document||Slots||Status |BWI (1) |Boosting, Top Down |Struct, Semi |Single, Boundary |Planning |LP2 (2) |Bottom Up Cover |All |Single, Boundary |Prototype |RAPIER(3) |Top Down/Bottom Up Compr. |Semi |Single |Experimental |WHISK (4) |Top Down Cover |All|Multi |Prototype |WIEN (5) |CSP |Struct |Multi, Rows |Prototype

* Strategy: The used strategy of the learning methods are commonly coverage algorithms. *Document: The type of the document may be ''free'' like in newspapers, ''semi'' or ''struct'' likeHTML pages. * Slots: The slots refer to a single annotation that represents the goal of the learningtask. Some rule are able to create several annotation at once in the same context (multi-slot).However, only single slots are supported by the current implementations. * Status: The currentstatus of the implementation in the TextRuler framework.

Publications

(1) Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI, pages577–583, 2000.

(2) F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic Constraints.Technical Report CS-03-07, Department of Computer Science, University of Sheffield, Sheffield,2003.

(3) Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of PatternMatching Rules for Information Extraction. Journal of Machine Learning Research, 4:177–210,2003.

(4) Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information ExtractionRules for Semi-Structured and Free Text. In Machine Learning, volume 34, pages 233–272, 1999.

(5) N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information Extraction. InProc. IJC Artificial Intelligence, 1997.

Available Learners


BWI BWI (Boosted Wrapper Induction) uses boosting techniques to improve the performance ofsimple pattern matching single-slot boundary wrappers (boundary detectors). Two sets of detectorsare learned: the "fore" and the "aft" detectors. Weighted by their confidences and combined witha slot length histogram derived from the training data they can classify a given pair of boundarieswithin a document. BWI can be used for structured, semi-structured and free text. The patterns aretoken-based with special wildcards for more general rules.

Implementations No implementations are yet available.

Parameters No parameters are yet available.

LP2 This method operates on all three kinds of documents. It learns separate rules for the beginningand the end of a single slot. So called tagging rules insert boundary SGML tags and additionallyinduced correction rules shift misplaced tags to their correct positions in order to improveprecision. The learning strategy is a bottom-up covering algorithm. It starts by creating a specificseed instance with a window of w tokens to the left and right of the target boundary and searchesfor the best generalization. Other linguistic NLP-features can be used in order to generalize over theflat word sequence.

Implementations LP2 (naive): LP2 (optimized):

Parameters Context Window Size (to the left and right): Best Rules List Size: Minimum CoveredPositives per Rule: Maximum Error Threshold: Contextual Rules List Size:

RAPIER RAPIER induces single slot extraction rules for semi-structured documents. The rulesconsist of three patterns: a pre-filler, a filler and a post-filler pattern. Each can hold severalconstraints on tokens and their according POS-tag- and semantic information. The algorithmuses a bottom-up compression strategy, starting with a most specific seed rule for each traininginstance. This initial rule base is compressed by randomly selecting rule pairs and search for thebest generalization. Considering two rules, the least general generalization (LGG) of the slot fillersare created and specialized by adding rule items to the pre- and post-filler until the new rulesoperate well on the training set. The best of the k rules (k-beam search) is added to the rule baseand all empirically subsumed rules are removed.

Implementations RAPIER:

Parameters Maximum Compression Fail Count: Internal Rules List Size: Rule Pairs forGeneralizing: Maximum 'No improvement' Count: Maximum Noise Threshold: Minimum CoveredPositives Per Rule: PosTag Root Type: Use All 3 GenSets at Specialization:

WHISK WHISK is a multi-slot method that operates on all three kinds of documents and learnssingle- or multi-slot rules looking similar to regular expressions. The top-down covering algorithmbegins with the most general rule and specializes it by adding single rule terms until the rulemakes no errors on the training set. Domain specific classes or linguistic information obtained bya syntactic analyzer can be used as additional features. The exact definition of a rule term (e.g.a token) and of a problem instance (e.g. a whole document or a single sentence) depends on theoperating domain and document type.

Implementations WHISK (token): WHISK (generic):

Parameters Window Size: Maximum Error Threshold: PosTag Root Type:

WIEN WIEN is the only method listed here that operates on highly structured texts only. It inducesso called wrappers that anchor the slots by their structured context around them. The HLRT (headleft right tail) wrapper class for example can determine and extract several multi-slot-templates

Available Learners


by first separating the important information block from unimportant head and tail portions andthen extracting multiple data rows from table like data structures from the remaining document.Inducing a wrapper is done by solving a CSP for all possible pattern combinations from the trainingdata.

Implementations WIEN:

Parameters No parameters are available.

TextMarker user guide

Documents

Transcript of TextMarker user guide