CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz...

CIS 702 Communication/Information Technologies (CIT)

Philip Robbins – March 7, 2013Dr. Luz Quiroga, Ph.D.

Chapter 6Documents: Language & Properties

Communication & Information Sciences Ph.D. ProgramUniversity of Hawai'i at Mānoa

Teaching Session #9

1

Documents: Language & Properties

Chapter Contents• Metadata• Document Formats• Markup Languages• Text Properties• Document Preprocessing• Organizing Documents• Text Compression

2

Introduction

Document• Denotes a single unit of information• Structure and a Syntax• Semantics, specified by the author• Presentation style

3

Introduction

4

Introduction

Document Syntax• Expresses structure, presentation style,

semantics• Implicit in its content• Expressed in a simple declarative language• Expressed in a programming language

Text• Can be written in natural language (Hard to

process)

5

Introduction

Document Style• How a document is visualized or printed• Can be embedded in the document i.e. RTF files• Can be complemented by macros

6

Introduction

Queries• Short pieces of text• Differ from normal text• Semantics often ambiguous due to polysemy• User intent behind a query is not easy to infer

7

Metadata

Metadata• Data about data• Information on the organization of the data,

various data domains, and their relationship• Metadata is associated with most documents

8

Metadata

Descriptive Metadata• External to the meaning of the document and

pertain more to how it was created.• Author of the text• Date of publication• Source of the publication• Documentation length

9

Metadata

Semantic Metadata• Characterizes the subject matter within the

document contents• Associated with a wide number of documents• Availability is increasing

10

Metadata

Metadata Format• Machine Readable Cataloging Record (MARC)• Format used for most library records• Includes fields for distinct attributes of a

bibliographic entry such as: title, author, publication venue.

11

Metadata

Metadata in Web Documents• Increase in web data has led to adding metadata

information to web pages.• Cataloging and content rating• Intellectual property rights and digital signatures• Electronic Commerce

12

Metadata

Resource Description Framework (RDF)• New standard for Web metadata• Allows describing Web resources to facilitate

automated processing.• Does not assume any particular application or

semantic domain.• Consists of a description of nodes and attached

attribute/value pairs.

13

Text

Text• Computers represent characters in binary, which

is done through coding schemes:• EBCDIC (7 bits)• ASCII (8 bits)• UNICODE (16 bits)• IR systems should be able to retrieve information

from many text formats (doc, pdf, html, txt)• IR systems have filters to handle most

documents (might not be possible with proprietary formats)

14

Text

Text Formats• For document exchange: Rich Text Format (RTF)• For printing and displaying: Portable Document

Format (PDF)• For printing and displaying: Postscript (PS)

15

Text

Interchange Formats• For encoding email: Multipurpose Internet Mail

Exchange (MIME)• For compressing text: ZIP

16

Multimedia

Multimedia• For applications that handle different types of

data:• Text• Sounds• Images• Video• Different types of formats are necessary for storing

each media

17

Images

Image Formats• Simplest image formats are direct representations

of a bit-mapped display: XBM, BMP, PCX• These formats have lots of redundancy and can be

compressed efficiently: GIF

18

Images

Lossy Compression• To improve compression ratios.• Uncompressing a compressed image does not

yield exactly the original image.• Joint Photographic Experts Group (JPEG)• Eliminates parts of the image that have less

impact in the human eye.• Parametric format – loss can be tuned.

19

Images

Interchange Formats for Images• Tagged Image File Format (TIFF)• Provides for metadata, compression, and varying

number of colors.• Standard de facto for images on the Web: • Portable Network Graphics (PNG)

20

Audio

Audio Formats• Audio is digitalized• MIDI is the standard format to interchange music

between electronic instruments and computers.• AU, WAVE

21

Movies

Movie Formats• Works by coding changes in consecutive frames• Takes advantage of temporal image redundancy• Includes audio signal associated with the video• Audio: MP3, Video: MP4• AVI, FLI, Quicktime

22

Graphics

Format for 3-D Graphics• Computer Graphics Metafile (CGM)• Virtual Reality Modeling Language (VRML)• VRML is the universal interchange format for 3-D

graphics and multimedia.

23

Markup

Markup Languages• Defined as extra syntax used to describe

formatting actions, structure information, text semantics, attributes

• XML: eXtensible Markup Language• HTML: Hyper Text Markup Language• SGML: Standard Generalized Markup

Language

24

Markup

Standard Generalized Markup Language (SGML)• ISO 8879• Meta-language for tagging text• Provides rules for defining a markup language

based on tages• Includes a description of the document structure:

“document type definition”• SGML document defined by: document type

definition with the text itself marked with tags describing the structure

25

Markup

SGML Document Type Definition• Describes the pieces that a document is

composed of• Defines how those pieces relate to each other• Part of the definition can be specified by an

SGML Document Type Declaration (DTD)• Other parts (i.e. semantics of elements &

attributes) cannot be express formally in SGML

26

Markup

SGML Document Type Definition

27

Markup

SGML Document Type Definition

28

Markup

SGML• Tags are denoted by angle brackets < >• Used to identify the beginning and ending of an

element• Ending tags include a slash before the tag name• Attributes are specified inside the beginning tag

29

Markup

SGML• Document description does not specify how a

document is printed• Output specifications are added to SGML

documents:• DSSSL: Document Style Semantic Specification

Language• FOSI: Formatted Output Specification Instance• These standards define mechanisms for

associating style information with SGML document instances

• Allows defining data identified by a tag should be typeset in some particular font 30

Markup

HyperText Markup Language (HTML)• Instance of SGML• Created in 1992• Latest Version is 4.0 (HTML5 under development)• Includes support for style sheets, frames, tables,

forms, etc.• Backwards compatible• Most documents on the Web are stored and

transmitted in HTML• HTML tags follow all SGML conventions and

include formatting directives.

31

Markup

HyperText Markup Language (HTML)• Can have media embedded within, such as

images or audio• Has fields for metadata• Adding programs (i.e. Javascript) inside a

webpage makes it dynamic (hence dynamic HTML).

32

Markup

HyperText Markup Language (HTML)

33

Markup

HyperText Markup Language (HTML)

34

Markup

Cascade Style Sheets (CSS)• Because HTML does not fix a presentation style,

CSS was introduced.• 1997• Way for authors to improve the aesthetics of

HTML pages• Information about presentation is separate from

document content• Support for CSS in current browsers in still

modest

35

Markup

eXtensible Markup Language (XML)• Is a simplified subset of SGML• Not a markup language (like HTML) but a meta-

language (like SGML)• Allows human-readable sematic markup, which

is also machine-readable• Does not have the restriction of HTML• Allows any user to define new tags• More rigid syntax on the syntax: • Ending tags cant be omitted• Distinguishes upper and lower case• Attribute values must be in quotes

36

Markup

eXtensible Style Sheet Language (XSL)• The XML counterpart of Cascading Style Sheets

(CSS)• Syntax based on XML• Designed to transform and style highly-

structured, data-rich documents written in XML• i.e. With XML it would be possible to

automatically extract a table of contents from a document

37

Markup

Hypermedia/Time-based Structuring Language• SGML architecture that specifies the generic

hypermedia structure of documents• Includes complex locating of document objects• Includes relationships (hyperlinks) between

document objects• Includes numeric, measured associations

between document objects• Does not specify graphical interfaces, user

navigation or user interaction.

38

Theory

Information Theory• It is difficult to formally capture how much

information there is in a given text• However, distribution of symbols is related to it• A text where one symbol appears almost all the

time does not convey much information• Information Theory defines a special concept,

entropy, to capture information content

39

Theory

Entropy

40

Theory

Entropy

41

Theory

Modeling Natural Language• We can divide the symbols of a text in two

disjoint subsets:• Symbols that separate words;• Symbols that belong to words;• Symbols are not uniformly distributed in a text• i.e. In English the vowels are usually more

frequent than most consonants.

42

Theory

Modeling Natural Language• A simple model to generate text is the Binomial

model• The probability of a symbol depends on previous

symbol.• i.e. f cannot appear after a letter c• A finite-context or Markovian model can be used

to reflect this dependency.• Second issue: is how the different words are

distributed inside each document.

43

Theory

Zipf’s Law

44

Theory

45

Theory

46

Modeling Natural Language• Words arranged in decreasing order of their

frequencies

Theory

47

Modeling Natural Language• Words arranged in decreasing order of their

frequencies• Distribution of words is very skewed• Words that are too frequent (“stopwords”) can

be disregarded.• Stopword is a word which does not carry

meaning in natural language• i.e. Stopwords in English: a, the, by, and• Therefore, half of the words appearing in a text

do not need to be considered

Theory

48

Modeling Natural Language• Third Issue: Distribution of words in the

documents of a collection.• Simple Model: Consider that each word appears

the same number of times in every document (Not True)

• Better Model: Use a binomial distribution

Theory

49

Heaps’ Law• Fourth Issue: Number of distinct words in a

document (document vocabulary)• To predict the growth of vocabulary size in

natural language text:

Theory

50

Modeling Natural Language• Vocabulary size grows sub-linearly with text size

Theory

51

Modeling Natural Language• The set of different words of a language is fixed

by a constant.• However, the limit is so high that it is common to

assume the size of the vocabulary is:

• Many argue that the number keeps growing anyway because of typing and spelling errors.

• As the total text size grows, the predictions of the model become more accurate.

Theory

52

Text Similarity• Similarity is measured by a distance function• Hamming distance: For strings of the same

length, distance between them is the number of positions with different characters (distance is 0 if equal).

• A distance function should be symmetric and satisfy:

Theory

53

Text Similarity• Levenshtein “edit” distance: the minimal number

of char insertions, deletions, and substitutions needed to make two strings equal.

• Edit distance between color and colour is 1• Edit distance between survey and surgery is 2

Theory

54

Text Similarity• Longest Common Subsequence (LCS):• All non-common characters of two (or more)

strings• Remaining sequence of characters is the LCS of

both strings• LCS of survey and surgery is surey.

Theory

55

Text Similarity• Similarity can be extended to documents• Compute the longest sequence of lines between

two files• ‘diff’ command in Unix

Theory

56

Resemblance Measure

Theory

57

Resemblance Measure

Model

58

Document Preprocessing Operations• Lexical analysis of the text• Elimination of stopwords• Stemming of the remaining words• Selection of index terms or keywords• Construction of term categorization structures

(thesaurus)

Model

59

Logical View of a Document

Document Preprocessing

60

Lexical Analysis• Process of converting stream of chars into

stream of words• Major Objective: Identify words in the text• Word Seperators:

- Space: most common separator- Numbers: inherently vague, context

required- Hyphens: break up hyphenated words- Punctuation marks- Case of letters: A vs. a


61

Elimination of Stopwords• Words that appear too frequently• Usually, not good discriminators• Filtered out as potential index terms• Reduces size of index by 40% or more• At expense of reducing recall: not able to

retrieve documents that contain “to be or not to be”


62

Stemming• Stem: portion of word left after removal of

prefixes/suffixes• User specifies query word but only variant of it is

present in a relevant document• This is partially solved by the adoption of stems• Stemming reduces size of the index• Controversial• Many search engines do not adopt any

stemming


63

Keyword Selection• Full text representation: all words in text is used

as index terms (or, keywords).• Alternative to full text representation:

– Not all words in text used as index terms– Use just nouns as index terms– Group nouns that appear nearby in text into a single

indexing component (a concept)


64

Thesaurus• Used as reference to a treasury of words.• Precompiled list of important words in a

knowledge domain• For each word in this list, a set of related words

derived from a synonymy relationship


65

Thesaurus• Used as reference to a treasury of words.• Precompiled list of important words in a

knowledge domain• For each word in this list, a set of related words

derived from a synonymy relationship


66

Thesaurus• Query formulation process (for IR):– User forms a query– Query terms might be erroneous and improper– Solution: reformulate the original query– Usually, this implies expanding original query

with related terms– Thus, it is natural to use a thesaurus for finding

related terms

Taxonomies

67

Folksonomies

68

Folksonomy• Collaborative flat vocabulary• Terms are selected by a population of users• Each term is called a tag

Folksonomies

69

References

• Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition. Chapter 6, Documents: Languages & Properties, Retrieved from http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf

70

http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf

http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf

Questions?

[email protected]/~probbins

71

CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz...

Documents

Transcript of CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz...