Geographic Information Retrieval (GIR)

Geographical Information Retrieval

Behrooz RasuliIranian Research Inst. For Information Science & [email protected]

GIR

Address information is essential for people's daily life. People often need to query addresses of unfamiliar location through Web and then use map services to mark down the location for direction purpose. Although both address information and map services are available online, they are not well combined.

Introduction . . .

general search engines are widely used to retrieve Web pages

Specialized search engines are dedicated to find either particular types of resources or Web pages based on different criteria e.g. language or geographic location

People use search engines to find Web pages of local services and events around them or in a particular area

Introduction . . .

is the data pertaining to the location of geographical entities together with their spatial dimensions

Location could be defined as “a place on the Internet where an Internet resource, such as a Web page, is stored”

Spatial data, geospatial data or geographic information

Source Geography◦ physical location of hosts◦ signal processing and network-based techniques

Target Geography◦ uses elements contained in the page to deduce

locations (place names, postal addresses, and phone numbers)

◦ Challenge: involves evidence extraction, semantic analysis, and interpretation, in order to link Web pages to geographic locations

Geographic aspects of the Web can be explored using two approaches

Geographic Information Retrieval (GIR) is an applied research field that involves indexing, searching, retrieving, and browsing georeferenced information sources, and designing systems to execute these tasks effectively and efficiently

Like IR, GIR includes indexing, storage and ranking

GIR

pattern extraction from raw text has already been done. For example, M. Hearst (1990s)

developed an approach for discovering lexico-syntactic patterns for hypernyms

GIR History

Pattern-Based Methods;◦ Named Entity Recognition (NER)◦ Gazetteer approach (Web-a-Where);◦ Pattern-based method

Ontology-Based Methods;◦ OnLocus

Machine Learning Methods;

GIR Methods

Pattern-Based Method

Few commercial geographic search engines have been commercially developed among them Google Map and Yahoo Local are notable

ambiguous dynamic nature of location names, various addressing styles, lack of geographic information, and multiple locations related to a Web resource

extract proper names from texts and documents

an algorithm that distinguishes five classes for name of locations: CITY, REGION, COUNTRY, ISLAND, RIVER, and MOUNTAIN

method is time-consuming and is not useful for real-time search

Named Entity Recognition (NER)

tagging individual place names (geotagger);◦finds and disambiguates geographic names

(assigning a canonical taxonomy node to each phrase in the text)

1. Spotting;2. Disambiguation;3. Focus determination;crawling the Web, storing the resulting pages and indexing their contents

Gazetteer approach

Basically, a geographic search engine must be able to find related addresses and location names and assign them to Web pages

Current address extraction techniques basically require large gazetteers which are expensive and unavailable for many countries

different markup styles e.g. HTML, XML and DOM

natural language processing models are not able to extract all addresses and location names from Web page contents

Different ways of mentioning an address in a Web page

large scale gazetteers

pattern-based model which uses HTML and visual segmentations to improve

address extraction on Web pages

divide an address to its semantic components

automatic

much human effort

new location names

The proposed address extraction system consists of five components:

HTML Pre- Processor, Parser, Knowledge Searcher, Decision Maker, and Knowledge Accumulator

analyze HTML tags and codes; convert HTML files to XML (by employing the

VIPS Demo software); in-depth analyzing and traversing the XML

to obtain content information; sorting them in a linear sequence together

with their node numbers; a node index is built

1. HTML Pre-processor

It tries to find all candidate phrases (potential addresses) in a node;

divides a potential address into its component;

Each segment obtained in this step, will be utilized as default searching unit of Database Searcher;

2. XML Parser

itemizes elements of a potential address; It finds all possibilities of a potential address

and forms them into a list of possible patterns in three steps:◦ Standardizing Word Formats (different spells, abbreviations,

synonyms)◦ Knowledge-Base Place Name Matching (separates

elements into more delicate level)◦ Ambiguity Eliminating (tries to match place name)

3. Knowledge Searcher

whether a candidate phrase is an address or not; by matching it with address patterns already stored

in a database;◦ Delimitating ambiguities and conflicts of place names (syntactic and

semantic: geo/non-geo and geo/geo);

◦ Itemizing each potential address to its elements;◦ Adding the lost parts to address based on a location tree

wherever it is possible

the address ”No. 10, William Street, Toowong, Queensland” will be modified as ”No. 10, William Street, Toowong, Brisbane, Queensland, Australia”

4. Decision Maker

the last component of the system; exhibits in two aspects:◦ Location Accumulation;◦ Address Pattern Accumulation

5. Knowledge Accumulator

there are 9 lemmas in KB; 3 lemmas have multiple identities (Victoria, Churchill, Howard

Avenue); Following algorithm indicates how place

names are detected in Phrases◦ PW - A candidate phrase◦ Wi - the ith word in PW◦ f - any syntactic format of W i

◦ KB - Knowledge-Base◦ Ci - Result Collection

Example

Inputs

1. PW(pre word, Wi) {2. if ((pre word + f) = a place name

found in KB)3. add (pre word + f) to Ci;4. if (pre word + f) = part of a name in

KB5. pre word = pre word + f;6. PW(pre word, Wi+1);//try next word in PW7. }

1. SyntacticAE(Potential) {2. current word = first word in Potential3. C = NULL; //initialize C4. While current word != EOF5. {6. C = SAE (C, current word); //add longestresult in C7. current word = next new word in Potential;8. }9. }

Syntactic Ambiguity Elimination

inconsistencies between accumulated knowledge in KB and extracted information from the Web:◦ misspelling and synonymy◦ incompleteness of KB

• Keeping the Conflict• Removing Meaningless Conflict Element• Finding Synonymous Sub-tree• Merging Synonymous Sub-Tree

Conflict Elimination

Ontology-Based Method

Direct references◦ place names, complete postal addresses

Indirect references◦ postal codes and telephone area codes, or from

expressions that indicate relationships to other places, which are directly referenced (for instance, “The hotel is two blocks from Times Square”)

References to places in Web pages

propose a three-phase process for recognizing geographic evidence in Web pages:◦ Extraction (selecting relevant Web content),◦ Recognition (corresponds to isolating references

to places embedded in text and includes dealing with ambiguity),

◦ Location (obtains locations from the place descriptions previously recognized, using positioning data from gazetteers or from spatial databases)

OnLocus

an extraction ontology is able to identify objects and relationships;

ontology must describe rules for identifying elements within its domain that are present in Web pages

extraction

recognition of terms and expressions as place names;◦ compared to a gazetteer: Alexandria and

GeoNames

recognition

try to determine an actual location from a gazetteer or performing a process

known as geocoding

Location of direct references◦ matching and locating

Location of indirect references◦ Formal

establish a correspondence between a code and the area it serves (supported by spatial databases)

◦ Informal natural language interpretation is required

Location

Machine Learning Methods

apply Text Mining procedures to the Internet in order to classify places into different location types (e.g., Maebashi is a CITY, Honshu is an ISLAND) and to determine for a given place name, where the place is (e.g. Maebashi is in Japan, Honshu is in the Pacific ocean);

acquire exhaustive fine-grained gazetteers automatically and thus avoid hand-coding;

distinguish 6 location types (CITY, REGION, COUNTRY, ISLAND, RIVER, MOUNTAIN)

Machine Learning Method

dataset consists of 1260 names of locations For each class constructed a set of patterns

◦ patterns have the form “KEYWORD+of+X” and “X+KEYWORD” (Alta Vista counts)

Each class has from 3 (ISLAND) up to 10 (MOUNTAIN) different keywords

Keywords and patterns were selected manually

Algorithm

For example, for the class CITY use 4 keywords (“city”, “town”, “mayor”, “streets”) and 7 corresponding patterns (“city+ of+X”, “X+city”, “town+of+X”, “mayor+of+X”, “X+ mayor”, “streets+of+X”, and “X+streets”

Example

Thank You!Presented in Information Retrieval Course, under supervision of

Dr. Saeid Asadi

Geographic Information Retrieval (GIR)

Technology

Transcript of Geographic Information Retrieval (GIR)