Geographic Information Retrieval (GIR)
-
Upload
behrooz-rasuli -
Category
Technology
-
view
120 -
download
8
description
Transcript of Geographic Information Retrieval (GIR)
Geographical Information Retrieval
Behrooz RasuliIranian Research Inst. For Information Science & [email protected]
GIR
Address information is essential for people's daily life. People often need to query addresses of unfamiliar location through Web and then use map services to mark down the location for direction purpose. Although both address information and map services are available online, they are not well combined.
Introduction . . .
general search engines are widely used to retrieve Web pages
Specialized search engines are dedicated to find either particular types of resources or Web pages based on different criteria e.g. language or geographic location
People use search engines to find Web pages of local services and events around them or in a particular area
Introduction . . .
is the data pertaining to the location of geographical entities together with their spatial dimensions
Location could be defined as “a place on the Internet where an Internet resource, such as a Web page, is stored”
Spatial data, geospatial data or geographic information
Source Geography◦ physical location of hosts◦ signal processing and network-based techniques
Target Geography◦ uses elements contained in the page to deduce
locations (place names, postal addresses, and phone numbers)
◦ Challenge: involves evidence extraction, semantic analysis, and interpretation, in order to link Web pages to geographic locations
Geographic aspects of the Web can be explored using two approaches
Geographic Information Retrieval (GIR) is an applied research field that involves indexing, searching, retrieving, and browsing georeferenced information sources, and designing systems to execute these tasks effectively and efficiently
Like IR, GIR includes indexing, storage and ranking
GIR
pattern extraction from raw text has already been done. For example, M. Hearst (1990s)
developed an approach for discovering lexico-syntactic patterns for hypernyms
GIR History
Pattern-Based Methods;◦ Named Entity Recognition (NER)◦ Gazetteer approach (Web-a-Where);◦ Pattern-based method
Ontology-Based Methods;◦ OnLocus
Machine Learning Methods;
GIR Methods
Pattern-Based Method
Few commercial geographic search engines have been commercially developed among them Google Map and Yahoo Local are notable
ambiguous dynamic nature of location names, various addressing styles, lack of geographic information, and multiple locations related to a Web resource
extract proper names from texts and documents
an algorithm that distinguishes five classes for name of locations: CITY, REGION, COUNTRY, ISLAND, RIVER, and MOUNTAIN
method is time-consuming and is not useful for real-time search
Named Entity Recognition (NER)
tagging individual place names (geotagger);◦finds and disambiguates geographic names
(assigning a canonical taxonomy node to each phrase in the text)
1. Spotting;2. Disambiguation;3. Focus determination;crawling the Web, storing the resulting pages and indexing their contents
Gazetteer approach
Basically, a geographic search engine must be able to find related addresses and location names and assign them to Web pages
Current address extraction techniques basically require large gazetteers which are expensive and unavailable for many countries
different markup styles e.g. HTML, XML and DOM
natural language processing models are not able to extract all addresses and location names from Web page contents
Different ways of mentioning an address in a Web page
large scale gazetteers
pattern-based model which uses HTML and visual segmentations to improve
address extraction on Web pages
divide an address to its semantic components
automatic
much human effort
new location names
The proposed address extraction system consists of five components:
HTML Pre- Processor, Parser, Knowledge Searcher, Decision Maker, and Knowledge Accumulator
analyze HTML tags and codes; convert HTML files to XML (by employing the
VIPS Demo software); in-depth analyzing and traversing the XML
to obtain content information; sorting them in a linear sequence together
with their node numbers; a node index is built
1. HTML Pre-processor
It tries to find all candidate phrases (potential addresses) in a node;
divides a potential address into its component;
Each segment obtained in this step, will be utilized as default searching unit of Database Searcher;
2. XML Parser
itemizes elements of a potential address; It finds all possibilities of a potential address
and forms them into a list of possible patterns in three steps:◦ Standardizing Word Formats (different spells, abbreviations,
synonyms)◦ Knowledge-Base Place Name Matching (separates
elements into more delicate level)◦ Ambiguity Eliminating (tries to match place name)
3. Knowledge Searcher
whether a candidate phrase is an address or not; by matching it with address patterns already stored
in a database;◦ Delimitating ambiguities and conflicts of place names (syntactic and
semantic: geo/non-geo and geo/geo);
◦ Itemizing each potential address to its elements;◦ Adding the lost parts to address based on a location tree
wherever it is possible
the address ”No. 10, William Street, Toowong, Queensland” will be modified as ”No. 10, William Street, Toowong, Brisbane, Queensland, Australia”
4. Decision Maker
the last component of the system; exhibits in two aspects:◦ Location Accumulation;◦ Address Pattern Accumulation
5. Knowledge Accumulator
there are 9 lemmas in KB; 3 lemmas have multiple identities (Victoria, Churchill, Howard
Avenue); Following algorithm indicates how place
names are detected in Phrases◦ PW - A candidate phrase◦ Wi - the ith word in PW◦ f - any syntactic format of W i
◦ KB - Knowledge-Base◦ Ci - Result Collection
Example
Inputs
1. PW(pre word, Wi) {2. if ((pre word + f) = a place name
found in KB)3. add (pre word + f) to Ci;4. if (pre word + f) = part of a name in
KB5. pre word = pre word + f;6. PW(pre word, Wi+1);//try next word in PW7. }
1. SyntacticAE(Potential) {2. current word = first word in Potential3. C = NULL; //initialize C4. While current word != EOF5. {6. C = SAE (C, current word); //add longestresult in C7. current word = next new word in Potential;8. }9. }
Syntactic Ambiguity Elimination
inconsistencies between accumulated knowledge in KB and extracted information from the Web:◦ misspelling and synonymy◦ incompleteness of KB
• Keeping the Conflict• Removing Meaningless Conflict Element• Finding Synonymous Sub-tree• Merging Synonymous Sub-Tree
Conflict Elimination
Ontology-Based Method
Direct references◦ place names, complete postal addresses
Indirect references◦ postal codes and telephone area codes, or from
expressions that indicate relationships to other places, which are directly referenced (for instance, “The hotel is two blocks from Times Square”)
References to places in Web pages
propose a three-phase process for recognizing geographic evidence in Web pages:◦ Extraction (selecting relevant Web content),◦ Recognition (corresponds to isolating references
to places embedded in text and includes dealing with ambiguity),
◦ Location (obtains locations from the place descriptions previously recognized, using positioning data from gazetteers or from spatial databases)
OnLocus
an extraction ontology is able to identify objects and relationships;
ontology must describe rules for identifying elements within its domain that are present in Web pages
extraction
recognition of terms and expressions as place names;◦ compared to a gazetteer: Alexandria and
GeoNames
recognition
try to determine an actual location from a gazetteer or performing a process
known as geocoding
Location of direct references◦ matching and locating
Location of indirect references◦ Formal
establish a correspondence between a code and the area it serves (supported by spatial databases)
◦ Informal natural language interpretation is required
Location
Machine Learning Methods
apply Text Mining procedures to the Internet in order to classify places into different location types (e.g., Maebashi is a CITY, Honshu is an ISLAND) and to determine for a given place name, where the place is (e.g. Maebashi is in Japan, Honshu is in the Pacific ocean);
acquire exhaustive fine-grained gazetteers automatically and thus avoid hand-coding;
distinguish 6 location types (CITY, REGION, COUNTRY, ISLAND, RIVER, MOUNTAIN)
Machine Learning Method
dataset consists of 1260 names of locations For each class constructed a set of patterns
◦ patterns have the form “KEYWORD+of+X” and “X+KEYWORD” (Alta Vista counts)
Each class has from 3 (ISLAND) up to 10 (MOUNTAIN) different keywords
Keywords and patterns were selected manually
Algorithm
For example, for the class CITY use 4 keywords (“city”, “town”, “mayor”, “streets”) and 7 corresponding patterns (“city+ of+X”, “X+city”, “town+of+X”, “mayor+of+X”, “X+ mayor”, “streets+of+X”, and “X+streets”
Example
Thank You!Presented in Information Retrieval Course, under supervision of
Dr. Saeid Asadi