Relevance of clasification and indexing

35
Relevance of classification and indexing in the organization of internet resources

Transcript of Relevance of clasification and indexing

Page 1: Relevance of clasification and indexing

Relevance of classification and indexing

in the organization of internet resources

Page 2: Relevance of clasification and indexing

The general opinion is that the digital age wipes the centuries old library system.

There is a feeling that libraries and librarians are obsolete in present digital era.

Two questions generally faced by the LIS professionals are: ‘What will be the future of libraries?’ ‘Why organization of information if you can find

it on the internet?’

Page 3: Relevance of clasification and indexing

Will Sherman: 33 Reasons why libraries and librarians are still important (http://www.degreetutor.com/library) Not everything available on the internet Digital libraries are not the internet Internet compliments libraries but does not replace The internet is not free Digitization does not mean destruction, infact means

survival Libraries are not just books Like business, digital libraries still need human beings Eliminating libraries would cut short cultural evolution Internet is a mess while libraries organize knowledge

Page 4: Relevance of clasification and indexing

Librarians employed three important tools for K.O. They are: Data element directory (Cataloging Manual) Classification Scheme for categorization of the documents; and Thesaurus (vocabulary control tool) for consistent indexing

(assigning index terms) The web has grown without any of these tools, so

unorganized(Devadason, F.J. Facet analysis and semantic Web: Musings of a student of Ranganathan

http://www.reocities.com/Athens/5041/FASEMWEB.html)

However the issue is:

Enormous quantity of information outside libraries

How to collect and organize world’s knowledge?

Page 5: Relevance of clasification and indexing

TRADITONAL Classification – shelf

arrangement Catalogue – identification and

location of information Analysis & consolidation-Indexing

/ abtractingfor micro documentsResult: -improved precision or recall

-provide context for search terms

- enable browsing

- access to related information with

meaningful relationships

-serve as a mechanism for switching between languages.

WEB BASED Search engines Subject gateways Directories

Result: The web is a sea of all kinds of data

- difficult to find, access &

retrieve pertinent information

-extremely unorganized data

- Too many false and missing links

Eg Building and architecture

Travel and hotel

Difference: Use of subject descriptors

Page 6: Relevance of clasification and indexing

Directories - Could not cope with the scale of Web growth- Were often built by amateurs in classification and vocabulary management

- Were biased by the commercial use of the Web

Vocabularies

- Open Directory categories

- Wikipedia categories

- Metadata in html <head>

- Spammed, not in sync with the content

- Ignored by most search engines now- Bottom line : The Web is not and will never be an organized library(Bernard, V. Porting library vocabularies to the Semantic Web, and back A win-win round

trip. IFLA 2010, Gothenburg)

Page 7: Relevance of clasification and indexing

Eg. Works on M. K. Gandhi

Library - The art of librarianship has been used for thousands of years to organise knowledge – catalogue/ librarian – class no. – shelf

Search engines - collections are built by robots; number count

- aim for exhaustive indexing;

- offer automatically generated metadata

Subject gateways - collections are built by humans 

- aim to develop catalogues of high quality resources 

- offer human generated metadata 

Page 8: Relevance of clasification and indexing

Can we apply classification principles?Can we apply Metadata?Can we apply indexing techniques?

Page 9: Relevance of clasification and indexing

Two distinct ways of finding resources on the Internet emerged (Dodd 1996).

- the use of robot or spider based search engines and

- producing ‘hotlists’, which would encourage users to browse the Web.

This production of hierarchically arranged lists brought in the use of Library classification schemes

Subject directories like Yahoo! and other quality controlled subject gateways started use of classification schemes to enhance searching the Net.

They maximize the retrievability / visibility of information: clustering, browsing. e.g. LIS education through distance mode

Page 10: Relevance of clasification and indexing

Electronic versions of classification schemes (Web Dewey, UDC Online) made it to adopt them on the web.

The Web, as an information environment, differs from the controlled setting of a traditional information retrieval system

How and to what extent a classification is actually used to support subject access on web.

Many Web sites, like Google and Yahoo, use hierarchical classification trees to organize text resources in Web.

Subject gateways offer hierarchical browse structures based on subject classification schemes.

Page 11: Relevance of clasification and indexing

The DDC was adapted earlier and more quickly to usage in digital systems via the Internet.

It is completely and easily available as "WebDewey" for all Web browsers and platforms.

Examples: Library and Archives Canada (LAC) has capitalized on the Dewey

Decimal Classification (DDC) potential for organizing Web resources in two projects.

ADAM, the Art, Design, Architecture & Media Information Gateway Biz/ed is a subject gateway for business education BUBL uses the Dewey Decimal Classification system as the primary

organisation structure for its catalogue of Internet resources. National Library of Canada's Canadian Information by Subject

service

Page 12: Relevance of clasification and indexing

Since 1993UDC has been in subject gateways and become more prevalent in East European SGs, portals and hubs since 2000

UDC in SGs appeared to be linked to the following types of applications: manual classification of manually collected links on small to medium-

size directories (from a few hundred to a few thousand resources) manual classification of a large number of automatically harvested

resources using harvesting and metadata creation tools and more advanced technology (quality controlled SGs)

automatic harvesting and classification (quality controlled SGs)

(Aida Slavic. UDC in subject gateways: experiment or opportunity? Knowledge Organization, 33, 2006)

Page 13: Relevance of clasification and indexing

Examples: WAIS (Wide Area Information Server) NISS (National Information Services and System ) INTUTE FVL (Finnish Virtual Library ) GERHARD (German Harvest Automated Retrieval and

Directory) PORT (Maritime Information Gateway) OKO (Slovenian catalogue of Web resources ) etc

But they are not displaying the UDC structure on the interface or UDC numbers in the metadata.

The UDC is probably more "modern" and has made faster progress towards a faceted structure.

Page 14: Relevance of clasification and indexing

Descriptive metadata is to facilitate discovery of relevant information.

In addition to resource discovery, metadata can help organize electronic resources, facilitate interoperability and legacy resource integration, provide digital identification, and support archiving and preservation.

The process is automatic and cost effective In descriptive metadata, the medium of that resource

becomes a non-issue. This enables DC metadata to be used by any organizations

for cataloguing specialized types of mixed-media collections

Page 15: Relevance of clasification and indexing

Pre and post coordinated; Derived and assigned; context based; Thesaurus and classaurus (Classaurus is a faceted scheme of terms indicating hierarchy enriched with synonyms)

Two concepts - Semantics and syntax Purpose – achieve precision out of recalled information Humans can do it since it is natural language Machines – ignorant and can’t make any sense

How to achieve precision out of recalled information of the Web?

Page 16: Relevance of clasification and indexing

Relationships – categorized as Hierarchical (internal) – whole – part composition

Non hierarchical (external) – associative and equivalent Application in different areas

Design of classification (thesaurus) Knowledge organization and Information retrieval (search

strategies) Lexical cohesion Epistemology etc Design and development of databases Web design and development Artificial intelligence Text analysis and summarization Hypermedia

Page 17: Relevance of clasification and indexing

Creating representation of Web pages Providing standard identifiers (URI) associated to access

protocol (http). The WWW is based on HTML / XML hierarchies for coding a

body of text and images (multi media) and linking things together Via http protocol, hypertext etc

Use of vocabularies as subject descriptors to organize Web content as in libraries

Page 18: Relevance of clasification and indexing

Taxonomies, subject headings, classifications

- That’s where library heritage is strong and the Web is weak

- Such vocabularies can be structuring for the web of data as they are for libraries

- But it is more than in a library – the process should be automated

Semantic enhancement of scholarly journal articles, by aiding publication of data and metadata and providing ‘lively’ interactive access is necessary

Such semantic enhancements are already being undertaken by leading STM publishers

Application of structured vocabularies, of course using artificial intelligence, is the ‘semantic Web’

Page 19: Relevance of clasification and indexing

Tim Berners-Lee: Computer Scientist at MIT, USA.;WWW Creator; Director of W3Consortium; Developer of Semantic Web

Intention: to enhance the usability and usefulness of the web and its connected resources.

“I have a dream for the Web [in which computers] become capable of analysing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.”   

—Tim Berners Lee, 1999

Technologies enabling machines to make more sense of the Web making the Web more useful for humans.

This means radically improving ability to find, sort, and classify information: an activity that takes up a large part

Page 20: Relevance of clasification and indexing

The Semantic Web is a project that intends to create a universal medium for information exchange by putting documents with computer-processable meaning (semantics) on the World Wide Web.

“The Semantic Web is an extension of the current Web that will allow you to find, share, and combine information more easily. It relies on machine-readable information and metadata expressed in RDF.”

www.noisebetweenstations.com/personal/essays/metadata_glossary/metadata_glossary.html

Humans can easily connect the data when browsing the Web…for

e.g. we disregard advertisements, we know the links that are interesting for our purpose (job –resume; air ticket – flights)… but machines can’t!Eg. automatic airline reservation can done (Ivan Herman, W3C) combining the local knowledge with remote services: airline preferences; dietary requirements; calendaringFor e.g. a computer can find the nearest plastic surgeon and book an appointment that fits a personal schedule.

Page 21: Relevance of clasification and indexing
Page 22: Relevance of clasification and indexing

XML provides a surface syntax for structured documents, but imposes no semantic constraints on the meaning of these documents.

XML SCHEMA is a language for restricting the structure of XML documents.

RDF is a simple data model for referring to objects (“resources") and how they are related. An RDF-based model can be represented in XML syntax.

RDF Schema is a vocabulary for describing properties and classes of RDF resources, with semantics for generalization-hierarchies of such properties and classes.

Page 23: Relevance of clasification and indexing

OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes.

URI – Universal Resource Identifier - used as universal naming tools, including for properties

NAME SPACE is a context in which a group of one or more identifiers might exist. An identifier defined in a namespace is associated with that namespace. E.g. Employee ID 123. Many modern computer languages provide support for namespaces.

Page 24: Relevance of clasification and indexing

All these are based on knowledge representation algorithms, say week AI.

The primary facilitators of this technology are URIs which identify resources along with XML and namespaces.

These with a bit of logic form RDF, which can be used to say anything about anything.

FOAF: A popular application of the semantic web is Friend of a Friend or (FoaF), which describes relationships among people and other agents in terms of RDF.

Page 25: Relevance of clasification and indexing

The web is changing and offering new possibilities for communication and interaction by combining the concepts on the web. This is made possible by XML

XML provides an interoperable syntactical foundation that facilitates to represent relationships and built meanings

Page 26: Relevance of clasification and indexing

RDF is an XML based standard for describing resources that exist on the web.

RDF is a model for such relationships and Interchange RDF is the standard interchange format on the semantic

web. Once information is in RDF form, it becomes easy to process it, since RDF is a generic format.

It is a model of (s p o) triplets with p naming the relationship between s and o

RDF is a graph: i.e., a set of RDF statements is a directed, labeled graph

- the nodes represent the resources that are bound

- the labeled edges are the relationships with their names

Page 27: Relevance of clasification and indexing

With an RDF application, it is easy to know which bits of data are the semantics of the application, and which bits are just syntactic fluff.

RDF statements describe a resource, the resources properties and the values of the properties.

RDF statements are often refer to as “triples” that consist of a subject, predicate and object which correspond to a resource (subject), a property (predicate) and a property value (object)

Page 28: Relevance of clasification and indexing

This piece of RDF basically says that this article has the title "The Semantic Web: An Introduction", and was written by someone whose name is "Sean B. Palmer". Here are the triples that this RDF produces:-

<> <http://purl.org/dc/elements/1.1/creator> _:x0 . this <http://purl.org/dc/elements/1.1/title> "The Semantic Web: An Introduction" . _:x0 <http://xmlns.com/0.1/foaf/name> "Sean B. Palmer" .

<rdf:Description rdf:about="http://www.ivan-herman.net">

<foaf:name>Ivan</foaf:name>

<abc:myCalendar rdf:resource="http://…/myCalendar"/>

<foaf:surname>Herman</foaf:surname>

</rdf:Description>

Page 29: Relevance of clasification and indexing

URI is simply a web identifier like the strings starting with “http:” “ftp:” Anyone can create a URI and the ownership of them is clearly delegated so they form ideal base technology to build a global web.

Resources on the web are identified by URIs, which uses a global naming convention.

The W3C maintains list of URI schemes. The URI-s made the merge possible URI-s ground RDF into the Web URI-s make this the Semantic Web

Page 30: Relevance of clasification and indexing

Ontological analysis clarifies the structure of knowledge Defined as the terms used to describe and represent an

area of knowledge. These are explicit specifications of a conceptualization The ontology is the study of the ‘categories, of things

that exist or may exist in some domain’. A common ontology defines the vocabulary with which

queries and assertions are exchanged among agents. These are the rules that help integration and operate on

globally shared theory Often equated with taxonomic hierarchies of classes but

need not be limited to this form as it adds knowledge about the word

Page 31: Relevance of clasification and indexing

The semantic Web is generally built on syntaxes which use URIs to represent data, usually in triples based structures i.e. many triples of URI data that can be held in databases, or interchanged on the WWW using a particular syntax developed especially for the task. These syntaxes are called “Resource Description Framework” Syntaxes.

The application of Semantic Web is to create relations among resources on the Web and to interchange those data, like (hyper) links on the traditional web, except that:

- there is no notion of “current” document; ie, relationship is between any two resources

- a relationship must have a name: a link to my CV should be differentiated from a link to my calendar

- there is no attached user-interface action like for a hyperlink

Page 32: Relevance of clasification and indexing

Map the various data onto an abstract data representation make the data independent of its internal representation…

Merge the resulting representations Start making queries on the whole!

queries that could not have been done on the individual data sets

Page 33: Relevance of clasification and indexing
Page 34: Relevance of clasification and indexing
Page 35: Relevance of clasification and indexing

Web lacks the coordination and organization of a traditional library.

It has been practiced and proved that the use of traditional library tools and techniques could be a great help in taming the Net.

The IFLA Information Technology section, with support of Cataloguing section, Classification and Indexing section, and Knowledge Management section, proposes the creation of a Semantic Web Special Interest Group (SWSIG) within IFLA.

The SWSIG intends to be a platform where interested professionals could gather, and undertake whatever tasks are needed to develop, enhance and facilitate the adoption of semantic Web technologies in the library community.

Librarians should start research projects to develop better techniques of organizing the web. Modern classification research must find order especially in the context of complexities of the Internet