Project Finaal Verslag

download Project Finaal Verslag

of 96

Transcript of Project Finaal Verslag

  • 8/3/2019 Project Finaal Verslag

    1/96

    Ghent University

    Architectural report

    Semantic annotation service

    Group 7

    Authors:

    Boghaert Michiel

    Goossens Sander

    Hebben Stan

    Heyse Tom

    Taelman Stijn

    Van Otten Neri

    Vandermeiren Maarten

    Vandewalle Bram

  • 8/3/2019 Project Finaal Verslag

    2/96

    Contents

    I Introduction 5

    II State of the Art 8

    1 Science 9

    1.1 (Semi-)automatically annotating a web page . . . . . . . . . . . . . . . . . . . . . . 91.1.1 Survey of Semantic Annotations Platform [45] . . . . . . . . . . . . . . . . . 91.1.2 SemTag and Seeker: Bootstrapping the Semantic Web via Automated Se-

    mantic Annotation [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.1.3 Automatic Annotation of Content-Rich HTML Documents: Structural and

    Semantic Analysis [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.2 Extracting information using visual representation . . . . . . . . . . . . . . . . . . 101.2.1 Recognition of Common Areas in a Web Page Using Visual Information: a

    possible application in a page classification [36] . . . . . . . . . . . . . . . . 111.2.2 Extracting Content Structure for Web Pages based on Visual Representation

    [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2.3 HTML page analysis based on visual cues [52] . . . . . . . . . . . . . . . . . 121.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.3 Ubiquitous content delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.1 Adapting Content for Wireless Web Services [42] . . . . . . . . . . . . . . . 121.3.2 Adapting Web Content to Mobile User Agents [37] . . . . . . . . . . . . . . 131.3.3 Detecting Web Page Structure for Adaptive Viewing on Small Form Factor

    Devices [22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.4 Annotation-Based Web Content Transcoding [33] . . . . . . . . . . . . . . . 141.3.5 A Two Layer Approach for Ubiquitous Web Application Development [25] . 141.3.6 Customization for Ubiquitous Web Applications - A Comparison of Ap-

    proaches [28] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Tolerable Waiting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.1 A Study on Tolerable Waiting Time: How Long Are Web Users Willing to

    Wait? [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.2 Akamai and JupiterResearch: 4 seconds [8] . . . . . . . . . . . . . . . . . . 151.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    1.5 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.5.1 Microformats: The Next (Small) Thing on the Semantic Web? [34] . . . . . 161.5.2 hGRDDL: Bridging Microformats and RDFa [20] . . . . . . . . . . . . . . . 161.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    1.6 European Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6.1 INSEMTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.6.2 QuASAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    1.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    1

  • 8/3/2019 Project Finaal Verslag

    3/96

    CONTENTS

    1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2 IPR - Patents 18

    2.1 Analyzing and annotating web content . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1 Method for scanning, analyzing and rating digital information content [50] . 182.1.2 Web page annotation systems [38] . . . . . . . . . . . . . . . . . . . . . . . 182.1.3 Method for annotating web content in real time [30] . . . . . . . . . . . . . 18

    2.2 Invocation of the engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 Web page annotating and processing [48] . . . . . . . . . . . . . . . . . . . 19

    2.3 Ubiquitous content delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Web content adaptation process and system [24] . . . . . . . . . . . . . . . 192.3.2 Web server for adapted web content [23] . . . . . . . . . . . . . . . . . . . 192.3.3 Web content transcoding system and method for small display device [47] . 19

    2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3 Standard bodies 21

    3.1 RDFa (Resource Description Framework - in - attributes) . . . . . . . . . . . . . . 213.2 Microformats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Microdata (HTML5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 DAML (DARPA Agent Markup Language) . . . . . . . . . . . . . . . . . . . . . . 233.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4 Professional organizations 24

    4.1 Reuters OpenCalais . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Ontotext KIM Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Ontoprise GmbH Semantic Contents Analytics . . . . . . . . . . . . . . . . . . . . 244.4 iQser GIN Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 Annotea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5 Market reports 26

    6 Industry trends 27

    6.1 Googles Rich Snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.2 Del eTools - eLearning Annotation Web Service . . . . . . . . . . . . . . . . . . . . 276.3 OSA Web Annotation Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    7 Conclusion 29

    III Vision 308 Vision 31

    8.1 Mission Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.2 Customers and benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.3 Key factors to judge quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.4 Key features and technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328.5 Crucial factors as applicable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2

  • 8/3/2019 Project Finaal Verslag

    4/96

    CONTENTS

    IV Scenarios 33

    9 Use cases 349.1 Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    9.1.1 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349.1.2 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    9.2 Use case scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369.2.1 Perform semantic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369.2.2 Choose performance and accuracy . . . . . . . . . . . . . . . . . . . . . . . 379.2.3 Correct annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.2.4 Add annotated page for machine learning . . . . . . . . . . . . . . . . . . . 419.2.5 Management of rule-based methods . . . . . . . . . . . . . . . . . . . . . . 429.2.6 Add functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439.2.7 Install and run locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    10 Quality attribute scenarios 47

    10.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4710.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4710.3 Modifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.5 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4810.6 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4910.7 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4910.8 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    V Architectural design 51

    11 Global overview 52

    11.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    12 Attribute Driven Design - Cycle I 54

    12.1 Inputs for the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5512.2 Architectural drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5512.3 Architectural pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    12.3.1 Chosen pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5512.3.2 Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    12.4 White Box Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5712.5 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    13 Attribute Driven Design - Cycle II 6113.1 Access Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    13.1.1 Inputs for the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6213.1.2 Architectural drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    13.2 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6413.2.1 Inputs for the subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6413.2.2 Architectural drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6413.2.3 Architectural pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    13.3 Persistence Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6713.3.1 Inputs for the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6713.3.2 Architectural drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6713.3.3 Architectural pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    13.4 White Box Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6913.5 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    3

  • 8/3/2019 Project Finaal Verslag

    5/96

    CONTENTS

    14 Attribute Driven Design - Cycle III 80

    14.1 Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    14.1.1 Inputs for the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8114.1.2 Architectural drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    14.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8314.2.1 Inputs for the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8314.2.2 Architectural drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    14.3 White boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    15 Global system 89

    VI Conclusion 91

    4

  • 8/3/2019 Project Finaal Verslag

    6/96

    Part I

    Introduction

    5

  • 8/3/2019 Project Finaal Verslag

    7/96

    The purpose of this project is to design an automatic semantic annotation engine. This means weare going to design an engine which adds semantic annotations to a given web page in a completely

    automatic way.

    Annotations are metadata, used in web pages. This metadata contains information about thedata in a web page. Annotations in a web page give information about the structural elementsand content in the web page. For example, search engines use annotations in web pages to be ableto better understand the content of the pages.

    There are many applications that can benefit from the project. For example, the engine can bepart of a larger system (figure 1) to adapt a web application to the device used to visite theapplication. First, the web pages are annotated by the engine to mark different structural andcontent elements. Based on these annotations, different structural and content elements can bereordered, based on their function and importance and adapted so the web application can be

    viewed optimally on the device.

    As said before, the focus of the project is the engine itself. It will take a web page as input andwill annotate this web page completely automatic. An important factor for the success of theproject is the correctness, accuracy and completeness of the semantic annotations. This meansadding the correct annotations at the correct place and not leaving too many annotations out.

    This document is meant to give an overview of the different stages of the architectural design. Wewill start with a State of the Art analysis, where we will investigate and explore existing technolo-gies and patents, the market and existing companies and organisations. Next we will capture therequirements by defining the different use cases and the quality attributes that are important forthe engine. Based on these quality attributes, we will design an actual architecture in differentiterations using the Attribute Driven Design (ADD) method. We will verify our decisions aftereach iteration with white box scenarios and deployement diagrams. We will finish this documentby looking back at what we have accomplished and what remains as feature work.

    6

  • 8/3/2019 Project Finaal Verslag

    8/96

    Figure 1: The engine as part of a larger system.

    7

  • 8/3/2019 Project Finaal Verslag

    9/96

    Part II

    State of the Art

    8

  • 8/3/2019 Project Finaal Verslag

    10/96

    Chapter 1

    Science

    1.1 (Semi-)automatically annotating a web page

    There are many different techniques on how to identify different content and structural parts ona web page to be able to add annotations. The following articles propose one or more techniquesto do so.

    1.1.1 Survey of Semantic Annotations Platform [45]

    This paper tries to offer the reader an overview of all existing semi-automatic annotation plat-forms in 2005. Semi-automatic means that each of these platforms require human interaction ata certain stage of the annotation.

    Chapter 2 is the most interesting part concerning the project. It offers an overview of three com-monly used annotation approaches. The first approach is Pattern-based: the annotation enginelooks for patterns or tries to find predefined patterns. The next approach is Machine Learning-based. Platforms based on this technique utilize two methods: probability and induction. Thefinal approach is called Multi strategy and tries to combine the advantages of both pattern-basedand machine learning-based systems.

    Chapter 3 compares different platforms: Armadillo [27], AeroDAML [35], KIM [44], MnM [51],MUSE [39], Ont-O-Mat [31] and SemTag [26]. Chapter 4 offers a performance evaluation of eachplatform and chapter 5 tries to explain the methods used by each platform. Chapter 6 contains theconclusion and repeats the three categories of annotation approaches. The authors of this paperalso wrote a chapter in the book Web semantics and ontology [49] about Semantic Annotation

    Platforms. It contains approximately the same information as this paper.

    This paper shortly introduces some interesting methods and approaches to adding semantic an-notations. Furthermore it compares 7 semi-automatic annotation platforms. The ideas of theseplatforms may serve as a launchpad for the project.

    1.1.2 SemTag and Seeker: Bootstrapping the Semantic Web via Auto-mated Semantic Annotation [26]

    In this document a system is introduced that processes HTML data and annotates it with semanticdata in a fully automatic way if the necessary metadata is provided manually beforehand. Thesystem allows queries to be executed to retrieve certain data using the tags that have been assigned.

    All data is pre-calculated and all documents are processed on one large batch operation. Thisbatch processing is necessary to perform proper analysis on the data. The analysis uses the words

    9

  • 8/3/2019 Project Finaal Verslag

    11/96

    CHAPTER 1. SCIENCE

    before and after the current word as context and can recognize correlations to detect which partsof the context are significant to extract the proper meaning of the word.

    This system is quite different from ours, especially since all documents are processed at the sametime, and beforehand. Additionally only content-based annotations are added. Therefore it hasdifferent applications compared to the project. Perhaps some algorithms could be interesting tostudy if we develop an implementation of the system.

    1.1.3 Automatic Annotation of Content-Rich HTML Documents: Struc-tural and Semantic Analysis [40]

    This article handles the problem of how to add automatic annotations to a HTML document.Namely, how to construct a semantic partition tree. The annotation engine for the proposedproject can work from two observations considering semantically related items in HTML docu-

    ments.

    The first observation is that related items exhibit consistency in presentation style (font, hyper-links, ...). Second, related items exhibit spatial locality, which means related items are locatedclose to each other in the document. This structural analysis can be exploited to form an ideawhich items on a page are related and should remain together.

    But this technique can be wrong. Therefore the article proposes semantic analysis techniquesas an addition. Lexical association is a linguistic processing tool to identify small segments ofrelated text (based on common words or synonyms). This can be implemented using WordNet[19]. Concept association uses domain knowledge encoded in an ontology to match partitions withcertain concepts. The two association methods can be combined to obtain better results.

    The methods described are very useful for the project. They give an idea on how to write analgorithm that partitions pages into pieces that are semantically related. On the other hand, theydo not specify which technologies are used for the annotation or which vocabulary is used.

    1.1.4 Conclusion

    For the project it will most likely be important to combine a wide range of techniques and al-gorithms in order to try and add useful semantic annotations. Survey of Semantic AnnotationsPlatform [45] describes basic ideas and approaches for adding semantic annotations. It also de-scribes how annotations could be added in a semi-automatic way, a very useful addition that weshould consider to implement as it gives developers the opportunity to choose on how their webpages should be annotated.

    SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation [26]implements a system slightly different from ours but contains algorithms worth studying. Themost useful article in this section is Automatic Annotation of Content-Rich HTML Documents:Structural and Semantic Analysis [40] as it provides a way of writing algorithms which partitionpages according to those components that semantically fit together.

    1.2 Extracting information using visual representation

    An important technique to identify different sections on a web page is to first render the page. Nextthis visual representation can be used to extract information from that web page. We have foundseveral articles presenting different techniques to extract information from a visual representation.

    10

  • 8/3/2019 Project Finaal Verslag

    12/96

    CHAPTER 1. SCIENCE

    Figure 1.1: The vision-based page segmentation algorithm

    1.2.1 Recognition of Common Areas in a Web Page Using Visual Infor-mation: a possible application in a page classification [36]

    This article focuses on recognizing visually important areas for a better classification of web pages.This is done by describing a possible representation for a web page in which objects are placed intoa well-defined tree hierarchy according to where they belong in an HTML structure of a page. Eachobject carries information about its position in a browser window. This visual information enablesthe definition of heuristics for recognition of common areas such as header, footer, left and rightmenus, and center of the page. The crucial difficulty discovered was the development of sufficientlygood rendering algorithm, i.e. imitating behavior of popular user agents such as Internet Explorer.

    The article describes in detail how a hierarchical tree can be constructed from a HTML structure.This tree structure could then be used to find key features such as a headers and footers. This

    information can then be semantically annotated in the HTML code.

    1.2.2 Extracting Content Structure for Web Pages based on Visual Rep-resentation [21]

    This article focuses on a vision-based page segmentation algorithm. A clarification is shown infigure 1.2.2.

    This is done by identifying the logic relationship of web content. Based on visual layout informa-tion, web content structure can effectively represent the semantic structure of the web page. Anautomatic top-down, tag-tree independent and scalable algorithm to detect web content structurewas presented. It simulates how a user understands the layout structure of a web page based onits visual representation.

    Compared with traditional DOM-based segmentation methods, the scheme utilizes useful visualcues to obtain a better partition of a page at the semantic level. It also is independent of physicalrealization and works well even when the physical structure is far different from visual presenta-tion. Test result show that humans judge 86 of the 140 rendered test pages to be perfect, 50 tobe satisfactory and only 4 to have failed. This algorithm can therefore be seen as having a 97%success rate.

    This paper provides a measure of accuracy for a good rendering algorithm which could, along withother techniques, be implemented in the engine.

    11

  • 8/3/2019 Project Finaal Verslag

    13/96

    CHAPTER 1. SCIENCE

    1.2.3 HTML page analysis based on visual cues [52]

    This paper presents a method to extract structures from HTML pages, without any a priori knowl-edge, in an automatic way. It bases its analysis on visual similarity between different web pagesand their organization. Their method requires a few definitions concerning the comparison of bothsimple objects and container objects. Using these definitions, the method tries to detect visualsimilarity patterns in the HTML-document. A large part of the paper is spent describing thetechnical details which we will not rewrite here. Their test results were actually quite impressive:92% of their documents were correctly analysed, 4% missed some apparent structural elements.Parsing failed for the remaining 4% of their test set. They did however use a rather small test set:only 50 web pages.

    This paper offers a very interesting and promising method of detecting structural elements in aweb page. However, a few remarks can be made concerning the project. First of all, its a ratherold paper (dating from 2001). This means the web pages of their test set will be very different

    from current web pages. They were less dynamic, contained no or very little CSS, no Flash orother plugins and a general simpler structure. This means it may be harder to find patterns forus or even impossible. Secondly, this method only works to detect structural elements. We maywant to add more semantic annotations.

    1.2.4 Conclusion

    The visual representation of a web page consists of lots of cues. The article HTML page analysisbased on visual cues [52] is an older paper that tries to recognize visually interesting areas byusing a few predefined examples. This method, however only failing for 4% of the test pages, isnot extensible. As the World Wide Web evolves, this method would continously needs updating.This technique could however be used in combination with other more extensible techniques.

    The two other articles in this section construct DOM tree structures from the HTML code in orderto detect visually interesting areas. The article Recognition of Common Areas in a Web PageUsing Visual Information: a possible application in a page classification [36] goes a bit further byimplementing a vision based segmentation algorithm that combines the visual information and theDOM tree in order to get a different hierarchical structure. This method has a 97% success rate,wich is a good measure for the engine to strive to but probably not in the initial implementation.

    1.3 Ubiquitous content delivery

    An important application of semantic annotations is ubiquitous content delivery. Ubiquitouscontent delivery means to deliver content to all kinds of devices. To do so, its necessary to adapt

    the content to the device used. We found quite some articles which present different ways to adaptcontent to different devices and one of them is using annotated web pages.

    1.3.1 Adapting Content for Wireless Web Services [42]

    This article mainly handles the rendering of web pages for specific devices. This is done by tak-ing specific properties of the device into account. The article mentions that rendering based onadded annotations produces better results compared to the use of only basic HTML-tags. Such adual-purpose page is hard to handle because both, the HTML-tags and the annotation tags, haveto be taken into account.

    Assuming the developer adds these annotations manually, it can be difficult to handle dual-purposepages. The engine on the other hand, runs at real time which allows the annotations to be addedwhen the page is requested. The developer will be able to create his web pages without having toworry about the annotations.

    12

  • 8/3/2019 Project Finaal Verslag

    14/96

    CHAPTER 1. SCIENCE

    1.3.2 Adapting Web Content to Mobile User Agents [37]

    This article (2005) focuses on the concepts of content adaptation, specifically for mobile appli-cations. The emphasis is more on reliability and interactive functionality rather than on the layout.

    The adaptation depends on the device, the network, the user preferences ... and can be staticor dynamic. There are three possibilities (which can be combined): server-side, intermediate andclient-side adaptation.

    Server-side adaptation gives more control to web developers, causes better performance, and usesXSLT (less bandwidth for e.g. multimedia applications). With intermediate adaptation the pro-cessing occurs in a proxy, the developer can optionally add meta data or hints to the content.Another possibility is the use of common adaptation heuristics (e.g. DOM-tree). The advantageof client side adaptation is that you dont need to send extra client properties. The disadvantage,on the other hand, is a regression of performance (processing time, memory usage, network con-

    nection ...).

    The authors of the article made a (standalone) content adaptation proxy application with thefollowing functionalities: adaptation of web documents and media files, user session-based caching,state management, navigation generation (list with links to delivery units (DUs)), error messages,user management and configuration.

    Their extensible application is not tuned for a specific type of web page and makes efficient useof the services of the device. Its scalable and can handle multiple users in realtime. The contentis decomposed in DUs with added meta data like priority or labels. The decomposition happensin two phases. In the adaptation module, the content is divided into perceivable units throughsemantic interpretation. In the post processing phase there are compatibility checks for the device,

    and the data can be divided into extra DUs if necessary.

    The adaptation module only excepts XHTML as an input (converting to XHTML can be donewith HTML Tidy) and XHTML is converted to XHTML MP (mobile profile) or WML (WirelessMarkup Language).

    1.3.3 Detecting Web Page Structure for Adaptive Viewing on SmallForm Factor Devices [22]

    This article describes how to improve the presentation of web pages on a mobile device. Thisincludes analysis of existing pages in order to adapt it. This piece is very important for the anno-tation algorithm. The algorithm specified, uses an iterative system: each iteration divides blocksinto smaller blocks that semantically belong together.

    The algorithm has three steps. First, high-level content block detection. This determines whatcontent fall into which high-level block (examples of blocks: header, footer, left/right side bar).This happens based on the position and shape of the region it occupies. Next, explicit separatordetection follows. The detection of regions that are separated by for example: hr, table, td, div,horizontal and vertical lines, etc. Finally, implicit separator detection. This method tries to sep-arate regions divided by white space.

    This article is the only one that effectively uses high-level content block detection (to detectheader, footer, sidebars, etc). Also the pieces on separator detection are very useful for theprojects semantic annotation algorithm. On the other hand, theres absolutely no information onwhich annotations are added, and in which format.

    13

  • 8/3/2019 Project Finaal Verslag

    15/96

    CHAPTER 1. SCIENCE

    1.3.4 Annotation-Based Web Content Transcoding [33]

    This article proposes a system to transcode web content to different devices accessing it. The mostrelevant part for the project is the first part where the article proposes a framework of externalannotations: existing web pages are associated with content adaptation hints in separate annota-tion files.

    The authors have chosen to use external files for the annotations because they claim that it wouldbe impractical to incorporate meta-information into existing HTML documents. They use RDF(Resource Description Framework) as the syntax of annotation file. XPath and XPointer are usedfor associating portions of a HTML page with annotating descriptions. An annotation file canrefer to more than one HTML doc, and vice versa. Finally, they describe a vocabulary to use inthe annotation files. There are 3 kinds of annotations. The first kind specifies a list of alternativerepresentations for an annotated element (E.g. gray scale image vs the original image) Then thereis Splitting Hints. The purpose of this annotation is to provide a hint for determining appro-

    priate page break points, so a complex HTML page can be divided into different pages on clientswith smaller displays. The last kind is Selection Criteria. This contains information to help thetranscoding system select, from several alternative representations, the one that best suits theclient device. For example the importance of an element, resource requirements of an alternativerepresentation, etc.

    This approach is a little bit different from partition algorithms in other articles but there are somesolid ideas in the article. Maybe it can be considered using a combination of several approachesto come to better semantic annotation results.

    1.3.5 A Two Layer Approach for Ubiquitous Web Application Devel-opment [25]

    This article describes two steps to adapt web applications and to customize it to the specificationsof the device used by using a web-content adaptation engine. It is stated that semantic analysison the structure of a web page is fundamental in adapting the presentation. You must be able todistinguish the main data from the menus, sidebars, and other side information. In order to adaptthe presentation layer of a web application, they consider using microformats, HTML5, RDF orXML to annotate the web application which gives the content adaptation engine the informationit needs to adapt the HTML files to an optimized version for the device used. The article considersusing ECMA-scripts, which is a good alternative, except that it is not supported on all devicesbecause its client-sided. Further implementation specifics are not given.

    This article offers a good source of information about the use of HTML5, RDF(a), or XML forthis projects purpose.

    1.3.6 Customization for Ubiquitous Web Applications - A Comparisonof Approaches [28]

    This article proposes 10 techniques for customization for ubiquitous web applications. The goal isto minimize the data transfer but to maximize the semantic content left. One of these approachesis the Global Document Annotation (GDA) project. This section can be useful for the projectbecause it handles annotations for video, image, and text compression, hereby simplifying the webapplication for devices with lesser capabilities, but it does not handle visual reorganization of theweb pages.

    This project proposes the use of semantic annotation in the form of a separate XML file. This

    section of the article can be useful for the project. There are three kinds of annotations distin-guished. Linguistic annotation aims at making text machine readable. Commentary annotation

    14

  • 8/3/2019 Project Finaal Verslag

    16/96

    CHAPTER 1. SCIENCE

    serves for annotating non-textual content (images, sounds, ...). Multimedia annotation handlesvideos. Based on this context information, the generic adaptations that are considered are text

    transcoding, image transcoding, voice transcoding, video transcoding. This all happens dynami-cally.

    The information about methods used in this article will certainly be useful for the project. Onedownside is that this technique does not include visual separation and reorganization, but this canbe obtained from several other articles in this section.

    1.3.7 Conclusion

    In order to have ubiquitous content delivery, the content needs to be adapted for different devices.From Adapting Content for Wireless Web Services [42] can be learned, adding annotations toa HTML page produces better rendering results for ubiquitous devices. However a disadvantage

    of doing this is that pages using this dual tag (HTML and annotations) system are a lot harderto handle. Customization for Ubiquitous Web Applications - A Comparison of Approaches [28]describes ten techniques that could be used for customization of ubiquitous devices. These tech-niques could easily be included in the project.

    Important ubiquitous devices are mobile devices due to their wide spread popularity. The articleAdapting Web Content to Mobile User Agents [37] describes different ways in which adaptationcan take place and describes the advantages and disadvantages of each. Detecting Web PageStructure for Adaptive Viewing on Small Form Factor Devices [22] focuses on the detection ofpage structure and how this can be adapted to small devices. This is done using high level contentblock and separator detection. In Annotation-Based Web Content Transcoding [33] a differentalgorithm is described that could possibly be interesting to implement.

    Although A Two Layer Approach for Ubiquitous Web Application Development [25] is less spe-cific, it does contain descriptions of rather useful technologies relevant for this project.

    1.4 Tolerable Waiting Time

    The project needs to perform its task within a certain time. These articles will help us in deter-mining the amount of time an average web user is willing to wait.

    1.4.1 A Study on Tolerable Waiting Time: How Long Are Web UsersWilling to Wait? [41]

    This paper tries to determine the threshold for tolerable waiting time for web users when retriev-ing information from the World Wide Web. The results indicate that the waiting time is notaffected by the availability or unavailability of pictures on the webpages and that the inclusionof a feedback bar significantly prolonged the waiting time of users. The conclusion of this paperis that most users are willing to wait for only 2 seconds for information retrieval from the WorldWide Web.

    Two seconds seems to be the waiting limit for most web users when retrieving information. Sincethe paper of Selvidge [46] concludes that web delay time is not affected by the type of task theuser is executing, the project should stay as close to the 2 second limit as possible.

    1.4.2 Akamai and JupiterResearch: 4 seconds [8]

    This report commissioned by Akamai through JupiterResearch examines the reaction of a con-sumer to a poor online shopping experience. Most of the conclusions are irrelevant to this project,

    15

  • 8/3/2019 Project Finaal Verslag

    17/96

    CHAPTER 1. SCIENCE

    but their findings indicate that 4 seconds is the maximum length of time an average shopper willwait for a webpage to load before potentially abandoning a retail site.

    This paper places the threshold at 4 seconds for the project.

    1.4.3 Conclusion

    Its hard to come up with an actual time period that a user is willing to wait. Both these articlesreport a rather low value: 2 and 4 seconds. If we want to annotate in realtime, we have to bearin mind these numbers.

    1.5 Annotations

    The last articles handle about an annotation standard which might be interesting to include in

    the project.

    1.5.1 Microformats: The Next (Small) Thing on the Semantic Web?[34]

    This article focuses on the use of Microformats. More specifically it examines detailed examples,the general principles by which they are constructed and the growing community of users behindthis alternative to the Semantic Web. The Microformats community is an open wiki, mailing list,and Internet relay chat (IRC) channel that has proven remarkably scalable and accommodating.

    Microformats could be used as a way of adding annotations to a web page but due to the natureof the standard described in this article, this would be less advisable for the larger scaled project.

    1.5.2 hGRDDL: Bridging Microformats and RDFa [20]

    This article proposes hGRDDL, this name is chosen because it is meant to be a GRDDL-like (aW3C effort that aims to extract RDF triples from any XML document) transformation with afocus on processing Microformats. The mechanism transforms HTML-embedded structure datasuch as Microformats into RDFa. This technique has many advantages. Microformats has alreadybeen widely used. Converting this to RDFa allows this data to be preserved while allowing newdeployments to focus on RDFas great extensibility and consistency.

    It would be unwise to neglect Microformats as developers have used this to add extra informationsto existing web pages. This article describes how hGRDLL can be used to transform this data inthe more useful RDFa standard.

    1.5.3 Conclusion

    The first article on Microformats shows that it would be an unwise choice to let the engine useMicroformats to add annotations to web pages. This is due to the limiting nature of the format.The second article focuses on converting microformats to RDFa using hGRDDL. This is veryrelevant to the project. If we use this technique, it will allow the engine to adapt web pages tothe specifications of the developers who use Microformats as their annotation of choice.

    1.6 European Projects

    There are several European projects going on that deal with semantic annotations.

    16

  • 8/3/2019 Project Finaal Verslag

    18/96

    CHAPTER 1. SCIENCE

    1.6.1 INSEMTIVES

    Incentives for semantics (INSEMTIVES) is a five million euro costing European project fundedin 2009 under 7th FWP (Seventh Framework Programme). Several universities and organisationsfrom over six different countries in Europe are participating. The aim of INSEMTIVES is tobridge the gap between human and computational intelligence in the current semantic contentauthoring landscape. They are developing methodologies for the creation of semantic metadatafor different types of web resources.

    They investigate applicable social and economic incentives, notably in the areas of ontology engi-neering, and the semantic annotation of media and Web services, to motivate user participationin these inherently human-driven tasks. [6]

    1.6.2 QuASAR

    Quality Assurance of Semantic Annotations for Services (QuASAR) is a project of the school ofcomputer science, University of Manchester. The QuASAR project aims to provide a toolkit toassist in the cost-effective creation and evolution of reliable semantic annotations web services. Inparticular, they have developed tools to assist human annotators in verifying the annotations theydevelop before they are deployed into public repositories, and to gain maximum value from man-ually created annotations, by using them as the starting point from which to infer new annotations.

    Although QuASAR aims at optimizing manually created annotations, the tools they have de-veloped could be very interesting for the project. We might be able to use it to check if theannotations the engine produced are thrustworthy.

    1.6.3 Conclusion

    A lot of research is done on the development of the Semantic Web and semantic annotations bysome large or smaller projects across Europe. Mostly their search area is a lot larger and semanticannotation is just a piece of one larger project. But it does show that the Semantic Web is a hotresearch topic and we are not the only ones into semantic annotation. It would be good to keepan eye on some of these projects and what they accomplish, as we might be able to use some oftheir work.

    1.7 Conclusion

    Web pages can be annotated using a large variety of techniques. This can be done manually bythe developer, semi-automatically or completely automatic. Each technique has a large number

    of options. These are described in detail in the different articles. We will most likely not pick onespecific technique for the project but combine several approaches to strive to a better annotationresult.

    One very useful approach of combining the automatic approach but still keeping the end user inmind is to add semantic annotations based on the visual representation of a web page.

    Several projects are going on across Europe and may deliver interesting result for this project.

    Lastly, we examined Microformats as a possible standard. Due to the large community of activeusers this technique should not be forgotten as it can provide valuable semantic information.

    17

  • 8/3/2019 Project Finaal Verslag

    19/96

    Chapter 2

    IPR - Patents

    2.1 Analyzing and annotating web content

    Weve found several patents concerning the analysis and annotation of web content.

    2.1.1 Method for scanning, analyzing and rating digital informationcontent [50]

    This patent describes a method to analyse data and classify it. Data is matched against multipleregular expressions. Those results are processed by a neural network, which has been trained bya large dataset. Although it is described in the paper as a method to classify pornographic oroffensive content, it could be used for other types of content classification as well.

    The algorithm to match text against a word database or regular expression database which isafterwards processed by a neural network, is patented. We have to make sure we do not use suchan algorithm in the implementation.

    2.1.2 Web page annotation systems [38]

    This patent describes an annotation technique which relies on a data processing system. Thissystem acts as a proxy and retrieves data from the internet per user request. The system thenanalyses the web page content to select, by subject matter, at least one product class of a pluralityof product classes represented in a product classification database. The annotations that will beavailable for display are each associated with a display condition dependent on one or more prod-uct items in the database. If these conditions are satisfied the annotation data will be included in

    the document that is supplied to the end-user.

    This patent handles about a very specific way of annotating: a proxy server with a productdatabase. Furthermore it does not annotate any semantic data but only adds extra data relatedto certain products. Based on these observations, this patent wont pose a problem.

    2.1.3 Method for annotating web content in real time [30]

    In this patent the inventor describes a method for annotating web content in realtime via a clientinterface. Users can add annotations, after retrieving the page, to (portions of ) web pages andshare them with other users. The intention is to inform the other users (e.g. reliability of e-commerce pages).

    Since these annotations are not automatically added and the purpose is different than ours, thispatent wont be a problem.

    18

  • 8/3/2019 Project Finaal Verslag

    20/96

    CHAPTER 2. IPR - PATENTS

    2.2 Invocation of the engine

    We found a patent which might be interesting concerning the way we would invoke the engine.

    2.2.1 Web page annotating and processing [48]

    This patent presents a system and method for associating annotations, modifications and otherinformation to a web page. More specific, a redirector is patented. If a user tries to access a webdocument on the internet using a specific URL, the request is intercepted by the redirector. Theredirector modifies the requested web document and returns the modified document to the user.These modifications may include annotations, comments on the document ...The redirector is implemented as a web service so no modifications to available browsers or serversoftware is needed.

    If the engine is invoked by intercepting a users request and redirect these request to the engine,this patent could be violated. We are intending to implement the engine as a web service. On theother hand, the redirection itself is not part of the project. So depending on how the patent isexactly stated, external applications using the engine could violate this patent. So its importantto the developers of those applications to be aware of this patent.

    2.3 Ubiquitous content delivery

    As mentioned before (section 1.3), an important application of semantic annotations is ubiquitouscontent delivery. Therefore its possible to use annotations in an web page to render the web pagespecific to the device used. There are several patents about the rendering and adaptation of suchweb pages which might be interesting for us.

    2.3.1 Web content adaptation process and system [24]

    This patent patents content adaptation for devices with small displays (E.g.: mobile devices). Theadaptation process receives information about the display characteristics of the target device anda web page to be transformed. Next, the web page is adapted and stored in a database.

    Every application which uses the annotations to adapt the annotated web page to the device used,would violate this patent. Implementations will likely have to make an arrangement with theinventors.

    2.3.2 Web server for adapted web content [23]

    This patent describes a web server architecture for delivering web content adapted to mobile de-vices. A database is used which stores multiple adapted versions of web pages. These adaptedversions can be pre-calculated or calculated on the fly if necessary. A method is described toclassify web page content and to adapt the page for a certain screen size.

    The patent patents the server which stores multiple adapted versions of a web page in a database,detects display characteristics of the requesting device, and transmits the cached or dynamicallyadapted version of the application back to the mobile device. Although content analysis is notpatented on its own, many applications using the system could violate the patent.

    2.3.3 Web content transcoding system and method for small displaydevice [47]

    This patent describes two methods to generate a version of a large web page adapted to a smallscreen size. Basically, HTML is preprocessed, the HTML data is analysed to categorize the content

    19

  • 8/3/2019 Project Finaal Verslag

    21/96

    CHAPTER 2. IPR - PATENTS

    blocks, a processing step is performed on this data (this step differs somewhat between the twoversions) and voice mark-up and adapted HTML code are generated.

    The patent patents only these methods as described in the patent. This patent wont cause us anyproblems. Again, content analysis is not patented, only a specific set of applications of it wouldinfringe the patent.

    2.4 Conclusion

    There are some existing patents concerning the analysis and annotation of web content. Althoughthey are all really focused on a particular problem, they may cause problems for the users of thesystem.

    Two of them handle about searching for specific content in a text [50, 38]. The described meth-ods might be interesting for us and we should take this patents into account but were not onlyfocussing on the content. The third patent [30] requires to add the annotations manually whichisnt the purpose of the project.

    The other patents are less important for us but more for our customers, as they will use the engineand the annotated web pages. Especially to render this web pages to a specific device, there arequite some patents that should be taken into account.

    20

  • 8/3/2019 Project Finaal Verslag

    22/96

    Chapter 3

    Standard bodies

    In the following paragraphs describe a few annotation technologies which might be interesting forthe project. We mention the benefits and drawbacks of each proposed technology.

    3.1 RDFa (Resource Description Framework - in - attributes)

    RDFa [18, 11, 29], developed and proposed by the W3C, is a set of rules that can be used as amodule for XHTML2. It reuses attributes from standard XHTML meta and link elements andapplies them to all other XHTML elements, so one can annotate XHTML markup with semanticinformation. The ultimate goal of RDFa is to make any RDF structure representable in pureXHTML. This allows the author to use a predefined set of rules to mark up just about anything.

    The benefits of RDFa: Publisher Independence: The publishers are independent and each website is allowed to use

    their own standards.

    Self Containment: Because of Self Containment, the RDF triplets are separated from the(X)HTML content.

    Modularity: Modularity of the schema makes attributes reusable.

    No duplication: There is no need to create separate XML and HTML section with the samecontent, so there is no duplicated data.

    Modifiability: Additional field can easily be added and XML transforms can extract the

    semantics of the data from an HTML file. Widely supported: Its developed by the W3C and Google will use RDFa soon so there will

    be no lack of support.

    But there are some drawbacks too:

    Incompatible tools: Some XHTML cleaning tools (to create well-formed content) can breakthe embedded RDF semantics.

    Complexity: Its more complicated than Microdata and Microformats.

    Usage: Will it ever be used worldwide?

    21

  • 8/3/2019 Project Finaal Verslag

    23/96

    CHAPTER 3. STANDARD BODIES

    3.2 Microformats

    Microformats (F) [17, 10, 29] is a web-based approach to semantic markup which wants to reuseexisting HTML/XHTML tags to convey metadata and other attributes in web pages and othercontexts that support (X)HTML. This approach allows software to process information intendedfor end-users (such as contact information, geographic coordinates, calendar events etc) auto-matically. The use, adoption and processing of Microformats enables data items to be indexed,searched for, saved or cross-referenced, so that information can be reused or combined.

    The b enefits of Microformats:

    Basic HTML: There is no need to learn another language if you already know HTML.Microformats uses the class-attribute of the different HTML tags.

    Compact: Microformats uses a very compact syntax.

    Compatibility: Microformats are easy to add in an existing web page and it works perfectlywith CSS. Allow applications to use already existing technologies instead of converting datato RDF and back.

    Existing technologies: Try to model and encapsulate real, existing technologies like vCardand iCal data.

    Widely supported: There is a very wide deployment and adoption by mainstream web de-signers and developers. Also the newest versions of browsers (like Firefox 3 and IE 8) willprovide support for Microformats. (Not with plugin like current versions.)

    Drawbacks of Microformats:

    Scalability: Scalability issues and only limited number of Microformats (but number is still

    growing.)

    Parsing problems: Separate parsing rules are required for each Microformat.

    Inefficiency: Its quite inefficient from a parsers point of view. (Difficult for automatedsearch.)

    Namespaces: No use of namespaces. Youve to make sure the class you defined isnt alreadyused for another purpose. (E.g.: defined in CSS)

    3.3 Microdata (HTML5)

    Microdata [16, 5] is a proposed feature of HTML5 intended to provide a simple way to embed

    semantic markup into HTML documents. Microdata can be viewed as an extension of the ex-isting Microformat idea which attempts to address the deficiencies of Microformats without thecomplexity of systems such as RDFa.

    The benefits of Microdata:

    Complexity: Microdata is less complex then RDFa

    Easy to use: Easy to add markup to your pages using a few HTML attributes.

    Adoption: Its already adopted by Google.

    Standard: Its very likely to become an official recommended web standard as part of HTML5.0.

    Drawbacks of microdata: Unfinished: HTML 5.0 is still under construction and is not a standard yet.

    22

  • 8/3/2019 Project Finaal Verslag

    24/96

    CHAPTER 3. STANDARD BODIES

    3.4 DAML (DARPA Agent Markup Language)

    DAML [15, 3] is the name of a US funding program. The program focused on the creation ofmachine readable representations for the web. The DAML language is being developed as an ex-tension to XML and RDF (Resource Description Framework). It provides a rich set of constructsto create ontologies and to markup information so that its machine readable and understandable.Much of the efforts put into DAML has now been incorporated into OWL (Web Ontology Lan-guage).

    The benefits of DAML:

    Machine readable: The DAML language allows the machines to make the same sort of simpleinferences that human beings do.

    Validation: A validator for DAML exists.

    Extensive: DAML is more extensive than RDFa.

    Drawbacks of DAML:

    Complexity: Compared to the other standards, DAML is more complex to use.

    3.5 Conclusion

    This table gives an overview of the most important characteristics of the different technologiesand compare them with each other.

    RDFa Microformats Microdata (HTML 5.0) DAMLEasy to use No Yes No Yes

    Extensive Yes Yes Yes YesWidely supported Yes Yes Yes No

    Embedded in HTML No Yes Yes NoFree to use Yes Yes Yes Yes

    Table 3.1: Overview of the different technologies and their features.

    Technologies for the semantic web are still in development. There isnt a language thats betteron every aspect than the others. If you are looking for a simple and quick way to add annotations,Microformats probably is the best option for you. But if you want to be thorough and if you havethe time to do it, you should prefer RDFa or Microdata. DAML and Microformats should be yourchoice if you want a technology thats easy to extend and modify to your needs.

    In the project we should consider using each technology because they all got their benefits. Butdue to incompleteness, Microformats and Microdata arent really the best choice. In the future,on the other hand, they may be the most used technology. So we have to consider everythingbefore turning down a technology.

    23

  • 8/3/2019 Project Finaal Verslag

    25/96

    Chapter 4

    Professional organizations

    4.1 Reuters OpenCalais

    OpenCalais is a open web service by Thomson Reuters. This Web Service creates rich semanticmetadata for the content you submit in a fully automatic way. They also claim the annotationprocess is done in under a second. Their system relies on methods such as natural language pro-cessing (NLP), machine learning and some other, unknown methods. OpenCalais is a free servicefor both non-commercial and commercial use.

    Basically, what OpenCalais does is try to find certain words in the submitted content which itcan link to other web pages. E.g.: brands, places, specific terms... It does not add semanticdata concerning the layout or structure of the web application. It does offer some extra servicesthough, it can deliver Social Tags and Topics. Social Tags presents a list of tags related tothe submitted content along with a relevance indicator. The topic(s) of the submitted content willalso be delivered along with a relevance percentage. OpenCalais uses RDF to write down theirannotations.

    4.2 Ontotext KIM Platform

    The KIM Platform, developed by Ontotext, is a semantic search engine. It can analyse text, eitherfrom your own documents or from the web, and provide hybrid queries to search the structureddata. Its free for non-commercial use, but paying for commercial use. The KIM Platform is theresult of the paper KIM - Semantic Annotation Platform [44] and uses RDF tags. It differs fromthis project because it doesnt annotate the structure of a web application. The annotating engine

    is only part of KIM.

    4.3 Ontoprise GmbH Semantic Contents Analytics

    Ontoprise GmbH created Semantic Contents Analytics together with IBM. It analyses text fromdiverse sources (e.g. documents, e-mails, wikis, databases...) and classifies or tags documentsaccording to its search results. This function is comparable to this project at a certain level butSemantic Contents Analytics has quite a few other functions. It tries to extract facts from data,find links between different documents and even tries to logically derive new information. Theanalysis can be helped by providing a knowledge model (ontology). It does not pay any attentionto the structural elements of the texts it analyses. Semantic Contents Analytics is part of a bigger,commercial Semantic Middleware system.

    24

  • 8/3/2019 Project Finaal Verslag

    26/96

    CHAPTER 4. PROFESSIONAL ORGANIZATIONS

    4.4 iQser GIN Platform

    The iQser GIN (Global Information Network) Platform is semantic middleware with quite a lotof functions concerning data integration, process control and information retrieval. The mostinteresting part is the part of the semantic analysis. They claim their analysis process is fullyautomated, light-weight and ensured high semantic validity. It does not require any precedingontology and adapts to the current information situation. It does not provide semantic annotationsconcerning the structure and layout of webpages. Since 2009 the SDK of this middleware isavailable, but you have to pay a license fee to become a developer.

    4.5 Annotea

    Annotea [2] is an open project of W3C that enables you to annotate your web pages. This projectis meant to support and demonstrate W3C standards, mainly RDF based annotations (section 3)

    and XPointer. The annotations are made using RDF based annotations and XPointer is used todefine for each annotation to which part of the HTML page its related. The special thing aboutthis project is the fact that the annotations and their paths are saved seperatly from the web pageitself. Therefore the developer is able to add annotations to web pages without actually having toedit the web page itself.

    Annotea consists of two parts. First of all, W3C offers an online editor, namely Amaya [1], wheredevelopers can add annotations to their web pages. This editor is an example implementationand can easily be replaced by any other editor with the same capabilities. The second part isthe RDF based metadata server. This server is used for storing and fetching the annotations. Asmentioned before, the annotations are kept seperatly from the web page so the server stores thefiles containing the annotations but not the web pages.

    The editor and the server can communicate with eachother. The editor can present a web pagesusing the annotations fetched by the server. Converserly the server saves the annotations madewith the editor.

    In the article Adapting Content for Wireless Web Services [ 42], they mention that it is difficultto edit dual-purpose web pages. That problem is solved here by seperating the annotations fromthe HTML-page. This method of external annotation, seperated from the web page is also pre-sented in one of the articles [33] (section 1.3.4) which gives several (dis)advantages of this method.

    This might be interesting for us because Amaya could be replaced or extended with the engine toadd annotations automatically.

    4.6 Conclusion

    There are a few companies who offer semantic analysis software. Most of them offer a middlewareplatform that analysis data beforehand and constructs a database containing all the information.None of the above mentioned organizations offer realtime annotation or add any informationconcerning the structure of the webpage. None of them is doing any research in that area (at leastwe couldnt find any information related to this kind of research on their website).This project seems to offer a pioneering concept, but we must be aware these organizations mayeasily try to adapt their current platforms to incorporate some of our features.

    25

  • 8/3/2019 Project Finaal Verslag

    27/96

    Chapter 5

    Market reports

    The World Wide Web and its services are an evergrowing market with lots of opportunities. Theinternet landscape changes rapidly but it can be stated that the future of the web is semantic.The semantic web has become a hot topic over the past decade and currently a lot of research isgoing on in this area. Both on business as on academic level, a lot of research is done and newtechnologies are being developed. Major companies already offer semantic web tools or systemsusing semantic web: Adobe, IBM, HP, Oracle etc. [12]

    Gartner, a leading IT research and advisory company, predicted in a 2007 report that during thenext 10 years, web-based technologies will improve the ability to embed semantic structures. Theyexpect that by 2017, the vision of the Semantic Web coalesce and the majority of web pages aredecorated with some form of semantic hypertext. By 2012, 80% of public web sites will use somelevel of semantic hypertext to create SW documents. [12] They also state that the grand vision

    of the semantic web will occur in multiple evolutionary steps, and small-scale initiatives are oftenthe best starting points. [14]

    Semantic annotations play a crucial role in the realisation of the semantic web and they can beused in a wide variety of other application areas on the internet. With the ability to tag allcontent on the web, we can describe what each piece of information is about and give semanticmeaning to the content item. Search engines will become more effective and users can find theprecise information they are hunting for. This will have a big impact on the internet economy, asa majority of the searches are in the domain of consumer e-commerce, where a web user is lookingfor something to buy. Agent enabled semantic search will have a dramatic impact on the precisionof these searches. It will reduce and possibly eliminate information asymmetry where a betterinformed buyer gets the best value. [32]

    There is a need for machine readable metadata on the web and nowadays its often difficult andtime consuming to mark up data semantically. This is one of the reasons the semantic webhasnt yet been widely adopted, at least commercially. [13] An engine that does this automaticallywould solve this problem and many other applications could use it to speed up and facilitate theirservices. Adding semantic annotations to web applications could pave the way for the semanticweb as they create web pages that can be understood by computers and so a whole new range ofapplications and possibilities will become available, such as the possibility to enhance ubiquitouscontent delivery.

    26

  • 8/3/2019 Project Finaal Verslag

    28/96

    Chapter 6

    Industry trends

    6.1 Googles Rich Snippets

    When you search something on the internet using Google, Google presents you a list of resultscontaining a link to each web page and a short section of information about the web page. WithGoogle Rich Snippets [7], they want to give you the opportunity to help Google decide whichinformation to display in that section.

    As developer you have to annotate your web pages manually. You are free to use RDFa or Mi-croformats (section 3) because Googles Rich Snippets recognizes both. Once your web page isannotated, Google can parse it and decide which information is valuable to show in the searchresult of users. Based on the search term of the user, Google decides, using your annotations,which part of your web page is most relevant to show.

    Annotating your web pages manually is a disadvantage for many developers. As noticed in one ofthe articles [42] (section 1.3.1), it is difficult to edit such dual-purpose web pages. On the otherhand, the user isnt restricted to one annotation standard. Therefore they can choose (out of two)the standard they are the most familiar with.

    6.2 Del eTools - eLearning Annotation Web Service

    Del eTools - eLearning Annotation Web Service [4] is a web service which offers you the possibilityto add semantic annotations directly into the content of web pages. You can not just add thisannotations but also share them with other learners. This service offers a web-based user interface

    where developers can manually add, manage and share the annotation functionality of their webapplication.

    Similar to other tools, the user has to add all annotations manually. As noticed in one of thearticles [42] (section 1.3.1), it is difficult to edit such dual-purpose web pages.

    6.3 OSA Web Annotation Service

    The main objective of the OSA project [9] is to give people the possibility to annotate content.This content can be their own web page but can also be content from another or unknown author.Adding this annations can even be done without knowledge of the author. The basic structure isquite the same as in Annotea (section 4.5). The annotations and the content are saved seperately

    which allows people to add annotations to whatever online content they prefer.

    27

  • 8/3/2019 Project Finaal Verslag

    29/96

    CHAPTER 6. INDUSTRY TRENDS

    An import part of OSA is the annotation service. This service consists of a URL interceptor.This interceptor searches for annotations related to the document which is retrieved. Further it

    combines the annotations and the web page to an annotated web page before it is delivered to theuser. It also offers other kinds of features, for example to decide from which sources you want toaccept annotations. Compared to Annotea (section 4.5), where a web page is linked to one pageof annotations, here a web page can have more than one page of annotations. At runtime theannotation service can decide which annotations will be used to combine with the actual web page.

    There are no restrictions mentioned on how the annotations of a web page are created. Althoughthey arent created at realtime. However it might be interesting for us as the engine could createthis annotations automatically, when a web page is requested. Afterwards this existing annotationservice could combine the annotations with the web page.

    6.4 ConclusionAll projects, companies and services try to offer developers a platform to annotate web pages. Allof them do this by using one (or more) annotation standards. But the platforms use different waysto add the annotations to the web pages. In some cases, the developer has to edit the web page(E.g.: Google Rich Snippets, Del eTools), others use seperate files to save the annotations (E.g.:OSA).

    More important is the fact that in all these projects, the developer or more general the userhas to add this annotations manually. So, important for the project is to add this annotationsautomatically and deliver annotated web pages that are at least as accurate as the web pages thatwhere annotated manually.

    28

  • 8/3/2019 Project Finaal Verslag

    30/96

    Chapter 7

    Conclusion

    There are many methods and algorithms to detect content and structural parts of a web page.We can divide these methods in two major groups. First of all theres one group of algorithmswhich uses the DOM-structure of a web page to analyse the document. On the other hand, manyalgorithms use a visual representation of the web page and define different parts based on therendered page. Both kind of algorithms are interesting for us as its likely that the engine has tocombine several methods.

    Important to us is the fact that none of this algorithms seems to be patented. There is no patentwhich really limits what we try to do. Our customers, on the other hand, should take notice ofthe different patents because some of the applications of annotated web pages are patented in oneor another way.

    On top, there doesnt seems to exist any company or organization who tries to do exactly whatwe want to do. Several companies and organizations try to offer an application or service to addannotations to web documents. However, all this companies developed semi-automatic methodswhich require interaction from the user or developer. At the moment, nobody offers a completelyautomatic annotation service or engine. We would be the first to add annotations completelyautomatic, so today we wouldnt have any real competition.

    Market research shows us that the world wide web is evolving to a more semantic web in which,of course, semantic annotations play a ma jor role. It would be very interesting for us to be partof the semantic web as adding annotations automatically would be a huge step forward towardsthe semantic web.

    To annotate web pages, we have the possibility to choose between different annotation technolo-gies. However, its not so easy to choose one technology. All of them have their advantages anddisadvantages. On top, all of them have quite some applications on the world wide web and areintegrated by one or more web companies. Therefore we cant just ignore a technology. Its impor-tant for us to use technologies by which we cover a part of the web as large as possible. Interestingis the possibility to transform annotations from one technology to another. More specific, thereare methods to convert microformats into RDFa. This would be a benefit for us because we couldcover more technologies.

    29

  • 8/3/2019 Project Finaal Verslag

    31/96

    Part III

    Vision

    30

  • 8/3/2019 Project Finaal Verslag

    32/96

    Chapter 8

    Vision

    8.1 Mission Statement

    We are going to create an engine which analyses the contents and structure of web pages. Based onthis analysis, the engine will automatically add annotations. Unlike existing engines, this enginewill add annotations completely automatic. This will solve the lack of semantic information onthe web today. This extra information will improve existing applications, but also create endlesspossibilities for new kinds of applications.

    8.2 Customers and benefits

    Primary: at first our main customers will be web application developers. The engine will

    allow them to easily and automatically add semantic data to their pages without spendingextra development time.

    Secondary: once semantic annotations are becoming more common on the world wide web,we will have a whole new range of customers who want to make use of the annotations. E.g.:search engines, browser developers, adaptation engines, application developers...

    Benefits: adding semantic annotations to web applications will create webpages that can beunderstood by computers. This will pave the way for the semantic web and create a wholenew range of applications and possibilities.

    8.3 Key factors to judge quality

    Scalability: Since we opted for a serverside approach the engine has to be able to handlemany requests within a short time period.

    Performance : People do not like to spend a few seconds waiting for their page to load. Thesystem will annotate at realtime. Therefore its crucial that the engine will be able to dothis very fast.

    Modifiability: We need to be able to swap out an algorithm for another one at runtime. Thismeans the entire system should be split up in smaller, less complex subsystems. Each ofthese subsystems can then be edited without affecting other subsystems. Each subsystemcan be exchanged for another subsystem with the same function.

    Accuracy: The engine needs to replace developers who annotate their pages by hand. To

    make sure the engine will be used, it must reach an acceptable level of accuracy.

    31

  • 8/3/2019 Project Finaal Verslag

    33/96

    CHAPTER 8. VISION

    Availability: To offer a certain service level, the servers have to be up and running all thetime. (E.g.: Also during updates and modifications we have to try to keep the servers up

    and running)

    Completeness: The added annotations not only have to be accurate but also complete. Thismeans, every place where a developer expects an annotation, should contain the correctannotations.

    Stability: The engine annotates at realtime. Therefore every error has to be handled correctlyand fast without stopping the engine. An error may not lead to a bad annotated application.

    Extensibility: As the world wide web evolves very quickly, the engine has to be extensibleto be able to handle new technologies used in web applications. We should be able to newtechnologies (E.g. a new annotation algorithm) in a short time span. These new technologiesshould be available immediately.

    8.4 Key features and technology

    The engine will add annotations to a given web application.

    The engine will do this completely automatically.

    The engine takes a web application as input and gives an annotated web application asoutput.

    The engine can be used as part of a bigger system.

    8.5 Crucial factors as applicable A certain level of correctness (accuracy) will be required of the engine to be usable.

    The engine needs to be consistent, which means we cant have 2 different annotated versionsof the same web page. This is necessary for the engine to be usable.

    The solution has to be easily extendable for future web features.

    Documentation about the used annotations and API will be necessary. This will stimulatethird party software to make use of the annotations and the semantic annotation engine.The documentation will be publicly available.

    32

  • 8/3/2019 Project Finaal Verslag

    34/96

    Part IV

    Scenarios

    33

  • 8/3/2019 Project Finaal Verslag

    35/96

    Chapter 9

    Use cases

    9.1 Use case diagram

    Figure 9.1: Use case diagram for the annotation engine

    9.1.1 Actors

    We have three kind of actors:

    1. User: A user can be anyone who wants to use the engine. This can be a physical person butalso another engine or application. Every single user is able to perform a semantic analysis

    on a selected web application. To do this, the user is able to change some common settingsconcerning the performance and accuracy of the engine.

    34

  • 8/3/2019 Project Finaal Verslag

    36/96

    CHAPTER 9. USE CASES

    2. Administrator: The administrator is an extension of the user, which means he can do ev-erything a normal user can do. On top, he manages the engine on the server and is able to

    change all settings related to the engine. Hes also able to add additional functionality tothe engine, e.g.: add a new algorithm.

    3. Web developer: The web developer is also an extension of the user. Additionally he caninstall the engine on his own computer. Doing this, he becomes administrator of his localinstallation and is able to do everything similar to what the administrator can do related tothe engine on the server.

    9.1.2 Use cases

    Perform semantic analysis

    Performing a semantic analysis of a web page can be done in two ways. The engine is implemented

    as a web service. This means the engine can be invoked by any possible user from any place onthe internet. A user even can be an other engine or application. Secondly, we provide a webapplication as a user interface to the engine. Therefore any physical user can use the engine byusing the provided user interface.

    In both ways, the user will have to give a web page as input. The output after the semanticanalysis will be an annotated web page.

    Choose performance and accuracy

    The user will be able to change the performance and accuracy. Concrete, this means the user canchoose the algorithm used by the engine. Not every algorithm will perform as fast. Most likely, afaster algorithm will be less accurate. When using the engine at realtime, a user probably prefers

    a fast algorithm which may be a little less accurate. On the other hand, when a web developerruns the engine on his own computer, he can decide to choose a slower algorithm which gives ahigher accuracy.

    Correct annotations

    As noticed before (section 9.1.2), we offer a user interface to the user. If this user interface is used,the user can choose, before starting the semantic analysis, to view the annotations in an editorafterwards. After the analysis, an editor will be opened and show the added annotations. Theuser is able to add other annotations or change the already added annotations.

    Afterwards, the revised page can be submitted to the engine for machine learning. For safetyreasons, only an administrator can actually insert annotated pages for machine learning (sections

    9.1.2 and 9.2.4). If a user submits an annotated page to the engine, this page will be saved untilan administrator confirms to use that page for machine learning. This option will especially beused by advanced users when they notice the engine isnt accurate enough anymore.

    Add annotated pages for machine learning

    Another way of dealing with new technologies is the possibility to add already annotated webpages. Using this annotated web pages, the engine will be able to use machine learning to learnfrom this annotations and maybe change his behavior.

    As already noticed (section 9.1.2), any user can submit an annotated web page. The actualselection of which of these web pages has to be used for machine learning is, for safety reasons, leftto the administrator. Therefore the administrator will be able to view the submitted pages andchoose which pages has to be used for the actual machine learning. If the administrator resubmitsthose pages, they will be analysed and used to update the algorithms.

    35

  • 8/3/2019 Project Finaal Verslag

    37/96

    CHAPTER 9. USE CASES

    Management of rule-based methods

    As the world wide web evolves quite fast, its necessary the engine can be adapted to deal with newtechnologies. To do so, the administrator can add, change or remove rules containing informationabout what structural or content part has to be mapped to which annotation.

    Add functionality

    The administrator is able to add additional functionality, e.g.: a new algorithm to the engine.This can easily be done, at runtime, using the user interface. Important is the fact that this isdone at runtime, so the engine keeps running while adding this additional functionality.

    Install and run locally

    A web developer has the possibility to install the engine on his own computer. By doing this, he

    has his own local engine and he becomes administrator of his local engine (also see section 9.1.1).So he is able to perform semantic analysis on his web applications by using his own engine.

    9.2 Use case scenarios

    9.2.1 Perform semantic analysis

    NamePerform semantic analysis

    Textual descriptionWhen the engine is given a certain web page as input, the engine needs to return an annotated

    web page.

    PriorityHigh - This is essential for the entire project.

    ComplexityHigh - This is the most essential part

    ActorsAny user who wants to annotate a web application.

    Events (triggering, during execution, ...)The user starts the engine and gives the web page which has to be annotated.

    During execution the engine could trigger a fatal error sending a failure notification to the userand a log message to the administrator. The original web page without any annotations will bereturned.

    Preconditions

    The engine is running.

    The given web page is available.

    The engine has the correct rights in order to visit the web page.

    Main success scenario (MSS)

    1. The user gives the location of the web page to be annotated.

    2. The engine tries and read the web page.

    36

  • 8/3/2019 Project Finaal Verslag

    38/96

    CHAPTER 9. USE CASES

    3. The engine applies heuristics and algorithms to detect the web pages structure and content.

    4. The structural and content parts are annotated.

    5. The annotated version is returned to the user.

    Extensions on the MSS2. The web page can not be read or contains errors.

    An error message is returned to the user and will be logged internally.

    Postconditions after success and failure

    If the engine is used by using the user interface, the annotated web page will be showed inthe editor.

    If the engine is used as web service by another application, the annotated web page is

    returned to the user.

    In case of failure, a message will be returned to the end user and the original web applicationis returned or showed.

    Blackbox

    Figure 9.2: Blackbox for the use case to perform a semantic analysis.

    9.2.2 Choose performance and accuracy

    NameChoose performance and accuracy

    DescriptionYou can choose which algorithm you want to use for performing the annotation of the web page.The accuracy of the annotation will vary, depending on the selected algorithm. By selecting ahighly accurate algorithm, performance will likely be lower and vice versa.

    37

  • 8/3/2019 Project Finaal Verslag

    39/96

    CHAPTER 9. USE CASES

    PriorityLow - This is optional. Otherwise the standard algorithm is used.

    ComplexityHigh - Implementing different algorithms is a complex job. It will take a while to make severalalgorithms with a high accuracy.

    ActorsUser

    EventsThis use case is triggered when the user decides to choose another algorithm as the standard one.

    Preconditions

    We need to have different algorithms which the user can choose from.

    The algorithms work perfectly.

    Main Success Scenario (MSS)

    1. The user goes to the option pane where he can select the algorithm he prefers.

    2. The user selects the algorithm he wants to use.

    3. The user saves his selection and can perform the semantic analysis.

    Extensions on the MSS2. The user makes an empty selection.

    A message is returned to the user that he has to select an algorithm.

    3. The user doesnt save his selection

    The selection will be discarded and the standard algorithm (set by the administrator) willbe used.

    Postconditions

    The user can now perform semantic analysis using the selected algorithm.

    38

  • 8/3/2019 Project Finaal Verslag

    40/96

    CHAPTER 9. USE CASES

    Blackbox

    Figure 9.3: Blackbox for the use case to choose performance and accuracy.

    9.2.3 Correct annotations

    NameCorrect annotations

    DescriptionA user will be able to view the result of the annotations added by the engine and edit these ifthey are incorrect. This will especially be done by advanced users as the revised pages can beused by the engine for machine learning to make the annotation algorithms more accurate. Aftera user corrected a page, he can submit this page as a suggestion for machine learning. The actualselection of the page for machine learning is done by th