[Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges...

10
Model-Driven Development of Multidimensional Models from Web Log Files Paul Hern´ andez, Irene Garrig´ os, and Jose-Norberto Maz´ on Lucentia Research Group Dept. of Software and Computing Systems University of Alicante, Spain {phernandez,igarrigos,jnmazon}@dlsi.ua.es Abstract. Analyzing Web log data is important in order to study the usage of a website. Even though some approaches propose data warehous- ing techniques for structuring the Web log data into a multidimensional model, they present two main drawbacks: (i) they are based on informal guidelines and must be manually applied; and (ii) they consider data tailored to a specific Web log format, thus being restricted to specific analysis tools. To overcome these limitations, we present a model-driven approach for obtaining a conceptual multidimensional model from Web log data in a comprehensive, integrated and automatic manner. This ap- proach consists of the following steps: (i) obtaining a conceptual model of the Web log data based on a unified metamodel, (ii) deriving a mul- tidimensional model from this Web log model by formally defining a set of QVT (Query/View/Transformation) transformation rules. 1 Introduction Web log files can have millions of entries that contain a lot of information about the user interaction with the site. These files are useful for a detailed analysis of the usage of a website (also known as clickstream [2]) in order to support decision making regarding several tasks [1,2,10,12], e.g., reducing the editor’s effort, improving the browsing experience, managing Web traffic, marketing, e- commerce, advertising, the evaluation of the system against initial specifications and goals, or personalization support. Web log data are usually stored in files by using different text-based formats, such as the NCSA Common log format [18] or the W3C Extended Common Log File format [19]. Moreover, every format can be customized for specific purposes depending on the data that should be monitored. Unfortunately, the features of these formats make the analysis of Web log files to be restricted to specific analysis tools and their analysis functionality [5]. Therefore, advanced analysis techniques can not be easily applied to Web log data. To solve these problems, some approaches [10,12] consider structuring the Web log data into a data warehouse by means of multidimensional modeling [11]. Multidimensional modeling is at the core of data warehousing, since it allows J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 170–179, 2010. c Springer-Verlag Berlin Heidelberg 2010

Transcript of [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges...

Page 1: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

Model-Driven Development of Multidimensional

Models from Web Log Files

Paul Hernandez, Irene Garrigos, and Jose-Norberto Mazon

Lucentia Research GroupDept. of Software and Computing Systems

University of Alicante, Spain{phernandez,igarrigos,jnmazon}@dlsi.ua.es

Abstract. Analyzing Web log data is important in order to study theusage of a website. Even though some approaches propose data warehous-ing techniques for structuring the Web log data into a multidimensionalmodel, they present two main drawbacks: (i) they are based on informalguidelines and must be manually applied; and (ii) they consider datatailored to a specific Web log format, thus being restricted to specificanalysis tools. To overcome these limitations, we present a model-drivenapproach for obtaining a conceptual multidimensional model from Weblog data in a comprehensive, integrated and automatic manner. This ap-proach consists of the following steps: (i) obtaining a conceptual modelof the Web log data based on a unified metamodel, (ii) deriving a mul-tidimensional model from this Web log model by formally defining a setof QVT (Query/View/Transformation) transformation rules.

1 Introduction

Web log files can have millions of entries that contain a lot of information aboutthe user interaction with the site. These files are useful for a detailed analysisof the usage of a website (also known as clickstream [2]) in order to supportdecision making regarding several tasks [1,2,10,12], e.g., reducing the editor’seffort, improving the browsing experience, managing Web traffic, marketing, e-commerce, advertising, the evaluation of the system against initial specificationsand goals, or personalization support.

Web log data are usually stored in files by using different text-based formats,such as the NCSA Common log format [18] or the W3C Extended Common LogFile format [19]. Moreover, every format can be customized for specific purposesdepending on the data that should be monitored. Unfortunately, the featuresof these formats make the analysis of Web log files to be restricted to specificanalysis tools and their analysis functionality [5]. Therefore, advanced analysistechniques can not be easily applied to Web log data.

To solve these problems, some approaches [10,12] consider structuring the Weblog data into a data warehouse by means of multidimensional modeling [11].Multidimensional modeling is at the core of data warehousing, since it allows

J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 170–179, 2010.c© Springer-Verlag Berlin Heidelberg 2010

Page 2: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

Model-Driven Development of Multidimensional Models from Web Log Files 171

advanced analysis tools (i.e., such as OLAP – On-Line Analytical Processing–, data mining or “what-if”) to access data in a way that comes more naturalto human analysts. The data is located in n-dimensional space (or facts, e.g.,how many products are sold, how many patients treated, how long somethingtakes, etc.), with the dimensions representing the different ways the data canbe viewed and sorted (e.g., according to time, store, customer, product, etc.).Therefore, designers of multidimensional schemas have to structure the informa-tion that is available into facts and dimensions. Current approaches manuallydefine these structures from the Web log information which is a tedious, errorprone and time-consuming task. Also, multidimensional modeling requires spe-cialized design techniques that resemble the traditional database design methodsin which the development process is guided by a first conceptual design phasewhose output is an implementation-independent and expressive conceptual mul-tidimensional schema for the data warehouse [17].

In order to overcome these drawbacks, a model-driven approach is proposed inthis paper to automatically derive a conceptual multidimensional schema fromWeb log files. To be able to tackle the aforementioned different available for-mats, a unified metamodel for Web log data has been developed. Fig. 1 showsan overview of our overall approach. The first step is to represent the raw datafrom Web log files into a model that conforms to our generic Web log meta-model. Once this model is obtained, a conceptual multidimensional model canbe automatically derived, by means of several model transformations defined byusing the QVT (Query/View/Transformation) language.

Fig. 1. Overview of our model-driven approach for obtaining multidimensional modelsfrom Web log data

The remainder of this paper is structured as follows. A brief overview of therelated work is presented in section 2. Section 3 describes our model-drivenapproach for multidimensional modeling of Web log data. An example is pro-vided throughout this section to show the applicability of our approach. Finally,section 4 points out our conclusions and future works.

2 Related Work

Commercial tools for Web log data analysis have significant limitations whenperforming advanced analytical tasks [12]. Furthermore, they have some draw-backs: (i) they are useless when trying to understand navigational patterns of

Page 3: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

172 P. Hernandez, I. Garrigos, and J.-N. Mazon

users [2], and (ii) they lack the ability to integrate and correlate informationfrom different sources. One of the most known analysis tools is Google Analyt-ics1 which has emerged as a major solution for Web traffic analysis. However, ithas several drawbacks, e.g. the drill-down capability is limited and there is noway of storing your data efficiently. Also, the user does not own the data, Googledoes.

There are several approaches [3,4,10,11,12] that define a multidimensionalschema in order to represent the Web log data. With these approaches, oncethe data is structured, it is possible to use OLAP or data mining techniquesto analyze the content of the Web logs, tackling the aforementioned problems.However, there is a lack of agreement about a methodological approach in orderto detect which would be the most appropriate facts and dimensions: some ofthem let the analysts decide the required multidimensional elements, while oth-ers decide these elements by taking into consideration a specific Web log format.Furthermore, Web applications can be distributed over several servers dependingon the performance needed, e.g., video and audio content could be hosted in anspecialized multimedia server while sales transactions in a high security server.Therefore, the main problem is that the multidimensional elements are infor-mally decided according to a specific format, so the resulting multidimensionalmodel may be incomplete.

To overcome these problems, our approach is aligned with [6] where Web logfiles are considered at the conceptual level. Specifically, our approach defines (i) aWeb log metamodel in order to unify different Web log formats in a conceptualWeb log model, and (ii) a set model transformations to automatically obtainmultidimensional data structures from a Web log model.

3 Model-Driven Approach for MultidimensionalModeling of Web Logs

In this section, we describe our approach for obtaining a conceptual multidimen-sional model from Web log files. In Fig. 1, we show an overview of our approach:from the Web log files, a Web log model is obtained. From this model, a concep-tual multidimensional model is obtained through a set of QVT transformations.To summarize, the benefits of our approach are:

– A Web log metamodel is defined, which is not tailored to a specific Web logformat, which allows more flexibility

– The Web log model is automatically generated from the Web log file– The multidimensional model is automatically derived from the conceptual

model by means of QVT rules

It is worth to point out that this conceptual model will drive the development ofa data warehouse by using our approach presented in [15]. This data warehousewill be used to enhance the analysis of Web usage data.

1 http://www.google.com/analytics

Page 4: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

Model-Driven Development of Multidimensional Models from Web Log Files 173

3.1 Web Log Metamodel

The main goal of this metamodel is to define the elements and the semanticsthat allow building a conceptual model which represents, in a static way, theinteraction between raw data elements (i.e. the client remote address) and usageconcepts (i.e. session, user). We have divided our Web log metamodel into twopackages as is shown in Fig. 2: the Entries package and the Usage package.

(a) The Entries Package

(b) The Usage Package

Fig. 2. Our Web log metamodel

Page 5: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

174 P. Hernandez, I. Garrigos, and J.-N. Mazon

The Entries package (see Fig. 2a) is intended to represent the entries of anykind of Web log format. The EntryField metaclass contains subclasses represent-ing any field present in an entry. These fields are optional because some Web logformats are customizable like the W3C Extended Log File Format, thus allow-ing to store only the desirable fields in the log. Most of the fields are gathereddirectly from de http request of the client like the RemoteIp, BytesSent, Remote-Name, and HttpStatus. The AuthUser metaclass has a value if the current useris authenticated in the server. The WebObject metaclass represents any elementthat, when clicked, produces a request over a resource identified by a URI. TheRequest metaclass represents the request line from the client. The TimeTakenmetaclass represents the length of the time that the action took. The Cookiefield includes the content of one or more cookies sent or received. The Refer-rer metaclass represents the site the user comes from, and the Agent metaclasscontains information regarding the browser type that the client used. Finally,the Entry metaclass consists of a set of entry fields in order to structure theconfiguration of the current Web log.

Regarding the Usage package (see Fig. 2b), it contains classes to represent howthe user interacts during a session and produces entries in the Web log. The Useris the person or program that makes use of the website. The User identificationis one of the challenges task in Web log analysis, this information could be takendirectly from the authenticated user field if it is available, in other case there aresome methods to accomplish this task2. A single Session has a unique User buta single User may have many sessions. The Session also has a Context whichcould be the Device used by the client or the UserOrigin which in turn couldbe determined by the RemoteIp. A Session contains a set of entries caused by auser interaction over a period of time3. A Page is a specialized WebObject andit can contain many WebObjects at the same time. A Session could have one ormore Page elements associated. Cookies are very helpful in order to identify theuser, determine the user location and delimit the session. A drawback for usingcookies is that they are not always available because they depend on the useracceptance.

Our metamodel has been developed in the Eclipse Modeling Framework4.Eclipse is an open source project conceived as a modular platform able to beextended by plugins in order to add features to the development environment.Within Eclipse, EMF is a Java framework and code generation facility for build-ing tools and other applications based on a structured model. In order to supportour modeling tasks, we have developed a plugin of the metamodel that allowsthe definition and edition of Web log models in a programmatic manner by usinga reflective API for manipulating EMF objects.

2 This issue is out of the scope of this paper and we refer the reader to some methodsexplained in [11] for further information.

3 Determining a specific session is a challenging task; again this is out of the scope ofthis paper and we refer the reader to [11] for further explanations.

4 Site: http://www.eclipse.org/emf

Page 6: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

Model-Driven Development of Multidimensional Models from Web Log Files 175

With the defined metamodel it is possible to express the structure of the datacontained in the log files in a model independent of the Web server technology. Itis also possible to model the users interaction during a session with the website.

To show how to create Web log models from our metamodel, we use a runningexample based on a log file from the server which hosts our research groupwebsite5 (an Apache server that uses the Combined Log Format). A typicalentry is shown in Fig. 3.

172.16.242.69 - - [16/Mar/2010:09:28:00 +0100] "GET /labcss/Projects.php

HTTP/1.1" 200 2916 "http://lucentia.dlsi.ua.es/labcss/Activities.php"

"Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.9.1.8)

Gecko/20100202 Firefox/3.5.8 (.NET CLR 3.5.30729)"

Fig. 3. Code for a typical entry from http://www.lucentia.es

The process of obtaining a Web log model from the set of Web log files has beenimplemented by using the java.util.regex.Pattern class for representing regularexpression to parse data in the Web log files with the java.util.Scanner. Thesedata is then converted into a model that conforms to our Web log metamodel,by using the EMF.Edit interface EditingDomain. The corresponding model forour example is sketched in Fig. 4.

Fig. 4. Sample Web log model

The Entry element contains the fields that represent the raw data taken directlyfrom the Web log entry line. This information is associated with a Session startedby an anonymous User within a localization Context (User Origin Spain). Thismodel represents the interaction of the User in the website: how (click sequence),when (Time Stamp), where (User Origin) and what (Pages visited). It is worthto recall that the novelty of our approach is that this model is independent of the

5 http://www.lucentia.es

Page 7: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

176 P. Hernandez, I. Garrigos, and J.-N. Mazon

Web log technology. In our sample, it is shown how Session started in the Activi-ties Page. Then the User went to the Projects Page. In this way, it is possible torepresent complex User Sessions and larger sets of entries if it is needed.

3.2 Multidimensional Conceptual Modeling

The major aim of a conceptual multidimensional model is to represent the mainmultidimensional elements without taking into account any specific technologydetail. The UML profile proposed in [13] is used for specifying conceptual mul-tidimensional models as UML class diagrams, where facts and dimensions arerepresented by Fact ( ) and Dimension ( ) classes respectively. More precisely,Fact classes are defined as composite classes in shared aggregation relationshipswith several Dimension classes. If multiplicities are not specified for those rela-tionships, a default of many-to-one is assumed, i.e., each fact is associated withone coordinate in every dimension, and each of the coordinates can be used formany facts. Measures for Fact classes are represented as attributes with the Fac-tAttribute stereotype ( ). With respect to dimensions, each level of a dimensionhierarchy is specified by a Base class. Every Base class ( ) can contain severaldimension attributes (DimensionAttribute stereotype, ), and must also containa descriptor attribute (Descriptor stereotype, ).

3.3 Model Transformations from Web Log Model to ConceptualMultidimensional Model

Traditionally, conceptual multidimensional schemas have been derived from adetailed analysis of relational data sources in order to determine facts and di-mensions from relational tables [7,8,9,16]. In this way, we have previously devel-oped a set of QVT transformations to support designer in discovering every kindof multidimensional element from relational data sources [14]: e.g., table whichcontains a high number of numeric columns is transformed to a fact. However,these guidelines are focused on relational sources and they are not valid whenthe multidimensional model must be derived from the Web log model. Therefore,in this paper we have defined a new set of QVT transformations for detectingfacts and dimensions in Web log models, thus deriving a conceptual multidimen-sional model. Due to space constraints, we focus on explaining a subset of theseQVT transformations. Once a Fact has been derived from a Session class in theWeb log model, the ObtainFactAttributes (see Fig. 5) enforces the derivation ofFactAttribute properties in the conceptual multidimensional model: BytesTaken,TimeTaken and SessionDate. The value for each of these attributes is derivedfrom the Web log model and it is calculated by means of OCL constraints in thewhere clause of the QVT transformation. For example, for each Session fact, thevalue of the TimeTaken is the total length of processing time that every entrycause in the server within a session.

Regarding dimensions, the User2Dimension transformation checks the Userclass in the Web log model to create a User dimension (and their relatedUserData Base class) related to the previously created Session fact. Once this

Page 8: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

Model-Driven Development of Multidimensional Models from Web Log Files 177

Fig. 5. Obtaining Sessions fact attributes

Fig. 6. Deriving User dimension

dimension is created, the corresponding DimensionAttributes must be enforcedby means of the QVT transformations defined in the where clause. Also, fromthe UserData Base class a hierarchy should be enforced by means of the cor-responding QVT transformation. In order to exemplify the defined QVT trans-formations, the resulting multidimensional model for our running example isdefined in Fig. 7.

Page 9: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

178 P. Hernandez, I. Garrigos, and J.-N. Mazon

Fig. 7. Sample conceptual multidimensional model

4 Conclusions and Future Work

In this paper we have presented a model-driven approach for obtaining a concep-tual multidimensional model from Web log data. This model will drive the devel-opment of a data warehouse in order to enhance the analysis of Web usage data. Tobe able to tackle the different available Web log formats, a unified metamodel forWeb log data has been developed. Our approach consists of the following steps: (i)obtaining a conceptual model of the data of the Web log files (based on the unifiedmetamodel defined), (ii) automatically deriving a multidimensional model fromthis Web log model by formally defining a set of QVT transformation rules. Ourfuture work consists of aligning our Web log metamodel with Web engineering ap-proaches in order to create the multidimensional model for the Web applicationtogether with the rest of the Web conceptual models (navigational, domain, etc).

Acknowledgments. This work has been partially supported by the ESPIAproject (TIN2007-67078) from the Spanish Ministry of Education and Science,and by the QUASIMODO project (PAC08-0157-0668) from the Castilla-La Man-cha Ministry of Education and Science (Spain).

References

1. Alves, R., Belo, O.: Mining clickstream-based data cubes. In: 6th InternationalConference on Enterprise Information Systems, pp. 583–586 (2004)

2. Alves, R., Belo, O., Cavalcanti, F., Ferreira, P.: Clickstreams, the basis to estab-lish user navigation patterns on web sites. In: Fifth International Conference onData Mining, Text Mining and their Business Applications, pp. 87–96. WIT Press,Southampton (2004)

Page 10: [Lecture Notes in Computer Science] Advances in Conceptual Modeling – Applications and Challenges Volume 6413 || Model-Driven Development of Multidimensional Models from Web Log

Model-Driven Development of Multidimensional Models from Web Log Files 179

3. Aurelio, D.M., Jorge, A.M., Soares, C., Leal, J.P., Machado, P.: A data warehousefor web intelligence. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007.LNCS (LNAI), vol. 4874, pp. 487–499. Springer, Heidelberg (2007)

4. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wideweb browsing patterns. Knowl. Inf. Syst. 1, 5–32 (1999)

5. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Trans.Internet Techn. 3, 1–27 (2003)

6. Fraternali, P., Lanzi, P.L., Matera, M., Maurino, A.: Model-driven web usage anal-ysis for the evaluation of web application quality. J. Web Eng. 3, 124–152 (2004)

7. Golfarelli, M., Maio, D., Rizzi, S.: The Dimensional Fact Model: A conceptualmodel for data warehouses. Int. J. Cooperative Inf. Syst. 7, 215–247 (1998)

8. Husemann, B., Lechtenborger, J., Vossen, G.: Conceptual data warehouse model-ing. In: 2nd Intl. Workshop on Design and Management of Data Warehouses, pp.6–1–6–11 (2000)

9. Jensen, M.R., Holmgren, T., Pedersen, T.B.: Discovering multidimensional struc-ture in relational data. In: Kambayashi, Y., Mohania, M., Woß, W. (eds.) DaWaK2004. LNCS, vol. 3181, pp. 138–148. Springer, Heidelberg (2004)

10. Joshi, K.P., Joshi, A., Yesha, Y.: On using a warehouse to analyze web logs. Dis-tributed and Parallel Databases 13, 161–180 (2003)

11. Kimball, R., Merz, R.: The data webhouse toolkit: building the web-enabled datawarehouse. John Wiley & Sons, Inc., New York (2000)

12. Lopes, C.T., David, G.: Higher education web information system usage analysiswith a data webhouse. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K.,Taniar, D., Lagana, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3983,pp. 78–87. Springer, Heidelberg (2006)

13. Lujan-Mora, S., Trujillo, J., Song, I.Y.: A uml profile for multidimensional modelingin data warehouses. Data Knowl. Eng. 59, 725–769 (2006)

14. Mazon, J.N., Trujillo, J.: A model driven modernization approach for automati-cally deriving multidimensional models in data warehouses. In: Parent, C., Schewe,K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 56–71.Springer, Heidelberg (2007)

15. Mazon, J.N., Trujillo, J.: A hybrid model driven development framework for themultidimensional modeling of data warehouses. SIGMOD Record 38, 12–17 (2009)

16. Phipps, C., Davis, K.C.: Automating data warehouse conceptual schema design andevaluation. In: 4th Intl. Workshop on Design and Management of Data Warehouses,pp. 23–32 (2002)

17. Rizzi, S., Abello, A., Lechtenborger, J., Trujillo, J.: Research in data warehousemodeling and design: dead or alive? In: 9th International Workshop on Data Ware-housing and OLAP, pp. 3–10 (2006)

18. The Apache Software Foundation: Log files,http://eregie.premier-ministre.gouv.fr/manual/logs.html

19. W3C Consortium: Extended common log file format,http://www.w3.org/TR/WD-logfile.html