Risks of search engine dependency and its influence on data quality

118
Risks of search engine dependency and its influence on data quality A thesis submitted for the European Master in Business Studies (EMBS) by Ronan CHARDONNEAU Institut de Management de l'Université de Savoie d'Annecy (FR) Università degli studi di Trento (IT) Universität Kassel (GER) Universidad de León (SP) Date of submission: June the 26 th , 2009 Master Thesis

description

A master thesis about search engine dependency and its influence on data quality.

Transcript of Risks of search engine dependency and its influence on data quality

Page 1: Risks of search engine dependency and its influence on data quality

Risks of search engine dependency and

its influence on data quality

A thesis submitted for the European Master in Business Studies (EMBS)

by Ronan CHARDONNEAU

Institut de Management de l'Université de Savoie d'Annecy (FR)

Università degli studi di Trento (IT)

Universität Kassel (GER)

Universidad de León (SP)

Date of submission: June the 26th, 2009

Master Thesis

Page 2: Risks of search engine dependency and its influence on data quality

2

Disclaimer

Before starting this report I inform any readers that this is the final version

of the master thesis I released on:

“Risks of search engine dependency and its influence on data quality”.

This is a one year study work I made from June 2008 to June 2009 in order to

get my degree: the European Master in Business Studies which is a two year program

specialized in marketing and management where students are moving every seme-

ster from one university to another in Europe (Italy, France, Germany and Spain).

To find what inspired me please have a look at my blogs:

- http://moteurs-de-recherches-alternatifs.blogspot.com/ (in French)

- http://internationalemarketing.blogspot.com/ (in English)

I allow any of its readers to use my work for any further research, in ex-

changed of being cited.

Please feel free to contact me through my blog by posting comments.

Page 3: Risks of search engine dependency and its influence on data quality

3

Acknowledgements

Sincere and grateful acknowledgements have to be made to:

Mr Francesco VALOTTO, Co-founder of Edexon (Italy) who gave me the

idea to study the world of search engine optimization which finally ended to the

following topic.

Mr Roland ZIMMERMANN, from the University of Kassel (Germany) for

his help in structuring the thesis and his rereading.

Mr Eugenio POZZOLINI, from the Advancia Business School (France) for

all his advice.

Mr Andrea MOLINARI, from the University of Trento (Italy) who accepted

to be the tutor of the thesis.

Mr Charles KNIGHT, editor at AltSearchEngines (United States) for

promoting the thesis on his website.

Mr Daniel Arias-Aranda, associate professor at the Universidad de Granada

(Spain) for his rereading and feedback.

Mr Charles NODOT, first year student (France) within the European Master

in Business Studies, for his rereading, feedback and corrections.

Page 4: Risks of search engine dependency and its influence on data quality

4

Contents

Disclaimer ................................................................................................................................. 2

Acknowledgements................................................................................................................... 3

Contents .................................................................................................................................... 4

Table of figures ......................................................................................................................... 7

Foreword ................................................................................................................................... 9

Chapter 1: Introduction of the topic background ....................................................................11

1.1 Relevance of the subject ...............................................................................................14

1.2 Major terms ...................................................................................................................15

1.3 Focus, goals and structure of the report ........................................................................16

1.4 Chaper 1: Key points ....................................................................................................18

Chapter 2: Concept of data quality .........................................................................................19

2.1 Data quality definition ..................................................................................................21

2.2 Data quality issues within businesses ...........................................................................22

2.3 Origins of data quality issues: Garbage In Garbage Out ..............................................25

2.3.1 Poor data quality content: the Wikipedia example ................................................25

2.3.2 Poor data quality content: the commercial example ..............................................26

2.3.3 Metadata ................................................................................................................27

2.3.4 Findability ..............................................................................................................28

2.4 Data quality solutions ...................................................................................................30

2.4.1 Learning how to use search tools ...........................................................................30

2.4.2 Check out the information: the Triangle method ...................................................31

2.5 Chapter 2: key points ....................................................................................................33

Chapter 3: Search engines dependency ..................................................................................34

3.1 Search engine categories ...............................................................................................35

3.1.1 Commercial search engines ...................................................................................35

3.1.2 Enterprise search engine (ESE) .............................................................................38

3.2 Search engine market ....................................................................................................38

3.2.1 Commercial search engine market .........................................................................38

3.2.2 Commercial search engine market: Consumer behavior .......................................39

Page 5: Risks of search engine dependency and its influence on data quality

5

3.2.3 Enterprise Search Engine market ...........................................................................41

3.2.4 Enterprise Search Engine market: Consumer behavior .........................................42

3.2.5 The commercial search market repartition ............................................................44

3.2.6 The commercial search engines in the world .........................................................44

3.2.7 Commercial search engine leaders presentation ....................................................47

3.2.8 Commercial search engine complexity ..................................................................51

3.2.9 Search engine market shares configuration ...........................................................52

3.2.10 Search engines competition .................................................................................53

3.3 Search engine dependency aspect .................................................................................55

3.3.1 Search engines dependency proves ........................................................................55

3.3.2 Types of search engines dependency .....................................................................56

3.3.3 Search engine loyalty .............................................................................................57

3.3.4 Search engines dependency issues .........................................................................58

3.3.5 Privacy issues .........................................................................................................59

3.3.6 Search engine awareness .......................................................................................60

3.4 Search engine dependency conclusion .........................................................................65

3.5 Chapter 3: key points ....................................................................................................66

Chapter 4: Risks of search engines dependency and its influence on data quality .................67

4.1 Search engine dependency and its influence on data quality: Issues ............................68

4.1.1 Search Engine Optimization ..................................................................................68

4.1.2 Commercial advertisement and perception ............................................................71

4.1.3 Censorship .............................................................................................................73

4.1.4 Technological partnerships ....................................................................................74

4.1.5 The Visible Web .....................................................................................................75

4.1.6 Invisible Web .........................................................................................................77

4.2 Search engine dependency and its influence on data quality: Solutions ......................78

4.2.1 A deeper knowledge in search engine abilities ......................................................78

4.2.2 Taking the best part of each search engine ............................................................80

4.2.3 Technological evolution .........................................................................................81

4.3 The future of Internet search .........................................................................................88

4.5 Chapter 4: Key points ...................................................................................................90

Chapter 5: The Google example .............................................................................................91

5.1 Google presentation ......................................................................................................92

5.1.1 Google....................................................................................................................92

Page 6: Risks of search engine dependency and its influence on data quality

6

5.1.2 Google's success ....................................................................................................92

5.1.3 Google image .........................................................................................................93

5.1.4 Google dependency state .......................................................................................94

5.1.5 Google added functionalities .................................................................................95

5.1.6 Google success is his weakness .............................................................................95

5.2 Google's disappearance consequences ..........................................................................97

5.2.1 Google Search engine failure .................................................................................97

5.2.2 Google Gmail failure .............................................................................................99

5.2.3 Google other services failure ...............................................................................101

5.2.4 Google collateral damages ...................................................................................102

5.3 Chapter 5: Key points .................................................................................................103

Conclusion and recommendations ........................................................................................104

Declaration ............................................................................................................................111

List of literature ....................................................................................................................112

Page 7: Risks of search engine dependency and its influence on data quality

7

Table of figures

Figure 1: Internet Domain Survey Host Count January 1994 - January 2009 ........................12

Figure 2: Do you use a personal blog? ...................................................................................13

Figure 3: How frequently do Internet users participate in the most popular activities? .........13

Figure 4: Most used information source when people need help ...........................................20

Figure 5: How much of the information on the World Wide Web overall is generally reliable?

................................................................................................................................................21

Figure 6: Enterprise findability goal .......................................................................................24

Figure 7: 1st and 2nd results for "data quality" are Wikipedia websites ................................26

Figure 8: A query made on Google images with the keyword "P5170009" ...........................28

Figure 9: How well is findability understood in your organization? ......................................29

Figure 10: How critical is findability to your Organization's Business Goals and Success? ..29

Figure 11: Triangle method application ..................................................................................32

Figure 12: Ask search engine home page ...............................................................................36

Figure 13: Yahoo home page ..................................................................................................36

Figure 14: An example of vertical search with Yahoo Images ...............................................37

Figure 15: A semantic search engine: Wolfram Alpha ............................................................37

Figure 16: Top 10 Worldwide Search December 2007 ...........................................................38

Figure 17: How Much Of The Information On the Internet Do You Think is Reliable and

Accurate? ................................................................................................................................40

Figure 18: Enterprise search satisfaction ................................................................................42

Figure 19: Influence of the consumer web on enterprise search tools ....................................42

Figure 20: Success rate of finding the information with enterprise search tools ....................43

Figure 21: Worldwide Search by Region ................................................................................44

Figure 22: Search engine leaders (>50%) per country personal estimation ...........................45

Figure 23: Search engine market in the USA, source: Hitwise, february 2009 ......................47

Figure 24: Google logo ...........................................................................................................47

Figure 25: Yahoo Logo ...........................................................................................................48

Figure 26: Japanese search engine market, source:webcreate.ga-pro.com, May 2009 ..........48

Figure 27:Chinese search engine market, source:China IntelliConsulting Corp. sept 2008 ...48

Figure 28: Baidu logo .............................................................................................................48

Figure 29: Bing logo ...............................................................................................................49

Figure 30: Korean search engine market, source: July 2007 Koreanclick..............................49

Figure 31: Naver logo .............................................................................................................49

Figure 32: Yandex logo ...........................................................................................................50

Figure 33:Search engine market in Russia, source: LiveInternet.ru:December 31, 2008 ......50

Figure 34: Seznam logo ..........................................................................................................50

Figure 35: Search engine market in Czech Rep, source: navrcholu.cz, June 2008 ................50

Figure 36:Search engine market in Iceland, source: statice.is 2007 .......................................51

Figure 37: Leit.is logo.............................................................................................................51

Figure 38: An example of a customized interface on iGoogle................................................52

Figure 39: Search engine market shares in 2007 for the Czech Republic ..............................53

Figure 40: Google market shares in Europe in 2008, source:Comscore .................................57

Page 8: Risks of search engine dependency and its influence on data quality

8

Figure 41: Use of search engines in 2004 and 2005 ...............................................................57

Figure 42: Search engine dependency relevancy ....................................................................58

Figure 43: Search users blame themselves not the technology...............................................59

Figure 44: Search engine syntax examples .............................................................................61

Figure 45: Use of advanced search functionalities in Canada ................................................62

Figure 46: Do users know how to use Boolean operators? .....................................................62

Figure 47: Use of meta search engines ...................................................................................63

Figure 48: Use of specialized search engines .........................................................................64

Figure 49: U.S. Advertising Market - Media Comparison – 2008 ($ Billions) ..........................68

Figure 50: Internet Ad Revenues by Advertising Format - 2008 Annual Results...................69

Figure 51: Search engine user behavior regarding results pages in the USA .........................69

Figure 52: An eye tracking study on several search engines ..................................................70

Figure 53: Differences between organic and sponsored results ..............................................71

Figure 54: Type of Search Result Selected .............................................................................72

Figure 55: Results relevancy according to users by search engine in 2004 ............................72

Figure 56: Attitudes towards search engines in India .............................................................73

Figure 57: Powered by Google logo .......................................................................................75

Figure 58: Powered by Yahoo logo .........................................................................................75

Figure 59: Estimation of the indexable web per search engine ..............................................75

Figure 60: Distribution of Public Web Sites By Country in 2002 ..........................................76

Figure 61: Percentage of Web Sites Covered by Google in 2002 ...........................................76

Figure 62: Google vertical search engines ..............................................................................79

Figure 63: Search engine search within website content comparison ....................................80

Figure 64: Future of web 2.0 ..................................................................................................81

Figure 65: Search engines are not the Internet .......................................................................82

Figure 66: Time and knowledge lag .......................................................................................83

Figure 67: Delicious bookmarks search..................................................................................84

Figure 68: Home page of the Similicious website ..................................................................84

Figure 69: Twitter real time information search engine ..........................................................85

Figure 70: Kartoo search results presentation .........................................................................86

Figure 71: 2008 Web trend map..............................................................................................87

Figure 72: 2007 Web trend map..............................................................................................87

Figure 73: Significant age-related differences in article discovery methods ..........................89

Figure 74: Google domination in Europe Figure 75: Google domination in Latin America

................................................................................................................................................94

Figure 76: Google coverage representation of the visible web ...............................................96

Figure 77: Google search failure ............................................................................................97

Figure 78: Figure 77: Google bug analysis on January the 31st 2009 ....................................98

Figure 79: Google evolution traffic during the bug on January the 31st 2009 .......................99

Figure 80: Google Gmail failure ...........................................................................................100

Figure 81: Main use of Internet ............................................................................................101

Page 9: Risks of search engine dependency and its influence on data quality

9

Foreword

A general trend of the early 21st century has been the use of the Internet

despite of TV as an information provider1.

There are today 1,596,270,108 Internet users in the world2 and basically most

of them already have their habits: checking their e-mail box(es), making research,

finding information about goods and services, online chatting, reading the news3.

Most of the functions described above can be done through an unique

information exchange provider: the search engines.

According to the main actors in Internet traffic measurements search engines

are by far the most visited websites4.

The main search engines actors are nowadays providing all kind of services

making the Internet use very comfortable.

However using a single search engine everyday make people conditioned to

process information in a certain way.

Such habits taken at home may unfortunately be present at work or the

other way around.

It is for sure comfortable to have a standard when dealing with computers. As

an example Microsoft is the leading Operating System on computers with more than

90% of the all market5. But is Microsoft the computer? The same question arise with

search engines: are they the Internet?

1 Cogar, P. (ed.) (2007). TV vs. the Internet: Internet wins. [online]. Available from : http://www.bit-

tech.net/news/2007/08/23/tv_vs_the_internet_internet_wins/1 [Accessed 17 June 2009] 2 Internet World Stats. (2009). World Internet Usage Statistics News and World Population Stats.

[online]. Available from: http://www.internetworldstats.com/stats.htm [Accessed 17 June 2009] 3 Malaysian Communications and Multimedia Commission. (2005). Household use of the Internet

survey 2005. [online]. Available from:

http://www.skmm.gov.my/facts_figures/stats/pdf/Household_use_internet_survey2005.pdf

[Accessed 17 June 2009] 4

Alexa Web. (n.d). Alexa Top 500 Global Sites. [online]. Available from:

http://www.alexa.com/topsites [Accessed 17 June 2009]

Netcraft. (n.d). Most visited websites. [online]. Available from:

http://toolbar.netcraft.com/stats/topsites [Accessed 17 June 2009] 5 One Stat. (2007). OneStat Website Statistics and website metrics - Press Room. [online]. Available

from : http://www.onestat.com/html/aboutus_pressbox54-windows-vista-global-usage-share.html

[Accessed 20 June 2009]

Page 10: Risks of search engine dependency and its influence on data quality

10

« Risks of search engine dependency and its influence on data quality » has

been written in the scope of understanding the potential risks of search engines

addiction on businesses.

Search engines such as Google are used by all Internet users. According to

studies, Internet users are confident, satisfied and trust search engines6

. They

unfortunately show that users are unaware and naïve as well.

Search engines are set up to find information on the Internet,

information being the basis of any good decision making we can then

understand how important and interesting it is for businesses to understand

what are the consequences of their use.

Ronan CHARDONNEAU

6

Fallows, D. (2005). Search Engine users. [online]. Available from:

http://www.pewinternet.org/~/media//Files/Reports/2005/PIP_Searchengine_users.pdf.pdf [Accessed

17 June 2009]

Page 11: Risks of search engine dependency and its influence on data quality

11

Chapter 1: Introduction of the topic

background

Page 12: Risks of search engine dependency and its influence on data quality

12

The Internet has been created to share information and to communicate with

each others.

It is hard to evaluate how big is the Internet, estimations among companies

are very different, it varies from 15 to some 30 billion Web pages7.

The number of websites is increasing everyday and estimated at more than

600,000,0008 for 2009 with a constant augmentation since the creation of the world

wide web.

Figure 1: Internet Domain Survey Host Count January 1994 - January 2009

Websites are used now in diverse manners if it comes to be a standard for

companies (enlargement of their business activity, new opportunity for advertisement)

it is also a space for many individuals (blog phenomenon).

7

Cf. Koch, P. / Koch, S. (2009). How big is the Internet?. [online]. Available from

http://www.pandia.com/sew/383-websize.html. [Accessed 19 January 2009]

8 Internet Systems Consortium. (2009). The ISC Domain Survey Internet Systems Consortium. [online].

Available from https://www.isc.org/solutions/survey [Accessed 17 June 2009]

Page 13: Risks of search engine dependency and its influence on data quality

13

Figure 2: Do you use a personal blog?9

A study realized on 29 countries shows that almost 25% of Internet users

under 34 year-old are using a blog, this trend is moreover growing since 2003.

The vulgarization of the Internet and the fact that anyone can create his own

website for free increased drastically the number of contents. The explosion of social

networks (Facebook, Hi5…), blogs (Wordpress, Blogger, Myspace…),

microblogging (Twitter) are changing the nature and fabric of the world wide web:

from an Internet built by a few thousand of individuals we moved to one made by

millions.10

If we take into account that searching is after e-mails the biggest activity

9

USC Annenberg School. (2008). The impact of the Internet. [online]. Available from:

http://advertising.microsoft.com/sverige/WWDocs/User/sv-se/NewsAndEvents/Events/jeff_cole.ppt

[Accessed 21 June 2009] p.7 10

Cf. UCL. (2008). Information behaviour of the researcher of the future. [online]. Available from:

http://www.bl.uk/news/pdf/googlegen.pdf [Accessed 19 June 2009] p.16

Figure 3: How frequently do Internet users participate in the most popular activities?

Page 14: Risks of search engine dependency and its influence on data quality

14

which is made of the Internet11

:

We can then understand that more sophisticated tools are needed to find the

right information on the Web.

So far we access to websites through three ways:

Direct access (for example entering directly the URL in the address bar,

clicking on a bookmark);

External links (access to a website through the link of another website, this

is the case in most of websites, catalogs, advertisement);

Through Search Engines;

By using only the first two options one cannot browse the Internet normally.

It has been said as well that the first way is disappearing more and more in profit of

search engines12

.

A search engine is then indispensable in order to crawl the web properly.

1.1 Relevance of the subject

The Internet is becoming more and more our information provider. As studies

show:

‖More people turned to the internet than any other source of information and

support, including experts, family members, government agencies, or libraries‖13.

The Web is the primary source of information for many people with an

increase of its recognition14

.

11

USC. (2008). Annual Internet Survey by the Center for the Digital Future. [online]. Available from:

http://www.digitalcenter.org/pdf/2008-Digital-Future-Report-Final-Release.pdf [Accessed 21 June

2009] p.4 12

cf. Ohayon, O. (2008). Google, moteur de recherche ou moteur de navigation?. [online]. Available

from : http://fr.techcrunch.com/2008/10/30/fr-google-moteur-de-recherche-ou-moteur-de-

navigation/ [Accessed 17 June 2009] 13

Estabrook, L. /Witt, E./ Rainie, L. (2007). Information searches that solve problems. [online].

Available from:

http://www.pewinternet.org/~/media//Files/Reports/2007/Pew_UI_LibrariesReport.pdf.pdf [Accessed

17 June 2009] p5 14

Cole, J. I./Suman, M./Schramm, P./Lunn, R/Aquino, J.S. (2003). Surveying the Digital Future.

[online] Available from: http://www.digitalcenter.org/pdf/InternetReportYearThree.pdf [Accessed 17

June 2009]

Page 15: Risks of search engine dependency and its influence on data quality

15

The number of Internet users is estimated to 1,463,632,361 (world population

6,676,120,288) with a growth rate from 2000-2008 fixed at 305.5 %15.

The Internet is then our main information provider and his number of users is

increasing every day.

This rule is the same for businesses as for individuals. More and more

information is digitalized and it comes then easier for companies to get data from the

Internet rather than extracting it in the former way. As an example it is simpler to

access the Yellow pages online, making copy and paste of some information rather

than opening the hard copy book and typing in the data you want to work on.

The Internet is then a place where the working environment is crossing

the one of the individual.

This information sharing have some consequences (lot of information,

accuracy issues, internet users are subject to many commercials). This is moreover

problematic because this is the first time that an information provider is

gathering in such extend those two sources of information. It was not the case

with TV, Radio or even newspapers.

As we will see later some companies are only relying on information, finding

quality websites is then critical for businesses.

1.2 Major terms

In this thesis the following expressions will be used: search engines, search

engine dependency, data quality, Web 2.0 and following versions.

Search engine is the most flexible technology which has been created in

order to browse the web. A search engine is no more than a web application which is

processing information. It does not create data it just process some information it has

in his index.

―A search engine is simply a means to ask information on the Web, a system for

organizing the data held on the Internet. A search engine can be metaphorically

15

Internet World Stats. (2009). INTERNET USAGE STATISTICS The Internet Big Picture.[online]

Available from: http://www.internetworldstats.com/stats.htm [Accessed 17 June 2009]

Page 16: Risks of search engine dependency and its influence on data quality

16

compared to several activities: a miner panning for gold, a clerk looking for a

document in a cabinet…‖16

Search engine dependency is the fact that Internet users use a single search

engine when looking for information on the Internet. This dependency can be created

from different factors such as loyalty, patriotism or convenience.

Data quality is the quality of data. Data are of high quality "if they are fit for

their intended uses in operations, decision making and planning"17

. Alternatively, the

data are deemed of high quality if they correctly represent the real-world construct to

which they refer. These two views can often be in disagreement, even about the same

set of data used for the same purpose.18

Web 2.0 and following versions are not the name of a specific software or

technology. As an example Web 2.0 is an online movement that encourages users to

participate in the fresh, interactive nature of the Internet by using widely available,

less expensive, and more mature state-of-the-art technologies19

.

1.3 Focus, goals and structure of the report

The focus of this work is to put in evidence that there is a critical lack of

how to use the Internet either at home that within businesses and that one is

influencing the other.

Such lack of knowledge is raising from the over evaluation we are making

about technologies, commercial search engine strategies, lack of awareness, strong

addiction to search engines, lack of training within businesses and educational

institutions.

This has some critical consequences on business decision-making as well

as day to day choices.

16

Friedman, B. G. (2004). Web search savvy. p.19 17

Juran, J. (1999). Juran’s quality handbook: Fifth edition. p.976 18

Kaplan, I. (2008). Bad Data Can Cost You Big Time. [online]. Available from:

http://www.federationofcredit.com/base/document/Newsletter/IKaplanSept08.html [Accessed 17

June 2009] 19

Meyerson, M./Scarborough, M. E. (2007). Mastering Online Marketing. P.223

Page 17: Risks of search engine dependency and its influence on data quality

17

If those risks are relevant it is then very important to put them in evidence

showing concretely what are those risks, where are they coming from and how much

is the gap of information between a search from a search engine addicted user and

the most rational way of looking for the information.

The structure of the report is as follow:

The first idea is to introduce the concept of data quality. What do we mean

by data quality? How to get data quality on the Internet?

The next point is dealing with the world of search engines and the

dependency which is coming out from them.

Analyzing the world of search engines is important to understand how the

Internet is not as rational as one could think and what are the actors of the

dependency (search engines may be not the Internet, search engines may be different

from a country to another).

Once this analysis made, a look at the facts and figures regarding search

engine users attitudes will be conducted. This should drive us to the conclusion that

Internet users are not using an all set of search tools but only a couple of them: the

dependency concept.

Once the dependency concept introduced we will measure the risks of such

addiction on data quality.

Google being in Europe the most used search engine and it will be used as a

concrete example in the last part.

In the last part recommendations will be given for companies interested

in improving their information research system and reducing data quality issues

when looking for information on the Internet.

Page 18: Risks of search engine dependency and its influence on data quality

18

1.4 Chaper 1: Key points

The Internet is used to share information and to communicate;

The number of websites created increase everyday;

Websites are used for diverse purposes (making advertisement, expressing

personal opinions, running businesses…);

25% of young people Internet users aged of <34 year-old have a personal

website;

In a decade we skipped from an Internet built by a thousand of individuals to

one made by millions;

Search is the second biggest activity made on the Internet after e-mails;

Search engines are so far the only way to crawl the Internet properly;

The Internet is our main information provider;

On the Internet, flows of information from businesses are mixed up with the

ones of individuals, it can then be subject to confusions;

Search engine are the origin of those confusions, it seems then critical to

analyze how those technologies are working and what are the consequences

of their use;

Page 19: Risks of search engine dependency and its influence on data quality

19

Chapter 2: Concept of data quality

Page 20: Risks of search engine dependency and its influence on data quality

20

A recent study in the United States showed20

that the Internet is the most

used source of information when people need help:

Figure 4: Most used information source when people need help

This information is far more valuable if we consider that the World Wide

Web is now the largest resource of information21

.

The Internet has then several strengths: the most used information system,

the biggest resource of information, it is moreover the most global and accessible

one (free and mobile).22

The issue is how to use it wisely to get quality information.

If we have a look at the perception that Internet users have regarding the

quality of information on the Internet we can see that a high percentage of users are

considering the data quality issue. Most of them however agree that in general the

Internet is a reliable source of information23

:

20

cf. Estabrook, L. /Witt, E./ Rainie, L. (2007). Information searches that solve problems. [online].

Available from:

http://www.pewinternet.org/~/media//Files/Reports/2007/Pew_UI_LibrariesReport.pdf.pdf [Accessed

17 June 2009] 21

Muñoz, C./Moraga, A./Piattini, M. (2008). Handbook of Research on Web Information Systems

Quality. p.286 22

Albarran, A.B./Chan-Olmsted,S.M./Wirth,M.O. (2006). Handbook of media management and

economics. p471 23

Pierce, J. (2008). The World Internet Project. [online]. Available from:

http://www.digitalcenter.org/WIP2009/WorldInternetProject-FinalRelease.pdf [Accessed 20 June 2009]

Page 21: Risks of search engine dependency and its influence on data quality

21

Figure 5: How much of the information on the World Wide Web overall is generally reliable?

2.1 Data quality definition

―Data has quality if it satisfies the requirements of its intended use. It lacks

quality to the extent that it does not satisfy the requirement. In other words, data

quality depends as much on the intended use as it does on the data itself. To satisfy

the intended use, the data must be accurate, timely, relevant, complete, understood,

and trusted.‖24

In general one agrees to define data quality according to six dimensions.

Accuracy: The quality of being near to the true value25

. Accuracy is the most

important dimension.26

Timelessness: unaffected by time27

.

Relevant: the degree to which search results meet the requirements or

expectations implicit in the query28

Complete: bring to a whole, with all the necessary parts or elements.

Understood: perceive (an idea or situation) mentally.

24

Olsen, J. (2003). Data quality: The accuracy dimension. p.24 25

Wordnet.princeton.edu. (2009). Accuracy definition. [online]. Available from:

wordnet.princeton.edu/perl/webwn [Accessed 17 June 2009] 26

Olsen, J. (2003). Data quality: The accuracy dimension. p.3 27

Wordnet.princeton.edu. (2009). Timelessness definition. [online]. Available from:

wordnet.princeton.edu/perl/webwn [Accessed 17 June 2009] 28

WhamTech . (n.d). Glossary of less-than-usual terms used in the Web site. [online]. Available from:

www.whamtech.com/glossary.htm [Accessed 17 June 2009]

Page 22: Risks of search engine dependency and its influence on data quality

22

Trusted: inclined to believe or confide readily.

Each of those dimensions can be accepted with a certain level of

acceptance. As previously said everything depends on the intended use of the

information. For example a database with 70% of accuracy may have a value for

some company departments (e.g: marketing for estimations) because those 70% of

data are exploitable.

On the other hand it can be useless for others, for e.g: an accounting

department releasing a balance sheet of 70% accuracy.

Data quality is a complex topic and some additional dimensions can be

included for the use of the data such as:

Accessibility, Accuracy, Amount of data, Applicability, Attractiveness,

Availability, Believability, Completeness, Concise representation, Consistent

representation, Cost effectiveness, Customer support, Currency, Documentation,

Duplicates, Ease of operation, Expiration, Flexibility, Granularity, Interactive,

Internal consistency, Interpretability, Latency, Maintainable, Novelty, Objectivity,

Ontology, Organization, Price, Relevancy, Reliability, Reputation, Response time,

Security, Specialization, Source's information, Timeliness, Understand ability,

Validity, Value-added.29

2.2 Data quality issues within businesses

As we saw previously accurate data is the most important dimension of data

quality. Data is the heart of any good businesses or organizations. Some companies

such as financial ones are only living on information.

The use of the Internet increased the flow of information and now company's

data are used by other companies to make decisions such as purchasing and

selling.

29

Muñoz, C./Moraga, A./Piattini, M. (2008). Handbook of Research on Web Information Systems

Quality. p.138

Page 23: Risks of search engine dependency and its influence on data quality

23

So if company A is providing bad quality data which afterward are retaken

by company B it enters in a vicious circle where the flow of biased information never

stop.

As Jack E. Olson mentioned it in his book ―Data quality‖:

―Data is generated by more people, is used in the execution of more tasks by

more people, and is used in corporate decision making more than ever before.‖30

Data quality is critical.

Even though databases are recognized as the most important asset, companies

tolerate enormous inaccuracies in their databases.

According to the same author this issue is not only present within businesses

but as well in governmental organizations and educational systems:

Businesses and organizations are aware of data issue;

They all underestimate the consequences of it;

They have no idea of the cost linked to those issues;

They have no idea of the potential value in fixing the problem;

Jack E.Olsen gives us as well in his book an estimation of the loss associated

to data quality fixing it at 15 to 25% of the operating profit.

Those losses are of different kinds: transaction rework costs, costs incurred in

implementing new systems, delays in delivering data to decision makers, lost

customers through poor service, lost production through supply chain problems.

Those issues are normally not coming from the data management system

(DMS are conceived to answer a specific request). The failure is mainly coming from

its users.

To avoid this they need to be aware of three things:

- what are the system capabilities;

- how to use it properly;

- how to interpret its results.

30

Olsen, J. (2003). Data quality: The accuracy dimension. p.5

Page 24: Risks of search engine dependency and its influence on data quality

24

The main remedy of this issue stands to be a long term strategy in which

teams within the organization are trained in the concept of data quality

management.

The concept of data quality is very relevant when dealing about search

engines. Most of the search engines we know as consumers are commercial search

engines. But as we know the main objective of a commercial company is to make

profit and from this a lot of issues are raising.

According to a study untitled ―Findability‖31

most of businesses (62%) agree

that finding information is critical however on the other hand most of them do not

know the criticality level of finding information and this due to a general lack of

awareness. It shows as well that strategy are almost mainly not defined (49%):

And proper goals not clearly expressed. It draws the same conclusions as

some authors on this topic32

.

As we saw technology is not responsible of quality issues but the use of

technology and the interpretation made out of the information retrieval is a

source of quality problems. This can be reduced by implementing methodologies

such as:

– Putting in place a better information research management strategy33

mainly

based on employees training. It does not only mean to train employees on how to use

technologies but as well how to develop a pro efficient behavior when making

31

cf. The Association for Enterprise and Content Management. (2008). Findability: The Art and

Science of Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p.22 32

Olsen, J. (2003). Data quality: The accuracy dimension. p.7-8 33

Kehoe, M. (2009). Overview of the Enterprise Search Market. [online]. Available from:

http://www.ideaeng.com/tabId/98/itemId/181/Overview-of-the-Enterprise-Search-Market-2009.aspx

[Accessed 17 June 2009]

Figure 6: Enterprise findability goal

Page 25: Risks of search engine dependency and its influence on data quality

25

research. It means reconsidering the information process and participating in the

improvement of the all research information system (cf.chapter:2.3.4). Computer

users are expecting too much from technologies waiting to be fed with the most

rational solution whereas it is not yet on the market;

– Implementing a more user oriented research application. Studies are showing

that regarding libraries too many of them did not investigate enough in this field,

focusing on the size of their database rather than how to retrieve the information34

.

This is one of the reason why people move from libraries to the Internet as an

information provider;

2.3 Origins of data quality issues: Garbage In Garbage Out

―On two occasions I have been asked,—"Pray, Mr. Babbage, if you put into the

machine wrong figures, will the right answers come out?" … I am not able rightly to

apprehend the kind of confusion of ideas that could provoke such a question. “

— Charles Babbage

As we previously saw data quality issues with search engines are not coming

from technology. They are in fact coming from:

– The one who wrote the contents of the results, it can be misspellings, no

concrete sources to justify himself, no adoption of standards, advertisement;

– The one who type in the request (cf. chapter 3.3.6.1: Search engine use

awareness);

The next parts will develop this first point in detail.

2.3.1 Poor data quality content: the Wikipedia example

Wikipedia is an easy example to illustrate the data quality issue with Internet

content and introduce well the chapters coming afterward.

34

Cf. UCL. (2008). Information behaviour of the researcher of the future. [online]. Available from:

http://www.bl.uk/news/pdf/googlegen.pdf [Accessed 19 June 2009] p.31

Page 26: Risks of search engine dependency and its influence on data quality

26

Wikipedia is one of the greatest collaborative world wide web project ever

but on the other hand it has a couple of drawbacks. Those disadvantages are mainly

arising from an absence of standards in data quality, here are some of those points:

– Everybody can provide his contribution and have the possibility to sign it as

anonymous, so in theory a 3 year-old kid can write an article. According to Sara

Baase: ―Accuracy and quality are impossible. Truth does not come from populist

free-for-alls. Some articles are biased and one sided‖35

;

– Some articles without reliable sources can be validated by an administrator,

Internet users may then take the displayed information for granted;

– The success of Wikipedia: word of mouth;

– Wikipedia's popularity36

made it ranks first on Google on most of the requests.

It has a page rank of 9 out of 10 which corresponds to almost the maximum

recognition Google can give to a website.

Figure 7: 1st and 2nd results for "data quality" are Wikipedia websites

2.3.2 Poor data quality content: the commercial example

In a study untitled ―Of course it’s true I saw it on the Internet!‖ 37

aimed at

understanding how American students conduct searches the following question was

asked: ―List three major innovations developed by Microsoft over the past 10 years‖.

35

Baase, S. (2007). A gift of Fire. p352 36

Baase, S. (2007). A gift of Fire. p351 37

Graham, E. L./ Metaxas, P. T. (2003). Of course it’s true I saw it on the Internet!: Critical thinking

in the Internet. Available from: http://www.wellesley.edu/CS/pmetaxas/CriticalThinking.pdf

[Accessed 17 June 2009]

Page 27: Risks of search engine dependency and its influence on data quality

27

The survey was submitted to 180 college students in the United States during the

school year 2000-2001.

As an answer 63% responded by using only one source of information:

Microsoft‟s website but is a commercial website a reliable, neutral and trusting

source of information?

One thing is sure a company have no interest to critic herself on her own

website so it may be high probable that they will tend to sell themselves more than

keeping a neutral point of view.

The commercial aspect of search engine will be retaken and more developed

in the next chapters.

2.3.3 Metadata

Metadata is the key in order to understand how search engines are currently

working and to understand how to perform good search. Commonly speaking the

definition of metadata is data about data.

As an example a librarian is archiving his books by assigning to each of them

a reference.

For example the reference ―AA1‖ corresponds to ―gone in the wind.

Each web page on the Internet has several metadata such as the ―title‖ of the

page ―keyword associated to the page‖ ―description‖ etc etc…

Metadata issues are coming mainly because they are not representing all

the data. The best example we can find is the one of images search. Today when

typing a request to look for pictures we get as a result a strange cocktail of a bit

everything. The reason in this case are a lack of metadata and a use of them which is

not appropriate.

As an example most of Internet users are uploading pictures without giving

them any names, letting just a number as identifier. This is an incredible amount of

data which is unusable.

Page 28: Risks of search engine dependency and its influence on data quality

28

This is introducing another issue which is findability.

2.3.4 Findability

―Findability Precedes Usability

In the alphabet and on the Web

You can’t use what you can’t find‖38

Findability is the art and science of making content findable. The science is

library science; the art is language arts and the user interface design.39

Findability is more or less understood by businesses and too often confused

as search.

38

Morville, P. (2005). Ambient Findability. p.111 39

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science

of Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p.9

Figure 8: A query made on Google images with the keyword "P5170009"

Page 29: Risks of search engine dependency and its influence on data quality

29

Figure 9: How well is findability understood in your organization?

Findability is not only about making research but also on how to make

information findable.

Most of businesses agree on this point: Findability is critical in Organization’s

Business Goals and Success (62%).

Figure 10: How critical is findability to your Organization's Business Goals and Success?

However as a study on findability shows40

and as we will see later in Chapter

3 findability is not well defined and implemented within companies and this is

mainly due to a management failure.

As Peter Morville describes it in his book ―Ambient Findability‖41

Findability is defying classification. It flows across the borders between design,

engineering, and marketing. Everybody is responsible, and so we run the risk that

nobody is accountable.

Findability is the matter of everyone within a company for example when

designing the company website you have different actors: designers, engineers,

information architects, brand architects, marketing department.

Another example is the one of a secretary or an archiver when storing

documents. He or she have to think about how to make those materials easy to find

for everyone (by choosing the right metadata, the right technology) this include a

40

Cf. The Association for Enterprise and Content Management. (2008). Findability: The Art and

Science of Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] 41

Morville, P. (2005). Ambient Findability. p.111

Page 30: Risks of search engine dependency and its influence on data quality

30

collaboration with all departments within a company. If not those contents are

not findable and lost in a certain way.

The solutions given by the Peter Morville are two: cultivate cross-functional

collaboration and on an individual level to learn how to be pro efficient and to

go beyond the job responsibility.

2.4 Data quality solutions

A problem well defined is a problem half-solved."

–John Dewey

Data quality issues are coming from:

- Garbage In Garbage Out;

- No check of information accuracy;

Solutions are then easy to find out:

- Learning how to use search tools;

- Check out the information;

2.4.1 Learning how to use search tools

The main issue with Internet users is that they stick to the “Principle of

Least Effort” invented by George Kingsley ZIPF:

“Each individual will adopt a course of action that will involve the

expenditure of the probably least average of his work (least effort).”42

And according to Calvin Mooers’ ―people will not seek information that

makes their jobs harder (even if it may benefit the organization they work for)‖.43

Studies are in fact showing that users are sacrificing information quality

42

Case, D. O. (2007) Looking for information. p.151 43

Morville, P. (2005). Ambient Findability. p.54

Page 31: Risks of search engine dependency and its influence on data quality

31

for accessibility44

.

So users do not care about quality there are interested in easy to access

information.

This is mainly why the Google Advanced search option is rarely used. People

assigning Advanced to complex.45

Whereas Advanced should be the right way to

search.

2.4.2 Check out the information: the Triangle method

Commonly used in the educational system the triangle method consists in lo-

cating three independent sources that point to the same answer in order to pro-

duce the most accurate information. This method is not making a distinction be-

tween quality websites and poor quality ones but it helps in checking the infor-

mation.

Applying this concept can be more powerful that we can imagine. As an ex-

ample one can take a recent news event such the riots in Tibet in 2008. If we look at

the news provided from the United Kingdom46

and Germany47

as symbols of West-

ern media Tibetans were suffering a true chaos in March 2008.

On the other hand by having a look at CCTV (China Central Television)48

some information posted by Western media were according to them totally biased

and incoherent. And when having a look at the proves advanced by the Chinese Me-

dia it is actually giving them reason49

. The inaccuracies came from the facts that

44

Hirsh, S./Dinkelacker, J. (2004). Seeking Information in order to produce information: an empirical

study at Hewlett Packards Labs. p.816 45

Olausson , A. M. (2007). Advanced Search: Is the name a problem?. [online]. Available from :

http://digital-lifestyles.info/2007/09/21/advanced-search-is-the-name-a-problem/ [Accessed 17 June

2009] 46

BBC. (2008). Tibetans describe continuing unrest. [online]. Available from :

http://news.bbc.co.uk/2/hi/asia-pacific/7300312.stm [Accessed 17 June 2009] 47

Berliner Morgenpost. (2008). China rüstet sich für « die entscheidende Schlacht ». [online].

Available from :

http://www.morgenpost.de/printarchiv/politik/article169230/China_ruestet_sich_fuer_die_entscheiden

de_Schlacht.html [Accessed 17 June 2009] 48

XinHua. (2008). Commentary : Facts about Tibet should not be distorted. [online]. Available from:

http://news.xinhuanet.com/english/2008-03/24/content_7847789.htm

http://news.xinhuanet.com/english/2008-03/24/content_7847789_1.htm [Accessed 17 June 2009] 49

Beijing Review. (2008). Dialogue: Media Coverage on Tibet. [online]. Available from:

http://www.bjreview.com.cn/special/txt/2008-03/22/content_107054.htm [Accessed 17 June 2009]

Page 32: Risks of search engine dependency and its influence on data quality

32

Western media did not know well enough the Chinese and Tibetan cultures and lan-

guages and were associating captions to images which were not true.

In this configuration looking at three independent sources is critical. Who

could have thought that Western medias can be wrong for example.

Figure 11: Triangle method application

Reliable sources is then a necessary condition for data accuracy but this

condition is not sufficient you need moreover to look at three independent and

reliable sources information which point to the same answer.

Page 33: Risks of search engine dependency and its influence on data quality

33

2.5 Chapter 2: key points

The Internet is the most used, largest, global and accessible source of

information;

The majority of Internet users consider the Internet has a reliable source of

information and are aware of quality issues;

Accuracy is the most important dimension in data quality and can be accepted

in some cases with a certain level of acceptance;

Some companies are only living on information;

Company's data are used by other companies to make decisions;

Data quality issues are touching all kind of organizations;

The loss associated to data quality is estimated from 15 to 25% of the

operating profit;

In most of the cases Database Management System is not the cause of data

quality issues;

A majority of businesses do not have proper goals defined regarding the

findability of their material within their research environment;

Cultivate cross-functional collaboration and pro efficient behavior within

companies are the keys to set up good information retrieval systems;

Making content findable is the job responsibility of everyone within a

company;

People will not seek information that makes their jobs harder (even if it may

benefit the organization they work for)

Users are sacrificing information quality for accessibility;

People are assigning Advanced to complex. Whereas Advanced is the right

way to search.

Accuracy issue can be reduced by checking the information from three

independent and reliable sources;

Page 34: Risks of search engine dependency and its influence on data quality

34

Chapter 3: Search engines dependency

Page 35: Risks of search engine dependency and its influence on data quality

35

As previously seen search is the second most popular activity made of the

Internet and search engines are the most appropriate tool to do so.

Before introducing the search engine dependency concept it may be

interesting to know the search engine market configuration.

Even if Google is recognized as the leading brand in this field, his superiority

may be not worldwide.

A strong effort has been made in this thesis to make it as global as possible.

Most of the publications in this area have been written considering the American and

European market as a representative sample of the market.

The raising up of India and China in the technological world and the increase

of information on the Internet allow us now to get information about the Asian

market. If most of new technologies are coming from the United States it is

interesting to enlarge the research study to Asia to get a more representative and

exclusive panel.

3.1 Search engine categories

Search engines can be divided into two categories:

Commercial search engines available for free for the mass public mainly in

exchange of advertisement display;

Enterprise search engines for businesses. They are generally paid services,

free of advertisement and customized for a specific need.

3.1.1 Commercial search engines

Commercial search engines are divided into four categories:

Standard: the most well known search engines such as www.google.com,

www.bing.com, http://www.ask.com/. They are looking for any kind of in-

formation through the Internet and are characterized by a very light inter-

face (mostly text-based applications):

Page 36: Risks of search engine dependency and its influence on data quality

36

Figure 12: Ask search engine home page

Portals: Portals are a mix between standard search engines and direc-

tories. Differently from search engines, directories are using human being

instead of robots to index websites address. In theory (if we did not take

into consideration the commercial aspect) directories should provide qual-

ity information rather than quantity.50

Portals are then characterized by a

lot of information on their home page including the search engine func-

tion. The most well known portal is Yahoo.

Figure 13: Yahoo home page

Specialized search engines: they belong to a subcategory of the first

group and are also called vertical search engines. Vertical search engine

is to search the information sources of one industry or a kind.51

Specia-

lized search engines are crawling only a restricted area and not the entire

web. For example they can search only in a specific website or only a

specific kind of document (books, images, .pdf documents, videos…).

If specialized search engines are not a revolution in themselves (they are

for most of them a filter of bigger search engines) they however find their

50

Friedman, B. G. (2004). Web Search Savvy. p.21 51

Wang, W. (2007). Integration and Innovation Orient to E-Society Volume 1. p.666

Page 37: Risks of search engine dependency and its influence on data quality

37

place when standard search engines are providing too many results for a

given request.

Figure 14: An example of vertical search with Yahoo Images

Semantic search engines: Most of search engines on the market are

based on keywords and documents popularity (for example Google page

rank) without taking into account the real content52

. The idea behind se-

mantic is to understand the hidden meaning of the information. A re-

cent example of such search engine called ―Wolfram Alpha‖ just came out

on the market, qualified as a ―knowledge engine‖53

designed to give you

answers to your request rather than driving you to a website which may

have it. Semantic search engines belong to the Web 3.0 generation where

machines interpret the meaning of the data.54

Figure 15: A semantic search engine: Wolfram Alpha

52

Priss, U./Corbett, D./Angelova, G. (2002). Conceptual structures. p.92 53

Valentiner, Z. (2009). New search tool on the block: Wolfram Alpha. [online]. Available from :

http://www.mndaily.com/blogs/tech-corner/2009/05/20/new-search-tool-block-wolframalpha

[Accessed 17 June 2009] 54

Cf. Sankar, K./Bouchard, S./Mancini, D. (2009). Enterprise Web 2.0 Fundamentals. P.161

Page 38: Risks of search engine dependency and its influence on data quality

38

3.1.2 Enterprise search engine (ESE)

Enterprise Search Engine are dedicated to search within companies

environment such as Internet, Intranet, Customer Management System, Databases,

Wikis, Software Applications.

Their use can be clearly understood when employees within companies are

looking for information which are not public or want to get pertinent information

within their own environment.

Enterprise search engines have more or less the same technology and function

as commercial web search engine, they just target a specific group rather than a

mass public audience55

.

3.2 Search engine market

3.2.1 Commercial search engine market

The commercial search engine market is segmented as follow:

Figure 16: Top 10 Worldwide Search December 2007

55

cf. The Association for Enterprise and Content Management. (2008). Findability: The Art and

Science of Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009]

Page 39: Risks of search engine dependency and its influence on data quality

39

- Google is the major leader with more than 60% ;

- Yahoo has a comfortable second position with more than 10%;

- Three other major search engines are sharing the 3rd

, 4th

and 5th

place

with market shares from 2,4 to 5%: Baidu, Microsoft and Naver;

- The presence of some specialized search engines in the top 10;

As mentioned above, in 2007 the top 10 search website showed an interesting

market with the presence of:

- 2 specialized search engines such as eBay and Alibaba.com;

- 4 Asian search engines: Baidu, NHN, Yandex and Alibaba.com;

This clearly shows the presence of Asian technologies. Moreover Baidu,

NHN and Yandex are nationally oriented as we will see later in chapter 3.2.7.

3.2.2 Commercial search engine market: Consumer behavior

A study made in the United States shows that Internet searchers are confident,

satisfied and trust search engines56

some of those results are confirmed by a

Taiwanese study57

:

• 92% are confident about their searching skills;

• 87% have a successful search experience;

• 68% believe that search engines are a fair and unbiased source of information;

• 44% of searchers say they regularly use a single search engine, 48% will use

just two or three, 7% will use more than three;

56

Fallows, D. (2005). Search engine users. [online]. Available from:

http://www.pewinternet.org/~/media//Files/Reports/2005/PIP_Searchengine_users.pdf.pdf

[Accessed 17 June 2009] p.2

57 Insight Xplorer. (2006). 創 市 際 市 場 研 究 顧 問 . [online]. Available from:

http://www.insightxplorer.com/specialtopic/co_info_acquisition.html [Accessed 17 June 2009]

Page 40: Risks of search engine dependency and its influence on data quality

40

• 62% are not aware of a distinction between commercial and non commercial

results;

Moreover according to a study untitled: ―surveying the Digital Future‖ 58

:

Figure 17: How Much Of The Information On the Internet Do You Think is Reliable and

Accurate?

A huge majority of them is seeing it as a reliable and accurate source of

information over the time.

According to another study 22% of Internet users have a search engine such as

Google, Yahoo as their home page. This trend doubled since 2005.59

Regarding search engines reliability and accuracy 51% in 2007 are saying that

most or all the information produced by search engines is reliable and accurate.

They were 62% in 2006;

Internet users find high degree of reliability and accuracy on their favorite web

sites, they were 81% in 2005, 83% in 2006 and 83% in 2007;60

58

UCLA Center for Communication Policy. (2004). Surveying the Digital Future. [online]. Available

from: http://www.digitalcenter.org/downloads/DigitalFutureReport-Year4-2004.pdf [Accessed 18 June

2009]. P.39 59

Center for the Digital Future (2008). Annual Internet Survey by the Center for the Digital Future.

[online]. Available from

http://www.digitalcenter.org/pdf/2009_Digital_Future_Project_Release_Highlights.pdf [Accessed 19

June 2009] p.4 60

Center for the Digital Future (2008). Annual Internet Survey by the Center for the Digital Future.

[online]. Available from

http://www.digitalcenter.org/pdf/2009_Digital_Future_Project_Release_Highlights.pdf [Accessed 19

June 2009] p.5

Page 41: Risks of search engine dependency and its influence on data quality

41

In 2007, 80% of Internet users are considering that most or all of the information

posted by well known media such as the New York Times and CNN is reliable

and accurate. They were 77% in 2006.

It seems that commercial Internet users have a positive search experience.

Even if they recognize data quality issues they seem not to understand where

those problems are coming from.

It should then interesting to inform them more regarding the commercial

aspect of free search engines.

3.2.3 Enterprise Search Engine market

The Enterprise Search Engine market is far more confused and crowded61

than the commercial one. There are not many information on it but what we can say

is that actors are different and that enterprise search engines are customized for a

specific use.

In a book untitled ―Practical aspects of Knowledge Management― and written

in 2008 by Takahira Yamaguchi, a rank of the main actors in this field is given62

:

1st autonomy.com

2nd

Fastsearch.com

3rd

Endeca.com

As we can see those three companies were not listed in the commercial search

engine ranking. However some commercial search engine firms are present on this

market such as Google with Google Search Appliance and Microsoft with Microsoft

Search Server.

61

Feldman, S. (2005). Desperately seeking search. [online]. Available from:

http://www.kmworld.com/Articles/Editorial/Feature/Desperately-seeking-search-9665.aspx [Accessed

17 June 2009] 62

Yamaguchi, T. (2008). Practical aspects of Knowledge Management. p.41

Page 42: Risks of search engine dependency and its influence on data quality

42

3.2.4 Enterprise Search Engine market: Consumer behavior

The Enterprise search engine market has a different configuration that the

commercial one. However the main protagonists such as Google and Microsoft are

still present63

. In opposite to the commercial web search engines, enterprise search

engine users are mostly disappointed by their search experience.

It is quite impressive to see that almost the majority (49%) have a negative

image about searching for information within their enterprise search tools.

The major reasons for this are:

– The lack of training and consulting of those search tools within

organizations64

;

– The expectation to have results which are as pertinent as commercial web

search engines;

63

Kehoe, M. (2009). 2009 Overview of the Enterprise Search Market. [online]. Available from:

http://www.ideaeng.com/tabId/98/itemId/181/Overview-of-the-Enterprise-Search-Market-

2009.aspx [Accessed 17 June 2009] 64

cf. The Association for Enterprise and Content Management. (2008). Findability: The Art and

Science of Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p.36

Figure 18: Enterprise search satisfaction

Figure 19: Influence of the consumer web on enterprise search tools

Page 43: Risks of search engine dependency and its influence on data quality

43

A vast majority (82%) agree to say that their consumer web experience on

how to look for information on the Internet influence their expectations regarding the

implementation of such technology within companies.

As Ron Miller (cited in the following study) explained it: « On the web,

search engines like Google have the advantage of searching the entire web.

Therefore, the likelihood of finding query matches is much greater than in the

enterprise where the number of possible right answers is much smaller, and could in

fact be found in just a single document. (Of course finding more results doesn’t

necessarily mean finding right ones, but that’s another issue altogether.) »

It is then not surprising to see that most of enterprise search engines are not

successful in finding what they are looking for:

The problematic according to Ron Miller should then be as follow: "I don‟t

think the technology is failing us, I think it‟s the way we are using the

technologies," but he adds, "If I can’t find my content, it doesn’t exist."65

This part clearly put in relevancy that searchers within companies are

confusing commercial search engines with enterprise search engines associating

directly one to the other. It shows as well the lack of training to those

technologies and confirm then the lack of technology literacy of Internet users.

Moreover it clearly define what the market is: simple and easy to use

applications.

65

Miller, R. (2009). Unlock Power Enterprise Search. [online]. Available from:

http://byronmiller.typepad.com/UnlockPowerEnterpriseSearch.pdf [Accessed 17 June 2009] p.5

Figure 20: Success rate of finding the information with enterprise search tools

Page 44: Risks of search engine dependency and its influence on data quality

44

3.2.5 The commercial search market repartition

Regarding the repartition of search use on the Internet we can see that the

block “Europe+ North America” is representing more than half of the market

with 55%.

The Asian-Pacific area is well represented with one third of the market.

Northern American and Asian Internet users are more or less experiencing the

same volume of search whereas it is in Europe and Latin America that Internet users

are performing it the most per capita.

This part will be more developed in chapter 5.1.4: Google dependency state.

3.2.6 The commercial search engines in the world

As mentioned in chapter 3.2.1, 6 research out of 10 on the Internet are made

on Google.

However it does not mean that each country in the world has a population of

60% Google users.

Figure 21: Worldwide Search by Region

Page 45: Risks of search engine dependency and its influence on data quality

45

Figure 22: Search engine leaders (>50%) per country personal estimation66

The world is not covered entirely by Google. There are some 7 other leaders:

Yahoo, Yandex (Mail.ru), Baidu, Microsoft, Naver, Seznam and Leit.is.

Almost all the American continent is using Google as well as Europe,

Northern Africa, Southern Africa, Australia and India.

In one word almost all countries which have strong links with the Anglo-

Saxon culture.

The strong presence of Yandex in Eastern Europe (ex-soviet countries) and

Russia could let us think about a possible « boycott of American technologies » and

support of Russian technologies. The recent partnership between Yandex (main

search engine in Russia) and the browser Firefox is increasing those suspicions67

.

66

Alexa Web. (n.d). Alexa Top 500 Global Sites. [online]. Available from:

http://www.alexa.com/topsites [Accessed 17 June 2009] 67

cf. Houste, F. (2009). Russie: Yandex sera le moteur de recherche par défaut de Firefox. [online].

Available from: http://www.search-engine-feng-shui.com/2009/01/russie-yandex-sera-le-moteur-

de-recherche-par-defaut-de-firefox/ [Accessed 23 January 2009]

cf. Schwartz, B. (2009). Firefox Drops Google For Yandex In Russia, But Big Loser May Be Rambler.

[online]. Available from: http://searchengineland.com/firefox-drops-google-for-yandex-in-russia-but-

big-loser-may-be-rambler-16107 [Accessed 23 January 2009]

Page 46: Risks of search engine dependency and its influence on data quality

46

The same observation can be made in China. The recent advertisement

broadcast by Baidu68

(the search engine leader in China) are going in that sense,

showing clearly the will of getting rid of foreigner search engines.69

The Russian and Chinese cases are contradictory with the concept mentioned

in the book ―Winners, Losers and Microsoft‖ which is saying that the best product

always win70

. The search engine market is then not a rational one.

Information regarding Caribbean areas and Central Africa are hard to find and

are not very relevant taking in account that the Internet is not well implemented yet.

On the other hand the Pacific area region is quite interesting because

containing all the « Tigers » (Taiwan, Thailand...) are all in red: Yahoo.

As a conclusion the search engine world is divided into two parts:

The Google planet: which is composed of all the Anglo-Saxon countries as

well as countries which have strong links with the United States or Great

Britain. Czech Republic and Iceland seem only to be a matter of time?71

.

The Asian – Pacific regions: Asia is composed of a lot of countries and then

a lot of cultures. Among them we can identify four players:

o Yandex (Mail.ru) which is dominating all the ex-soviet countries;

o Baidu which has a total control over China;

o Naver (NHN Corporation), a 100% South Korean product which is

the best example that search engines work by culture;

68

Baidu. (2006). Baidu advertisement. [online]. Available from:

http://www.youtube.com/watch?v=EPnmsFl__nU [Accessed 17 June 2009] 69

cf. Einhorn, B. (2007). Baidu Thinks It Can Play in Japan. [online]. Available

from:http://www.businessweek.com/globalbiz/content/feb2007/gb20070215_649662.htm?chan=gl

obalbiz_asia_technology [Accessed 23 January 2009]

cf. Grallet, G. (2009). Baidu, un autre Google s'éveille. [online]. Available from:

http://www.lexpress.fr/actualite/high-tech/baidu-un-autre-google-s-eveille_734826.html [Accessed

23 January 2009]

cf. Shijun, Z./Peng, N./Weifeng, X. (2006). 时尚中国—网动中国英. p.45 70

Liebowitz, S. J./Margolis, S. (1999). Winners, Losers and Microsoft

71cf. Rafat, A. (2008). Czech Portal Seznam Could Fetch $900 Million; Google, Apax, Warburg and

Others in Fray. [online] Available from: http://www.washingtonpost.com/wp-

dyn/content/article/2008/08/15/AR2008081502517.html [Accessed 23 January 2009]

cf. Mar Hauksson, K. (2007). Global search report 2007 [online]. Available from:

http://www.e3internet.com/downloads/global-search-report-2007.pdf [Accessed 23 January 2009]

p.8

Page 47: Risks of search engine dependency and its influence on data quality

47

o Yahoo which is leader in almost all ―Tigers‖ Asian countries.

Yahoo being an American technology how can we explain his domination in

Asia? The reason is mainly cultural, Yahoo is a shiny portal and that Asian culture on

the Internet recognize a quality website to the number of animations on it72

. Another

explanation could be the leading presence of Yahoo in Japan which can influence the

tigers countries. Moreover Japan has one of the highest rate of the Internet

integration in the world per capita73

.

3.2.7 Commercial search engine leaders presentation

Knowing search engine leaders and the services they are providing is critical

to understand the search engine dependency concept. Here is a list of the main

commercial search engine actors:

Google:74

Created in 1998 in the United States. Physically present in 34

countries around the world.

Services provided: News, Blogs, Images, Videos, Maps, Mail services,

Social networks, e-commerce, Online advertising…

Language supported: More than 65.

72

cf. Tobin, R./Hotchkiss, G./Lee, P. (2008). Chinese Search Engine Engagement. [online]. Available

from : http://www.enquiroresearch.com/download-research-whitepapers.aspx [Accessed 17 June

2009] p.28.

73 Internet World Stats. (2009). Internet Usage in Asia. [online]. Available from:

http://www.internetworldstats.com/stats3.htm [Accessed 17 June 2009] 74

Miller, M. (2006). Googlepedia. p.11

Figure 23: Search engine market in the

USA, source: Hitwise, february 2009

Figure 24: Google logo

Page 48: Risks of search engine dependency and its influence on data quality

48

Yahoo (―Yet Another Hierarchical Oracle‖):75

Created in 1994 in the United

States it started as a directory to become later an Internet Portal. Physically present in

20 countries around the world.

Services provided: News, Business directory, Maps, Videos, Images, Online

advertising, Mail services, Jobs, Questions/Answers….

Language supported: More than 20.

Baidu:76

Created in 2000 in China. Physically present in China and in Japan.

Services provided77

: News, Business directory, Maps, Music, Videos,

Images, Online advertising, Social networking…

Language supported: 2 (Chinese and Japanese).

75

Yahoo Inc. (n.d.). Company Overview. [online]. Available from:

http://yhoo.client.shareholder.com/press/overview.cfm [Accessed 17 June 2009]

Yahoo Inc. (n.d.). Yahoo dans le monde. [online]. Available from:

http://world.yahoo.com/?c=fr [Accessed 17 June 2009] 76

Shijun, Z./Peng, N./Weifeng, X. (2006). 时尚中国—网动中国英. p45

Baidu Japan Inc. (n.d.). Baidu(バイドゥ)会社情報 - 会社概要 . [online]. Available from :

http://www.baidu.jp/info/corp/data.html [Accessed 17 June 2009] 77

Baidu Inc. (n.d.). Baidu products. [online]. Available from :

http://ir.baidu.com/phoenix.zhtml?c=188488&p=irol-products [Accessed 17 June 2009]

Figure 27:Chinese search engine market,

source:China IntelliConsulting Corp. sept 2008

Figure 26: Japanese search engine market,

source:webcreate.ga-pro.com, May 2009

Figure 25: Yahoo Logo

Figure 28: Baidu logo

Page 49: Risks of search engine dependency and its influence on data quality

49

Microsoft: Microsoft main search engine is named ―Bing‖ (since June 2009).

Search engines being not the core activity of Microsoft it is quite complex to give a

description of it. Internet users do not go properly on Bing to use it but on Microsoft

other sites services such as hotmail. Microsoft is physically implemented all over the

world.

Services provided: News, Social Networking, blogs, Mail services,

toolbar…

Language supported:78

41

Naver:79

Created in 1999 in South Korea. Naver is an Internet Portal.

Physically present in South Korea, China, Japan and the United States.

Services provided: News, e-commerce, Social Networking, blogs, real time

information, Books, Mail services, toolbar.

Language supported: Korean.

Yandex:80

Created in 1997 in Russia. Yandex is physically present in Russia,

Ukraine and the United States.

78

Microsoft. (n.d.). Préférences Bing. [online]. Available from:

http://www.bing.com/settings.aspx?sh=2&FORM=WIWA [Accessed 17 June 2009] 79

NHN Corporation. (n.d.). NHN Corporation. [online]. Available from : http://www.nhncorp.com/

[Accessed 17 June 2009]

Figure 30: Korean search engine market, source:

July 2007 Koreanclick

Figure 29: Bing logo

Figure 31: Naver logo

Page 50: Risks of search engine dependency and its influence on data quality

50

Services provided: News, e-commerce, Social Networking, blog search

engine, Maps, dictionary, Mail services, photos, website, videos, professional

network, online payment service, online advertising.

Language supported: Russian, Ukrainian and English.

Seznam81

: Created in 1996 in Czech Republic. Seznam is an Internet Portal.

Physically present in Czech Republic.

Services provided: Search, Business directory, Images, Mail services, Online

advertising, e-commerce, News, Social Network, Jobs, Online Games.

Language supported: Czech.

Leit.is:82

Leit.is is an Icelandic Internet portal created in 1999. It is physically

present in Iceland.

Services provided: Images, Music…

80

Yandex inc. (2008). Russia’s largest internet search engine and a leading internet and technology

company. [online]. Available from: http://download.yandex.ru/company/mini_book_v19.pdf

[Accessed 17 June 2009] 81

Seznam inc. (n.d.). Vize firmy | O společnosti Seznam.cz.[online]. Available from :

http://firma.seznam.cz/cz/vize-firmy.html [Accessed 17 June 2009] 82

Leit.is. (n.d.). Leit.is - Um leit.is :: Um leit.is. [online] Available from: http://www.leit.is/umleit/

[Accessed 17 June 2009]

Figure 33:Search engine market in Russia, source:

LiveInternet.ru:December 31, 2008

Figure 35: Search engine market in Czech Rep,

source: navrcholu.cz, June 2008

Figure 32: Yandex logo

Figure 34: Seznam logo

Page 51: Risks of search engine dependency and its influence on data quality

51

Language supported: Icelandic and English

As we can see none of those search engine leaders are simple search

engines anymore. All are providing a bunch of services linked to their search

activity. Moreover they all have at least ten years of experience in the search field.

The one accumulating the most market shares are the one who play

internationally.

3.2.8 Commercial search engine complexity

As previously mentioned commercial search engines are not only providing a

search experience. They are all moving toward a personalized interface with a set of

associated services. In fact they are changing to a personal desktop environment

where by creating a simple free account you can access to your emails, search engine,

personal documents, software suite solutions such as spreadsheet, slides or word

processor. iGoogle is a good example of it:

Figure 36:Search engine market in

Iceland, source: statice.is 2007

Figure 37: Leit.is logo

Page 52: Risks of search engine dependency and its influence on data quality

52

Figure 38: An example of a customized interface on iGoogle

It is like an Operating System (Google) within the Operating System

(Microsoft, Linux, Mac OS).

In such configuration commercial search engines are providing more

interesting services because more instinctive tools than the ones within companies.

Companies employees frustrations can then be understood. The technological mass

public market is for them moving faster than the business one.

3.2.9 Search engine market shares configuration

A study untitled « Global Search Report 2007 »83

realized in 2007 made a

clear view of the market. It shows that the configuration of each market is always the

same:

83

cf. Wilsdon, N. (2007). Global Search Report 2007. [online]. Available from:

http://www.e3internet.com/downloads/global-search-report-2007.pdf [Accessed 23 January 2009]

Page 53: Risks of search engine dependency and its influence on data quality

53

Figure 39: Search engine market shares in 2007 for the Czech Republic

It is very rare to find a country where there is a close competition among

search engines. Even if in the High Technology sector things change from a day to

another you have often the following configuration where the first search engine

is leading the game by more than 30 points on its followers.

When a search engine get more than 50% of the market it is adopted as a

standard. This trend seems quite relevant in the software industry, people seem to

look for a standard used by all. This is the case for the Operating System industry,

the browser industry, the e-learning industry. The explanation of such a success with-

in a population can be found in the word to mouth, isn’t it how Google has been so

successful? How never heard sentences such as « you just have to Google it » Google

is even nowadays in dictionaries as a verb84

.

Markets are also define by a lot of small local search engines which are if

original enough bought by the biggest ones or if not will disappear quickly (some

examples are coming in the news every month). The only key of the success on the

short term seems to be advertisement but on the long run you need the technology

behind in order to compete.

3.2.10 Search engines competition

84

cf. Merriam Webster. (2001). Google - Definition from the Merriam-Webster Online Dictionary.

[online]. Available from: http://www.merriam-webster.com/dictionary/google [Accessed 17 June

2009]

Page 54: Risks of search engine dependency and its influence on data quality

54

Google has been created in 1998 and was not a pioneer in the field of search

engines. In a short period of time Google succeed to take the lead and among the

pioneers in this field only Yahoo (created in 1994) is still in place.

Even if Google has a dominating position on the market it will take him a lot

of time to be the number one in all countries (as we saw this market is not rational

mainly due to political and cultural reasons). This situation is in fact giving

hope/time to its competitors.

Yahoo is still in discussion with Microsoft in order to buy Yahoo search

technologies. One can understand how strategic can be such acquisition. Yahoo

having the research knowledge and Microsoft the funds as well as the software

ownership.

Regarding Baidu we cannot clearly see how they could compete against

Google outside of China.

What about new comers? Starting from nothing they could maybe beat

famous search engines in a small period of time. It could have been the success of

some services such as Cuil launched in summer 2008 which received a lot of

advertisement through the news85

. But the search engines market is a very ungrateful

world where visitors are giving no more than one chance: the product works or it

does not. In the case of Cuil it did not.

―An information retrieval system will tend not to be used whenever it is more

painful and troublesome for a customer to have information than for him not to have

it.‖86

Users want the information as soon as they can. When you move from

Google to another search engine you are often intransigent. At the first result which

does not fit your expectations you will go back to Google. But is the search engine

wrong or is it because it is responding differently that on what you were used to?

As a conclusion it is hard to say how Google can lose his dominant posi-

tion. Until now only one company succeeds to make a such gap in the world of

85

cf. Arrington, M. (2008). Cuil On BusinessWeek's Most Successful of 2008 List. Huh?. [online].

Available from: http://www.techcrunch.com/2008/12/29/cuil-on-businessweeks-most-successful-of-

2008-list/ [Accessed 17 June 2009] 86

Mooers, N. C. (1959). A panel discussion at the Annual Meeting of the

American Documentation Institute. 24 October.

Page 55: Risks of search engine dependency and its influence on data quality

55

search engine and it is Google itself and it was in a period where everything had to

be created on the Internet.

A new technology regarding research is however more and more recurrent in

this field and is called semantic research.

3.3 Search engine dependency aspect

As mentioned in the introduction search engine dependency is the fact that

people are using only one search engine and then only one way to process data when

looking for information on the Internet.

3.3.1 Search engines dependency proves

The sources used for this part are coming from Canadian87

, French88

and

Belgium students panels89

. Some other information regarding Germany, China (Hong

Kong)90

and the United States91

have also been used.

Those studies have been made on different panels: students, workers

(researchers), household and the following conclusion have been made: search

engine is the first tool when looking for information on the Internet.

It also states regarding surveys made on students that most of them did not

receive enough training on how to look for information.

87

cf. Crepuq. (2003). Etude sur les connaissances en recherche documentaire des étudiants entrant au

1er cycle dans les universités québécoises. [online]. Available from :

http://www.crepuq.qc.ca/documents/bibl/formation/etude.pdf [Accessed 18 June 2009] 88

cf. Université de Lyon. (2007). De la documentation au plagiat. [online]. Available from :

http://www.compilatio.net/files/sixdegres-univ-lyon_enquete-plagiat_sept07.pdf [Accessed 18

June 2009] 89

cf. EduDoc. (2008). Enquête sur les compétences documentaires et informationnelles des étudiants

qui accèdent à l'enseignement supérieur en Communauté française de Belgique. [online].

Available from : http://www.edudoc.be/synthese.pdf [Accessed 18 June 2009] 90

cf. Leung, H. W. 梁漢榮. (2004). A study of computer science students' conceptions of information

literacy and their experiences in information search process and use. [online]. Available from:

http://hub.hku.hk/handle/123456789/30758 [Accessed 18 June 2009] 91

cf. Enquiro. (2004). Search Engine Usage in North America. [online]. Available from:

http://www.enquiroresearch.com/download-research-whitepapers.aspx [Accessed 18 June 2009]

Page 56: Risks of search engine dependency and its influence on data quality

56

The best study found on this topic is one made on all the registered PhD

students (2,218 with an answer rate of 23,4%) last year (2008) on a whole region of

France92

.

The study shows that 67,5% of the respondents have never received a training

regarding how to look for information during their whole stay at the university and

that search engines are used in 96% of the cases when performing research.

Internet users are dependent of search engines.

3.3.2 Types of search engines dependency

Different types of search engine dependency can be identified:

Search engine satisfaction: users are performing web search on a specific

search engine which give them entire satisfaction. They then have no reasons

and intentions to change;

Search engine patriotism: users are performing research on a specific search

engine in order to support a specific cause, for example to support the nation-

al economy: Yandex and Baidu;

Search engine convenience and lock-in effect: users are performing re-

search on a specific search engine for all the other services it can provide:

convenience. He may be lock-in as well in all the services he subscribed to

and do not wish to change for this reason. For example it is more convenient

to gather all the services under the same provider than going on each individ-

ual website to use the service (using one email box for different accounts,

displaying on the same page news from different providers…);

As explained in chapter 3.2.7 most of all search engines leaders are moving in

this direction.

92

cf. URFIST de Rennes. (2008). Enquête sur les besoins de formation des doctorants à la maîtrise de

l’information scientifique dans les Ecoles doctorales de Bretagne. [online]. Available from:

http://www.uhb.fr/urfist/enquete_besoins_formation_doctorants-maitrise_information [Accessed

18 June 2009]

Page 57: Risks of search engine dependency and its influence on data quality

57

Being search engine dependent means using massively a search engine for

one of the reasons above and to not use and even think of others solutions. Search

engines dependency reach very high rate in Europe:

Figure 40: Google market shares in Europe in 2008, source:Comscore

Most of the European countries have a strong addiction to Google with more

than 70%. It means than most of European citizens are fed by using the same way to

process information.

3.3.3 Search engine loyalty

Studies are putting in evidence search engine loyalty. A study launched by

the China Internet Network Information Center in August 200593

showed that:

Figure 41: Use of search engines in 2004 and 2005

93

China Internet Network Information Center. (2005). China Online Search Market Survey Report

2005. [online]. Available from: http://www.cnnic.cn/download/2005/2005083101.pdf [Accessed 18

June 2009]

Page 58: Risks of search engine dependency and its influence on data quality

58

Rather in China or in the United States94

Internet users are using one or

two search engines when looking for information.

We also have to take into account that both in China and in the USA the

search engine leader has around 60% of the market.

In countries where Google has more than 80% of the market the use of a

second search engine should not be relevant.

In any case very few are the users making research on more than 2 search

engines.

3.3.4 Search engines dependency issues

At the first sight when using a search engine we are not thinking about all the

issues which are coming out from them. We make our research and we get results

from this and then we try the results one after the other until finding the one which

fits the best our expectations.

The first main problem is that when addicted to a specific search engine

which normally gave satisfaction the day when the result is not the one expected we

may think that:

The information is not displayed because the information searched does not

exist;

The request was not good enough, we should try with other keywords, this

assumption is confirmed by Canadian and American surveys. Search users

are sticking to their search engines;

Figure 42: Search engine dependency relevancy

94

iProspect. (2004). Search Engine User Attitude April May 2004. [online]. Available from:

http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf [Accessed 18 June 2009]

Page 59: Risks of search engine dependency and its influence on data quality

59

In fact according to a British study95

search users are blaming themselves

more than the search engine:

Figure 43: Search users blame themselves not the technology

The main issue to highlight is that people are so confident with some search

engines that they will normally not look for alternatives or even consider that their

favorite search engine can be wrong.

3.3.5 Privacy issues

Privacy issues are finding justifications in the way that search engines are

collecting information.

When analyzing search engines we have to consider that it is a free product

for all of us (in fact search engines get paid by displaying advertisement on each web

page).

Each time a search is made on the Internet the search engine you are using

registers the IP number of your computer and the research you made.

Those data are supposed to be confidential but some are used for statistics as

well as providing more targeted advertisement.

The more information you give and the more they collect. If you open an

email account on a search engine your name, address and some other information

will be collected.

95

Harvest Digital. (2006). User attitudes to search. [online]. Available from:

http://www.harvestdigital.com/uploads/assets/pdfs/2cec1cc789493f04e8af724694f23e8c.pdf

[Accessed 18 June 2009]

Page 60: Risks of search engine dependency and its influence on data quality

60

Until now few are the cases where we got the proof that the information

collected by search engines have been given to third parties.

The most famous litigation case has been the one of Yahoo in China which

filtered some emails and gave the names of some Chinese journalists who were

denouncing facts about the Chinese government96

.

Until now no mass exploitation of data have been observed and the recent

news given by major search engines (Microsoft and Google) are saying that the trend

is to eliminate those data as much as possible in the fear of losing confidentiality97

.

In theory the more information you give to a search engine and the more it

can fit your expectations, so reducing the collection of data is in a certain way neither

in people interest neither in search engine interest.

3.3.6 Search engine awareness

Search engine awareness is one of the key issue of risks of search engine

dependency, it is composed of:

Poor search engine awareness regarding how to use a specific search engine;

Poor search engine awareness regarding the existence of other search engines;

Both parts are fundamental. The first one deals with what we call search tools.

It consists of a combination of keys in order to fit a specific request.

3.3.6.1 Search engine use awareness

Search engines use different syntaxes to improve requests pertinence:

96

cf. Kahn, J. (2005). Yahoo helped Chinese to prosecute journalist. [online]. New York: New York

Times. Available from: http://www.nytimes.com/2005/09/07/business/worldbusiness/07iht-

yahoo.html [Accessed 18 June 2009] 97

cf. Boucq, I. (2009). Yahoo et vos données persos... [online]. Available from :

http://www.erenumerique.fr/yahoo_et_vos_donnees_persos_-news-15162.html [Accessed 18 June

2009]

Page 61: Risks of search engine dependency and its influence on data quality

61

Figure 44: Search engine syntax examples

The most famous are the ―Boolean operators‖ for example the following

request: Search+engine will look at websites where only both those keywords are

present.

It exists dozens of those tools per search engine and syntaxes differ

sometimes from one search engine to another. Some search engines are also

providing some syntaxes which are not present in others.

Some search engines are then complementary.

The idea behind Boolean operators is ―I seek a good site on this topic, but I

don’t have a specific site in mind. More than three quarters of the surveyed users

desire to access the best site regarding this topic.‖98

According to a Canadian study 99 54% of Canadian users use Boolean

operators:

98

Broder, A.Z. (2002). A taxonomy of web search. SIGIR Forum 36(2) pp. 3-10 99

Skooiz. (2008). Comment les Québecois utilisent ils et cherchent ils sur Internet ?. [online].

Available from : http://documents.skooiz.com/comment-les-quebecois-cherchent-ils-sur-le-web-

2008.pdf [Accessed 18 June 2009]

Page 62: Risks of search engine dependency and its influence on data quality

62

Figure 45: Use of advanced search functionalities in Canada

According to a Chinese study100

66,7% of Chinese Internet users understand

Boolean operators.

On the other hand according to a Canadian101

and a Belgium102

study it seems

that the most basic Boolean operators are not used properly:

Figure 46: Do users know how to use Boolean operators?

Another study103

untitled: ―How are we searching the World Wide Web‖ A

comparison of nine search engine transaction logs‖ made in the United States and

Europe shows that the use of Boolean operators has been stable from 1997 to

2002.

100

Insight Xplorer. (2006). 創 市 際 市 場 研 究 顧 問 . [online]. Available from:

http://www.insightxplorer.com/specialtopic/co_info_acquisition.html [Accessed 17 June 2009] 101

Crepuq. (2003). Information Literacy: Study of Incoming First-Year Undergraduates in Quebec.

[online]. Available from: http://www.crepuq.qc.ca/documents/bibl/formation/studies_Ang.pdf [Ac-

cessed 18 June 2009] 102

EduDoc. (2008). Enquête sur les compétences documentaires et informationnelles des étudiants

qui accèdent à l’enseignement supérieur en Communauté française de Belgique. [online]. Available

from : http://www.edudoc.be/synthese.pdf [Accessed 18 June 2009] 103

Jansen, J .B./Spink, A. (2004). How are we searching the World Wide Web? A comparison of nine

search engine transaction logs. [online]. Available from :

http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_searching_the_web.pdf [Accessed 18

June 2009]

Page 63: Risks of search engine dependency and its influence on data quality

63

It is also saying that the use of Boolean operators differ from Europe to the

United States. For the USA it goes from 11 to 20% whereas in Europe from 2 to 10%.

They put as well in evidence the existence of search engine dependency in

terms of the use of query operators with a particular search-engine system.

Another study104

made in 2004 is confirming those low figures stating that

only 2% out of hundreds of millions queries were containing Boolean operators.

Users then know the existence of Boolean operators, however they are

not using them and if they do so they are not using them properly.

3.3.6.2 Search engines existence awareness

Another issue is the search engine existence awareness.

Most of Internet users are unaware of other search engines existence as it is

showed in the following study105

:

104

Beitzel, S.M./Jensen, E. C./Chowdhury, A./Grossman, D./Frieder, O. (2004). Hourly analysis of a

very large topically categorized Web query log. [online]. Available from:

http://portal.acm.org/citation.cfm?id=1008992.1009048 [Accessed 18 June 2009]

105 URFIST de Rennes. (2008). Enquête sur les besoins de formation des doctorants à la maîtrise de

l’information scientifique dans les Ecoles doctorales de Bretagne. [online]. Available from:

http://www.uhb.fr/urfist/enquete_besoins_formation_doctorants-maitrise_information [Accessed

18 June 2009]

Figure 47: Use of meta search engines

Page 64: Risks of search engine dependency and its influence on data quality

64

Figure 48: Use of specialized search engines

According to a recent study106

made in the United States on three search

engines, vertical search (specialized search) are not used by 60% (including 25%

who may did it without knowing). Images search is the most used with 26%, News

search 17% and Video search 10%.

It seems that even among the biggest search engines, Internet users do not

use complex search tools.

106

iProspect. (2008). Blended Search Results Study – April 2008. [online]. Available from:

http://www.iprospect.com/premiumPDFs/researchstudy_apr2008_blendedsearchresults.pdf [Accessed

18 June 2009]

Page 65: Risks of search engine dependency and its influence on data quality

65

3.4 Search engine dependency conclusion

Intensive use of commercial search engines is the starting point of a vicious

cycle in the way we search for information.

Commercial search engines as his name said include commercial websites

which are no more than pure advertisement (no knowledge information) and

represent then a risk for its users. It is moreover increasing the number of contents

potentially displayable to any kind of request.

No matter how pertinent is the request the results provided by commercial

search engines are giving a feeling of satisfaction to Internet users. It makes him

think that he knows how to make research on the Internet and do not make him think

about reconsidering his search process.

When users switch to enterprise search engines where the training and the

implementation are not done properly the search engine user may think that the

search tool is not working.

Studies are comforting search engine users in this feeling107

« While people

trained in library sciences may bemoan the fact that most users are not Boolean

search experts (or « sophisticated » with search in general), the reality is that

business people should not have to be search experts in order to find the

information they need to do their jobs ».

So Internet users do not know how search engines work but think they do,

they have no desire to learn and as a result expect more from enterprise search

technology.

In this configuration search engine dependency do exist and has some

strong consequences on businesses. It seems that the market is designed for

simple and easy to use search applications.

107

Cf. The Association for Enterprise and Content Management. (2008). Findability: The Art and

Science of Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p36

Page 66: Risks of search engine dependency and its influence on data quality

66

3.5 Chapter 3: key points

Search engine is the first tool used when looking for information on the Inter-

net;

There are two categories of search engines: Commercial search engines (ad-

vertisement –oriented, free) and Enterprise search engines (ESE) (paying,

customizable service);

Google is the leader in the commercial search engine market with more than

60% followed by far by Yahoo. Presence of other strategic leaders in China,

Russia and South Korea;

Internet searchers are confident, satisfied and mostly trust search engines.

They say they know search engines but they do not use them properly;

They however trust more their favorite web sites and well established media;

The ESE market is confused and crowded, not transparent as the commercial

one.

ESE users are disappointed by their search experience. The main reasons are

the lack of training to those tools and the expectation to have results as

pertinent as commercial search engines;

Commercial search engine leaders are not simple search engines anymore.

They are all a complex set of attractive services making Internet users depen-

dent of them.

Commercial search engines are well implemented and the market is quite ri-

gid;

Internet users can be addicted to search engines for many reasons (conveni-

ence, lock-in, loyalty).

Use of search engine is different from one to another which emphasize the

importance of developing a culture of information research;

Commercial search engines have then a strong impact on businesses;

Page 67: Risks of search engine dependency and its influence on data quality

67

Chapter 4: Risks of search engines dependency

and its influence on data quality

Page 68: Risks of search engine dependency and its influence on data quality

68

The following part is dealing with the risks evaluation of the search engine

dependency. How much search engine dependency is affecting our day to day search

experience and how to overcome the situation?

4.1 Search engine dependency and its influence on data quality:

Issues

4.1.1 Search Engine Optimization

According to David Meerman Scott108

Search Engine Optimization (SEO) is

the art and science of ensuring that the words and phrases on your site, blog, and

other online content are found by the search engines and that once found, your site is

given the highest ranking possible in the natural (non commercial) search results.

Internet marketing is the third largest media used to make advertisement in

the United States109

.

Figure 49: U.S. Advertising Market - Media Comparison – 2008 ($ Billions)

It is in constant growth since year 2002 and out of Internet advertisement:

108

Meerman, D. S. (2007). The new rules of marketing and PR. p.242 109

Interactive Advertising Bureau. (2009). IAB Internet Advertising Revenue Report. [online].

Available from: http://www.iab.net/insights_research/530422/adrevenuereport [Accessed 18 June

2009]

Page 69: Risks of search engine dependency and its influence on data quality

69

Figure 50: Internet Ad Revenues by Advertising Format - 2008 Annual Results

Search Engine Optimization is the biggest activity of Internet marketing

with 45%.

So SEO is a growing industry and it is one of the biggest channel of making

advertisement.

Online marketing companies are offering as a service a position to company

websites on the first page of search engine results.

And if we look at the statistics:

Figure 51: Search engine user behavior regarding results pages in the USA

We can see that very few are the users who go beyond the second page and

year after year it seems that the vast majority is only considering the first page.

Statistics in other countries are confirming that almost no Internet users is

considering results after the second page.

Page 70: Risks of search engine dependency and its influence on data quality

70

We can understand from here how valuable are the positions for marketing

agencies to get a place within the first page of search engines.

It is scientifically proved that the eyes of Internet users are unconsciously

giving more importance to some results.

Eye tracking is a technology which through a camera sensor technology

detects the viewing of a screen by a person110

and allow to put this into relevancy.

In the case of search engines such as Google, Yahoo, Baidu, Naver and MSN

Internet users are giving more importance to the first results where they read almost

entirely the information displayed111

. We can observe some differences among

Internet users (Chinese and Korean users seem to look almost all results) and search

engines (Google users seem to stick only to the first three results).

Figure 52: An eye tracking study on several search engines112

However in all cases Internet users are focusing more on the five results

than the five others remaining. The colored parts are the ones where the eye

focused where read is the maximum intensity.

110

Black box network services. (n.d.). Digital Signage: Glossary — Black Box Network Services.

[online]. Available from: http://www.blackbox.com/resources/tools/microsites/digital-

signage/what/glossary.aspx [Accessed 20 June 2009]

111 cf. Enquiro. (2008). Eye Tracking Studies. [online]. Available from :

http://www.enquiroresearch.com/download-research-whitepapers.aspx [Accessed 18 June 2009] 112

Enquiro. (2006). Eye Tracking Studies: Eye Tracking Whitepapers from Enquiro Research. [online].

http://www.enquiroresearch.com/eyetracking-report.aspx [Accessed 20 June 2009]

Pandey, S. (2008). Top most search properties in Asia Pacific. [online]. Available from:

http://shalabh.wordpress.com/2008/10/02/topmost-search-properties-in-asia-pacific/ [Accessed 20

June 2009]

Hotchkiss, G. (2007). Chinese eye tracking study: Baidu versus Google. [online]. Available from:

http://www.cnblogs.com/dixin/articles/955369.html [Accessed 20 June 2009]

Page 71: Risks of search engine dependency and its influence on data quality

71

The more a search engine is popular and the more marketing agencies

will make an effort to be present in those positions. In the case of Google where

the first three results are the most viewed there is then a high competition.

This is an issue because according to an American study113

36% of Internet

users agree that companies listed as the first results are the best ones in their

field. But actually it is not, they just are better at advertising themselves.

Here it is quite obvious to see that such attitude make Internet users only

browsing a tiny part of the World Wide Web. Moreover this tiny part of the Web

is a battlefield marketing territory.

The risks are then to pick up for granted some commercial information or/and

to use the same sources that everyone use.

Moreover this affect all sites for example companies may have an interest to

promote themselves in an indirect way on websites well ranked, like writing an

article on Wikipedia114

.

Another example is given in the next part about the risk of sticking to the first

results page.

4.1.2 Commercial advertisement and perception

Figure 53: Differences between organic and sponsored results

113

iProspect. (2006). Search Engine User Behavior Study. [online]. Available from :

http://www.iprospect.com/premiumPDFs/WhitePaper_2006_SearchEngineUserBehavior.pdf

[Accessed 23 June 2009] 114

Cf. Zittrain, J.L. (2008). The future of the Internet and how to stop it. p.140

Page 72: Risks of search engine dependency and its influence on data quality

72

In general natural results are according to users more relevant than sponsored

ones.

According to a study made on an American search engine user panel in

2004115

:

Figure 54: Type of Search Result Selected

Most of search users are clicking on natural search results. It however appear

that search engine users are in some cases finding paid results more relevant than

normal ones:

Figure 55: Results relevancy according to users by search engine in 2004

According to a study made on American search engine users in 2005116

:

- 68% of users say that search engines are a fair and unbiased source of

information;

- 38% of searchers are aware of a distinction between paid and unpaid results,

62% are not;

- 18% of searchers overall (47% of searchers who are aware of the distinction)

say they can always tell which results are paid or sponsored and which are not.

This study can be put in correlation with another one made in India in 2007117

where Indian Internet users stated that:

115

iProspect. (2004). Search Engine Users Attitudes 2004. [online]. Available from:

http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf [Accessed 18 June 2009] 116

Fallows, D. (2005). Search Engine Users 2005. [online]. Available from:

http://www.pewinternet.org/~/media//Files/Reports/2005/PIP_Searchengine_users.pdf.pdf [Accessed

18 June 2009]

Page 73: Risks of search engine dependency and its influence on data quality

73

Figure 56: Attitudes towards search engines in India

Information coming from links are more trustworthy for 60% of Indian users.

So even if it appears clear for the major part of Internet users that paid

results are commercials there is still a confusion about it.

In the case of some search engines such as Baidu it appears that there is a

confusion between the sponsored links and the natural ones. ―Baidu merges its

organic results with results from its paid listings service‖118

.

That kind of indexation can lead to strong data quality issues119

.

The most recent example is the ―milk scandal‖120

(Baidu accepted to high

ranked unlicensed companies which were providing fake milk in exchange of

money).

This emphasize that businesses and individuals have to be aware of how

search engines are dealing with information processes.

According to the example given above such confusion between paid and

natural results can be dangerous.

4.1.3 Censorship

117

Internet and Mobile Association of India. (2007). Search Engine Marketing 2007. [online].

Available from: http://www.slideshare.net/targetseo/search-engine-marketing-india-sem-india-imrb-

presentation [Accessed 18 June 2009] 118

David Viney. (2008). Get to the Top on Google. p.210

119 cf. China Tech News.com. (2007). CCTV: Baidu Search Engine Fraud Exposed?. [online].

Available from: http://www.chinatechnews.com/2007/05/31/5459-cctv-baidu-search-engine-fraud-

exposed/ [Accessed 18 June 2009] 120

cf. China Daily. (2008). Baidu cuts revenue forecast on ad scandal. [online]. Available from:

http://www.chinadaily.cn/china/2008-12/14/content_7302341.htm [Accessed 18 June 2009]

Page 74: Risks of search engine dependency and its influence on data quality

74

By being information providers search engines have some obligations

regarding the countries in which they are implemented.

China is often used as an example to introduce this issue121

.

Censorship can mean that some search results have been removed or the

access to the search engine has been denied.

All search engines and countries are concerned about those obligations. For

example Google has as well to adapt to French, German Turkish and Argentinean

regulations122

.

The risk here is as mentioned before to believe that search engine are a

trustful and unbiased source of information. Some companies such as Google seem

clear and transparent about the policy they adopt for each country: ―Figuring out how

to deal with China has been a difficult exercise for Google. The requirements of

doing business in China include self-censorship – something that runs counter to

Google’s most basic values and commitments as a company.‖123

.

But once more here Internet users have to be aware that censorship exists

and to know which kind of content could have been removed.

4.1.4 Technological partnerships

A huge part of alternative search engines on the market are in fact using exis-

tent technologies from other bigger search engines. The most famous ones are AOL,

Netscape Search which are both powered by Google. All The Web and AltaVista are

powered by Yahoo.

Both can be recognized by the following logos:

121

United Nations. (n.d.). Human Rights Translated: A Business Reference Guide. p.54 122

Turow, J. (2008). Media Today. p.559

Valle, F. S./Soghoian, C. (2008). Adios Diego: Argentine judges cleanse the Internet. [online].

Accessible from: http://opennet.net/blog/2008/11/adiós-diego-argentine-judges-cleanse-internet

[Accessed 18 June 2009] 123

Wickre, K. (2006). Testimony the Internet in China. [online]. Available from:

http://googleblog.blogspot.com/2006/02/testimony-internet-in-china.html [Accessed 18 June 2009]

Page 75: Risks of search engine dependency and its influence on data quality

75

Figure 57: Powered by Google logo

Figure 58: Powered by Yahoo logo

Powered by Google means124

using the Google technology and choosing a

specific number of websites to look for information.

Here the risk is to use twice the same technology to search without knowing it.

It is sometimes not explicitly indicated.

Search engine users have to understand what “Powered by” mean and

what are the search engines which are providing their own technology and then

their own innovation.

4.1.5 The Visible Web

The visible Web is in represented by what search engine can potentially index.

Gulli and Signorini with a study made in 2005125

gave an estimation of the

indexable web by general search engines such as Google, Yahoo and Microsoft:

Indexable coverage Index content of other search engine

Google 76,20% 68,20%

Yahoo 69,30% 59,10%

MSN 61,90% 49,20% Figure 59: Estimation of the indexable web per search engine

Google seems to index 3 quarters of the indexable web but miss more than 30%

of web pages indexed by others such as Yahoo and Microsoft.

Gulli and Signorini gave also an interesting estimation of the index those

three search engines have in common which is estimated at less than 30%.

Here we clearly have once more the proof of the significance of using

several search engines.

124

Alacra. (n.d.). What does "Powered By Google" mean?. [online] Available from:

http://www.alacra.com/compliancesearch/faq.asp [Accessed 23 January 2009] 125

Gulli, A./Signorini, A. (2005). The Indexable Web is More than 11.5 billion pages. [online].

Available from: http://www.cs.uiowa.edu/~asignori/web-size/size-indexable-web.pdf [Accessed 18

June 2009]

Page 76: Risks of search engine dependency and its influence on data quality

76

Moreover we have to consider that this study has been made on American

technologies and then are not considering the language aspects. One study has been

made on this topic in 2005 untitled ―Search Engine Coverage Bias: Evidence and

Possible Causes‖126

in order to discover if general search engines such as Google

were covering American websites content in the same way as foreign websites.

The results of this study shows the supremacy of American websites presence:

Figure 60: Distribution of Public Web Sites By Country in 2002127

. In 2002 a large majority of websites were American. Because most of search

engines are basing their algorithm on the number of links which point to a page

American websites were far more covered than the foreigner ones. Moreover with

time old American websites are keeping their leading position in the repartition of

websites.

The study goes further by giving figures regarding the percentage of web sites

covered by Google according to the different countries.

USA China Singapore Taiwan

Google 87% 70% 56% 75% Figure 61: Percentage of Web Sites Covered by Google in 2002

The language here does not seem to be the problem because most of websites

in Singapore are in English and are not covered by Google properly. But websites

from Singapore may have not enough links which point to their pages as a result they

are not covered as well as American ones.

Here we see the importance of using different (local) search engines for

126

Vaughan, L./Thelwall, M. (2003). Search Engine Coverage Bias: Evidence and Possible Causes.

[online]. Available from: http://www.scit.wlv.ac.uk/~cm1993/papers/search_engine_bias_preprint.pdf

[Accessed 18 June 2009] 127

Online Computer Library Center. (2002). Trends in the evolution of the Public Web 1998-2002.

[online]. Available from: http://www.dlib.org/dlib/april03/lavoie/04lavoie.html [Accessed 18 June

2009]

Page 77: Risks of search engine dependency and its influence on data quality

77

some countries which know better a specific market.

The use here for some pure national services seem appropriate for two

reasons:

a better experience in indexing the websites of their country;

giving less importance to American websites;

For this part the risk for search engine users is to consider that a single

search engine can browse by himself the all web.

―Believing you can find everything and anything online is unrealistic‖128

.

Some search engines are using different technology and each of them have

gathered some experience in some particular fields that others did not. It is then

important to take this into consideration when making research.

4.1.6 Invisible Web

The invisible Web is in opposition to the visible web what commercial search

engines cannot find.

Also called Deep Web the invisible Web are ―Text pages, files, or other high-

quality authoritative information available via the Web that general-purpose search

engines cannot, due to technical limitations, or will not, due to deliberate choice, add

to their indices of Web pages‖ 129

The invisible web is then containing all contents which are in a certain way

protected and then for most of all very important material.130

Here we are mostly

referring to databases or special content protected by a login and password.

According to some sources the invisible Web is 400 to 550 times bigger131

than the Internet as we know it, and it is the fastest growing category of new

information on the Internet.

128

Friedman, B. G. (2004). Web Search Savvy. P.20 129

Sherman, C./Price, G. (2001). The invisible web. p.57 130

Cf. Hock, R/Notess, G. R. (2007). The extreme searcher’s Internet handbook. P.21

Page 78: Risks of search engine dependency and its influence on data quality

78

The risk of non considering the invisible web and then the independent

databases is of course to not get valuable information. It is foolish to think that one

can find everything with the visible Web.

Company contents and reports which are public are made on purpose132

(obsolete, incomplete information…).

They are for sure less valuable that the ones protected by a password which

require registration fees.

The Invisible Web has to be seriously considered when looking for

reliable and valuable information.

4.2 Search engine dependency and its influence on data quality:

Solutions

Issues are mainly composed of two parts:

A small awareness of search engines;

A small awareness of technologies in general;

Solutions can then be found by answering those two problems.

4.2.1 A deeper knowledge in search engine abilities

As mentioned previously there is today too much information on the Internet

and too many direct and indirect advertisement. It is today very easy to manipulate

information. There is a strong interest in getting a list of quality websites on which

we can rely on.

It is critical to know what search engines can and cannot do and how to take

the best part of their technology.

131

Pedley, P. (March 2002). Why you can’ afford to ignore the Invisible Web. Business Information

Review. 132

Cf. Shapiro, C./Varian, R.H. (1999). Information Rules: A strategic guide to the Network Economy.

Page 79: Risks of search engine dependency and its influence on data quality

79

Knowing and using properly Boolean operators is necessary. Too many

functions are underused and unknown whereas they are important: “links:”

“intitle:” “related:”.

The second point with search engine is vertical search. Most of the biggest

search engines are providing more than one search engine each having a specific use.

It is not a hazard if Google provides so much of those tools, it is because it knows

that the general Google cannot offer the best search experience.

The same search on the general home page of Google, Google Scholars and

Google Books will provide three totally different sources of results.

Here is an example of some vertical search engines provided by Google di-

vided by level of accessibility from google.com home page. For example the first

level (Google Images, Maps etc…) is accessible through one click, level 2 through

two clicks, etc etc…

But once more according to the principle of least effort Internet users are

sticking to the ground level.

One important element in using those vertical search is that it may possible

that a specific kind of file included in a website can be well indexed by a vertical

search engine but not the all website. Finding this file may then find you the website.

Figure 62: Google vertical search engines

Page 80: Risks of search engine dependency and its influence on data quality

80

For example an image within a website can be well indexed by an image search en-

gine but not the website in itself.

Knowing better how work a specific search engine is not fixing the de-

pendency state but reduce the data quality issue.

4.2.2 Taking the best part of each search engine

If it can be a strategic mistake to use two search engines with the same

technology (cf chapter 4.1.4) it can be appropriate to use technology from the biggest

search engines to compensate the disadvantages of smallest ones.

Here is an example of how to take the best part of each search engine

technology to improve data quality.

By using the search term ―Yahoo‖ on the university website of the

―Universidad de Léon‖ www.unileon.es here are the results we obtain (9/06/2009):

Search engine Internal search

engine of the

Léon university

Google

Yahoo

Number of

results

2 14 (+92) 17 (+2)

Request entered Yahoo Yahoo

site:www.unileon.es

Yahoo

site:www.unileon.es

Figure 63: Search engine search within website content comparison

Out of those results the 2 results found by the search engine of the university

of Léon were found by Yahoo and Google search engine. Google and Yahoo were

sharing 8 links in common.

As explained in chapter 4.1.5 search engines using different technologies are

not providing the same results because they are searching differently. There is then a

strong interest in trying different search engines.

Trying different search engines is a necessary condition to face search engine

dependency but this condition is not sufficient.

Page 81: Risks of search engine dependency and its influence on data quality

81

4.2.3 Technological evolution

As we saw in chapter 3.2.2 (Consumer Behavior regarding commercial search

engines) and according to a study conducted on young people133

―the search engine,

be that Yahoo or Google, becomes the primary brand that they associate with the

internet‖ internet users are making a confusion between search engines and the

Internet.

Search engines are one part of the Internet and what is on the Internet

may not be present into search engines. There is a time lag where search engines

are in late with the new use made of the Internet.

Figure 64: Future of web 2.0

Internet is changing according to the new technologies which are developed

on it. We are currently at the end of the Web 2.0 generation.

As described above from a PC Era where flows of information were few and

almost unidirectional we moved to the Web 1.0 where the Internet was considered

has an alternative source of information in plus of TV and radio.

In Web 2.0 lots of individual took the control of the Internet and with it the

133

UCL – University College London. (2008). Information behavior of the researcher of the future.

[online]. Available from: http://www.bl.uk/news/pdf/googlegen.pdf [Accessed 19 June 2009] p.12

Page 82: Risks of search engine dependency and its influence on data quality

82

use of its application.

The Web 2.0 is representing what is the web right now, what is his

configuration and the use which is made out of it. We saw previously that search

engines are often associated with the Internet and from this assimilation a huge gap

in terms of time adaptation is created.

It is critical for businesses to understand that we are not evolving in a Web

1.0 configuration anymore and that the way of getting information has drastically

changed.

Sources of information are not anymore located in only one place.

Web 2.0 is a new perception and conception of web applications and

communities. Where continuous updates and social networking is the main focus.134

Search engines are not all evolving in a Web 2.0 configuration simply

because some use of the Web 2.0 are against their policy:

Search engine gurus want homemade technologies which mean no

partnerships with enterprises offering this technology (unless they buy it).

They want as well fast and easy to use applications which means based on

texts. This is contrary to Internet technology innovations of the Web 2.0 period, no

use of heavy applications such as visual representations.

Web 2.0 is mainly based on information from individuals which mean low

quality of information flows. This is not in accordance with search engine policy of

providing quality information.

Because search engines are not the Internet this is creating a time lag that

companies and individuals have to catch up:

134

Bieberstein, N./Jones, K. (2008). Executing Service Oriented Architecture. p.169

Figure 65: Search engines are not the Internet

Page 83: Risks of search engine dependency and its influence on data quality

83

Figure 66: Time and knowledge lag

People already have a low knowledge regarding search engines, by being

search engine dependent and making the confusion that search engines are the

internet they do net explore other Internet technology capacity. There is then a

double gap and even a triple one if we take into consideration the over evaluation

they make about technologies.

Businesses should not follow the evolution of search engines but the

evolution of the Internet.

The only way to fill in this gap is learning how to use properly search engines

and being aware of new technologies.

Some examples of Web 2.0 search tools are given in the following parts. In a

Web 2.0 configuration information are fresh and coming from knowledge sharing.

4.2.3.1 Social bookmarking

Social bookmarking is a good example of a good Web 2.0 search application.

Social bookmarking allow users to store links to Web pages, otherwise known as

bookmarks, that they find useful, those bookmarks are then stored on a web page

representing user’s personal library. When combined with other personal libraries,

they allow many social networking possibilities.135

This system allow to set up directories including websites unindexed by

major search engines. In this case search engines are not set up by companies but 135

Sweeney, S. (2008). 101 Ways to Promote Your Tourism Business Web Site. P.288

Page 84: Risks of search engine dependency and its influence on data quality

84

by individuals.

Delicious is the most popular of them:

Figure 67: Delicious bookmarks search

With the information collected from there other applications can be created

such as Similicious which is looking for similar websites to the one you indicated

him. Such applications are creating through labels that every Internet users is

associating to a website:

Figure 68: Home page of the Similicious website

Social bookmarking is then a solution to explore the Visible web part that

some search engines do not explore and some parts of the Invisible Web as well.

4.2.3.2 Real time information: the Twitter example

Even if search engines are indexing the information faster and faster some

applications do it quicker. Twitter is the example of it. Through the concept of

microblogging.

Page 85: Risks of search engine dependency and its influence on data quality

85

Microblogging is the practice of posting brief messages (no more than 140

characters in the case of Twitter), regular updates about your thoughts, ideas, which

can be viewed by a group of your choosing via text messaging, email, instant

messages or the Web136

.

Twitter is looking for real time information, what is happening right now:

Figure 69: Twitter real time information search engine

Twitter has been recently very popular with the political situation in Iran137

,

being the one of the rare technology being available.

Real time information is not fixing the data quality issue, information is

coming from individuals where the risk of hoaxes is high. However on the other

hand fresh information is also critical for some businesses.

4.2.3.3 Visual representation of the results

The web 2.0 is not only including the social aspect but also the vulgarization

of Internet technology (Flash, RSS, Atom, Ajax) it allows for example search results

to be displayed differently. Kartoo is an example of it:

136

Maximum PC. (2008). Microblogging. P.10 137

Yang, G. (2008). Despite many counter-measures and filters, digital democracy continues to

trouble authoritarian regimes. [online]. Available from:

http://yaleglobal.yale.edu/display.article?id=12493 [Accessed 24 June 2009]

Page 86: Risks of search engine dependency and its influence on data quality

86

Figure 70: Kartoo search results presentation

Companies here have to understand that not only the use of the Internet

changed but its technology as well.

Major search engine do not follow them for the reasons explained in chapter

4.2.3 but it does not mean that companies should not adopt those solutions.

4.2.4 A better knowledge of the World Wide Web

We have seen so far how to take the best of search engines and the

technological evolution of the Web. We however did not took into consideration a

simple fact evoked in chapter 4.1.5 regarding the coverage of the visible web of

search engines.

The Internet is exactly as our physical world, it has its most visited and

popular places and this among each countries.

The issue is that most of Internet users sticks to a couple of them and are not

developing their true search potential.

This can be fixed by considering the Internet as a world map.

Page 87: Risks of search engine dependency and its influence on data quality

87

Each year the website: http://informationarchitects.jp/start/ is providing a map

of the most visited and famous websites in the world divided by category.

Such representation of the world wide web allow Internet users to know what

are the most popular websites within a specific categories. We saw in chapter 3.2.2

that Internet users trust more their favorite websites and well media established web-

sites than search engines.

Knowing what are those most famous websites through those maps is one

solution to improve data quality and solve the search engine dependency state.

Figure 71: 2008 Web trend map Figure 72: 2007 Web trend map

Page 88: Risks of search engine dependency and its influence on data quality

88

4.3 The future of Internet search

We developed so far only the current characteristics of search engine users

and saw that the situation is quite critical. Is this situation can change in the future?

An interesting study has been recently published 138

on this subject focusing

on the “Google generation”.

The Google generation is defined as Internet users born after the year 1993.

Young students are now more comfortable with computers than with pens and

papers however as the study show it does not mean that the Google generation is

expert in finding information on the Web. Some behaviors linked with what have

been written in this thesis are interesting:

- The information literacy of young people, has not improved with

the widening access to technology in fact, their apparent facility with

computers disguises some worrying problems139

;

- Young people have unsophisticated mental maps of what the

internet is, often failing to appreciate that it is a collection of networked

resources from different providers;

- Many young people do not find library-sponsored resources

intuitive;

- They spend little time in evaluating the information;

- They make very little use of advanced search facilities, assuming

that search engines „understand‟ their queries;

- They are more competent with technology but use very simple

applications and facilities;

- They have very high expectations regarding ICT;

- It seems that most teachers are information literate however their

138

UCL – University College London. (2008). Information behavior of the researcher of the future.

[online]. Available from: http://www.bl.uk/news/pdf/googlegen.pdf [Accessed 19 June 2009] 139

UCL – University College London. (2008). Information behavior of the researcher of the future.

[online]. Available from: http://www.bl.uk/news/pdf/googlegen.pdf [Accessed 19 June 2009] p.12

Page 89: Risks of search engine dependency and its influence on data quality

89

skills and attitudes towards information literacy is not transferred to

pupils140

;

- They simply do not recognize that they have a problem: there is a

big gap between their actual performance in information literacy tests and

their self-estimates of information skill and library anxiety;

The consumer behavior described in chapter 3.2.2 and the one of young

students is not really different. It is even worst because users born before 1993 were

not in a search engine dependency configuration. Former search engine users have

not been well trained and are not training properly the young generation.

The study goes even deeper by highlighting another critical point which is

that searchers have different information needs at their time of their lives:

Figure 73: Significant age-related differences in article discovery methods

Young people are far more digital addicted users than any others. Attitudes

towards search has totally changed.

The situation is already critical for ―old users‖ but at least is compensated by

the use of other sources of information whereas it is not the case for young users.

140

Merchant, L./Hepworth, M. Information literacy of teachers and pupils in secondary schools.

Journal of Librarianship and Information Science 34(2) 2002, p.81.

Page 90: Risks of search engine dependency and its influence on data quality

90

4.5 Chapter 4: Key points

The more a commercial search engine is popular and the more it is the target

of advertisement;

Online Advertisement and indirect advertisement are not going to stop ;

Only a tiny part of the web is considered by searchers and is unfortunately the

most commercial one;

There is a confusion for searcher to make the difference between commercial

and non commercial websites, it has in some cases some strong consequences;

The potentiality of the visible web can only be maximized by mixing differ-

ent search technologies;

Independent and paid databases representing the Invisible Web should se-

riously been taken into consideration to improve data quality;

Use of search engines have to be understood;

Internet users have to take advantage of the use of each technology;

As the physical world, Internet users should understand that the Internet is a

map as well;

Search engines are not the Internet;

Search users have to understand that they are evolving in a Web 2.0 configu-

ration where technologies are different as well as their use;

Existence of Web 2.0 search technologies have to be understood as search

engines;

Internet users should have a map of the Internet in their head;

Searching information is a critical skill which is not taught properly, it will

have some huge consequences in the future;

Page 91: Risks of search engine dependency and its influence on data quality

91

Chapter 5: The Google example

Page 92: Risks of search engine dependency and its influence on data quality

92

With more than 60% of market shares in 2009 around the world Google is the

best example of the search engine dependency phenomenon.

5.1 Google presentation

Before evaluating the consequences of the Google dependency it is important

to see how Google is providing this dependency state.

5.1.1 Google

In January 1996 a 24 year-old PhD student called Larry Page studying at the

University of Stanford was looking for a theme for his PhD thesis.

Encouraged by his supervisor he studied the following topic ―exploring the

mathematical properties of the World Wide Web― working in collaboration with

another student called Sergey Brin. To make it simple, it is from this work and

collaboration which will come up ―Google Inc‖ (officially created in September the

7th

1998)141

.

Two months later Google is already included in the Top 100 of world

websites of PC magazine142

(a reference in the United States for computers).

Even if Google is formerly a web based application in English it is a

worldwide service available on the Internet for all. As his creator (Larry Page) said

"Google's search engine has always had strong global appeal"143

.

Google is by facts what we call a “Killer App” a software application that

bypass all of its competitors.

5.1.2 Google's success

Google’s success may be linked for one part to the following strategy

« Google provides for free a useful service that people actively seek out »144

.

141

Cf. Scott, V. (2008). Google. p.5 142

Cf. Hitt, M.A./Miller, C. C./Colella, A. (2006). Organizational behavior a strategic approach.

P.470 143

Page, L. (2000). Google Press Center: Press Release.[online]. Available from :

http://www.google.com/press/pressrel/pressrelease22.html [Accessed 18 June 2009]

Page 93: Risks of search engine dependency and its influence on data quality

93

Most of users are looking for a free service, easy to use, better than the others

and efficient.

As mentioned in chapter 3.2.7 Google is providing a wide range of services

and for most of them better that its competitors. For example Gmail (e-mail service)

in comparison to Microsoft Hotmail or Yahoo mail services.145

Moreover Google is providing useful additional services that its competitors

don’t have such as: images hosting services with huge storage capacity (Picasa),

encyclopedia (Knol), free online suite office (Google slides, Google spreadsheets,

Google word processor), websites (Google sites)…

In one word there is no comparison in terms of volume of what Google

can offer vis-à-vis its competitors.

5.1.3 Google image

Google is in 2009 for the third year in a row, recognized as the most

valuable brand in the world according to BrandZ146

and the number 2 in terms of

reputation just behind Toyota147

in 2008.

Google has a better image regarding its main competitors on the privacy issue

and on the commercial aspect.

In 2007 a survey made in the United Kingdom on 1,101 persons shows that

38% of the respondents trust that Google will keep their information private against

26% for Yahoo and 23% for Microsoft148

.

According to the same agency when users are asked: ―Is Google becoming

144

Eternal Dreamer. (2008). Why Google is so Successful?. [online] Available from :

http://crumja.wordpress.com/2008/05/20/why-google-is-so-successful/ [Accessed 18 June 2009] 145

Miller, M. (2006). Googlepedia. p.352

Consumer Search. (2009). Webmail review. [online]. Available from :

http://www.consumersearch.com/webmail-reviews [Accessed 18 June 2009] 146

Millward Brown Optimor. (2009). Top 100 most global brands 2009. [online]. Available from:

http://www.brandz.com/output/ [Accessed 24 June 2009] 147

Reputation Institute. (2008). The World Most Reputable Companies. [online]. Available from:

http://www.reputationinstitute.com/ [Accessed 24 June 2009] 148

Bigmouthmedia. (2007). Survey results: Uncertainty over Google’s privacy intentions. [online].

Available from: http://www.bigmouthmedia.com/live/articles/survey-results-uncertainty-over-googles-

data-pri.asp [Accessed 24 June 2009]

Page 94: Risks of search engine dependency and its influence on data quality

94

too commercial?‖ 23 % of women answered no and 35 % of men said yes149

.

It seems that Google have a very good image towards Internet users it is

important to highlight that it has even a better reputation than his main

competitors.

It may then keep his dominant position for a very long period.

5.1.4 Google dependency state

Some continents are clearly under Google domination, this is the case for Eu-

rope and Latin America:

Figure 74: Google domination in Europe Figure 75: Google domination in Latin America150

As mentioned in chapter 3.2.5 those two continents are particularly

interesting because this is where Internet users in average perform the most

search per capita.

The more search are performed and the more Google can collect information

on its searchers. This allow him to fit better user expectations and to get many

information about experienced search users.

149

Bigmouthmedia. (2007). Gender split in attitude towards Google. [online]. Available from :

http://www.bizreport.com/2007/05/gender_split_in_attitude_toward_google.html [Accessed 24 June

2009] 150

Comscore. (2008) [online]. Available from : http://www.comscore.com/ [Accessed 24 June 2009]

Page 95: Risks of search engine dependency and its influence on data quality

95

Those are information that its competitors don‟t have and with more

than 60% of market shares worldwide Google has all elements to keep

developing high expected services and increase the dependency state.

5.1.5 Google added functionalities

As mentioned previously and in chapter 3.2.7 Google is not a simple search

engine anymore but provide an all set of services which are for most of all attractive,

easy to use, instinctive, useful and exclusive (Blogger, Adsense, Picasa, Google

Documents…).

But registering in one of those make you create a Google account which is

one day or another make you try another Google service and make you enter in a

vicious circle of dependency which never end.

For its more advanced users Google can be used as a far more complex tool

(cf: chapter 3.2.8) acting like an operating system within the operating system.

We can take as an example the iGoogle service which is an online

customizable desktop playing the role of a portal to thousands of customizable

applications151

.

Google dependency is mainly created from his associated services.

5.1.6 Google success is his weakness

The fate of Google is linked also to the one of Search Engine

Optimization which is the ability to well index a website on search engines.

The main issue is that Google being the most well known search engine a lot

of people from the marketing field tried and are still trying to understand how

Google is ranking pages to get valuable advertisement positions.

151

Conner, N. (2008). Google Apps: The Missing Manual. p.411

Page 96: Risks of search engine dependency and its influence on data quality

96

The more experiments are made on it and the more the secret algorithm of

Google is known.

As we can imagine few are the marketing agencies interested in having a

website in the latest pages of Google.

As we saw in chapter 4.1.1 Online Advertisement is the third most

Advertisement popular activity where search engine optimization is the most popular.

An incredible amount of agencies have been built on Search Engine Optimization

during the last years.

This is why Google results as most of Internet users use them are not the most

rational they can get.

We can take as an example the following scheme:

As we saw previously Google is not indexing all the visible web, but

moreover the use of foreign languages is necessary to improve search skills as well

as Boolean operators.

Trends are for a couple of keywords and sticking to the first results and pages

of Google which finally make users browsing a tiniest part of the World Wide Web.

Mainly because of marketing agencies making simple keyword research

on Google made us search the most commercial indexable part of the Visible

web.

Figure 76: Google coverage representation of the visible

web

Page 97: Risks of search engine dependency and its influence on data quality

97

5.2 Google's disappearance consequences

Many Internet users including individuals and businesses are basing all their

experience and activity on the Internet.

Can we imagine the consequences if Google disappear or just do not fill his

obligation during a period of time?

It is hard to believe that a company such as Google can close his gates one

day but as we know revolutions in the world of technology happen and the risk 0

does not exist.

However Google faced this year in 2009 for the first time in the last 10 years

three critical failures in a short period of time.

Analyzing the consequences of those three failures will allow us to determine

the consequences of the search engine dependency phenomenon.

5.2.1 Google Search engine failure

On January Saturday the 31st 2009 a system error at Google was displaying

all links on Google page results with the following warning 'This site may harm your

computer'152

.

Figure 77: Google search failure

152

cf. AT Internet Institute. (2009). Google breaks down on the 31st of January 2009. [online].

Available from: http://www.atinternet-institute.com/en-us/focus-on-current-events/google-breaks-

down-on-the-31st-of-january-2009/index-1-2-1-158.html [Accessed 18 June 2009]

Page 98: Risks of search engine dependency and its influence on data quality

98

This warning stayed displayed for 50 minutes and was viewable world widely.

According to Google the failure origin was no more than a human error coming

from a single and mere typo153

.

Actually there were no dangers at all for Internet users to click on the links

but it however show how a single individual can have an impact on millions of

people.

If we have a look at the consequences of this Google bug on the Internet traf-

fic we can see that in average 70% of Google users stop to use it. This clearly

shows that users are quite uneducated about search engines. Search engines are

just displaying links of websites and it is high probable that you know and trust a part

of them:

It shows as well that people are blindly following Google instructions, if

Google say that it may endanger the computer then they took it for granted. As a

consequence Google lost until 90% of his traffic 20 minutes after the failure.

153

Mayer, M. (2009). "This site may harm your computer" on every search result?!?!. [online].

Available from :http://googleblog.blogspot.com/2009/01/this-site-may-harm-your-computer-on.html

[Accessed 18 June 2009]

Figure 78: Figure 77: Google bug analysis on January the 31st 2009

Page 99: Risks of search engine dependency and its influence on data quality

99

It is as well very interesting to observe that more than 20% of Internet us-

ers leave the Internet with a peak of almost 30% at the beginning. It clearly once

more emphasize the Google dependency phenomenon where some people consider

Google as the Internet.

A rational behavior should have been to go to another search engine to see if

such information was displayed.

According to this graph made by the AT Internet Institute:

Figure 79: Google evolution traffic during the bug on January the 31st 2009

This bug did not even profit to other search engines because only 13,9%

changed from Google to another and 16,2% made a direct access to websites. This

clearly highlight once more the poor search engine awareness of Internet users

The 70% remaining just abandoned their research on Google.

Almost one fourth of Internet users are considering Google as the Inter-

net and do not know other search engines.

5.2.2 Google Gmail failure

Page 100: Risks of search engine dependency and its influence on data quality

100

As previously explained Google's dependency is not only linked to its search

function but also to the convenience of all services associated such as: email services,

blogs, finance controlling, maps....

On February the 24th

2009 from 10:30 to 14:30 the Google mail service was

not working. All Google mail users could not access and send emails.154

This issue if far more worrying when we know that Google mail is used by peers

but as well by more than one million of companies.

Figure 80: Google Gmail failure

Gmail failure was a problem of accessing to data. This is a critical failure

when we consider that it was a worldwide bug and that using email is the first

activity on the Internet before making search155

:

154

Beaumont, C. (2009). Google’s Gmail service crashes across world. [online]. Available from

http://www.telegraph.co.uk/scienceandtechnology/technology/google/4797727/Googles-Gmail-

service-crashes-across-world.html [Accessed 18 June 2009] 155

Malaysian Communications and Multimedia Commission. (2005). Household use of the Internet

survey 2005. [online]. Available from:

http://www.skmm.gov.my/facts_figures/stats/pdf/Household_use_internet_survey2005.pdf

[Accessed 17 June 2009]

Internet and Mobile Association of India. (2007). Search Engine Marketing 2007. [online]. Available

from: http://www.slideshare.net/targetseo/search-engine-marketing-india-sem-india-imrb-presentation

[Accessed 18 June 2009]

Fallows, D. (2008). Search engine use. [online]. Available from

http://www.pewinternet.org/~/media//Files/Reports/2008/PIP_Search_Aug08.pdf.pdf [Accessed 25

June 2009]

Page 101: Risks of search engine dependency and its influence on data quality

101

Figure 81: Main use of Internet

The use of Gmail within companies is critical and extremely dangerous.

Gmail services are very convenient but this convenience has some high risks.

5.2.3 Google other services failure

Google documents is a storage place where Internet users can store several

kind of documents in order to access them from everywhere.

On March the 9th

2009 it is the Google Documents application which faced

some troubles156

.

It appears that a bug allowed some Internet users to have access to some other

Internet users documents without their agreement. As well explained by Google

hierarchy: "The inadvertent sharing was limited to people with whom the document

owner, or a collaborator with sharing rights, had previously shared a document. The

issue affected so few users because it only could have occurred for a very small

percentage of documents, and for those documents only when a specific sequence of

user actions took place."157

it finally occurred to 0,05% of the documents hosted.

Even if the number of documents concerned by this failure has not been very

high it clearly shows that such issues can happen.

It is then important to inform individuals and companies to such a risk.

Companies should then not store confidential documents in those applications. Using

commercial search engines applications for business seem risky.

156

Claburn, T. (2009). Google Informs Docs Users Of Security Lapse. [Online]. Available from:

http://www.informationweek.com/news/services/storage/showArticle.jhtml?articleID=215801317

[Accessed 18 June 2009] 157

Mazzon, J. (2009). On yesterday’s email. [online]. Available from :

http://googledocs.blogspot.com/2009/03/on-yesterdays-email.html [Accessed 18 June 2009]

Page 102: Risks of search engine dependency and its influence on data quality

102

5.2.4 Google collateral damages

On May the 14th

2009 some millions of people have been cut off from Google

search engine, email and other Google services.

The reason invoked has been a huge traffic coming from Asia.

As a consequence it is 14% of its users who have been affected during one

hour.158

Moreover according to Gomez, an American company specialized in Website

Performance Monitoring services all websites which were using Google minor

services such as Google Analytics (audience measurement application for websites)

were twice slower to load.159

Here it is interesting to see that Google failures are coming from its own

success and not from a human error.

Here it shows that in any case if your company website is indirectly using

some Google services it may have some consequences on it.

158

Liedtke, M. (2009). Google glitch disrupts search engine, e-mail. [online]. The Associated Press.

Available from:

http://www.google.com/hostednews/ap/article/ALeqM5jJA_JCGApgxvxii3ryhhilxLuscgD9867IU01

[Accessed 18 June 2009]

Hoelzle, U. (2009). This is your pilot speaking. Now, about that holding pattern... [online]. Available

from : http://googleblog.blogspot.com/2009/05/this-is-your-pilot-speaking-now-about.html [Accessed

18 June 2009] 159

Hof, R. (2009). Google's Outage Affected More than Google Users; Other Sites Hit Too. [online].

Available from:

http://www.businessweek.com/the_thread/techbeat/archives/2009/05/googles_outage.html [Accessed

18 June 2009]

Page 103: Risks of search engine dependency and its influence on data quality

103

5.3 Chapter 5: Key points

Google is a ―Killer App‖ composed of several others ―Killer App‖ such as

Gmail, Google documents…

Google provides for free a useful service that people actively seek out;

Google has a strong and positive image towards its customers;

The number of useful free services associated to one single Google account

make people dependent of Google;

Google has all elements in hand to keep his dominant position on the market

and to provide highly demanded services which increase the dependency state;

Google being the most popular search engine it is the target of many

marketing companies and individuals. Inexperienced Google searchers will

always then explore a tiny part of the Visible Web which is the most

commercial one;

25% of Google users are considering Google the Internet. In a case of a

Google search failure 70% of Google users are dependent of him and do not

know which are their alternatives if Google do not work properly;

Having a Google mail account as main mail account is very risky for

businesses;

Companies and individuals should know that there are risks to store high

value documents on Google documents;

If company websites are using some Google applications, even if they are

minor they can be affected by Google failures;

Page 104: Risks of search engine dependency and its influence on data quality

104

Conclusion and recommendations

Page 105: Risks of search engine dependency and its influence on data quality

105

In our information age, it is clear to everyone that the Internet has taken the

place of radio and TV as our main information provider. The former scope of the

Internet, which was to provide an extensive wide range of information, has been

quickly taken like TV and radios by advertising and marketing agencies.

As currently used, the Internet is a source of information ―media fast food‖160

for the mass where commercial and ―people‖ information are the main concerns.

The Internet is a data platform mixing information written by businesses and

individuals at the same time. The Web 2.0 has enlarged this mixing, allowing users

(and not only professionals) to upload contents on the Internet, thanks to blogs, wikis,

social networks etc. This cross and share of information is a benefit for our society,

but increases as well confusion and data quality issues, especially regarding the au-

thoritativeness of the information sources.

The implementation of simple and easy-to-use web applications did not help

to clarify the matter, and has conditioned users to apply the ―least effort‖ principle to

information search, where they go for convenience and speed of retrieval, thus scari-

fying data quality, opting for quantity rather than quality.

Search engine dependency is relevant and critical firstly because commercial

search engines are our main information provider on the Internet, and secondly be-

cause a large part of Internet users are uneducated about how internet information

sources work. These end users’ behaviors are unfortunately influencing the

workplace as well.

As any other field, information search and retrieval is a skill, an art based on

education, practice, experience, knowledge and pro efficiency, which have to be

taught within the educational system as well as within companies.

Due to information overload, believing that a user is able to find everything

and anything online relevant to what is needed, is mostly unrealistic, companies

should then clearly take these aspects and problems into consideration when relying

on the web as one of their information source.

As mentioned in this report, there are critical issues regarding data and infor-

160

Romaní, C.C./ Kuklinski, H.P. (2007). Planeta Web 2.0. Inteligencia colectiva o medios fast food

Page 106: Risks of search engine dependency and its influence on data quality

106

mation management within companies and this cannot be fixed without a proper

education which could start by teaching a series of fundamentals topics regarding the

world wide web:

the basic mechanisms of the world wide web

how search engines are built

what are the different information sources available today (web, new-

sgroups, wikis, blogs, newsletters, feed/RSS etc.) and what are their

confidence levels

what are the principles on which search engines index and retrieve in-

formation, and the role of advertisement in the results of search en-

gines

There is clearly today an over evaluation of the information extracted from

the world wide web, and in general regarding the reliability of technologies as infor-

mation sources: in my opinion, users are expecting too much from these information

sources, or better, they rely too much on these technological outputs for their work-

ing and personal usage. Technology is in constant movement and has to be follow.

We could be sure that, if there was a perfect solution for Internet search, eve-

rybody would adopt it. The solutions cited so far are explaining that the best way to

obtain data quality is to use the best part of each search tool, coordinating the differ-

ent results and always taking into account the Internet changing environment. Adapt-

ing those solutions to business will be very costly:

firstly, because it would imply to implement a huge set of search tools

secondly, because the more search tools you implement and the more training

you have to do.

As we saw in chapters 2.4.1, Internet users are sticking to the principle of

least-effort and are focusing on accessibility rather than quality.

Integrating a large set of complex tools have already been considered in the

past: the study titled ―Information behavior of the researcher of the future‖ cited above

well explained how librarians did not adapt properly their services to the demand,

thus making young students fleeing them.

Page 107: Risks of search engine dependency and its influence on data quality

107

We discovered so far that the market is demanding easy to use, accessible,

simple and instinctive search solutions. We know as well that commercial search

engine users are blaming themselves and not the technology when they are not able

to find what they are looking for. We saw that Enterprise search engine users are

blaming the technology and not themselves when they cannot find the result they

want. And, finally, we saw that commercial search engine usage influences the one of

enterprises.

We all know how reluctant workers are towards changes, and that

implementing a new information search system is critical. As we saw in chapter 2,

there are some solutions which are implemented within companies, but their goals

are rather blurred and workers are not finding the information they want in most of

the cases.

In conclusion, the recommendation should then be the following:

There is no need to invest in costly and complex search information system.

Simply because they are very costly in terms of management and training. If you set

up an information search system including several search tools, employees will not

use them and/or not use them properly.

Technology has to adapt to searchers, and not the other way around. So

companies should invest in one, only one search technology but at least one that

everybody in the company is familiar with. In the case of Google-covered countries,

it should mean implementing Google Search Appliance within companies.

This implementation should increase employees satisfaction and this will help

in installing a more powerful search system afterward.

Employees may not find all the material they want with only one solution, but

it does not matter for the reason that they did not find the contents either with several

solutions.

It should be hard for them to blame on the technology with which they are the

most familiar with, and as a result they may reconsider their way of internet

searching. Once this step has been undertaken, it will be easier to train people to

search engine usage and afterward implementing deeper search tools.

Page 108: Risks of search engine dependency and its influence on data quality

108

The justification for this quite radical choice is customer satisfaction. If an

Internet user is not finding the information through his favorite search tool, this

would mean that there is something wrong with the use that s/he is making of it.

Starting from this point, users will prepare themselves to better receive training on

information search tools and techniques.

This solution is for sure not the most rational one, because it is not fixing the

issue of search engine dependency. It is even making it worst. But however, it could

help to fix the issue of user satisfaction, which drives to efficiency. As mentioned in

chapter 3.4, employees do not need to be search information experts, they simply

have to be concentrated on their core-business activities.

Page 109: Risks of search engine dependency and its influence on data quality

Search engine users behavior within household summary:

Policy search Search user characteristics Solutions Risks Number of tools used

No search strat-

egy imple-

mented

Always find what they are looking for, confident, trusting and naive.

Satisfied with their search experience.

Do not make the difference between commercial and non commercial

links

98% (95%) of their Internet use is dedicated to search.

They used the Internet for many purposes such as answers regarding

health.

For them there

is no problem

so why should

they find some

solutions.

They can

pick up the

wrong in-

formation

(health in-

formation)

They are not expe-

rienced searchers and

use in average no

more than 2 search

tools

Search engine user within the educational system:

Policy search Search user characteristics Solutions Risks Number of tools used

None or few pro-

grams installed or

forecasted for the

future

Even when trained students do not apply

the tools. Google's comfort. The Wikipedia

phenomenon has been put in evidence and

the educational system is alerted to this

issue

More training

and education,

special courses

on the topic.

Information flow is always the same,

Wikipedia phenomenon. If the train-

ing is not applied properly perfor-

mances cannot be seen in household

neither within businesses.

Presence of library da-

tabases but very few

other search engines are

known

Search engine users within Businesses:

Policy search Search user characteristics Solutions Risks Number of tools

used

Poor implemen-

tation, if im-

plemented goals

Get used to experience search

engine at home and do not

understand why enterprise

More training, better man-

agement implementation

Bad image about Enterprise Search Engine,

do not feel to be pro efficient and fix the prob-

lem by themselves, Looking for info on the

Several

Page 110: Risks of search engine dependency and its influence on data quality

110

are in half of

the cases not

clearly defined.

search engines are not per-

forming as good as them.

Very unsatisfied about their

search experience

More or less aware of the

problem but are minimizing

the consequences

Internet when not finding it on Intranet

Page 111: Risks of search engine dependency and its influence on data quality

Declaration

I certify that this work has been done by myself and only myself. All the sources

used for its realization have been well indicated.

Ronan CHARDONNEAU

European Master in Business Studies

Institut de Management de l'Université de Savoie d'Annecy (FR)

Università degli studi di Trento (IT)

Universität Kassel (GER)

Universidad de León (SP)

26th June, 2009

Page 112: Risks of search engine dependency and its influence on data quality

112

List of literature

Alacra. (n.d.). What does "Powered By Google" mean?. [online] Available from:

http://www.alacra.com/compliancesearch/faq.asp [Accessed 23 January 2009]

Albarran, A.B./Chan-Olmsted,S.M./Wirth,M.O. (2006). Handbook of media management and

economics. p471

Alexa Web. (n.d). Alexa Top 500 Global Sites. [online]. Available from: http://www.alexa.com/topsites

[Accessed 17 June 2009]

American Documentation Institute. 24 October.

Arrington, M. (2008). Cuil On BusinessWeek's Most Successful of 2008 List. Huh?. [online]. Available

from: http://www.techcrunch.com/2008/12/29/cuil-on-businessweeks-most-successful-of-2008-list/

[Accessed 17 June 2009]

AT Internet Institute. (2009). Google breaks down on the 31st of January 2009. [online]. Available

from: http://www.atinternet-institute.com/en-us/focus-on-current-events/google-breaks-down-on-the-

31st-of-january-2009/index-1-2-1-158.html [Accessed 18 June 2009]

Baase, S. (2007). A gift of Fire. p351

Baase, S. (2007). A gift of Fire. p352

Baidu Inc. (n.d.). Baidu products. [online]. Available from :

http://ir.baidu.com/phoenix.zhtml?c=188488&p=irol-products [Accessed 17 June 2009]

Baidu Japan Inc. (n.d.). Baidu(バイドゥ)会社情報 - 会社概要 . [online]. Available from :

http://www.baidu.jp/info/corp/data.html [Accessed 17 June 2009]

Baidu. (2006). Baidu advertisement. [online]. Available from:

http://www.youtube.com/watch?v=EPnmsFl__nU [Accessed 17 June 2009]

BBC. (2008). Tibetans describe continuing unrest. [online]. Available from :

http://news.bbc.co.uk/2/hi/asia-pacific/7300312.stm [Accessed 17 June 2009]

Beaumont, C. (2009). Google’s Gmail service crashes across world. [online]. Available from

http://www.telegraph.co.uk/scienceandtechnology/technology/google/4797727/Googles-Gmail-

service-crashes-across-world.html [Accessed 18 June 2009]

Beijing Review. (2008). Dialogue: Media Coverage on Tibet. [online]. Available from:

http://www.bjreview.com.cn/special/txt/2008-03/22/content_107054.htm [Accessed 17 June 2009]

Beitzel, S.M./Jensen, E. C./Chowdhury, A./Grossman, D./Frieder, O. (2004). Hourly analysis of a very

large topically categorized Web query log. [online]. Available from:

http://portal.acm.org/citation.cfm?id=1008992.1009048 [Accessed 18 June 2009]

Berliner Morgenpost. (2008). China rüstet sich für « die entscheidende Schlacht ». [online]. Available

from :

http://www.morgenpost.de/printarchiv/politik/article169230/China_ruestet_sich_fuer_die_entscheiden

de_Schlacht.html [Accessed 17 June 2009]

Bieberstein, N./Jones, K. (2008). Executing Service Oriented Architecture. p.169

Bigmouthmedia. (2007). Survey results: Uncertainty over Google’s privacy intentions. [online].

Available from: http://www.bigmouthmedia.com/live/articles/survey-results-uncertainty-over-googles-

data-pri.asp [Accessed 24 June 2009]

Black box network services. (n.d.). Digital Signage: Glossary — Black Box Network Services.

[online]. Available from: http://www.blackbox.com/resources/tools/microsites/digital-

signage/what/glossary.aspx [Accessed 20 June 2009]

Boucq, I. (2009). Yahoo et vos données persos... [online]. Available from :

http://www.erenumerique.fr/yahoo_et_vos_donnees_persos_-news-15162.html [Accessed 18 June

2009]

Broder, A.Z. (2002). A taxonomy of web search. SIGIR Forum 36(2) pp. 3-10

Case, D. O. (2007) Looking for information. p.151

Center for the Digital Future (2008). Annual Internet Survey by the Center for the Digital Future.

[online]. Available from

http://www.digitalcenter.org/pdf/2009_Digital_Future_Project_Release_Highlights.pdf [Accessed 19

June 2009] p.4

China Daily. (2008). Baidu cuts revenue forecast on ad scandal. [online]. Available from:

http://www.chinadaily.cn/china/2008-12/14/content_7302341.htm [Accessed 18 June 2009]

Page 113: Risks of search engine dependency and its influence on data quality

113

China Internet Network Information Center. (2005). China Online Search Market Survey Report 2005.

[online]. Available from: http://www.cnnic.cn/download/2005/2005083101.pdf [Accessed 18 June

2009]

China Tech News.com. (2007). CCTV: Baidu Search Engine Fraud Exposed?. [online]. Available

from: http://www.chinatechnews.com/2007/05/31/5459-cctv-baidu-search-engine-fraud-exposed/

[Accessed 18 June 2009]

Claburn, T. (2009). Google Informs Docs Users Of Security Lapse. [Online]. Available from:

http://www.informationweek.com/news/services/storage/showArticle.jhtml?articleID=215801317

[Accessed 18 June 2009]

Cogar, P. (ed.) (2007). TV vs. the Internet: Internet wins. [online]. Available from : http://www.bit-

tech.net/news/2007/08/23/tv_vs_the_internet_internet_wins/1 [Accessed 17 June 2009]

Cole, J. I./Suman, M./Schramm, P./Lunn, R/Aquino, J.S. (2003). Surveying the Digital Future. [online]

Available from: http://www.digitalcenter.org/pdf/InternetReportYearThree.pdf [Accessed 17 June

2009]

Conner, N. (2008). Google Apps: The Missing Manual. p.411

Consumer Search. (2009). Webmail review.[online]. Available from :

http://www.consumersearch.com/webmail-reviews [Accessed 18 June 2009]

Crédoc. (2008). La diffusion des technologies de l'information et de la communication dans la société

française. [online]. Available from: http://www.arcep.fr/uploads/tx_gspublication/etude-credoc-2008-

101208.pdf [Accessed 17 June 2009] p.120.

Crepuq. (2003). Etude sur les connaissances en recherche documentaire des étudiants entrant au 1er

cycle dans les universités québécoises. [online]. Available from :

http://www.crepuq.qc.ca/documents/bibl/formation/etude.pdf [Accessed 18 June 2009]

Crepuq. (2003). Information Literacy: Study of Incoming First-Year Undergraduates in Quebec. [on-

line]. Available from: http://www.crepuq.qc.ca/documents/bibl/formation/studies_Ang.pdf [Accessed

18 June 2009]

David Viney. (2008). Get to the Top on Google. p.210

EduDoc. (2008). Enquête sur les compétences documentaires et informationnelles des étudiants qui

accèdent à l'enseignement supérieur en Communauté française de Belgique. [online]. Available

from : http://www.edudoc.be/synthese.pdf [Accessed 18 June 2009]

EduDoc. (2008). Enquête sur les compétences documentaires et informationnelles des étudiants qui

accèdent à l’enseignement supérieur en Communauté française de Belgique. [online]. Available from :

http://www.edudoc.be/synthese.pdf [Accessed 18 June 2009]

Einhorn, B. (2007). Baidu Thinks It Can Play in Japan. [online]. Available

from:http://www.businessweek.com/globalbiz/content/feb2007/gb20070215_649662.htm?chan=gl

obalbiz_asia_technology [Accessed 23 January 2009]

Enquiro. (2004). Search Engine Usage in North America. [online]. Available from:

http://www.enquiroresearch.com/download-research-whitepapers.aspx [Accessed 18 June 2009]

Enquiro. (2008). Eye Tracking Studies. [online]. Available from :

http://www.enquiroresearch.com/download-research-whitepapers.aspx [Accessed 18 June 2009]

Estabrook, L. /Witt, E./ Rainie, L. (2007). Information searches that solve problems. [online]. Availa-

ble from: http://www.pewinternet.org/~/media//Files/Reports/2007/Pew_UI_LibrariesReport.pdf.pdf

[Accessed 17 June 2009] p5

Estabrook, L. /Witt, E./ Rainie, L. (2007). Information searches that solve problems. [online].

Available from:

http://www.pewinternet.org/~/media//Files/Reports/2007/Pew_UI_LibrariesReport.pdf.pdf [Accessed

17 June 2009]

Eternal Dreamer. (2008). Why Google is so Successful?. [online] Available from :

http://crumja.wordpress.com/2008/05/20/why-google-is-so-successful/ [Accessed 18 June 2009]

Fallows, D. (2005). Search Engine Users 2005. [online]. Available from:

http://www.pewinternet.org/~/media//Files/Reports/2005/PIP_Searchengine_users.pdf.pdf [Accessed

18 June 2009]

Fallows, D. (2005). Search Engine users. [online]. Available from:

http://www.pewinternet.org/~/media//Files/Reports/2005/PIP_Searchengine_users.pdf.pdf [Accessed

17 June 2009]

Fallows, D. (2005). Search engine users. [online]. Available from:

http://www.pewinternet.org/~/media//Files/Reports/2005/PIP_Searchengine_users.pdf.pdf [Accessed

Page 114: Risks of search engine dependency and its influence on data quality

114

17 June 2009] p.2

Feldman, S. (2005). Desperately seeking search. [online]. Available from:

http://www.kmworld.com/Articles/Editorial/Feature/Desperately-seeking-search-9665.aspx [Accessed

17 June 2009]

Friedman, B. G. (2004). Web search savvy. p.19

Friedman, B. G. (2004). Web Search Savvy. p.21

Graham, E. L./ Metaxas, P. T. (2003). Of course it’s true I saw it on the Internet!: Critical thinking in

the Internet. Available from: http://www.wellesley.edu/CS/pmetaxas/CriticalThinking.pdf [Accessed

17 June 2009]

Grallet, G. (2009). Baidu, un autre Google s'éveille. [online]. Available from:

http://www.lexpress.fr/actualite/high-tech/baidu-un-autre-google-s-eveille_734826.html [Accessed

23 January 2009]

Gulli, A./Signorini, A. (2005). The Indexable Web is More than 11.5 billion pages. [online]. Available

from: http://www.cs.uiowa.edu/~asignori/web-size/size-indexable-web.pdf [Accessed 18 June 2009]

Harvest Digital. (2006). User attitudes to search. [online]. Available from:

http://www.harvestdigital.com/uploads/assets/pdfs/2cec1cc789493f04e8af724694f23e8c.pdf [Ac-

cessed 18 June 2009]

Hirsh, S./Dinkelacker, J. (2004). Seeking Information in order to produce information: an empirical

study at Hewlett Packards Labs. p.816

Hock, R/Notess, G. R. (2007). The extreme searcher’s Internet handbook. P.21

Hoelzle, U. (2009). This is your pilot speaking. Now, about that holding pattern... [online]. Available

from : http://googleblog.blogspot.com/2009/05/this-is-your-pilot-speaking-now-about.html [Accessed

18 June 2009]

Hof, R. (2009). Google's Outage Affected More than Google Users; Other Sites Hit Too. [online].

Available from:

http://www.businessweek.com/the_thread/techbeat/archives/2009/05/googles_outage.html [Accessed

18 June 2009]

Hotchkiss, G. (2007). Chinese eye tracking study: Baidu versus Google. [online]. Available from:

http://www.cnblogs.com/dixin/articles/955369.html [Accessed 20 June 2009]

Houste, F. (2009). Russie: Yandex sera le moteur de recherche par défaut de Firefox. [online].

Available from: http://www.search-engine-feng-shui.com/2009/01/russie-yandex-sera-le-moteur-

de-recherche-par-defaut-de-firefox/ [Accessed 23 January 2009]

Insight Xplorer. (2006). 創 市 際 市 場 研 究 顧 問 . [online]. Available from:

http://www.insightxplorer.com/specialtopic/co_info_acquisition.html [Accessed 17 June 2009]

Insight Xplorer. (2006). 創 市 際 市 場 研 究 顧 問 . [online]. Available from:

http://www.insightxplorer.com/specialtopic/co_info_acquisition.html [Accessed 17 June 2009]

Interactive Advertising Bureau. (2009). IAB Internet Advertising Revenue Report. [online]. Available

from: http://www.iab.net/insights_research/530422/adrevenuereport [Accessed 18 June 2009]

Internet and Mobile Association of India. (2007). Search Engine Marketing 2007. [online]. Available

from: http://www.slideshare.net/targetseo/search-engine-marketing-india-sem-india-imrb-presentation

[Accessed 18 June 2009]

Internet Systems Consortium. (2009). The ISC Domain Survey Internet Systems Consortium. [online]

Available from https://www.isc.org/solutions/survey [Accessed 17 June 2009]

Internet World Stats. (2009). Internet Usage in Asia. [online]. Available from:

http://www.internetworldstats.com/stats3.htm [Accessed 17 June 2009]

Internet World Stats. (2009). INTERNET USAGE STATISTICS The Internet Big Picture.[online]

Available from: http://www.internetworldstats.com/stats.htm [Accessed 17 June 2009]

Internet World Stats. (2009). World Internet Usage Statistics News and World Population Stats.

[online]. Available from: http://www.internetworldstats.com/stats.htm [Accessed 17 June 2009]

iProspect. (2004). Search Engine User Attitude April May 2004. [online]. Available from:

Page 115: Risks of search engine dependency and its influence on data quality

115

http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf [Accessed 18 June 2009]

iProspect. (2004). Search Engine Users Attitudes 2004. [online]. Available from:

http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf [Accessed 18 June 2009]

iProspect. (2006). Search Engine User Behavior Study. [online]. Available from :

http://www.iprospect.com/premiumPDFs/WhitePaper_2006_SearchEngineUserBehavior.pdf [Ac-

cessed 23 June 2009]

iProspect. (2008). Blended Search Results Study – April 2008. [online]. Available from:

http://www.iprospect.com/premiumPDFs/researchstudy_apr2008_blendedsearchresults.pdf [Accessed

18 June 2009]

Jansen, J .B./Spink, A. (2004). How are we searching the World Wide Web? A comparison of nine

search engine transaction logs. [online]. Available from :

http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_searching_the_web.pdf [Accessed 18

June 2009]

Juran, J. (1999). Juran’s quality handbook: Fifth edition. p.976

Kahn, J. (2005). Yahoo helped Chinese to prosecute journalist. [online]. New York: New York Times.

Available from: http://www.nytimes.com/2005/09/07/business/worldbusiness/07iht-yahoo.html

[Accessed 18 June 2009]

Kaplan, I. (2008). Bad Data Can Cost You Big Time. [online]. Available from:

http://www.federationofcredit.com/base/document/Newsletter/IKaplanSept08.html [Accessed 17

June 2009]

Kehoe, M. (2009). 2009 Overview of the Enterprise Search Market. [online]. Available from:

http://www.ideaeng.com/tabId/98/itemId/181/Overview-of-the-Enterprise-Search-Market-

2009.aspx [Accessed 17 June 2009]

Kehoe, M. (2009). Overview of the Enterprise Search Market. [online]. Available from:

http://www.ideaeng.com/tabId/98/itemId/181/Overview-of-the-Enterprise-Search-Market-2009.aspx

[Accessed 17 June 2009]

Koch, P. / Koch, S. (2009). How big is the Internet?. [online] Available from

http://www.pandia.com/sew/383-websize.html. [Accessed 19 January 2009]

Leit.is. (n.d.). Leit.is - Um leit.is :: Um leit.is. [online] Available from: http://www.leit.is/umleit/ [Ac-

cessed 17 June 2009]

Leung, H. W. 梁漢榮. (2004). A study of computer science students' conceptions of information

literacy and their experiences in information search process and use. [online]. Available from:

http://hub.hku.hk/handle/123456789/30758 [Accessed 18 June 2009]

Liebowitz, S. J./Margolis, S. (1999). Winners, Losers and Microsoft

Liedtke, M. (2009). Google glitch disrupts search engine, e-mail. [online]. The Associated Press.

Available from:

http://www.google.com/hostednews/ap/article/ALeqM5jJA_JCGApgxvxii3ryhhilxLuscgD9867IU01

[Accessed 18 June 2009]

Malaysian Communications and Multimedia Commission. (2005). Household use of the Internet

survey 2005. [online]. Available from:

http://www.skmm.gov.my/facts_figures/stats/pdf/Household_use_internet_survey2005.pdf

[Accessed 17 June 2009]

Mar Hauksson, K. (2007). Global search report 2007 [online]. Available from:

http://www.e3internet.com/downloads/global-search-report-2007.pdf [Accessed 23 January 2009] p.8

Maximum PC. (2008). Microblogging. P.10

Mayer, M. (2009). "This site may harm your computer" on every search result?!?!. [online]. Available

from :http://googleblog.blogspot.com/2009/01/this-site-may-harm-your-computer-on.html [Accessed

18 June 2009]

Mazzon, J. (2009). On yesterday’s email. [online]. Available from :

http://googledocs.blogspot.com/2009/03/on-yesterdays-email.html [Accessed 18 June 2009]

Meerman, D. S. (2007). The new rules of marketing and PR. p.242

Page 116: Risks of search engine dependency and its influence on data quality

116

Merriam Webster. (2001). Google - Definition from the Merriam-Webster Online Dictionary. [online].

Available from: http://www.merriam-webster.com/dictionary/google [Accessed 17 June 2009]

Meyerson, M./Scarborough, M. E. (2007). Mastering Online Marketing. P.223

Microsoft. (n.d.). Préférences Bing. [online]. Available from:

http://www.bing.com/settings.aspx?sh=2&FORM=WIWA [Accessed 17 June 2009]

Miller, M. (2006). Googlepedia. p.352

Miller, M. (2006). Googlepedia. p.11

Miller, R. (2009). Unlock Power Enterprise Search. [online]. Available from:

http://byronmiller.typepad.com/UnlockPowerEnterpriseSearch.pdf [Accessed 17 June 2009] p.5

Millward Brown Optimor. (2009). Top 100 most global brands 2009. [online]. Available from:

http://www.brandz.com/output/ [Accessed 24 June 2009]

Mooers, N. C. (1959). A panel discussion at the Annual Meeting of the

Morville, P. (2005). Ambient Findability. p.111

Morville, P. (2005). Ambient Findability. p.111

Morville, P. (2005). Ambient Findability. p.54

Muñoz, C./Moraga, A./Piattini, M. (2008). Handbook of Research on Web Information Systems

Quality. p.286

Muñoz, C./Moraga, A./Piattini, M. (2008). Handbook of Research on Web Information Systems Quali-

ty. p.138

Netcraft. (n.d.). Most visited websites. [online]. Available from:

http://toolbar.netcraft.com/stats/topsites [Accessed 17 June 2009]

NHN Corporation. (n.d.). NHN Corporation. [online]. Available from : http://www.nhncorp.com/

[Accessed 17 June 2009]

Ohayon, O. (2008). Google, moteur de recherche ou moteur de navigation?. [online]. Available from :

http://fr.techcrunch.com/2008/10/30/fr-google-moteur-de-recherche-ou-moteur-de-navigation/

[Accessed 17 June 2009]

Olausson , A. M. (2007). Advanced Search: Is the name a problem?. [online]. Available from :

http://digital-lifestyles.info/2007/09/21/advanced-search-is-the-name-a-problem/ [Accessed 17 June

2009]

Olsen, J. (2003). Data quality: The accuracy dimension. p.24

Olsen, J. (2003). Data quality: The accuracy dimension. p.3

Olsen, J. (2003). Data quality: The accuracy dimension. p.5

Olsen, J. (2003). Data quality: The accuracy dimension. p.7-8

One Stat. (2007). OneStat Website Statistics and website metrics - Press Room. [online]. Available

from : http://www.onestat.com/html/aboutus_pressbox54-windows-vista-global-usage-share.html

[Accessed 20 June 2009]

Online Computer Library Center. (2002). Trends in the evolution of the Public Web 1998-2002. [on-

line]. Available from: http://www.dlib.org/dlib/april03/lavoie/04lavoie.html [Accessed 18 June 2009]

Pedley, P. (March 2002). Why you can’ afford to ignore the Invisible Web. Business Information

Review.

Pierce, J. (2008). The World Internet Project. [online]. Available from:

http://www.digitalcenter.org/WIP2009/WorldInternetProject-FinalRelease.pdf [Accessed 20 June 2009]

Priss, U./Corbett, D./Angelova, G. (2002). Conceptual structures. p.92

Rafat, A. (2008). Czech Portal Seznam Could Fetch $900 Million; Google, Apax, Warburg and Others

in Fray. [online] Available from: http://www.washingtonpost.com/wp-

dyn/content/article/2008/08/15/AR2008081502517.html [Accessed 23 January 2009]

Reputation Institute. (2008). The World Most Reputable Companies. [online]. Available from:

http://www.reputationinstitute.com/ [Accessed 24 June 2009]

Romaní, C.C./ Kuklinski, H.P. (2007). Planeta Web 2.0. Inteligencia colectiva o medios fast food

Sankar, K./Bouchard, S./Mancini, D. (2009). Enterprise Web 2.0 Fundamentals. P.161

Schwartz, B. (2009). Firefox Drops Google For Yandex In Russia, But Big Loser May Be Rambler.

[online]. Available from : http://searchengineland.com/firefox-drops-google-for-yandex-in-russia-

but-big-loser-may-be-rambler-16107 [Accessed 18 June 2009]

Page 117: Risks of search engine dependency and its influence on data quality

117

Seznam inc. (n.d.). Vize firmy | O společnosti Seznam.cz.[online]. Available from :

http://firma.seznam.cz/cz/vize-firmy.html [Accessed 17 June 2009]

Shapiro, C./Varian, R.H. (1999). Information Rules: A strategic guide to the Network Economy. Sherman, C./Price, G. (2001). The invisible web. p.57

Shijun, Z./Peng, N./Weifeng, X. (2006). 时尚中国—网动中国英. p.45

Shijun, Z./Peng, N./Weifeng, X. (2006). 时尚中国—网动中国英. p45

Skooiz. (2008). Comment les Québecois utilisent ils et cherchent ils sur Internet ?. [online]. Available

from : http://documents.skooiz.com/comment-les-quebecois-cherchent-ils-sur-le-web-2008.pdf

[Accessed 18 June 2009]

Sweeney, S. (2008). 101 Ways to Promote Your Tourism Business Web Site. P.288

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science of

Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p22

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science of

Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p.9

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science of

Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009]

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science of

Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009]

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science of

Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p.36

The Association for Enterprise and Content Management. (2008). Findability: The Art and Science of

Making Content Easy to Find. [online]. Available from:

http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx [Accessed 17 June 2009] p36

Tobin, R./Hotchkiss, G./Lee, P. (2008). Chinese Search Engine Engagement. [online]. Available from :

http://www.enquiroresearch.com/download-research-whitepapers.aspx [Accessed 17 June 2009]

p.28.

Turow, J. (2008). Media Today. p.559

UCL – University College London. (2008). Information behavior of the researcher of the future.

[online]. Available from: http://www.bl.uk/news/pdf/googlegen.pdf [Accessed 19 June 2009]

UCLA Center for Communication Policy. (2003). Surveying the Digital Future. [online]. Available

from: http://images.forbes.com/fdc/mediaresourcecenter/UCLA03.pdf [Accessed 18 June 2009]. P.39

United Nations. (n.d.). Human Rights Translated: A Business Reference Guide. p.54

Université de Lyon. (2007). De la documentation au plagiat. [online]. Available from :

http://www.compilatio.net/files/sixdegres-univ-lyon_enquete-plagiat_sept07.pdf [Accessed 18

June 2009]

URFIST de Rennes. (2008). Enquête sur les besoins de formation des doctorants à la maîtrise de

l’information scientifique dans les Ecoles doctorales de Bretagne. [online]. Available from:

http://www.uhb.fr/urfist/enquete_besoins_formation_doctorants-maitrise_information [Accessed

18 June 2009]

URFIST de Rennes. (2008). Enquête sur les besoins de formation des doctorants à la maîtrise de

l’information scientifique dans les Ecoles doctorales de Bretagne. [online]. Available from:

http://www.uhb.fr/urfist/enquete_besoins_formation_doctorants-maitrise_information [Accessed

18 June 2009]

Valentiner, Z. (2009). New search tool on the block: Wolfram Alpha. [online]. Available from :

http://www.mndaily.com/blogs/tech-corner/2009/05/20/new-search-tool-block-wolframalpha [Ac-

cessed 17 June 2009]

Page 118: Risks of search engine dependency and its influence on data quality

118

Valle, F. S./Soghoian, C. (2008). Adios Diego: Argentine judges cleanse the Internet. [online].

Accessible from: http://opennet.net/blog/2008/11/adiós-diego-argentine-judges-cleanse-internet

[Accessed 18 June 2009]

Vaughan, L./Thelwall, M. (2003). Search Engine Coverage Bias: Evidence and Possible Causes.

[online]. Available from: http://www.scit.wlv.ac.uk/~cm1993/papers/search_engine_bias_preprint.pdf

[Accessed 18 June 2009]

Wang, W. (2007). Integration and Innovation Orient to E-Society Volume 1. p.666

WhamTech . (n.d). Glossary of less-than-usual terms used in the Web site. [online]. Available from:

www.whamtech.com/glossary.htm [Accessed 17 June 2009]

Wickre, K. (2006). Testimony the Internet in China. [online]. Available from:

http://googleblog.blogspot.com/2006/02/testimony-internet-in-china.html [Accessed 18 June 2009]

Wilsdon, N. (2007). Global Search Report 2007. [online]. Available from:

http://www.e3internet.com/downloads/global-search-report-2007.pdf [Accessed 23 January 2009]

Wordnet.princeton.edu. (2009). Accuracy definition. [online]. Available from:

wordnet.princeton.edu/perl/webwn [Accessed 17 June 2009]

Wordnet.princeton.edu. (2009). Timelessness definition. [online]. Available from:

wordnet.princeton.edu/perl/webwn [Accessed 17 June 2009]

XinHua. (2008). Commentary : Facts about Tibet should not be distorted. [online]. Available from:

http://news.xinhuanet.com/english/2008-03/24/content_7847789.htm

http://news.xinhuanet.com/english/2008-03/24/content_7847789_1.htm [Accessed 17 June 2009]

Yahoo Inc. (n.d.). Company Overview. [online]. Available from:

http://yhoo.client.shareholder.com/press/overview.cfm [Accessed 17 June 2009]

Yahoo Inc. (n.d.). Yahoo dans le monde. [online]. Available from: http://world.yahoo.com/?c=fr [Ac-

cessed 17 June 2009]

Yamaguchi, T. (2008). Practical aspects of Knowledge Management. p.41

Yandex inc. (2008). Russia’s largest internet search engine and a leading internet and technology

company. [online]. Available from: http://download.yandex.ru/company/mini_book_v19.pdf [Ac-

cessed 17 June 2009]

Zittrain, J.L. (2008). The future of the Internet and how to stop it. p.140