Where does Enterprise search end and text analytics begin?

9
Where does enterprise search end and text analytics begin? I was asked recently by a Chief Information Officer (CIO) of a large organization, where does search end and text analytics begin? It is an interesting question perhaps worth exploring and developing some models, one of which is shown below (Figure 1). Figure 1 – From ‘traditional search’ (green) to unlocking a wider range of questions we can ask

Transcript of Where does Enterprise search end and text analytics begin?

Page 1: Where does Enterprise search end and text analytics begin?

Where does enterprise search end and text analytics begin?

I was asked recently by a Chief Information Officer (CIO) of a large organization, where does search

end and text analytics begin? It is an interesting question perhaps worth exploring and developing

some models, one of which is shown below (Figure 1).

Figure 1 – From ‘traditional search’ (green) to unlocking a wider range of questions we can ask

Page 2: Where does Enterprise search end and text analytics begin?

The green boxes (Figure 1) are the domain of traditional search and the blue boxes illustrate the

direction of travel being taken by some organizations blending increasing amounts of text analytics

and Knowledge Organization Systems (KOS) like taxonomies and thesauri to traditional Information

Retrieval (IR). From left to right, this allows organizations to move beyond just finding documents

which contain existing ‘knowledge’, to pattern recognition (where the searcher is part of the process)

discovering latent ‘knowledge’ directly, sometimes using the whole corpus of text available.

Whether you call this space Knowledge Management, e-business, Artificial Intelligence (AI), smart

machines, deep learning, cognitive computing or plain old search & discovery probably does not

matter.

Ranking by frequency dominates

Text has been converted into numerical representations of some form since the first search engine

was developed. Statistical approaches for search have generally been dominated by frequency. Whilst

different metadata fields are typically tuned with different search ranking criteria (e.g. title and tags

higher weights compared to body text), all things being equal its statistical frequency that dominates

ranking. Whether that is how many times search terms have been mentioned in a document/corpus

or how many times people have made certain search queries (for query suggestions as you type) or

how many times a web page or document has been referenced or accessed. Even facets (refiners)

typically shown on the left hand side of a search user interface to help narrow a search, are almost

always ranked by frequency or popularity, regardless of whether they are created from manual or

automatic tagging methods. Ranking by frequency dominates.

I won’t delve into ‘relevance’ in this article, needless to say ranking by statistical frequency has been

used as a part-surrogate for relevance.

Linguistics

Linguistics (including the use of authority lists, thesauri/taxonomies, ontologies for handling synonyms

& relationships and Natural Language Programming (NLP)) have been used for decades to improve

searching. These techniques are typically used to ensure the above methods operate on ‘concepts’ to

mitigate the ‘vocabulary problem’ we have as humans and improve search recall and precision.

Thesauri, taxonomies and ontologies can also help recommend search terms. They are an aid to the

statistics, with many scholars proposing a ‘best of both worlds’ that hybrid (linguistic & statistical)

techniques work best to cater for a range of scenarios’ in search and auto-classification, although

some still argue for just one or the other.

Social cues

Social analytics has been used in search for many decades. For example, recommending or boosting

an item in search results because it is viewed or cited very often, suggesting search terms as you type

“the Google type model” or suggesting another information item that may be of interest, the

“Amazon.com type model” (often called crowdsourcing). These approaches are transactional, based

on the social cues from people. Enterprise social tools increasingly hosted in the cloud, perform the

same type of analysis, using the data from document or post views and likes, ‘people who attend online

meetings that you attended’ etc., to algorithmically push information to the user in an activity feed

like the “Facebook type model”, complementing traditional enterprise search. Aspects of

personalization in browser cookies have been doing this for years of course, by clickthrough

advertising revenue.

It could be argued that search has never been separate to analytics.

Page 3: Where does Enterprise search end and text analytics begin?

Content Analytics

As information volumes have grown exponentially, so has the analysis of the content inside these

‘containers’ of information, such as web pages, documents and structured databases. Linguistics is

still a very important part, but it supports the statistical methods that are used to seek out ‘interesting’

situational context.

Taxonomies and semantic networks are useful, if not essential for computer systems to help us

discover information, however, they may also blind us to new discoveries if we ignore what the data

(text) is telling us and only superimpose a priori representations.

Organizations have turned to auto-classification to help records retention, to reduce file storage costs

and clean up Redundant, Obsolete and Temporary (ROT) files typically on the shared file system and

email systems. Search is also used for reporting, highlighting information in the corpus that should not

be there for legal, privacy or confidentiality reasons. This analysis may often be simply a series of

phrase queries. Similar techniques have been (and continue to be used) automatically moving emails

into spam folders when they contained certain trigger words or with disambiguated semantics.

Perhaps these approaches are akin to First generation content analytics.

Second generation analytics could be viewed as more advanced techniques which target business

value, wealth creation and risk reduction to surface real world patterns. In addition to frequency, both

similarity and discriminatory techniques are increasingly used. In these techniques conversion of text

to numerical form is taken to extremes. For example the creation of complex probability distributions

using neural networks approximating one words relationship to every word in the entire corpus where

vectors can be compared, added and subtracted.

Instead of using this analytical information to influence search results of document and web pages,

the focus shifts to the associations between concepts and entities within and across those documents

and web pages. From documents to entities & concepts. It is still ‘search’ but a different focus.

Using an analogy, if documents are “atoms”, content analytics smashes them apart to look at the

concepts & entities “particles” inside and their behaviour with respect to one another. You might find

a new particle or a new relationship between particles that you did not know before, but you have to

look inside first and it takes imagination and energy to produce the really exciting.

Analogues

Organizations are increasingly interested in how these content analytic techniques can be used to

identify complex business analogues. As one scientist made the comment, “analogues are difficult to

search on because you don’t know what they are, so you don’t know what search queries to use!” For

example in the oil and gas industry looking for geological environments (using the similarity of words

that appear around different entities), and/or surface an activity trend that one company is doing in a

geological basin, that other companies are not. These techniques can transform unstructured

information into structured information to automatically return answers (not lists of documents), to

stimulate ideas, visualize results on a map or store in a database.

Prediction

Coping with information overload and keeping on top of what is going on around you is getting

increasing difficult in many areas. Using historical information to calibrate systems, may help predict

certain types of events before they happen. For example, looking at patterns from daily operations

reports as clues, alerting engineers to potential impending issues. They present another voice based

Page 4: Where does Enterprise search end and text analytics begin?

on the text, akin to “Look, last time I saw these clues appearing in the operations reports..this

happened”. These techniques have been used with quantitative data for many years to predict and

prescribe action, but are now being increasingly applied to qualitative (unstructured text) authored

by people.

Relevance versus Interestingness

In some recent research on facilitating serendipity in the search user interface, a scientist in an oil and

gas company mentioned a search result refiner was ‘relevant but not interesting’. Clearly what is

interesting for one person, may not be interesting for another. However, there may be a need to move

beyond traditional definitions (and algorithms) for relevance (Figure 2).

Figure 2 – Moving search beyond a text box and ten blue links

Summary

Returning to the question posed at the beginning, ‘Where does search end and text analytics begin?’

Perhaps they are two sides of the same coin. Text analytics has always been essential to basic

information retrieval, although its main use was to help people find results (containers) they were

looking for. Analytics of social cues have been used to good effect to help people locate what they

were looking for and also discover information that they were not looking for. However, some argue

this is discovery through the ‘rear view mirror’ and creates a filter bubble that does not encourage us

to stray from the beaten path. Combining search and content based text analytics presents us with

opportunities to help us formulate our needs, test hypotheses, predict events, unlock new knowledge

and increase the propensity of our user interfaces to stimulate fortuitous information discovery.

There are likely challenges for the CIO. How to meet existing complaints and needs (to find that single

web page or document) for all staff, whilst delivering the capability for advanced discovery to small

communities that may wish to mine external and internal information (much more than a simple e-

discovery solution) – which may lead to leaps in business value that cannot be predicted in advance.

Many organizations have already done this organically. It may be a mistake to think this can be

achieved with a single technology or user interface.

That brings further challenges with regards to costs, multiple indexing streams and complexity.

Business cases and creative architectures may exist however to meet both requirements. However,

this will require leadership, to set out a clear vision with careful planning and architecting. These

Page 5: Where does Enterprise search end and text analytics begin?

decisions need to be made against a backdrop of technology vendor propaganda and useful

information which are intertwined, making it sometimes difficult for objective and realistic views of

what is possible and what are the caveats.

Exponentially growing information volumes combined with proven techniques published in the public

domain, present an opportunity to expand our horizons; ‘To move the goalposts’, with respect to the

questions we can ask computer systems and what those computer systems can suggest to us; after

all, we may be asking the wrong questions.

More at: www.paulhcleverley.com

References Addison, V. (2014). Oil, Gas Industry Focuses on Predictive Analytics. Hart Energy October 6th 2014. Online Article

(Accessed February 2015).

Adkins, S (2003). Information Gathering in the Electronic Age: The Hidden Cost of the Hunt. Safari Techbooks, January 2003.

AIIM (2008). Market IQ Report: Findability: The art and science of making content easy to find. Association for Information

and Image Management (AIIM) 2008. Sponsored by OpenText.

Allan, J., Croft, B., Moffat, A., Sanderson, M. (2012). Frontiers, Challenges, and Opportunities for Information Retrieval.

Report from the Second Strategic Workshop on Information Retrieval in Lorne, February 2012, ACM SIGIR Forum,

46(1), 2-32

Alyahyaee, A. (2012). Oil & Gas Data Repository (OGDR), Energistics National Data Repository (NDR) ’11 Update. 21st-

24th October 2012. Kuala Lumpur, Malaysia.

Andersen, E. (2012). Making Enterprise Search Work: From Simple Search Box to Big Data Navigation. Center for

Information Systems Research (CISR) Massachusetts Institute of Technology (MIT) Sloan School Management,

12(11).

Ballard, T., Blaine, A. (2011). User search limiting behaviour in Online Catalogs. Comparing classic catalog use to search

behaviour in next generation catalogs. New Library World, 112(5/6), 261-273.

Bawden, D. (1986). Information-Systems and the Stimulation of Creativity. Journal of Information Science, 12(5), 203-216.

Behounek, S., Casey, K. (2007). EarthSearch=GoogleEarth Enterprise+PetroSearch. Society of Petroleum Engineers (SPE)

Digital Energy Conference and Exhibition, 11-12th April, Houston, Texas, USA. Report ID: SPE-108208-MS

Berger, P.L., Luckmann, T. (1966). The social construction of reality. A treatise in the sociology of knowledge. 1st ed. London:

Penguin.

Bizer, C., Heath, T., Berners-Lee, T. (2009). Linked Data – The Story So Far. Special Issue on Linked Data, International

Journal on Semantic Web and Information Systems (IJSWIS), 5(3), 1-22.

Blackman, S. (2012). Risky business: challenges of deepwater drilling in the North Sea. Offshore Technology, 21st June 2012.

Online Article (Accessed December 2014).

Blei, D, Ng, A., Jordan, M. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 2003, 3, 993-1022

Broussard, F., Dineen, P., Tushingham, K. (2011). Hart’s E&P Magazine. Digital Oil Field: G&G software accelerates user

productivity. Schlumberger, August 2011.

Brown, J.S., Duguid, P. (1991). Organizational Learning and Communities-of-Practice: Toward a Unified View of Working,

Learning and Innovation. Organizational Science, 2(1), 40-57.

Brown, N. (2014). Fostering Collaboration Using Analytics & Real-time Big Data Search: Insight into Technology Services.

AstraZeneca presentation Enterprise Search Europe, 29-30th May, London, UK.

Bushell, S. (1999). Wiring the Corporate Brain. Chief Information Officer (CIO). Online Article 6th October 1999 (Accessed

October 2014).

Caballero, R, Nuernberg, S. (2014). Building an Enterprise Taxonomy. 18th International Petroleum Data, Integration and Data

Management (PNEC), May 20-22nd 2014, Houston, USA.

Carpineto, C., Romano, G. (2012). A Survey of Automatic Query Expansion in Information Retrieval. ACM Computing

Surveys, 44(1), 1-50.

Chuang, J., Manning, C.D., Heer, J. (2012). “Without the Clutter of Unimportant Words”: Descriptive Keyphrases for Text

Visualization. ACM Transactions on Computer-Human Transactions, 19(3)

Chui, M., Manyika, J., Bughin, J., Dobbs, R., Roxburgh, C., Sarrazin, H., Sands, G., Westergren, M. (2012). The social

economy: Unlocking value and productivity through social technologies. McKinsey Global Institute Report. Online

Article (Accessed January 2015).

Chum, F., Everett, M., Hills, S., Soma, R., Cutler, R. (2011). Realizing the Semantic Web Promise in the Oil & Gas Industry:

Challenges and Experiences. SemTech 2011,, 9th June 2011, San Francisco, USA.

Chum, F. (2009). Semantic Technologies at the Ecosystem Level. Interview by (Morrison, A. and Parker, B.)

PriceWaterhouseCoopers Technology Forecast Spring 2009. Online Article.

Page 6: Where does Enterprise search end and text analytics begin?

Cleverley, P.H. (2012). Improving Enterprise Search in the Upstream Oil and Gas Industry by Automatic Query Expansion

using a Non-Probabilistic Knowledge Representation. International Journal of Applied Information Systems (IJAIS),

1(1), 25-32

Cleverley, P.H. (2014). Towards a causal model for search user satisfaction and sub-optimal task performance in the upstream

oil and gas industry. Doctoral PhD Thesis (work in progress – unpublished), Robert Gordon University, Aberdeen, UK.

Cleverley, P.H., Burnett, S. (2015b). Creating sparks: comparing search results using discriminatory search term word co-

occurrence to facilitate serendipity in the enterprise. Journal of Information and Knowledge Management (JIKM).

Cleverley, P.H., Burnett, S. (2015a). Retrieving haystacks: a data driven information needs model for faceted search. Journal

of Information Science, 41, 97-113

Colleran, J. (2014). Improving Exploration Success through Better Data Management: Maersk Oil Perspective. The Oil and

Gas Industry Conference, 12th June 2014, London, UK.

Coyne, I.T. (1997). Sampling in qualitative research. Purposeful and theoretical sampling; merging or clear boundaries. Journal

of Advanced Nursing, 26, 623-630.

Dale, E. (2013). The importance of constant measurement in search relevance. A longitudinal case study. Ernst & Young.

Enterprise Search Summit 2013, New York, USA.

DeLone, W.H., McLean, E.R. (2002). The DeLone and McLean Model of Information System Success: A Ten Year Update.

Journal of Management Information Systems, 19(4), 9-30.

Delphi (2002). Taxonomy & Content Classification. Market Milestone Report. Online Article (Accessed March 2013).

Demartini, G. (2007). Leveraging Semantic Technologies for Enterprise Search, PIKM November 2009, Lisboa, Portugal.

Dextre Clarke, S.G., Zeng, M.L. (2012). From ISO 2788 to ISO 25964: The Evolution of Thesaurus Standards towards

Interoperability and Data Modeling. Information Standards Quarterly, 24(1), 20-26.

Dillon, T. S., Talevski, A., Potdar, V., & Chang, E. (2009). Web of things as a framework for ubiquitous intelligence and

computing. In Ubiquitous Intelligence and Computing (2-13). Springer Berlin Heidelberg.

Doane, M. (2010). Cost benefit analysis: Integrating an enterprise taxonomy into a SharePoint environment. Journal of Digital

Asset Management, 6(5), 262-278

Duan, L., Xu, L.D. (2012). Business Intelligence for Enterprise Systems: A Survey. IEEE Transactions on industrial

informatics, 8(3), 679-687

Espinosa, J.A., Armour, F. (2010). Enterprise Architecting Process and Coordination. Executive Briefing Series, Center for

Information Technology and the Global Economy (CITGE), Kogod School of Business, 3(3)

Fagan, J.C. (2010). Usability studies of faceted browsing: A literature review. Information Technology and Libraries, 58-66.

Faith, A. (2011). Linguistically Training Automatic Indexing Software for Complex Taxonomies. Semantic Technology &

Business Conference June 2013.

Feldman, S., Sherman, C. (2001). The High cost of not finding information. White Paper International Data Corporation (IDC).

Feldman, S., Marobella, J.R., Duhl, J., Crawford, A. (2005). The Hidden Costs of Information Work. White Paper International

Data Corporation (IDC).

Feldman, S. (2009). IDC Executive Briefings: Information Advantage: Information Access in Tommorow’s Enterprise.

International Data Corporation (IDC).

Foster, A. & Ford, N. (2003) Serendipity and information seeking: an empirical study. Journal of Documentation. 59(3), 321-

340

Friedman, B. (2010). Serendipity is an Explorationists best friend. American Association of Petroleum Geologists (AAPG)

Online Article.

Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T. (1987). The vocabulary problem in human-system communication.

Communications of the ACM, 30(11), 964-971

Garbarini, M., Catron, R.E., Pugh, B. (2008). Improvements in the Management of Structured and Unstructured Data. Society

of Petroleum Engineers, Report IPTC12035.

Garbujo, C., Viarigi, P. (2013). ENI E&P Global GIS Project Infoshop. ESRI Petroleum Users Group (EPUG) 14th November

2013, London, UK.

Geggel, L. (2015). Forget Jeopardy: 5 Abilities That Make IBM’s Watson Amazing. Livescience Online Article April 15th,

(Accessed April 2015)

Ghiselin, D. (2010). Serendipity is alive and well at EagleFord. Hart’s E&P Online Article.

Gimmal (2013). Information Governance and Compliance in Oil and Natural Gas Company. Online Article (Accessed January

2015)

Goker, A., Davies, J. (2009). Information Retrieval: Searching in the 21st Century. UK: Wiley & Sons Ltd

Greenberg, J. (2011). Introduction: Knowledge Organization Innovation: Design and Frameworks. Bulletin of the American

Society for Information Science and Technology, April/May 2011, 37(4), 12-14.

Grefenstette, G. (1994). Explorations in Automatic Thesaurus Generation. MA, USA: Kluwer Academic Publishers Norwell

Grimes, S. (2014). Text Analytics Applied. 2nd LIDER Road mapping workshop, May 8th 2014, Madrid, Spain.

Gwizdka, J. (2009). What a difference a tag cloud makes: effects of tasks and cognitive abilities on search results interface

use. Information Research. 14(4)

Halvey, M., Keane, M.T. (2007). An assessment of tag presentation techniques. Proceedings of 16th International World Wide

Web Conference (WWW).

Hamski, J. (2010). Unstructured Geospatial Information for a Competitive Advantage in Resource Exploration. Elsevier,

Online Article, Accessed January 2015.

Page 7: Where does Enterprise search end and text analytics begin?

Hearst, M.A. and Stoica, E. (2009). NLP Support for Faceted Search Navigation in Scholarly Collections. Proceedings of the

2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, ACL-IJCNLP Suntec, Singapore 7th

August 2009, 62-70

Hedden, H. (2013). Taxonomies for Auto-Tagging Unstructured Content. Text Analytics World, October 1st 2013, Boston

USA.

Heye, D. (2003). Taxonomies and automatic classification at Shell – a case study. ‘Building a Knowledge Framework:

Practical Taxonomy Design and Application Conference, September 29-30th Amsterdam, The Netherlands.

Hills, S. (2014). Why we Want to Implement ISO Metadata: Energy Industry Profile of ISO 19115-1:2014 (“EIP”) V1.0.

Energistics FGDC ISO Metadata Implementation Forum 12th February 2014.

Hjorland, B. (2008). What is Knowledge Organization (KO)? International journal devoted to concept theory, classification,

indexing and knowledge representation, 35(2/3), 86-101

Hodge, G. (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. Washington,

USA, First Digital Library Federation and Council on Library and Information Resources.

Hubert, C. (2012). Seamless Collaboration. Enabling Employees to Work Together Across Boundaries. APQC Report K03906,

1-15.

Jacob, E.J. (2004). Classification and categorization: A Difference that Makes a Difference. Library Trends, 52(3), 515-540.

Jacobs, P.S., Rau, L.R. (1990). SCISOR: Extracting information from on-line news. CACM 33, 88-97

Jurka, T.P., Collingwood, L., Boydstun, A.E., Grossman, E., van Atteveldt, W. (2013). RTextTools: A Supervisory Learning

Package for Text Classification. The R Journal, 5(1), 6-12.

Kaizer, J., Hodge, A. (2005): "AquaBrowser Library: Search, Discover, Refine", Library Hi Tech News, 22(10), 9-12

Kastrin, A., Rindflesch, T.C., Hristovski, D. (2014). Large-Scale Structure of a Network of Co-Occuring MeSH Terms:

Statistical Analysis of Macroscopic Properties. PLoS One, 9(7).

Khoo, C.S.G., Luyt, B., Ee, C., Osman, J., Lim, H., Yong, S. (2007). How users organize electronic files on their workstations

in the office environment: a preliminary study of personal information organization behaviour. Information Research,

11(2).

Koenig, M.E.D. (2002). Time saved – a misleading justification for KM. KM World, 11(5)

Krestel, R., Demartini, G., Herder, E. (2011). Visual Interfaces for Stimulating Exploratory Search. JCDL 2011, June 13th-17th

Ottawa, Canada, 393-394.

Landauer, T.K., Dumais, S.T. (1997). A Solution to Platos’ Problem: The Latent Semantic Analysis Theory of Acquisition,

Induction, and Representation of Knowledge. Psychological Review, 104(2), 211-240.

Lennon, A., Alshubi, F., Cleverley, P.H. (2012). Improving Subsurface and Wells Document Management at Qatar Shell. 16th

Annual Petroleum Data Integration Conference. May 15th-17th Houston, USA.

Low, B. (2011). Usability and contemporary user experiences in digital libraries. CIGS Seminar, University of Edinburgh.

Slide 17

Lowe, A., McMahon, C., Culley, S. (2004). Characterising the requirements of engineering information systems. International

Journal of Information Management, 24, 401-422.

Luke, T., Schaer, P., Mayr, P. (2012). Improving Retrieval Results with discipline-specific Query Expansion. Proceedings of

Theory and Practice of Digital Libraries, 2012.

Lund, K., Burgess, C., Atchley, R.A. (1995). Semantic and Associative Priming in High-Dimensional Semantic Space.

Cognitive Science Proceedings, 603-608

Magnuson, D. (2014). Auto Classification and the Holy Grail for Records Managers. IBM Presentation as the Association or

Records Managers and Administrators (ARMA), Houston.

Majid, S., Anwar, M.A., Eisenshitz, T.S. (2000). Information Need and Information Seeking Behavior of Agricultural

Scientists in Malaysia. Library & Information Science Research, 22 (2), 145-163

Manning, C.D., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, United States of

America, Massachusetts Institute of Technology (MIT) Press.

Manning, C.D., Raghavan, P., Schutze, H. (2009). An Introduction to Information Retrieval. Cambridge, England. Cambridge

University Press.

Marchionini, G. (2006). Exploratory Search: From Finding to Understanding. Communications of the ACM. 49 (4), 41-46

Martela, F. (2015). Fallible Inquiry with Ethical End-in-View: A Pragmatist Philosophy of Science for Organizational

Research. Organizational Studies, 1-27.

Mason, J. (2006). Mixing methods in a qualitative way. Qualitative Research, 6(1), 9-25

Matarazzo, J.M., Pearlstein, T. (2014). Demonstrating the Value of Corporate Libraries. APLIC Meeting, April 29th 2014,

Boston, USA.

McCandless, D. (2012). Information in beautiful, 2nd ed., William Collins, London.

McCay-Peet, L. & Toms, E. (2011). Measuring the dimensions of serendipity in digital environments. Information Research,

16(3)

McDonald, S., Ramscar, M. (2001). Testing the distributional hypothesis: The influence of context on judgements of semantic

similarity. Proceedings of the 23rd Annual Conference of the Cognitive Science Society, 611-616.

McNaughton, N. (2015). Knowledge organization – the great debate! Oil Information Technology Journal, 20(2), 1-11

Microsoft and Accenture (2010). Upstream Oil & Gas Computing Trends Survey (2010). Conducted by PennEnergy Research

and the Oil & Gas Journal Research Centre.

Page 8: Where does Enterprise search end and text analytics begin?

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J. (2013). Distributed representations of words and phrases and

their compositionality. Advanced in Neural Information Processing Systems, 3111-3119

Miller, D. (2014). Just the facts: Auto-classification and Taxonomies. ConceptSearching Webinar, Online Article (Accessed

February 2015).

Miller, G.A. (1956). The magical number seven, plus or minus two. Some limits on how our capacity for processing

information. Psychological Review, 63, 81-97.

Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K. (1990). WordNet: An online lexical database. International

Journal of Lexicography, 3(4), 235–244

Mindmeter (2011). Mind the Enterprise Search Gap. Report Sponsored by SmartLogic.

Mitchell, T.M., AbuZaki, W., Betteridge, J., Carlson, A., Hruschka, E.R., Kisiel, B., Settles, B., Wang, R. (2009). How Will

We Populate the Semantic Web on a Vast Scale? International Semantic Web Conference (ISWC) 2009.

Morgan, D.L. (1997). Focus Groups as Qualitative Research: Planning and Research Design for Focus Groups. In Sage

Research Methods, 32-46

Munkvold, B.E., Paivarinta, T., Hodne, A.K., Stangeland, E. (2006). Contemporary issues of enterprise content management:

the case of Statoil. Scandinavian Journal of Information Systems, 18(2), 69-100.

Navigli, R., Velardi, P. (2002). Automatic Adaptation of WordNet to Domains. Proceedings of the Third International

Conference on Language Resources and Evaluation (LREC ’02), Canary Islands, Spain.

Nimmagadda, S.L., Dreher, H., Rudra, A. (2014). Integration and Effective Management of Heterogeneous Petroleum Digital

Ecosystems Using Big Data Paradigm. PPDM Data Management Symposium, 6th August 2014, Perth, Australia.

Niu, X., Hemminger, B.M. (2010). Beyond Text Querying and Ranking List: How People are searching through Faceted

Catalogs in Two Library Environments. Proceedings of the 73rd Association for Information Science and Technology

(ASIS&T) Annual Meeting on Navigating Streams in an Information Ecosystem 2010, 47(29)

Noor, A.M., Yassin, C.Z.H. (2006). Issues, Challenges and Constraints in K-Era. Proceedings of the Knowledge Management

International Conference. Kuala Lumpur, Malaysia, 6-8th June 2006.

Norling, K., Boye, J. (2013). 2013 Findability Survey. Findability Day. Findwise, Stokholm May 2013

NSS (2014). National Statistics Service Australia Online Calculator (Accessed September 2014).

Oberle, D. (2014). How ontologies benefit enterprise applications. Semantic Web, 5(6), 473-491.

O’Donnell, M. (2011). Visualizing Patterns in Text: Keynote talk at AESLA (Spanish Association of Applied Linguistics),

University of Salamanca May 4th-6th. (Online Article, accessed September 2014).

Ohly, P.H. (2012). Actas del X Congreso ISKO Capitulo Espanol (Ferrol 2012), 541-551

Oil and Gas UK (2011). Oil and Gas UK. Exploration Economic Report 2011. Online Article (Accessed January 2015).

Olson, T.A. (2007). Utility of a faceted catalog for scholarly research. Library Hi Tech. 25(4), 550-561.

Oracle (2012). From overload to impact: An industry scorecard on big data business challenges. Online Article (Accessed

March 2013).

Outsell (2005). Survey of Knowledge Workers. Online Article (Accessed March 2013).

Painter, K., Dutton, S.J., Owens, E.O., Burgoon, L.D. (2014). Automatic Document Classification for Environmental Risk

Assessment. PeerJ PrePrints,

Palkowsky,B. (2005). A New Approach to Information Discovery – Geography Really Does Matter. Society of Petroleum

Engineers (SPE) Annual Technical Conference and Exhibition, Dallas, Texas, USA, 9-12th October 2015. Report ID:

SPE 96771

Palmer, C.R., Pesenti, J., Valdes-Perez, R.E., Christel, M.G., Hauptmann, A.G., Ng, D., Wactlar, H.D. (2001). Demonstration

of hierarchical document clustering of digital library retrieval results. Proceedings of the 1st ACM/IEEE-CS joint

conference on digital libraries, 451.

Peng, J., He, B., Ounis, I. (2009). Predicting the Usefulness of Collection Enrichment for Enterprise Search. ICTIR 2009, 366-

370.

Preece, A., Flett, A., Sleeman, D., Curry, D., Meany, N., Perry, P. (2001). Better Knowledge Management through Knowledge

Engineering. Knowledge Management IEEE Intelligent Systems, Jan/Feb 2001, 36-42

Prince, V., Roche, M. (2009). Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration.

New York, USA, Medical Information Science Reference.

Quaadgras, A., Beath, C.M. (2011). Leveraging unstructured data to capture business value. Center for Information Systems

Research (CISR). MIT, Sloan School of Management, 11(4).

Raskin, R. (2011). National Aeronautical Space Administration (NASA) Semantic Web for Earth and Environmental

Terminology (SWEET) Ontology.

Rasmus, D.W. (2013). How IT Professionals can Embrace the Serendipity Economy. Harvard Business Review, August 19th

2013 Online Article (Accessed January 2013).

Robinson, M.A (2010). An empirical analysis of engineer’s information behaviors. Journal of the American Society for

Information Science and Technology, 61(4), 640-658

Romero, L. (2013). Deloitte: Improving Findability in the Enterprise. APQC Knowledge Management Conference May 3rd

2013, Houston, Texas, USA.

Rose, D.G. (2010). Apache Corporation. The ECM Journey. AIIM Southwest Chapter, May 6th 2010.

Saleem, M., Kamdar, M.R., Iqbal, A, Sampath, S., Deus, H.F., Nyonga, A. (2013). Fostering Serendipity through Big Linked

Data. Semantic Web Challenge (ISWC) 2013.

Page 9: Where does Enterprise search end and text analytics begin?

Salmador Sanchez, M.P., Angeles Palacios, A. (2008). Knowledge-based manufacturing enterprises: evidence from a case

study. Journal of Manufacturing Technology Management, 19(4), 447-468.

Salthe, S.N. (2012). Hierarchical Structures. Axiomathes, 22, 355-383

Sarrafzadeh, B., Vechtomova, O., Jokic, V. (2014). Exploring Knowledge Graphs for Exploratory Search. IIiX August 26th-

29th 2014, Regensburg, Germany.

Sasaki, Y. (2008). Automatic Text Classification. University of Manchester. Online Article (Accessed November 2014).

Schlumberger (2008). Schlumberger Oilfield glossary. Online resource (accessed March 2014).

Shiri, A.A., Revie, C.W., Chowdhury, G. (2002). Thesaurus-assisted search term selection and query expansion: a review of

user-centred studies. Knowledge Organization, 29(1), 1-19.

Skoglund, M., Runeson, P. (2009). Reference-based search strategies in systematic reviews. Proceedings of the 13th

International Conference on Evaluation and Assessment in Software Engineering (EASE). Durham University, 20-21st

April 2009, 31-40.

Smiraglia, R.P., van den Heuvel, C. (2011). Idea Collider: From a Theory of Knowledge Organization to a Theory of

Knowledge Interaction. Bulletin of the American Society for Information Science and Technology, April/May 2011,

37(4), 43-47.

Smith, R. (2012). Implementing Enterprise Information Management at Marathon Oil. Gartner Portals, Content and

Collaboration Summit. Track B: Content and Information Management Session B2, March 12th 2012.

Solskinnsbakk, G., Gulla, J.A. (2008). Ontological Profiles as Semantic Domain Representations. NLDB 2008, LNCS 5039,

pg. 67-78

Spiteri, L.F. (2004). Word Association Testing and Thesaurus Construction. Library and Information Science Research

Electronic Journal (LIBRES), 14(2)

Stamper, R. (1996). Signs, Information Norms and Systems. In Signs of Work, Semiosis and Information Processing in

Organizations, Holmqvist et al. (Eds), Berlin: Walter de Gruyter

Stenmark, D. (2008). Identifying clusters of user behaviour in Intranet Search Engine log files. Journal of the American Society

for Information Science and Technology, 59(14), 2232-2243.

Steyvers, M., Tenenbaum, J.B. (2005). "The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model

of Semantic Growth". Cognitive Science 29

Stock, W.G. (2010). Concepts and Semantic Relations in Information Science. Journal of the American Society for Information

Science and Technology, 61(10), 1951-1969.

Tonstad, K., Bjorge, E. (2003). Data Management Metrics in Statoil, Smi Data Management Presentation, London, UK.

Tudhope, D., Alani, H., Jones, C. (2001). Augmenting Thesaurus Relationships: Possibilities for Retrieval. Journal of Digital

Information (JODI), 1(8).

Van Noorden, R. (2014). Scientists may be reaching a peak in reading habits. Nature International weekly journal of science,

news 5th February 2014 Online Article (Accessed January 2015).

Velardi, P., Navigli, R., Martinez, S. (2012). A New Method for Evaluating Automatically Learned Terminological

Taxonomies. Proceedings of the 8th Conference on International Language Resources and Evaluation (LREC 2012),

May 21-27th, 2012.

Villena-Roman, J., Collada-Perez, S., Lana-Serrano, S., Gonzalez-Cristobal, J.C. (2011). Hybrid Approach Combining

Machine Learning and a Rule-Based Expert System for Text Categorization. Proceedings of the Twenty-Fourth

International Florida Artificial Intelligence Research Society Conference, 323-328.

W3C (2009). W3C workshop on Semantic Web in Oil and Gas Industry – Report.

Walkup, G.W., Ligon, B.J. (2006). The Good, Bad and Ugly of Stage-Gate Project Management Process as Applied in the Oil

and Gas Industry. Society of Petroleum Engineers (SPE) Annual Technical Conference and Exhibition, 24-27th

September, San Antonio, Texas, USA. Report ID: SPE-102926-MS.

Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M.X., Qian, W., Shi, L., Tan, L., Zhang, Q. (2010). TIARA: A Visual Exploratory

Text Analytic System. Proceedings of ACM. Knowledge Discovery in Databases (KDD), July 25-28th Washington DC,

USA.

Wessely, J. (2011). Text Analytics and Auto-Categorization in Semantic Web Applications. SemTech 2011. Online

Presentation (Accessed December 2014).

White, M. (2012). Enterprise Search. 1st Edition. California: O’Reilly.

White, M. (2014). Search Strategy A-Z List of Topics. Intranet Focus, September 2014, Online Article.

Wilson, T.D. (2000). Human Information Behavior. Special Issue on Information Science Research, Informing Science, 3(2)

Yu, K., Zhang, J., Chen, M., Xu, X., Suzuki, A., Ilic, K., Tong, W. (2014). Mining hidden knowledge for drug safety

assessment: topic modelling of LiverTox as a case study. BMC Bioinformatics, 15

Zeeman, D., Jones, R., Dysart, J. (2011). Assessing Innovation in Corporate and Government Libraries. Computers in

Libraries, 31(5)

Zeng, M.L. (2008). Knowledge Organization Systems (KOS). Knowledge Organization, 35(2/3).