IXE: Ideare Indexing Engine Ideare SpA .

45
IXE: I IXE: I deare deare I I ndexing ndexing E E ngine ngine Ideare SpA Ideare SpA www.ideare.com www.ideare.com

Transcript of IXE: Ideare Indexing Engine Ideare SpA .

Page 1: IXE: Ideare Indexing Engine Ideare SpA .

IXE: IIXE: Idearedeare I Indexing ndexing EEnginengine

Ideare SpAIdeare SpA

www.ideare.comwww.ideare.com

Page 2: IXE: Ideare Indexing Engine Ideare SpA .

Keep it as simple as possible but not Keep it as simple as possible but not simplersimpler..

Albert EinsteinAlbert Einstein

Page 3: IXE: Ideare Indexing Engine Ideare SpA .

HistoryHistory

6/966/96 IOL-University cooperationIOL-University cooperation10/9610/96 Arianna: first SE for Italian WebArianna: first SE for Italian Web1/981/98 EUROsearch projectEUROsearch project10/9810/98 WebNet: Categorization by WebNet: Categorization by

ContextContext10/9810/98 Automated Arianna catalogueAutomated Arianna catalogue11/9811/98 WWW8 paper on CategorizationWWW8 paper on Categorization1/991/99 Ideare spin-offIdeare spin-off3/003/00 Tiscali purchases 60% of IdeareTiscali purchases 60% of Ideare6/016/01 First release of IXEFirst release of IXE01-0201-02 Large scale deploymentLarge scale deployment

Page 4: IXE: Ideare Indexing Engine Ideare SpA .

GoalsGoals

Specialized tool (indexing and Specialized tool (indexing and search)search)

C++ framework with high-level C++ framework with high-level primitivesprimitives– Applications built with few lines of C++– Specialization by inheritance

High performanceHigh performanceScalabilityScalabilitySimple to maintainSimple to maintain

Page 5: IXE: Ideare Indexing Engine Ideare SpA .

ApproachApproach

Quick and cleanQuick and cleanSignificant effort in designing best Significant effort in designing best

abstractionsabstractionsRefined through extensive usageRefined through extensive usage

Page 6: IXE: Ideare Indexing Engine Ideare SpA .

Fundamental IdeasFundamental Ideas

Rely on hardware caching and mmapRely on hardware caching and mmap– Keep data as compact as possible– Stucture on disk same as used by

algorithmsRely on good data structures and Rely on good data structures and

algorithmsalgorithms– STL

Specialize data structuresSpecialize data structures– For indexing– For search

Page 7: IXE: Ideare Indexing Engine Ideare SpA .

IndexingIndexing

Posting List are created in memoryPosting List are created in memory– Provide as much memory as possible to

indexing machinesWhen size of lists reaches a When size of lists reaches a

threshold, dump partial index to diskthreshold, dump partial index to diskPerform final merging of partial Perform final merging of partial

indexesindexesMerging operation used also for:Merging operation used also for:

– Incremental indexing– Distributed indexing

Page 8: IXE: Ideare Indexing Engine Ideare SpA .

ColorsColors

Generalization of Google hits Generalization of Google hits properties (anchor, size, properties (anchor, size, capitaliation)capitaliation)

Similar to Fulcrum zonesSimilar to Fulcrum zonesUsed for rankingUsed for ranking

– E.g. title words contribute more to rank of document

and selective queriesand selective queriestext matches author = attardi

Page 9: IXE: Ideare Indexing Engine Ideare SpA .

Early 2001Early 2001

IXE releasedIXE released Ideare starts deploymentIdeare starts deployment

– June: Italian Web (50 Mil. documents) served by 3 PCs with IXE

– Fall: expanded to Germany, France, Switzerland

– Fall: Video, Image, Shopping search on IXE

October: evaluation and negotiations October: evaluation and negotiations with major German search portalwith major German search portal

Page 10: IXE: Ideare Indexing Engine Ideare SpA .

OverallOverall

IXE runs on PCs (better than Solaris IXE runs on PCs (better than Solaris or Alpha)or Alpha)

Fully self-contained libraryFully self-contained library Its own multithreaded serverIts own multithreaded serverDistributed crawlerDistributed crawlerDistributed indexing and mergeDistributed indexing and mergeParallel searchParallel searchWeb Service architectureWeb Service architecture .NET managed code interface.NET managed code interface

Page 11: IXE: Ideare Indexing Engine Ideare SpA .

FeaturesFeatures

Full text + phrase + proximityFull text + phrase + proximityBoolean queriesBoolean queriesColors: HTML, XML tagsColors: HTML, XML tagsMultiple collectionsMultiple collections Incremental indexingIncremental indexingScalability:Scalability:

– TeraByte collections– Distributed multithreaded servers

Page 12: IXE: Ideare Indexing Engine Ideare SpA .

Features (2)Features (2)

Pluggable Document Readers: Pluggable Document Readers: Office, PDFOffice, PDF

Compressed document cacheCompressed document cacheDocument snippets with Document snippets with

highlightshighlightsProgrammable query syntaxProgrammable query syntaxClustering of results (prototype)Clustering of results (prototype)

Page 13: IXE: Ideare Indexing Engine Ideare SpA .

TechnologyTechnology

C++ OO architectureC++ OO architecture Fast indexingFast indexing

– Sort-based inversion Fast searchFast search

– Efficient algorithms and data structures– Query Compiler

• Small Adaptive Set Intersection– Suffix array with supra index– Memory mapped index files

Programmable API libraryProgrammable API library Template metaprogrammingTemplate metaprogramming Full Object Data BaseFull Object Data Base

Page 14: IXE: Ideare Indexing Engine Ideare SpA .

ArchitectureArchitecture

GathererGatherer Table<DocInfo>Table<DocInfo>

IndexerIndexerLexiconPostings

Hit Lists

DocStore

mmap

Berkeley DB

name:time:size:

DocInfo

mmaplocal

cache

mmap

DocInfo DocInfo

name:time:size:

name:time:size:

DocInfo DocInfo

name:time:size:title:summary:type:

name:time:size:title:summary:type:

DocInfo DocInfo

name:time:size:title:summary:type:

name:time:size:title:summary:type:

Page 15: IXE: Ideare Indexing Engine Ideare SpA .

ArchitectureArchitecture

Gatherers

.html, .doc, .pdf, .ps, .txt

Gatherers

.html, .doc, .pdf, .ps, .txt

MultithreadQuery

MultithreadQueryIndexersIndexers

IndexPosting

DocStore

Page 16: IXE: Ideare Indexing Engine Ideare SpA .

Storing Objects in Relational TablesStoring Objects in Relational Tables

SQLSQLcreate table video (name varchar(256),

caption varchar(2048), format INT, PRIMARY KEY(name))

Page 17: IXE: Ideare Indexing Engine Ideare SpA .

Template MetaprogrammingTemplate Metaprogramming

class Video : public DocInfo {class Video : public DocInfo {char*char* name;name;char*char* caption;caption;intint format;format;

META(Video, (SUPERCLASS(DocInfo),META(Video, (SUPERCLASS(DocInfo), VARKEY(name, 256),VARKEY(name, 256),

VARFIELD(caption, 2048),VARFIELD(caption, 2048),FIELD(format)));FIELD(format)));

};};

Page 18: IXE: Ideare Indexing Engine Ideare SpA .

Programming Applications (C+Programming Applications (C++)+)

Collection<Video> videos(“CNN”);Collection<Video> videos(“CNN”);videos.insert(video1);videos.insert(video1);

Query q(“caption MATCHES Jordan and Query q(“caption MATCHES Jordan and format=wav”);format=wav”);

Cursor<Video> cursor(videos, q);Cursor<Video> cursor(videos, q);

while (cursor.hasNext())while (cursor.hasNext())cout << cursor.get();cout << cursor.get();

Page 19: IXE: Ideare Indexing Engine Ideare SpA .

Small Adaptive Set IntersectionSmall Adaptive Set Intersection

Query compilerQuery compiler– One cursor on posting lists for each

node– CursorWord, CursorAnd, CursorOr,

CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)

– Returns first result r >= minSingle operator for all kind of Single operator for all kind of

queries: e.g. proximityqueries: e.g. proximity

Page 20: IXE: Ideare Indexing Engine Ideare SpA .

SASI exampleSASI example

world wide web

3

9

12

20

40

47

1

8

10

40

41

2

4

6

21

40

Page 21: IXE: Ideare Indexing Engine Ideare SpA .

PerformancePerformance

Page 22: IXE: Ideare Indexing Engine Ideare SpA .

Comparison (single node)Comparison (single node)

IndexingIndexing

TimeTime

SearchSearch

SpeedSpeed ¹

ProgramProgrammabilitymability

ExcerptsExcerpts ProximityProximity

RankRank

RankingRanking

IXEIXE 2 GB/h2 GB/h 30 q/s30 q/s C++ APIC++ API Link Link popularitypopularity

FulcrumFulcrum 0.7 GB/h0.7 GB/h 6 q/s6 q/s C APIC API nono nono nono

GoogleGoogle ?? 1 q/s1 q/s C, C, pythonpython

PageRankPageRank

FastFast 1-2 GB/h1-2 GB/h 3 q/s3 q/s CC

plannedplanned FirstPageFirstPage

ShareShare

PointPoint

0.2 GB/h0.2 GB/h 3 q/s3 q/s C++C++ nono ?? ??

VerityVerity 0.2 GB/h0.2 GB/h 4 q/s4 q/s nono ?? ??

¹ 2 million documents

Page 23: IXE: Ideare Indexing Engine Ideare SpA .

Comparison (2)Comparison (2)

Paragraph Paragraph indexingindexing

ColorColor

SearchSearch

ColumnColumn

SearchSearch

Max docMax doc

sizesize

O.S.O.S.

IXEIXE no limitno limit

Linux, Linux, Windows, Windows,

Alpha, Alpha, SolarisSolaris

FulcrumFulcrum nono limitedlimited 64 K64 KWindows, Windows,

Linux, Linux, SolarisSolaris

GoogleGoogle nono ??limitedlimited

4 K4 K LinuxLinux

FastFast nono ?? ?? ?? NetBSDNetBSD

ShareShare

PointPointnono nono nono ?? WindowsWindows

Page 24: IXE: Ideare Indexing Engine Ideare SpA .

An independent benchmarkAn independent benchmark

0,00

50,00

100,00

150,00

200,00

250,00

Indexing (Intel) Retrieval (Intel)

AltaVistaIXE

0,00

50,00

100,00

150,00

200,00

250,00

Indexing (Intel) Retrieval (Intel)

AltaVistaIXE

Page 25: IXE: Ideare Indexing Engine Ideare SpA .

Independent evaluationsIndependent evaluations

Major portal, GermanyMajor portal, GermanyMajor portal, FranceMajor portal, FranceMajor portal, ItalyMajor portal, Italy

– Stress test with 300 concurrent queries– Verity crashed in several cases

Microsoft RedmondMicrosoft Redmond

Page 26: IXE: Ideare Indexing Engine Ideare SpA .

IXE in useIXE in use

JanasJanas– 150 Million documents– 50 Million documents per server:

• Pentium III, 1 GHz, 2 GB RAM, 2x75 GB IDE

– Italy: 3 PCs, 300 K queries/dayKataWebKataWeb

– largest Italian Web portal– 4 GB documents– 2nd largest Italian newspaper

Page 27: IXE: Ideare Indexing Engine Ideare SpA .

Other FeaturesOther Features

SnippetsSnippetsDocument cacheDocument cacheColorsColorsMultiple collectionsMultiple collections

– Sorted by page rank– Authoritativeness– Popularity

Filter/Group by similarityFilter/Group by similarityConceptual ClusteringConceptual Clustering

Page 28: IXE: Ideare Indexing Engine Ideare SpA .

SnippetsSnippets

Adaptive algorithm:Adaptive algorithm:– Compiled regular expression search for

few words– Karp-Rabin algorithm for several words

Customizable on length of snippets, Customizable on length of snippets, proximity of hits, etc.proximity of hits, etc.

Page 29: IXE: Ideare Indexing Engine Ideare SpA .

Programmable Query SyntaxProgrammable Query Syntax

Typical Search OptionsTypical Search Options– By document type (e.g. HTML, PDF,

DOC)– By color (e.g. title, author)– Within site or domain (through prefix

search on URL)

Page 30: IXE: Ideare Indexing Engine Ideare SpA .

Result RankingResult Ranking

Based on combination of measuresBased on combination of measures– Classical IR– Authoritativeness– Link popularity– Prioritized collections

Clients can provide their own criteriaClients can provide their own criteria– Pay for placement– Adult filter– Freshness, etc.

Page 31: IXE: Ideare Indexing Engine Ideare SpA .

Ranking MeasuresRanking Measures

IR rankIR rank– Based on frequencies (tf, idf)– cosine, Okapi (Robertson), Amati

Best Trec10 score: 0,22% relevanceBest Trec10 score: 0,22% relevance IXE uses simplified cosine with IXE uses simplified cosine with

additional scoring factors:additional scoring factors:– Colors (presence in title, heading, etc.)– Proximity for multiple words– Capitalization/font possible (Google)

Page 32: IXE: Ideare Indexing Engine Ideare SpA .

Authoritative scoreAuthoritative score

Link popularityLink popularity– Based on incoming link count

Reference from authoritative site Reference from authoritative site (e.g. Dmoz)(e.g. Dmoz)– Increase document rank– Descriptions from Dmoz are added to

document with special colorCitations (i.e. text surrounding link)Citations (i.e. text surrounding link)

– Added to document with special color

Page 33: IXE: Ideare Indexing Engine Ideare SpA .

Priority rankPriority rank

Documents are arranged in several Documents are arranged in several collectionscollections

Collections are searched in orderCollections are searched in orderEarlier collections contain higher Earlier collections contain higher

rank documentsrank documentsTunable cutoff at 4000 documentsTunable cutoff at 4000 documentsStatistical estimate of overall number Statistical estimate of overall number

of resultsof results

Page 34: IXE: Ideare Indexing Engine Ideare SpA .

Custom rankCustom rank

IR rank is computed from data in IR rank is computed from data in lexicon (word based)lexicon (word based)

Cosine, authoritativeness, custom Cosine, authoritativeness, custom rank are document relatedrank are document related

Accessing document data during Accessing document data during search is a drag in performancesearch is a drag in performance

Solution: associate direct access Solution: associate direct access info (mmapped)info (mmapped)

Page 35: IXE: Ideare Indexing Engine Ideare SpA .

Nested ObjectsNested Objects

class WebInfo : public DocInfo {class WebInfo : public DocInfo {

CompressedText<65535>CompressedText<65535> text;text;

RankWeightRankWeight weights;weights;

META(WebInfo,META(WebInfo,

(SUPERCLASS(DocInfo),(SUPERCLASS(DocInfo),

FIELD(text),FIELD(text),

KEY(weights, mapped)));KEY(weights, mapped)));

};};

Page 36: IXE: Ideare Indexing Engine Ideare SpA .

Custom Rank Nested ObjectCustom Rank Nested Object

Struct RankWeightStruct RankWeight

{{int importance,int popularity,int freshness,int adult,…

};};

Page 37: IXE: Ideare Indexing Engine Ideare SpA .

ScalabilityScalability

Distributed IndexingDistributed Indexing– Performed on spidering machines– Merged indexes

Server farm of cheap PCsServer farm of cheap PCs– 1.2 GHz Athlon or Pentium– 2 GB RAM– 2 x 75 GB disks

12 h indexing cycle for 50 million 12 h indexing cycle for 50 million documents on 8 PCsdocuments on 8 PCs

Page 38: IXE: Ideare Indexing Engine Ideare SpA .

Query processingQuery processing

Query brokerQuery broker– Dispatches query– Merge sort of results– Maintains cache of results

IFIFLL (Local Inverted File Partition) (Local Inverted File Partition)

Page 39: IXE: Ideare Indexing Engine Ideare SpA .

Distributed CrawlerDistributed Crawler

Page 40: IXE: Ideare Indexing Engine Ideare SpA .

Distributed CrawlerDistributed Crawler

High performanceHigh performance– ~120 pages/sec on single node

ScalableScalableFault tolerantFault tolerantCollects data for link popularity, Collects data for link popularity,

citationscitationsHandles several documents formatsHandles several documents formats

Page 41: IXE: Ideare Indexing Engine Ideare SpA .

Crawler ArchitectureCrawler Architecture

Retriever Crawler

Parser

Scheduler Retriever

Retriever

Cache

CrawlInfo

select()

Table <UrlInfo>

Citations

Hosts Robots

Host queues

Page 42: IXE: Ideare Indexing Engine Ideare SpA .

Web Service SupportWeb Service Support

Page 43: IXE: Ideare Indexing Engine Ideare SpA .

C# integrationC# integration

Managed code indexer DLLManaged code indexer DLLManaged objects for controlling Managed objects for controlling

indexing:indexing:– CollectionInfo– Gatherer– Gathered

WebForm GUIWebForm GUI

Page 44: IXE: Ideare Indexing Engine Ideare SpA .

GUI ArchitectureGUI Architecture

GUIControl

GUIControl

CollectionInfo

CollectionInfo

GathererGatherer

GUIControl

GUIControl

table<Gathered>

Collection BuilderCollection Builder

*.coll

Serialize

copycache

URL:ID:time:size:MD5:lastSeen:

Gathered

ConverterConverter name:time:size:

WebInfo

CollectionEnumeratorCollection

Enumerator

table<WebInfo>

WebIndexerWebIndexer

UnManaged

Page 45: IXE: Ideare Indexing Engine Ideare SpA .

Web Search ServiceWeb Search Service

High performance High performance search engine search engine librarylibrary

C++ template C++ template librarylibrary

Handles Terabyte Handles Terabyte of dataof data

Available as Web Available as Web ServiceService

IIndendeXXing ing EEnginengine