Www.semantec.de ´Google-ized´ search in your business data Author: Krasen Paskalev Certified...
-
Upload
rosalind-austin -
Category
Documents
-
view
217 -
download
1
Transcript of Www.semantec.de ´Google-ized´ search in your business data Author: Krasen Paskalev Certified...
www.semantec.de
´Google-ized´ search in your business data
Author:
Krasen Paskalev
Certified Oracle 8i/9i DBA Seniour Oracle Consultant
Semantec GmbH
Semantec GmbHBenzstr. 32D-71083 Herrenberg, Germanywww.semantec.de
Search within your Oracle table datalike searching the web with Google
2
www.semantec.de
Agenda
Motivation Applications contain valuable data How difficult it is to search for it How easy it is in Google
What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and architectural
elements
4
www.semantec.de
Classical approach -Instring search with LIKE
Too complex to use Too slow – often
results in full table scan
No advanced search expressions
No text fragments CAT finds also:
APPLICATION VACATION
Not flexible – expensive to add or remove searchable fields
5
www.semantec.de
How easy it is in Google
Results presented in pages
Link to open the document
Highlighted text fragments
Full document location
(document context)
6
www.semantec.de
How to search here?
0..n
0..n
0..n
1..n
0..n0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n 0..n
0..n
0..n
0..n
0..n
0..n
0..n0..n
0..*
0..*0..*
0..*
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..n0..n
0..n
0..n
0..n
0..n
0..n
0..n
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..n
0..n0..n
0..n
0..*
0..*
0..*
0..*
0..*
0..*
0..* 0..*
0..*
0..*
CUR
CUR_CDHP_CODE
<pk>
CUT
CUR_CDMTH_IDCUT_RATECUT_UPDATE
<pk,fk2><pk,fk1>
KEY_1 <pk>
EXPT
EXPT_IDIND_CDACT_IDDLG_IND_CDCUR_CDWSC_NMATT_IDEXPT_DATEEXP_AMTEXP_DESCINT_DETAILSACT_DEF_BM_ID
<pk><fk2><fk6,fk7><fk5><fk1><fk3><fk4>
<fk6>
PK_EXPT <pk>
ORG
ORG_CDLOC_IDORG_ORG_CDORT_IDIND_CDORG_NMORG_DESCORG_OPEN_FLORG_EWM_FLORG_REP_FLORG_SD_FL
<pk><fk4><fk1><fk2><fk3>
CUG
YEA_IDCUR_CDCUR_TARGET_RATE
<pk,fk2><pk,fk1>
KEY_1 <pk>
OLR
GEOGRAPHY_NMELV_CDCUR_CDRATE
<pk,fk1><pk,fk2><fk3>
KEY_1 <pk>
UPF
UPF_CDUPF_NMUPF_SU
<pk>
USR
USR_LOGINUPF_CDIND_CDUSR_PASSWORDUSR_START_DATEUSR_END_DATE
<pk><fk1><ak,fk2>
UPF_TGF
TGF_CODEUPF_CDUPF_TGF_RUPF_TGF_W
<pk,fk2><pk,fk1>
TGF
TGF_CODETGF_DESCRTGF_ORDER
<pk>
GEOGRAPHY
GEOGRAPHY_NMREGION_NAMEGEOGRAPHY_DESCGEOGRAPHY_OPEN_FL
<pk><fk>
KEY_1 <pk>
OMW
GEOGRAPHY_NMMTH_IDOMW_HOURS
<pk,fk1><pk,fk2>
KEY_1 <pk>
IND
IND_CDGEOGRAPHY_NMORG_CDORG_ORG_CDELV_CDINDIV_LASTNAMEINDIV_FIRSTNAMEINDIV_FTEEMAILEFFICIENCYREPORTINGIND_FIELD_FL
<pk><fk1><fk2><fk3><fk4>
KEY_1 <pk>
YEA
YEA_IDYEA_START_DTYEA_STOP_DTYEA_CURFY_FL
<pk>
MTH
MTH_IDYEA_IDMTH_CDMTH_NMMTH_OP_TRACK_DTMTH_CL_TRACK_DTMTH_CL_ADJUST_DTMTH_CURRENT_FLMTH_CURRENT_BILLING_FLAG
<pk><fk>
ENT
ENT_CDCUR_CDENT_COUNTRYCBS_INSTSAP_INSTGEOGRAPHY_NM
<pk><fk1>
<fk2><fk3><fk4>
KEY_1 <pk>
LOC
LOC_IDENT_CDLOC_CODELOC_DESCLOC_BL_SUBENTLOC_BL_DEPTLOC_EX_OPER_CTLOC_EX_PTYPELOC_EX_PLINELOC_EX_SPLINELOC_EX_DISTRICTLOC_OPEN_FLLOC_COMPL_FLLOC_FIELD_FL
<pk><fk><ak>
KEY_1LOC_CODE_UK
<pk><ak>
CUSTOMERS
CUSTOMER_IDASM_CDLOC_IDCUSTOMER_NAMESUB_CUSTOMER_NAMEDESCRIPT IONCUSTOMER_KEY
<pk><fk1><fk2><ak><ak>
CUST_SUB_CUST_AKPK_CUST_ID
<ak><pk>
PRJ
PRJ_IDIND_CDORG_CDCUSTOMER_IDPRJ_NAMEPRJ_DESCPRJ_OP_FLPRJ_OPEN_DATEPRJ_CLOSE_DATE
<pk><fk1><fk2><fk3><ak>
PK_PRJPRJ_CUST_NAME
<pk><ak>
IND_TO_PRJ
PRJ_IDIND_CD
<pk,fk1><pk,fk2>
ORT
ORT_IDORT_NMORT_REVENUE_FLORT_T IME_FLORT_DESC
<pk>
KEY_1 <pk>
BLM
BLM_NMBLM_DESC
<pk>
PK_BLM <pk>
ACT_DEF
ACT_DEF_NMACT_DEF_DESCACT_DEF_OP_FLACT_DEF_EXP_FL
<pk>
PK_ACT_DEF <pk>
ADFSV
ADFSV_IDACT_DEF_NMPAR_CDADFSV_VALUE
<pk><ak,fk1><ak,fk2>
IND_TO_ACT
ACT_DEF_NMIND_CD
<pk,fk1><pk,fk2>
ACT
ACT_IDCUR_CDBLM_NMFRM_NAMEACT_DEF_NMACT_OP_FLCSVC_IDORG_CDX_BILLING_FL
<pk><fk3><fk5><fk6><ak,fk2>
<ak,fk1><fk4>
PK_ACTACT_CSVC_UK
<pk><ak>
PRB
PRB_IDMTH_IDACT_IDPRB_ORITGT_AMTPRB_CURFOR_AMTPRB_CURCOM_AMTPRB_CUREXIT_AMTPRB_ACTUALS_AMTPRB_CALCUL_AMTPRB_ADJUST_AMTPRB_ADJUST_DESCPRB_BILLED_AMTACT_DEF_BM_ID
<pk><fk1><fk2>
<fk2>
KEY_1 <pk>
COM
COM_IDASG_IDCOM_UPDATECOM_YAMTCOM_DESCCOM_USER
<pk><fk>
ACTSV
ACTSV_IDACT_IDPAR_CDACTSV_VALUE
<pk><ak,fk1><ak,fk2>
ATT
ATT_IDACT_IDATT_NMATT_OP_FL
<pk><ak,fk><ak>
KEY_1 <pk>
PAT
PAT_CDPAT_NAMEPAT_DESC
<pk>
PAR
PAR_CDPAT_CDPAR_DESCPAR_ADF_FLPAR_SDF_FLPAR_PRJ_FLPAR_CSVC_FLPAR_ACT_FLPAR_SPP_FLPAR_OPEN_FLPAR_DEF
<pk><fk>
KEY_1 <pk>
PDV
PAR_CDPDV_VALUE
<pk,fk><pk>
FRM
FRM_NAMEFRM_DESCFRM_TEXTFRM_TEXT_INTERNALFRM_OPEN_FL
<pk>
FRM_PAR
FRM_NAMEPAR_CD
<pk,fk1><pk,fk2>
TEC
TEC_NAME <pk>
KEY_1 <pk>
CSVC
CSVC_IDORG_CDPRJ_IDIND_CDTEC_NAMESVC_DEF_NMCSVC_DESCCSVC_OPEN_FLWBS_IDCSVC_ACC_NUM
<pk><fk5><ak,fk1><fk2><ak,fk3><ak,fk4>
SVC_DEF_TO_TEC
TEC_NAMESVC_DEF_NM
<pk,fk2><pk,fk1>
ELV
ELV_CDELV_DESC
<pk>
SVC_DEF
SVC_DEF_NMSVC_DEF_ACC_NUMSVC_DEF_OP_FL
<pk>SVC_TO_ACT
SVC_DEF_NMACT_DEF_NM
<pk,fk1><pk,fk2>
IND_TO_CSVC
SVC_DEF_NMIND_CD
<pk,fk1><pk,fk2>
SDFSV
SDFSV_IDPAR_CDSVC_DEF_NMSDFSV_VALUE
<pk><ak,fk1><ak,fk2>
WSC
WSC_NMWSC_DESCWSC_UPLIFTWSC_OP_FL
<pk>
KEY_1 <pk>
CSVCSV
CSVCSV_IDCSVC_IDPAR_CDCSVCSV_VALUE
<pk><ak,fk1><ak,fk2>
T_INT_MAP_SERVICES
GEOGRAPHY_NMMSCSCSSCSVC_DEF_NMACT_DEF_NM
<pk><pk><pk><pk><fk1><fk2>
KEY_1 <pk>
INTERFACE_SESSION
IDINTERFACE_CODESTART_T IMEFILE_NAMENUM_IMPORTEDNUM_REJECTED
<pk><fk>
PK_INT_SESSION <pk>
T_INT_MAP_TECH
TEC_NAMEPRODUCT
<fk><pk>
KEY_1 <pk>
ORG_TO_PRJ
PRJ_IDORG_CD
<pk,fk1><pk,fk2>
KEY_1 <pk>
INTERFACE_LOAD
IDEXPT_ID
<pk,fk><pk>
PK_INT_LOAD <pk>
CBS_INST
CBS_INST <pk>
KEY_1 <pk>SAP_INST
SAP_INSTABBR
<pk><ak>
SAP_CRITERIA
FIELDVALUE
<pk><pk>
DELEGATES
MGR_CDDLG_CD
<pk,fk1><pk,fk2>
REGISTRY
USR_LOGINSETT ING_NAMEVALUE
<pk,fk1><pk,fk2>
REGISTRY_KEYS
SETTING_NAMEDEFAULT_VALUE
<pk>
REGION
REGION_NAMEREGION_SHORT
<pk>
KEY_1 <pk>
T IMES_INTERFACES
INTERFACE_CODEINTERFACE_NAMEWSC_NMMAPPING_TYPEINCOMING_DIRLOG_DIR
<pk>
<fk>
PK_T IMES_INT <pk>
IND_TO_LOAD
IND_CDINTERFACE_CODEINT_IDLOAD_FL
<pk,fk1><pk,fk2>
PK_IND_TO_LOAD <pk>
PRJSV
PRJSV_IDPRJ_IDPAR_CDPRJSV_VALUE
<pk><ak,fk1><ak,fk2>
KEY_1PRJSV_AK
<pk><ak>
USR_TEMPLATES
USR_LOGINACT_IDWSC_NMACT_DEF_BM_ID
<pk,fk2><pk,fk3><pk,fk1><fk3>
KEY_1 <pk>
T_INT_MAP_BM
GEOGRAPHY_NMSSCBILL_SIT_NM
<pk><pk>
KEY_1 <pk>
ACT_DEF_BM_TO_ACT
ACT_IDACT_DEF_BM_IDFRM_NAMEWBS_IDDEF_BILL_SIT_FL
<pk,fk1><pk,fk2><fk3>
PK_BILL_SIT_TO_ACT <pk>
ACT_DEF_BM
ACT_DEF_BM_IDACT_DEF_NMFRM_NAMEGEOGRAPHY_NMBLM_NMBILL_SIT_NMDEF_SIT_FLBILL_SIT_OP_FL
<pk><ak><fk1><ak,fk2><fk3><ak>
PK_ACT_DEF_BMUK_ACT_DEF_BM
<pk><ak>
7
www.semantec.de
Motivation What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and
architectural elements
Agenda
8
www.semantec.de
Fast search Order by relevance Options to narrow and judge the hits
Advanced search expressions More information about the object hit
Text fragments with highlighted keywords Keyword context – where is the keyword found Object context - extended object information
Search by object type Search within specific object attribute
Direct access to the object found Accessible – to wide user group
What makes a good search engine?
9
www.semantec.de
Motivation What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and
architectural elements
Agenda
10
www.semantec.de
Direct Info
Framework developed by Semantec
Builds on Oracle Text platform Built with pure PL/SQL All code is stored in Oracle
11
www.semantec.de
Data Model
customers
idcodecustomer_typefirst_namelast_nameother_namesprofessiontitlenationalitydate_of_birthcompany_namebusiness_sectorbusiness_phoneprivate_phonemobile_phonefaxemailweb_siteremarks
NUMBERVARCHAR2(20)VARCHAR2(40)VARCHAR2(40)VARCHAR2(40)VARCHAR(80)VARCHAR(80)VARCHAR2(10)VARCHAR2(2)DATEVARCHAR2(80)VARCHAR2(40)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(80)VARCHAR2(80)VARCHAR2(1000)
addresses
idcustomer_idcountry_codepostal_codecitystreetpo_box
NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(80)VARCHAR2(80)VARCHAR2(10)
countries
codename
VARCHAR2(2)VARCHAR2(80)
services
idnamedescription
numberVARCHAR2(100)VARCHAR2(1000)
bank_accounts
idcustomer_idcountry_codebank_namebank_codeaccount_noremarks
NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(20)VARCHAR2(20)VARCHAR2(1000)
contracts
idcustomer_idbegin_dateend_dateservice_idremarks
NUMBERNUMBERDATEDATENUMBERVARCHAR2(1000)
12
www.semantec.de
Motivation What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and
architectural elements What is Oracle Text Indexing data Search results presentation
Agenda
13
www.semantec.de
What is Oracle Text?
Formerly known as ConText (8.0) and interMedia Text (8i)
Uses standard SQL to index, search and analyze text and documents stored in the Oracle database, in files and on the Web
Allows advanced searching including keyword search, pattern matching, boolean expressions, etc.
Supports multiple languages
14
www.semantec.de
Oracle Text Index Usage
CREATE INDEX DOC_INDEX_01 ON DOC_TABLE_01(location)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('DATASTORE USER_DATASTORE_01');
SELECT doc_name FROM DOC_TABLE_01
WHERE CONTAINS(location,'mouse AND wireless', 1) > 0
ORDER BY score(1) DESC
Oracle Text index creation:
Oracle Text index search:
15
www.semantec.de
Boolean expressions,Proximity search
AND (&) – mouse AND wireless OR (|) – mouse OR wireless NOT (~) – mouse NOT wireless ACCUMulate (,) – mouse, monitor, cd NEAR – NEAR((mouse,wireless),5)
16
www.semantec.de
Expansion operators
Allow to expand the word list searched for Wildcard (%, _) – only portion of the word
_ing -> sing king ping monito% -> monitor monitoring
Soundex (!) – words that sound similarly !sing -> sing sink
Fuzzy – words that are spelled similarly fuzzy(sing,70,10,weight) -> sing king sink
Stem ($) – words having the same linguistic root
$sing -> sing sang sung
17
www.semantec.de
Thesauri examples
Theme search – ABOUT(economics) Broader term – BT(cat) -> animal Narrower term – NT(animal) -> cat dog Associative relation – RT(cat) ->
kitten
Translated term – TR(cat) -> cat gato Synonym – SYN(cat) -> cat tiger
18
www.semantec.de
DatastoreDirect and Multi-column
documentsdoc_name author text
documentsdoc_name author text
Direct Multi-column
<doc_name>
...
<author>
...
<text>
...
Allowed datatypes:• CHAR
• VARCHAR
• VARCHAR2
• BLOB
• CLOB
• BFILE
• XMLType
19
www.semantec.de
DatastoreDetail and Nested
documentsdoc_name author
doc_detailsdoc_name seq_no text
Detail
{{
documentsdoc_name author doc_nst doc_nst
seq_no text
Nested
20
www.semantec.de
Indexing data - Data Model
customers
idcodecustomer_typefirst_namelast_nameother_namesprofessiontitlenationalitydate_of_birthcompany_namebusiness_sectorbusiness_phoneprivate_phonemobile_phonefaxemailweb_siteremarks
NUMBERVARCHAR2(20)VARCHAR2(40)VARCHAR2(40)VARCHAR2(40)VARCHAR(80)VARCHAR(80)VARCHAR2(10)VARCHAR2(2)DATEVARCHAR2(80)VARCHAR2(40)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(80)VARCHAR2(80)VARCHAR2(1000)
addresses
idcustomer_idcountry_codepostal_codecitystreetpo_box
NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(80)VARCHAR2(80)VARCHAR2(10)
countries
codename
VARCHAR2(2)VARCHAR2(80)
services
idnamedescription
numberVARCHAR2(100)VARCHAR2(1000)
bank_accounts
idcustomer_idcountry_codebank_namebank_codeaccount_noremarks
NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(20)VARCHAR2(20)VARCHAR2(1000)
contracts
idcustomer_idbegin_dateend_dateservice_idremarks
NUMBERNUMBERDATEDATENUMBERVARCHAR2(1000)
21
www.semantec.de
Indexing DataOracle Text Features
User datastore – PL/SQL procedure delivers the contents to be indexed
AUTO_SECTION_GROUP – Instructs Oracle to create separate section for each XML tag and index only its value
22
www.semantec.de
Indexing dataPutting it all together
<customer> <id>50</id> <code>635</code> <customer_type>Person</customer_type> <personal_data> <title /> <first_name>Jurgen</first_name> <last_name>Claus</last_name> <other_names /> <profession>Software Engineer</profession> <nationality>Germany</nationality> <date_of_birth>28.05.1935</date_of_birth> </personal_data> <addresses> <address> <country>Germany</country> <postal_code>80995</postal_code> <city>München</city> <street>Dachauer Str. 665</street> <po_box /> </address> <address> <country>Germany</country> ...
customers
idcodecustomer_typefirst_namelast_nameother_namesprofessiontitlenationalitydate_of_birthcompany_namebusiness_sectorbusiness_phoneprivate_phonemobile_phonefaxemailweb_siteremarks
NUMBERVARCHAR2(20)VARCHAR2(40)VARCHAR2(40)VARCHAR2(40)VARCHAR(80)VARCHAR(80)VARCHAR2(10)VARCHAR2(2)DATEVARCHAR2(80)VARCHAR2(40)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(80)VARCHAR2(80)VARCHAR2(1000)
addresses
idcustomer_idcountry_codepostal_codecitystreetpo_box
NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(80)VARCHAR2(80)VARCHAR2(10)
countries
codename
VARCHAR2(2)VARCHAR2(80)
services
idnamedescription
numberVARCHAR2(100)VARCHAR2(1000)
bank_accounts
idcustomer_idcountry_codebank_namebank_codeaccount_noremarks
NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(20)VARCHAR2(20)VARCHAR2(1000)
contracts
idcustomer_idbegin_dateend_dateservice_idremarks
NUMBERNUMBERDATEDATENUMBERVARCHAR2(1000)
Data + MetadataExtraction
Data Indexing
Oracle TextIndex
23
www.semantec.de
How easy it is in Google
Results presented in pages
Link to open the document
Highlighted text fragments
Full document location
(document context)
24
www.semantec.de
Search Results PresentationResults presented in pages
Link to open the customer edit application
Location of the keyword found
Extended customer info in balloon window
Most important info: Address and contacts
Highlighted text fragments
25
www.semantec.de
Summary
Direct Info uses Oracle Text as a solid platform for creating an advanced full text search solution
Powerful text search capabilities Advanced results presentation
features Rich features to judge the results Plugable into existing applications
26
www.semantec.de
Want to know more?Semantec GmbH.Krasen Paskalev, Armin SingerBenzstr. 32D-71083 Herrenberg, Germany
+49(7032)9130-0+49(7032)9130-12+49(7032)[email protected]@semantec.dewww.semantec.de
Company:Name:
Address:
Telephone:Telephone:
Fax:E-Mail:
Internet:
Meet us here -> booth C10 on the ground floor