SQLBits X SQL Server 2012 Rich Unstructured Data

44
Taking SQL Server Beyond Relational Into the Realm of Unstructured Data Management Michael Rys Principal Program Manager @SQLServerMike

description

SQLBits X Training Day Presentation on SQL Server 2012 FileStream, FileTable, FullText Search and Semantic SearchCopyright (c) Microsoft Corp.

Transcript of SQLBits X SQL Server 2012 Rich Unstructured Data

Page 1: SQLBits X SQL Server 2012 Rich Unstructured Data

Taking SQL Server Beyond Relational Into the Realm of Unstructured Data Management

Michael RysPrincipal Program Manager@SQLServerMike

Page 2: SQLBits X SQL Server 2012 Rich Unstructured Data

Unstructured Data in SQL Server

80% of all data is not stored in databases! Most of it is “unstructured”

Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top

Page 3: SQLBits X SQL Server 2012 Rich Unstructured Data

Rich Unstructured Data in SQL Server 2012

Address important customer requests for Capabilities and rich services for Rich Unstructured Data (RUDS)

Scale Up for storage and search to 100m to 500m documentsEasy use/access to Unstructured data from all applicationsRich insight into unstructured data to make better decisions

Page 4: SQLBits X SQL Server 2012 Rich Unstructured Data

Rich Unstructured Data & Services Ecosystem

Fulltext Search

Semantic Similarity Search

Rich

S

erv

ices

Database

Disk1

Disk2

Disk3

Multiple Containers

Sca

le-u

p

Solu

tions

Database Applications

Transactional Access

Blobs

DB FileStre

DB FileStreams

Integrated Backup/Replication/AlwaysO

n

Integrated AdministrationIntegrated Administration?

Windows Apps

SMB Share Files/Folders

FileStream API

Streaming Win32 AccessStreaming Win32 Access??

Customer Application

Azure lib Centera lib

SQL FILESTREAM lib

SQL RBS API

Azure Centera SQL DB

Remote BLOB Storage

FileStreamsFileTable

SQL Apps

Page 5: SQLBits X SQL Server 2012 Rich Unstructured Data

RBS Example Workflow

Application

RBS Client Library

BLOB Store Provider Library

BLOB Store SQL Server

ClaimID ClaimDate PhotoRef

4390 6/5/2007 <Binary(20)>1

2

3

1Write BLOB(Photo)Return Blob IDWrite Blob ID to PhotoRef field

2

3

Machine Boundary

RBS Services:• Create• Fetch• GC• Delete

Page 6: SQLBits X SQL Server 2012 Rich Unstructured Data

RBS – Create and Read Blob// Store a new blob.

byte[] myBlobId;

SqlRemoteBlobContext blobContext = new SqlRemoteBlobContext(sqlConn);

using (SqlRemoteBlob newBlob = blobContext.CreateNewBlob()) {

    // Write to a System.IO.Stream object.

    newBlob.Write(…);

    newBlob.Close();

    myBlobId = newBlob.BlobId;

}

// Alternative way to write.

newBlob.WriteFromStream(inputStream);

Page 7: SQLBits X SQL Server 2012 Rich Unstructured Data

RBS – Create and Read Blob (Continued)

// Add a new row including the blob ID to the database

// table.

// Fetch the blob.

using (SqlRemoteBlob existingBlob = blobContext.OpenBlob(myBlobId)) {

    // Read from System.IO.Stream object.

existingBlob.Read(...);

}

 

// Alternative way to read.

existingBlob.ReadToStream(outputStream);

Page 8: SQLBits X SQL Server 2012 Rich Unstructured Data

FilestreamStorage Attribute on VARBINARY(MAX)

Works with integrated FTSUnstructured data stored directly in the file system (requires NTFS)Dual Programming Model

TSQL (Same as SQL BLOB)Win32 Streaming APIs with T-SQL transactional semantics

Data ConsistencyIntegrated Manageability

Back Up/RestoreAdministration

Size limit is the file system volume sizeSQL Server Security Stack

Store BLOBs in DB + File SystemApplication

BLOB

DB

Page 9: SQLBits X SQL Server 2012 Rich Unstructured Data

TSQL FILESTREAM API

// New TSQL Function:

// Get_filestream_transaction_context()

//

SELECT Get_filestream_transaction_context()

// New TSQL Function :

// PathName()

//

SELECT ClaimImage.PathName()

FROM Insurancedb..Claims

Page 10: SQLBits X SQL Server 2012 Rich Unstructured Data

Managed SqlFileStream: READ// New SqlFileStream Class in VS08 SP1

//

SqlFileStream sfs = new SqlFileStream(path, txnId, System.IO.FileAccess.Read);

// output file to read into

System.IO.FileStream fs = new System.IO.FileStream ("c:\\output2.jpg", System.IO.FileMode.Create);

{

   byte[] buffer = new byte[512 * 1024];

   int cbBytesRead = buffer.Length;

   while (cbBytesRead == buffer.Length)

   {

    cbBytesRead = sfs.Read(buffer, 0, buffer.Length);

     fs.Write(buffer, 0, cbBytesRead);

     }       

Page 11: SQLBits X SQL Server 2012 Rich Unstructured Data

Managed SqlFileStream: WRITE

sfs = new SqlFileStream(path, txnId, System.IO.FileAccess.Write, 0);

using (System.IO.Stream res = Pictures.GetResourceStream(HealthCare.MRI.JoeSmith)) {

    byte[] buffer = new byte[512 * 1024];

    int cbBytesRead = buffer.Length;

    while (cbBytesRead == buffer.Length) {

        cbBytesRead = res.Read(buffer, 0, buffer.Length);

        sfs.Write(buffer, 0, cbBytesRead);

    }

}

// commit SQL transaction and close SQL connection.

Page 12: SQLBits X SQL Server 2012 Rich Unstructured Data

Integrated Management of documents in SQL Server 2012

demo

Page 13: SQLBits X SQL Server 2012 Rich Unstructured Data

FILETABLE Overview

FileTable: A Table of Files/Directories

User created Table with a fixed schema

contains FILESTREAM and File Attributes

Each row represents a File or a Directory

System defined constraints maintain the tree integrity

File/Directory hierarchy view through a Windows Share

Supports Win32 APIs for File/Directory Management

DB Storage is Transparent to Win32 applications

SMB level of application compatibility

Virtual network name (VNN) path support for transparent Win32 application failover

Private Docs(Database1)

Office Docs(Database2)

LogFiles (FileTable)

Documents(FileTable)

Media(FileTable)

MSSQLSERVER

\\my_machine\MSSQLSERVER\Office Docs\Documents

FILESTREAM Share

Database Directories

FileTable Directories

FileTable Folder Hierarchy

User-Defined Directory Structure

Page 14: SQLBits X SQL Server 2012 Rich Unstructured Data

Creating a FileTable

Pre-requisitesEnable FILESTREAM

Create FILESTREAM Share and Filegroup

Enable non-transactional access at the DB levelALTER DATABASE Contoso SET FILESTREAM( non_transacted_access=FULL, Directory_name = N’Contoso’)

Create FileTableCREATE TABLE Contoso..Documents AS FILETABLE

WITH (filetable_directory = N'Document Library')

Access at \\<machine name>\<FILESTREAM share>\Contoso\Document Library\

Page 15: SQLBits X SQL Server 2012 Rich Unstructured Data

FileTable SchemaFile Attribute Name Type Purpose

Path_locator hierarchyid Represents position of this node in the hierarchical FileNamespace.

parent_path_locator hierarchyid Represents the hierarchyID of the parent directory-- a computed column

stream_id uniqueidentifier UniqueId for Filestream Datafile_stream varbinary(max) filestream Filestream data

file_type nvarchar(255) Type of the file. Can be used for fulltext index creation

cached_file_size bigint Size of the filestream (cached value)

Name nvarchar(255) File/Folder Name (e.g foo.txt)creation_time datetime2 Creation Timelast_write_time datetime2 LastWrite Timelast_access_time datetime2 LastAccess Timeis_directory bit TRUE for directories.is_offline bit Offline attributeis_hidden bit Hidden attributeis_readonly bit Read Only attributeis_archive bit Archive attributeis_system bit System attributeis_temporary bit Temporary attribute

Page 16: SQLBits X SQL Server 2012 Rich Unstructured Data

Modifying a FileTable

FileTable has a fixed schemaColumns, system defined constraints cannot be altered/dropped

Allows user defined indexes/constraints/triggers

Disabling/Enabling FileTable NamespaceALTER TABLE Documents DISABLE FILETABLE_NAMESPACE

Disables all system-defined constraints and Win32 access to FileTable

Useful for bulk-loading/re-organization of data

FileTable can be dropped similar to any other tableCatalog views can be used for obtaining metadata

Page 17: SQLBits X SQL Server 2012 Rich Unstructured Data

Data Access – File system Access

FileTable hierarchy is visible through Filestream share\\machine\<FILESTREAMshare>\<Database_directory>\<FileTable_Directory>\...

Provides transparent Win32 API & File/Directory Management capabilitiese.g. MS word can create/open/save files; xcopy for copying directory trees into database..

Win32 API operations are non-transactionalOperations cannot be part of any user transactions

Win32 operations are intercepted by SQL Server at the File system level e.g. File/Directory creation/deletion => insert/delete into FileTable

Full locking/concurrency semantics with other accesses

Allows in-place update of file stream data/File attributes

Transactional FILESTREAM APIs can also be used.

Page 18: SQLBits X SQL Server 2012 Rich Unstructured Data

Data Access – T-SQL Access

Normal Insert/Update/Delete allowed for the FileTable manipulationFileTable Namespace integrity constraints enforced

Set based operations on the File-attributes – value add

Built-in functionsGetFileNamespacePath() – UNC path for a file/directory

FileTableRootPath() – UNC path to the FileTable root

GetPathlocator() – path_locator value for a file/directory

DDL/DML Triggers are supportedDML triggers on a FileTable cannot update any FileTables

Page 19: SQLBits X SQL Server 2012 Rich Unstructured Data

Programming PatternWindows applications work using normal Win32 APIs using the logical UNC paths

e.g. Search files by using FindFirstFile, FindNextFile, FindClose pattern

Move a directory using MoveFile or MoveFileEx .. etc

New Hybrid Applications using DB and FileTable:File I/O APIs start by obtaining a handle using FileNamespace Path

DECLARE @path nvarchar(max)

// get FileNamespace pathSELECT @path=file_stream.GetFileNamespacePath() FROM DocumentStore WHERE name='MySpec.doc';

// Open File handlehandle = CreateFile( @path, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

Page 20: SQLBits X SQL Server 2012 Rich Unstructured Data

Managing FileTable

DB Backup/Restore operations include FileTable data

Point in time Restore’ may contain more recent FILESTREAM data due to non-transactional updates during backup

FileTables are secured similar to any other user tables

Same security is enforced for Win32 access also

Data LoadingWindows tools like xcopy/robocopy OR drag-drop operations through Windows Explorer can be used

BCP operations are supported for direct T-SQL data inserts

SSMS supports FileTable creation/exploration

Page 21: SQLBits X SQL Server 2012 Rich Unstructured Data

Managing FileTable – High Availability

SQL Server 2012 AlwaysOn is fully supported

Transparent data failoverFileTables can be configured with multiple secondary nodes

Both sync and async data replication is supported

File and metadata is available in the secondary in case of failover

Transparent application failoverVirtual network name (VNN) path support for transparent Win32 application failover

Applications use \\VNN\Share\db\... Path

Applications are automatically redirected to the secondary in case of failover

RestrictionsFileTables cannot participate in “Read-only” replicas.

Page 22: SQLBits X SQL Server 2012 Rich Unstructured Data

Managing FileTable – Trouble shooting

DMV to show all open non-transactional file handles

sys.dm_filestream_non_transact_handles

Stored Procedure to terminate open file handles

sp_kill_filestream_non_transacted_handles

X-events/Perf counters for trouble shooting

Page 23: SQLBits X SQL Server 2012 Rich Unstructured Data

FileTable Restrictions

FileTables cannot be partitionedMerge/Transactional replications are not supportedRCSI/SnapShot isolation mode

Win32 Applications cannot modify file stream data in FileTables

Win32 Application compatibilityMemory mapped files, Directory notifications, links are not supported

Page 24: SQLBits X SQL Server 2012 Rich Unstructured Data

Some FileStream/FileTable performance tipsReading bigger buffers gives better performance

Volumes hosting FILESTREAM/FILETABLE data should have 8.3 name generation and LastAccessTime disabled

FILESTREAM/FILETABLE containers to reside on dedicated volumes

Have one volume per FILESTREAM/FILETABLE containerenables space management at volume level

“Magic” SMB buffer size = ~60KB Another “good” value is 480KB

ROWGUID unique index for aligned partitioning for FILESTREAM

AntiVirus programs should be configured not to delete infected files but to quarantine them

If using compressed volumes, use cluster size 4 KB

Page 25: SQLBits X SQL Server 2012 Rich Unstructured Data

Unstructured Data Scale-upMultiple Containers for FILESTREAM data

SQL 2008 R2Only one storage container/FILESTREAM filegroup

Limits storage capacity scaling and I/O scaling

SQL Server 2012Support for multiple storage containers/filegroup.

DDL Changes to Create/Alter Database statements

Ability to set max_size for the containers

DBCC Shrinkfile Emptyfile support

Scaling FlexibilityStorage scaling by adding additional storage drives

I/O scaling with multiple spindles

Page 26: SQLBits X SQL Server 2012 Rich Unstructured Data

Unstructured Data : Multiple containers

Use of multiple spindles for achieving better I/O Scalability

Page 27: SQLBits X SQL Server 2012 Rich Unstructured Data

RUDS Scale-up: FileStream Perf/Scale

Improved performance of T-SQL and File I/O accessVarious enhancements to improve read/write throughput

5 fold increase in Read throughput

Linear scaling with large number of concurrent threads

2012 2012

Page 28: SQLBits X SQL Server 2012 Rich Unstructured Data

Unstructured Storage In SQL Server 2008 & 2012 File Stores /

External Blob Stores (CAS)

SQL BLOBs Remote Blob API FILESTREAM FILETABLE

Streaming PerformanceDepends on

external storeDepends on

external store

Win32 App CompatDepends on external store

Depends on external store

Link Level Consistency

Data Level Consistency

Integrated Query & Management

Non-local Windows File Servers

n/a

External Blob Stores n/a

Page 29: SQLBits X SQL Server 2012 Rich Unstructured Data

Feature ComparisonFeatures FileServer+DB

SolutionSQL 2008–FILESTREAM

SQL 2012– FileTable

Integrated Admin operations for Relational and File data- Backup/Restore, HA/Mirroring

No Yes Yes

Integrated Services for Relational and File data- Tex/Semantic Search, Reports, Query etc

No Yes Yes

Integrated Security Model No Yes Yes

In-place update of Filestream data(non-transacted)

Yes No Yes

Fully Transacted update of Filestream data No Yes Yes

File/Directory hierarchy in DB No No Yes

Win32 App compatibility Yes No Yes

Relational access to File Attributes No No Yes

Page 30: SQLBits X SQL Server 2012 Rich Unstructured Data

Summary: FileTable

Application Compatibility for Windows ApplicationsWindows applications run on top of files stored in FileTables with no modifications

Relational Value PropositionProvide Integrated Administration and Services

Backup, Log Shipping, HA-DR, Full text and Semantic search, …

T-SQL orthogonalityFile/Folder attributes surfaced through relational columns

Power of set based operations, Policy Management, Reporting etc

FileNamespace Hierarchy management

Page 31: SQLBits X SQL Server 2012 Rich Unstructured Data

Full Text Search Improvements in SQL Server 2012Improved Performance and Scale:

Scale-up to 350M documents

iFTS query perf 7-10 times faster than in SQL Server 2008

Worst-case iFTS query response times < 3 sec for corpus

At par or better than main database search competitors

New Functionality:Property Search

customizable NEAR

New Wordbrakers: update existing WB, add Czech and Greek

Innovation in Search: Semantic Similarity Search

Page 32: SQLBits X SQL Server 2012 Rich Unstructured Data

Full Text Search Performance & Scale ImprovementsArchitectural Improvements

Improved internal implementation

Queries no longer block Index updates

Improved Query Plans: Better Plans for common queries

Fulltext predicate folding

Parallel Plan execution

Index and Query tested on scale up to 350Million documents with < ~2 Sec Response

~3X better w/o DML and ~9X better with DML throughput

Scale easily with increasing number of connections

Page 33: SQLBits X SQL Server 2012 Rich Unstructured Data

Scale-up: Full-Text Search

Queries over 350M documents database and random DMLs running in background. Beating SQL Server 2005 with a scale factor more than 2x and with avg 60x times better throughput

2012

2005/8

2005/8 vs 2012

Page 34: SQLBits X SQL Server 2012 Rich Unstructured Data

Scale-up: Full-Text Search

Query avgExecTime (ms) under various number of connections (50 ~ 2000 users) for customer playback benchmark

2012

2005/8

2005/8 vs 2012

Page 35: SQLBits X SQL Server 2012 Rich Unstructured Data

New FullText Search Capabilities in SQL Server 2012

demo

Page 36: SQLBits X SQL Server 2012 Rich Unstructured Data

FullText Property Scoped Search

• Setup once per database instance to load the office filtersexec sp_fulltext_service 'load_os_resources',1goexec sp_fulltext_service 'restart_all_fdhosts'go

• Create a property listCREATE SEARCH PROPERTY LIST p1;

• Add properties to be extractedALTER SEARCH PROPERTY LIST [p1] ADD N'System.Author' WITH

(PROPERTY_SET_GUID = 'f29f85e0-4ff9-1068-ab91-08002b27b3d9', PROPERTY_INT_ID = 4, PROPERTY_DESCRIPTION = N'System.Author');

• Create/Alter Fulltext index to specify property list to be extractedALTER FULLTEXT INDEX ON fttable... SET SEARCH PROPERTY LIST = [p1];

• Query for propertiesSELECT * FROM fttable WHERE CONTAINS(PROPERTY(ftcol, 'System.Author'), 'fernlope');

New Search Filter for Document PropertiesCONTAINS (PROPERTY ( { column_name }, 'property_name' ),

‘contains_search_condition’ )

Page 37: SQLBits X SQL Server 2012 Rich Unstructured Data

Full-Text Customizable Near

OLD NEAR SYNTAXselect * from fttable where contains(*, 'test near Space')

NEW NEAR USAGES

• SPECIFY DISTANCEselect * from fttable where contains(*, 'near((test, Space), 5,false)')

• REDUCE DISTANCEselect * from fttable where contains(*, 'near((test, Space), 2,false)')

• ORDER OF WORDS IS SPECIFIED AS IMPORTANTselect * from fttable where contains(*, 'near((test, Space), 5,true)')

Page 38: SQLBits X SQL Server 2012 Rich Unstructured Data

Statistical Semantic SearchSemantic Insight into textual content

Uses language models to find most important keywords in documentNo need to build brittle ontologies!

Statistically Prominent KeywordsAutogenerated tag clouds

Potentially Related Content based on extracted Keywords, such asSimilar Products (based on description)

Similar Jobs or Applicants

Similar Support Incidents (based on call logs)

Potential Solutions (based on similar incidents)

First class usage experienceEfficent linear algorithms

Integrated with FTS and SQLNew Rowset functions for all results using SQL query

Page 39: SQLBits X SQL Server 2012 Rich Unstructured Data

Semantic Extraction and RelationshipsFullText Search in SQL Server 2012

demo

Page 40: SQLBits X SQL Server 2012 Rich Unstructured Data

Semantic SimilarityInput: Text such as varchar, Office, PDF, HTML, email…Output: Rowset functions with standard SQL queries

Illustrating example:

Key Title Document

D1 Annual Budget …

D2 Corporate Earnings …

D3 Marketing Reports …

… … …

------------------------------------------------------------

----------------------------------------------------------------------

----------

------------------------------------------------------------

----------

Source Table

ID Keyword Colid … compDocid CompOc CompPid

K1 revenue 1 … 10,23,123 (1,4),(5,8),(1,34) 2,5,6,8,4,3

K2 growth 1 … 10,23,123 (1,5),(5,9),(1,34) 2,5,6,8,5,4

… … … … … …

Keyword Index (Full-Text)

Keyphrases KeyphraseDocumentsID DocID

T1 (revenue) D1 (Annual Budget)

T2 (growth) D2 (Corporate Earnings)

T3 (Windows) D3 (Marketing Reports)

… …

T1 (revenue) D7 (Finance Report)

… …

T3 (Windows) D11 (Azure Strategy)

T4 (Azure) D11 (Azure Strategy)

ID Keyword

T1 revenue

T2 growth

T3 Windows

T4 Azure

… …

DocumentSimilarityDocID MatchedDocID

D1 (Annual Budget) D2 (Corporate Earnings)

D1 (Annual Budget) D7 (Finance Report)

D3 (Marketing Reports) D11 (Azure Strategy)

… …

Full-Text and Semantic Processing

quarter, record, revenue…

2b

3

2 a1

+ Language Models 3

Page 41: SQLBits X SQL Server 2012 Rich Unstructured Data

Functional Surface: Initiate Semantics

Create / Alter Full-Text with SemanticsMakes internal design dependency on FTS explicit

CREATE FULLTEXT INDEX ON Production.Document (

Title LANGUAGE 1033,

Document

LANGUAGE 1033

TYPE COLUMN FileExtension

STATISTICAL_SEMANTICS

)

KEY INDEX PK_Document_DocumentID

ON documents_catalog

WITH CHANGE_TRACKING OFF, NO POPULATION;

ALTER FULLTEXT INDEX ON Production.Document

ALTER COLUMN Document

ADD STATISTICAL_SEMANTICS

WITH NO POPULATION;

ALTER FULLTEXT INDEX ON Production.Document

START FULL POPULATION;

Page 42: SQLBits X SQL Server 2012 Rich Unstructured Data

Semantic Extraction: End-2-End Experience

Downloadable Language Statistical Database with registration stored procedureSetup along with Full-TextMetadata / Catalog viewsSystem level DMVs for progress state and usageManageability through SSMS and SMO

Page 43: SQLBits X SQL Server 2012 Rich Unstructured Data

Key Takeaways

SQL Server’s unstructured data support is:targeting non-traditional database workloads that are growing rapidly in the enterprise. Example: Content and Collaboration apps

targeting key ISV asks in fast growing markets such as eDiscovery, Healthcare, Document management etc.

key strategy to enable you to build complex data applications that go beyond relational data!

Page 44: SQLBits X SQL Server 2012 Rich Unstructured Data

Related Content

SQL Server 2012 Whitepapers and information:http://www.sqlserverlaunch.com

Channel 9 DataBound Episode 2: http://channel9.msdn.com

MySemanticsSearch Demo: http://mysemanticsearch.codeplex.com

More demo data sets and demo scripts: http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx

Microsoft Virtual Academy Recording: http://www.microsoftvirtualacademy.com/tracks/breakthrough-insights-using-microsoft-sql-server-2012-scalable-data-warehouseFind Me Later…• On Twitter: @SQLServerMike• Blog: http://sqlblog.com/blogs/michael_rys