DOXLON November 2016 - Data Democratization Using Splunk
-
Upload
outlyer -
Category
Technology
-
view
654 -
download
2
Transcript of DOXLON November 2016 - Data Democratization Using Splunk
About Me• Splunking Since 2008
• Largest Splunk Implementation:
• 3 TB/day
• 1.2 PB Searchable
• 900 Users
• Interests:
• Guitars
• And the occasional Uke
What is Splunk?• Google Search for IT Data?
• Log aggregation Tool?
• Data Visualisation Tool?
• Data Platform with App Creation Capabilities
• Proprietary Search Language - SPL
• Correlation of Structured and Unstructured Data Sources
• Visualisation capabilities
• Out of the Box
• Modular
Getting Data In
Unstructured Data Sources
Structured Data Sources - JSON,
CSV, XML
Forwarders HEC
Data Sources Indexer
Line Breaking
Timestamp Recognition
Data Segmentation
Pipeline
Persist to Disk
Index
BucketBucket
BucketBucket
Bucket Keywords Raw Data
Data Collection using Splunk Forwarder
• Splunk forwarder capabilities
• File based Inputs
• Database Inputs
• Scripted Inputs
• Forwarder Configurations deployed as modular add-ons
Typical Splunk Search
index = <my_product> sourcetype=web.access checkout | stats avg(response_time) as “Average Response Time” by request
Searching DataQuery Index By
KeywordLoad Raw Results
Returned in MemoryApply Data Extractions,
Transformations and Lookups
Run Streaming Commands
Indexers - Map
Search Heads - Reduce
Knowledge
Objects
Receive Results and “Reduce”
Run Additional Commands
Visualise, Report, Alert
So what about Knowledge Objects?• Most Knowledge Objects are configurable from UI
• Common Types:
• Field Extractions - regex to extract fields
• Field Aliases - Alias a name of a field
• Lookups - vs flat files and kv-store
• Tags - Provides event grouping abstraction
• Eventtypes - Provides event categorisation
• Calculated Fields - Data manipulations
Goal?• Queries like:
• Become:
index=<my_website> “/checkout/auth/confirmation” | rex “<some humungous regex that extracts customer id in addition to other things>” | eval response_time_seconds = resp_time_milliseconds/(1000) | where http_code == 200 | lookup db_locations customer_id OUTPUT location | stats avg(response_time_seconds) as avg_response_time by location
eventtype=auth_successful tag=web | stats avg(response_time_seconds) as average_response_time by location
Goal?
Persisting Knowledge
Data Democratisation
• Sounds like the holy grail of data
• Idealistic?
Scenario• Microservices Architecture
• Numerous Development Teams working under different service umbrellas
• Mix of legacy systems with modern services
• Dependance on vendor integrations
• Data can be sensitive
Typical Data Democratisation Issues• Security - Some data is sensitive yet valuable but we’d like an open
access model
• Knowledge Fragmentation - Its our data, lets make sure everyone knows what it means.
• Adoption - People need to like it. Shouldn’t get in the way.
• Scalability
• Chargeback - its not my data, why should I pay for it?
Security - Delegated Access Model• Splunk Search Apps can serve knowledge containers
• Knowledge Objects Ownership can scope local to the app or global to the entire system.
• Splunk Indexes are data containers.
• Data Access granted by index
• Assign an app per product or service umbrella
• Assign Data Owner
Delegated Access Model
Federated Group Splunk Role
App Level Permissions
Index Level Permissions
Splunk Security Must Have!• Splunk Authentication is Poor
• No Password Policy
• No Centralised management for multiple search nodes
• Single Sign On - Splunk supports:
• Ping Identity
• Okta
• ADFS
• Azure AD
• LDAP
• Custom Auth
• Use a Entitlement Framework on top of single sign on groups
Combating Knowledge Fragmentation• Semantic Logging:
• Logging for the sole purpose of analytics
• Rich datasets can be viewed in multiple dimensions
• Define Developer Guidelines:
• Ensure Correlation Identifiers are present in all events
• Precision Timestamps
• Incorporate Logging into SDLC
• Standardise Logging Formats
• Standardise Log content per service - e.g. BAM metrics
Combating Knowledge Fragmentation
Reality - Not all logs can be logged semantically or logged semantically without significant refactoring.
Splunk Solution - Data Models
Data Models
• Enable go go gadget - “Schema on the fly”
• Hierarchically structured search-time mapping of semantic knowledge.
• Accessed via Datasets tab in Splunk 6.5
Example: Splunk CIM• Splunk Common Information Model (CIM)
• Collection of Data Models based on subject area
• Shared Semantic model
• Support consistent and normalised treatment of data
• Enables third party apps to be integrated to your data.
• Reference Tables:
http://docs.splunk.com/Documentation/CIM/4.6.0/User/Howtousethesereferencetables
Pivot• UI Developed to enable the creation of analytics off structured data
models
• Supports:
• Tables
• Charts - Line,Scatter, Column, Bar, Bubble,Pie
• Single Value Visualisations
Performance• Data Models can be accelerated which can lead to:
• Decreases Search Optimisation Effort
• Decreases Dashboard Optimisation Effort
• Increases Storage Requirements
• Speed up upto x1000
• Speed is dependant on the cardinality of data
Notable Splunk Apps on CIM
• Splunk Enterprise Security
• Splunk PCI Compliance
• Insight Engines - Search Splunk using Natural Language
Adoption• Most users complain about backlogs on onboarding data
• Automating the onboarding process isn’t as easy as it sounds. Data Validation is key to deriving value.
• Universal Forwarder:
• Standardise Log Locations
• Standardise Time Stamps
• HTTP Event Collector:
• Send data directly from your application to splunk
• Utilise Indexer Acknowledgement
• Notable implementations:
• Docker - Splunk Logging Driver
Newish Splunk Features• Machine Learning Toolkit
• Comes with built-in assistants for supported algorithms
• Extend algorithms available - python sci-kit learn
• ITSI
• Modular Visualisations
• New Custom Search Command Creation Capability
• TSIDX Reduction - Decrease Storage Costs
Crystal Ball
Further integration into the Hadoop ecosystem