SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
Powering Machine Learning and Analytics with Logical Data Lake · Powering Machine Learning and...
Transcript of Powering Machine Learning and Analytics with Logical Data Lake · Powering Machine Learning and...
LOGITECH CONFIDENTIAL
Avinash Deshpande
Chief Software Architect
Powering Machine Learning and Analytics with Logical
Data Lake
WE DESIGN PRODUCTS THAT CONNECT YOU
3TO THE DIGITAL EXPERIENCES YOU LOVE
A UNIQUE COMPANY
Start-upsBig Companies
$2.22 billion in sales in FY17, strong cash, no debt
7,000 employees
30 countries
SIX Swiss Exchange: LOGNNASDAQ: LOGI
3 strong identities
5 big market opportunities
1 big market driver
5 core capabilities
3 key advantages:
A GREAT SMALL COMPANY
GLOBAL NETWORK
350+ customers
2,000+ locations
2 week delivery fromfactory to family in100 countries
Decade-long relationships with key retailers and e-tailers
3 million products shipped per week to 100 countries
Select partners and customers
DIVERSE PORTFOLIO OFPRODUCTS AND BRANDS
VIDEOCOLLABORATION HOME
GAMINGCREATIVITY & PRODUCTIVITY
MUSIC
LOGITECH CONFIDENTIAL
LOGITECH CONFIDENTIAL
HOME SECURITY AND ENTERTAINMENT
CONTROL
We design simple, yet powerful products that help people take
control of their ever-growing connected home.
LOGITECH CONFIDENTIAL
VIDEOCOLLABORATIONLogitech is transforming video conferencing by offering an easy and affordable way to collaborate with crystal-clear audio and razor-sharp video.
LOGITECH CONFIDENTIAL
BRING MUSIC TO LIFEUltimate Ears empowers you to spontaneously transform your world with music.
LOGITECH CONFIDENTIAL
POWER YOUR PASSIONJaybird is an authentic sports brand leading with innovative design that elevates an active life.
LOGITECH CONFIDENTIAL
ADVANCE PLAYLogitech G applies our leadership to one pursuit: advancing the performance and passion for play.
LOGITECH CONFIDENTIAL
PROGAMINGEQUIPMENTASTRO Gaming, creates premium video gaming equipment and lifestyle products for professional gamers, leagues, and gaming prosumers.
LOGITECH CONFIDENTIAL: NOT FOR DISTRIBUTION
ANALYTICS AT SCALE
SUPPORTING OUR GROWING BUSINESS
LOGITECH DATA USE CASES
Structured Semi-Structured Unstructured
Ba
tch
D
ata
Ve
loc
ity
Re
al-
Tim
e
Social Media
Sentiment
Predictive
Analytics
Demand
Forecasting
Price violations
on Retail sites
Data
WarehousingText Mining
Security Video
Analysis
Retail Data
scrapping
Machine
Learning
ioT
Multi site ERP
Marketing Funnel
Sales Channel
Mgmt
Smart Home
Natural Language
Processing (NLP) VR Gaming
Device Events
ANALYTICS AT SCALE - SUMMARY
• Create a decentralised self-service analytics environment for traditional business reporting and
analytics (Descriptive and Diagnostic Analytics). Becomes a purely EXPLICIT experience.
• Allowing for a centralised, cross-functional shared advanced analytics service tasked to deliver Predictive and Prescriptive analytics to the organisation. • A minimal investment, with leveraged return.
• Logitech and Competitors Products Consumer Reviews
▪ Core capabilities to scrape retail websites for consumer reviews
▪ Raw and structured data available for advanced analysis
▪ Petabyte data volumes support and performance
▪ Sentiment analysis for business decision
▪ BGs data analysis for consumer complaints and issues and negative reviews
• Logitech Product Pricing
▪ Core capabilities to scrape retail websites for pricing of Logitech products
▪ Amazon buy button analysis
▪ Amazon.com and marketplace price analysis
▪ Price violation analysis
DATA VOLUME – COMPLEXITY(RETAIL USE CASE)
Logitech connected devices generate events and are streamed as real-
time/micro-batches for analysis and insight.
▪ Need for tactical KPIs
▪ Engineering feedback on issues
▪ Marketing segmentation for cross sell/up sell
DATA VOLUME – COMPLEXITY (IOT USE CASE)
LOGITECH CONFIDENTIAL: NOT FOR DISTRIBUTION
NEW TECHNOLOGY
A VISION
PRICE ANALYSIS
TEXT ANALYTICS
NATURAL LANGUAGE PROCESSING
YTD Sales by Product
Furniture in the Corporate segment and Chairs sub-category
accumulated $24,304 in YTD sales (3.8% of Superstore's total).
This cell accounted for the greatest proportion of sales in the
Corporate segment (45.1%) and the second highest proportion of
sales in the Chairs sub-category (31.1%).
Q3 recorded the largest proportion of sales (30%), though the
largest sales week was in the second quarter (May 4, 2014;
$2,297). As a result of a moderate profit ratio (13%) and high
order profitability (91%) overall, total profit from Corporate Chair
orders was $3,263. The Central Region contributed the most to
profit ($1,703), while the East Region had the best profit ratio
(23%).
This year's Corporate Chair orders shipped on time or early 79%
of the time. This could be associated with solid performance in
Standard Class Shipping, where 96% of orders shipped on time
or early. Another probable reason was the Central Region, where
83% of orders shipped on time or early.
SAAS INTEGRATION PLATFORM
LOGITECH CONFIDENTIAL: NOT FOR DISTRIBUTION
REAL-TIME ON DEMAND delivery to your PHONE and DESKTOP and DASHBOARD
Executive Summaries
Customer by Product
Product by Customer
Demand / Supply updates
Market Analytics / Market Share
Marketing Reports
Competitive Analysis
Sentiment
...
NLP is a scalable self-service environment,
meaning we can open it to business users
(self-service) and allow them to improve and
drive business impact and adoption. It is
language agnostic, meaning we can publish
reports in the language they are written.
HOW DID WE GET THERE?
REFERENCE ARCHITECTURE
Metadata Management, Data Governance, Data Security
Cost and Usage Pattern
Sensor Data
Machine Data Logs
Social Data
Clickstream Data
Internet Data
Image and Video
Cloud Applications
Enterprise
Applications
Data Sources Data Insights
Self-Service /
Data Discovery
Reporting
Predictive Analytics
Statistical Analytics
Sentimental Analytics
Text Analytics
Data Mining
Data VirtualizationData Collection
Real-Time Data Access (On-Demand / Streaming)
C
D
C
E
T
L
EDW
ODS
Cloud
DW
NoSQL
Data Warehouse
File Storage (S3)
Batch DWSpark
SQL
NoSQLSearch Search
Big Data
InMemory
Analytical
Appliances
Real-Time
Decision Support
Alerts
Scorecards/
Dashboards
SOLUTION ARCHITECTURE
AAmazn Web Services
AWS GlacierAWS S3 AWS Redshift
Pentaho DI
Pentaho Operations Mart
Cloudwatch SNSIAM Cloudtrail EMR SPARKPython /
R
AWS RDS
Denodo Data Virtualization
Tableau Pentaho BA Data Interfaces Web ServicesOBIEE CUBES
SnowflakeText
Analytics
Records
JSON
Read raw
JSONWrite to
HDFS
as Parquet
Cleaned
and/or
aggregated data
Data Virtualization /
Blended DataBusiness Reporting
high computation needed analysis
AWS Glue: fully managed ETL service that can categorize your data, clean it, enrich it, and move it reliably between various data stores.
Snowflake: Cloud datawarehouse
S3: Data Storage
Denodo: Data Virtualization provides business agility by integrating disparate data from any enterprise source, Big Data and Cloud.
Tableau: Data Visualization
Zeppelin: Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, pyspark and more
Apache Parquet: a columnar storage format available to any project in the Hadoop ecosystem
DATA VIRTUALIZATION IN IOT ECOSYSTEM
Other RDBMS(apps, CRM, SAP)
Other Sources(SaaS, SFDC, etc.)
Ingestion Streaminganalytics
Big Data Storage
Batch analytics,Machine learning
Streaming data
Traditional batch processing (ETL to EDW)
Semantic Model
Secure+
Combine+
Enrich
Meta Base
INTEGRATION WITH IOT - STREAMING
Ingestion buffers Streaming Analytics Storage
Read temp window buffers
SQL basedenrichment
Secure +
Combine +
Enrich
JOURNEY TO CLOUD
Cloud empowers IT organizations to redefine the way Data services are produced
and delivered
Scalable • Infrastructure scaled up - down on the fly (Elastic)
• Focus on simplicity, security, robustness, and scalability
Efficient • Infrastructure costs are pay as use
Reliable• AWS managed services
Managed & Governed
• Transparency on usage patterns
• Breadth of services offered, pricing, performance and flexibility
IMPACT OF DATA VIRTUALIZATION PLATFORMS
By 2018, organizations with data virtualization
capabilities will spend 40% less on building and
managing data integration processes for
connecting distributed data assets.”
-Gartner
NEED FOR DATA VIRTUALIZATION
Abstract access to disparate data sources
A single semantic repository
Optimized data availability in real-time to consumers
Centralized, governed and secured data layer
MANAGING BIG DATA WITH DATA LAKES
➢ Organizations are exploring data lakes as consolidated repositories of massive volumes of
raw, detailed data of various types and formats to overcome Big Data challenges.
➢ But creating a physical data lake presents its own hurdles, one of which is the need to store
the data twice which can lead to governance challenges with regard to data access and
quality.
➢ Data Virtualization technologies can improve an organization’s ability to govern and
extract more value from its data lakes by extending them as logical data lakes.
- Ventana Research
BIG DATA FABRIC
Denodo Data Virtualization
•Logical model can be predefined for the data
•Eliminates load processes and the need to update the data
•Uses the security and governance system already in place
•Collects and maintains statistics and determines optimal query execution
•Avails Cache mechanism and pushdown for optimal performance
•Array of connection options from structured to unstructured data
•Business Layer, enabling data Consistency through single object, multiple consumers
•Rapid prototyping
•Data Audits
•Sandbox for data science
VIRTUALIZATION BENEFITS
•Catalog exploration
o Graphical representation of data model
o Data lineage
o Integrated catalog search
•Data Discovery
o User friendly query wizards with drill down capabilities
o Export to CSV, Excel and Tableau Data Extracts
•Secure
o Leverages Denodo’s security model and access control
o Available vis SSL/TLS
GOVERNANCE - DENODO INFORMATION SELF
SERVICE
CLOUD AND DV BENEFITS
• Proactive – IT has embraced cloud as a model for achieving
innovation through increased efficiency, reliability and agility
• Reusability and template development
• Rapid innovation within governance structure, balanced
costs, risks and service levels
• Greater efficiency and reliability, enabling broader audience
to consume IT services via self-service
LESSONS LEARNT
•Reduced Spend
•Live migration
•Flexible and cost effective
•Better business continuity
•Speed to deliver
•Easier to manage
•More efficient IT operations
•low hardware costs
•No or reduced Software costs
Cons
•Possible learning curve
•Accountability
•Getting all vendors to gel well
Pros
BENEFITS
• Adds context to device data
▪ Enrichment and augments with other sources (internal or external)
▪ No replication: enables Virtual Data Lake
▪ Simplifies publication of useful results
▪ End-user oriented semantic model
▪ Reports and dashboards in SQL
▪ Data as a service (REST, OData)
▪ Improves governance
▪ Security (AD integration, data restrictions, masking, etc.)
▪ End-to-end lineage
▪ Abstraction of source technologies
DATA VIRTUALIZATION – NIRVANA
4
6