Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code
Transcript of Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code
built by
QuerySurge ™
Automated Big Data Testing
without Writing Code Testing of Hadoop and Data Warehouses Visually
Bill HaydukCEO/President
RTTS
Jeff Bocarsly, PhDChief Architect
QuerySurge /RTTS
Presentation Topics
built by
QuerySurge ™
• Testing a Data Warehouse
• Testing Big Data
• Current Data Testing Strategies
• About QuerySurge
• Demo
built by
QuerySurge ™
About FACTS
Founded: 1996
Headquarters: New York Customer profile:• Fortune 1000 • 600+ customers
Strategic Partners:IBM, Microsoft, HP, Oracle, Teradata, HortonWorks, Cloudera, Amazon Web Services
Software:
QuerySurge
RTTS is the leading provider of software & data quality for critical business systems
“70% of enterprises have either deployed or are planning to deploy big data projects and programs this year”
– analyst firm IDG
“46% of companies cite data quality as a barrier for adopting Business Intelligence products.”
- InformationWeek
“Poor data quality is a primary reason for 40% of all business initiatives failing to achieve their targeted benefits.”
- analyst firm Gartner
Data Quality Issues
built by
QuerySurge ™
Business Intelligence (BI) software
CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine.
“The average organization loses $14.2 million annually through poor Data Quality.”
- Gartner
The Executive Office & Critical Data
potential problem areas
ETL
Source Data ETL Process Data WarehouseBig Data
Data Architecture
Flat Files
Data Warehouse Testing
built by
Data Warehouse: the Marketplace
“The data warehousing market will see a compound annual growth rate of 11.5% …to reach a total of $13.2 billion in revenue.”
- consulting specialist The 451 Group
Data Warehouse software vendors
- Analyst firm Gartner’s Magic Quadrant for Data Warehouse Database Management Systems
Leaders
Challengers
built by
QuerySurge ™
Extract
built by
QuerySurge ™
Legacy DB
CRM/ERP DB
Finance DB
Testing the Data Warehouse: the ETL process
Source Data
ETL Process Target Data Warehouse
Transform
Load
Testing the Data Warehouse: Test Entry Points
Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point test entry point test entry points
built by
QuerySurge ™
Legacy DB
CRM/ERP DB
Finance DB
ETL ETL
Source Data ETL Process Target DW ETL Process Data MartBusiness
Intelligence software
Big Data Testing
built by
Big Data Vendors
built by
QuerySurge ™
Big Data technology & services market will grow at a 26.4% CAGR to $41.5 billion through 2018, or about 6x the growth rate of the overall IT market.
- Analyst firm IDC
Basic Hadoop Architecture
MapReduce(Task Tracker)
HDFS(Data Node)
MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node)
machine
Cluster Add more machines for scaling, from 1 to 100 to 1,000
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
Job Tracker accepts jobs, assigns tasks, identifies failed machines
MapReduce(Task Tracker)
HDFS(Data Node)HiveQLHiveQL
HiveQL
HiveQL
HiveQL
HiveQL
Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files
• create• insert • update • delete• select
Hive
2 Use Cases:
Hadoop
Data Warehouse
NoSQL
Hadoop Data Warehouse
Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
built by
Business Intelligence
software
ETL
Source Data
Source Hadoop ETL Process Target DWH
built by
QuerySurge ™
Use Case #1:Data Warehouse & Hadoop
test entry point test entry points
Use Case #2: MongoDB, Hadoop, Data Warehouse
Relational DB & Data WarehousingSource Data
@
BI, Analytics & ReportingIngestion
built by
QuerySurge ™
test entry point
test entry point
test entry point
test entry point test entry point
2 Prevalent Data Testing Strategies
built by
1) Stare & Compare (also known as sampling)
2) Minus Queries
Strategy #1: Stare & Compare
built by
QuerySurge ™
• Review Mapping Document (business rules, data flow mapping, data movement requirements)
• Write Tests in SQL editor• Execute 2 Tests: 1 at Source & 1 at Target • Dump results to 2 Excel files• Compare results by eye (‘Stare & Compare’ or ‘sampling’)
Issue with Stare & Compare:Impossible to visually compare billions of data sets.
Result: usually less than 1% of data is compared
Example: Current QuerySurge customer has:
• a single test with 100 million rows & 200 columns • = 20 billion data sets • the client has > 7,000 total tests
built by
QuerySurge ™
MINUS QUERIES subtract one result set from another result set to show difference Comment: MINUS QUERIES need to be executed 2x (Source MINUS Target; Target MINUS Source)
Result sets may not be accurate when dealing with duplicate rows of data
No historical data from past testing – audit and regulatory issues
Processing of minus queries puts pressure on the servers
Double execution means 2x testing time and resource utilization
Potential for false positives (bad data could exist on both sides of an ETL leg)
Data Testing Strategy #2: Minus Queries
Minus Query #1: Table_1 MINUS Table_2
Minus Query #2: Table_2 MINUS Table_1
Result Set #1
Result Set #2
ISSUES with MINUS QUERIES
Write 2 MINUS queries in SQL editor
Execute MINUS queries 2x
Data Testing Strategies
built by
QuerySurge ™
a fundamental issue with both current strategies:
Assumption that all team members understand and can write SQL or HQL code
About QuerySurge ™
built by
What is QuerySurge ™?
the collaborative Big Data Testing solution that finds bad data & provides a holistic view
of your data’s healthData Testing
built by
the QuerySurge advantage
built by
QuerySurge ™
Automate the entire testing cycle Automate kickoff, tests, comparison, auto-emailed results
Create Tests easily with no programming ensures minimal time & effort to create tests / obtain results
Test across different platforms data warehouse, Hadoop, NoSQL, database, flat file, XML
Collaborate with team Data Health dashboard, shared tests & auto-emailed reports
Verify more data & do it quickly verifies up to 100% of all data up to 1,000 x faster
Integrate for Continuous Delivery Integrates with most Build, ETL & QA management software
Collaboration
Testers - functional testing - regression testing- result analysis
Developers / DBAs- unit testing- result analysis
Data Analysts- review, analyze data - verify mapping failures
Operations teams - monitoring- result analysis
Managers- oversight- result analysis
Share information on the health of your data
built by
QuerySurge ™
QuerySurge™ Architecture
Web-based…
Installs on...
Linux
Connects to…
…or any other JDBC compliant data source
built by
QuerySurge ™
QuerySurgeController
QuerySurgeServer
QuerySurgeAgents
Flat Files
SQL
HQL
SQL
HQL
SQL
SQL
QS pulls data from data sources QS pulls data from target data store QS compares data quickly QS generates reports, audit trails
How QuerySurge Works
Reports, Data Health Dashboard, auto emails
built by
QuerySurge ™
Source Data Target DataData Stores• Databases • Data Warehouses • Data Marts
Flat Files• Fixed Width• Delimited• Excel
Big Data stores• Hadoop • NoSQL
Data Warehouses
XML
built by
QuerySurge ™
all QuerySurge™ Modules
Design Library
SchedulingDeep-Dive Reporting
Run Dashboard
Query Wizards
Data Health Dashboard
Design Library• Create Query Pairs (source & target SQLs)• Great for team members skilled with SQL
QuerySurge™ Modules
Scheduling Build groups of Query Pairs Schedule Test Runs
built by
QuerySurge ™
Deep-Dive Reporting Examine and automatically
email test results
Run Dashboard View real-time execution Analyze real-time results
QuerySurge™ Modules
built by
QuerySurge ™
QuerySurge Test Management Connectors
built by
QuerySurge ™
Drive QuerySurge execution from your Test Management Solution
Outcome results (Pass/Fail/etc.) are returned from QuerySurge to your Test Management Solution
Results are linked in your Test Management Solution so that you can click directly into detailed QuerySurge results
• HP ALM (Quality Center)
• Microsoft Team Foundation Server
• IBM Rational Quality Manager
Integration with leading Test Management Solutions
QuerySurge & DevOps: Continuous Delivery & Integration
built by
QuerySurge ™
Automated Testing
Automated Reporting
Automated Launch
Data Integration/ETL solutions
QuerySurge ™
and many others…
email report
Test Management solutions
QuerySurge ™
email report
and many others…
QuerySurge ™
Automated Build solutions
email report
built by
Introducing the newQuery Wizards
We just made data testing REALLY EASY!
No programming needed
Testing Big Data Visually
built by
QuerySurge ™
From a recent poll1 of: • Big Data Experts• Data Warehouse Architects• Solution Architects• ETL Architects
Recent Survey: Data Experts
Consensus Answer: 80% of data columns have no transformation at all
Our Question: What % of columns in your projects have no transformations at all?
1Poll conducted by RTTS on targeted LinkedIn groups
Why is this important?
Fast and Easy. No programming needed.
built by
QuerySurge ™
QuerySurge™ Modules
Compare by Table, Column & Row
• Perform 80% of all data tests
• Automatically generates SQL & HQL code
• Opens up testing to novice & non-technical team members
• Speeds up testing for skilled SQL coders
• provides a huge Return-On-Investment
built by
QuerySurge ™
QuerySurge™ Modules
3 Types of Data Comparison Wizards:
The Query Wizards also provide you with automated features for:o filtering (‘Where’ clause) ando sorting (‘Order By’ clause)
Column-Level Comparison:This is great for Big Data stores and Data Warehouses
Table-Level Comparison:This comparator is great for Data Migrations and Database Upgrades.
Row Count Comparison:Great for all - Big Data stores, Data Warehouses, Data Migrations and Database Upgrades.
Uses: Tests the columns that have no transformations,
which means it tests approximately 80% of your data store without you writing any SQL code
Tests: Big Data, Data Warehouses
Value added: novice or non-technical: no coding needed,
productive immediatelyexperienced user: saves time
built by
QuerySurge ™
SQL
SQL
HQL
SQL
SQLHQLSQLSQL
SQL
SQL
SQL
SQL
HQLSQL
SQL
HQL
SQL
SQLSQL
built by
QuerySurge ™
pick Source & Target
pick Comparison Type
Select Tables & Columns
Auto-generated SQLAuto-generated SQL
(we picked Column-Level Comparison)
Filter (‘Where’ clause)
Uses: Verifies data loads when no transformation occurs
Tests: data migrations, upgrades
Value added: novice or non-technical: no coding neededexperienced user: saves time
SQLSQLHQL
SQL SQL
SQL
HQL
SQL
SQL
built by
QuerySurge ™
Use: Verify that the amount of rows from the source match the amount from the target
Tests: Big data, data warehouse, data migration, database upgrades, data interfaces
Value added:novice: no coding neededexperienced user: saves time
built by
QuerySurge ™
SQLHQL
SQLSQL SQL
SQL
SQLSQL
SQL
HQL
HQL
HQL
_________Total
05/01/2023 40built by
QuerySurge ™
Training CoursesData Warehouse Testing• Data Warehouse & ETL Testing Fundamentals (1 day)• Fundamentals of QuerySurge (1 day)• Introduction to SQL for QuerySurge (1 day)• Advanced SQL techniques for QuerySurge (1 day)
Big Data Testing• Big Data And ETL Testing Fundamentals• Introduction To Big Data Testing Using Hive And HQL
ConsultingRTTS, the software quality experts (and developer of QuerySurge), provides consulting solutions to the challenges of Big Data & Data Warehouse / ETL Testing
• Jumpstart 2-week program – combines training courses, mentoring, consulting
• Staff Augmentation – add additional RTTS resources to your team
• Outsourcing - RTTS can perform all testing, including planning, design, execution
(1) Trial in the Cloud of QuerySurgeTM, including self-learning tutorial that works with sample data for 3 days
(2) Downloaded Trial of QuerySurgeTM, including self-learning tutorial with sample data or your data for 15 days
(3) Proof of Concept of QuerySurgeTM includes our team of experts assisting you for 30 days
for more information on (1), (2) and (3),
Go to http://www.querysurge.com/compare-trial-options
TRIAL IN THE CLOUD
built by
QuerySurge ™
Free TrialsQuerySurge™
Proof of
Concept
built by
QuerySurge ™
QuerySurge Demo