Welcome [tc18.tableau.com] · Operationalizing and Scaling Tableau Prep. American Culture. What is...
Transcript of Welcome [tc18.tableau.com] · Operationalizing and Scaling Tableau Prep. American Culture. What is...
Welcome
Tableau Prep: Below Decks
# T C 1 8
Doug Thomae
Staff Software Engineer
Tableau
Goal
That you understand enough to start investigating the operation of Tableau Prep on your own, if you want to.
Agenda
American Culture
Executing A Flow (Batch)
Measuring Culture
Interacting With A Flow
Operationalizing and Scaling Tableau Prep
American Culture
What is Culture
The social behavior and norms found in human societiesIt tells us how we should behave and relate to other people and other cultures
A “programmed” lens that affects how we interpret events in our environment
What’s dangerous?
What’s beneficial?
How do we decide?
Culture has a large influence on mass behavior…but influences individuals in the culture to varying degrees
American “Nations”
USA consists of 11 regional culturesBased on history. Outlined in dialect maps, genetic study
Complex mapping from culture to politics, religion and issues
Similar ideas that came beforeWilber Zelinsky, “Doctrine of First Effective Settlement”
Kevin Phillips, Emerging Republican Majority, 1969
Joel Garreau, The Nine Nations of North America, 1981
David Hackett Fisher, Albion’s Seed, 1989
David Hackett Fisher, Champlain’s Dream, 2008
Robert Cushing, The Big Sort, 2008
…and others…
“American Nations” vs. Genome Map
Han, Carbonetto, Curtis, Wang, Granka, Byrnes, Noto, Kermany, Myres, Barber, Rand, Song, Roman, Battat, Elyashiv, Guturu, Hong, Chahine, Ball, “Clustering of 770,000 genomes reveals post-colonial population structure of North America”, Nature Communications 8, Article number 14238, 07 February 2017
Executing Batch Flows
The Beginning of the Beginning: Filtered Map
Tableau Prep Desktop is Web Client/Server
Electron’s embedded Chrome
Electron
TP Front End(Typescript, React,
Redux)
TP Back End(Spring + Collection
of Java Services)
Tableau Query Pipeline + Connector Platform
HTTPS
AQL
PostgreSQL Server“Customer Database”
Front End Back End
C++ Stack
Tableau Prep Back End Services
Service Name Purpose
Cache Analysis Service Caching of data for interactive operation
Connection ServicePresentation models for connection dialogs.
Enables sharing of presentation models/dialogs with rest of Tableau products
File Service Storing and saving .tfl/.tflx documents. Probably should be named “document service”
Flow Executor Service Entry point for compiling and initiating flows runs
Flow Operation Service Manages binning and brushing during interactive operation
Function Def ServiceRetrieves Tableau function definitions from C++ stack.
Enables sharing of functions/formulas with rest of Tableau products
Tableau Prep Back End Services
Service Name Purpose
Desktop Integration Service Returns information about installed Tableau Desktop products
Licensing Service License validation and activation
Versioning Service Document versioning (in the “documents from different releases” sense of version)
LoomDoc Validator Service Analyzes/validates LoomDoc objects
Node Validator Service Validates single nodes by doing a front end compile and returning errors (or not)
MRU Flow Service Persists/retrieves the most recently used documents list
Tableau Prep Back End Services
Service Name Purpose
Status Service Tracks/returns the status of flows that have been initiated by the Flow Executor Service.
Telemetry Service Gathers/sends telemetry (if the user has chosen that option).
What is a Tableau Prep Flow?
Answer 1: It’s the graph displayed in the top pane
Answer 2: It’s a specification defined in loom-lang“loom-lang” is the language that captures flow definitions
It’s only current textual form is in JSON
Answer 3: It’s a set of specifications for queriesEach node in a flow is a specification for a query (e.g. for a SQL database or in Hyper)
When federation is involved, it may be multiple queries
Same Flow Graph, One Level Down
Input Node Output Node
Container Node
Filter off HI
Flow Document/Loom-Lang
{“nodes”:{<see next page>
},“connections”: {
“53bcf9c0-59a8-4f42-bf28-daf4be6b144c”:{“connectionType”: “.v1.SqlConnection”,“isPackaged”: false“name”: “dthomae2.tsi.lan”,
“connectionAttributes”: {“server”: “dthomae2.tsi.lan”,“dbname”: “tc18”,“port”: “5432”,“class”: postgres
}}
}}
Connection id, unique within a flow
Standard fields for all connections
connectionAttributes differ by connection class
Flow Document/Loom-Lang, continued
{“nodes”:{
“074e9fd5-e4a5-4217-80b2-2caa214f02bf”:{“nodeType”: “.v1.LoadSql”,“name”: “county_to_nation_map”,“id”: “074e9fd5-e4a5-4217-80b2-2caa214f02bf”,“baseType”: “input”,“nextNodes”: [{
“namespace”: “Default”,“nextNodeId”: “a77e4d8e-387d-4ccd-be23-75b487896686”,“nextNamespace”: “Default”
}],
<node type specific fields>}
},“connections”: {…}
}
Node id, unique within a flow
Standard fields for all nodes
074e9fd5-e4a5-4217-80b2-2caa214f02bf
Flow Document/Loom-Lang, continued{
“nodes”:{“074e9fd5-e4a5-4217-80b2-2caa214f02bf”:{“nodeType”: “.v1.LoadSql”,“baseType”: “input”,“nextNodes”: [{“nextNodeId”: “a77…”,“nextNamespace”: “Default”}],
},“a77e4d8e-387d-4cc-be23-75b487896686”:{“nodeType”: “.v1.Container”,“baseType”: “container”,“nextNodes”: [{“nextNodeId”: “132…”,“nextNamespace”: “Default”}],“loomContainer”: {
“nodes”: {“120daf25-3ae2-4f11-b83d-b5c87651edfd”: {
“nodeType” : “.v1.RangeFilter”,“baseType”: “transform”,“nextNodes” : []
}}
}}
},…
Loom-Lang, continued
All nodes have:A type, which has a version component and a type name
A name
An id which is unique within the flow
A base type, one of input, output, transform, container, and supernode (another type of container)
…followed by node type specific fields
Every node input and output exists in a namespace:Namespaces are how Tableau Prep keeps duplicate column names straight
Single input/single output nodes use the “Default” namespace
Join nodes have an incoming “Left” and “Right” namespace
General multi-input nodes (e.g. Unions) generate guids as namespaces
Compilation and Queries
Flow executor Service
Loom engine
Front end compiler
Pre-compilation
Build node and type info
Back end compiler
Build execution
plan
Create nodeLogical and
physical models
AQLRunner Querypipeline
Connectorplatform
PostgreSQLdatabase
Database agnostic “logical” query
Database dependentSQL query
Error info
Tableau data platform
Logical Query for Our Flow<logical-query>
<selects><field>[stcou]</field>
...other fields</selects><projectOp class=\"logical-operator\">
<expressions><binding name=[stcou]><identifierExp identifier=\"[stcou]\" class=\"logical-expression\"/></binding>
…other fields<projectOp><selectOp class=\"logical-operator\">
<predicate><funcallExp function=\"!\" shape=\"scalar\" class=\"logical-expression\">
<funcallExp function=\"&&\" shape=\"scalar\" class=\"logical-expression\"><funcallExp function=\"==\" shape=\"scalar\" class=\"logical-expression\">
<identifierExp identifier=\"[state]\" class=\"logical-expression\"/><literalExp value=\""HI"\" datatype=\"string\" class=\"logical-expression\"/>
</funcallExp><funcallExp function=\"!\" shape=\"scalar\" class=\"logical-expression\">
<funcallExp function=\"ISNULL\" shape=\"scalar\" class=\"logical-expression\"><identifierExp identifier=\"[state]\" class=\"logical-expression\"/>
</funcallExp></funcallExp>
</funcallExp></funcallExp>
</predicate>…table and field name information
</logical-query>
PostgreSQL QuerySELECT "e1b673e1-afa9-47aa-bbac-0b12dc"."stcou" AS "stcou","e1b673e1-afa9-47aa-bbac-0b12dc"."county" AS "county","e1b673e1-afa9-47aa-bbac-0b12dc"."state" AS "state","e1b673e1-afa9-47aa-bbac-0b12dc"."nation" AS "nation“
FROM "public"."county_to_nation_map" "e1b673e1-afa9-47aa-bbac-0b12dc“WHERE (NOT (("e1b673e1-afa9-47aa-bbac-0b12dc"."state" = 'HI’)
AND (NOT ("e1b673e1-afa9-47aa-bbac-0b12dc"."state" IS NULL)))
)
We Use Hyper Under the Covers a Lot
Local Files (e.g. .csv, .xls) are put into Hyper:Connector creates a table in Hyper and transfers data into it
Queries are then generated for Hyper, just as if it was any other database
Hyper is used for federationFederation brings together data in one place to do cross database joins
Hyper is the place where the data is brought together
Federation for Tableau Prep is exactly the same as it is for other Tableau products
Handling Local Files, continued
Same data comes from .csv instead of PostgreSQL, Hyper sees:1) At ingestion time a table is created and data copied into it
CREATE TABLE "TableauTemp"."CountyToNationMapUSA#csv" ("STCOU" BIGINT, "County" TEXT COLLATE "en_US", "State" TEXT COLLATE "en_US","Nation\" TEXT COLLATE "en_US")
COPY "TableauTemp"."CountyToNationMapUSA#csv" ("STCOU", "County", "State", "Nation") FROM STDIN WITH (FORMAT HYPERBINARY, SANITIZE)
2) Later, when the query happensSELECT "-1384900078"."STCOU" AS "STCOU“,
"-1384900078".\"County\" AS \"County\","-1384900078"."State" AS "State","-1384900078"."Nation" AS "Nation”
FROM "TableauTemp"."CountyToNationMapUSA#csv" "-1384900078“WHERE (NOT (("-1384900078"."State" = 'HI') AND (NOT ("-1384900078"."State" IS NULL)))
Measuring Culture
Hofstede’s Cultural Dimensions
Geert Hofstede devised a set of dimensions that can be used to compare cultures:
Power Distance—degree of acceptance of unequal power
Individualism vs. collectivism—degree of integration into groups
Uncertainty avoidance—a society’s tolerance for things outside the status quo
Masculinity vs. femininity—degree of preference for achievement, heroism, assertiveness
Long-term orientation—degree to which a society is able/willing to adapt or change
Indulgence vs. restraint—degree to which behavior is controlled by social norms
Hofestede’s Dimensions Can Be Measured
High power distance:Greater income inequality
Smaller middle class
Dictatorships or oligarchies
Violence in national politics
Political systems changed by revolution
Business executives older
Innovations only when supported by hierarchy
Low power distance:Smaller income inequality
Larger middle class
Separation of powers
Peaceful political conflict resolution
Political systems changed by evolution
Business executives younger
Spontaneous innovations
Note that these are measures that allow
cultures to be compared, not absolute indices
Gini Coefficient
The Gini coefficient measures income inequality:
Interacting With Flows
Prepping ACS Gini Data
Binning
Binning produces the vertical “bar chart” of values in the profile pane:
The Flow Operation service is called to generate the binned values
It uses the Flow Executor service to generate the actual queries
Binning uses various “bin strategies” to decide how to do the actual binning:
A bin strategy decides how to select the values/ranges that will be shown to the user
When the user clicks on the node a bin strategy is chosen based on the type of the column
The final binning operation is a count of something, although continuous ranges need to be partitioned first
Walking Through the Gini Flow (Binning)
The Hyper ViewSELECT COUNT(1) AS Measure,
t0.Dimension AS Dimension
FROM (
SELECT hyper.GEO.id AS GEO.id,
hyper.HD02_VD01 AS HD02_VD01,
hyper.GEO.display-label AS GEO.display-label,
hyper.HD01_VD01 AS HD01_VD01,
hyper.File Paths AS File Paths,
hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,
hyper.GEO.id2 AS GEO.id2,
hyper.HD02_VD01 AS Dimension
FROM Extract.tmp-e20YjwL8onYfAWC4NZP4GxsweOTq18Hj4q0mczFO7bI=-Default hyper
LIMIT 1048576
) t0
WHERE ((NOT (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 IS NULL))
AND (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 > 0)
AND ((t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 IS NULL)
OR (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 <= 8868)))
GROUP BY 2
ORDER BY Dimension ASC NULLS FIRST
Hyper uses PostgreSQL, conventions, this is same as count(*)
TP issues one query per columns, the one specified “as Dimension” changes in each one
This has to do with paging data to the UI
Null is always sorted to the top
Interactive ops will always limit to default 1M rows
Brushing
Brushing is binning with a condition:The user picks the condition by clicking on a value
Brushing and binning are both performed by “analyzers”. Binners are a special case analyzer.
There are null values for the Gini coefficient values. Where do they come from?
Walking Through the Gini Flow (Brushing)
The Hyper View
The query generated when Gini Coefficient null value was selected
SELECT COUNT(1) AS Measure,
t0.Dimension AS Dimension
FROM (
SELECT hyper.GEO.id AS GEO.id,
hyper.HD02_VD01 AS HD02_VD01,
hyper.GEO.display-label AS GEO.display-label,
hyper.HD01_VD01 AS HD01_VD01,
hyper.File Paths AS File Paths,
hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,
hyper.GEO.id2 AS GEO.id2,
hyper.HD02_VD01 AS Dimension
FROM Extract.tmp-e20YjwL8onYfAWC4NZP4GxsweOTq18Hj4q0mczFO7bI=-Default hyper
LIMIT 1048576
) t0
WHERE (t0.HD01_VD01 IS NULL)
GROUP BY 2
ORDER BY Dimension ASC NULLS FIRST
It’s the binning query with a condition added
It’s still the old name –the column name is still the same at the db level and TP knows that
The Hyper View
After the exclusion of Geography is added to the recipe the binning query looks like:
SELECT t0.GEO.id AS GEO.id,
t0.File Paths AS File Paths,
t0.GEO.id2 AS GEO.id2,
t0.Gini Coefficient Error AS Gini Coefficient Error,
t0.Gini Coefficient AS Gini Coefficient,
t0.GEO.display-label AS GEO.display-label,
t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82
FROM (
SELECT hyper.GEO.id AS GEO.id,
hyper.HD02_VD01 AS HD02_VD01,
hyper.GEO.display-label AS GEO.display-label,
hyper.HD01_VD01 AS HD01_VD01,
hyper.File Paths AS File Paths,
hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,
hyper.GEO.id2 AS GEO.id2,
hyper.HD02_VD01 AS Gini Coefficient Error,
hyper.HD01_VD01 AS Gini Coefficient
FROM Extract.tmp-e20YjwL8onYfAWC4NZP4GxsweOTq18Hj4q0mczFO7bI=-Default hyper
LIMIT 1048576
) t0
WHERE (NOT ((t0.GEO.display-label = 'Geography') AND (NOT (t0.GEO.display-label IS NULL))))
LIMIT 1000
The query handles the column rename
Conditions from the recipe are included in future binning and brushing queries
Log File Locations
“My Tableau Prep Repository”“Logs” directory
hyperd.log – the operations (including queries) as seen by Hyper
log.txt – the operations (including queries) that are sent by Tableau Prep
Download Tableau Log Viewer!https://github.com/tableau/tableau-log-viewer
Exercise For the Interested:What query is sent to get “what’s in/what’s out” data in a join node?
Hint: Search for “FULL” in hyperd.log in Tableau Log Viewer or text editor
Getting Started With the Exercise
Walking Through the Gini Flow (Join)
How Do Gini Coefficients Compare?
Operationalizing and Scaling Tableau Prep
Tableau Prep Conductor – V1
Add Tableau Prep Capabilities to Tableau Server:Use the same Flow Executor Service used by Tableau Prep DesktopUse the same Versioning Service used by Tableau Prep DesktopAdd:
Flow Orchestrator Service – to set up connections needed by Flow Executor ServiceFlow Publishing Service – API to publish flowsFlow Service – API for UI to retrieve flow inputs/outputs, decrypt credentials and other functions
Extend/Using Existing Tableau Server Mechanism:Job type to schedule flowsBackgrounder support for running flowsSecure credential storageEnforcement of permissionsExtension of content types (e.g. data sources, workbooks, flows)Extension of administrative views
Tableau Prep Conductor – Post V1
Enable Web Authoring:Port most remaining services
Port existing web UI
Server version of Hyper caching
Improve Scheduling and Resource Management:Part of larger data platform efforts
Trigger flow runs when inputs are updated
Scaling to Larger Datasets
Output To Database:Tableau Prep currently outputs to local files or data sources (hyper, csv)
For large datasets move computation to data:- Generate a query using existing mechanisms
- Wrap it in an upsert, send to database. Tableau systems never handle data at all in batch runs.
Augmenting Data Warehouse/Lake With Local Data:Don’t pull down the big dataset to federate with the local data
Push the smaller, local data to a temp table to work with larger dataset
Incremental Update and Query:Large Data Warehouse/Lake datasets are built one hourly/daily/weekly/etc. update at a time
Parameterized Data Pulls
Incremental Upserts
Finishing Up
What Should I Remember?
Tableau Prep flows are specifications that get turned into queries
Tableau Prep Desktop is actually a client/server system
Tableau Prep is built on top of the Tableau data platform
Tableau Prep is architected to scale…although many of the mechanisms aren’t built out yet
Tableau Prep | Below Decks
S E S S I O N R E P E AT S
Tue 10/23 | 2:15 – 3:15 | MCCNO – L2 - 297
Wed 10/24 | 12:00 – 1:00 | MCCNO – L2 - 263
Preparing Your Data the Tableau Prep Way
R E L AT E D S E S S I O N S
Thu 10/25 | 12:30 – 1:30 | MCCNO – L3 - 388
How Aggregate Friends and Influence Pivots
Wed 10/24 | 3:30 – 4:30 | MCCNO – L2 – New Orleans Theater A
Please complete the
session survey from the
Session Details screen
in your TC18 app
.tfl/.tflx Files
Both are always in zip format:Some older output files from Tableau Alpha were JSON files
The x on .tflx files is a hint that they contain data files, but has no other significance
You can open them up with a standard zip utility.
They’re not encrypted and will never contain secrets like passwords
The content is segmented by zip stream:maestroMetadata – other stream names, “document versions”
displaySettings – data or config that affects the way things are displayed (e.g. column order)
flow – the flow definition in loom-lang
data files (streams named using a guid to avoid name collision)