Understanding Big Data for policy professionals

24
B A ([email protected] ) It is not only Hadoop…

Transcript of Understanding Big Data for policy professionals

BA

([email protected])

It is not only Hadoop…

BA

• Big Data are the new types of data that let go of the limitations we had to impose decades ago due to the state of hardware and software back then

• The main challenge is therefor unlearning said limitations, and learning to incorporate Big Data capabilities and agility into [policy] work

• Traditional reporting and BI works with “known knowns”. Big data allows working with “known unknowns”, “unknown knowns” and “unknown unknowns”.

• There are several distinctive types of technologies that fall under the “Big Data“ moniker, which has their unique capabilities: Hadoop, NOSQL, Semantic, Graph

2© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Consists of tables tightly packed with data, specific type per row

• Tables identified and created in advance

• Tables populated from human input

• Tables used by filtering, grouping by rows, as well as performing a limited number of joins, for reports, OLAP etc

• Text data are supposed to be read by people

3© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Data coming from all over the Internet

• Data from Internet of Things.

• Human circumstances

• XML structures

• Data come from someplace, designed by someone else

• Machine learning

• Clustering

• Graph algorithms

4© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Traditional for IT

• Fully defined data

• Traditional Database

• Master Data Model

• Data Warehouse

• New generation of ideas and technologies

• Presumes only part of information is known

• Internet

• Information across multiple enterprises

• Information extracted from texts

5© Copyright Business Abstraction Pty Ltd 2014-2015

BA

New generations of tools, often coming from Internet companies, designed for “New Data”

• Hadoop File System

• NoSQL: Cassandra, MarkLogic, Couchbase, DynamoDB

• Column-store RDB

• Semantic DBs

• Graph DBs

• Map/Reduce of different flavours

• Xquery

• Sparql

• Gremlin

6© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Write anything associated with a primary key (akin to a file path)

• Distributed over commodity servers

• Highly concurrent write and read

• Everything is cheap – hardware, “design” etc

• However, small records have to be stored in Sequence files or Map Files

• Anything at scale – can store files in Petabytes

• Designed for Map/Reduce batch work, data lakes

• Anything interactive requires massive hardware

7© Copyright Business Abstraction Pty Ltd 2014-2015

BA

The term “NoSQL” means “not relational”, and as such covers a lot of different models. Some of them are suitable for complexity of generic data storage. They are called “semi-structured” as although individual data items are structures, the structures are not necessarily defined in advanced

NoSQL platforms combine Hadoop’s “store anything” capability with indexing and

• store and index XML or JSON documents (“trees”).

• A deep row store can be seen as a document database where depth of trees limited to 2..

• Tables with named fields per row

8© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• “Interactive Hadoop”

• Low-granularity Hadoop

• Data Lake

• Operations DBs with complex data

• Data consolidation

• Dynamic Data Warehouse

• Operational Data Warehouse

• Data presumed “forests” of “trees” – connected data are handled not as good

• A touch more expensive than Hadoop

9© Copyright Business Abstraction Pty Ltd 2014-2015

BA

Provide traditional RDB interface in the new world

• Different internal structure

• Less suitable for OLTP

• Suitable for sparse data – empty fields don’t take space or penalise for read

• Much faster for analytics, especially if only selected fields are used

• Analytics when schema is known

• Cannot do schema-on-read

10© Copyright Business Abstraction Pty Ltd 2014-2015

BA

Support Resource Description Framework (RDF), originally created for Semantic Web metadata. It stores information in Subject-Predicate-Object “triples”, the most flexible representation possible. Use Sparql for queries.

• Graph patterns

• Metadata for Hadoop/NoSQL. Lack of internal schema requires external metadata

• Do not scale as much

• Hype-contaminated: people who understand enterprise and understand Semantic Tech are rare

11© Copyright Business Abstraction Pty Ltd 2014-2015

BA

Graph Databases see data as one huge graph. They are optimised for navigating the edges of the graph. Use Gremlin.

• Implementing Graph Analytics

• Bespoke graph logic

• Backend for general apps (if BASE jumping is too boring)

• Not as scalable as NoSQL

• Lack declarative data type, patterns & rules definitions of Semantic DBs

• Depend on ability to build and maintain a graph

12© Copyright Business Abstraction Pty Ltd 2014-2015

BA

Platform for massively parallel computations, enables effective sharing of workload between commodity servers.

• MapReduce

• YARN

• Apache Spark

• Batch jobs over massive data

• On-demand queries where some lag is acceptable

• Implementations have powerful Analytics/Machine Learning libraries

• Latency

13© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Ensure datasets are identifiable

• Capture metadata

• Ensure your data are not lost

• Profile data across field names, structures etc

• Locate data as needed

• As you learn more about data, build up your metadata

• Hadoop

• NoSQL

14© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• A server in $1,000-$10,000 range

• 0.5TB – 25TB per server

• A lot of them if needed

• Doubling the number of servers reduces the time to execute the task by the factor of 2.

15© Copyright Business Abstraction Pty Ltd 2014-2015

BA

Perhaps more complex than learning

• There are a lot of data you do not know about which is available and can be used

• For many types of objects, it is natural to have uncommon attributes

• Data storage is cheap. It doesn’t cost much to store everything remotely related

• No massive pre-work.

• Ask everything

16© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Traditional reporting, BI

• Predictive analytics

• Data consolidation, Semantic Integration, Object-based Intelligence

• Clustering

17© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Straightforward operation, no design upfront

• Can take immensely complex metadata, like UML & BPMN models

• Apply OWL for classification

• SWRL builds complex linkages

• Refer to Classes defined by lower-level Ontologies rather than data

18© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Words to be converted to tags (URLs)

• Some words have multiple meanings

• Ontology provides possible tags for nouns

• Software tries to resolve expected predicates

• The tag that can find necessary relations (predicates) wins

• Use ontology to restrict search

• Much more flexible than “foreign key”

19© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Information about data

• Traditional metadata was stored in form of data schema

• With schema-less storage, metadata should be stored separately

• Incremental discovery process requires Open World Assumption – you don’t know what other data are there.

• Reasoning to handle complexity

• Relationships as first-class citizens and the basis for classification

20© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Work in progress – not there yet

• Different data paradigms mandate different views

• SQL view of Big Data (Apache PIG etc)

• Excel import

• Analytic visualisation frontends

• 30+ JavaScript libraries

• Presume development

• Mahoot & other libraries

• Writing code in Scala, Java, Python, Groovy

21© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Better picture of the current state

• What if prediction

• Researching impact

• Increasing the number of categories, by several orders of magnitude if necessary

• Common, meaningful view of individual, organisation etc

• Prevention of undesirable effects on insights, complex events and prediction

22© Copyright Business Abstraction Pty Ltd 2014-2015

BA

23© Copyright Business Abstraction Pty Ltd 2014-2015

BA

• Description Logic, while using First-Order Predicate Logic terminology

• Reduced for practical purposes

• Is not necessary to be productive

• Can be applied to anything

• Class can be derived depending on values

• A State is a Class

• New triples can be derived from existing

24© Copyright Business Abstraction Pty Ltd 2014-2015