U.S. Army Intelligence and Security...
Transcript of U.S. Army Intelligence and Security...
INSCOM … Vigilance Always!
U.S. Army Intelligence and Security Command
OVERALL CLASSIFICATION OF THIS BRIEFING IS UNCLASSIFIED
(U) Big Data IntroductionINSCOM ORSA Cell
November 2016
UNCLASSIFIED
UNCLASSIFIED
Agenda
• Define Big Data
• History of Big Data
• Basics of Networking, HDFS, and MapReduce
• Uses and limitations of Hadoop
• Military Applications and Concerns
• Summary and Review of Definitions2
UNCLASSIFIED
UNCLASSIFIED
How big is “Big Data?”
• Doesn’t fit in (1,048,576 rows and 16,384 columns)?
• Doesn’t fit in memory (constraining factor for ) ?
• Doesn’t fit on a single machine (starts at ~1TB)?
• Requires (starting around 5-10TB)?
4
UNCLASSIFIED
UNCLASSIFIED
Why bother with Big Data?
– Know (what happened?)• Basic analytics + visualizations (descriptive statistics, histogram, time
series, bar-chart, box plot, etc.
• Interactive drill down
• Implemented with MapReduce or Queries
• Examples: forensics, assessments, historical data/reports/trends
– Explain (why)• Data mining, classifications, building models, clustering
• Correlation
• Examples: Find similar items, find hubs and authorities ina graph, find frequent item sets
• Possibly implemented with Apache Mahout
– To predict (what will happen)• Neural networks, decision models, unsupervised learning
• Examples: Translation, weather forecast, user profile, traffic models, economic models 5
UNCLASSIFIED
UNCLASSIFIED
Why is Big Data Hard
– Storage: At 1TB each, it takes 1000 computers to store 1 PB
– Movement: Assuming a 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy 1PB
– Searching: Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277 CPU days to process 1TB and 785 CPU years to process 1PB
– Processing: • How do we convert existing algorithms to work on large data
• How do we create new algorithms?
6
UNCLASSIFIED
UNCLASSIFIED
Understanding Traditional Data Storage
• Often requires big/expensive hardware
• Requires expensive Data Base Management System (Oracle, Terabase, etc)
• Not necessarily fault tolerant
• Back-up can be difficult and expensive
• Doesn’t scale horizontally (high marginal cost)
• SQL is unsuited for some analytics
– Complex analysis (like ranking Internet Pages)
– Unstructured data7
UNCLASSIFIED
UNCLASSIFIED
Google’s Problem
• In 1999 Google wanted to index the web. But even at that time it was hundreds of millions of pages
– Crawl all the pages
– Rank pages based on relevant metrics
– Build a search index of keywords to pages
– Do this in real-time!
8
UNCLASSIFIED
UNCLASSIFIED
Google’s Solution
• Google Designed their own storage and processing infrastructure
– Google File System and MapReduce
• Goals:
– Cheap
– Scalable
– Reliable
Image from http://infolab.stanford.edu
9
UNCLASSIFIED
UNCLASSIFIED
Google Product
• It worked!
• Powered Google Search for many Years
• General framework for large-scale batch computation tasks
• Still used internally at Google to this day
10
UNCLASSIFIED
UNCLASSIFIED
Google Share’s Ideas
2003: Google published paper on Google File System (GFS) Internet Link
2004: Google publishes paper on MapReduce Internet Link
At this point, these are already mature technologies.
….but it took 2-3 years for people to “get it”!11
UNCLASSIFIED
UNCLASSIFIED
The Elephant in the Room
• Doug Cutting and Mike Cafarella attempted to develop an Open-source search platform called Nutch
• Ran into same problem Google did
• Decided to “reverse engineer” GFS and MapReduce from the 2003 and 2004 papers
• 2006: spun their product out into Apache Hadoop
12
UNCLASSIFIED
UNCLASSIFIED
Hadoop Goes Mainstream
• Today Hadoop is used by every Fortune 500 company, majority of internet companies and social media, as well as an increasing number of government agencies.
Facebook has a 20PB/4000 node cluster
• Many big tech companies and betting on Hadoop
• Experts predict that within 5-10 years, the vast majority of servers will contain Hadoopclusters
13
UNCLASSIFIED
UNCLASSIFIED
Networking Primer
**Graphic from Cloudera Training Material
Google Data Center in Council Bluffs, Iowa
Central Cooling plan in Google’s data center in Douglas County, Georgia
15
UNCLASSIFIED
UNCLASSIFIED
Hadoop File Systems
• Same concepts as the file system on your personal computer
– Directory Tree
– Create, read, write, and delete files
• Filesystems store metadata and data
– Metadata: filename, size, permissions, location
– Data: contents of the file
16
UNCLASSIFIED
UNCLASSIFIED
Understanding HDFS
HDFS Design assumptions
• Failures are common
– Massive scale means more failures
– Disks, network, node
• Files are append-only
• Files are large (GBs to TBs)
– Works better with few large files than many small files
• Accesses are large and sequential17
UNCLASSIFIED
UNCLASSIFIED
N
HDFS Block Replication
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Name Node:Metadata information about files and blocks
Very Large Data File
3-Fold Replication is baked into the
process
18
UNCLASSIFIED
UNCLASSIFIED
N
Map Reduce
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Name Node:Metadata information about files and blocks
Very Large Data File
Map produces intermediate values
Reduce combines intermediate values into one or more final values
19
UNCLASSIFIED
UNCLASSIFIED
Word Count Example
The cat sat on the matThe aardvark sat on the sofa
The, 1cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1
aardvark, 1
cat, 1
mat, 1
on, [1,1]
sat, [1,1]
sofa, 1
the, [1,1,1,1]
aardvark, 1
cat, 1
mat, 1
on, [1,1]
sat, [1,1]
sofa, 1
the, [1,1,1,1]
aardvark, 1cat, 1mat, 1on, 2sat, 2sofa, 1the, 4
Mapper Input
Mapping
Shuffling Reducing
Final Result
~100 lines of Java Code to accomplish in Hadoop20
UNCLASSIFIED
UNCLASSIFIED
Reasons to avoid Hadoop
• Use cases that may not be best in Hadoop
– Analysis cannot be adapted to parallel processing environment
– Real time analytics or fast access (i.e. 30 milliseconds to look up information in a database that has 300 million people)
– When your intermediate processes need to talk to each other
– When processing requires significant data to be shuffled over the network
21
UNCLASSIFIED
UNCLASSIFIED
Linux
Hadoop
You don’t need to download and compile it yourself!
Cloudera/Hortonworks
25
UNCLASSIFIED
UNCLASSIFIED
Hadoop Ecosystem (1 of 2)
• Hive: Relational database abstraction using a SQL like dialect (but executed as MapReduce Jobs). Developed by Facebook
SELECT s.word, s.freq, k.freq FROM shakepeare JOIN ON (s.word=k.word) WHERE s.freq >=5;
• Pig: High level scripting language for executing one or more Map Reduce Jobs. Developed by Yahoo
Emps=LOAD ‘people.txt’ AS (id,name,salary);
Rich=FILTER emps BY salary > 200000;
Sorted_rich=ORDER rich BY salary DESC;
STORE sorted_rich INTO ‘rich_people.txt’
26
UNCLASSIFIED
UNCLASSIFIED
Hadoop Ecosystem (2 of 2)
• Sqoop: Performs bidirectional data transfers between Hadoopand almost any SQL database with a JDBC driver
• Flume: A streaming data collection and aggregation system for massive volumes of data
• Hbase: HBase is Hadoop's NoSQL database. Patterned after Google BigTable, HBase is designed to provide fast, tabular access to the high-scale data stored on HDFS.
• Accumullo: Similar to Hbase but developed by the National Security Agency with cell-based access control (added a new element to the key called Column Visibility) (https://accumulo.apache.org/ )
27
UNCLASSIFIED
UNCLASSIFIED
Two Big Data Tools to Watch Closely
• Spark (fast general engine for large scale data processing)
– Runs 100x faster than map reduce in memory and 10x faster on disk
– Easy to use: write applications in Java, Scala, R or Python
– Generality: Combine SQL, streaming, and complex analytics
• Apache Drill (Schema-free SQL Query Engine for Hadoop, NoSQL, and Cloud)
– Query any non-relational data store with SQL (can point to directory of JSON files on your laptop or S3)
– With Drill’s ODBC Drivers, you can connect to any existing BI Tool (Excel, R, SAS, Tableau, etc) 28
UNCLASSIFIED
UNCLASSIFIED
Hadoop Use Cases
• Hadoop is the platform of choice for:
– Clickstream data
– Sentiment data (Twitter and social media)
– Telematics, such as vehicle tracking data
– Sensor and Machine-generated data
– Geo tracking and location data
– Server and network logs
– Document and text repositories
– Digitized images, voice, video and other media.
PossibleMilitary
Application
30
UNCLASSIFIED
UNCLASSIFIED
Sampling of Army “Big Data” Projects
• Gabriel Nimbus: ARCYBER instance of the Big Data Platform used to aggregate and enrich cyber data as well as provide a platform to develop rapid analytics for defensive cyber operations (DCO)
• Tactical Cloud Reference Implementation (TCRI): TCRI intends to deliver a joint warfighting tactical/deployed data and analytic platform that enables all-source analysis, rapid decision making, and optimization of force employment
• Person-Event Data Environment (PDE): The Person-Event Data Environment (PDE) business intelligence platform is a cloud-based virtual data repository for housing personnel digitized information. Functionally, the PDE serves two central purposes: (1) acquire, integrate, and securely store data for Army-approved research projects, and (2) provide a secure, virtual workspace where approved researchers can access ‘‘sensitive’’ although unclassified Army military service, performance, manpower, and health data. PDE is hosted by the Army Analytics Group.
31
UNCLASSIFIED
UNCLASSIFIED
Military Application
Today:
• Intelligence
• Cyber
Tomorrow:
• Mobile Technology
• Sensorssustainablesecurity.org
www.enocean.com
www.techweekeurope.co.uk
www.geek.com
32
UNCLASSIFIED
UNCLASSIFIED
“Big” vs. “Bigger” Data
– “Bigger” Data• Size: 1GB up to 1TB fits on 1 machine
• Often doesn’t fit into memory
• Doesn’t require Hadoop (unless you required a production application with lighting fast query/analytics)
– Some recommendations for “Bigger” Data:• Tools that I’ve found help with “Bigger Data”:
– SQL-Lite
– Elastic-search, Logstash, Kibana (ELK) server
– Apache Drill
– Unix Terminal (query many csv files with grep)
• Batch process (store and aggregate interim statistics)
• “Poor man’s parallelization” (multiple instances running simultaneously)
• Understand how your code uses RAM (i.e. difference between data-frames and lists in R)
34
UNCLASSIFIED
UNCLASSIFIED
Big Data Storage vs. Big Data Analytics
– Big Data Storage distributed storage
Data is duplicated and stored across many different nodes (computers)
– Big Data Analytics distributed analytics
Analytics is conducted across multiple nodes; a master node collects and aggregates interim solutions
35
36
What is Data Science?
Math and Statistics:• Machine learning• Statistical modeling• Bayesian inference• Optimization• Simulation• Network science• Model Development
Programming & Database• Computer science
fundamentals• Scripting language (Python)• Statistical Computing package
(e.g. R)• Databases SQL and NoSQL• Parallel databases and parallel
query processing• MapReduce concepts• Hadoop and Hive/Pig• Experience with xaaS like AWS• Basic Tools Development• Understands Sources of Data
Ops/Intel Expertise & Leader Skills:• Background in intel/ops/cyber• Curious about data• Influence with leaders • Problem solver• Creates narratives with data• Visual design and communication• Strategic, proactive, creative,
innovative and collaborative
Data science =
the ability to extract knowledge and insights from large and complex data sets
--DJ Patil, the U.S. Government’s first Chief Data Scientist
Operationalize Data for Decision Makers
UNCLASSIFIED
UNCLASSIFIED
The Gartner Hype Cycle
Graph from “Big Data for Defence and Security” by Neil Couch and Bill Robins
37
UNCLASSIFIED
UNCLASSIFIED
Final “Big Data” Considerations
Big Data Ethics Big Data Security
Image from www.tbitsglobal.com
38
UNCLASSIFIED
UNCLASSIFIED
“A human must turn information into intelligence or knowledge. We've tended to forget that no computer will ever ask a new
question.”--Grace Hopper
http://en.wikiquote.org/wiki/Grace_Hoppe r
39
UNCLASSIFIED
UNCLASSIFIED
Questions
• For more information on “Big Data”, visit my research site at http://data-analytics.net
• For data science collaboration, visit https://dscoe.army.mil/
• For insomnia, visit https://dmbeskow.github.io/
40
Contact Info:
Major David Beskow
Office: 703.706.1255
NIPR: [email protected]
SIPR: [email protected]
JWICS: [email protected]
UNCLASSIFIED
UNCLASSIFIED
References
Executive Office of the President, “Big Data: Seizing Opportunities, Preserving Values.” May 2014
Couch, Neil and Robins, Bill. “Big Data for Defence and Security,” Royal United Services Institute, September 2013
Olson, Mike, “HADOOP, Scalable, Flexible Data Storage and Analysis”, IQT Quarterly, Vol 1, No. 3, p 14-18.
Jacobs, Bill and Dinsmore, Thomas. “Delivering Value from Big Data with Revolution R Enterprise and Hadoop”, Revolution Analytics Executive White Paper, October 2014
Jacobs, Bill. “Maximizing the Value of Big Data.” Revolution Analytics White Paper, April 2014
Cloudera training materials were a primary resource in creating this presentation.
41
UNCLASSIFIED
UNCLASSIFIED
AWS Primer
• Elastic Cloud Compute (EC2):
– Virtual Machines (Computers) that Reside in the Cloud (just like a real computer, you choose RAM and Storage size)
– Choose Linux of Microsoft image
• Simple Storage Solution (S3)
– “Buckets” that can store files
– Think of this as an infinitely expandable Dropbox in which you only pay for the storage used
43
UNCLASSIFIED
UNCLASSIFIED
The Four V’s of Big Data
Volume
• Scale of Data
• 40 Zettabytes of data will be created by 2020
Velocity
• Analysis of Streaming Data
• The NY Stock Exchange captures 1TB of trade information during each trading session
Variety
• Different forms of data
• By the end of 2014, it’s anticipated there will be 420 million wearable wireless health monitors
Veracity
• Uncertainty of Data
• Poor data quality costs the US economy $3.1 Trillion a year
Data from IBM Graphic Visualization
44
UNCLASSIFIED
UNCLASSIFIED
HDFS Fault-tolerance
• Many different failure modes
– Disk corruption, node failure, switch failure
• Primary concern: Data is safe!!
• Secondary concerns
– Keep accepting reads and writes
– Do it transparently to clients/users
45
UNCLASSIFIED
UNCLASSIFIED
Hadoop Cost Considerations
• Traditional Storage:
Terabase is ~ $20K per TB per year
• Hadoop Storage:
$1K-2K per TB per year
Note: Hadoop Storage costs assume you have the technical expertise in-house. Hiring/contracting Hadoop programmers increases costs significantly.
***Cloudera training, NYC, 201446
UNCLASSIFIED
UNCLASSIFIED
File Systems (cont)
• Disk does a seek for every I/O operation
• Seeks are expensive (~10ms)
• Throughput tradeoff—Input/OutputOperations per second (IOPS)
– 100 MB/s and 10 IOPS
– 10MB/s and 100 IOPS
• Big I/Os mean better throughput
47
UNCLASSIFIED
UNCLASSIFIED
Summary
• GFS and MR co-design
– Cheap, simple, effective at scale
• Fault-tolerance baked in
– Replicate data 3x
– Incrementally re-execute computation
– Avoid single points of failure
48
UNCLASSIFIED
UNCLASSIFIED
Networking Primer
Namenode
Host 1
Namenode
Host 2
Data Node
Host 3
Data Node
Host 5
Data Node
Host 4
Data Node
Host 6
49
UNCLASSIFIED
UNCLASSIFIED
MapReduce—Map
MapReduce—Map
• Records from the data source (lines out of files, rows out of a database, etc.) and feeds them into the map function as key*value pairs: e.g., (filename, line)
• Map() produces one or more intermediate values along with an output key from the input
Txt
Txt
MapTask
{key 1, values}
Shuffle Phase
[key 1, int. values}
Reduce Task
Final {key,
values}
{key 1, values}
{key 1, values}
[key 1, int. values}
[key 1, int. values} 51
52
Hierarchy of Data Scientists
Tool-maker: generates algorithms from scratch: full understanding of when
algorithm will break.
High-end tool user: uses products that require deeper understanding of the question and tools. Ex: executing and
debug a few lines of code
Tool user: uses products generated by other analysts to generate answers to
well-known questions
UNCLASSIFIED
UNCLASSIFIED
MapReduce—Reduce
MapReduce—Reduce
• After the map phase is over, all the intermediate values for a given output key are combined together into a list
• Reduce() combines those intermediate values into one or more final values for that same output key
Txt
Txt
MapTask
{key 1, values}
Shuffle Phase
[key 1, int. values}
Reduce Task
Final {key,
values}
{key 1, values}
{key 1, values}
[key 1, int. values}
[key 1, int. values} 53
UNCLASSIFIED
UNCLASSIFIED
Where does data science fit in?
Data-Driven Decision Making(across the organization)
Automated DDD
Data Science
Data Engineering and Processing[including “Big Data” technologies]
Other positive effects of data processing
54
UNCLASSIFIED
UNCLASSIFIED
The Data Scientist
“The Hacker”
Computer Science
“The Nerd”
Statistics & Math
Modeling
“The Expert”
SME on fieldof interest
“The Data Scientist”
Organizations can achieve data
science through a team approach
55
UNCLASSIFIED
UNCLASSIFIED
How Big Data Fits into a Data Science Minor at USMA
UC Berkley:
1. Research Design and Application for Data and Analysis
2. Exploring and Analyzing Data
3. Storing and Retrieving Data
4. Applied Machine Learning
5. Visualizing and Communicating Data
Proposed WP Curriculum:
1. Engineering Statistics
2. Data Bases & Big Data
3. Network Analysis
4. Machine Learning and Data Mining
5. Visualizing and Communicating Data
56