Back to Basics Webinar 5: Introduction to the Aggregation Framework
-
Upload
mongodb -
Category
Data & Analytics
-
view
2.134 -
download
0
Transcript of Back to Basics Webinar 5: Introduction to the Aggregation Framework
MongoDB Europe 2016Old Billingsgate, London
15th November
mongodb.com/europe
Register with code JD20 for a 20% discount
Back to Basics 2016 : Webinar 5
Introduction to the Aggregation Framework
Joe Drumgoole
Director of Developer Advocacy, EMEA
@jdrumgoole
V1.1
3
Recap
• Webinar 1 – Introduction to NoSQL
– The different types of NoSQL databases
– MongoDB is a document database
• Webinar 2 – My First Application
– Creating databases and collections
– CRUD, Indexes and Explain
• Webinar 3 – Schema Design
– Dynamic schema
– Embedding approaches
• Webinar 4 – GeoSpatial and Text Indexes
4
The Aggregation Framework
• An analytics engine for MongoDB
• What is analytics?
• Think of the two types of database, OLTP, OLAP
• OLTP : Online Transaction Processing
– airline booking,
– ATMs,
– Taxi booking
• OLAP : Online Analytical Processing
– Which tickets make us most money?
– When do we need to refill our ATMs?
– How many cabs do we need to service the West End of London?
5
What Does This Look Like?
OLTP OLAP
6
OLAP - There Be (Hadoop) Dragons Here
• OLAP queries are often table scans
• The output of the queries is often stored for future analysis and comparison
• Many customers are looking at Spark and Hadoop for OLAP but:
– Complexity is astronomical
– Focused on algorithmic analysis of data (you gotta write a program)
– Requires some significant knowledge of parallel alogrithms and parallel
processing
• The aggregation framework is a kinder gentler tool
• May do what you want in less time with less anguish
7
The Aggregation Framework – A Processing Pipeline
Match Project Group SortLimit
• Think unix pipeline
• The output of one stage is passed to the input of the next stage
• Each stage performs one job
• Stages can be repeated
• The input is a single collection
8
Pipeline Operators
• $match
Filter documents
• $project
Reshape documents
• $group
Summarize documents
• $out
Create new collections
• $sort
Order documents
• $limit/$skip
Paginate documents
• $lookup
Join two collections togther
• $unwind
Expand an array
9
The MoT Data Set
10
Example Document
{ "_id" : ObjectId("5759ee6e8684975e1098af68"),
"TestID" : 400,
"VehicleID" : "278",
"TestDate" : ISODate("2013-04-23T00:00:00Z"),
"TestClassID" : "4",
"TestType" : "N",
"TestResult" : "P",
"TestMileage" : 99284,
"Postcode" : "E",
"Make" : "AUDI",
"Model" : "A3",
"Colour" : "BLACK",
"FuelType" : "P",
"CylinderCapacity" : 1598,
"FirstUseDate" : ISODate("2003-11-11T00:00:00Z“) }
11
Lets Use The Shell
MongoDB Enterprise > use vosa
switched to db vosa
MongoDB Enterprise > db.results2013.findOne()
{
"_id" : ObjectId("577294020cb23533dfbaac18"),
"TestID" : 17,
"VehicleID" : 28,
"TestDate" : ISODate("2013-05-02T00:00:00Z"),
"TestClassID" : "2",
"TestType" : "N",
"TestResult" : "P",
"TestMileage" : 46414,
"Postcode" : "BN",
"Make" : "SUZUKI",
"Model" : "UNCLASSIFIED",
"Colour" : "GREEN",
"FuelType" : "P",
"CylinderCapacity" : 398,
"FirstUseDate" : ISODate("1993-08-11T00:00:00Z")
}
12
$limit
MongoDB Enterprise > db.results2013.aggregate([ { "$limit" :2 } ] )
{
"_id" : ObjectId("577294020cb23533dfbaac18"),
"TestID" : 17,
"VehicleID" : 28,
"TestDate" : ISODate("2013-05-02T00:00:00Z"),
"TestClassID" : "2",
"TestType" : "N",
"TestResult" : "P",
"TestMileage" : 46414,
"Postcode" : "BN",
"Make" : "SUZUKI",
"Model" : "UNCLASSIFIED",
…
13
Let’s Make a Small Collection
> db.results2013.aggregate( [ { "$limit" : 10000 }, {"$out" : "results10k" } ] )
> db.results10k.count()
10000
> db.results10k.findOne()
{
"_id" : ObjectId("577294020cb23533dfbaac18"),
"TestID" : 17,
"VehicleID" : 28,
"TestDate" : ISODate("2013-05-02T00:00:00Z"),
"TestClassID" : "2",
"TestType" : "N",
"TestResult" : "P",
"TestMileage" : 46414,
"Postcode" : "BN",
"Make" : "SUZUKI",
"Model" : "UNCLASSIFIED",
"Colour" : "GREEN",
"FuelType" : "P",
"CylinderCapacity" : 398,
"FirstUseDate" : ISODate("1993-08-11T00:00:00Z")
}
14
$match
…aggregate([ { "$limit" :2000 },
{ "$match" : { "FirstUseDate" : { "$ne" : "NULL" }}} ])
{ "_id" : ObjectId("577294020cb23533dfbaac18"), "TestID" : 17, "VehicleID" : 28,
"TestDate" : ISODate("2013-05-02T00:00:00Z"), "TestClassID" : "2", "TestType" :
"N", "TestResult" : "P", "TestMileage" : 46414, "Postcode" : "BN", "Make" :
"SUZUKI", "Model" : "UNCLASSIFIED", "Colour" : "GREEN", "FuelType" : "P",
"CylinderCapacity" : 398, "FirstUseDate" : ISODate("1993-08-11T00:00:00Z") }
{ "_id" : ObjectId("577294020cb23533dfbaac19"), "TestID" : 22, "VehicleID" : 33,
"TestDate" : ISODate("2013-06-07T00:00:00Z"), "TestClassID" : "2", "TestType" :
"N", "TestResult" : "P", "TestMileage" : 15605, "Postcode" : "PE", "Make" :
"UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Colour" : "BLACK", "FuelType" : "P",
"CylinderCapacity" : 150, "FirstUseDate" : ISODate("1962-01-01T00:00:00Z") }
{ "_id" : ObjectId("577294020cb23533dfbaac1a"), "TestID" : 44, "VehicleID" : 49,
"TestDate" : ISODate("2013-08-09T00:00:00Z"), "TestClassID" : "4", "TestType" :
"N", "TestResult" : "PRS", "TestMileage" : 72694, "Postcode" : "SO", "Make" :
"UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Colour" : "BLACK", "FuelType" : "P",
"CylinderCapacity" : 998, "FirstUseDate" : ISODate("2001-05-16T00:00:00Z") }
...
15
$project (1 of 2)
ageinusecs = { "$subtract" : [ "$TestDate", "$FirstUseDate" ] }
ageinyears = { "$divide" :[ ageinusecs , (1000*3600*24*365) ] }
floorage = { "$floor" : ageinyears }
ispass = { "$cond" : [{"$eq": ["$TestResult","P"]},1,0]}
project = { "$project" : { "Make” :1,
"Model” :1,
"VehicleID" :1,
"TestResult” :1,
"Age” :floorage,
"pass” :ispass }}
16
$project (2 of 2)
MongoDB Enterprise > db.nonulldates.aggregate( [ project ] )
{ "_id" : ObjectId("577294020cb23533dfbaac18"), "VehicleID" : 28, "TestResult" : "P", "Make" : "SUZUKI", "Model" : "UNCLASSIFIED", "Age" : 19, "pass" : 1 }
{ "_id" : ObjectId("577294020cb23533dfbaac19"), "VehicleID" : 33, "TestResult" : "P", "Make" : "UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Age" : 51, "pass" : 1 }
{ "_id" : ObjectId("577294020cb23533dfbaac1a"), "VehicleID" : 49, "TestResult" : "PRS", "Make" : "UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Age" : 12, "pass" : 0 }
{ "_id" : ObjectId("577294020cb23533dfbaac1b"), "VehicleID" : 54, "TestResult" : "P", "Make" : "NISSAN", "Model" : "MICRA GX", "Age" : 13, "pass" : 1 }
{ "_id" : ObjectId("577294020cb23533dfbaac1c"), "VehicleID" : 54, "TestResult" : "F", "Make" : "UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Age" : 13, "pass" : 0 }
{ "_id" : ObjectId("577294020cb23533dfbaac1d"), "VehicleID" : 63, "TestResult" : "P", "Make" : "UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Age" : 12, "pass" : 1 }
{ "_id" : ObjectId("577294020cb23533dfbaac1e"), "VehicleID" : 63, "TestResult" : "F", "Make" : "UNCLASSIFIED", "Model" : "UNCLASSIFIED", "Age" : 12, "pass" : 0 }
{ "_id" : ObjectId("577294020cb23533dfbaac1f"), "VehicleID" : 93, "TestResult" : "P", "Make" : "BMW", "Model" : "318ti SE COMPACT", "Age" : 12, "pass" : 1 }
…
17
$group
countMakes = { "$group" : {{ "_id" : "$Make"} , "total" : {"$sum" : 1 }}}
db.nonulldates.aggregate( [ countMakes ])
{ "_id" : "IVECO", "total" : 1 }
{ "_id" : "ISUZU", "total" : 1 }
{ "_id" : "YAMAHA", "total" : 1 }
{ "_id" : "OLDSMOBILE", "total" : 1 }
{ "_id" : "KAWASAKI", "total" : 1 }
{ "_id" : "MASERATI", "total" : 1 }
{ "_id" : "BENELLI", "total" : 1 }
{ "_id" : "BENTLEY", "total" : 3 }
{ "_id" : "AUDI", "total" : 26 }
{ "_id" : "SMART", "total" : 2 }
{ "_id" : "HARLEY-DAVIDSON", "total" : 1 }
…
18
Summary
• A pipeline of operations
• Select, project, group, sort
• $out must appear last in an aggregation pipeline
• There are a range of accumulators (see the group by
documentation)
• Very powerful way to reshape and analyse data
• Shard aware to gain maxinum performance for large clusters
Q&A
21
• This is slide content
25
26
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
Sollicitudin VenenatisLOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
Graphic Element Examples
Porta Ultricies
Commodo Porta
Graph Examples
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Category 1 Category 2 Category 3 Category 4
Series 1
Series 2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Category 1 Category 2 Category 3 Category 4
Series 1
Series 2
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type : "Dental",
plan : "Standard" }
]
}
Code/Highlight Example
Aggregation Framework Agility Backup Big Data Briefcase
Buildings Business Intelligence Camera Cash Register Catalog
Chat Checkmark Checkmark Cloud Commercial Contract
Computer Content Continuous Development Credit Card Customer Success
Data Center Data Variety Data Velocity Data Volume Data Warehouse Database
Dialogue Directory Documents Downloads Drivers Dynamic Schema
EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop
Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development
Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass
Man Mobile Phone Meter Monitoring Music New Apps
New Data Types Online Open Source Parachute Personalization Pin
Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying
Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel
Stopwatch Text Search Tick Data Training Transmission Tower Trophy
Woman World