Big data at AWS Chicago User Group - 2014
-
Upload
awschicago -
Category
Technology
-
view
16 -
download
0
Transcript of Big data at AWS Chicago User Group - 2014
Have an idea for a meetup? Talk to me: !Margaret WalkerCohesiveFT !!Tweet: @MargieWalker #AWSChicago
Sponsors & Hosts
#AWSChicago
6:00 pm Introductions 6:05 pm Short Talks !"AWS Storage Options" Ben Blair, CTO at MarkITx @stochastic_code !"APIs and Big Data in AWS" - Kin Lane, API Evangelist @kinlane !"Democratizing Data Analysis with Amazon Redshift" - Bill Wanjohi @billwanjohi and Michelangelo D'Agostino @MichelangeloDA, Civis Analytics !
6:45 pm Q & A 7:00 pm Networking, drinks and pizza
Agenda
#AWSChicago
Sponsors & Hosts
TL;DW
• Use IAM roles for access control
• Use DynamoDB for online storage & transactions
• Use Redshift for offline storage & analysis
• Use S3 to keep *everything*
Interactive Data goes in DynamoDB
If your users read or write it, and it’s not huge, it should probably go into DynamoDB
Why Not DynamoDB
• Vendor lock-in vs Cassandra
• Can’t add / change indexes (but that’s ok)
• Need to watch utilization
Redshift vs RDS
• Start with RDS
• Redshift is actually very cheap
• RDS for simple reporting on small data sets
• Redshift for all other analysis
!@stochastic_code
!github.com/markitx
"APIs and Big Data in AWS" Kin Lane API Evangelist !
@kinlane !Click here for slides on GitHub
#AWSChicago
Sponsors & Hosts
Democratizing Data Analysis with Amazon
Redshift
Michelangelo D’Agostino - Civis Analytics Senior Data ScientistBill Wanjohi - Civis Analytics Senior Engineer
● advantages of Redshift● some pitfalls● workflows and recommendations on best
practices
What you’ll learn
Why should you listen?● 18 months of heavy Redshift use● Two complementary perspectives:
The Scientist and The Engineer
● collaborated on monolithic Vertica analytics database
● dozens of TB of data● scaled from 4-20 server blades● dozens of concurrent users across
departments (hundreds total)● arbitrary SQL allowed/encouraged
Life before Redshift
Our early requirements
● SQL language● low starting cost● easy to integrate with OSS, other DBs● performant on large data sets● minimal database administration
Choosing Redshift
● timing: first full release in Feb 2013● drastically cheaper to start than other
commercial offerings● very similar to our previous choice, HP
Vertica● many fewer administration tasks
Basics● RDBMS● MPP/Columnar
Supports window functionsFew enforceable constraintsNo concept of an index
● Redshift <= ParAccel <= PostgreSQL 8Postgres drivers workORM requires mocking
● Most data I/O via S3 service
How’s it worked out?
Pretty good!● adequate performance
○ big step up from traditional RDBMS○ comparable to other analytics DBs
● easy to stand up new clusters● cheaper clusters now available● most workflows can live entirely in-database● s3 is a good broker for what can’t
Data Science Workflow
Our custom plumbing syncs tables from dozens of source databases into Redshift at varying refresh frequencies.
We’ve found that SQL just invites so many more people to the analytics game.
Analysts and data scientists run exploratory SQL and build up complex tables for statistical modeling一utilizing crazy joins, aggregates and rollup features.
Redshift supports powerful window functions
Data Science Workflow
Predictive ModelingFor simple linear models, scoring is done directly in redshift via SQL.
For more complicated models, data is pulled from redshift to s3 with a COPY SQL command, processed in EMR, and loaded back into redshift with another COPY command.
Hurdles we’ve faced along the way● inconsistent runtimes● catalog contention● bugs (databases are hard)● resizing● too easy to end up with uncompressed data● “missing” PostgreSQL functionality● complex workload management
Setup Recommendations● at least two nodes● send 35-day snapshots to other regions● at-rest encryption● enforce SSL● provision with boto or AWS CLI● cluster isolation to hide objects● buy 3-year reservations