Aws Atlanta meetup Amazon Athena

26
AWS Athena Querying Data in S3 with out the need for EMR

Transcript of Aws Atlanta meetup Amazon Athena

Page 1: Aws Atlanta meetup Amazon Athena

AWS Athena

Querying Data in S3 with out the need for EMR

Page 2: Aws Atlanta meetup Amazon Athena

Sponsors

Page 3: Aws Atlanta meetup Amazon Athena

Find me on LinkedIn

AWS Certifications

Presented by Adam Book

Page 4: Aws Atlanta meetup Amazon Athena

What is Athena?

Big Data Offerings Announced at re:Invent 2016

AWS Glue Amazon Athena AWS Greengrass

Page 5: Aws Atlanta meetup Amazon Athena

What is Athena?

Athena allows you to Run interactive SQL Queries on S3 data

WITHOUT the need to CREATE, MANAGE, or worrying about Scaling infrastructure

Page 6: Aws Atlanta meetup Amazon Athena

About Athena?

Athena is currently only in 2 regions:• US-EAST-1 (Northern Virginia)• US-WEST-2 (Oregon)

Pricing: $5 per TB scanned

No charge for failed queries No charge for Data Definition Language Statements (DDL) • CREATE / ALTER / DROP Table• Managing partitions

Page 7: Aws Atlanta meetup Amazon Athena

What is Athena?

Athena allows you to Run interactive SQL Queries on S3 data • S3 data is never modified (data is loaded in read only memory) • Cross-region buckets are supported • You can access Athena via JDBC

WITHOUT the need to CREATE, MANAGE, or worrying about Scaling infrastructure

• You only pay for the queries you run• Queries execute in parallel – so results are FAST, even with large datasets and complex

queries

Page 8: Aws Atlanta meetup Amazon Athena

Athena Queries

Service based on Presto (which is available in Amazon EMR)

ANSI SQL operators and functions

Table Creation is using APACHE HIVE DDL • CREATE EXTERNAL_TABLE only • CREATE TABLE as SELECT not supported

Unsupported Operations:• User Defined Functions (UDFs or UDAFs)• Stored Procedures• Any Transaction found on Hive or Presto• LZO is not supported (use Snappy instead)

Page 9: Aws Atlanta meetup Amazon Athena

Running queries on Athena

Run queries straight from the AWS Console • Use the wizard for schema definition • It can save queries • It can run multiple queries in parallel

Page 10: Aws Atlanta meetup Amazon Athena

Running queries on Athena

Run queries from your favorite tool (SQL Workbench, Agility, Aqua Data Studio)

• Requires the JDBC driver

JDBC 4.1-compatible driver: https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar.

Or via AWS CLI

aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar [loc

Full documentation link HERE

Page 11: Aws Atlanta meetup Amazon Athena

Available Athena File Formats

• Apache Web Logs• CSV• TSV • JSON • Parquet • ORC• TEXT File w/ Custom Delimiters• AVRO (COMING SOON)

Page 12: Aws Atlanta meetup Amazon Athena

Tips on File Formats

Use Compressed Formats: Snappy, Zlib, GZIP ( no LZO) • Less I/O = Better performance & more cost savings

Use Structured / Columnar Formats • Apache Parquet • Apache ORC (Optimized Row Columnar)

Page 13: Aws Atlanta meetup Amazon Athena

The Difference between Hive and Presto

disk

disk

map map

map map

reduce reduce

disk

reduce reduce

HIVE PRESTO

Wait between stages

task

task

task

task

task task

task

Write data to disk

Memory to memory data

transfer• No disk IO• Data chunk

must fit in memory

All stages are pipelined

• No wait time • No fault-tolerance

Page 14: Aws Atlanta meetup Amazon Athena

First Create A new Database

Page 15: Aws Atlanta meetup Amazon Athena

Choose the Data Format

Page 16: Aws Atlanta meetup Amazon Athena

Then lay out the columns

Page 17: Aws Atlanta meetup Amazon Athena

Optional add Partitions

Page 18: Aws Atlanta meetup Amazon Athena

Bypassing the wizard

CREATE EXTERNAL TABLE IF NOT EXISTS default.elb_logs ( `request_timestamp` string, `elb_name` string, `request_ip` string, `request_port` int, `backend_ip` string, `backend_port` int, `request_processing_time` double, `backend_processing_time` double, `client_response_time` double, `elb_response_code` string, `backend_response_code` string, `received_bytes` bigint, `sent_bytes` bigint, `request_verb` string, `url` string, `protocol` string, `user_agent` string, `ssl_cipher` string, `ssl_protocol` string )ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe’WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?') LOCATION 's3://athena-examples/elb/plaintext/';

Page 19: Aws Atlanta meetup Amazon Athena

Athena Catalog Management

Amazon Athena uses an internal data catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3

For more information Read the documentation page for Catalog Management

Page 20: Aws Atlanta meetup Amazon Athena

Athena Featuressaved queries

Page 21: Aws Atlanta meetup Amazon Athena

Athena Tips and Tricks

You will need to have some understanding of the structure of your data or have a DDL meta store before you can start querying with Athena

When looking through your S3 buckets and finding one of the allowed formats (CSV, TSV, Parquet, etc) you will have to have an understanding of what colums are there so you can create the Athena table before you can start querying

Avoid Surprises when working with Athena with the following tips and tricks

Page 22: Aws Atlanta meetup Amazon Athena

Athena Tips and Tricks

Table names that begin with an underscore

Use backticks if table names begin with an underscore. For example:

CREATE TABLE myUnderScoreTable ( `_id` string, `_index` string, ...

Avoid Surprises when working with Athena with the following tips and tricks

Page 23: Aws Atlanta meetup Amazon Athena

Athena Tips and Tricks

For the location clause, use a trailing slashIn the LOCATION clause, use a trailing slash for your folder, NOT filenames or glob characters

Avoid Surprises when working with Athena with the following tips and tricks

Don’t USEs3://path_to_buckets3://path_to_bucket/*s3://path_to_bucket/mySpecialFile.dat

USEs3://path_to_bucket/

Page 24: Aws Atlanta meetup Amazon Athena

Athena Tips and Tricks cont.

Athena table names are case insensitiveIf you are interacting with Apache Spark, then your table column names must be lowercase.Athena is case insensitive but Spark requires lowercase table names.

Athena table names only allow the underscore character Athena table names cannot contain any special characters beside the underscore _

Avoid Surprises when working with Athena with the following tips and tricks

Page 25: Aws Atlanta meetup Amazon Athena

Questions?

Image by http://www.gratisography.com/

Page 26: Aws Atlanta meetup Amazon Athena

Interested in SponsoringAWS Atlanta?

Image by http://www.gratisography.com/