Parquet and AVRO

PARQUET & AVRO

http://airisdata.com/

Presenter Introduction• TimSpann,SeniorSolutionsArchitect,airis.DATA• ex-PivotalSeniorFieldEngineer•DZONEMVBandZoneLeader• ex-StartupSeniorEngineer/TeamLead

• http://www.slideshare.net/bunkertor• http://sparkdeveloper.com/• http://www.twitter.com/PaasDev

• SrinivasDarunaDataEngineer,airis.DATASparkCertifiedDeveloper

Presenter Introduction

airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.

Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.

WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.

Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.

airis.DATA

qAvroandParquet- WhenandWhytousewhichformat?qUsecasesforSchemaEvolution&practicalexamplesqDatamodeling- AvroandParquetschemaqWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet

qOurexperienceswithAvroandParquetqSomehelpfulinsightsforprojectsdesign

Agenda

AVRO - Introduction

Ø DougCuttingcreatedAvro,adataserializationandRPClibrary,tohelpimprovedatainterchange,interoperability,andversioninginHadoopEcoSystem.

ØSerialization&RPCLibraryandalsostorageformat.ØWhatledtoanewserializationmechanism.?ØThriftandPBarenotsplittable andcodegen requiredforbothofthem.Dynamicreadingisnotpossible

ØSequencefilesdoesnothaveschemaevolutionØEvolvedasin-houseserializationandRPClibraryforhadoop.GoodoverallperformancethatcanmatchuptoProtocolBuffersinsomeaspects.

Ø DynamicAccess – NoneedofCodegenerationforaccessingthedata.

Ø UnTagged Data – Whichallowsbettercompression

Ø Platformin-dependent – HaslibrariesinJava,Scala,Python,Ruby,CandC#.CompressibleandSplittable – Complements theparllel processingsystemssuchasMRandSpark.

Ø SchemaEvolution: “Datamodelsevolveovertime”,andit’simportant thatyourdataformatssupport yourneedtomodifyyourdatamodels.Schemaevolutionallowsyoutoadd,modify,andinsomecasesdeleteattributes,whileatthesametimeproviding backwardandforwardcompatibility forreadersandwriters

Some Important features of AVRO

• RowBased• Directmappingfrom/toJSON• Interoperability:canserializeintoAvro/BinaryorAvro/Json• Providesrichdatastructures• Mapkeyscanonlybestrings(couldbeseenasalimitation)• Compactbinaryform• Extensibleschemalanguage• Untaggeddata• Bindingsforawidevarietyofprogramminglanguages• Dynamictyping• Providesaremoteprocedurecall• Supportsblockcompression• Avrofilesaresplittable• Bestcompatibilityforevolvingdataschemas

Summary of AVRO Properties

AVRO Schema TypesPrimitive Types

null: no valueboolean: a binary valueint: 32-bit signed integerlong: 64-bit signed integerfloat: single precision (32-bit) IEEE 754

floating-point numberdouble: double precision (64-bit) IEEE 754

floating-point numberbytes: sequence of 8-bit unsigned bytesstring: unicode character sequence

Primitive types have no specified attributes.Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to: {"type":"string"}

Complex Types

RecordsEnumsArraysMapsUnionsFixed

https://avro.apache.org/docs/current/spec.html

Avro SchemaUnderstandingAvroschemaisveryimportantforAvroData.

Ø JSONFormatisusedtodefineschema

Ø SimplerthanIDL(InterfaceDefinitionLanguage)ofProtocolBuffersandthrift

Ø veryusefulinRPC.Schemaswillbeexchangedtoensurethedatacorrectness

Ø Youcanspecifyorder(AscendingorDescending)forfields.

SampleSchema:

{"type":"record","name":"Meetup","fields": [{

"name":"name","type":"string”,"order":"descending"

}, {"name":"value","type":["null", "string”]

}…..]

}

Union

File Structure - Avro

Workshop - Code Examples

Ø JavaAPItocreateAvrofile- APISupport

Ø HiveQuerytocreateExternaltablewithAvroStorageFormat– SchemaEvolution

Ø Accessingavro filegeneratedfromJavainPython– LanguageIndependence

Ø Spark-Avrodataaccess

Few interesting things…

• AvroCli – AvroToolsjarthatcanprovidesomecommandlinehelp• ThriftandProtocolBuffers• Kryo• Jackson-avro-databind javaAPI• ProjectKiji (SchemamanagementinHbase)

PleasedropmailforsupportifyouhaveanyissuesorifyouhavesuggestionsonAvro

PARQUET - Introduction

• ColumnarstorageformatthatcomeoutofacollaborationbetweenTwitterandClouderabasedonDremel• Whatisastorageformat?• Well-suitedtoOLAPworkloads• HighlevelofintegrationwithHadoopandtheecosystem(Hive,ImpalaandSpark)• InteroperableAvro,ThriftandProtocolBuffers

PARQUET cont..• Allowscompression.CurrentlysupportsSnappyandGzip.• WellsupportedoverHadoop ecosystem.• VerywellintegratedwithSparkSQLandDataFrames.• Predicatepushdown:Projectionandpredicatepushdownsinvolveanexecutionenginepushingtheprojectionandpredicatesdowntothestorageformattooptimizetheoperationsatthelowestlevelpossible.• I/Otoaminimumbyreadingfromadiskonlythedatarequiredforthequery.• SchemaEvolutiontosomeextent.Allowsaddingnewcolumnsattheend.• Languageindependent.SupportsScala,Java,C++,Python.

File Structures - Parquet

message Meetup {required binary name (UTF8);required binary meetup_date (UTF8);required int32 going;required binary organizer (UTF8);required group topics (LIST) {repeated binary array (UTF8);

}}

Sample Schema

Ø BecarefulwithParquetDatatypesØ DoesnothaveagoodstandaloneAPIasAvro,haveconvertersforAvro,PBandThriftinsteadØ Flattensallnesteddatatypesinordertosavethemascolumnarstructures

Nested Schema resolution

Parquet few important notes..

• Parquetrequiresalotofmemorywhenwritingfilesbecauseitbufferswritesinmemorytooptimizetheencodingandcompressingofthedata• UsingaheavilynesteddatastructurewithParquetwilllikelylimitsomeoftheoptimizationsthatParquetmakesforpushdowns.Ifpossible,trytoflattenyourschema

Code examples• JavaAPI• SparkExample• KafkaExmple

How to decide on storage format

• Whatkindofdatayouhave?• Whatistheprocessingframework?FutureandCurrent• Dataprocessingandquerying• DoyouhaveRPC/IPC• Howmuchschemaevolutiondoyouhave?

Our experiences with Parquet and Avro

Namespacere-definitions<item>

<chapters><content>

<name>content1</name><pages>100</pages>

</content></chapters>

</item><otheritem>

<chapters><othercontent>

<randomname>xyz</randomname><someothername>abcd</someothername>

</othercontent></chapters>

</otheritem>

Is Agenda Accomplished..??üAvroandParquet- WhenandWhytousewhichformat?üUsecasesforSchemaEvolution&practicalexamplesüDatamodeling- AvroandParquetschemaüWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet

üOurexperienceswithAvroandParquetüSomehelpfulinsightsforprojectsdesign

Questions… ????????

Notes• https://dzone.com/articles/where-should-i-store-hadoop-data

• https://developer.ibm.com/hadoop/blog/2016/01/14/5-reasons-to-choose-parquet-for-spark-sql/

• http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015

• http://parquet.apache.org/

• https://github.com/cloudera/parquet-examples

• http://avro.apache.org/docs/current/spec.html#schema_primitive

• http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/

• https://cwiki.apache.org/confluence/display/AVRO/FAQ

• http://avro.apache.org/

• https://github.com/miguno/avro-cli-examples

• http://avro.apache.org/docs/current/spec.html#schema_primitive

• https://dzone.com/articles/getting-started-apache-avro

• https://github.com/databricks/spark-avro

Notes• https://github.com/twitter/bijection

• https://github.com/mkuthan/example-spark

• http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

• https://github.com/databricks/spark-avro/blob/master/README.md

• http://www.bigdatatidbits.cc/2015/01/how-to-load-some-avro-data-into-spark.html

• http://engineering.intenthq.com/2015/08/pucket/

• https://dzone.com/articles/understanding-how-parquet

• http://blog.cloudera.com/blog/2015/03/converting-apache-avro-data-to-parquet-format-in-apache-hadoop/

OurSolutionsTeamo PrasadSripathi,CEO,ExperiencedBigDataArchitect,HeadofNJDataScienceandHadoopMeetups

o SergeyFogelson,PhD,Director,DataScience

o EricMarshall, SeniorSystemsArchitect,HortonworksHadoopCertifiedAdministrator

o KristinaRogale Plazonic,SparkCertifiedDataEngineer

o RaviKora,SparkCertifiedSeniorDataScientist

o Srinivasarao Daruna, SparkCertifiedDataEngineer

o Srujana Kuntumalla, SparkCertifiedDataEngineer

o TimSpann,SeniorSolutionsArchitect,ex-Pivotal

o RajivSingla,DataEngineer

o SureshKempula,DataEngineer

Technology Stack

Parquet and AVRO

Engineering

Transcript of Parquet and AVRO