Parquet and AVRO

29
PARQUET & AVRO http://airisdata.com/

Transcript of Parquet and AVRO

Page 1: Parquet and AVRO

PARQUET & AVRO

http://airisdata.com/

Page 2: Parquet and AVRO

Presenter Introduction• TimSpann,SeniorSolutionsArchitect,airis.DATA• ex-PivotalSeniorFieldEngineer•DZONEMVBandZoneLeader• ex-StartupSeniorEngineer/TeamLead

• http://www.slideshare.net/bunkertor• http://sparkdeveloper.com/• http://www.twitter.com/PaasDev

Page 3: Parquet and AVRO

• SrinivasDarunaDataEngineer,airis.DATASparkCertifiedDeveloper

Presenter Introduction

Page 4: Parquet and AVRO

airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.

Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.

WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.

Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.

airis.DATA

Page 5: Parquet and AVRO

qAvroandParquet- WhenandWhytousewhichformat?qUsecasesforSchemaEvolution&practicalexamplesqDatamodeling- AvroandParquetschemaqWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet

qOurexperienceswithAvroandParquetqSomehelpfulinsightsforprojectsdesign

Agenda

Page 6: Parquet and AVRO

AVRO - Introduction

Ø DougCuttingcreatedAvro,adataserializationandRPClibrary,tohelpimprovedatainterchange,interoperability,andversioninginHadoopEcoSystem.

ØSerialization&RPCLibraryandalsostorageformat.ØWhatledtoanewserializationmechanism.?ØThriftandPBarenotsplittable andcodegen requiredforbothofthem.Dynamicreadingisnotpossible

ØSequencefilesdoesnothaveschemaevolutionØEvolvedasin-houseserializationandRPClibraryforhadoop.GoodoverallperformancethatcanmatchuptoProtocolBuffersinsomeaspects.

Page 7: Parquet and AVRO

Ø DynamicAccess – NoneedofCodegenerationforaccessingthedata.

Ø UnTagged Data – Whichallowsbettercompression

Ø Platformin-dependent – HaslibrariesinJava,Scala,Python,Ruby,CandC#.CompressibleandSplittable – Complements theparllel processingsystemssuchasMRandSpark.

Ø SchemaEvolution: “Datamodelsevolveovertime”,andit’simportant thatyourdataformatssupport yourneedtomodifyyourdatamodels.Schemaevolutionallowsyoutoadd,modify,andinsomecasesdeleteattributes,whileatthesametimeproviding backwardandforwardcompatibility forreadersandwriters

Some Important features of AVRO

Page 8: Parquet and AVRO

• RowBased• Directmappingfrom/toJSON• Interoperability:canserializeintoAvro/BinaryorAvro/Json• Providesrichdatastructures• Mapkeyscanonlybestrings(couldbeseenasalimitation)• Compactbinaryform• Extensibleschemalanguage• Untaggeddata• Bindingsforawidevarietyofprogramminglanguages• Dynamictyping• Providesaremoteprocedurecall• Supportsblockcompression• Avrofilesaresplittable• Bestcompatibilityforevolvingdataschemas

Summary of AVRO Properties

Page 9: Parquet and AVRO

AVRO Schema TypesPrimitive Types

null: no valueboolean: a binary valueint: 32-bit signed integerlong: 64-bit signed integerfloat: single precision (32-bit) IEEE 754

floating-point numberdouble: double precision (64-bit) IEEE 754

floating-point numberbytes: sequence of 8-bit unsigned bytesstring: unicode character sequence

Primitive types have no specified attributes.Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to: {"type":"string"}

Complex Types

RecordsEnumsArraysMapsUnionsFixed

https://avro.apache.org/docs/current/spec.html

Page 10: Parquet and AVRO

Avro SchemaUnderstandingAvroschemaisveryimportantforAvroData.

Ø JSONFormatisusedtodefineschema

Ø SimplerthanIDL(InterfaceDefinitionLanguage)ofProtocolBuffersandthrift

Ø veryusefulinRPC.Schemaswillbeexchangedtoensurethedatacorrectness

Ø Youcanspecifyorder(AscendingorDescending)forfields.

SampleSchema:

{"type":"record","name":"Meetup","fields": [{

"name":"name","type":"string”,"order":"descending"

}, {"name":"value","type":["null", "string”]

}…..]

}

Union

Page 11: Parquet and AVRO

File Structure - Avro

Page 12: Parquet and AVRO

Workshop - Code Examples

Ø JavaAPItocreateAvrofile- APISupport

Ø HiveQuerytocreateExternaltablewithAvroStorageFormat– SchemaEvolution

Ø Accessingavro filegeneratedfromJavainPython– LanguageIndependence

Ø Spark-Avrodataaccess

Page 13: Parquet and AVRO

Few interesting things…

• AvroCli – AvroToolsjarthatcanprovidesomecommandlinehelp• ThriftandProtocolBuffers• Kryo• Jackson-avro-databind javaAPI• ProjectKiji (SchemamanagementinHbase)

PleasedropmailforsupportifyouhaveanyissuesorifyouhavesuggestionsonAvro

Page 14: Parquet and AVRO

PARQUET - Introduction

• ColumnarstorageformatthatcomeoutofacollaborationbetweenTwitterandClouderabasedonDremel• Whatisastorageformat?• Well-suitedtoOLAPworkloads• HighlevelofintegrationwithHadoopandtheecosystem(Hive,ImpalaandSpark)• InteroperableAvro,ThriftandProtocolBuffers

Page 15: Parquet and AVRO

PARQUET cont..• Allowscompression.CurrentlysupportsSnappyandGzip.• WellsupportedoverHadoop ecosystem.• VerywellintegratedwithSparkSQLandDataFrames.• Predicatepushdown:Projectionandpredicatepushdownsinvolveanexecutionenginepushingtheprojectionandpredicatesdowntothestorageformattooptimizetheoperationsatthelowestlevelpossible.• I/Otoaminimumbyreadingfromadiskonlythedatarequiredforthequery.• SchemaEvolutiontosomeextent.Allowsaddingnewcolumnsattheend.• Languageindependent.SupportsScala,Java,C++,Python.

Page 16: Parquet and AVRO

File Structures - Parquet

Page 17: Parquet and AVRO

message Meetup {required binary name (UTF8);required binary meetup_date (UTF8);required int32 going;required binary organizer (UTF8);required group topics (LIST) {repeated binary array (UTF8);

}}

Sample Schema

Ø BecarefulwithParquetDatatypesØ DoesnothaveagoodstandaloneAPIasAvro,haveconvertersforAvro,PBandThriftinsteadØ Flattensallnesteddatatypesinordertosavethemascolumnarstructures

Page 18: Parquet and AVRO

Nested Schema resolution

Page 19: Parquet and AVRO

Parquet few important notes..

• Parquetrequiresalotofmemorywhenwritingfilesbecauseitbufferswritesinmemorytooptimizetheencodingandcompressingofthedata• UsingaheavilynesteddatastructurewithParquetwilllikelylimitsomeoftheoptimizationsthatParquetmakesforpushdowns.Ifpossible,trytoflattenyourschema

Page 20: Parquet and AVRO

Code examples• JavaAPI• SparkExample• KafkaExmple

Page 21: Parquet and AVRO

How to decide on storage format

• Whatkindofdatayouhave?• Whatistheprocessingframework?FutureandCurrent• Dataprocessingandquerying• DoyouhaveRPC/IPC• Howmuchschemaevolutiondoyouhave?

Page 22: Parquet and AVRO

Our experiences with Parquet and Avro

Page 23: Parquet and AVRO

Namespacere-definitions<item>

<chapters><content>

<name>content1</name><pages>100</pages>

</content></chapters>

</item><otheritem>

<chapters><othercontent>

<randomname>xyz</randomname><someothername>abcd</someothername>

</othercontent></chapters>

</otheritem>

Page 24: Parquet and AVRO

Is Agenda Accomplished..??üAvroandParquet- WhenandWhytousewhichformat?üUsecasesforSchemaEvolution&practicalexamplesüDatamodeling- AvroandParquetschemaüWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet

üOurexperienceswithAvroandParquetüSomehelpfulinsightsforprojectsdesign

Page 25: Parquet and AVRO

Questions… ????????

Page 26: Parquet and AVRO

Notes• https://dzone.com/articles/where-should-i-store-hadoop-data

• https://developer.ibm.com/hadoop/blog/2016/01/14/5-reasons-to-choose-parquet-for-spark-sql/

• http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015

• http://parquet.apache.org/

• https://github.com/cloudera/parquet-examples

• http://avro.apache.org/docs/current/spec.html#schema_primitive

• http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/

• https://cwiki.apache.org/confluence/display/AVRO/FAQ

• http://avro.apache.org/

• https://github.com/miguno/avro-cli-examples

• http://avro.apache.org/docs/current/spec.html#schema_primitive

• https://dzone.com/articles/getting-started-apache-avro

• https://github.com/databricks/spark-avro

Page 27: Parquet and AVRO

Notes• https://github.com/twitter/bijection

• https://github.com/mkuthan/example-spark

• http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

• https://github.com/databricks/spark-avro/blob/master/README.md

• http://www.bigdatatidbits.cc/2015/01/how-to-load-some-avro-data-into-spark.html

• http://engineering.intenthq.com/2015/08/pucket/

• https://dzone.com/articles/understanding-how-parquet

• http://blog.cloudera.com/blog/2015/03/converting-apache-avro-data-to-parquet-format-in-apache-hadoop/

Page 28: Parquet and AVRO

OurSolutionsTeamo PrasadSripathi,CEO,ExperiencedBigDataArchitect,HeadofNJDataScienceandHadoopMeetups

o SergeyFogelson,PhD,Director,DataScience

o EricMarshall, SeniorSystemsArchitect,HortonworksHadoopCertifiedAdministrator

o KristinaRogale Plazonic,SparkCertifiedDataEngineer

o RaviKora,SparkCertifiedSeniorDataScientist

o Srinivasarao Daruna, SparkCertifiedDataEngineer

o Srujana Kuntumalla, SparkCertifiedDataEngineer

o TimSpann,SeniorSolutionsArchitect,ex-Pivotal

o RajivSingla,DataEngineer

o SureshKempula,DataEngineer

Page 29: Parquet and AVRO

Technology Stack