PARQUET & AVRO
http://airisdata.com/
Presenter Introduction• TimSpann,SeniorSolutionsArchitect,airis.DATA• ex-PivotalSeniorFieldEngineer•DZONEMVBandZoneLeader• ex-StartupSeniorEngineer/TeamLead
• http://www.slideshare.net/bunkertor• http://sparkdeveloper.com/• http://www.twitter.com/PaasDev
• SrinivasDarunaDataEngineer,airis.DATASparkCertifiedDeveloper
Presenter Introduction
airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.
Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.
WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.
Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.
airis.DATA
qAvroandParquet- WhenandWhytousewhichformat?qUsecasesforSchemaEvolution&practicalexamplesqDatamodeling- AvroandParquetschemaqWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet
qOurexperienceswithAvroandParquetqSomehelpfulinsightsforprojectsdesign
Agenda
AVRO - Introduction
Ø DougCuttingcreatedAvro,adataserializationandRPClibrary,tohelpimprovedatainterchange,interoperability,andversioninginHadoopEcoSystem.
ØSerialization&RPCLibraryandalsostorageformat.ØWhatledtoanewserializationmechanism.?ØThriftandPBarenotsplittable andcodegen requiredforbothofthem.Dynamicreadingisnotpossible
ØSequencefilesdoesnothaveschemaevolutionØEvolvedasin-houseserializationandRPClibraryforhadoop.GoodoverallperformancethatcanmatchuptoProtocolBuffersinsomeaspects.
Ø DynamicAccess – NoneedofCodegenerationforaccessingthedata.
Ø UnTagged Data – Whichallowsbettercompression
Ø Platformin-dependent – HaslibrariesinJava,Scala,Python,Ruby,CandC#.CompressibleandSplittable – Complements theparllel processingsystemssuchasMRandSpark.
Ø SchemaEvolution: “Datamodelsevolveovertime”,andit’simportant thatyourdataformatssupport yourneedtomodifyyourdatamodels.Schemaevolutionallowsyoutoadd,modify,andinsomecasesdeleteattributes,whileatthesametimeproviding backwardandforwardcompatibility forreadersandwriters
Some Important features of AVRO
• RowBased• Directmappingfrom/toJSON• Interoperability:canserializeintoAvro/BinaryorAvro/Json• Providesrichdatastructures• Mapkeyscanonlybestrings(couldbeseenasalimitation)• Compactbinaryform• Extensibleschemalanguage• Untaggeddata• Bindingsforawidevarietyofprogramminglanguages• Dynamictyping• Providesaremoteprocedurecall• Supportsblockcompression• Avrofilesaresplittable• Bestcompatibilityforevolvingdataschemas
Summary of AVRO Properties
AVRO Schema TypesPrimitive Types
null: no valueboolean: a binary valueint: 32-bit signed integerlong: 64-bit signed integerfloat: single precision (32-bit) IEEE 754
floating-point numberdouble: double precision (64-bit) IEEE 754
floating-point numberbytes: sequence of 8-bit unsigned bytesstring: unicode character sequence
Primitive types have no specified attributes.Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to: {"type":"string"}
Complex Types
RecordsEnumsArraysMapsUnionsFixed
https://avro.apache.org/docs/current/spec.html
Avro SchemaUnderstandingAvroschemaisveryimportantforAvroData.
Ø JSONFormatisusedtodefineschema
Ø SimplerthanIDL(InterfaceDefinitionLanguage)ofProtocolBuffersandthrift
Ø veryusefulinRPC.Schemaswillbeexchangedtoensurethedatacorrectness
Ø Youcanspecifyorder(AscendingorDescending)forfields.
SampleSchema:
{"type":"record","name":"Meetup","fields": [{
"name":"name","type":"string”,"order":"descending"
}, {"name":"value","type":["null", "string”]
}…..]
}
Union
File Structure - Avro
Workshop - Code Examples
Ø JavaAPItocreateAvrofile- APISupport
Ø HiveQuerytocreateExternaltablewithAvroStorageFormat– SchemaEvolution
Ø Accessingavro filegeneratedfromJavainPython– LanguageIndependence
Ø Spark-Avrodataaccess
Few interesting things…
• AvroCli – AvroToolsjarthatcanprovidesomecommandlinehelp• ThriftandProtocolBuffers• Kryo• Jackson-avro-databind javaAPI• ProjectKiji (SchemamanagementinHbase)
PleasedropmailforsupportifyouhaveanyissuesorifyouhavesuggestionsonAvro
PARQUET - Introduction
• ColumnarstorageformatthatcomeoutofacollaborationbetweenTwitterandClouderabasedonDremel• Whatisastorageformat?• Well-suitedtoOLAPworkloads• HighlevelofintegrationwithHadoopandtheecosystem(Hive,ImpalaandSpark)• InteroperableAvro,ThriftandProtocolBuffers
PARQUET cont..• Allowscompression.CurrentlysupportsSnappyandGzip.• WellsupportedoverHadoop ecosystem.• VerywellintegratedwithSparkSQLandDataFrames.• Predicatepushdown:Projectionandpredicatepushdownsinvolveanexecutionenginepushingtheprojectionandpredicatesdowntothestorageformattooptimizetheoperationsatthelowestlevelpossible.• I/Otoaminimumbyreadingfromadiskonlythedatarequiredforthequery.• SchemaEvolutiontosomeextent.Allowsaddingnewcolumnsattheend.• Languageindependent.SupportsScala,Java,C++,Python.
File Structures - Parquet
message Meetup {required binary name (UTF8);required binary meetup_date (UTF8);required int32 going;required binary organizer (UTF8);required group topics (LIST) {repeated binary array (UTF8);
}}
Sample Schema
Ø BecarefulwithParquetDatatypesØ DoesnothaveagoodstandaloneAPIasAvro,haveconvertersforAvro,PBandThriftinsteadØ Flattensallnesteddatatypesinordertosavethemascolumnarstructures
Nested Schema resolution
Parquet few important notes..
• Parquetrequiresalotofmemorywhenwritingfilesbecauseitbufferswritesinmemorytooptimizetheencodingandcompressingofthedata• UsingaheavilynesteddatastructurewithParquetwilllikelylimitsomeoftheoptimizationsthatParquetmakesforpushdowns.Ifpossible,trytoflattenyourschema
Code examples• JavaAPI• SparkExample• KafkaExmple
How to decide on storage format
• Whatkindofdatayouhave?• Whatistheprocessingframework?FutureandCurrent• Dataprocessingandquerying• DoyouhaveRPC/IPC• Howmuchschemaevolutiondoyouhave?
Our experiences with Parquet and Avro
Namespacere-definitions<item>
<chapters><content>
<name>content1</name><pages>100</pages>
</content></chapters>
</item><otheritem>
<chapters><othercontent>
<randomname>xyz</randomname><someothername>abcd</someothername>
</othercontent></chapters>
</otheritem>
Is Agenda Accomplished..??üAvroandParquet- WhenandWhytousewhichformat?üUsecasesforSchemaEvolution&practicalexamplesüDatamodeling- AvroandParquetschemaüWorkshop- ReadAvroinputfromKafka- TransformdatainSpark- WritedataframetoParquet- ReadbackfromParquet
üOurexperienceswithAvroandParquetüSomehelpfulinsightsforprojectsdesign
Questions… ????????
Notes• https://dzone.com/articles/where-should-i-store-hadoop-data
• https://developer.ibm.com/hadoop/blog/2016/01/14/5-reasons-to-choose-parquet-for-spark-sql/
• http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015
• http://parquet.apache.org/
• https://github.com/cloudera/parquet-examples
• http://avro.apache.org/docs/current/spec.html#schema_primitive
• http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
• https://cwiki.apache.org/confluence/display/AVRO/FAQ
• http://avro.apache.org/
• https://github.com/miguno/avro-cli-examples
• http://avro.apache.org/docs/current/spec.html#schema_primitive
• https://dzone.com/articles/getting-started-apache-avro
• https://github.com/databricks/spark-avro
Notes• https://github.com/twitter/bijection
• https://github.com/mkuthan/example-spark
• http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
• https://github.com/databricks/spark-avro/blob/master/README.md
• http://www.bigdatatidbits.cc/2015/01/how-to-load-some-avro-data-into-spark.html
• http://engineering.intenthq.com/2015/08/pucket/
• https://dzone.com/articles/understanding-how-parquet
• http://blog.cloudera.com/blog/2015/03/converting-apache-avro-data-to-parquet-format-in-apache-hadoop/
OurSolutionsTeamo PrasadSripathi,CEO,ExperiencedBigDataArchitect,HeadofNJDataScienceandHadoopMeetups
o SergeyFogelson,PhD,Director,DataScience
o EricMarshall, SeniorSystemsArchitect,HortonworksHadoopCertifiedAdministrator
o KristinaRogale Plazonic,SparkCertifiedDataEngineer
o RaviKora,SparkCertifiedSeniorDataScientist
o Srinivasarao Daruna, SparkCertifiedDataEngineer
o Srujana Kuntumalla, SparkCertifiedDataEngineer
o TimSpann,SeniorSolutionsArchitect,ex-Pivotal
o RajivSingla,DataEngineer
o SureshKempula,DataEngineer
Technology Stack
Top Related