Adatmenedzsment Kihívások és válaszok Andras Benczur Insitute for Computer Science and Control...

19
Adatmenedzsment Kihívások és válaszok Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences (MTA SZTAKI) benczur @ sztaki.mta.hu http://datamining.sztaki.hu November 11, 2025 Jövő Internet Konferencia

Transcript of Adatmenedzsment Kihívások és válaszok Andras Benczur Insitute for Computer Science and Control...

PowerPoint bemutat

AdatmenedzsmentKihvsok s vlaszokAndras BenczurInsitute for Computer Science and ControlHungarian Academy of Sciences (MTA SZTAKI)[email protected]://datamining.sztaki.huNovember 11, 2025Jv Internet KonferenciaGrowth of Data vs. Growth of Data Analysts

http://www.delphianalytics.net/wp-content/uploads/2013/04/GrowthOfDataVsDataAnalysts.png2012.06: http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

2013.02: http://www.slideshare.net/mjft01/big-data-big-deal-a-big-data-101-presentationSAP HANA Demo: Meryl Streep, Oscar 2012

SAP HANA Demo: Meryl Streep, Oscar 2012

kp: http://mirror.co.ukSAP HANA Demo: Meryl Streep, Oscar 2012

SAP HANA Demo: Meryl Streep, Oscar 2012What was The Challenge in Analytics?One year 1 billion Tweet collection, 100GBAd Hoc queries (Meryl Streep) may have 100,000+ hitsFast response needed to support the analystSolutionsIn Memory databases (such as SAP HANA)Customized approximate data structures (Bloom filters, MinHash fingerprints)

Information spread prediction in time

Collaboration w/ Ericsson on Spark PiggybackingHard to predict resource usage overestimation hurts SparkPeaks may cause out-of-memory errors Spark will fail

timeresourcesresource allocated by jobcurrent consumptionout-of-memory errorunder-utilization, waste of resources

Collaboration w/ Ericsson on Spark PiggybackingData Stream Analytics: Mobile Session DropPeriodic radio channel physical parameter measurementsPredict abnormal terminationBest features selected by AdaBoost: DropNot drop1. RLC NACK ratio Upl. Max 0.93447 0.12676 2. RLC NACK ratio Upl. Mean 0.11787-0.115713. Harq NACK ratio Downl. Max 0.02061 0.00619 4. Delta RLC NACK ratio Upl. Mean 0.19277 0.18110 5. Signal-Interf+Noise Mean 1.92105 6.61538ii+2iDynamic Time Warping

AUC: 0.93FPR: 0.03, TPR: 0.7FPR: 0.2, TPR: 0.89

DropNo DropSTREAMLINE Magic Triangle

New initiative on top of Apache FlinkDFKI (DE)SICS (SE)Portugal Telecom (PT)Internet Memory (FR)Rovio (FI)NMusic (PT )SZTAKI (HU)

Batch and Streaming in FlinkA "use-case complete" framework to unify batch and stream processingEvent logsHistoric dataETL RelationalGraph analysisMachine learningStreaming analysis

Flink

Historic dataKafka, RabbitMQ, ...HDFS, JDBC, ...

ETL, Graphs,Machine LearningRelational,

Low latencywindowing, aggregations, ...Event logs

Real-time data streamsBatch and stream: Same execution engine An engine that puts equal emphasis to streaming and batchBike Share Challenge one more month to go!November 11, 2025Jv Internet KonferenciaNovember 11, 2025Jv Internet Konferencia

Questions?November 11, 2025Jv Internet KonferenciaNew architecture for unified batch + stream neededApache Flink has the potentialNew machine learning is neededWe participate in turning research codes to open source software