Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to...
Transcript of Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to...
![Page 1: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/1.jpg)
DetectingDataErrors:Whereareweandwhatneedstobe
done?*PresentationBy:Sitong Che andSrikar Pyda
WrittenBy:Ziawasch Abedjan,XuChu,DongDeng,RaulCastroFernandez,Ihab F.Ilyas,Mourad Ouzzani,PaoloPapotti,MichaelStonebraker,NanTang
{abedjan,ddong,raulcf,stonebraker}@csail.mit.edu {x4chu,ilyas}@uwaterloo.ca {mouzzani,ntang}@qf.org.qa [email protected]
![Page 2: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/2.jpg)
Introduction
• Amultitudeofdata-cleaningtoolsexisttodetectandpotentiallyrepairerrors• It’sbettertothinkofdata-cleaningsolutionsasbeingtailoredtodetectingparticularcategoriesoferrorsratherthandetectingallpotentialerrors• Data-cleaningisimportantforenterprisebecausedata-centricapproachesarebecomingcriticalforinnovationinbusinessandscience
• Differenttypesoferrorsoftenexistonthesamedata-set• Requirescleaningfrommultipletoolsinordertodetectandrepairthevarietyofnuancesintheerrors
![Page 3: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/3.jpg)
Overview:PragmaticQuestion
Arethesetoolsrobustenoughtocapturemosterrorsinreal-worlddatasets?
Whatisthebeststrategytoholisticallyrunmultipletoolstooptimizethedetection
effort?
![Page 4: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/4.jpg)
DataCleaningSolutionCategories• Rule-baseddetectionalgorithm:Userscanspecifyacollectionofdata-cleaningrulesandthetoolwillfindanyviolationswithinthedata-set• NADEEF• “notnull”constraint• Multi-attributefunctionaldependencies(FDs)• User-definedfunctions
• Patternenforcementandtransformationtools:Patternenforcementtoolsdiscoverbothsemanticandsyntacticpatternsinthedataandusethemtodetecterrors.Transformationtoolscanbeusedtochangethedatarepresentationandexposeadditionalpatternswithinthedata-set.• Syntactic:OPENREFINEandTRIFACTA• Semantic:Katara
• Quantitativeerrordetection:Thesealgorithmsexposeoutliersandotherstatisticalglitcheswithinthedata.• Recordlinkageandde-duplicationalgorithms:De-duplicationtoolsdetectduplicatedatarecordswhichreferto thesameentity.ConflictingValuescanbefoundàfurther indicatingerror.• TAMR
• Arethesecategoriessufficient?Overlap?• Theauthorsadmitthattheircategorizationdoesnotperfectlypartitionerrors• Theauthorsareattemptingtocategorizedetectableerrorsfromreal-worlddata-sets
![Page 5: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/5.jpg)
Data-CleaningChallenges
• Syntheticdataanderrors:Mostcleaningalgorithmsareevaluatedondata,eithersyntheticorreal-world,withsyntheticallyinjectederror• Thereisbothalackofrealdata-setsalongwithappropriategroundtruthandalackofwidelyacceptedbenchmarkofdata-cleaningquality
• Combinationoferrortypesandtools:Real-worlddataoftenhasmultiplekindsoferrors.• Anerrorcanoftenbefoundthroughamultitudeoftools
• Conflictingduplicaterecordsandintegrityconstraint• Overlap??
• HumanInvolvement:Enterprisesrequirebudgetstofacilitatehumanpower—havinganidealorderingfortheapplicationofdata-cleaningtoolsiskeytominimizehumanintervention.• Verifydetectederrors• Specifycleaningrules• Providefeedbackformachineleaning
![Page 6: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/6.jpg)
Overview:Methodology
• Collectionofreal-worlddatawitheitherfullorpartialgroundtruth• Representthekindsofdirtydatafoundinpractice• Theauthorscaneasilyjudgetheperformancebecauseoftheknowledgeofgroundtruth
• Interestedinautomaticerrordiscoveryasopposedtoautomaticrepairbecauseauto-repairisnotpragmaticinpractice.• Reportresultsintermsofprecisionandrecallintermsofthegroundtruth.• Upper-boundrecall:estimateforthemaximumrecallofatoolifithasbeenconfiguredbyanoracle• ”Perfect-configuration”ofdata-cleaningrulestoenableoptimalerrordetection• Usegroundtruthtoestimateupper-boundrecall:classifyremainingerrorsthatarenotdetectedbytype
• Anyerrorwhosetypecanbecleanedbyatoolshouldbecountedtowardsitsrecall.
![Page 7: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/7.jpg)
EvaluationQuestionsWhatistheprecisionandrecallofeachtool?Howprevalentaretheerrorswhichthedata-cleaningtoolisdesignedtodetect?Howmanyerrorsinthedatasetsaredetectablebyapplyingalltoolscombined?Sincehuman-in-the-loopisawellacceptedparadigm,howmanyfalsepositivesarethere.Thesecauseadraininhumaneffortbudgetandcauseacleaningefforttofail.Isthereastrategytominimizehumaneffortbyleveragingtheinteractionamongtools?
![Page 8: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/8.jpg)
MainFindings
• Conclusion1:Thereisnosingledominanttoolbecausethedata-cleaningalgorithmsaregenerallytailoredtowardsparticulartypesoferrors• Aholistic“composite”strategymustbeusedbecauseeachdata-cleaningtoolisindividuallydesignedtodetectselectivegenresoferrors
• Conclusion2:Byassessingtheoverlapoferrorsdetectedbythemultitudeofdata-cleaningtoolsutilizedinordertominimizefalsepositive(userengagement)• Orderingstrategymustbespecifictothedata-setbecauseofvarianceinstructuralpropertiesandpatterns
• Conclusion3:Thepercentageoferrorsthatcanbefoundbythecombinedorderedapplicationofalltoolsissignificantlylessthan100%.• Willdiscussadditionalerrorslaterinexperiments—researchersneedtodevelopnewwaysoffindingthese”unknowncategories”ofdataerrors(oneswhichcanbespottedbyhumansbutnotbythecurrentcleaningtools
![Page 9: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/9.jpg)
DataErrorsandDataSets• DataError:Givenadataset,adataerrorisancell-valuewhich
isdifferentfromitsgivengroundtruth• Outliererrors:Cell-valueswhichdeviatefromthedistribution
ofovertherangeofvaluesinacolumnofatable.• Duplicateerrors:Distinctdatabaseentries/recordswhichrefer
tothesamereal-worldentity.• Ifthetwoentries’attributevaluesdonotmatchthat
couldindicateanerror.• Ruleviolationerrors:Cell-valuesthatviolateanykindof
integrityconstraints• NotNull&UniquenessConstraints
• Patternviolation:Valuesthatviolatesyntacticandsemanticconstraints• Alignment,formatting,misspelling,andsemanticdata-
types
![Page 10: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/10.jpg)
DataSetsOverview• Thefourerror-typesaregenerallyprevalentacrossalldata-sets• TheAnimaldata-setdoesnothaveoutliers• MITVPFandBLACKOAKaretheonlydata-setswithduplicates
• Theratiooferroneouscellsineachdata—setrangefrom0.1%to34%• Structuralproperties:
• Rayyan BibhasthehighestpercentageoferrorswhileAnimalhaslowest
• Merckhasthegreatestnumberofattributes• BlackOak hasthegreatestnumberofentries
![Page 11: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/11.jpg)
MIT-VPF• MITOfficeoftheVicePresidentforFinance’s(VPF)procurementdatabasewhichcontainsinformationaboutvendorsandindividualsthatsupplyMITwithproductsandsupplies
• StructuralDetails:• ExecutePurchaseOrder:newentryisaddedwithdetailsaboutthe
contractingpartytothevendormasterdata-setwheneverMITbuysaproduct• Identificationinformation(name,address,phonenumber,businesscodes)
• Theongoingprocessofaddingcreatesauniqueproblemofduplicatesandotherdata-errors(theory)• Inconsistentformatting(address,phonenumber,companynames)• Contactinformationmaychangeovertime
• Groundtruth:EmployeesofVPFmanuallycuratedarandomsampleof13,603records(halfofthedata-set)andmarkederroneousfields(empirics)• addressandcompanynames:missingstreetnumbers,wrong
capitalization,andattributevaluesinthewrongcolumn
![Page 12: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/12.jpg)
Merck• TheMerckdatasetdescribesITservicesandsoftwaresystemswithinthecompanythatarepartiallymanagedwiththirdparty—usedforoptimizationofdownsizingservice• StructuralDetails:Eachsystemischaracterizedbylocation,numberofendusers,andleveloftechnicalsupport• Greatestnumberofattributes(68)butisverysparse
• GroundTruth:Merckprovidedthecustomcleaningscripttheyusedtoproduceacleanedversionofthedata-set• Appliesvariousdatatransformationsthatnormalizecolumnsandallowforuniformvaluerepresentation
• Theauthorsutilizedthescripttoformulaterulesandtransformationsforcleaningtools• Therearemanyhiddenfunctioncallsthatareimplicitlycalled
whichchangedata-values
![Page 13: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/13.jpg)
Animal• Animaldata-setprovidedbyscientistsatUCBerkeleyabouttheeffectsoffirewoodcuttingonsmallterrestrialvertebrates• StructuralProperties
• Eachentrycontainsinformationaboutthetimeandlocationofcaptureofananimal,inadditiontoitsproperties:tagnumber,sex,weight,species,andagegroup
• Eachrecordwasmanuallyenteredintospreadsheetsfrombeinginitiallytranscribedonpaper(datafrom1993-2012years)
• Groundtruth:Thescientistsmanuallycleanedthedata-setandidentifiedseveralhundredsoferroneouscells• Errors:
• Shiftedfields• Wrongnumericvalues
![Page 14: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/14.jpg)
Rayyan-Bib• Rayyan isasystembuiltatQCRItoassistscientistsintheproductionofsystematicreviews• literaturereviewswhichidentifyandsynthesizeallresearchevidencerelatedtoanuancedresearchquestion
• StructuralProperties:• Usersconsolidatesearchresultsintolonglistsofreferencestostudieswhichtheyfeelarerelevanttoansweringthequestion• Searchingmultipledatabasesusingmultiplequeries• Userscanmanuallymanipulatecitationssodataispronetoerror
• Entrieshavealotofattributes:article_title,journal_title,journal_abbreviation etc
• Groundtruth:Theauthorsmanuallycheckedasampleof1,000referencesfromRayyan’s database• Manymissingvaluesandinconsistenciesindata
• Journal_title andjornal_abbreviation areoftenswitched• Authornamesaresometimesfoundinjournal_title
![Page 15: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/15.jpg)
BlackOak• BlackOak Analyticsisacompanywhichprovidesentityresolutionsolutions• StructuralProperties:Providedanonymizedaddressdatasetandadirtyversionwhichtheyuseforevaluation• Groundtruthisgivenbecauseit’sasyntheticdata-set• Errorsarerandomlydistributed
• Errors:• Spellingofvalues• Formattingofvalues• Completeness• Fieldseparation
• Theauthorsuniquelyincludedthisdata-settoanalyzethedifferenceinerrordetectionperformancebetweenrealworldandsyntheticdatasets
![Page 16: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/16.jpg)
DataCleaningTools• Selecteddata-cleaningtoolswhichcoveredallfourerrortypes• Multipletoolssometimesfocusondifferentsubtypesofagivenerrortype• Iterativefine-tuningprocessforeachtool• Comparedetectederrorswithgroundtruthinordertoadjustthetoolconfigurationorrulesinordertoimproveperformance• Detectableerrorsarecountedtowardstherecallupperbound
![Page 17: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/17.jpg)
Strategy1:OutlierDetection
• Detectdatavalueswhichdonotfollowthestatisticaldistributionoftheoveralldata• Tool1:Dboost
• Unique:Decomposesrun-ondatatypes(date)intotheirconstituentpieces(m,y,d)• Attributeswhicharewrappedinmorecomplexdatacanbeindividuallyanalyzedseparatelyforoutliers
• Histogramscreateadistributionofthedatawithoutanyapriori assumptionbycountingtheoccurrencesofuniquedata-values
• GaussianandGGMassumethateachvaluewasdrawnfromanormaldistributionwithgivenameanandstandarddeviationoramultivariateGaussiandistributionrespectively.
• OptimalParameterConfiguration:• Numberofbins&theirwidthsforhistograms• Mean&StandardDeviationforGaussianandGMM
![Page 18: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/18.jpg)
Strategy2:Rule-basedErrorDetection
• Relyondata-qualityrulestodetecterrors:expressedusingintegrityconstraints• FunctionalDependencies• DenialConstraints• Violation:Collectionofcellsthatdonotconformtoagivenintegrityconstraint• Atleastonecellinvolvedintheviolationmustbechangedtoresolveaviolation
• Tool2:DC-Clean• Focusesondenialconstraints• TheauthorsdesignacollectionofDCstocapturethesemanticsofthedata• “iftherearetwocapturesofthesameanimalindicatedbythesametagnumber,thenthefirstcapturemustbemarkedasoriginal”
![Page 19: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/19.jpg)
Strategy3:Pattern-basedDetection
• Tool3:OPENREFINE:Opensourcedatawranglingtool• Tool4:TRIFECTA:Communityversionofacommercialdatawranglingtool
• OPENRIFEANDTRIFECTAfocusonsyntacticpatterns:provideexplorationtechniquestodiscoverdata-inconsistencies
• Tool5:KATARA:Semanticpatterndiscoveryanddetectiontool• Focusesonsemanticpatternsmatchedagainstaknowledgebase
• ETL(Extract,Transform,Load)tools:pulldataoutofonedatabaseandplaceitinanother• Tool6:KNIME• Tool7:PENTAHO
![Page 20: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/20.jpg)
Tool3:OPENREFINE
• OPENREFINEisanopensourcewranglingtoolthatcandigestdatainmultipleformats--facilitatesdata-exploration• FacetingOperation:Letsuserslookatdifferentkindsofaggregateddata—resemblesagroupingoperation• TheuserspecifiesonecolumnsforfacetingandOPENREFINEgeneratesawidgetthatshowsalldistinctvalues&theirnumberofoccurrences
• Filteringoperation:• TheusercanspecifyanexpressiononmultiplecolumnsandOPENREFINEgeneratesthewidgetbasedonvaluesoftheexpression
• TheusercanthenselectoneormorevaluesinthewidgetandOPENREFINEfiltersrowswhichdonotcontainselectedvalues
• Datacleaningusesaneditingoperation• Editsonecellatatime• Ifyoueditatextfacet,allcellsconsistentwiththatfacetwillupdate
![Page 21: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/21.jpg)
Tool4:TRIFACTA
• TRIFACTAisthecommercialdescendantofDataWrangler:Predictsandappliesvarioussyntacticdata-transformationsfordatapreparationandcleaning.• Canapplybusiness&standardizationrulesthroughavailabletransformationscripts
• Appliesafrequencyanalysistoeachcolumntoidentifymostandleastfrequentvalues• Showsattributevaluesthatdeviatestronglyfromthevaluedistributioninthespecificattribute• Mapseachattributetoitsmostprominentdata-typeandidentifiesvaluesthatdonotmatch
![Page 22: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/22.jpg)
Tool5:KATARA
• KATARAreliesonexternalknowledgebases,suchasYago,todetect&correcterrorswhichviolateasemanticpattern• Identifiesthetypeofacolumnandtherelationshipbetweentwocolumnsinthedata-setusingaknowledgebase• ThetypecolumnAinatablemightcorrespondtoCountryinknowledgebaseYago &therelationshipbetweencolumnsAandBmightcorrespondtothepredicateHasCapital inYqgo
• Basedonthediscoveredtypesandrelationship,Katara validatesvaluesusingtheknowledgebaseandhumanexperts• Exampe:Avalueof”California”incolumnAwillbemarkedasanerrorbecauseitisnotacountryinYago
![Page 23: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/23.jpg)
Tool6:PENTAHO
• PENTAGOprovidesagraphicalinterfacewheredatawranglingcanbeimplementedasadirectedgraphofETL(Extract,Transform,load)operations• Anydata-manipulationorrulevalidationcanbeaddedasanodeintheETLpipelines• ExecutesmultipleETLworkflowstoclean/curatedataBUTrules/proceduresmustbespecifiedbyuser• Providesroutinesforstringtransformationandsinglecolumnconstraintvalidation
![Page 24: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/24.jpg)
Tool7:KNIME
• Knime focusesonworkflowauthoringandencapsulatingdataprocessingtaskssuchascurationandmachinelearningbasedfunctionalityincompassablenodes• AlthoughKNIMEexecutesmultipleETLworkflows,similartoPENTAGO,toclean/curatedata,rules/proceduresmustbespecifiedbyuser• Usersmustknowexactlywhatkindsofrulesandpatternsneedtobeverified• UnlikeOPENREFINE&TRIFACTA,PENTAHOandKNIMEdonotprovidewaystoautomaticallydisplayoutliersanddetecttypemismatches
![Page 25: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/25.jpg)
Strategy4:DuplicateDetection
• Iftworecordsrefertothesamereal-worldentity,buthavedifferingattributevalues,thereastrongchanceoneofthetwovaluesforeachrespectiverecordisanerror• Tool8:TAMR(commercialdescendantofDataTamersystem)• TAMRisatoolwithindustrialstrengthdataintegrationalgorithmsforrecordlinkageandschemamapping• Premisedonmachinelearningmodelsthatlearnduplicatefeatures
• Expertsourcing• SimilarityMetrics
![Page 26: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/26.jpg)
CombinationofMultipleTools
• Problem:Howdoesauserproperlycombinemultipleindependentdata-cleaningtools• Option1: Runalltoolsandapplyaunionormin-kstrategy• Option2:Haveusersmanuallycheckasampleofdetectederrors,whichcanbeusedtoguidetheprioritizationofdata-cleaningoperations
![Page 27: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/27.jpg)
Option1:UnionAllandMin-k
• Unionall• Takestheunionoftheerrorsemittedbyalltools
• Min-k• Thoseerrorsdetectedbyatleastk-toolswhileexcludingthosedetectedbylessthank-tools• Noneedtokeepcleaningthedata-setwithnewtechniquesifthemaximumperformanceforerrordetectionhasbeenreached
![Page 28: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/28.jpg)
OrderingBasedonPrecision
• ProblemswithOption1(exhaustiveunion)• Expensivebecauseitrequiresmassiveamountsofhumanefforttovalidatelargenumberofcandidateerrors• BlackOak data-set:Auserwouldhavetoverifythe982,000cellsidentifiedaspossiblyerroneoustodiscover382,928actualerrors.• Resultsfromtoolswithpoorperformanceinerrordetectionforthisparticulardata-setshouldnotbeevaluted
• Alternative:Sampling-basedmethodtoselecttheorderinwhichdata-cleaningstrategieswillbeimplementedonthedata-set
![Page 29: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/29.jpg)
OrderingBasedonPrecision• CostModel:Althoughtheperformanceofatoolcanbemeasuredbyprecisionandrecallindetectingerrors,Precisionisabetterproxyforadata-cleansingtool’serrordetectionperformance• Recallcanonlybecomputedifalloftheerrorsinthedataareknown(fullgroundtruth)—thisisnearlyimpossiblewhenweexecuteerrordetectionstrategiesonnewdata-sets
• Precisioniseasytoestimate• AssumeC isthecostofhavingahumancheckadetectederrorandthatVisthevalueofidentifyingarealerror• Valuemustbehigherthancost!• P*V>(P+N)*C,wherePisthenumberofcorrectlydetectederrorsandNisthenumberoferroneouslydetectederrors(falsepositives)
• P/(P+N)>C/V• Setthreshold:σ =C/V
![Page 30: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/30.jpg)
OrderingBasedonPrecision• Anytoolwithaprecisionbelowσ shouldnotrunbecausethecostofcheckingisgreaterthanthevalueofaccuratelyidentifyingadata-error• Theratioisdomaindependent(unknowninmostcases);itisnaturaltohavelargeVvaluesforhighlyvaluabledata
• IfVisverylarge,alldata-cleansingtoolswillbeconsideredwiththecorrespondinglysmallthresholdvalue,whichboostsrecall
• IfthevalueofCishighanddominatestheratio,wesavecostonlyonthevalidationoftoolsthatareveryprecise—tradeoffwithrecall
• Theauthorsestimatetheprecisionoftheirdata-cleansingtoolonagivendata-setbycheckingarandomsampleofthedetectederrors.• Whynotrunallthetoolswithaprecisionhigherthanthresholdandevaluatetheunionofalltheirdetectederrorsets??• Toolsarenotindependentandsetsofdetectederrorsmayandoftendooverlap• Sometoolsmayhaveanextremelyhighprecision,butalloftheerrorstheydetectmaybecoveredwithtoolsthathaveevenhighprecisionvalues
![Page 31: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/31.jpg)
OrderingBasedonPrecision
• Maximumentropy-basedorderselection:FollowingtheMaximumEntropyprinciple,theauthorsdesignanalgorithmwhichassessestheestimateprecisionforagivendata-cleansingtool• Estimatesoverlapbetweentoolresults• Pickingthetoolwithhighestprecision(percentageofpositiveswhicharetrueovertotal)reducesentropythemostbecausehighentropyreferstouncertainty
![Page 32: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/32.jpg)
OrderingBasedonPrecision:Algorithm
1. Runalldata-cleaningtoolsontheentiredata-setandreturndetectederrors2. Estimateprecisionforeachtoolbyverifyingarandomsampleofitsdetected
errorswithahumanexpert3. Pickthetoolwiththehighestestimatedprecisionamongalldata-cleaningtools
notyetconsideredinordertomaximizeentropy&verifiesdetectederrorsonthecompletedata-setwhichhavenotbeenverifiedbefore
4. SinceerrorsvalidatedfromStep3mayhavebeendetectedbyothertools,weupdatetheprecisionoftheothertoolsandgotoS3topickthenexttoolifadata-cleaningstrategyexistswithanestimatedprevision> σ
Empirics:Regardlessofeachtool’sindividualperformance,theproposedorderreducescostofmanualverificationwithmarginalreductionofrecall.
![Page 33: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/33.jpg)
EvaluationMetrics
• D:dataset• G:purelycleaneddataset• E:diff(G,D)=E• T(D):thesetofcellsmarkedaserros bytoolT• Precision:• Recall:• AggregatedF:2(R*P)/(R+P)
![Page 34: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/34.jpg)
UsageofTools
• DBOOST:appliedthreealgorithms:Gaussian,histogram,GMM.WithparametersmakingFhighest.• DC-Clean: existing rules + manuallyconstructedFDrulesbasedonobviousn-to-1relationships• OpenRefine: facetmechanism+ formattingandsingle-columnrules• TRIFACTA: outlierdetectionandtype-verification+ formattingandsingle-columnrules• KATARA: manuallyconstructed& existing knowledgebase• PENTAHO & KNIME: modeleachtransformationandvalidationroutineasaworkflownodeintheETLprocess• TAMR: iterate training until the precision and recall become stable
![Page 35: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/35.jpg)
![Page 36: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/36.jpg)
Dataqualityrulesdefinedoneachdataset
![Page 37: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/37.jpg)
UserInvolvement
• Setrules• PerformdataexplorationusingOpenRefine andTRIFCTA• Validatetheresultoferrors• Gothroughtheremainingerrorsandtrytocategorizethem
![Page 38: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/38.jpg)
IndividualEffectiveness
![Page 39: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/39.jpg)
IndividualEffectiveness
• DBOOST:useless for Animal. Good for BlackOak• DC-Clean: good for Animal, Merck. Bad for MIT VPF• OpenRefine: bad for Animal, top 2 for others• TRIFACTA: bad for Animal recall, top 2 for others• KATARA: good for BlackOak, bad for MIT VPF• PENTAHO & KNIME: good on general• TAMR: found all duplicates for MIT VPF, and most of duplicates forBlackOak
![Page 40: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/40.jpg)
![Page 41: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/41.jpg)
![Page 42: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/42.jpg)
Tool Combination Effectiveness
• Union All: High recall but low precision (lots of FP)
![Page 43: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/43.jpg)
Min-K
• Require at least k algorithms agree on error• (K=1) == union all• As k increases, precision increases, recall decreases• Main problem: how to pick k
![Page 44: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/44.jpg)
OrderingbasedonBenefitandUserValidation• Randomly sample 5% of the detected errors for each tool andcompare them with ground truth for precision estimation.• Run tools in precision order (dynamically update the precisionestimation and drop tools that did not pass )• Baseline: simple union• Threshold: σ(0.1-0.5) (for precision)• As threshold increases, precision increases, FP decrease significantly,with TP decrease a little, causing recall decrease a little
![Page 45: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/45.jpg)
Ordering Strategy results
![Page 46: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/46.jpg)
Recall Upper-bound
• extra rules found by manually going through remaining errors
![Page 47: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/47.jpg)
Domain Specific Tools
• For MIT VPF and BlackOak: ADDRESSCLEANER• Apply on a 1000 sample• Found 2 & 13 new errors. Recall: 0.93-0.95; 0.999-0.999
![Page 48: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/48.jpg)
Enrichment
• Manually add more attributes to the original dataset (only those thatdid not introduce additional duplicate rows)• DC-Clean & TAMR
![Page 49: Detecting Data Errors: Where are we and what needs to be ... · repair errors •It’s better to think of data-cleaning solutions as being tailored to detecting particular categories](https://reader033.fdocuments.net/reader033/viewer/2022060604/6058e0e4e64eef20736db51d/html5/thumbnails/49.jpg)
Conclusion
• Thereisnosingledominanttoolforthevariousdatasetsanddiversifiedtypesoferrors.Singletoolsachievedonaverage47%precisionand36%recall, showingthatacombinationoftoolsisneededtocoveralltheerrors.• Pickingtherightorderinapplyingthetoolscanimprovetheprecisionandhelpreducethecostofvalidationbyhumans.• Domainspecifictoolscanachievehighprecisionandrecallcomparedtogeneral-purposetools,achievingonaverage71%precisionand64%recall,butarelimitedtocertaindomains• Rule-basedsystemsandduplicatedetectionbenefitedfromdataenrichment.Inourexperiments,weachievedanimprovementofupto10%moreprecisionand7%morerecall