Final Project · 2019. 2. 10. · Title: Microsoft Word - Final Project.docx Created Date:...
Transcript of Final Project · 2019. 2. 10. · Title: Microsoft Word - Final Project.docx Created Date:...
-
B565FinalProjectPredictingIncomefromCensusData
KarthikVegi [email protected] [email protected]
1.GoaloftheProjectThegoaloftheprojectistocreateaclassificationmodeltoidentifytheindividualswithapotentialofearningmorethan$50,000USD.Theresponsevariableisabinaryclassifierthatwillidentifywhetherapersonwillmake$50kandthepredictorvariablesarecensusinformationlikeage,maritalstatus,educationetc.Wewouldliketoidentifythesignificanceofthevariablesthatwereusedaspredictors.Acoupleofclassificationtechniqueswereusedtocomparetheaccuracyoftheclassification.Thefocusofthisprojectwillbemoreontheinterpretationandlessonimplementingaclassificationalgorithmfromscratch.2.StatisticsofthedatasetusedThedatasetusedisfromUCIMachineLearningRepository:https://archive.ics.uci.edu/ml/datasets/AdultNoofObservations:48842Noofvariables:14Attributetype:categorical/numericPredictorvariables:age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,weekly-hours,native-country,incomeResponsevariable:Income3.ToolsUsedAnalysis:RVisualization:Tableau,R4.FeatureSelectionWewillexcludetheirrelevantvariablesthatwillnotgointotheconstructionofourclassificationmodel.Thevariablesfnlwgtandeducation-numareassumedtobeofnoorlesssignificanceandareremovedfromthetrainingandtestdatasets.Attributeslikeeducation,workclassandrelationshiparestrongfeaturesthatwillaffecttheresponsevariable.5.DataVisualizationWevisualizedthedataintableautounderstandthesignificanceofsomeimportantattributes
-
6.Choiceofclassificationtechnique
§ Wechosetoimplementtwoclassificationalgorithmssothatwecancomparetheperformance§ Naïvebayesissimpleandworkswellevenwithasmalltrainingset§ DecisionTreeisfastandscalabletocompute,especiallyinthiscasewheretherearealotofrecords
(i)NaïveBayesclassifier:Belowarethetestresultsfornaïvebayesclassifier
-
(ii)DecisionTreeclassifier:
ErrorRatePlot:
DecisionTree:
-
DecisionTreeStatistics:
7.EvaluatingPerformanceWiththedataset,Decisiontreeperformedbetterandthegraphsaresummarizedbelow:
-
8.KeyInsights§ Peoplewithhigherdegreestendtoearnmorethan50k
§ Peopleinastablerelationshipusuallyearnedmorethan50k
-
§ Workingclasshas6%oftherecordsmissingandthebulkofthepopulationfallintoprivatesector
§ Peoplewhoputincloseto40hoursearnedmorethan50k
9.ChallengesandOpportunities
§ Withthegiventime,implementingaclassificationalgorithmfromscratchwasnotfeasible§ Withmoretime,agoodworkwillbetocompareallclassificationalgorithmsandplottheaccuracyof
them10.References[1]https://www.tableau.com/learn/training[2]https://archive.ics.uci.edu/ml/datasets/Adult[3]https://www.r-bloggers.com/classification-trees-using-the-rpart-function/