Big Data: Data Analysis Boot Camp Titanic Dataset · PDF fileembarked Port of Embarkation (C =...

Click here to load reader

  • date post

    28-Feb-2019
  • Category

    Documents

  • view

    212
  • download

    0

Embed Size (px)

Transcript of Big Data: Data Analysis Boot Camp Titanic Dataset · PDF fileembarked Port of Embarkation (C =...

1/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Big Data: Data Analysis Boot CampTitanic Dataset

Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

22 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 201722 September 2017

2/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Table of contents (1 of 1)

1 Introduction

2 Background

3 Classification problem

4 Techniques

5 Hands-on

6 Q & A

7 Conclusion

8 References

9 Files

3/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

What are we going to cover?

Were going to talk about:

Rs RMS Titanic dataset.

Other Titanic datasets that containdifferent data.

Modeling the datasets to see whowill live and who will die.

4/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Well settled data

Basic information

Ordered: 17 Sep. 1908

Completed: 2 Apr. 1912

Maiden voyage: 10 Apr.1912

Sank: 14 Apr. 1912

5/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Well settled data

Where she was damaged

Red are water tightbulkheads

Green is where the iceberghit

As the bow settled, wateroverflowed the bulkheads Image from [11].

6/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Well settled data

Same image.

Image from [11].

7/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Well settled data

How many died and why?

Sailing capacity (passengersand crew): 3,372

Lifeboat capacity: 1,178

Number of people on board(accounts vary): 2,201

Number of people whosurvived: 706 - 712 (Rthinks 711)

Passengers, crew, builders men,and others. Image from [8].

8/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Data from diverse places

Expected first class passengers

Lots of lists of 1st classpassengers. Even, some of 2nd,and 3rd class passengers[10].Lists of non-passengers (shipscrew, and builders technicians)are more challenging[6].R has a built-in Titanic dataset:Titanic

Image from [7].

9/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Data from diverse places

A crew list

A reasonable collection of crewand builders representatives isavailable.

Name, job, status (lost ornot)

Age, place of birth

Image from [9].

10/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Data from diverse places

titanic3 dataset from PASWR[14]

Part of the PASWR library

Thomas Cason of UVa hasgreatly updated andimproved the titanic dataframe using theEncyclopedia Titanic.

Focuses and expands thepassenger data.

Image from [2].

11/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Data from diverse places

titanic3 attributes/variables

Name Explanation

Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)survival Survival (0 = No; 1 = Yes)name Namesex Sexage Agesibsp Number of Siblings/Spouses Aboardparch Number of Parents/Children Aboardticket Ticket Numberfare Passenger Fare (British pound)cabin Cabinembarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)boat Lifeboatbody Body Identification Numberhome.dest Home/Destination

12/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Data from diverse places

Bringing the pieces together

Combining:

Passenger data fromtitanic3

Crew data fromSouthampton

Not all data in both datasets

Get a reasonable estimation ofwho survived, or not when theRMS Titanic went down.

13/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

What is it?

A definition

Classification is the task of learning a targetfunction f that maps each attribute set x to one of thepredefined class labels y.

Tan, et al. [12]

14/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

What is it?

As a picture

1 A collection of correctlylabeled data (training data)is available.

2 The supervised data isprocessed by some sort ofmachine learning algorithm(there are many) to create amodel (or classifier).

3 Unlabeled (test or new)data, is processed by themodel and predictions aremade.

Image from [5].

15/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

What is it?

Same image.

Image from [5].

16/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

What is it?

Supervised vs. Unsupervised learning

Supervised learning

A training dataset with correct answers(labels) is mined to create a model

Unsupervised learning

Data are provided with no aprioriknowledge of labels or patterns. Thegoal is to discover labels and patterns.

Semi-supervised learning

Knowledge from one dataset is appliedto another dataset to help withmining, analysis, classification, andinterpretation.

17/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

What is it?

Supervised vs. Unsupervised learning techniques

With the Titanic dataset, we will be focusing on classification.

18/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Training and testing

Working with data

Supervised learning requires:

Training data usuallyabout 70% of available data

Testing data usually about30% of available data

Training data can also bepartitioned into validation data

19/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Training and testing

Lots of different things can be done with training data

Use as one monolithicentity

Randomly sample data(with and withoutreplacement)

Divide original training

data into training andvalidation subsets to createmultiple models

With multiple models:

Choose best one,Use all and vote on theoutcome

20/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Types of errors

Sample problem space

1,000 data points between+/- 1

Two classes of data points

color =

red , if 0.5

x2 + y2 1.0black, otherwise

(See attached file.)

21/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Types of errors

A decision tree based on sample data

A decision tree to classify thecircular data problem.

All nodes are labeled

Each mode shows thepercentage of the problemspace they address

Attached file.

22/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Types of errors

Same image.

Attached file.

23/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Types of errors

Errors in machine learning

Total sample divided into training(70%) and testing (30%) datasets

Training dataset was partitionedinto different sized decision trees(models)

Training and testing datasetswere classified using each model

Results were compared to theoriginal data

Initially models under-fitted untilaround 6 nodes

Finally models over-fitted beyond25 nodes Training and testing errors

24/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

Types of errors

Same image.

Training and testing errors

25/43

Introduction Background Classification problem Techniques Hands-on Q & A Conclusion References Files

A collection of decision tree techniques

rpart from the rpart library. Recursive pa