Alisha Rege ([email protected])cs229.stanford.edu › proj2016 › poster ›...

1
Viewpoint Invariant Person Classification in RGB-D Data Alisha Rege ([email protected]) Purpose Artificial Intelligence can play a key role in healthcare; however, due to patient confidentiality (HIPAA), we are unable to process this information without putting up some boundaries. This boundary comes in the form of RGB-D data; it prevents us from seeing a face or a distinguishing personal characteristic in videos. This project attempts to detect a person from any viewpoint in Stanford Health’s RGB-D data. The goal is to create a detection system that will be able to identify a person from any view point. This will allow nurses and doctors to sense problems such as if a person suddenly fell or if the person has not moved in days. A 6-layer CNN classifier is used to classify the object. Dataset Dataset from Stanford’s Lucile Packard Children’s Hospital RGB-D data from 3 different viewpoints Hand-labelled Data Previous project created bounding boxes for objects Large variations in viewpoints, object appearance and pose, object scale Convolutional Neural Network Implementation Thank you to Stanford Health Center and Stanford Artificial Intelligence Vision Lab for the dataset 6 – Layer CNN w/ Dropout: [1][2] Input (56 x 56 x1) Convolutional Layer (5x5 convolution w/ 32 Bias) & RELU Pooling Layer (2 down samples) Convolutional Layer (5x5 convolution w/ 64 Bias) & RELU Pooling Layer (2 down samples) Fully-Connected Layer (14x14x64 Inputs -> 1024 Outputs) Out (1024 Inputs -> 2 classes) Dataset Preprocessing Translating Video into frames Cropping annotations to feed into network Resizing all images to 56x56x1 Batch Size: 50 Cleaning data Acknowledgement Mean Average Precision (mAP) Discussion References [1] Yann LeCun et al.: LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet/. Retrieved: April 22, 2015. [2] Martín Abadi, Ashish Agarwal, etc. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. Baseline SVM w/ HOG Descriptors HOG Descriptors: Slides through image and calculates the number of gradients in certain direction Uses NMS – only one major object in certain pixel range Skimage Implementation Convolutional Layer: "# =&& () *+, )-. ("1()(#1)) ℓ+, *+, (-. "# =σ( "# ) nonlinearity min 7 1 2 ; + ; ; Results mxm convolutional size "# is every pixel selected Future Work Try different camera type to distinguish doctor/nurse/etc Use information to detect anomolies = 1 & () IJ*KILM"KN O- , = Q R S MKT(R) U VWX MKT(R) U VWX % = 1 − O_MMKOL ‘MKa"OLKa O_MMKOL ∗ 100 Classifier Training mAP Training Accuracy Testing mAP Testing Accuracy SVM w/ HOG descriptors .702 .609 .542 .532 CNN over cropped images .977 .971 .693 .590 CNN over entire image .723 .624 .560 .540 CNN with double representation .985 .981 .705 .650 CNN with unequal representation .985 .904 .693 .608 CNN with Histogram Equalization .988 .982 .715 .667 CNN on precise cropped images .999 .999 .912 .890 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 200 400 600 800 Mean Average Precision (mAP) Epoch Epoch vs Mean Average Precison Training Accuracy Testing Accuracy Epoch: 410 Epoch is maximum value Histogram Equalization: Effect of Training Size on Output: Viewpoint Number of Training Images Testing mAP Top-Down 544Y/377N .71 Mid-top Wall 767Y/383N .81 Hallway 374Y/284N .64

Transcript of Alisha Rege ([email protected])cs229.stanford.edu › proj2016 › poster ›...

  • Viewpoint Invariant Person Classification in RGB-D Data

    Alisha Rege ([email protected])

    Purpose

    Artificial Intelligence can play a key role in healthcare; however, due to patient confidentiality (HIPAA), we are unable to process this information without putting up some boundaries. This boundary comes in the form of RGB-D data; it prevents us from seeing a face or a distinguishing personal characteristic in videos. This project attempts to detect a person from any viewpoint in Stanford Health’s RGB-D data. The goal is to create a detection system that will be able to identify a person from any view point. This will allow nurses and doctors to sense problems such as if a person suddenly fell or if the person has not moved in days. A 6-layer CNN classifier is used to classify the object.

    Dataset

    • Dataset from Stanford’s Lucile Packard Children’s Hospital

    • RGB-D data from 3 different viewpoints

    • Hand-labelled Data• Previous project created bounding

    boxes for objects• Large variations in viewpoints, object

    appearance and pose, object scale

    Convolutional Neural Network Implementation

    Thank you to Stanford Health Center and Stanford Artificial Intelligence Vision Lab for the dataset

    6 – Layer CNN w/ Dropout:[1][2]

    • Input (56 x 56 x1)• Convolutional Layer (5x5 convolution w/ 32 Bias) & RELU• Pooling Layer (2 down samples)• Convolutional Layer (5x5 convolution w/ 64 Bias) & RELU• Pooling Layer (2 down samples)• Fully-Connected Layer (14x14x64 Inputs -> 1024 Outputs)• Out (1024 Inputs -> 2 classes)

    Dataset Preprocessing

    • Translating Video into frames• Cropping annotations to feed into network• Resizing all images to 56x56x1• Batch Size: 50• Cleaning data

    Acknowledgement

    Mean Average Precision (mAP)

    Discussion

    References

    [1] Yann LeCun et al.: LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet/. Retrieved: April 22, 2015.

    [2] Martín Abadi, Ashish Agarwal, etc. TensorFlow: Large-scale machinelearning on heterogeneous systems, 2015. Software available from tensorflow.org.

    Baseline

    • SVM w/ HOG Descriptors• HOG Descriptors: Slides through image and calculates the number of

    gradients in certain direction • Uses NMS – only one major object in certain pixel range• Skimage Implementation

    Convolutional Layer:

    𝑥"#ℓ = & & 𝜔()

    *+,

    )-.

    𝑦("1()(#1))ℓ+,*+,

    (-.

    𝑦"#ℓ =σ(𝑥"#ℓ )nonlinearity

    min7

    12 𝑤

    ; + 𝜆 𝑤 ;;

    Results

    • mxm convolutional size • 𝑥"#ℓ is every pixel selected

    Future Work

    • Try different camera type to distinguish doctor/nurse/etc• Use information to detect anomolies

    𝑚𝐴𝑃 =1

    𝑛𝑢𝑚𝑒𝑛𝑡𝑟𝑖𝑒𝑠 & 𝐴𝑃(𝑐)IJ*KILM"KN

    O-,

    𝐴𝑃 𝑐 = ∑ Q R SMKT(R)UVWX∑ MKT(R)UVWX

    %𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 −𝐿𝑎𝑏𝑒𝑙O_MMKOL − 𝐿𝑎𝑏𝑒𝑙`MKa"OLKa

    𝐿𝑎𝑏𝑒𝑙O_MMKOL∗ 100

    Classifier Training mAP

    Training Accuracy

    Testing mAP

    Testing Accuracy

    SVM w/ HOG descriptors .702 .609 .542 .532

    CNN over cropped images .977 .971 .693 .590

    CNN over entire image .723 .624 .560 .540

    CNN with double representation .985 .981 .705 .650

    CNN with unequal representation .985 .904 .693 .608

    CNN with Histogram Equalization .988 .982 .715 .667

    CNN on precise cropped images .999 .999 .912 .890

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    0 200 400 600 800

    Mea

    n A

    vera

    ge P

    reci

    sion

    (mA

    P)

    Epoch

    Epoch vs Mean Average Precison

    Training Accuracy Testing Accuracy

    Epoch: 410 Epoch is maximum valueHistogram Equalization:

    Effect of Training Size on Output:

    Viewpoint Number of Training Images

    Testing mAP

    Top-Down 544Y/377N .71Mid-top Wall 767Y/383N .81Hallway 374Y/284N .64