Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?

56
Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better? Petia Radeva www.cvc.uab.es/~peti a Computer Vision at UB (CVUB), Universitat de Barcelona & Medical Imaging Laboratory, Computer Vision Center

Transcript of Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?

Presentacin de PowerPoint

Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?Petia Radeva

www.cvc.uab.es/~petia

Computer Vision at UB (CVUB), Universitat de Barcelona &

Medical Imaging Laboratory, Computer Vision Center

IndexHealthy habits

Deep learning

Automatic food analysis

Egocentric vision

07:41

I Medical Imaging

07:41

What happens outside the body?07:41

Project led by Dr. Maite Garolera of the Consorci Sanitari de Terrassa:

Goal: using episodic images to develop cognitive exercises and tools for memory reinforcing of MCI and Alzheimer people.

07:41

But episodic images serve for something more than reinforcing memory.

They are showing the lifestyle of individuals!Rememory: Life-logging for MCI treatment

Risk factors and chronic diseases

07:41

Chronic disease statistics

07:41

Obesity in Catalunya

51% of the Catalan population from 18 to 74 years overweight, 15% are obese.62% without university studies vs. 36% with high education.07:41

51% de la poblaci catalana de 18 a 74 anys pateix un excs de pes important un 15% sn obesos, aquesta situaci afecta un 62% dels que no tenen estudis o no van superar els de Primria, i un 36% de les famlies amb formaci universitria.

8

The obesity pandemicRisk factors for cancers, cardiovascular and metabolic disorders and leading causes of premature mortality worldwide.

4.2 million die of chronic diseases in Europe (diabetesor cancer) linked to lack of physical activities and unhealthy diet.

Physical activitiescan increase lifespan by 1.5-3.7 years.

07:42

Which wearables do consumers plan to buy?

21M Fitbit sold in 2015!

Its expected to double by 2018, to 81.7 million users. 07:41The Consumer Technology Association (CTA), formerly the Consumer Electronics Association (CEA), surveyed 1,001 US internet users. Source: eMarketer.

Today, automatically measuring physical activity is not a problem.

But what about food and nutrition?

07:41

What are we missing in health applications?

But what about food and nutrition? State of the art: Nutritional health apps are based on manual food diaries.

07:41SparkpeopleLoseIt!MyFitnessPalCronometerFatsecretWhat are we missing in health applications?

https://techcrunch.com/2016/09/29/lose-it-launches-snap-it-to-let-users-count-calories-in-food-photos/

How many food categories there are?

Today we are speaking about 200.000 basic food categories.What about automatic food recognition?Is it possible?

07:41

Image databases evolutionNumber of objects/DatabaseNumber of images/Database

ImageNet & Deep learning07:41

Imagenet

07:41

Food datasets

Food256: 25.600 images (100 images/class)Classes: 256

Food101 101.000 images (1000 images/class)Classes: 101Food101+FoodCAT: 146.392 (101.000+45.392)Classes: 231EgocentricFood: 5038 imagesClasses: 907:41150.000 images231 categories1.400.000 images1000 categories????? images200.000 categoriesFood DBImageNetFuture Food DB

One is for sure, if there is a solution, it is highly probable to needDeep learning!07:41

IndexHealthy habits and food analysis

Deep learning

Automatic food analysis

Egocentric vision

07:41

Deep leearning everywhere

07:41

White House wants the nation to get ready for AI

October, 2016http://readwrite.com/2016/10/16/white-house-offers-artificial-intelligence-plan-cl1/07:41

Deep learning: In recent years, some of the most impressive advancements in machine learning have been in the subfield of deep learning, also known as deep network learning. Deep learning uses structures loosely inspired by the human brain, consisting of a set of units (or neurons). Each unit combines a set of input values to produce an output value, which in turn is passed on to other neurons downstream.

20

The learning pipeline07:41Input f(x,W)y(f)Score functionPredicted labelXFeature extractionGood enough?

The traning process07:41Input + Groundtruthf(x,W)argminf i Error(yi(f),yi)Score functionXFeature extractionLearn f

The learning process07:41argminf i Error(yi(f),yi)Expectation overdata distributionPredictionGround TruthMeasure of prediction quality (error, loss)Training data {(xi,yi), i = 1,2,,n}Loss function the negative conditional log-likelihood, with the interpretation that fi(X) estimates P(Y=i|X):L(f(x),y)) = -log fi(x), where fi(x)>=0, i fi(x) = 1.

The problem of image classification07:41

32x32x3 D vectorEach image of M rows by N columns by C channels (3 for color images) can be considered as a vector/point in RMxNxC and viceversa.Dual representation of images as points/vectors

R32x32x3

Linear classification07:41Given two classes how to learn a hyperplane to separate them?

R32x32x3

To find the hyperplane that separates dogs from cats, we need to define:The score functionThe loss functionAnd the optimization process.

Linear classification07:41How to project data in the feature space:

f(x)=W x + b

If x is an image of (32x32x3), -> x in R3072,

The matrix W is (3x3072).

The bias vector b is 3-dimensional.

3072x13x30723x13x1

Linear classification07:41How to project data in the feature space:

f(x)=W x + b

If we have 3 classes, f(x) will give 3 scores.

3072x13x30723x13x1

Image classification

Adapted from: Fei-Fei Li & Andrej Karpathy & Justin Johnson07:41

Loss function and optimisationQuestion: if you were to assign a single number to how unhappy you are with these scores, what would you do?07:41

Question : Given the score and the loss function, how to find the parameters W?

L(f(xi),yi)WLoss functionf(xi,W)Score functionInput

XiYi

How is a CNN doing deep learning?07:41

y=Wx

Image.

First layery1=iW1ixiy10=iW10ixi.

Second layery=W(Wx)

y=W(W(Wx)).

Output layerW11W12W13W1nFully connected layersy1=iW1ixi

Why a CNN is a neural network?

From: Fei-Fei Li & Andrej Karpathy & Justin Johnson07:41Modern CNNs 10M neuronsHuman CNNs 5B of neurons.

Activation functions of NN

From: Fei-Fei Li & Andrej Karpathy & Justin Johnson07:41

Exponential linear units- ELU all benefits of relu, does not die, closer to zero meanoutputs, but computation requires exp()32

Why is it convolutional?

Adapted from: Fei-Fei Li & Andrej Karpathy & Justin Johnson07:41

What is new in the Convolutional Neural Network?

07:41

Convolutional and Max-pooling layer

07:41

Convolutional layerMax-pool layer

Spatial infoNo spatial info

Example architecture

07:41The trick is to train the weights such that when the network sees a picture of a truck, the last layer will say truck.Credit slide: Li Fei-fei

Training a CNN07:41

The process of training a CNN consists of training all hyperparameters: convolutional matrices and weights of the fully connected layers.

Several millions of parameters!!!

1001 benefits of CNNTransfer learning: Fine tunning for object recognitionReplace and retrain the classier on top of the ConvNetFine-tune the weights of the pre-trained network by continuing the backpropagationFeature extraction by CNNObject detectionObject segmentation

Image similarity and matching by CNN

07:41

Convolutional Neural Networks (4096 Features)

IndexHealthy habits and food analysis

Deep learning

Automatic food analysis

Egocentric vision

07:41

Automatic food analysisCan we automatically recognize food?To detect and classify every instance of a dish in all of its variants, shapes and positions and in a large number of images.

The main problems that arise are:Complexity and variability of the data.Huge amounts of data to analyse.

07:41

Automatic Food AnalysisFood detectionFood recognitionFood environment recognitionEating pattern extraction

07:41

Food localization

Examples of localization and recognition on UECFood256 (top) and EgocentricFood (bottom). Ground truth is shown in green and our method in blue.07:41Marc Bolaos, Petia Radeva: Simultaneous Food Localization and Recognition, ICPR16, Cancun, Mexico, arXiv.org> cs> arXiv:1604.07953, 2016.

Image InputFoodness MapExtractionFood Detection CNN

Food Recognition CNNFood TypeRecognitionAppleStrawberryFood recognitionResults: TOP-1 74.7%TOP-5 91.6%SoA (Bossard,2014): TOP-1 56,4%07:41

Demo

07:41Herruzo, P., Bolaos, M. and Radeva, P. (2016). Can a CNN Recognize Catalan Diet?. In Proceedings of the 8th Intl Conf. for Promoting the Application of Mathematics in Technical and Natural Sciences (AMiTaNS).

Food environment classificationBakeryBanquet hallBarButcher shopCafeteraIce cream parlorKitchenKitchenetteMarketPantryPicnic AreaRestaurantRestaurant KitchenRestaurant PatioSupermarketCandy storeCoffee shopDinetteDining roomFood courtGalley

Classification results:0.92- Food-related vs. Non-food-related0.68 - 22 classes of Food-related categories 07:41

Towards automatic image description

07:41

Bolaos, M., Peris, ., Casacuberta, F., & Radeva, P. VIBIKNet: Visual Bidirectional Kernelized Network for the VQA Challenge VQA Challenge, CVPR '16.

Two main questions?What we eat?

Automatic food recognition vs. Food diaries

And how we eat?

Automatic eating pattern extraction when, where, how, how long, with whom, in which context?

07:41

IndexHealthy habits and food analysis

Deep learning

Automatic food analysis

Egocentric vision

07:41

Wearable cameras and the life-logging trend

Shipments of wearable computing devices worldwide by category from 2013 to 2015 (in millions)

07:41

Life-logging dataWhat we have:

07:41

Wealth of life-logging dataWe propose an energy-based approach for motion-based event segmentation of life-logging sequences of low temporal resolution- The segmentation is reached integrating different kind of image features and classifiers into a graph-cut framework to assure consistent sequence treatment.07:41Complete dataset of a day captured with SenseCam (more than 4,100 imagesChoice of devise depends on: 1) where they are set: a hung up camera has the advantage that is considered more unobtrusive for the user, or 2) their temporal resolution: a camera with a low fps will capture less motion information, but we will need to process less data.We chose a SenseCam or Narrative - cameras hung on the neck or pinned on the dress that capture 2-4 fps.

100.000 images per month1 TB in 3 yearsOr the hell of life-logging data

Visual Life-logging data

Events to be extracted from life-logging images

Activities he/she has doneInteractions he/she has participatedEvents he/she has taken partDuties he/she has performedEnvironments and places he/she visited, etc.07:41Dimiccoli,M.,Bolaos,M.,Talavera,E.,Aghaei,M.,Nikolov,S.,andRadeva,P.(2015). SRClustering:Semantic RegularizedClusteringfor EgocentricPhotoStreamsSegmentation.In Computer VisionandImageUnderstanding Journal(CVIU)(In press).Preprint: http://arxiv.org/abs/1512.07143

Egocentric vision progress

07:41Bolaos,M.,Dimiccoli,M.&Radeva,P.(2015).TowardsStorytellingfrom Visual Lifelogging: An Overview. InTransactionsonHumanMachineSystemsJournal(THMS)(INPRESS).Preprint: http://arxiv.org/abs/1507.06120

Towards healthy habitsTowards visualizing summarized lifestyle data to ease the management of the users healthy habits (sedentary lifestyles, nutritional activity, etc.).

07:41M. Aeghai, M. Dimiccoli, P. Radeva. Extended Bag-of-Tracklets for Multi-Face Tracking in Egocentric Photo Streams. Computer Vision and Image Understanding, Volume 149, 146-156, 2016. Special Issue on Assistive Computer Vision and Robotics, Elsevier, 2016. doi: 10.1016/j.cviu.2016.02.013

ConclusionsHealthy habits one of the main health concern for people, society, and governmentsDeep learning a technology that came to stayA new technological trend that is affecting directly our environment

Food analysis and recognition a new challenge with huge potential for applicationsWe need food databases of millions of images and thousands of categories

A wide set of problems for food analysis recognition, segmentation, habits characterization, image and video description, etc.Egocentric vision and Lifelogging a recent trend in Computer Vision and unexplored technology that hides big potential to help people monitor and describe their behaviour and thus improve their lifestyle.07:41

THANK YOU!07:41