Executive Summary -...

25
Vision Recognition Final Report Eric Johnson Ben Baird David Dzado

Transcript of Executive Summary -...

Page 1: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Vision RecognitionFinal Report

Eric Johnson

Ben Baird

David Dzado

Page 2: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Table of Contents

Executive Summary.....................................................................................................................................3

Review.........................................................................................................................................................3

Challenges...................................................................................................................................................4

Source Code................................................................................................................................................5

Functional Specification Document...........................................................................................................17

Product Description...........................................................................................................................17

Project Requirements............................................................................................................................17

Customer Needs................................................................................................................................17

Interpretation of Customer Needs.....................................................................................................18

Project Specifications............................................................................................................................18

Concept Generation and Selection............................................................................................................19

Introduction...........................................................................................................................................19

Body of Facts.........................................................................................................................................19

Alternatives...........................................................................................................................................19

Image descriptors SIFT or SURF comparison......................................................................................19

Concept Scoring.....................................................................................................................................20

Summary and Selection.........................................................................................................................20

Result.....................................................................................................................................................20

Page 3: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Executive SummaryThe overall goal for the semester was to come up with a vision recognition system using OpenCV that would allow us to take in a live video and determine if the camera had previously seen the current image. The program was implemented using C++ and was divided into three major tasks: Storing the vocabulary in a database, feature identification using words found in the database, and the word infrequency algorithm. Together these three tasks interact to provide us with a program based on the ‘Bag of Words’ concept, that fairly accurately determines if the current image captured on video has been seen before by the program.

ReviewThe database is formed by test images that were collected over the semester. This part of the program takes in an image and runs feature detecting algorithms on them. It was determined that the fastest way to accomplish this was to use SURF features and descriptors. The different features represent words. The words are then mapped to a K-means coordinate system and clustered together. Words that are closest together in the coordinate system are considered close enough to be deemed the same word. This establishes a dictionary of sorts that will be used to describe the features in the sampled frames of the video.

At this point, video is taken in by the webcam on the laptop. A similar feature detection algorithm is performed on the new images coming in. These images are not stored, and the features found in the images are not stored in the database. The features found in the images are labeled according to how closely they resemble different features found in the K-means map. This is done by using the center of each cluster in the K-means map and finding the Mahalanobis distance from the center to the current feature in question. Once each image is described with the features listed in the database, this information is stored in vectors and handed to the word infrequency algorithm.

Word infrequency is simply finding which features are most unique in an image. The features that are most unique to an image best describe that image and are weighted more heavily in the vision recognition process. All of the features found in every image are counted, along with the total number of images. A weighted logarithm function is used to determine how unique a feature is in an image and if it is overall unique to the current set of images. These weight totals are saved in a vector. Each vector has the number of each feature in the database and how many times it shows up in the image in question. Because of this uniform structure of the vectors, a dot product can be done with two of these vectors to determine if the images are similar, exact or not matches. The result of the dot product has a maximum of ‘1’ for exact match and falls off to ‘0’ linearly when it is not a match.

The program is able to fairly accurately identify images that it has seen before. Other images that have similar features do produce high results giving somewhat of a false positive. Looking forward, this can be taken care of when established thresholds and by tweaking the current program to discriminate against these false positives.

Page 4: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

ChallengesOne of our biggest challenges throughout the semester was the big learning curve with OpenCV. We jumped straight into coding and decided to learn OpenCV in real time. This was due to time constraints presented by the structure of the senior project. It introduced bugs into the program that may have been avoided if we had taken more time to learn the OpenCV library and data structures.

Page 5: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Source Code/* * BOWdatabase.h * BOW * * Created by Eric Johnson on 3/30/11. * Copyright 2011 All rights reserved. * */

#include "highgui.h"#include "cv.h"

#include <ctype.h>#include <stdio.h>#include <stdlib.h>

#include <iostream>#include <vector>

using namespace std;using namespace cv;

#define ARRAY_SIZE 1024#define NUM_IMG 299 //This is the number of images that will be used to create the "dictionary"

void BOWdatabase();

/* * BOWdatabase.cpp * BOW * * Created by Eric Johnson on 3/30/11. * Copyright 2011 All rights reserved. * */

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// All Block comments refer directly to the "Video Google: A Text Retrieval Approach to Object Matching in Videos" Paper//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#include "BOWdatabase.h"

void BOWdatabase(){

char path[NUM_IMG][ARRAY_SIZE];// initailize image data set

vector<IplImage*> img (NUM_IMG);

//Reads images into path array. This blcok of code could be combined with the next set of logic.

for (int i = 0; i<NUM_IMG;i++){char buffer [33];sprintf(buffer, "DatabaseFixed/Image_%03d.JPG",i+1);strcpy(path[i], buffer);

}

Page 6: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

/*for (int i = 0; i<NUM_IMG;i++){char buffer [33];sprintf(buffer, "rect-left%03d.pgm",i+1);strcpy(path[i], buffer);

}*/

//first data set/*strcpy(path[0],"Picture 36.jpg");strcpy(path[1],"Picture 37.jpg");strcpy(path[2],"Picture 38.jpg");strcpy(path[3],"Picture 39.jpg");strcpy(path[4],"Picture 40.jpg");strcpy(path[5],"Picture 41.jpg");*/

//new data set//strcpy(path[6],"Picture 42.jpg");//strcpy(path[7],"Picture 43.jpg");//strcpy(path[8],"Picture 44.jpg");//strcpy(path[9],"Picture 45.jpg");//strcpy(path[10],"Picture 46.jpg");//strcpy(path[11],"Picture 47.jpg");

/////////////////////////////////////////////////////////////////////////////////////////////////////////////

//// 2. Viewpoint invariant description -- SURF descriptors used with SURF features

descriptors// It is possible to replace much of this code with OpenCV's BOW library which

is completely based on// the Video Google Paper.//

/////////////////////////////////////////////////////////////////////////////////////////////////////////////

// load image data set (grayscale). This block could probably be combined with block above.

for (int i = 0; i < NUM_IMG; i++){

img[i] = cvLoadImage( path[i], CV_LOAD_IMAGE_GRAYSCALE );}

double t = cvGetTickCount();//Detect Features. Features can be changed to Mser,Surf,Sift,etc.//by changing detector declaration.printf("Finding features\n");vector<vector<KeyPoint> > keypoints(NUM_IMG);for (int i = 0;i < NUM_IMG; i++){

printf("Image %d\n",i);SurfFeatureDetector detector;detector.detect(img[i],keypoints[i]);

}

//extract surf descriptors. Sift Descriptors can be used, but Label.cpp will need to be adjusted

//to work with the 129-element vectors.printf("Extracting descriptors\n");vector<Mat> descriptors(NUM_IMG);for (int i = 0;i < NUM_IMG; i++){

SurfDescriptorExtractor extractor;extractor.compute(img[i], keypoints[i], descriptors[i]);

}t = cvGetTickCount() - t;printf( "Descriptors found in %g ms\n",t/((double)cvGetTickFrequency()*1000.) );

Page 7: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

// create memory matrixint N=NUM_IMG; // number of images in the data setint Nw=150;//number of clusters in kmeans. This can be changed to adjust and tune the

results.cv::Mat NoF(N,1,CV_32SC1);

//create matrix that describes the number of features in each documentint Nf=0;for (int i = 0;i<N;i++){

NoF.row (i) = descriptors[i].rows;Nf += descriptors[i].rows;

}

printf("Total Number of Pictures %d\n",N);printf("Total Number of features %d\n",Nf);

cv::Mat m_image1(Nf, descriptors[1].cols, CV_32F);cv::Mat clusters(Nf, 1, CV_32SC1);cv::Mat cluster_center(Nw, descriptors[1].cols, CV_32F);

//concatinate descriptor matrices into m_image. This can be improved because of conversion//issues with cv::Mat and CvMat.CvMat temp_m_image = m_image1;int count1 = 0;for (int i = 0; i<N; i++) {

CvMat tempDescriptor = descriptors[i];for (int j = 0; j<descriptors[i].rows;j++) {

for (int k = 0; k<descriptors[i].cols; k++) {*( (float*)CV_MAT_ELEM_PTR( temp_m_image,count1,k) ) =

CV_MAT_ELEM(tempDescriptor,float,j,k);}count1++;

}}//converting back to cv::Mat.cv::Mat m_image(&temp_m_image,true);

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//// 3. Building a visual vocabulary -- Here we create a database for the features we

find in the images using KMEANS.// A matrix is outputed that has the list of features, and another with the center

point of the feature in KMEANS.//

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//Perform kmeans. This is also a point of adjustment, using different parameters for the clusters

printf("Performing kmeans\n");cv::kmeans (m_image,Nw,clusters,cv::TermCriteria

(),4,cv::KMEANS_RANDOM_CENTERS,&cluster_center);

//write out matrices from kmeans. This come out in xml format.printf("Writing cluster and center_cluster matrices\n");CvMat desc = descriptors[0];desc = clusters;CvFileStorage* fs = cvOpenFileStorage("clusters.xml",0,CV_STORAGE_WRITE);cvWrite(fs,"kmeans_clusters",&desc);cvReleaseFileStorage( &fs );desc = cluster_center;fs = cvOpenFileStorage("cluster_center.xml",0,CV_STORAGE_WRITE);cvWrite(fs,"kmeans_cluster_centers",&desc);cvReleaseFileStorage( &fs );

Page 8: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

//create separate matrices for each cluster. This is to help calculate the covariance.vector<CvMat*> sampleMatrices (Nw);

for (int i = 0; i<Nw; i++){int count2 = 0;for (int j = 0; j<clusters.rows; j++) {

if (clusters.at<int>(j,0) == i) {count2++;

}}sampleMatrices[i] = cvCreateMat(count2,cluster_center.cols, CV_32F);

}

for (int i = 0; i<Nw; i++){int count2 = 0;for (int j = 0; j<clusters.rows; j++){

if (clusters.at<int>(j,0) == i) {for (int k = 0; k<m_image.cols; k++) {

*( (float*)CV_MAT_ELEM_PTR( *sampleMatrices[i],count2,k) ) = CV_MAT_ELEM(temp_m_image,float,j,k);

}count2++;

}}

}

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//// 4. Calculating Covariance Matrix. This is for future use by Label.cpp to calculate

mahalanobis distance.// It is here to save on calculation time and computing resources.//

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//calculate inverse covariance of each cluster group//outputs vector of Nw matrices, each matrix square of the size of descriptorsprintf("Calculating covariance of each cluster\n");vector<cv::Mat> covarianceMatrices(Nw);

for (int i = 0; i<Nw; i++) {covarianceMatrices[i] = cv::Mat::eye(m_image.cols,m_image.cols,CV_32F);if (sampleMatrices[i]->rows > 0) {

cv::Mat samples(sampleMatrices[i],true);cv::Mat center(cluster_center.row(i));calcCovarMatrix(samples, covarianceMatrices[i], center, CV_COVAR_USE_AVG +

CV_COVAR_ROWS + CV_COVAR_NORMAL, CV_32F);}

}

printf("Writing covariance file\n");fs = cvOpenFileStorage("covariance_inverted.xml",0,CV_STORAGE_WRITE);for (int i = 0; i<Nw; i++) {

cv::Mat temp = covarianceMatrices[i].inv(CV_SVD);desc = temp;char buffer [33];sprintf(buffer,"Cluster_%d",i);cvWrite(fs,buffer,&desc);cvStartNextStream(fs);

}cvReleaseFileStorage( &fs );

for (int i = 0; i<N; i++) {cvReleaseImage(&img[i]);

}}

Page 9: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most
Page 10: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

/* * Label.h * label * * Created by Ben Baird on 3/24/11. * Copyright 2011 BYU. All rights reserved. * */

#include "highgui.h"#include "cv.h"#include <ctype.h>#include <stdio.h>#include <stdlib.h>

#include <iostream>#include <vector>#include <time.h>#include <math.h>

using namespace std;using namespace cv;

#define NUM_WORDS 150 //The number of words in the "dictionary"#define SURF 1 //These were going to be used to dynamically choose feature detectors#define SIFT 2#define STAR 4

cv::Mat Label(IplImage* test_img);

/* * Label.cpp * label * * Created by Ben Baird on 3/24/11. * Copyright 2011 BYU. All rights reserved. * */

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// This function takes in a new image and finds which features it has based on the features we have seen in other images// A lot of clean up can be done. Most of the code is straight forward. One improvement can be // placing the covariance_inverted.xml file into memory once, since it is read everytime this fuction is called.// If improvements in the speed of calculating the actual distance can be made, do it since it is the weakest link// of the function.//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#include "Label.h"

cv::Mat Label(IplImage* test_img){

//Retrieve the covariance matrices from an xml fileclock_t start_program = clock();CvFileStorage* fs_cm =

cvOpenFileStorage("covariance_inverted.xml",0,CV_STORAGE_READ);//TAKES FOREVER, only run oncecout << "Loading time: " << (clock() - start_program)/(double)CLOCKS_PER_SEC << endl;

Page 11: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

vector<CvMat*> cov_mat(NUM_WORDS);int count(0);clock_t Extract_time = clock();for (int i=0; i<NUM_WORDS; i++) {

count++;char buffer [33];sprintf(buffer, "Cluster_%d",1);cov_mat[i] = (CvMat*) cvReadByName(fs_cm,NULL,buffer);

}cout << "Time to extract matrices: " << (clock() - Extract_time)/(double)CLOCKS_PER_SEC <<

endl;//Retrieve the cluster centers from an xml fileclock_t Extract_centers = clock();CvFileStorage* fs_cc = cvOpenFileStorage("cluster_center.xml",0,CV_STORAGE_READ);CvMat* cluster_centers = (CvMat*) cvReadByName(fs_cc,NULL,"kmeans_cluster_centers");cout << "Time to extract centers: " << (clock() - Extract_centers)/(double)CLOCKS_PER_SEC

<< endl;cvReleaseFileStorage(&fs_cm);cvReleaseFileStorage(&fs_cc);

//Extract SURF Featuresvector<KeyPoint> keypoints;MserFeatureDetector detector;detector.detect(test_img,keypoints);

cv::Mat descriptors;//change

SurfDescriptorExtractor extractor;extractor.compute(test_img, keypoints, descriptors);

int Num_features = descriptors.rows;cv::Mat cluster_cent(cluster_centers,true);vector<double> distance(NUM_WORDS);//vector<int> feature_cluster(Num_features);cv::Mat feature_cluster(Num_features,1,CV_32FC1);

// cv::Mat t2(64,1,CV_32FC1);// cv::Mat t3(1,1, CV_32FC1);// cv::Mat t4(1,1, CV_32FC1);

clock_t Mahalanobis_time = clock();for(int i=0;i<Num_features;i++){

for(int j=0;j<NUM_WORDS;j++){

cv::Mat cov_matrix(cov_mat[j],true);//Calculate Mahalanobis

// cv::Mat t1(1,64,CV_32FC1);// subtract(cluster_cent.row(j),descriptors.row(i),t1 );// t2 = t1*cov_matrix;// t3 = t2*t1.t();// sqrt(t3.row(0),t4);// distance[j] = t4.at<float>(0,0);

//Or us Mahalanobis functiondistance[j] = Mahalanobis( cluster_cent.row(j), descriptors.row(i),

cov_matrix);//distance[j] = rand();//cout << distance[j] << endl;

}int min_loc = min_element(distance.begin(), distance.end()) - distance.begin();feature_cluster.row(i) = min_loc;//cout << "Location: " << min_loc << " Value: " << distance[min_loc] << endl;

}cout << "Time to calculate Mahalanobis distance: " << (clock() -

Mahalanobis_time)/(double)CLOCKS_PER_SEC << endl;

///////////////////////////////////////////////////////////////////////////////////////////// for (int i=0; i<Num_features; i++) // {// cout << feature_cluster.row(i) << endl;// }

Page 12: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

return (feature_cluster);}

/* * main.cpp * BOW * * Created by Eric Johnson on 3/30/11. * Copyright 2011 All rights reserved. * */

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// All Block comments refer directly to the "Video Google: A Text Retrieval Approach to Object Matching in Videos" Paper//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#include "BOWdatabase.h"#include "Label.h"

#include <ctype.h>#include <stdio.h>#include <stdlib.h>

#include <iostream>#include <vector>#include <time.h>#include <math.h>

#define TOT_DATABASE 6#define NUM_TEST 6

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// inverse_doc_frequency calculated the Log(N/ni) expression of the document vector described in the Video Google Paper.// The function outputs a new matix everytime the fuction is called. The function should be called everytime a new image// is retrieved for comparison. //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

cv::Mat inverse_doc_frequency(cv::Mat inverse_doc_freq,cv::Mat Ni,int count){

//It probably isn't necessary to read in the clusters xml file any more//This was based off the assumption that Log(N/ni) needed information from the

"dictionary"CvFileStorage* fs = cvOpenFileStorage("clusters.xml", 0, CV_STORAGE_READ);CvMat* clusters_temp = (CvMat*) cvReadByName(fs,0,"kmeans_clusters");cvReleaseFileStorage(&fs);

cv::Mat clusters(clusters_temp,true);

//This can most likely be ignored. It was used for testing purposes./*cv::Mat Ni = cv::Mat::zeros(NUM_WORDS, 1, CV_32F);for (int i = 0; i<NUM_WORDS; i++) {

for (int j = 0; j<clusters.rows; j++) {

Page 13: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

if (clusters.at<int>(j,0)==i) {Ni.row(i)=Ni.row(i)+1;

}}

}*/

for (int i = 0; i<NUM_WORDS; i++) {if (Ni.at<float>(i,0)>0) {

double cc = double(count)/double(Ni.at<float>(i,0));inverse_doc_freq.row(i) = log(cc);

}}

return inverse_doc_freq;

}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// The word_frequency function calculates the nid/nd expression of the document vector described in the Video Google Paper.// It takes in a new document vector and outputs the frequency of features in that image/document.//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////cv::Mat word_frequency(cv::Mat vi){

cv::Mat word_freq = cv::Mat::zeros(NUM_WORDS, 1, CV_32F);cv::Mat nid = cv::Mat::zeros(NUM_WORDS, 1, CV_32F);double nd = (double) vi.rows;

for (int i = 0; i<NUM_WORDS; i++) {for (int j = 0; j<vi.rows; j++) {

if (vi.at<float>(j,0)==i) {nid.row(i)=nid.row(i)+1;

}}

}

for (int i = 0; i<NUM_WORDS; i++) {if (nid.at<float>(i,0)>0) {

word_freq.row(i) = double(nid.at<float>(i,0))/nd;}

}

return word_freq;}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// This was to be a function that would do the rest of the math when comparing images. Similar code is located in the main// function. Eventually, the code in the main fuction should be replaced by this function for better coding style and// portability.//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////void compare_images(IplImage* img,vector<cv::Mat> word_freq,vector<cv::Mat> vdoc,cv::Mat inverse_doc_freq){

cv::Mat vi = Label(img);word_freq.push_back(word_frequency(vi));

vdoc.push_back(cv::Mat::zeros(NUM_WORDS,1,CV_32F));

Page 14: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

for (int i = 0; i<NUM_WORDS; i++) {for (int j = 0; j<word_freq.size(); j++) {

if (word_freq[j].at<float>(i,0)>0) {double ee =

double(inverse_doc_freq.at<float>(i,0))*double(word_freq[j].at<float>(i,0));vdoc[j].row(i).col(0)=ee;

}}

}

cv:: Mat vq=vdoc.back();double abs_vq=sqrt(vq.dot(vq));cv:: Mat rank_vec(word_freq.size(),1,CV_32F);printf("Target Image %d\n",2);for (int i=0;i<vdoc.size();i++){

cv::Mat vd=vdoc[i];double abs_vd=sqrt(vd.dot(vd));double tt=(vd.dot(vq))/(abs_vd*abs_vq);rank_vec.row(i)=tt;printf("Cosine angle between Target Image %d and Image %d=%f\n",2,i+1,tt);

}}

int main( int argc, char** argv ){

//This is a bad what of trying to get this program to do two things. It works, but there is room for improvement.

//currently, the BOW program can take one argument (any argument) and run the database portion of the code and then quit.

//Future plans would be to give the program and exact option (like "-create_database" or "run search") to distinguish

//running modesif (argc == 2) {

BOWdatabase();exit(0);

}

/////////////////////////////////////////////////////////////////////////////////////////////Initialize test image/*IplImage* test_img1;test_img1 = cvLoadImage( "Picture 39.jpg", CV_LOAD_IMAGE_GRAYSCALE );if(!test_img1){

printf("Could not load image file");exit(0);

}

IplImage* test_img2;test_img2 = cvLoadImage( "Picture 40.jpg", CV_LOAD_IMAGE_GRAYSCALE );if(!test_img2){

printf("Could not load image file");exit(0);

}*/

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//// Initialization of document vectors and its expression. word_freq refers to the

nid/nd expression from the paper.// inverse_doc_freq refers to the log(N/ni) expression.//

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

vector<cv::Mat> word_freq;

Page 15: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

vector<cv::Mat> vdoc;cv::Mat inverse_doc_freq = cv::Mat::zeros(NUM_WORDS,1,CV_32F);cv::Mat Ni = cv::Mat::zeros(NUM_WORDS,1,CV_32SC1);//inverse_doc_freq = inverse_doc_frequency(inverse_doc_freq);

//This commented code was used to test the algorithm using a set number of images as opposed to

//taking images from video. It will stay in case someone would like to use it in the future.

//If not it can be completely removed./*vector<IplImage*> img (NUM_TEST);

for (int i = 0; i<NUM_TEST;i++){char buffer [33];sprintf(buffer, "Picture %d.jpg",i+42);img[i] = cvLoadImage( buffer, CV_LOAD_IMAGE_GRAYSCALE );cv::Mat vi = Label(img[i]);word_freq.push_back(word_frequency(vi));vdoc.push_back(cv::Mat::zeros(NUM_WORDS,1,CV_32F));

for (int j = 0; j<NUM_WORDS; j++) {for (int k = 0; k<vi.rows; k++) {

if (vi.at<float>(k,0)==j) {Ni.row(j)=Ni.row(j)+1;break;

}}

}

CvMat desc = vi;sprintf(buffer, "Picture_%d.xml",i);CvFileStorage* fs = cvOpenFileStorage(buffer,0,CV_STORAGE_WRITE);cvWrite(fs,"imgdata",&desc);cvReleaseFileStorage( &fs );

}

inverse_doc_freq = inverse_doc_frequency(inverse_doc_freq,Ni);

for (int i = 0; i<NUM_WORDS; i++) {for (int j = 0; j<word_freq.size(); j++) {

if (word_freq[j].at<float>(i,0)>0) {double ee =

double(inverse_doc_freq.at<float>(i,0))*double(word_freq[j].at<float>(i,0));vdoc[j].row(i).col(0)=ee;

}}

}

cv::Mat vq=vdoc[3];double abs_vq=sqrt(vq.dot(vq));cv::Mat rank_vec(word_freq.size(),1,CV_32F);//printf("Target Image %d\n",count2);for (int i=0;i<vdoc.size();i++){

cv::Mat vd=vdoc[i];double abs_vd=sqrt(vd.dot(vd));double tt=(vd.dot(vq))/(abs_vd*abs_vq);rank_vec.row(i)=tt;printf("Cosine angle between Target Image and Image %d=%f\n",i+1,tt);

}*/

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// This is where most of the heavy lifting happens in this file. This is where most of the improvements to the current state

Page 16: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

// of this code should be made. Here is where we take in live video from a camera, find the document vectors and then// compare these vectors to see if their dot products match.// Ideas for improvements:// 1. Label.cpp reads the inverse_covariance.xml file every time it is read. Perhaps it can just be read once into memory.// an inverse_covariance global variable will be needed.// 2. As mentioned above, the comparing images section of this code should be placed into a fuction to help clean the code.// 3. Some memory management might be needed down the road since that didn't really fit into our goals. We worked on getting this// to work, but we feel that there might be some memory problems down the road.// 4. The exit strategy of the program is a bit clumsy and could probably have a more elegant way of exiting.//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

cvNamedWindow( "Video", CV_WINDOW_AUTOSIZE );CvCapture* capture = cvCreateCameraCapture(0);//CvCapture* capture = cvCreateFileCapture( "BOWtest.avi" );IplImage* frame;int count2 = 0;while(1) {

int count = 0;while(1) {

frame = cvQueryFrame( capture );if( !frame ){

char c = cvWaitKey();if (c==27) {

exit(0);}

}; //cvSetImageROI(frame, cvRect(100,100,500,500));//4

// IplImage* img2 = cvCreateImage(cvGetSize(frame), frame->depth, frame->nChannels);// cvCopy(frame,img2,NULL);// cvCopy(img2,frame,NULL);

cvShowImage( "Video", frame ); char c = cvWaitKey(25);if( c == 27 ) return 0;count++;if (count == 90) break;

}count2++;cv::Mat vi = Label(frame);char buffer [33];sprintf(buffer,"Image_%d",count2);cvNamedWindow( buffer, CV_WINDOW_AUTOSIZE );cvShowImage(buffer, frame);

word_freq.push_back(word_frequency(vi));

for (int j = 0; j<NUM_WORDS; j++) {for (int k = 0; k<vi.rows; k++) {

if (vi.at<float>(k,0)==j) {Ni.row(j)=Ni.row(j)+1;break;

}}

}

inverse_doc_freq = inverse_doc_frequency(inverse_doc_freq,Ni,count2);

vdoc.push_back(cv::Mat::zeros(NUM_WORDS,1,CV_32F));

Page 17: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

for (int i = 0; i<NUM_WORDS; i++) {for (int j = 0; j<word_freq.size(); j++) {

if (word_freq[j].at<float>(i,0)>0) {double ee =

double(inverse_doc_freq.at<float>(i,0))*double(word_freq[j].at<float>(i,0));vdoc[j].row(i).col(0)=ee;

}}

}

cv::Mat vq=vdoc.back();double abs_vq=sqrt(vq.dot(vq));cv::Mat rank_vec(word_freq.size(),1,CV_32F);printf("Target Image %d\n",count2);for (int i=0;i<vdoc.size();i++){

cv::Mat vd=vdoc[i];double abs_vd=sqrt(vd.dot(vd));double tt=(vd.dot(vq))/(abs_vd*abs_vq);rank_vec.row(i)=tt;printf("Cosine angle between Target Image %d and Image %d=%f\

n",count2,i+1,tt);

}}

}

Page 18: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Functional Specification Document

Product DescriptionAn unmanned flying vehicle needs sensors and other input to help it understand where it is, where is has been, and where it can go. Our project is to develop a “Bag of Words” method of place recognition, using images captured by the on-board camera. The specifics of this method are described in the following sections.

Project Requirements

Customer NeedsAs a part of the HexaKopter group, we have been tasked to produce a vision recognition unit that is portable enough to be mounted on the quad-rotor hardware that is being used in the aforementioned competitions. The overall task of the vision recognition module is to be able to tell if the HexaKopter has already been at the location that is it currently at. This is known as loop-closure, and is a key element to self-navigation.

# Customer Statement

1 Be able to determine whether an object is on the list of words, and also be able to distinguish whether the object is not on the list

2 Be able to process an image in real time

3 The program should be compatible with the Bumblebee camera

4 The program should be implemented in OpenCV to be backwards compatible with the rest of the HexaKopter project.

Page 19: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Interpretation of Customer NeedsAs our customer needs are few, interpretation is fairly sparse. We are lucky in having a very technical “customer” with very specific needs. The following is a breakdown of what we determined is the full requirements.

# Interpretation of Customer Needs

1 Be able to accurately identify all of the words on the list and prevent false positives

2 The program should be able to process each image in a fraction of a second

3 The program should be compatible with the hardware and other software of the HexaKopter Project

4 Write a program in C++, implementing OpenCV that recognizes locations based on the “Bag of Words” method

Project SpecificationsOur needs are very specific. That being said, our specifications are quite broad due to our inexperience in what it might take to write this much code. As we continue to develop, we will most likely add specifications and we continue to learn the costs and benefits of working with OpenCV.

# Metric Units Marginal Value Ideal Value

1 Data set of images images 150 200

2 Data set of features “words” 500 800

3 Image processing speed ms/image <15 <10

Note: Some of these goals were unrealistic. For example, image processing can be very processing intense, making it very difficult to process an image in 15 ms.

Page 20: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Concept Generation and Selection

IntroductionThe vision recognition group was given the task to implement algorithms in the quad rotor

helicopter that would allow the robot to “recognize” if it has been to a location before. The task has very specific requirements that will be explained below. This task is to be completed as soon as possible, for it is a main feature in the quad rotor. This document shows how we selected the different elements to complete our task.

Body of FactsThe customer requires that we implement our algorithms using specific criteria. They are to be

written using the OpenCV library in C++, and the code should follow a specific outline discussed in a published whitepaper called “Video Google.” This paper overviews the task of “loop closure” using vision recognition. It entails comparing the most recently taken image to previous images to determine if the robot has been in the room before. Using vision recognition for localization is necessary because of the constraint that the hexakopter will have no access to external guidance systems, i.e. GPS. The customer has provided us with a desired camera for use, the Bumblebee. It comes with proprietary software with which we will need to interface in order to have a full functional vision recognition system.

AlternativesThe task of concept generation and selection will be needed to decide upon which type of

descriptors will be used compare images taken by the hexacopter’s camera. There are several types of descriptors that have an implementation in OpenCV, each having different advantages. We have determined through research that the two most commonly used descriptors are SURF and SIFT. We find these types to be promising because they are scale and rotation invariant, making them ideal for vision recognition.

Image descriptors SIFT or SURF comparison

SURF is faster has been implemented on smart-phone augmented reality applications authors of SURF think it is more robust has increased repeatability

SIFT has more widespread use is the same algorithm in the Google paper that we are trying to replicate

Page 21: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

Concept Scoring

Concept Scoring Matrix Weighted Value

SURF SIFT

Score Weighted Score Weighted

Speed 40 8 320 6 240

Robustness 25 8 200 7 175

Ease of Implementation 10 5 50 8 80

Documentation 25 6 150 7 175

Total 100 720 670

Summary and SelectionWe are still in the process of testing the SURF and SIFT algorithm, and therefore some of these scores are based solely upon what we have read of others work. The concept scoring chart may change as we continue testing each algorithm.

Speed — The speed category was given the highest weight because it is very important that this program run in real time. SURF was given a higher score for speed than SIFT because the papers that we have read suggest that SURF is a faster algorithm.

Robustness — Robustness was given its weight because the program must recognize images that have been transformed or rotated. The algorithm that is the most robust will increase the likelihood that loop closure occurs correctly. SURF is also more robust than SIFT because it withstands image transformations more readily.

Ease of Implementation — The easier the algorithm is to implement, the sooner we will be able to move on to other parts of the project. This was given a low weight because we consider the quality of our results to be much more important. The SIFT algorithm will probably be the easiest to implement, because it is more widely used.

Documentation –Documentation was given a higher weight than ease of implementation because good documentation will ensure that we will be able to successfully implement the chosen algorithm whether or not it is difficult. The SIFT algorithm received a higher score on documentation than SURF simply because it has more documentation available online.

ResultAfter voting, we decided to use SURF descriptors in our program. While SURF is less common, its advantages of speed and robustness are critical to successful loop closure. The increased robustness of SURF descriptors will reduce the chance of a false positive occurring, which would be

Page 22: Executive Summary - rwbclasses.groups.et.byu.netrwbclasses.groups.et.byu.net/...challenge:vision_recognit…  · Web viewWord infrequency is simply finding which features are most

bad for the hexacopter’s localization and mapping. We will continue to test SIFT as we are trying to closely follow the Video Google whitepaper for best results. It will also confirm whether SURF is indeed the better descriptor for our needs.