1 Lecture 19: Motion Capture. 2 Techniques Morphing Motion Capture.
Beryl: Motion Capture Analysis
Transcript of Beryl: Motion Capture Analysis
Beryl: Motion Capture Analysis Department of CIS Senior Design 20152016
Cezar Babin [email protected] University of Pennsylvania
Philadelphia, PA
Fahim Abouelfadl [email protected] University of Pennsylvania
Philadelphia, PA
ABSTRACT The goal of this project was to perform analysis on human motion capture data and provide useful insight for applications necessitating movement correction. We built a library of tools that could parse .bvh files and then visualize, filter, compress and structure the data so that it can be analyzed with traditional data analysis techniques. Equipped with this custom set of tools, we seeked to demonstrate the application of motion capture to the field of sports analytics. We built several classifiers that could determine the success of a freethrow shot in basketball by making use of input parameters that consisted of key features related to the shooting motion. These tools allowed us to investigate the various attributes that lead to a shot’s success and assess their importance based on the individual that performs the shot. The work conducted here serves as a strong base for developing programs that can provide customized coaching feedback for individuals attempting to strengthen their freethrow shot, and can be used as a framework for conducting motion capture analysis in other areas relating to sports. 1. INTRODUCTION Motion capture is the process or technique of recording patterns of movement digitally. Traditionally, motion capture has been most closely associated with the entertainment sector, and progress in the field has largely been motivated by a desire to create more immersive experiences for gamers and by a desire to create computergenerated characters that are increasingly realistic in the film industry. In addition to its prominent role in the entertainment sector, motion capture has a variety of other applications in the fields of sports analytics, medicine, and scientific research. The primary goal of this project is to explore the intricacies associated with the collection and analysis of motion capture data. Many of the tools required for the compilation and basic analysis of motion capture data are consistent across the various applications. As a result, a large portion of this project was dedicated towards building a toolbox consisting of operations that would play an essential role in reaching tangible results and identifying interesting patterns in a variety of motion capture use cases. To underscore the use of such a toolkit, the remainder of the project was dedicated towards applying these tools to the field of sports analytics.
A free throw in basketball is an unimpeded attempt at a basket awarded to a player following a foul or other infringement. When awarded a free throw, a player lines up at the foul line and must remain behind the line for the entirety of the shot. Improving one’s free throw accuracy is important at every level of basketball, as free throws account for approximately 20% of all points scored in an NBA game. Attempts have been made in the past to quantify the features that contribute to the success of a free throw. There are a multitude of views as to which of these features actually play a role in determining the success of a free throw, and even when assessing a specific feature, there is a lack of consensus as to what the optimal value should be. Due to the conflicting opinions about the aspects of a player’s shot that determine its success, our team wondered if it would be possible to predict whether or not a shot was made based on the shooter’s motion. The aim of this portion of the project was to build a binary classifier that can compare the motion associated with a given shot to a database, and make such a prediction. We further analyzed how a classifier as such could be used to determine the importance of certain features based on their impact on the accuracy of the classifier. 2. APPROACH The first step in our project was to be able to parse the .bvh file format and convert all the data into a format that would be usable for our analyses. The typical .bvh file structure consists of a hierarchy at the beginning of the file that states which joints are connected and how they relate to one another through parentchild relationships. Following this hierarchy is a specification of the total number of frames in the recording and the length of each frame in seconds. The .bvh file then contains information that specifies the movement of a given joint in relation to its parent joint. In order to determine the absolute position of the joints, we had to compute the translation matrices and perform the necessary translational operations in a manner that reflected the hierarchy. The .bvh format contains variations in the way the joints are structured hierarchically. In our case, the format of the data that we collected from the online CMU database was structured in a manner that differed from the structure of the data captured using our hardware. Such differences had to be accounted for in our parsing tool to enable others to make use of our software independently of the structure of their .bvh files. Due to the fact that the motion of a given joint is recorded relative to the motion of its parent joint, and the fact that some .bvh files only have a precision of two decimal places, data values recorded for joints that were structurally lower in the hierarchy were inherently less accurate due to the compounded rounding errors associated with the motion of the parent joints. Our next step was to record a shooter’s movement and the result of each shot in order to identify any patterns that may exist. To perform the motion capture, we used a suit that consisted of nine inertial measurement unit (IMU) sensors that were connected to one another and that transmitted data wirelessly to a computer. The nine sensors were placed on the left/right wrist, the left/right forearm, the left/right upper arm, the left/right shoulder, and the head.
Figure 1. A single IMU sensor Figure 2. The complete system of sensors used
Each of our data capture sessions involved a continuous recording of 10 freethrow shots at a time. We limited each recording session to 10 shots in order to minimize the effects of positional drift, which can compound over time to result in erroneous data. On the other hand, each session required a fairly cumbersome calibration process, so having fewer shots per session would not have been optimal in terms of time. A set of predefined actions was performed at the start of each recording session to provide a simple way of resolving any inconsistencies and standardizing the data captured across different sessions. In order to classify the shots, we needed to extract each individual shot from the session. The activity of segmenting motion capture sessions that consist of a repeated movement into each individual movement can be used for any repeated action, and not just in this specific application to free throws. As a result, we wanted to build a tool that would help automate segmentation for any motion that is repeated, such as segmenting a person’s walk into individual steps, or segmenting the motion of a person jumping rope into each individual jump. We thus decided to build a cycle detection tool. The input to the tool is the number of distinct repeating events (number of total free throws), the approximate framelength of each event, and a stream of the recorded data that consists of the repeated motion. The tool outputs a set of tuples, each of which contains the number of the frame where each repeating event starts and ends. In order to estimate the length of the event we manually looked at a couple of shots and arrived at an approximation of ~150 frames. Sometimes, the subject would dribble the ball before shooting or stretch his arms. As a result, the cycle detection tool would output erroneous results that had to be manually adjusted. Naturally, some shots took longer than others so it was important to restructure the shots of varying lengths into a standardized format that made each shot comparable to all other shots.We thus decided to build a keyframe reduction tool. The tool works by averaging out certain data points of shots in order to give each shot an equal number of intervals. This method has the drawback of abstracting away certain useful features that could be used for classification such as
the speed of release. We decided that this drawback was acceptable for this portion of our analysis, as we could later determine the impact of omitting these features on our classifier, and we would be able to experiment with different methods of keyframe reduction to eventually use the one that returns the best results. Afterwards, we noticed that the shots were not perfectly aligned. This means that even though each shot had an equal number of intervals, the aspect of the shot that was stored in each interval varied across shots. We thus decided to build an alignment tool. Initially, we aligned the shots by determining where the “explosive” movement is, and then adjusted each of the shot sequences so that the release of the ball took place at around frame number 70 for every single shot (see Appendix A). This decision was based on the hypothesis that the activity in the joints in the time preceding the shot is a lot more influential in the outcome of the shot than the activity that takes place after the basketball leaves the hands of the subject. One of the most important aspects that we wanted to analyze in our project was how different features affect the outcome of the shot. In order to do so, we needed to be able to write a method that takes a given feature as input and extracts that feature by performing a transformation on our joint positional data. The transformed positional data is then passed in as a parameter to the classifier. As a result, we simplified and abstracted the feature computation so that we could test different evaluation features on the spot. One feature that we decided to extract was the area of a triangle formed by representing a subset of the joints as vertices. This area was computed by using the 3dimensional positions of each relevant joint. The area was extracted across time as a means of evaluating the impact of the distance between joints on the accuracy of a shot. We were inspired by a paper that used triangular mapping to do gait identification.6
Figure 3. An example of triangular mapping using the elbow, knee, and ankle joints
After preparing all the data for analysis and building the code so that we could change parameters, we manually mapped each shot to a binary outcome depending on whether the shot was made or not. For the training of the classifier we used the “takeoneout” approach where we
would exclude the shot we were trying to classify, and would use all other shots as a reference. Due to the imprecision of our sensors, and the fact that we only had access to a total of nine sensors, we knew that our resulting classification accuracy would be low. In order to account for these inadequacies, we used principal component analysis (PCA) as a benchmark for our accuracy. More specifically, we used the x,y,z position/velocity matrices across time and we classified them after we determined their principal components. Upon calculating and incorporating our feature computations, we tested them and compared them to the benchmark. Features that improved our accuracy were considered relevant, and those that decreased our accuracy were considered irrelevant. Our goal was to be able to give suggestions to a user as to how to improve their shot by computing the relevance of certain features using the classifier accuracy. A potential user of our software could write their own python code that computes a certain parameter, and incorporate it as an additional criterion for evaluating the freethrow shot. 3. RESULTS & MEASUREMENTS Our first task in assessing the viability of classifying the success of a given free throw attempt based on its associated motion was to determine the type of classifier to focus on. We experimented with a variety of classifiers including a knearest neighbor (KNN) classifier, a support vector machine (SVM), a decision tree classifier, a random forest classifier, and others. With regards to the KNN classifier, we used a variety of values for k with a minimum of 1 and a maximum of 30. Figure 4 consists of the accuracy of the various kinds of classifiers for three different feature extraction methods: the PCA reduction of the x,y,z positions across time, the rootmean square (RMS) of the x,y,z positions, and the triangular mapping feature. The results displayed indicate that the most accurate classifications occurred when we used the various KNN classifiers and the SVM, thus we decided to focus on these two types of classifiers for the remainder of our analysis. We initially chose the PCA as the benchmark because we assumed that it would give us the most accurate results. We also built our own classifier that is denoted NN in the table, which is a modification of the KNN algorithm that performed relatively well. The triangular mapping feature extraction performs particularly well on KNN with less than 4 members.
Figure 4. Accuracy of various classifiers when given a feature as input PCA of (x,y,z)
coordinates across time
RMS Triangular Mapping
KNN1 0.523529412 0.470588235 0.617647059
KNN2 0.535294118 0.529411765 0.652941176
KNN3 0.529411765 0.488235294 0.629411765
KNN4 0.552941176 0.564705882 0.635294118
KNN5 0.5 0.529411765 0.594117647
KNN10 0.588235294 0.623529412 0.6
KNN15 0.588235294 0.576470588 0.617647059
KNN20 0.617647059 0.576470588 0.605882353
KNN25 0.617647059 0.6 0.570588235
KNN30 0.617647059 0.617647059 0.594117647
SVM 0.617647059 0.6 0.617647059
DTREE 0.535294118 0.552941176 0.582352941
RFOREST 0.558823529 0.588235294 0.547058824
ADABOOST 0.517647059 0.511764706 0.547058824
GAUSS 0.394117647 0.611764706 0.582352941
LINEARD 0.529411765 0.5 0.564705882
QUADRATIC 0.517647059 0.582352941 0.576470588
NN 0.564705882 0.552941176 0.523529412
The triangular mapping algorithm, which computes the area of a 3D polygon determined by right arm joints, outperformed PCA and RMS at classification, and therefore feature extraction. This result demonstrated that it is possible to outperform numerical quantitative transformations (PCA) with a qualitative metric of what a good shot actually is. We decided to use the PCA of (x,y,z) as a benchmark for any other new implemented feature even though the other methods gave higher accuracies because PCA is the standard in the industry for motion analysis and it would be more relevant to researchers trying to understand the problem. In addition to performing approximately 5% better on the KNN classifiers and SVM, Triangular mapping
improved the average accuracy of all classifiers and decreased the standard deviation as can be seen in Figure 6. This reiterated the fact that our feature extraction method is very relevant within the context. Figure 6. Accuracy of various classifiers when using different feature extraction methods Benchmark PCA RMS Triangular Mapping
Avg. Accuracy 0.550326797 0.559803922 0.592156863
Max. Accuracy 0.617647059 0.623529412 0.652941176
Standard Deviation 0.05552049 0.046108484 0.033941797
KNN + SVM Avg. 0.571657754 0.561497326 0.612299465
Our next focus was to investigate the extent to which a relationship exists between the motion preceding the shot and its success rate. As a benchmark, we used a 35 frame buffer surrounding the shot, with 30 frames dedicated to the motion preceding the shot, and 5 frames dedicated to the motion following the shot. We compared the accuracy of the KNN classifiers and the SVM classifier as we changed the number of frames that are used in representing the motion preceding the shot. We ran the classifiers repeatedly while incrementing the number of frames used to represent the motion preceding the shot by 10. As demonstrated in Figure 7, the majority of classifiers had the highest accuracy when we dedicated 70 frames to the motion preceding the shot. We can therefore conclude that the motion preceding the shot does have an impact on its likelihood of being successful.
Figure 7. Accuracy of various classifiers when incrementing the number of frames allocated to the motion preceding the shot
We furthered our investigation of the triangular mapping feature to ensure that the improvement in accuracy attributed to its use was wellfounded and not only based on it outperforming the other feature extraction in KNN and SVM. To verify our previous result, we compared the accuracy of the classifiers when using triangular mapping on the shooting side as a feature, as opposed to their accuracy when using triangular mapping on the nonshooting side. As shown in Figure 8, the classifiers using the benchmark PCA and the triangular mapping for the nonshooting side yielded relatively similar accuracies, while the classifiers using the triangular mapping for the shootingside yielded noticeably higher accuracies. This leads us to the conclusion that the triangular mapping of the shooting side is a feature that does correlate to the accuracy of the shot, while the triangular mapping of the nonshooting side does not correlate to the accuracy of the shot. Figure 8. Accuracy of classifiers when varying the side used for the triangular mapping
feature
The final part of our analysis consisted of determining whether or not having a smaller number of sensors would greatly impact the results of our classifiers. Naturally, one would assume that the sensors placed on the shooting side of the subject would be most relevant, and so we investigated the difference in accuracy when omitting the data collected from sensors placed elsewhere on the body. We first omitted the data collected from the sensor on the subject’s head, then omitted the data collected from the sensors on the subject’s head and nonshooting shoulder, then additionally omitted the data collected from the sensor on the subject’s nonshooting arm, and finally omitted the data collected from the sensor on the subject’s nonshooting hand in addition to those previously omitted. For all the classifiers in this part, we used the optimal value of 75 buffer frames surrounding the shot: 70 frames dedicated to representing the motion preceding the shot, and 5 frames dedicated to representing the motion following the shot.
From Figure 9, we can see that the omission of the data collected from the head resulted in an improvement in the accuracy of almost all the classifiers, which is intuitive considering the motion of the head should not correlate to the success of the shot. We can see that omitting the data collected from the sensor’s nonshooting side resulted in a decrease in the accuracy across the board, and that this decrease in accuracy was compounded as we omitted the data collected from additional sensors on that side. It is worth noting that for the majority of classifiers, the decrease in the accuracy was rather marginal as can be seen in Figure 10. This means that a future researcher attempting to conduct similar experiments with a more limited quantity of sensors should still be able to extract meaningful data that can be used for analyzing and classifying the success of free throw shots.
Figure 9. The impact of joint removal on accuracy
Figure 10. The impact of joint removal on accuracy None Head Previous & L.
Shoulder Previous & L. Arm
Previous & L. Hand
Avg. Accuracy
0.574509804 0.569934641 0.568627451 0.564052288 0.570261438
Max. Accuracy
0.629411765 0.635294118 0.635294118 0.623529412 0.617647059
Standard Deviation
0.033701072 0.045693378 0.04644934 0.046706679 0.036899599
KNN + SVM Avg.
0.580748663 0.593048128 0.593048128 0.577005348 0.576470588
4. ETHICAL/PRIVACY CONSIDERATIONS The research conducted and the tools used in this project do not pose ethical or privacy issues, nor do they solve any such existing issues. This would hold true even if the research were conducted on a larger scale involving a greater number of subjects. Any data collection conducted with the intention of compiling a large database to be used for an analysis of the biomechanics of a free throw would not contain the identity of the subjects, thus alleviating any privacy concerns that may arise. The project does not collect or deal with any private information, and thus there is no risk of misusing or misplacing such information. 5. DISCUSSION One of the most important things that we learned throughout this study was the fact that the data capture part is key to producing meaningful results in motion analysis. More specifically, in studying the correlation between body motion and the outcome of a free throw, each individual shot has to be preceded by some calibration motion and a carefully chosen protocol. Our work would have been greatly simplified if instead of segmenting data based on the basketball leaving the hands, we used the predetermined motions as indicators of the start and end of a shot. Additionally, the data collection process would benefit from having a camera that can record the ball entering the hoop as well as the trajectory of the ball. These additional steps would help in differentiating between “clean shots”, and shots that make it into the hoop only after bouncing off the hoop, rim, or the backboard. Using the same shot pattern would limit the degree of freedom of each shot and make them more comparable. The subject’s movement should be limited to the shot so that anything that is not a shot can be easily identified. In our scenario, the subject turned at times and attempted to fix the sensors, which resulted in ghostshots during segmentation where the algorithm misinterpreted the hand movement for a shot. The sensors give us the x,y,z position across time, however, we know that the position is computed by the firmware using the angular acceleration. That angular acceleration is transformed into x,y,z acceleration, which is then doubly integrated in order to get the position. This process obscures several features of the motion due to rounding errors. Velocity and acceleration could have a noticeable impact on the outcome of the shot, especially if the speed of the ball or the angle of release were to be calculated. In our case, we calculated those values through differentiation, however, since the process of obtaining the x,y,z positions involved double integration, the positional coordinates already numerically encoded that information. The reason that the x,y,z positions are provided rather than the initial angular accelerations captured, is because the motion capture product we used is tailored to animation applications. Motion capture suits designed specifically for biomechanics are only built by a few companies and are prohibitively expensive for our purposes. The specification of the sensors’ components say that each value is a 32bit floating point number calibrated to allow a small range of acceleration values that are humanly possible to achieve. Such values would give an immense advantage as opposed to the 2 decimal precision doubly integrated values that we worked with. In our case, we did not track the legs of the subject, which significantly reduced the accuracy of our data,
particularly because leg movement is very important to a basketball freethrow shot.1 Thus having more sensors can only lead to an improvement in the accuracies associated with classification. As shown above, however, not all parts of the body are as significant to the outcome. Some error correction could be made by using kinematic equations. These equations would detect if the motion does not seem close enough to the range of human motion. For example, during one of our sessions, we recorded an anomalous position for the wrist due to the effects of drift. Such mistakes could be prevented through the use of kinematic equations. Having a set of limits to the degrees of motion of the subject would limit the mapping space, which would improve classification as the best performing classifier was found to be the KNN algorithm. The kinematics equations would have to be specifically tailored to the body of the subject and would add an additional layer of complexity to work with. Another limiting factor to our analysis was the relatively small number of samples we collected. With only 200 shots in total, the training set was 100 shots each for shots made and not made. Even spam filters, which have come a long way in the set of heuristics to track down malicious email, need training sets of at least 10 times this size. Not only does the training size have to be large, but there needs to be a way to track down files that are anomalies to the general sample set. These could be tracked on a byrecording session basis, and a session with a low average classification accuracy would need to be manually reviewed. We laid the groundwork for analyzing different parameters that influence the shot outcome by using them as features for classification. This approach has proven to be successful with triangular mapping. However, there is an interesting relationship between the number of classification features incorporated and the weight that would need to be allocated to each of these features. This is where the qualitative assessment of a professional shooter and a larger dataset could contribute to an improvement in the accuracy of the analysis. An interesting approach would be to have an artificial intelligence agent that would try different combinations of features over a vast number of samples in order to find an optimal balance. The valuable part of such an approach is that every player would have a tailored assessment of their technique depending on their shot biomechanics. However, at the current stage of our analysis, there would be no way for us to instruct the subject as to how to improve their shot or to determine if the shot needs improvement, even if we were to identify key drivers of the outcome. Our tool merely is capable of making a determination of the key features that a shooter should pay attention to. 6. REFERENCES
1. Jackie L. Hudson. A Biomechanical Analysis by Skill Level of Free Throw Shooting in Basketball. J. Terauds (Ed.), Biomechanics in Sports (pp. 95102). Del Mar, CA: Academic Publishers. 1982.
2. Joerg M. Gablonsky and Andrew S. I. D. Lang. Modeling Basketball Free Throws. Society for Industrial and Applied Mathematics Review (Vol. 47, No. 4, pp. 775798). 2005.
3. Gary Yates. The Development of Multiple Linear Regression Equations to Predict Accuracy in Basketball Jump Shooting. L.E. Holt School of Recreation, Physical and Health Education, Dalhousie University. Halifax, N.S.
4. Jernej Barbic, Alla Safonova, JiaYu Pan, Christos Faloustsos, Jessica K. Hodgins, and Nancy S. Pollard. Segmenting Motion Capture Data into Distinct Behaviors. Computer Science Department, Carnegie Mellon University.
5. Dirk Ormoneit, Trevor Hastie, and Michael Black. Functional Analysis of Human Motion Data. Dept. of Statistics, Stanford University. Xerox Palo Alto Research Center, Palo Alto, CA 94304.
6. Sajid Ali, Zhongke Wu, Mingquan Zhou, Abdul Razzaq, Hafeez Ahmad. Human Identification based on Gait Joints Area through Straight Walking View. Department of Electrical Engineering, Institute of Space Technology. Islamabad, Pakistan.
Appendix A. RMS vs frame count (shots 5,6,7,9 made). Shots aligned to take place at frame nr 30