Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391...

download Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670.

If you can't read please download the document

description

This plot is F1 vs F Class F1 F2 F3 F

Transcript of Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391...

Classification Problem Class F1 F2 F3 F The objects to be classified are flowers The two classes are: 1) Was pollinated by a honey-bee 2) Was not pollinated by a honey-bee The biologist measured four features: F1: The longitude of the plant F2: The latitude of the plant F3: The height of the flower from ground F4: The diameter of the flower. This plot is F Class F1 F2 F3 F This plot is F1 vs F Class F1 F2 F3 F This plot is F3 vs F4 Class F1 F2 F3 F This plot is F1 vs F2 Class F1 F2 F3 F F1 F Which algorithms would work well on this dataset. F2 < -1.1? Blue class (1)? YN F2 < -1.1? Blue class (1)F1 < 0.98 YN Class F1 F2 F3 F Problem: You are given a problem with four features, but you are not told which features (if any) are useful for classification. How can you figure out which are useful? You could try plotting all pairs, and visual inspection.. For four features that is only 6 combinations For forty features it is 780 combinations However it might not be a pair of features that is best, it could be a subset of one, two, three For forty features there are (over a trillion) combinations. 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 50% 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 62%61%51%52% 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 60% 58% 97% 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 87% 88% 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 72% ,42,4 1,4 2,31,31,2 2,3,4 1,3,41,2,4 1,2,3 1,2,3, Class F1 F2 F3 F F1 and F2 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 One nearest neighbor cross validation Cross-validation accuracy Data Binary vector indicating features to use [ ] How I would solve this problem Ignore the search part for now Just get the cross validation working Extra Credit Possibility https://archive.ics.uci.edu/ml/datasets.html Also do search on a UCI dataset, and explain why the result is intuitive. Dont do this, unless the rest of your project is perfect! School Employees Simpson's Family What is a natural grouping among these objects? We can look at the dendrogram to determine the correct number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately) Outlier One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others Chimpanzee Pygmy Chimp Human Gorilla Orangutan Sumatran Orangutan Gibbon Hellenic Armenian Persian Hellenic Armenian Armenian borrowed so many words from Iranian languages that it was at first considered a branch of the Indo- Iranian languages, and was not recognized as an independent branch of the Indo-European languages for many decades Do Trees Make Sense for non-Biological Objects? The answer is Yes. There are increasing theoretical and empirical results to suggest that phylogenetic methods work for cultural artifacts. Does horizontal transmission invalidate cultural phylogenies? Greenhill, Currie & Gray. Branching, blending, and the evolution of cultural similarities and differences among human populations. Collard, Shennan, & Tehrani...results show that trees constructed with Bayesian phylogenetic methods are robust to realistic levels of borrowing Because trees are powerful in biology They make predictions Pacific Yew produces taxol which treats some cancers, but it is expensive. Its nearest relative, the European Yew was also found to produce taxol. They tell us the order of events Which came first, classic geometric spider webs, or messy cobwebs? They tell us about.. Homelands, where did it come from. Dates when did it happen. Rates of change Ancestral states Why would we want to use trees for non biological things? Markus PudenzMostly Belgium beers Irish beers Mostly Californian beers Clustered based on crowd sourced user subjective ranking Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Peder Peka Peadar Pedro (Portuguese/Spanish) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Where is Pet'Ka from? Where is Bedros from? How do we know the dates? If we can get dates, even upper/lower bounds, some events, we can interpolate to the rest of the tree. Irish/Welsh Split: Must be before 300AD. Archaic Irish inscriptions date back to the 5 th century AD divergence must have occurred well before this time. Gray, R.D. and Atkinson, Q. D., Language tree divergence times support the Anatolian theory of Indo-European origin Family Tree of Languages Has Roots in Anatolia, Biologists Say By NICHOLAS WADE Published: August 23, 2012 NEW York TimesNICHOLAS WADE Biologists using tools developed for drawing evolutionary family trees say that they have solved a longstanding problem in archaeology: the origin of the Indo-European family of languages. The family includes English and most other European languages, as well as Persian, Hindi and many others. Despite the importance of the languages, specialists have long disagreed about their origin.. Partitional Clustering Nonhierarchical, each instance is placed in exactly one of K nonoverlapping clusters. Since only one set of clusters is output, the user normally has to input the desired number of clusters K. Squared Error Objective Function Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3. K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3 K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3 K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3 K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3 K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3 Comments on the K-Means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t