G = ( (u 1,n) r e u 1,n M n,..., (u L,n) r e u L,n M n,..., (v,m 1 ) r e v,m 1 U v,..., v,m L...

G = ( (u1,n)r eu1,nMn,..., (uL,n)r euL,nMn, ..., (v,m1)r ev,m1Uv, ..., v,mLRe(v,mL)Uv )

d/dt( (Uu+tGu)(Mm+tGm) )0 = dsse(t)/dt= 2( (Uu+tGu)(Mm+tGm) - ru,m )1(u,m)r

d/dt( t2GuGm + t(GmUu+GuMm) + UuMm))=02(UuMm-rum+t(GmUu+GuMm)+t2GuGm)(u,m)r

g=GmGu

h=GmUu+GuMm

e=UuMm-rum

( 2tGuGm + GmUu+GuMm ) = 0(um)r ( UuMm-ru,m+t(GmUu+GuMm)+t2GuGm)

(e+th+t2g)(2tg+h) = (um)r (2teg+2t2hg+2t3g2+eh+th2+t2gh) = 0(um)r

g2 +t32(um)r t23(um)r eh = 0 hg + t(um)r (2eg+h2) + (um)r

a b c d

sse(t) = (u,m)r (pu,m(t) - ru,m)2 = (u,m)r ( (Uu+tGu)(Mm+tGm) - ru,m )2

2(UuMm+t2GuGm+tGmUu+tGuMm-ru,m)

F=(Uu,Mm) feature_vector, pu,m=UuMm prediction, ru,m rating, eu,m=ru,m-pu,m error, G Gradient(sse)

sse(F)=(u,m)r

(UuMm - ru,m)2 e.g., if two r=ratings, ru,m=2 and rv,n=4, laying things out as:

F(t)≡( Uu+tGu, Mm+tGm ) pu,m(t)≡( (Uu+tGu)o(Mm+tGm) )

Given a relationship matrix, r, between 2 entities, u, m. (e.g., the Netflix "Ratings" matrix, rum = rating user, u, gives to movie, m.The SVD Recommender uses the ratings in r to train 2 smaller matrixes, a user-feature matrix, U, and a feature-movie matrix, M. Once U and M are trained, SVD quickly predicts the rating u would give to m as the dot product, pum=UuoMm. Starting with a few features, Uu vector = extents to which a user "likes" feature; Mm = the level at which each feature characterizes m. SVD trains U and M using gradient descent minimization of sse (sum square error =

RATINGS pum-rum)2 ).

Solving at3+bt2+ct+d=0, t=

or t = ( q+[q2+(r-p2)3])1/3 + ( q-[q2+(r-p2)3])1/3 + p

q = p3+(bc-3ad)/(6a2)p = -b/(3a) r = c/(3a)

[(-b3/27a3+bc/6a2-d/2a) +{(-b3/27a3+bc/6a2-d/2a)2+(c/3a-b2/9a2)3}1/2]1/3

[(-b3/27a3+bc/6a2-d/2a) -{(-b3/27a3+bc/6a2-d/2a)2+(c/3a-b2/9a2)3}1/2]1/3 - b/3a

r m nu 2 1 Mm

v 4 1 Mn

1 1 Uu Uv F

p m nu 1 v 1

e m nu 1 v 3

r__ p__ e____ ee_ g___ h__ a 164 p 0.34142 1 1 -1 1 1 -1 -2 b -168 q -0.004 4 1 1 -3 9 9 -3 -6 c -16 r -0.0321 1 F -1-3 GR d 20 0.0365 t desc 40628

t root 164 -161 -21.9 1 -0.036 164 -168 -16 20 164 -6.0 -161 -16 -161 5.927 -21.9 20 -21.9 0.802 19.19

Something is wrong with the cubic root formula!

There are 3 zeros of the derivative function: -0.337, 0.361 and 1.000. Function values are 0.047, 18.47 and 4.0. See attachment and the chart. Blue chart is the derivative and the red chart is the function.

Given: 0 = ax³ + bx² + cx + dDivide through by a 0 = x³ + ex² + fx + g e = b/a f = c/a g = d/aStep 2: Do horizontal shift, x = z − e/3 which removes square term, leaving 0 = z³ + pz + q where z = x + e/3 p = f − e² / 3 q = 2e³/27 − ef/3 + gStep 3: Introduce u and t: u t = (p/3)³ , u − t = q so t = −½q ± ½√(q² + 4p³/27) u = ½q ± ½√(q² + 4p³/27) ⋅0 = z³ + pz + q has a root at: z = (t) − (u) 'u' and 't' both have a '±' sign. It doesn't matter which you pick to plug into ∛ ∛the above equation... as long as you pick the same sign for each.

a 164 p 0.3414 -0.007 -0.001b -168 q -0.004 -0.198 -0.106c -16 r -0.032d 20

x z=x+e/3 q^2+4p^3/27 1 1 -0.013e -1.02f -0.09 p -0.447 t ERRg 0.121 q 0.0090 u

1 r____ p__ e_________ ee_____ g_______ h____2 2 1 1 -1 1 1 -1 -23 4 1 1 -3 9 9 -3 -64 1 1 fv -1 -3 GR

r_______ p_______ e_________ ee______ g_____ h_______ a 164 2 1 1 1 1 1 1 2 1 b 168 4 1 1 3 9 9 3 6 1 c 60 1 1 fv 0 t 5 mset 5 mse 1 3 GR 1 1 fvt d 8

r_______ p_______ e_________ ee______ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.1 t 2.980 mset 5 mse

r_______ p_______ e_________ ee______ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.2 t 1.193 mset 5 mse

r_______ p_______ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.3 t 0.124 mset 5 mse

r_______ p_______ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.4 t 0.353 mset 5 mser_______ p________ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.5 t 2.562 mset 5 mse

r_______ p________ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.6 t 7.529 mset 5 mser_______ p________ e_________ ee_____ 2 1 1 1 1 4 1 1 3 9 1 1 fv 0.7 t 16.13 mset 5 mse

r_______ p_________ e_________ ee______ g_____ h_______ 2 1 1 1 1 1 1 2 1.3 4 1 1 3 9 9 3 6 2.0 1 1 fv 0.338 t 0.023 mset 5 mse 1 3 GR 1.3 2.0 fvt

r_______ p_________ e_________ ee________ g_____ h_______ 2 1.3 1.7 0.209 0.043 0. 0. 0.7 1.3 4 2.0 4.056 -0. 0.0 0. -0 -0. 2.01.3 2.0 fv 0.1 t 0.009 mset 0.023 mse 0. -0 GR 1.3 2.0 fvt

r_______ p_________ e_________ 2 1.3 1.7 0.209 4 2.0 4.056 -0.1.3 2.0 fv 0.2 t 0.002 mset

r_______ p_________ e________ 2 1.3 1.7 0.209 4 2.0 4.056 -0.1.3 2.0 fv 0.3 t 0.003 mse

r_______ p_________ e_________ ee________ g_____ h_______ 2 1.3 1.7 0.209 0.043 0. 0. 0.7 1.4 4 2.0 4.056 -0. 0.0 0. -0 -0. 1.91.3 2.0 fv 0.23 t 0.001 mset 0.023 mse 0. -0 GR 1.4 1.9 fvt

Since calculus isn't working (to find the min mse along f(t)=f+tG ), will this type of binary search be efficient enough? Maybe so! In all dimensions, the mse(t) equation is quartic (dimension=4) so The general shape is as below (where any subset of the local extremes can coelese).

r_____ p_________ e_____ ____ ee____ g_____ h______________1 3 1 1 1 0 2 0 4 6 4 2 5 4 1.24 5 1 1 1 3 4 9 ** **7 7 10 8 1.71 1 1 fv 0.1 t 3.9362 mset 7 mse 3 1 2 GR 1.3 1.1 1.2 fvt

r_____ p_________ e_____ ____1 3 1 1 1 0 24 5 1 1 1 3 41 1 1 fv 0.2 t 1.7848 mset

r_____ p_________ e_____ ____1 3 1 1 1 0 24 5 1 1 1 3 41 1 1 fv 0.3 t 2.2170 mset

r_____ p_________ e_____ ____1 3 1 1 1 0 24 5 1 1 1 3 41 1 1 fv 0.25 t 1.5761 mset

r_____ p_________ e_____ ____1 3 1 1 1 0 24 5 1 1 1 3 41 1 1 fv 0.225 t 1.5878 mset

r_____ p__________ e__________ ee____ g_____ h______________1 3 1 1 1 0 2 0 4 6 4 2 5 4 1.474 5 1 1 1 3 4 9 ** **7 7 10 8 2.661 1 1 fv 0.2375 t 1.5571 mset 7 mse 3 1 2 GR 1.71 1.23 1.47 fvt

rr______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.701.71 1.23 1.47 fv 0.1 t 0.5542 mset

r______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.701.71 1.23 1.47 fv 0.2 t 3.9363 mset

r______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.701.71 1.23 1.47 fv 0.05 t 0.5582 mset

rr______________ p________________ e_____ ____ 1 3 1.47 2.52 2.17 -1.525 4 5 2.66 4.55 3.2948 -0.559 1.701.71 1.23 1.47 fv 0.075 t 0.4258 mset

Linear and pseudo-binary line search

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD1 \a=Z22 3 3 5 2 5 3 3 /rvnfv~fv~{goto}L~{edit}+.005~/XImse<omse-.00001~/xg\a~3 2 5 1 2 3 5 3 .001~{goto}se~/rvfv~{end}{down}{down}~4 3 3 3 5 5 2 /xg\a~5 5 3 4 3 6 2 1 2 1 7 4 1 1 4 38 4 3 2 5 39 1 4 5 3 2 LRATE omse10 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3 1 3 3 2 0.001 0.1952090 fv

A22: +A2-A$10*$U2 /* error for u=a, m=1 */A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4) /* updates f(m=8 */AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */W22: @COUNT(A22..T22) /* counts the number of actual ratings (users) for m=1 */X22: [W3] @SUM(W22..W29) /*adds ratings counts for all 8 movies = training count*/AD30: [W9] @SUM(SE)/X22 /* averages se's giving the mse */

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD21 working error and new feature vector (nfv)22 0 0 0 **0 ** 3 6 3523 0 0 ** 0 ** 0 3 624 0 0 0 ** 0 2 525 0 ** ** 3 326 0 0 **1 327 **** ** 0 3 428 ** 1 0 ** 3 429 ** ** 0 0 2 4 L mse30 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.1952063 nfv

A52: +A22^2 /*squares all the individual erros */

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 square errors53 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 054 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 056 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 057 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 058 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 059 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SE60 ---------------------------------------------------------------61 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 3 3 3 3. 2 2 3 2 0.125 0.22507362 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 3 3 3 3. 1 2 3 2 0.141 0.20042463 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 3 3 3 3. 1 3 3 2 0.151 0.19756464 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.151 0.19616565 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.151 0.19522266 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.19523267 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.19522868 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.19522469 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.19522170 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.19521871 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.19521472 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.195211

/rvnfv~fv copies fv to nfv after converting fv to values.

{goto}L~{edit}+.005~ increments L by .005

/XImse<omse-.00001~/xg\a~ IF mse still decreasing, recalc mse with new L

.001~ Reset L=.001 for next round

/xg\a~ Start over with next round

{goto}se~/rvfv~{end}{down}{down}~ "value copy" fv to output list

Notes: In 2 rounds mse is as low as Funk gets it in 2000 rounds. After 5 rounds mse is lower than ever before (and appears to be bottoming out).

I know I shouldn't hardcode parameters! Experiments should be done to optimize this line search (e.g., with some binary search for a low mse).

Since we have the resulting individual square_errors for each training pair, we could run this, then for mask the pairs with se(u,m) > Threshold.

Then do it again after masking out those that have already achieved a low se. But what do I do with the two resulting feature vectors? Do I treat it like a two feature SVD or do I use some linear combo of the resulting predictions of the two (or it could be more than two)? We need to test out which works best (or other modifications) on Netflix data.

Maybe on those test pairs for which the training row and column have some high errors, we apply the second feature vector instead of the first? Maybe we invoke CkNN for test pairs in this case (or use all 3 and a linear combo?)

This is powerful! We need to optimize the calculations using pTrees!!!

FAUST pTree PREDICTOR/CLASSIFIER (FAUST= Functional Analytic Unsupervised and Supervised machine Teaching):

FAUST pTree CLUSTER/ANOMALASER

pTrees in MapReduce MapReduce and Hadoop are key-value approaches to organizing and managing BigData.

pTree Text Mining:: capturie the reading sequence, not just the term-frequency matrix (lossless capture) of a text corpus.

Secure pTreeBases: This involves anonymizing the identities of the individual pTrees and randomly padding them to mask their initial bit positions.

pTree Algorithmic Tools: An expanded algorithmic tool set is being developed to include quadratic tools and even higher degree tools.

pTree Alternative Algorithm Implementation: Implementing pTree algorithms in hardware (e.g., FPGAs) should result in orders of magnitude performance increases?

pTree O/S Infrastructure: Computers and Operating Systems are designed to do logical operations (AND, OR...) rapidly. Exploit this for pTree processing speed.

pTree Recommender: This includes, Singular Value Decomposition (SVD) recommenders, pTree Near Neighbor Recommenders and pTree ARM Recommenders.

Research SummaryWe datamine big data (big data ≡ trillions of rows and, sometimes, thousands of columns (which can complicate data mining trillions of rows).How do we do it? I structure the data table as [compressed] vertical bit columns (called "predicate Trees" or "pTrees").I process those pTrees horizontally (because processing across thousands of column structures is orders of magnitude faster than processing down trillions of row structures. As a result, some tasks that might have taken forever can be done in a humanly acceptable amount of time.

What is data mining? Largely it is classification (assigning a class label to a row based on a training table of previously classified rows).

Clustering and Association Rule Mining (ARM) are important areas of data mining also, and they are related to classification.

The purpose of clustering is usually to create [or improve] a training table. It is also used for anomaly detection, a huge area in data mining.

ARM is used to data mine more complex data (relationship matrixes between two entities, not just single entity training tables). Recommenders recommend products to customers based on their previous purchases or rents (or based on their ratings of items)".

To make a decision, we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we (or an expert) made in those similar cases. We do what worked before (for us or for others). I.e., we let near neighbor cases vote. But which neighbor vote? "The Magical Number Seven, Plus or Minus Two..." Information"[2] is one of the most highly cited papers in psychology cognitive psychologist George A. Miller of Princeton University's Department of Psychology in Psychological Review. It argues that the number of objects an average human can hold in working memory is 7 ± 2 (called Miller's Law). Classification provides a better 7.

Some current pTree Data Mining research projects

http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_or_Minus_Two

http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_or_Minus_Two#cite_note-2

DPP-KM 1. Check gaps in DPPp,d(y) (over grids of p and d?). 1.1 Check distances at any sparse extremes. 2. After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined).

DPP-DA 2. Check gaps in DPPp,d(y) (grids of p and d?) against the density of subcluster. 2.1 Check distances at sparse extremes against subcluster density. 2.2 Apply other methods once Dot ceases to be effective.

DPP-SD) 3. Check gaps in DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid). 3.1 Check sparse ends distance with subcluster density. (DPPpd and SDp share construction steps!)

The Square Distance Functional (SD) Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a pRn grid).

SD-DPP-SDPR) (DPPpq , SDp and SDPRpq share construction steps!SDp(y) ≡ (y-p)o(y-p) = yoy - 2 yop +pop DPPpq(y) ≡ (y-p)od=yod-pod= (1/|p-q|)yop - (1/|p-q|)yoq

Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract.Calculate DPPpq(y)2. Then subtract it from SDp(y)

FAUST clustering (the unsupervised part of FAUST)

This class of partitioning or clustering methods relies on choosing a dot product projection so that if we find a gap in the F-values, we know that the 2 sets of points mapping to opposite sides of that gap are at least as far apart as the gap width.).

The Coordinate Projection Functionals (ej) Check gaps in ej(y) ≡ yoej = yj

The Dot Product Projection (DPP) Check for gaps in DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| (parameterized over a grid of d=(p-q)/|p-q|Spheren. d

The Dot Product Radius (DPR) Check gaps in DPRpq(y) ≡ √ SDp(y) - DPPpq(y)2

The Square Dot Product Radius (SDPR) SDPRpq(y) ≡ SDp(y) - DPPpq(y)2 (easier pTree processing)

FAUST DPP CLUSTER on IRiS with DPP(y)=(y-p)o(q-p)/|q-p|, where p is the min (or n) corner and q is the max (x) corner of the circumscribing rectangle (mdpts or avg (a) is used also).

IRIS: 150 irises (rows), 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). first 50 are Setosa (s), next 50 are Versicolor (e), next 50 are Virginica (i) irises.

gap>=4p=nnnnq=xxxxF Count 0 1 1 1 2 1 3 3 4 1 5 6 6 4 7 5 8 7 9 310 811 512 113 214 115 119 120 121 326 228 129 430 231 232 233 434 336 537 238 239 240 541 642 543 744 245 146 347 248 149 550 451 152 353 254 255 356 257 158 159 161 264 266 268 1

CL1 F<17 (50 Set)

DPP605960586058605959586058596263616160586057596456565757596058575861625861615860596157605756585959606025272229242625372531342930243527263123322331212528

SL SW PL PWset 51 35 14 2set 49 30 14 2set 47 32 13 2set 46 31 15 2 set 50 36 14 2set 54 39 17 4set 46 34 14 3set 50 34 15 2set 44 29 14 2set 49 31 15 1set 54 37 15 2set 48 34 16 2set 48 30 14 1set 43 30 11 1set 58 40 12 2set 57 44 15 4set 54 39 13 4set 51 35 14 3set 57 38 17 3set 51 38 15 3set 54 34 17 2set 51 37 15 4set 46 36 10 2set 51 33 17 5set 48 34 19 2set 50 30 16 2set 50 34 16 4set 52 35 15 2set 52 34 14 2set 47 32 16 2set 48 31 16 2set 54 34 15 4set 52 41 15 1set 55 42 14 2set 49 31 15 1set 50 32 12 2set 55 35 13 2set 49 31 15 1set 44 30 13 2set 51 34 15 2set 50 35 13 3set 45 23 13 3set 44 32 13 2set 50 35 16 6set 51 38 19 4set 48 30 14 3set 51 38 16 2set 46 32 14 2set 53 37 15 2set 50 33 14 2ver 70 32 47 14ver 64 32 45 15ver 69 31 49 15ver 55 23 40 13ver 65 28 46 15ver 57 28 45 13ver 63 33 47 16ver 49 24 33 10ver 66 29 46 13ver 52 27 39 14ver 50 20 35 10ver 59 30 42 15 ver 60 22 40 10ver 61 29 47 14ver 56 29 36 13ver 67 31 44 14ver 56 30 45 15ver 58 27 41 10ver 62 22 45 15ver 56 25 39 11ver 59 32 48 18ver 61 28 40 13ver 63 25 49 15ver 61 28 47 12ver 64 29 43 13

1234567891012345678920123456789301234567894012345678950 1234567891012345678920123425

ver 58 26 40 12ver 50 23 33 10 ver 56 27 42 13ver 57 30 42 12ver 57 29 42 13ver 62 29 43 13ver 51 25 30 11ver 57 28 41 13vir 63 33 60 25vir 58 27 51 19vir 71 30 59 21vir 63 29 56 18vir 65 30 58 22vir 76 30 66 21vir 49 25 45 17vir 73 29 63 18vir 67 25 58 18vir 72 36 61 25vir 65 32 51 20vir 64 27 53 19vir 68 30 55 21vir 57 25 50 20vir 58 28 51 24vir 64 32 53 23vir 65 30 55 18vir 77 38 67 22vir 77 26 69 23vir 60 22 50 15vir 69 32 57 23vir 56 28 49 20vir 77 28 67 20vir 63 27 49 18vir 67 33 57 21vir 72 32 60 18vir 62 28 48 18vir 61 30 49 18vir 64 28 56 21vir 72 30 58 16vir 74 28 61 19vir 79 38 64 20vir 64 28 56 22vir 63 28 51 15vir 61 26 56 14vir 77 30 61 23vir 63 34 56 24vir 64 31 55 18vir 60 30 18 18vir 69 31 54 21vir 67 31 56 24vir 69 31 51 23vir 58 27 51 19vir 68 32 59 23vir 67 33 57 25vir 67 30 52 23vir 63 25 50 19vir 65 30 52 20vir 62 34 54 23vir 59 30 51 18

345678950 1234567891012345678920123456789301234567894012345678950

30372930292840301019111512 524 8121019161519171716 6 0301019111512 524 81210 19161519171716 6 01613181911121719181620

DPP2723212636323332202727242531302726

SL SW PL PWver 66 30 44 14ver 68 28 48 14ver 67 30 50 17ver 60 29 45 15ver 57 26 35 10ver 55 24 38 11ver 55 24 37 10ver 58 27 39 12ver 60 27 51 16ver 54 30 45 15ver 60 34 45 16ver 67 31 47 15ver 63 23 44 13ver 56 30 41 13ver 55 25 40 13ver 55 26 44 12ver 61 30 46 14

26789301234567894012

Check distances in [12,28] s16,,i39,e49, e11, {e8,e44, i6,i10,i18,i19,i23,i32 outliers

F 12 13 13 14 15 19 20 21 21 21 26 26 28 s34 s6 s45 s19 s16 i39 e49 e8 e11 e44 e32 e30 e31s34 0 5 8 5 4 21 25 28 32 28 30 28 31s6 5 0 4 3 6 18 21 23 27 24 26 23 27s45 8 4 0 6 9 18 18 21 25 21 24 22 25s19 5 3 6 0 6 17 21 24 27 24 25 23 27s16 4 6 9 6 0 20 26 29 33 29 30 28 31i39 21 18 18 17 20 0 17 21 24 21 22 19 23e49 25 21 18 21 26 17 0 4 7 4 8 8 9e8 28 23 21 24 29 21 4 0 5 1 7 8 8e11 32 27 25 27 33 24 7 5 0 4 7 9 7e44 28 24 21 24 29 21 4 1 4 0 6 8 7e32 30 26 24 25 30 22 8 7 7 6 0 3 1e30 28 23 22 23 28 19 8 8 9 8 3 0 4e31 31 27 25 27 31 23 9 8 7 7 1 4 0

17<F<23 CL2 (e8,e11,e44,e49,i39)

23<F CL3 (46 vers,49 vir)

Checking [0,4] distances (s42 Setosa outlier)

F 0 1 2 3 3 3 4 s14 s42 s45 s23 s16 s43 s3s14 0 8 14 7 20 3 5s42 8 0 17 13 24 9 9s45 14 17 0 11 9 11 10s23 7 13 11 0 15 5 5s16 20 24 9 15 0 18 16s43 3 9 11 5 18 0 3s3 5 9 10 5 16 3 0

Checking [57.68] distances i10,i36,i19,i32,i18, {i6,i23} outliers

F 57 58 59 61 61 64 64 66 66 68 i26 i31 i8 i10 i36 i6 i23 i19 i32 i18i26 0 5 4 8 7 8 10 13 10 11i31 5 0 3 10 5 6 7 10 12 12i8 4 3 0 10 7 5 6 9 11 11i10 8 10 10 0 8 10 12 14 9 9i36 7 5 7 8 0 5 7 9 9 10i6 8 6 5 10 5 0 3 5 9 8i23 10 7 6 12 7 3 0 4 11 10i19 13 10 9 14 9 5 4 0 13 12i32 10 12 11 9 9 9 11 13 0 4i18 11 12 11 9 10 8 10 12 4 0

Thinning=[6,7 ] CL3.1 <6.5 44 ver 4 vir

CL3.2 >6.5 2 ver 39 vir

No sparse ends

CL3 w outliers removed p=aaax q=aaan F Cnt 0 4 1 2 2 5 3 13 4 8 5 12 6 4 7 2 8 11 9 510 411 512 213 714 315 2

Here we project onto lines through the corners and edge midpoints of the coordinate-oriented circumscribing rectangle. It would, of course, get better results if we choose p and q to maximize gaps.Next we consider maximizing the STD of the F-values to insure strong gaps (a heuristic method).

"Gap Hill Climbing": mathematical analysis1. To increase gap size, we hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher StDev would increase the likelihood that gaps would be larger since more dispersion allows for more and/or larger gaps. This is very heuristic but it works.

2. We are more interested in growing the largest gap(s) of interest ( or largest thinning). To do this we could do:

F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows.Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning.This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.).

0 1 2 3 4 5 6 7 8 9 a b c d e ff 1 0e 2 3d 4 5 6c 7 8b 9a98765 a j k l m n4 b c q r s3 d e f o p2 g h1 i0

d1

d 1-gap

=p

q=

d2

d2-gap

The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?)

0 1 2 3 4 5 6 7 8 9 a b c d e ff 1e 2 3d 4 5 6c 7 8b 9a98765 a j k 4 b c q 3 d e f 2 1 0

d1

d 1-gap

p

q

d2

d2-gap

In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q???

V(d)= 2a11d1 +j1a1jdj

2a22d2+j2a2jdj

:

2anndn +jnanjdj

d0, one can hill-climb it to locally maximize the variance, V, as follows: d1≡(V(d0)); d2≡(V(d1)):... where

Maximizing theVarianceGiven any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let

= jXj

2 dj2 +2

j<kXjXkdjdk - "

= j=1..n

(Xj2

- Xj2)dj

2 ++(2j=1..n<k=1..n

(XjXk - XjXk)djdk )

V(d)≡VarianceXod=(Xod)2 - (Xod)2

= i=1..N

(j=1..n xi,jdj)

2 - ( j=1..n

Xj dj )2

N 1

= i

jxi,j

2dj2 +

j<k xi,jxi,kdjdk

-

jXj

2dj2 +2

j<k XjXkdjdk N 1

N2

+ jkajkdjdkV(d)=jajjdj2

subject to i=1..ndi2=1

dT o A o d = V(d)

V

i XiXj-XiX,j

:

d1 ... dnd1:

dn

ijaijdidjV(d) =

x1

x2

:xN

x1odx2od

xNod=

Xod=Fd(X)=DPPd(X)d1

dn

- (jXj dj) (

kXk dk) =

i(

j xi,jdj) (

k xi,kdk) N 1

2a11 2a12 ... 2a1n

2a21 2a22 ... 2a2n

:'2an1 ... 2ann

d1

:di

:dn

V(d)≡Gradient(V)=2Ao

d or

We can separate out the diagonal or not:

Theorem1:

k{1,...,n}, d=ek will hill-climb V to its globally maximum.

Let d=ek s.t. akk is a maximal diagonal element of A,Theorem2 (working on it):

d=ek will hill-climb V to its globally maximum.

How do we use this theory?For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps.

For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk .

M1

M2

:MC

Then Xi = Mean(X)i and

and XiXj = Mean Mi1 Mj1

. : MiC MjC

These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means.

FAUST Classifier MVDI (Maximized Variance Definite Indefinite:

Build a Decision tree. 1. Find the d that maximizes the variance of the dot product projections of the class means each round.2. Apply DI each round (see next slide).

Definite_i = ( Mx<i, Mn>i )

FAUST DI FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK):

Let mi≡meanCi s.t. dom1dom2 ...domK Mni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj}

Indefinite_i_i+1 = [Mn>i, Mx<i+1] Then recurse on each Indefinite.

For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEAN sMEANe

Definite_i_______ Indefinite_i_i+1______ class Mx<i MN>i class MN>i Mx<i+1 s-Mean 50.49 34.74 14.74 2.43 s(i=1) -1 25 e-Mean 63.50 30.00 44.00 13.50 e(i=2) 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i(i=3) 48 128 ei 37 48

F < 18 setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37 versicolor (15 vers) 37 F 48 IndefiniteSet2 (20 vers, 10 virg) 48 < F virginica (25 virg)

F < 7 versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7 F 10 IndefSet3 ( 3 vers, 5 virg) 10 < F virginica ( 0 vers, 5 virg)

F < 3 versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3 F 7 IndefSet4 ( 2 vers, 1 virg) Here we will assign 0 F 7 versicolor 7 < F virginica ( 0 vers, 3 virg) 7 < F virginica

100% accuracy.

Test:F < 15 setosa (15 seto) 1ST ROUND D=MeansMeane 15 < F < 15 versicolor ( 0 vers, 0 virg) 15 F 41 IndefiniteSet2 (15 vers, 1 virg) 41 < F virginica ( 14 virg)

F < 20 versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F virginica ( 0 vers, 1 virg)

Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?)Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?)Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?)Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster.Option-4: D seq.: always pick the means pair which are furthest separated from each other.

Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(mean i), F(meanj)Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially,

IS=X)

FAUST MVDIFAUST MVDI on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.)

Definite_____ Indefinite s-Mean 50.49 34.74 14.74 2.43 s -1 10 e-Mean 63.50 30.00 44.00 13.50 e 23 48 s_ei 23 10 empty i-Mean 61.00 31.50 55.50 21.50 i 38 70 se_i 38 48

(-1, 16.5=avg{23,10})s sCt=50 (16.5, 38)e eCt=24 (48.128)i iCt=39 d=(.33, -.1, .86, .38)indef[38, 48]se_i seCt=26 iCt=13

Definite Indefinite i-Mean 62.8 29.2 46.1 14.5 i -1 8 e-Mean 59 26.9 49.6 18.4 e 10 17 i_e 8 10 empty

d=(-.55, -.33, .51, .57) (-1,8)e Ct=21 (10,128)i Ct=9indef[8,10]e_i eCt=5 iCt=4

In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree:

38

xod

0 4

8

d1=(-.55, -.33, .51, .57)

Versicolor

xod1 < 9

Virginica

xod1 9

Setosa

d0=(.33, -.1, .86,.38)

xod0 < 16.5

Versicolor

16.5 xod 0 < 38

Virginica

48 < xod0

FAUST MVDIFAUST MVDI SatLog 413train 4atr 6cls 127testGradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd)0.00 0.00 1.00 0.00 2820.13 0.38 0.64 0.65 7000.20 0.51 0.62 0.57 7420.26 0.62 0.57 0.47 7810.30 0.70 0.53 0.38 8100.34 0.76 0.48 0.30 8300.36 0.79 0.44 0.23 8410.37 0.81 0.40 0.18 8470.38 0.83 0.38 0.15 8500.39 0.84 0.36 0.12 8520.39 0.84 0.35 0.10 853

Fomn Ct min max max+1mn2 49 40 115 119 106 108 91 155 156mn5 58 58 76 64 108 61 92 145 146mn7 69 77 81 64 131 154 104 160 161mn4 78 91 96 74 152 60 127 178 179mn1 67 103 114 94 167 27 118 189 190mn3 89 107 112 88 178 155 157 206 207

Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) 0.00 0.00 0.00 1.00 1137 -0.11 -0.22 0.54 0.81 1747

MNod Ct ClMn ClMx ClMx+1mn2 45 33 115 124 150 54 102 177 178mn5 55 52 72 59 69 33 45 88 89Gradient Hill Climb of Var(d)on t257 0.00 0.00 1.00 0.00 496 -0.15 -0.29 0.56 0.76 1595Same using class means or training subset.

Gradient Hill Climb of Var(d)on t75 0.00 0.00 1.00 0.00 12 0.04 -0.09 0.83 0.55 20-0.01 -0.19 0.70 0.69 21Gradient Hill Climb of Var(d)on t13 0.00 0.00 1.00 0.00 29-0.83 0.17 0.42 0.34 166 0.00 0.00 1.00 0.00 25-0.66 0.14 0.65 0.36 81-0.81 0.17 0.45 0.33 88Gradient Hill Climb of Var(d)on t143 0.00 0.00 1.00 0.00 19-0.66 0.19 0.47 0.56 95 0.00 0.00 1.00 0.00 27-0.17 0.35 0.75 0.53 54-0.32 0.36 0.65 0.58 57-0.41 0.34 0.62 0.58 58

Using class means: FoMN Ct min max max+1mn4 83 101 104 82 113 8 110 121 122mn3 85 103 108 85 117 79 105 128 129mn1 69 106 115 94 133 12 123 148 149Using full data: (much better!)mn4 83 101 104 82 59 8 56 65 66mn3 85 103 108 85 62 79 52 74 75mn1 69 106 115 94 81 12 73 95 96

Gradient Hill Climb of Var t156161 0.00 0.00 1.00 0.00 5-0.23 -0.28 0.89 0.28 19-0.02 -0.06 0.12 0.99 157 0.02 -0.02 0.02 1.00 159 0.00 0.00 1.00 0.00 1-0.46 -0.53 0.57 0.43 2Inconclusive both ways so predict

purality=4(17) (3ct=3 tct=6

cl=4

Gradient Hill Climb of Var t146156 0.00 0.00 1.00 0.00 0 0.03 -0.08 0.81 -0.58 1 0.00 0.00 1.00 0.00 13 0.02 0.20 0.92 0.34 16 0.02 0.25 0.86 0.45 17Inconclusive both ways so predict

purality=4(17) (7ct=15 2ct=2

Cl=7

Gradient Hill Climb of Var t127 0.00 0.00 1.00 0.00 41-0.01 -0.01 0.70 0.71 90-0.04 -0.04 0.65 0.75 91 0.00 0.00 1.00 0.00 35-0.32 -0.14 0.59 0.73 105Inconclusive predict purality=7(62

4(15) 1(5) 2(8) 5(7)

cl=7

F[a,b) 0 92 104 118 127 146 156 157 161 179 190Class 2 2 2 2 2 2 5 5 5 5 7 7 7 7 7 7 1 1 1 1 1 1 1 4 4 4 4 4 3 3 3 3

d=(0.39 0.89 0.35 0.10 )

F[a,b) 89 102Class 5 2

d=(-.11 -.22 .54 .81)

F[a,b) 47 65 81 101Class 7 5 5 2 2

d=(-.15 -.29 .56 .76)

F[a,b) 57 61 69 87Class 5 7

d=(-.01, -.19, .7, .69)

F[a,b) 21 35 41 59Class 3 1

d=(-.81, .17, .45, .33)

F[a,b) 52 56 66 73 75Class 3 3 3 3 4 1 1

d=(-.66, .19, .47, .56)

On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy.speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread).With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet.

For WINE: min max+18.40 10.33 27.00 9.63 28.65 9.9 53.47.56 11.19 32.61 10.38 34.32 7.7 111.88.57 12.84 30.55 11.65 32.72 8.7 108.48.91 13.64 34.93 11.97 37.16 13.1 92.2 Awful results!

FAUST MVDIFAUST MVDI Concrete

For Concrete min max+1 train335.3 657.1 0 l120.5 611.6 12 m321.1 633.5 0 hTest 0 l****** 1 m****** 0 h****** 0 321

3.0 57.0 0 l 3.0 361.0 11 m28.0 92.0 0 h 0 l***** 2 m***** 0 h 92***** 999

d0= -0.34 -0.16 0.81 -0.45

xod0<320

Class=m (test:1/1)

xod0>=634

Class=l (test:1/1)d1= .85 -.03 .52 -.02

xod2>=92

Class=m (test:2/2)d2= .85 -.00 .53 .05

xod2<28

Class= l or md3= .81 .04 .58 .01

d3547.9 860.9 4 l 617.1 957.3 0 m 762.5 867.7 0 h 0 l******* 0 m******* 0 h. 0******* 617

xod3<969

Cl=l *test 6/9)

xod3>=868

Cl=m (test:1/1)

d2544.2 651.5 0 l515.7 661.1 0 m591.0 847.4 40 h 1 l****** 0 m****** 11 h 662****** 999

xod2>=662

Cl=h (test:11/12)

xod3<544

Cl=m *test 0/0)

d4 = .79 .14 .60 .03

xod4<640

Cl=l *test 2/2)

xod4>=681

Cl=l (test:0/3)

7 test errors / 30 = 77%

Seeds.97 .17 -.02 .15 d013.3 19.3 0 0 l16.4 23.5 0 0 m12.2 15.2 25 5 h 0 13.2 19.3 23.5

xod<13.2

Class=h errs:0/5)

xod>=19.3

Class=m errs0/1)

.97 .19 .08 .16 d113.4 19.6 0 0 l16.9 19.9 4 3 m13.5 16.0 0 0 h0 13.45 18.6 99

xod>=18.6

Class=m errs0/4)

xod<13.2

Class=h errs:0/5)

0.97 0.19 0.06 0.1514.4 19.6 0 0 l16.8 18.8 0 0 m13.5 15.8 11 1 h0 14.366 17.816 99

Class=h errs:0/1)

Class=m errs0/0).00 .00 1.00 .00 1.0 8.0 6 4 l 4.0 5.0 0 0 m 2.0 9.0 0 0 h0 2 2 99

Class=l errs:0/4)

Class=m errs8/12)

8 test errors / 32 = 75%

E.g.,? Let D=vector connecting class means and d= D/|D|PX dot d>a = PdiXi>a

FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need std j (or variancej) in TBL.

FAUST Oblique Classifier: formula:

P(X dot D)>a X any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ).

AND 2 pTrees masks

To separate r from v: D = (mvmr), a = (mv+mr)/2 o d = midpoint of D projected onto d

P(m

b m

r )oX>(mr +m

) |/2od

P(m

vmr)oX>(m

r+mv )/2od

masks vectors that makes a

shadow on mr side of the midpt r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b

For classes r and b

"outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex.

Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier?

D

r

g

b

grb grb grb grb grb grb grb grb

grb

In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them.

bgr bgr bgr bgr bgr bgr bgr bgr bgr bgr

r r r v v r mr r v v v r r v mv v r v v r v

P(m

rmv )/|m

rmv |oX<a

For classes r and v D = mrmv

a

r r vv r mR r v v v v r r v mV v r v v r v

FAUST ObliqueFAUST Oblique

PR = P(X dot d)<a

D≡ mRmV = oblique vector. d=D/|D|

Separate classR, classV using midpoints of means (mom)midpoints of means (mom) method: calc a

a

View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od

= (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left)

d

Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification)Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV},2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] )

median{v2|vV}, ... )

vomV

v1

v2

vomR

std of these distances from origin

along the d-line

dim 2

dim 1

d-line

x y x\y 1 2 3 4 5 6 7 8 9 a b1 1 1 13 1 2 32 2 3 2 43 3 45 2 5 59 3 615 1 7 f14 2 815 3 9 6 d13 4 a b10 9 b c e1110 c9 11 d a1111 e 87 8 f 7 9

L1(x,y) Count Array

z1 1 2 1 1 1 1 2 1 1 1 1 1 1

z2 1 3 1 1 1 2 1 1 1 1 1 1

z3 1 3 1 1 1 1 1 2 1 1 1 1

z4 1 2 1 1 1 1 1 2 1 2 1 1

z5 1 3 2 1 1 1 2 1 1 1 1

z6 1 2 3 2 4 1 2

z7 1 2 1 1 1 1 2 4 1 1

z8 1 2 1 1 1 2 4 1 2

z9 1 2 1 1 3 2 1 3 1

z10 1 2 2 2 1 2 2 2 1

z11 1 1 2 1 1 1 2 1 2 2 1

z12 1 1 1 1 1 1 1 2 1 1 1 2 1

z13 1 1 2 1 1 1 1 3 3 1

z14 1 1 1 1 1 1 1 2 1 1 1 2 1

z15 1 1 1 1 2 1 1 1 2 3 1

L1(x,y) Value Array

z1 0 2 4 5 10 13 14 15 16 17 18 19 20

z2 0 2 3 8 11 12 13 14 15 16 17 18

z3 0 2 3 8 11 12 13 14 15 16 17 18

z4 0 2 3 4 6 9 11 12 13 14 15 16

z5 0 3 5 8 9 10 11 12 13 14 15

z6 0 5 6 7 8 9 10

z7 0 2 5 8 11 12 13 14 15 16

z8 0 2 3 6 9 11 12 13 14

z9 0 2 3 6 11 12 13 14 16

z10 0 3 5 8 9 10 11 13 15

z11 0 2 3 4 7 8 11 12 13 15 17

z12 0 1 2 3 6 8 9 11 13 14 15 17 19

z13 0 2 3 5 8 11 13 14 16 18

z14 0 1 2 3 7 9 10 12 14 15 16 18 20

z15 0 4 5 6 7 8 9 10 11 13 15

12/8/12

L1(x,y) Count Array

z1 1 2 1 1 1 1 2 1 1 1 1 1 1

z2 1 3 1 1 1 2 1 1 1 1 1 1

z3 1 3 1 1 1 1 1 2 1 1 1 1

z4 1 2 1 1 1 1 1 2 1 2 1 1

z5 1 3 2 1 1 1 2 1 1 1 1

z6 1 2 3 2 4 1 2

z7 1 2 1 1 1 1 2 4 1 1

z8 1 2 1 1 1 2 4 1 2

z9 1 2 1 1 3 2 1 3 1

z10 1 2 2 2 1 2 2 2 1

z11 1 1 2 1 1 1 2 1 2 2 1

z12 1 1 1 1 1 1 1 2 1 1 1 2 1

z13 1 1 2 1 1 1 1 3 3 1

z14 1 1 1 1 1 1 1 2 1 1 1 2 1

z15 1 1 1 1 2 1 1 1 2 3 1

L1(x,y) Value Array

z1 0 2 4 5 10 13 14 15 16 17 18 19 20

z2 0 2 3 8 11 12 13 14 15 16 17 18

z3 0 2 3 8 11 12 13 14 15 16 17 18

z4 0 2 3 4 6 9 11 12 13 14 15 16

z5 0 3 5 8 9 10 11 12 13 14 15

z6 0 5 6 7 8 9 10

z7 0 2 5 8 11 12 13 14 15 16

z8 0 2 3 6 9 11 12 13 14

z9 0 2 3 6 11 12 13 14 16

z10 0 3 5 8 9 10 11 13 15

z11 0 2 3 4 7 8 11 12 13 15 17

z12 0 1 2 3 6 8 9 11 13 14 15 17 19

z13 0 2 3 5 8 11 13 14 16 18

z14 0 1 2 3 7 9 10 12 14 15 16 18 20

z15 0 4 5 6 7 8 9 10 11 13 15

After having subclustered with linear gap analysis, it would make sense to run this round gap algoritm out only 2 steps to determine if there are any singleton, gap>2 subclusters (anomalies) which were not found by the previous linear analysis.

x y x\y 1 2 3 4 5 6 7 8 9 a b1 1 1 13 1 2 32 2 3 2 43 3 45 2 5 59 3 615 1 7 f14 2 815 3 9 6 M d13 4 a b10 9 b c e1110 c9 11 d a1111 e 87 8 f 7 9

This just confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis.

Confirms zf as an anomaly or outlier, since it was already declared so during the linear gap analysis.


yo(x-M)/|x-M| Count Arrays

z1 2 2 4 1 1 1 1 2 1

z2 2 2 4 1 1 1 1 2 1

z3 1 5 2 1 1 1 1 2 1

z4 2 4 2 2 1 1 2 1

z5 2 2 3 1 1 1 1 1 2 1

z6 2 1 1 1 1 3 3 3

z7 1 4 1 3 1 1 1 2 1

z8 1 2 3 1 3 1 1 2 1

z9 2 1 1 2 1 3 1 1 2 1

z10 2 1 1 1 1 1 4 1 1 2

z11 1 2 1 1 3 2 1 1 1 2

z12 1 1 1 2 2 1 1 1 1 1 1 2

z13 3 3 3 1 1 1 1 2

z14 1 1 2 1 3 2 1 1 2 1

z15 1 2 1 1 2 1 2 2 2 1

yo(x-M)/|x-M| Value Arrays

z1 0 1 2 5 6 10 11 12 14

z2 0 1 2 5 6 10 11 12 14

z3 0 1 2 5 6 10 11 12 14

z4 0 1 3 6 10 11 12 14

z5 0 1 2 3 5 6 10 11 12 14

z6 0 1 2 3 7 8 9 10

z7 0 1 2 3 4 6 9 11 12

z8 0 1 2 3 4 6 9 11 12

z9 0 1 2 3 4 6 7 10 12 13

z10 0 1 2 3 4 5 7 11 12 13

z11 0 1 2 3 4 6 8 10 11 12

z12 0 1 2 3 5 6 7 8 9 11 12 13

z13 0 1 2 3 7 8 9 10

z14 0 1 2 3 5 7 9 11 12 13

z15 0 1 3 5 6 7 8 9 10 11

x y Fz1 z1 14z1 z2 12z1 z3 12z1 z4 11z1 z5 10z1 z6 6z1 z7 1z1 z8 2z1 z9 0z1 z10 2z1 z11 2z1 z12 1z1 z13 2z1 z14 0z1 z15 5

9 5 Mean

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

z1z2z3z4z5z6z7z8z9z10z11z12z13z14z15

gap: 10-6

gap: 5-2

Cluster by splitting at gaps > 2

z13111110000000000

z12000001000000001

z11000000111111110

cluster PTree Masks (by ORing)

gap: 6-9



z1 2 2 4 1 1 1 1 2 1

z2 2 2 4 1 1 1 1 2 1

z3 1 5 2 1 1 1 1 2 1

z4 2 4 2 2 1 1 2 1

z5 2 2 3 1 1 1 1 1 2 1

z6 2 1 1 1 1 3 3 3

z7 1 4 1 3 1 1 1 2 1

z8 1 2 3 1 3 1 1 2 1

z9 2 1 1 2 1 3 1 1 2 1

z10 2 1 1 1 1 1 4 1 1 2

z11 1 2 1 1 3 2 1 1 1 2

z12 1 1 1 2 2 1 1 1 1 1 1 2

z13 3 3 3 1 1 1 1 2

z14 1 1 2 1 3 2 1 1 2 1

z15 1 2 1 1 2 1 2 2 2 1


z1 0 1 2 5 6 10 11 12 14

z2 0 1 2 5 6 10 11 12 14

z3 0 1 2 5 6 10 11 12 14

z4 0 1 3 6 10 11 12 14

z5 0 1 2 3 5 6 10 11 12 14

z6 0 1 2 3 7 8 9 10

z7 0 1 2 3 4 6 9 11 12

z8 0 1 2 3 4 6 9 11 12

z9 0 1 2 3 4 6 7 10 12 13

z10 0 1 2 3 4 5 7 11 12 13

z11 0 1 2 3 4 6 8 10 11 12

z12 0 1 2 3 5 6 7 8 9 11 12 13

z13 0 1 2 3 7 8 9 10

z14 0 1 2 3 5 7 9 11 12 13

z15 0 1 3 5 6 7 8 9 10 11


9 5 Mean


z13111110000000000

z12000001000000001

z71111111000011111

z72000000111100000

z11000000111111110



z1 2 2 4 1 1 1 1 2 1

z2 2 2 4 1 1 1 1 2 1

z3 1 5 2 1 1 1 1 2 1

z4 2 4 2 2 1 1 2 1

z5 2 2 3 1 1 1 1 1 2 1

z6 2 1 1 1 1 3 3 3

z7 1 4 1 3 1 1 1 2 1

z8 1 2 3 1 3 1 1 2 1

z9 2 1 1 2 1 3 1 1 2 1

z10 2 1 1 1 1 1 4 1 1 2

z11 1 2 1 1 3 2 1 1 1 2

z12 1 1 1 2 2 1 1 1 1 1 1 2

z13 3 3 3 1 1 1 1 2

z14 1 1 2 1 3 2 1 1 2 1

z15 1 2 1 1 2 1 2 2 2 1


z1 0 1 2 5 6 10 11 12 14

z2 0 1 2 5 6 10 11 12 14

z3 0 1 2 5 6 10 11 12 14

z4 0 1 3 6 10 11 12 14

z5 0 1 2 3 5 6 10 11 12 14

z6 0 1 2 3 7 8 9 10

z7 0 1 2 3 4 6 9 11 12

z8 0 1 2 3 4 6 9 11 12

z9 0 1 2 3 4 6 7 10 12 13

z10 0 1 2 3 4 5 7 11 12 13

z11 0 1 2 3 4 6 8 10 11 12

z12 0 1 2 3 5 6 7 8 9 11 12 13

z13 0 1 2 3 7 8 9 10

z14 0 1 2 3 5 7 9 11 12 13

z15 0 1 3 5 6 7 8 9 10 11


9 5 Mean


z13111110000000000

z12000001000000001

z71111111000011111

z72000000111100000

gap: 3-7

zd1000000000011111

zd2111111111100000

z11000000111111110



z1 2 2 4 1 1 1 1 2 1

z2 2 2 4 1 1 1 1 2 1

z3 1 5 2 1 1 1 1 2 1

z4 2 4 2 2 1 1 2 1

z5 2 2 3 1 1 1 1 1 2 1

z6 2 1 1 1 1 3 3 3

z7 1 4 1 3 1 1 1 2 1

z8 1 2 3 1 3 1 1 2 1

z9 2 1 1 2 1 3 1 1 2 1

z10 2 1 1 1 1 1 4 1 1 2

z11 1 2 1 1 3 2 1 1 1 2

z12 1 1 1 2 2 1 1 1 1 1 1 2

z13 3 3 3 1 1 1 1 2

z14 1 1 2 1 3 2 1 1 2 1

z15 1 2 1 1 2 1 2 2 2 1


z1 0 1 2 5 6 10 11 12 14

z2 0 1 2 5 6 10 11 12 14

z3 0 1 2 5 6 10 11 12 14

z4 0 1 3 6 10 11 12 14

z5 0 1 2 3 5 6 10 11 12 14

z6 0 1 2 3 7 8 9 10

z7 0 1 2 3 4 6 9 11 12

z8 0 1 2 3 4 6 9 11 12

z9 0 1 2 3 4 6 7 10 12 13

z10 0 1 2 3 4 5 7 11 12 13

z11 0 1 2 3 4 6 8 10 11 12

z12 0 1 2 3 5 6 7 8 9 11 12 13

z13 0 1 2 3 7 8 9 10

z14 0 1 2 3 5 7 9 11 12 13

z15 0 1 3 5 6 7 8 9 10 11


9 5 Mean


z13111110000000000

z12000001000000001

z71111111000011111

z72000000111100000

zd1000000000011111

zd2111111111100000

z11000000111111110

AND each red with each blue with each green, to get the subcluster masks (12 ANDs)

For any FAUST clustering method, we proceed in one of 2 ways: gap analysis of the projections onto a unit vector, d, and/or gap analysis of the distances from a point, f (and another point, g, usually):

f, g, d, SpS(xod) require no processing (gap-finding is the only cost).

MCR(fg) adds the cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)).

f3

g3

x

y

z

f2

g2

f1g1

(nv1,nv2,Xv3)

(nv1,Xv2,Xv3)

(nv1,Xv2,nv3)

MinVect=nv=(nv1,nv2,nv3)

(Xv1,Xv2,Xv3)=Xv=MaxVect

(Xv1,nv2,Xv3)

(Xv1,Xv2,nv3)

(Xv1,nv2,nv3)

FAUST Clustering Methods: MCR (Using Midlines of circumscribing Coordinate Rectangle)

Define a sequence fk,gkdk

Given d, fMinPt(xod) and gMaxPt(xod). Given f and g, dk≡(f-g)/|f-g|

fk≡((nv1+Xv1)/2,...,nvk,...,(nvn+Xvn)/2)

MCR(dfg) on Iris150 Do SpS(xod) linear gap analysis (since it is processing free).

d1 noned2 none

d30 10 set23...1 19 set450 30 ver49...0 69 vir19

SubClus2

SubClus1

SubClus1

d41 6 set440 18 vir39Leaves exactly the 50 setosa.

SubClus2

d4 noneLeaves 50 ver and 49 vir

(look for outliers in subclus1, subclus2 Sequence thru{f, g} pairs: SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap.

On what's left:

SubClus1

f1 none

g1 none

f2 none

g2 none

f3 none

g3 none

f4 none

g4 none

f1 none

SubClus2

g1 none

f2 1 41 vir230 47 vir180 47 vir32

g2 none

f3 none

g3 none

f4 none

g4 none

f1

g1

0001

0011

0010

nv= 0000

0111

0101

0110

0100

0½½½=

=1½½½

1001

1011

1010

1000

1111 =Xv

1101

1110

1100

f2 = ½0½½

g2 =½1½½

f3

g3 =½½1½

=½½0½ f4

g4 =½½½1

=½½½0

f1= 0½½½

g1= 1½½½

gk≡((nv1+Xv1)/2,...,nXk,...,(nvn+Xvn)/2) dk=ek and SpS(xodk)=Xk

So we can do any subset

(d), (df), (dg), (dfg), (f), (fg), fgd), ...

d10 17 t1240 17 t140 17 tal1 17 t134

0 23 t130 23 t120 23 t11 23 t1230 38 set14...1 79 vir320 84 b120 84 b10 84 b131 84 b123

0 98 b1240 98 b1340 98 b140 98 ball

MCR(d) on Iris150+Outlier30, gap>4: Do SpS(xodk) linear gap analysis, k=1,2,3,4.

SubClus1

t124 t14 tal t134 0.0 25.0 35.0 43.025.0 0.0 43.0 35.035.0 43.0 0.0 25.043.0 35.0 25.0 0.0

t13 t12 t1 t123 0.0 43.0 35.0 25.043.0 0.0 25.0 35.035.0 25.0 0.0 43.025.0 35.0 43.0 0.0

b12 b1 b13 b123 0.0 30.0 52.4 43.030.0 0.0 43.0 52.452.4 43.0 0.0 30.043.0 52.4 30.0 0.0

b124 b134 b14 ball 0.0 52.4 30.0 43.052.4 0.0 43.0 30.030.0 43.0 0.0 52.443.0 30.0 52.4 0.0

d20 5 t20 5 t230 5 t241 5 t2340 20 ver1...1 44 set160 60 b240 60 b20 60 b2340 60 b23

t2 t23 t24 t234 0.0 35.0 12.0 37 35.0 0.0 37.0 12 12.0 37.0 0.0 35 37.0 12.0 35.0 0

b24 b2 b234 b23 0.0 28.0 43.0 51.328.0 0.0 51.3 43.043.0 51.3 0.0 28.051.3 43.0 28.0 0.0

d30 10 set23...1 19 set250 30 ver49...1 69 vir19Same split (expected)

SubClus1

SubClus1d4 1 6 set440 18 vir39Leaves exactly the 50 setosa as SubCluster1.

SubClus2d4 0 0 t41 0 t240 10 ver18...1 25 vir450 40 b40 40 b24Leaves the 49 virginica (vir39 declared an outlier) and the 50 versicolor as SubCluster2.

MCR(d) performs well on this dataset.

Accuracy: We can't expect a clustering method to separate versicolor from virginica because there is no gap between them. This method does separate off setosa perfectly and finds all 30 added outliers (subcluster of size 1 or 2). It finds virginica outlier, vir39, which is the most prominent intra-class outlier (distance 29.6 from the other virginica iris's, whereas no other iris is more than 9.1 from its classmates.)Speed: dk = ek so there is zero calculation cost for the d's.SpS(xodk) = SpS(xoek) = SpS(Xk) so there is zero calculation cost for it.The only cost is the loading of the dataset PTreeSet(X) (We use one column, SpS(Xk) at a time.) and that loading is required for any method. So MCR(d) is optimal with respect to speed!

Declare subclusters of size 1 or two to be outliers. Create the full pairwise distance table for any subcluster of size 10 and declare any point an outlier if its column (other than the zero diagonal value) values all exceed the threshold (which is 4).

start f1=MnVec RnGp>4 none

g1=MxVec RnGp>40 7 vir18...1 47 ver300 53 ver49..0 74 set14

SubClus1SubClus2

SubClus1 Lin>4 none

f2=0001 RnGp>4 none g2=1110 RnGp>4 none

Lin>4 none

f3=0010 RnGp>4 none

g3=1101 RnGp>4 none

Lin>4 none

f4=0011 RnGp>4 none

g4=1100 RnGp>4 none

Lin>4 none

f5=0100 RnGp>4 none

g5=1011 RnGp>4 none

Lin>4 none

f6=0101 RnGp>4 none

g6=1010 RnGp>4 none

Lin>4 none

f7=0110 RnGp>4 none

g7=1001 RnGp>4 none

Lin>4 none

f8=0111 RnGp>4 none

g8=1000 RnGp>4 none

Lin>4 none This ends SubClus1 = 95 ver and vir samples only

f8=0111 RnGp>4 none g8=1000 RnGp>4 none Lin>4 none

SubCluster2

f7=0110 RnGp>41 28 ver130 33 vir49

g7=1001 RnGp>4 none Lin>4 none

f6=0101 RnGp>41 19 set260 28 ver490 31 set420 31 ver80 32 set360 32 ver441 35 ver110 41 ver13

g6=1010 RnGp>4 none Lin>4 none

ver49 set42 ver8 set36 ver44 ver11 0.0 19.8 3.9 21.3 3.9 7.2 19.8 0.0 21.6 10.4 21.8 23.8 3.9 21.6 0.0 23.9 1.4 4.6 21.3 10.4 23.9 0.0 24.2 27.1 3.9 21.8 1.4 24.2 0.0 3.6 7.2 23.8 4.6 27.1 3.6 0.0

ver49

ver8

ver44ver11

Subc2.1






This ends SubClus2 = 47 setosa only

CCR(fgd) (Corners of Circumscribing Coordinate Rectangle) f1=minVecX≡(minXx1..minXxn) (0000)g1=MaxVecX≡(MaxXx1..MaxXxn) (1111), d=(g-f)/|g-f|

Sequence thru main diagonal pairs, {f, g} lexicographically. For each, create d.

Notes: No calculation required to find f and g (assuming MaxVecX and minVecX have been calculated and residualized when PTreeSetX was captured.)

If the dimension is high, since the main diagonal corners are liekly far from X and thus the large radii make the round gaps nearly linear.

CCR(f) Do SpS((x-f)o(x-f)) round gap analysis

CCR(g) Do SpS((x-g)o(x-g)) round gap analysis.

CCR(d) Do SpS((xod)) linear gap analysis.

SL SW PL PWset 51 35 14 2 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0set 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0set 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0set 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0set 50 36 14 2 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0set 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0set 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1set 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0set 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1set 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0set 48 34 16 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0set 48 30 14 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1set 43 30 11 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1set 58 40 12 2 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0set 57 44 15 4 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0set 54 39 13 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0set 51 35 14 3 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1set 57 38 17 3 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1set 51 38 15 3 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1set 54 34 17 2 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0set 51 37 15 4 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0set 46 36 10 2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0set 51 33 17 5 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1set 48 34 19 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0set 50 30 16 2 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0set 50 34 16 4 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0set 52 35 15 2 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0set 52 34 14 2 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0set 47 32 16 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0set 48 31 16 2 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0set 54 34 15 4 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0set 52 41 15 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 1set 55 42 14 2 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1set 50 32 12 2 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0set 55 35 13 2 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1set 44 30 13 2 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0set 51 34 15 2 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0set 50 35 13 3 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1set 45 23 13 3 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1set 44 32 13 2 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0set 50 35 16 6 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0set 51 38 19 4 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0set 48 30 14 3 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1set 51 38 16 2 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0set 46 32 14 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0set 53 37 15 2 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0set 50 33 14 2 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0ver 70 32 47 14 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0ver 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1ver 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 1ver 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1ver 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1ver 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1ver 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 0 0ver 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0ver 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1ver 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0ver 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0ver 59 30 42 15 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1ver 60 22 40 10 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0ver 61 29 47 14 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0ver 56 29 36 13 0 1 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1ver 67 31 44 14 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0ver 56 30 45 15 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1ver 58 27 41 10 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0ver 62 22 45 15 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1ver 56 25 39 11 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1ver 59 32 48 18 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0ver 61 28 40 13 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1ver 63 25 49 15 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1ver 61 28 47 12 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0ver 64 29 43 13 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1ver 66 30 44 14 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 0ver 68 28 48 14 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0ver 67 30 50 17 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1ver 60 29 45 15 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1 1ver 57 26 35 10 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0ver 55 24 38 11 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1ver 55 24 37 10 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0ver 58 27 39 12 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0ver 60 27 51 16 0 1 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0ver 54 30 45 15 0 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1ver 60 34 45 16 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0ver 67 31 47 15 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1ver 63 23 44 13 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1ver 56 30 41 13 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1ver 55 25 40 13 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1ver 55 26 44 12 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0ver 61 30 46 14 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0

SL SW PL PWver 58 26 40 12 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0ver 50 23 33 10 0 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0ver 56 27 42 13 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1ver 57 30 42 12 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0ver 57 29 42 13 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1ver 62 29 43 13 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1ver 51 25 30 11 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1ver 57 28 41 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1vir 63 33 60 25 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1vir 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1vir 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0vir 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0vir 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1vir 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1vir 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0vir 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 0vir 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1vir 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0vir 64 27 53 19 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1vir 68 30 55 21 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1vir 57 25 50 20 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0vir 58 28 51 24 0 1 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0vir 64 32 53 23 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1vir 65 30 55 18 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0vir 77 38 67 22 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 0vir 77 26 69 23 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 1vir 60 22 50 15 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1vir 69 32 57 23 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1vir 56 28 49 20 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0vir 77 28 67 20 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0vir 63 27 49 18 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0vir 67 33 57 21 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1vir 72 32 60 18 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0vir 62 28 48 18 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0vir 61 30 49 18 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0vir 64 28 56 21 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1vir 72 30 58 16 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0vir 74 28 61 19 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1vir 79 38 64 20 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0vir 64 28 56 22 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0vir 63 28 51 15 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1vir 61 26 56 14 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0vir 77 30 61 23 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1vir 63 34 56 24 0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0vir 64 31 55 18 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 0vir 60 30 18 18 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0vir 69 31 54 21 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1vir 67 31 56 24 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0vir 69 31 51 23 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1vir 68 32 59 23 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 1 1vir 67 33 57 25 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1vir 67 30 52 23 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1vir 63 25 50 19 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1vir 65 30 52 20 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0vir 62 34 54 23 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1vir 59 30 51 18 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0t1 20 30 37 12 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1t2 58 5 37 12 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1t3 58 30 2 12 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1t4 58 30 37 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0t12 20 5 37 12 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1t13 20 30 2 12 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1t14 20 30 37 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0t23 58 5 2 12 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1t24 58 5 37 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0t34 58 30 2 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0t123 20 5 2 12 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1t124 20 5 37 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0t134 20 30 2 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0t234 58 5 2 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0tall 20 5 2 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0b1 90 30 37 12 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 1 1b2 58 60 37 12 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1b3 58 30 80 12 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1b4 58 30 37 40 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0b12 90 60 37 12 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1b13 90 30 80 12 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 1b14 90 30 37 40 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0b23 58 60 80 12 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1b24 58 60 37 40 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0b34 58 30 80 40 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0b123 90 60 80 12 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1b124 90 60 37 40 1 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0b134 90 30 80 40 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0b234 58 60 80 40 0 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0ball 90 60 80 40 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0Before adding the new tuples:MINS 43 20 10 1MAXS 79 44 69 25MEAN 58 30 37 12 same after additions.

1234567891012345678920123456789301234567894012345678950 1234567891012345678920123456789301234567894012

345678950 1234567891012345678920123456789301234567894012345678950

f=M Gp>4 1 53 b130 58 t1230 59 b2340 59 tal0 60 b1341 61 b1230 67 ball

DISTANCES t123 b234 tal b134 b123 0.00 106.48 12.00 111.32 118.36106.48 0.00 110.24 43.86 42.52 12.00 110.24 0.00 114.93 118.97111.32 43.86 114.93 0.00 41.04118.36 42.52 118.97 41.04 0.00All outliers! f0=t123 RnGp>4

1 0 t1230 25 t131 28 t1340 34 set42...1 103 b230 108 b13

f0=b23 RnGp>4 1 0 b230 30 b3...1 84 t340 95 t230 96 t234

f0=b124 RnGp>4 1 0 b1240 28 b120 30 b141 32 b240 41 vir10...1 75 t241 81 t11 86 t141 93 t120 98 t124

b12 b14 b24 0.00 41.04 42.5241.04 0.00 43.8642.52 43.86 0.00 All outliers again!

f0=b34 RnGp>41 0 b340 26 vir1...1 66 vir390 72 set24...1 83 t30 88 t34

SubClust-1

SubClust-2

SubClust-1f0=b2 RnGp>4 1 0 b2 0 28 ver36

SubClust-1f0=b3 RnGp>4 1 0 b30 23 vir8...1 54 b10 62 vir39

SubClust-1f0=t24 RnGp>41 0 t241 12 t20 20 ver13

SubClust-1f0=b1 RnGp>41 0 b10 23 ver1

SubClust-1f0=ver19 RnGp>4none

SubClust-1f0=ver19 LinGp>4none

SubClust-2f0=t3 RnGp>4none

SubClust-2f0=t3 LinGap>41 0 t30 12 t34

SubClust-2f0=t34 LinGap>41 0 t340 13 set36

SubClust-2f0=set16 LnGp>4 none

SubClust-2f1=set42 RdGp>4 none

SubClust-2f1=set42 LnGp>4 noneSubClust-2 is 50 setosa! Likely f2, f3 and f4 analysis will not find none.

SubClust-1f1=ver49 RdGp>4 noneSubClust-1f1=ver49 LnGp>4 none

1. Choose f0 (high outlier potential? e.g., furthest from mean, M?)2. Do f0-rnd-gap analysis (+ subcluster anal?)3. f1 be s.t. no x further away from f0 (in some dir) (all d1 dot prods0)4. Do f1-rnd-gap analysis (+ subclust anal?).5. Do d1-linear-gap analysis, d1≡ f0-f1 / |f0-f1|.6. Let f2 s.t. no x is further away (in some direction) from d1-line than f2 7. Do f2-round-gap analysis.8. Do d2-linear-gap d2 ≡ f0-f2 - (f0-f2)od1 / len...

FM(fgd) (Furthest-from-the-Mediod) FMO (FM using a Gram-Schmidt Orthonormal basis) X Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.))

Repick f1MnPt[SpS(xod1)]. Pick g1MxPt[SpS(xod1)]

If d110, Gram-Schmidt {d1 e1...ek-1 ek+1..en}

d2 ≡ (e2 - (e2od1)d1) / |e2 - (e2od1)d1|

d3 ≡ (e3 - (e3od1)d1 - (e3od2)d2) / |e3 - (e3od1)d1 - (e3od2)d2| ...

dh≡(eh-(ehod1)d1-(ehod2)d2-..-(ehodh-1)dh-1) / |eh-(ehod1)d1-(ehod2)d2-...-(ehodh-1)dh-1|

Thm: MxPt[SpS((M-x)od)]=MxPt[SpS(xod)] (shift by Mod, MxPts are same

d1≡(M-f1)/|M-f1|.f1MxPt(SpS[(M-x)o(M-x)]).

Pick fhMnPt[SpS(xodh)]. Pick ghMxPt[SpS(xodh)].

f1=ball g1=tall LnGp>41 -137 ball0 -126 b1230 -124 b1341 -122 b2340 -112 b13...1 -29 t131 -24 t1341 -18 t1231 -13 tal

b123 b134 b234 0.0 41.0 42.541.0 0.0 43.942.5 43.9 0.0

FMO(d)

f2=t2 g2=b2 LnGp>41 21 set160 26 b2

f1=b13 g1=b2 LnGp>4 none

f2=t2 g2=t234 Ln>40 5 t230 5 t2340 6 t120 6 t240 6 t1241 6 t20 21 ver11

t23 t234 t12 t24 t124 t2 0.0 12.0 51.7 37.0 53.0 35.0 12.0 0.0 53.0 35.0 51.7 37.0 51.7 53.0 0.0 39.8 12.0 38.0 37.0 35.0 39.8 0.0 38.0 12.0 53.0 51.7 12.0 38.0 0.0 39.8 35.0 37.0 38.0 12.0 39.8 0.0

f2=vir11 g2=b23 Ln>41 43 b120 50 b340 51 b1240 51 b230 52 t130 53 b13

b34 b124 b23 t13 b13 0.0 61.4 41.0 91.2 42.5 61.4 0.0 60.5 88.4 59.4 41.0 60.5 0.0 91.8 43.9 91.2 88.4 91.8 0.0 104.8 42.5 59.4 43.9 104.8 0.0

f2=vir11 g2=b12 Ln>41 45 set160 61 b240 61 b20 61 b12

b24 b2 b12 0.0 28.0 42.528.0 0.0 32.042.5 32.0 0.0

f2=vir11 g2=set16 Ln>4 none

f3=t34 g3=vir18 Ln>4 none

f4=t4 g4=b4 Ln>41 24 vir10 39 b40 39 b14

f4=t4 g4=vir1 Ln>4 noneThis ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3.

x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxxx xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x

CRC f1=MinVector

CRC method g1=MaxVector ↓

f for FMG-GM

g for FMG-GM

x x

x x x x xx x

x x x x x

x x x x x x

x x x x x x

x x x x x x x x

xx x x x x x x x x x

x

x x x x x xxx x

x x

x xx x x x x x x x x

MCR f MCR g

f1=bal RnGp>41 0 ball0 28 b123...1 73 t40 78 vir39... 1 98 t340 103 t120 104 t230 107 t1241 108 t2340 113 t131 116 t1340 122 t1230 125 tal

t12 t23 t124 t234 0.0 51.7 12.0 53.051.7 0.0 53.0 12.012.0 53.0 0.0 51.753.0 12.0 51.7 0.0

FMO(fg) start

f1MxPt(SpS((M-x)o(M-x))), Round gaps first, then Linear gaps.

SubClus2

SubClus1

SubClus1 f1=b123 Rn>41 0 b1230 30 b130 30 vir320 30 vir181 32 b230 37 vir6

b13 vir32 vir18 b23 0.0 22.5 22.4 43.922.5 0.0 4.1 35.322.4 4.1 0.0 33.443.9 35.3 33.4 0.0

SubClus1 f1=b134 Rn>41 0 b1340 24 vir19

SubClus1 f1=b234 Rn>41 0 b2341 30 b340 37 vir10SubClus1 f1=b124 Rn>41 0 b1240 28 b120 30 b141 32 b240 41 b1...1 59 t40 68 b3

b124 b12 b14 0.0 28.0 30.028.0 0.0 41.030.0 41.0 0.0

SC1 f1=vir19 Rn>41 44 t40 52 b2

SC1 g1=b2 Rn>41 0 t40 28 ver36

SC1 f2=ver13 Rn>41 0 ver130 5 ver43

SC1 g2=vir10 Rn>41 0 vir100 6 vir44

SC1 f4=b1 Rn>41 0 b10 23 ver1

SC1 g4=b4 Rn>41 0 b40 21 vir15

SubC1us1 has 91, only versicolor and virginica.

SubClus2 f1=t14 Rn>40 0 t11 0 t140 30 ver8 ...1 47 set150 52 t30 52 t34

SubClus2 f1=set23 Rn>41 17 vir390 23 ver490 26 ver80 27 ver441 30 ver110 43 t240 43 t2

|ver49 ver8 ver44 ver11 0.0 3.9 3.9 7.1 3.9 0.0 1.4 4.7 3.9 1.4 0.0 3.7 7.1 4.7 3.7 0.0Almost outliers! Subcluster2.2Which type? Must classify.

SubClus2.2

SbCl_2.1 g1=ver39 Rn>41 0 vir390 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters.SbCl_2.1 g1=set19 Rn>4 none

SbCl_2.1 LnG>4 none

SbCl_2.1 f2=set42 Rn>41 0 set420 6 set9SbCl_2.1 f2=set9 Rn>4 noneSbCl_2.1 g2=set16 Rn>4 noneSbCl_2.1 LnG>4 none

SbCl_2.1 f3=set16 Rn>4 noneSbCl_2.1 g3=set9 Rn>4 noneSbCl_2.1 LnG>4 none

SbCl_2.1 f4=set Rn>4 noneSbCl_2.1 g4=set Rn>4 noneSbCl_2.1 LnG>4 none

Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before).

If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis!

FFG (Furthest to Furthest), computes SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis.

Then compute SpS(xod1) to get g1 to have projection furthest from that of f1 ( for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!)

We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150.

We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify.

For speed of text mining (and of other high dimension datamining), we might do additional dimension reduction (after stemming content word). A simple way is to use STD of the column of numbers generated by the functional (e.g., Xk, SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod), etc.). The STDs of the columns, Xk, can be precomputed up front, once and for all. STDs of projection and square distance functionals must be done after they are generated (could be done upon capture too). Good functionals produce many large gaps. In Iris150 and Iris150+Out30, I find that the precomputed STD is a good indicator of that. A text mining scheme might be:1. Capture the text as a PTreeSET (after stemming the content words) and store mean, median, STD of every column (content word stem).2. Throw out low STD columns. 4'. Use a weighted sum of "importance" and STD? (If the STD is low, there can't be many large gaps.)A possible Attribute Selection algorithm: 1. Peel from X, outliers using CRM-lin, CRC-lin, possibly M-rnd, fM-rnd, fg-rnd.. (Xin = X - Xout)2. Calculate widths of each Xin-Circumscribing Rectangle edge, crewk 4. Look for wide gaps top down (or, very simply, order by STD).4'. Divide crewk into count{xk| xXin}. (but that doesn't account for dups) 4''. look for preponderance of wide thin-gaps top down.4'''. look for high projection interval count dispersion (STD). Notes: 1. Maybe an inlier sub-cluster needs occur from more than one functional projection to be declared an inlier sub-cluster? 2. STD of a functional projection appears to be a good indicator of the quality of its gap analysis.

For FAUST Cluster-d (pick d, then f=MnPt(xod) and g=MxPt(xod) ) a full grid of unit vectors (all directions, equally spaced) may be needed. Such a grid could be constructed using angles a1, ... , am, each equi-width partitioned on [0,180), with the formulas:

d = e1k=n...2cosk + e2sin2k=n...3cosk + e3sin3k=n...4cosk + ... + ensinn where i's start at 0 and increment by .

So, di1..in= j=1..n[ ej sin((ij-1)) * k=n. .j+1cos(k) ]; i0≡0, divides 180 (e.g., 90, 45, 22.5...)

CRMSTD(dfg) Eliminate all columns with STD < threshold.d30 10 set23...50set+vir391 19 set250 30 ver49...50ver_49vir0 69 vir19

SubClus2

SubClus1

ver49 ver8 ver44 ver11 0.0 3.9 3.9 7.2 3.9 0.0 1.4 4.6 3.9 1.4 0.0 3.6 7.2 4.6 3.6 0.0

(d1+d3)/sqr(2) clus1 none(d1+d3)/sqr(2) clus2:0 57.3 ver490 58.0 ver80 58.7 ver441 60.1 ver110 64.3 ver10 none

d5 (f5=vir23, g5=set14) none,f5 none, g5 none

d5 (f5=vir32, g5=set14) none, f5 none, g5 none

d5 (f5=vir6, g5=set14) none, f5 none, g5 none

(d3+d4)/sqr(2) clus1 none(d3+d4)/sqr(2) clus2 none

(d1+d3+d4)/sqr(3) clus11 44.5 set190 55.4 vir39(d1+d3+d4)/sqr(3) clus2 none

(d1+d2+d3+d4)/sqr(4) clus1(d1+d2+d3+d4)/sqr(4) clus2 none

d5 (f5=vir19, g5=set14) nonef51 0.0 vir19 clus20 4.1 vir23g5 none

d5 (f5=vir18, g5=set14) nonef51 0.0 vir18 clus21 4.1 vir320 8.2 vir6g5 none

Just about all the high STD columns find the subcluster split.

In addition, they find the four outliers as well

G = ( (u 1,n) r e u 1,n M n,..., (u L,n) r e u L,n M n,..., (v,m 1 ) r e v,m 1 U v,..., v,m L...

Documents

Transcript of G = ( (u 1,n) r e u 1,n M n,..., (u L,n) r e u L,n M n,..., (v,m 1 ) r e v,m 1 U v,..., v,m L...