Scien&ficandLargeDataVisualiza&on29November2017
HighDimensionalData–PartII
MassimilianoCorsiniVisualCompu,ngLab,ISTI-CNR-Italy
Overview
• GraphsExtensions• Glyphs
– ChernoffFaces– Mul&-dimensionalIcons
• ParallelCoordinates• StarPlots• DimensionalityReduc&on
– PrincipalComponentAnalysis(PCA)– LocallyLinearEmbedding(LLE)– IsoMap– SummonMapping– t-SNE
DimensionalityReduc,on
• N-dimensionaldataareprojectedto2or3dimensionsforbeCervisualiza,on/understanding.
• Widelyusedstrategy.• Ingeneral,itisamappingnotageometrictransforma,on.
• Differentmappingshavedifferentproper,es.
PrincipalComponentAnalysis(PCA)
• Aclassicmul,-dimensionalreduc,ontechniqueisPrincipalComponentAnalysis(PCA).
• Itisalinearnon-parametrictechnique.• Thecoreideatofindabasisformedbythedirec,onsthatmaximizethevarianceofthedata.
PCAasachangeofbasis
• Theideaistoexpressthedatainanewbasis,thatbestexpressourdataset.
• Thenewbasisisalinearcombina,onoftheoriginalbasis.
PCAasachangeofbasis
Signal-to-noiseRa,o(SNR)
• Givenasignalwithnoise:
• Itcanbeexpressedas:
Redundancy
Redundantvariablesconveynorelevantinforma&on!
Figure From Jonathon Shlens, “A Tutorial on Principal Component Analysis”, arXiv preprint arXiv:1404.1100, 2015.
CovarianceMatrix
• Squaresymmetricmatrix.• Thediagonaltermsarethevarianceofapar,cularvariable.
• Theoff-diagonaltermsarethecovariancebetweenthedifferentvariables.
Goals
• HowtoselectthebestP?– Minimizeredundancy– Maximizethevariance
• Goal:todiagonalizethecovariancematrixofY– Highvaluesofthediagonaltermsmeansthatthedynamicsofthesinglevariableshasbeenmaximized.
– Lowvaluesoftheoff-diagonaltermsmeansthattheredundancybetweenvariablesisminimized.
SolvingPCARememberthat
SolvingPCA
• Theorem:asymmetricmatrixAcanbediagonalizedbyamatrixformedbyitseigenvectorsasA=EDET.
• ThecolumnofEaretheeigenvectorsofA.
PCAComputa,on
• Organizethedataasanmxnmatrix.• Subtractthecorrespondingmeantoeachrow.• CalculatetheeigenvaluesandeigenvectorsofXXT.
• OrganizethemtoformthematrixP.
PCAforDimensionalityReduc,on
• Theideaistofindthek-thprincipalcomponents(k<m).
• Projectthedataonthesedirec,onsandusesuchdatainsteadoftheoriginalones.
• Thisdataarethebestapproxima,onw.r.tthesumofthesquareddifferences.
PCAastheProjec,onthatMinimizestheReconstruc,onError• Ifweuseonlythefirstk<mcomponentsweobtainthebestreconstruc,onintermsofsquarederror.
Datapointprojectedonthefirstkcomponents.
Datapointprojectedonallthecomponents.
PCAastheProjec,onthatMinimizestheReconstruc,onError
Example
Figure From Jonathon Shlens, “A Tutorial on Principal Component Analysis”, arXiv preprint arXiv:1404.1100, 2015.
PCA–Example
Eachmeasurehas6dimensions(!)ButtheballmovesalongtheX-axisonly..
LimitsofPCA
• Itisnon-parametricàthisisastrengthpointbutitcanbealsoaweakpoint.
• Itfailsfornon-Gaussiandistributeddata.• Itcanbeextendedtoaccountfornon-lineartransforma,onàkernelPCA.
LimitsofPCA
ICAguaranteessta&s&calindependenceà
ClassicMDS
• Findthelinearmappingwhichminimizes:
Euclideandistanceinlowdimensionalspace
Euclideandistanceinhighdimensionalspace
PCAandMDS
• Wewanttominimize,thiscorrespondstomaximize:
Thatisthevarianceofthelow-dimensionalpoints(samegoalofthePCA).
PCAandMDS
• Thesizeofthecovariancematrixispropor,onaltothedimensionofthedata.
• MDSscaleswiththenumberofdatapointsinsteadofthedimensionsofthedata.
• BothPCAandMDSpreservebeCerlargepairwisedistances.
LocallyLinearEmbedding(LLE)• LLEaCemptstodiscovernonlinearstructureinhighdimensionbyexploi,nglocallinearapproxima,on.
MappingDiscoveredNonlinearManifoldM SamplesonM
LocallyLinearEmbedding(LLE)
• INTUITIONàassumingthatthereissufficientdata(well-sampledmanifold)weexpecteachdatapointanditsneighborscanbeapproximatedbyalocallinearpatch.
• Thepatchisrepresentedbyaweightedsumofthelocaldatapoints.
ComputeLocalPatch
• Chooseasetofdatapointsclosetoagivenone(ball-radiusorK-nearestneighbours).
• Solvefor:
LLEMapping
• Findwhichminimizestheembeddingcostfunc,on:
Notethatweightsarefixedinthiscase!
LLEAlgorithm
1. Computetheneighborsofeachdatapoint,.
2. Computetheweightsthatbestreconstruct.
3. Computethevectorsthatminimizesthecostfunc,on.
LLE–Example
PCAfailstopreservetheneighborhoodstructureofthenearbyimages.
LLE–Example
ISOMAP
• Thecoreideaistopreservethegeodesicdistancebetweendatapoints.
• Geodesicistheshortestpathbetweentwopointsonacurvedspace.
ISOMAP
Euclideandistancevs
Geodesicdistance
Graphbuildand
GeodesicdistanceApproxima&on
Geodesicdistancevs
ApproximatedGeodesic
ISOMAP
• Constructneighborhoodgraph– DefinegraphGoveralldatapointsbyconnec,ngpoints(i,j)ifandonlyifthepointiisaKneareastneighborofpointj
• Computetheshortestpath– UsingtheFloyd’salgorithm
• Constructthed-dimensionalembedding
ISOMAP
ISOMAP
Autoencoders
• MachinelearningisbecomingubiquitousinComputerScience.
• Aspecialtypeofneuralnetworkiscalledautoencoder.
• Anautoencodercanbeusedtoperformdimensionalityreduc,on.
• First,letmesaysomethingaboutneuralnetwork..
Autoencoder
Low-dimensional Representation
Mul,-layerAutoencoder
SummonMapping
• Adapta,onofMDSbyweigh,ngthecontribu,onofeach(i,j)pair:
• ThisallowstoretainthelocalstructureofthedatabeCerthanclassicalscaling(theretainofhighdistancesisnotprivileged).
t-SNE
• Mosttechniquesfordimensionalityreduc,onarenotabletoretainboththelocalandtheglobalstructureofthedatainasinglemap.
• SimpletestsonhandwriCendigitsdemonstratethis(Songetal.2007).
L. Song, A. J. Smola, K. Borgwardt and A. Gretton, “Colored Maximum Variance Unfolding”, in Advances in Neural Information Processing Systems. Vol. 21, 2007.
Stochas,cNeighborEmbedding(SNE)
• Similari,esbetweenhigh-andlow-dimensionaldatapointsismodeledwithcondi,onalprobabili,es.
• Condi,onalprobabilitythatthepointxiwouldpeakxjasitsneighbor:
Stochas,cNeighborEmbedding(SNE)
• Weareinterestedonlyinpairwisedistance
• Forthelow-dimensionalpointsananalogouscondi,onalprobabilityisused:
Kullback-LeiblerDivergence
• Codingtheory:expectednumberofextrabitsrequiredtocodesamplesfromthedistribu,onPifthecurrentcodeisop,mizeforthedistribu,onQ.
• Bayesianview:ameasureoftheinforma,ongainedwhenonerevisesone'sbeliefsfromthepriordistribu,onQtotheposteriordistribu,onP.
• ItisalsocalledrelaDveentropy.
Kullback-LeiblerDivergence
• Defini,onfordiscretedistribu,ons:
• Defini,onforcon,nuosdistribu,ons:
Stochas,cNeighborEmbedding(SNE)
• Thegoalistominimizesthemismatchbetweenpj|iandqj|i.
• UsingtheKullback-Leiblerdivergencethisgoalcanbeachievedbyminimizingthefunc,on:
NotethatKL(P||Q)isnotsymmetric!
ProblemsofSNE
• Thecostfunc,onisdifficulttoop,mize.• SNEsuffers,asotherdimensionalityreduc,ontechniques,ofthecrowdingproblem.
t-SNE
• SNEismadesymmetric:
• ItemploysaStudent-tdistribu,oninsteadofaGaussiandistribu,ontoevaluatethesimilaritybetweenpointsinlowdimension.
t-SNEAdvantages
• Thecrowdingproblemisalleviated.• Op,miza,onismadesimpler.
Experiments
• ComparisonwithLLE,IsomapandSummonMapping.
• Datasets:– MNISTdataset– Oliveffacedataset– COIL-20dataset
Comparison figures are from the paper L.J.P. van der Maaten and G.E. Hinton, “Visualizing High-Dimensional Data Using t-SNE”, Journal of Machine Learning Research, Vol. 9, pp. 2579-2605, 2008.
MNISTDataset
• 60,000imagesofhandwriCendigits.• Imageresolu,on:28x28(784dimensions).• Asubsetof6,000imagesrandomlyselectedhasbeenused.
MNISTt-SNE
MNISTSummonMapping
MNISTLLE
MNISTIsomap
COIL-20Dataset
• Imagesof20objectsviewedfrom72differentviewpoints(1440images).
• Imagesize:32x32(1024dimensions).
COIL-20Dataset
...
t-SNE
LLEIsomap
SummonMapping
Objects
Arrangement
Mo,va,ons
• Mul,dimensionalreduc,oncanbeusedtoarrangeobjectsin2Dor3Dpreservingpairwisedistances(butthefinalplacementisarbitrary).
• Manyapplica,onsrequiretoplacetheobjectsinasetofpre-defined,discrete,posi,ons(e.g.onagrid).
Example–ImagesofFlowers
Random Order
Example–ImagesofFlowers
Isomap
Example–ImagesofFlowers
IsoMatch (computed on colors)
ProblemStatement
Originalpairwisedistance
Euclideandistanceinthegrid
Permuta&on
Thegoalistofindthepermuta&onπthatminimizesthefollowingenergy:
IsoMatch–Algorithm
• StepI:DimensionalityReduc,on(usingIsomap)
• StepII:CoarseAlignment(boundingbox)• StepIII:Bipar,teMatching• StepIV(op,onal):RandomRefinement(elementsswap)
Algorithm–StepIDimensionalityReduc,on
Algorithm–StepIICoarseAlignment
Bipar,teMatching
• Acompletebipar,tegraphisbuilt(onewiththestar,ngloca,ons,onewiththetargetloca,ons)
• Thearc(i,j)isweightedaccordingtothecorrespondingpairwisedistance.
• Aminimalbipar,tematchingiscalculatedusingtheHungarianalgorithm.
Algorithm–StepIIIBipar,teMatching(graphbuilt)
Algorithm–StepIIIBipar,teMatching
Algorithm–StepIIIFinalAssignment
AverageColors
WordSimilarity
PileBars
• Anewtypeofthumbnailbar.• Paradigm:focus+context.• Objectsarearrangedinasmallspace(imagesaresubdividedintoclusterstosavespace).
• Supportanyimage-imagedistance.• PileBarsaredynamic!
PileBars–Layouts
Slots
1 image 2 images 12 images 4 images 3 images
PileBars
• Thumbnailsaredynamicallyrearranged,resizedandreclusteredadap,velyduringthebrowsing.
• ThisisdoneinawaytoensuresmoothtransiDons.
PileBars-Applica,onExampleNaviga,onofRegisteredPhotographs
Take a look at http://vcg.isti.cnr.it/photocloud .
Ques&ons?
Top Related