Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data...

23
HICSS-35: 35th Hawaii International Conference on System Sciences Interactive Visualization and Analysis for Gene Expression Data Chun Tang, Li Zhang and Aidong Zhang Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260 chuntang, lizhang, azhang @cse.buffalo.edu Abstract New technology such as DNA microarray can be used to produce the expression levels of thousands of genes simultaneously. The raw microarray data are images which can be transformed into gene expression matrices where usually the rows represent genes, the columns represent various samples, and the number in each cell characterizes the expression level of the particular gene in a particular sample. Now the cDNA and genomic sequence projects are processing at such a rapid rate, more and more data become available to researchers who are working in the field of bioinformatics. New methods are needed to efficiently and effectively analyze and visualize the gene data. A key step in the analysis of gene expression data is the detection of groups that manifest similar expression patterns and filter out the genes that are inferred to represent noise from the matrix according to samples distribution. In this paper, we present a visualization method which maps the samples’ - dimensional gene vectors into 2-dimensional points. This mapping is effective in keeping correlation coefficient similarity which is the most suitable similarity measure for analyzing the gene expression data. Our analysis method first removes noise genes from the gene expression matrix by sorting genes according to correlation coefficient measure and then adjusts the weight for each remaining gene. We have integrated our gene analysis algorithm into a visualization tool based on this mapping method. We can use this tool to monitor the analysis procedure, to adjust parameters dynamically, and to evaluate the result of each step. The experiments based on two groups of multiple sclerosis (MS) and treatment data demonstrate the effectiveness of this approach. 1

Transcript of Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data...

Page 1: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

HICSS-35: 35th Hawaii International Conference on System Sciences

Interactive Visualization and Analysis for Gene Expression Data

ChunTang,Li ZhangandAidongZhang

Departmentof ComputerScienceandEngineering

TheStateUniversityof New York at Buffalo

Buffalo,NY 14260�chuntang,lizhang,azhang� @cse.buffalo.edu

Abstract

New technologysuchasDNA microarraycanbeusedto producetheexpressionlevelsof thousands

of genessimultaneously. The raw microarraydataare imageswhich can be transformedinto gene

expressionmatriceswhereusuallytherowsrepresentgenes,thecolumnsrepresentvarioussamples,and

thenumberin eachcell characterizestheexpressionlevel of theparticulargenein a particularsample.

Now thecDNA andgenomicsequenceprojectsareprocessingat sucha rapidrate,moreandmoredata

becomeavailableto researcherswhoareworkingin thefield of bioinformatics.New methodsareneeded

to efficiently andeffectively analyzeandvisualizethegenedata.

A key stepin the analysisof geneexpressiondatais the detectionof groupsthat manifestsimilar

expressionpatternsandfilter out thegenesthatareinferredto representnoisefrom thematrixaccording

to samplesdistribution. In this paper, we presenta visualizationmethodwhich mapsthe samples’� -

dimensionalgenevectorsinto 2-dimensionalpoints. This mappingis effective in keepingcorrelation

coefficient similarity which is the mostsuitablesimilarity measurefor analyzingthe geneexpression

data.Our analysismethodfirst removesnoisegenesfrom thegeneexpressionmatrix by sortinggenes

accordingto correlationcoefficient measureandthenadjuststhe weight for eachremaininggene.We

have integratedour geneanalysisalgorithminto a visualizationtool basedon this mappingmethod.We

canusethis tool to monitortheanalysisprocedure,to adjustparametersdynamically, andto evaluatethe

resultof eachstep.Theexperimentsbasedon two groupsof multiple sclerosis (MS) andtreatmentdata

demonstratetheeffectivenessof thisapproach.

1

Page 2: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

1 Introduction

DNA microarraytechnologycanbe usedto measureexpressionlevels for thousandsof genesin a single

experiment,acrossdifferent conditionsand over time [3, 16, 14, 24, 23, 27, 36, 20, 21, 8, 30]. To use

the arrays,labeledcDNA is preparedfrom total messengerRNA (mRNA) of target cells or tissues,and

is hybridizedto the array. The amountof label boundis an approximatemeasureof the level of gene

expression. Thus genemicroarrayscan give a simultaneous,semi-quantitative readouton the levels of

expressionsof thousandsof genes.Just4-6 suchhigh-density“genechips” couldallow rapidscanningof

theentirehumanlibrary for geneswhichareinducedor repressedunderparticularconditions.By preparing

cDNA from cellsor tissuesatintervalsfollowing somestimulus,andexposingeachto replicatemicroarrays,

it is possibleto determinethe identity of genesrespondingto that stimulus,the time courseof induction,

andthedegreeof change.

Somemethodshave beendevelopedusingbothstandardclusteranalysisandnew innovative techniques

to extract,analyzeandvisualizegeneexpressiondatageneratedfrom DNA microarrays.Geneexpression

matrix canbe studiedin two dimensions[2]: comparingexpressionprofilesof genesby comparingrows

in the expressionmatrix [22, 6, 26, 23, 25, 9, 33, 3] and comparingexpressionprofiles of samplesby

comparingcolumnsin thematrix [?, 10, 32]. In addition,bothmethodscanbecombined(providedthatthe

datanormalization[18, 38] allows it).

A key stepin theanalysisof geneexpressiondatais thedetectionof groupsthat manifestsimilar ex-

pressionpatternsandfilter out the genesthat areinferredto representnoisefrom the matrix accordingto

samplesdistribution. Thecorrespondingalgorithmicproblemis to clustermulti-conditiongeneexpression

patterns.Dataclustering[6] wasusedto identify patternsof geneexpressionin humanmammaryepithelial

2

Page 3: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

cellsgrowing in cultureandin primaryhumanbreasttumors.DeRisiet al. [17] useda DNA arraycontain-

ing acompletesetof yeastgenesto studythediauxicshift timecourse.They selectedsmallgroupsof genes

with similar expressionprofilesandshowed that thesegenesarefunctionally relatedandcontainrelevant

transcriptionfactorbinding sitesupstreamof their openreadingframes(ORFs). In [3], a clusteringalgo-

rithm wasintroducedfor analysisof geneexpressiondatain which an appropriatestochasticerror model

on theinput hasbeendefined.Self-organizingmaps[26, 19], a typeof mathematicalclusteranalysisthatis

suitedfor recognizingandclassifyingfeaturesin complex, multidimensionaldata,wasappliedto organize

the genesinto biologically relevant clustersthat suggestnovel hypothesesabouthematopoieticdifferenti-

ation. In [12], theauthorspresenteda strategy for theanalysisof large-scalequantitative gene-expression

measurementdatafrom time-courseexperiments. The correlatedpatternsof geneexpressionfrom time

seriesdatasuggestsan orderthat conformsto a notionof sharedpathwaysandcontrol processesthat can

beexperimentallyverified. Brown et al. [23] applieda methodbasedon the theoryof supportvectorma-

chines(SVMs). The methodis consideredasa supervisedcomputerlearningmethodbecauseit exploits

prior knowledgeof genefunctionto identify unknown genesof similar functionfrom expressiondata.They

appliedthisalgorithmonsix functionalclassesof yeastgeneexpressionmatricesfrom 79samples[22]. Al-

ter et al. [25] usedsingularvaluedecompositionin transforminggenome-wideexpressiondatafrom genes

� arraysspaceto reduceddiagonalized“eigengenes”� “eigenarrays”spaceto extractsignificantgenesby

normalizingandsortingthedata.Hastieet al. [33] proposeda tree harvesting methodfor supervisedlearn-

ing from geneexpressiondata. This techniquestartswith a hierarchicalclusteringof genes,thenmodels

theoutcomevariableasa sumof theaverageexpressionprofilesof chosenclustersandtheir products.The

methodcandiscover genesthathave strongeffectson theirown, andgenesthatinteractwith othergenes.

On sampledimension,our task is to build a classifierwhich can predict the samplelabelsfrom the

3

Page 4: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

expressionprofile. Golubetal. [32] appliedneighborhoodanalysisto constructclasspredictorsfor samples,

especiallyfor leukemias.They werelooking for geneswhoseexpressiondataarebestcorrelatedwith two

known classesof leukemias,acutemyeloid leukemiaandacutelymphoblasticleukemia. They constructed

aweightedvoteclassifierbasedon50genes(outof 6817)using38samplesandappliedit to acollectionof

34new samples.Theclassifiercorrectlypredicted29of the34samples.In [10], theauthorspresentaneural

network modelknown asSimplified FuzzyARTMAP which canidentify normalanddiffuse large B-cell

lymphoma(DLBCL) patientsusingDNA microarraysdatageneratedby apreviousstudy. Many traditional

clusteringalgorithmssuchasthehierarchical[22, 15,5] andK-meansclusteringalgorithms[13, 28] haveall

beenusedfor clusteringexpressionprofiles.MathematicalandstatisticalmethodslikeFourierandBayesian

analysisalsohave beenusedto discover profilesof cell cycle-dependentgenes[31, 37, 1]. Our grouphas

developedamaximumentropy approachto classifyinggenearraydatasets[29]. Weusedpartof pre-known

classesof samplesastraining setandappliedthemaximumentropy modelto generateanoptimal pattern

modelwhichcanbeusedto new samples.

Sampleclusteringhasbeencombinedwith geneclusteringto identify which genesarethemostimpor-

tant for sampleclustering[34, 5]. Alon et al. [34] have applieda partitioning-basedclusteringalgorithm

to study6500genesof 40 tumorand22 normalcolontissuesfor clusteringbothgenesandsamples.Getz

et al. [11] presenta methodappliedon coloncancerandleukemiadata.By identifying relevantsubsetsof

thedata,they wereableto discover partitionsandcorrelationsthatweremasked andhiddenwhenthe full

datasetwasusedin theanalysis.Thismethodis calledtwo-way clustering.

Multiple sclerosis (MS) is achronic,relapsing,inflammatorydisease.Interferon- � ( ������� ) hasbeen

themostimportanttreatmentfor theMS diseasefor thelastdecade[35]. TheDNA microarraytechnology

makesit possibleto studytheexpressionlevelsof thousandsof genessimultaneously. Thegeneexpression

4

Page 5: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

levels aremeasuredby the intensity levels of the correspondingarrayspots. In this paper, we presenta

visualizationmethodwhich mapsthesamples’ -dimensiongenevectorsinto 2-dimensionalpoints. This

mappingis effective in keepingcorrelationcoefficient similarity which is themostsuitablesimilarity mea-

surefor analyzinggeneexpressiondata.Ourgeneexpressiondataanalysismethodfirst removesnoisegenes

from theexpressionmatrix by sortinggenesaccordingto correlationcoefficient measureandthenadjusts

theweightfor eachremaininggene.Wehave integratedourgeneanalysisalgorithminto avisualizationtool

basedonthismappingmethod.Wecanusethis tool to monitortheanalysisprocedure,to adjustparameters,

andto evaluatetheresultof eachstep.Theexperimentsonthehealthycontrol,MS andIFN-treatedsamples

basedon the datacollectedfrom the DNA microarrayexperimentsdemonstratethe effectivenessof this

approach.

This paperis organizedasfollows. Section2 introducesthevisualizationmethod.Section3 describes

thedetailsof our genedataanalyzingapproachwith thehelpof our visualizationtool. Section4 presents

theexperimentalresults.And finally, theconclusionis providedin Section5.

2 Visualization Tool

2.1 Mapping Method

A typical geneexpressionmatrix hasthousandsof rows (eachrow representsa gene)andseveral (usually

lessthan100)columnswhichrepresentsamplesrelatedto acertainkind of diseaseor othercondition.Each

sampleis marked with a label thatpointsout which classit belongsto, suchascontrol, patientanddrug-

treated.While we analyzethesesamples,we usuallywantto view thesampledistribution duringeachstep.

How to visualizearbitrarily largedimensionaldataeffectively andefficiently is still anopenproblem.The

parallelcoordinatesystemallows thevisualizationof multidimensionaldataby a simpletwo dimensional

5

Page 6: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure1: Distributionof ourgeneexpressiondataof 44samples,whereeachsampleshas4132genes.(A) Visualiza-

tion usingtheparallelcoordinatesystem,wherethehorizontalaxisrepresentsgenedimensionandeachmultidimen-

sionalline representasample.Thesampledistribution is not clear. (B) Visualizationusingour tool, whereeachpoint

representsa sample.Thefigureshows thataftermappingoriginal datainto 2-dimensionalspace,sampledistribution

is clearlyrepresented.

representation.But asthedimensionsgohigher, thedisplayingis lesseffective. Figure1 shows anexample

of differentvisualizationmethods.

We usethe ideaof a linear mappingmethodthatmapsthe multidimensionaldatasetto 2-dimensional

space[7]. Let vector �������������������� �!��"#"#"$���&%('representadatapoint in -dimensionalspace,andtotalnumber

of pointsin thespaceis ) , denotedas� � �&� � ��"#"#"#�&��*

.

WeusetheFormula(1) to map �� �into a2-dimensionalpoint ��,+�

:

��,+� � %-. / � ��0.21 ��354 ' 1 ��� � . '�' �6

. �(1)

where0 .

is anadjustableweightfor eachcoordinate, is vectorlengthof theoriginal space,354 is a ratio

to centralizethepoints,and �6. ��78�:9!�<;=��"#"#"#� '

areunit vectorswhichdivide thecentercircle of thedisplay

screenequally(Figure2 (B)).

6

Page 7: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure2: Mappingfrom � -dimensionalspaceto 2-dimensionalspace.(A) A datapoint > in theoriginal space.(B)>@? is thecorrespondingpoint usingmappingfunction(Formula(1)) in the2-dimensionaldisplayingspace.Thered

point markedas(0,0) is thecenterof thedisplayingscreen, AB(CEDGFIHKJMLONPL<QRQSQRL ��T areunit vectorswhich dividedtheunit

circleof thedisplayscreenequally.

Our initial settingis0 . �VUW"YX[Z]\M^`_5aba�7c�:9!�<;=��"#"#" , which meanseachcoordinateof theoriginal space

contributesequallyin the initial mapping.Underthis setting,we caneasilyfigureout that point (0,0,...0)

in theoriginal -dimensionalspacewill bemappedto (0,0)which is thecenterof 2-dimensionaldisplaying

spacebasedon mappingfunction (Formula (1)). In addition,a point which hasthe format of (a,a,.....a)

will alsobe mappedto thecenter(Figure3 (A)). Anotherpropertyunderthe initial settingis keepingthe

correlationcoefficient simiarlty of the original datavectors. Becausethe correlationcoefficient hasthe

advantageof dependingonly on shapebut not on theabsolutemagnitudeof thespatialvector, it is a better

similarity measurethan Euclideandistance[32, 4]. The formula of correlationcoefficient betweentwo

vectors �d and �e is:

f \g� �dh� �ei'j� 1 �lk %. / � � . 1cm . ' �lk %. / � � . ' 1 �lk %. / � m . 'n o

1 � k %. / � � �. ' � k %. / � � . ' �qp o 1 � k %. / � m �. ' � k %. / � mr. ' �qp � (2)

where

�ds����� � ��� � ��"#"#"#��� % ' �7

Page 8: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure3: Somepropertiesof our mappingfunction. (A) Showsevery point whosecoordinatesin theoriginal space

canberepresentedas(a,a,.....a)will bemappedto thecenterof the2-dimensionaldisplayingspace.(B) Showspoints

thathave thesamepattern,which meansratiosof eachpairsof coordinatesin theoriginal spaceareall equal,will be

mappedontoa straightline acrossthecenterin the2-dimensionaldisplayingspace.

8

Page 9: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

�e���� m � � m � ��"#"#"#� m % ' �%- . / � �

.�1tmr. �vuxw ) \rZzy�{�|~}�^r\!�Pw f y�u�\rZ[}2_�7l^P|x��u f _=a�_5^��%- . / � �

. �vuxw ) \rZz��u f _=ab_�^��%- . / �

mr. �vuxw ) \rZ m u f _=ab_�^��%-. / � � �

. �vuxw ) \!Z�ux��w2_�^P|�����u f _5ab_5^M�%- . / �

m �. �vuxw ) \!Z�ux��w2_�^P|�� m u f _5ab_5^M"If �d and �e have thesamepattern,which meansratiosof eachpairsof coordinatesof �d and �e areall

equal(Equation(3)), their correlationcoefficient valueswill be9. Usingour mappingfunction, thesetwo

vectorswill bemappedontoa straightline acrossthecenterin the2-dimensionaldisplayingspace,andall

othervectorswhichhave thesamepatternas �d and �e will all bemappedontothatline (Figure3 (B)), even

if their Euclideandistancesin theoriginal spacearevery large.

�I�m � � ���m � � ���m � ��"#"#"=� ��%m % "(3)

2.2 Parameter Adjustment

Our visualizationtool allows theuserto adjusttheweightof eachcoordinatefrom 9to

9to changedata

distribution in thedisplayingspace,bothmanuallyandautomatically. Mappingfrom a higherdimensional

spaceto a lower dimensionalspacemay not preserve all the propertiesof the dataset.By adjustingthe

coordinateweightsof the dataset,data’s original staticstateis changedinto dynamicstatewhich may be

usedto compensatethe information lossfrom mapping. For example,two points��9�UrUW��9�UrUW��"#"#"#��9�UrU�'

and

�bUW�&UW��"#"#"#"�U�'arefar away in theoriginal space,but by theinitial setting,they arebothmappedinto thecenter

9

Page 10: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

of the 2-dimensionaldisplayingscreen. Whenever any weight0 .

in Formula (1) is changed,thesetwo

pointswill beseparated.That is,�bUW�&UW��"#"#"#"�U�'

will bestill at thecenterbut��9�UrUW��9�UrUW��"#"#"#��9�UrU�'

will no longer

bemappedto thecenter. Figure4 showsanotherexamplewheretheoriginaldatasethas268points(vectors)

whichcanbedividedinto two clusters,markedashollow redcirclesandfilled bluecircles.While mapping

basedon the initial setting,the clusterboundaryis not clearenough:somepartsof the two clustersare

overlapped.After changingtheweightsof somecoordinates,thedatadistribution is alsochanged,andthe

two clustersareseparatedfrom eachother.

Sincethe original dimensionsmay be very high, suchas several thousands,manuallychangingthe

weight for eachdimensionto find the bestcombinationis impractical. That is the reasonwhy our tool

supportsautomaticchangingof weights.Theuseronly needsto settheadjustmentdirection(ascendingor

descending)andchangingstepfor eachdimension.The tool will performan animationto show the data

distribution while theweightsarechangedautomatically. Theusercanobtainall theweightswhentheideal

distribution is reached.

3 Data Analyzing Approach

Basedon thegeneexpressionmatrixwhich hasthousandsof genesandseveraldozensof samples,our task

is to build aclassifierto predictthesamplelabels(classes)from theexpressionprofileusingtheinformation

alreadyknown from the experiment,suchasdiseased/normalattributesof the samples.A very common

methodis [32] first to reducethe numberof genesin thegenedimension,which meansto find important

genesthat aremorerelatedto suchkind of idealizedpatterns;thento assignsomeweightsfrom the first

stepto the remaininggenesto constructa “classpredictor”. How many genesaredeletedis usuallyfrom

experience.Researchersusuallygiveafixednumberto everymatrix,but thebestnumbermightbedifferent

10

Page 11: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure4: Effect of weightsadjustment.Left circlesshow thedatadistribution in displayingspace,right sideslides

show theweightadjustmentenvironment.(A) Showsmappingresultof a datasetusingtheinitial setting.Thedataset

includetwo clusters.We markedashollow redcirclesandfilled bluecircles.(B) After changingtheweightsof some

coordinates,thedatadistribution is alsochanged,andtwo clustersareseparatedfrom eachother.

11

Page 12: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure5: Differentkindsof genepatterns.Assumethatsampleshave two classes.If we wantto constructa “class

predictor” from thegenematrix, geneshave patternlike (a) give thecorrectinformationwhich arecalled“important

gene”.Pattern(b) meansuselessgenes.Pattern(c) is noisein thedataset.Soour taskis to select(a), remove(b) and

(c) from thesetof genes.

for eachdataset.How to choosethebestweightis anotheropenproblem.

3.1 Normalization

In thegeneexpressionmatrix,differentgeneshavedifferentrangesof intensityvalues.Theintensityvalues

alonemaynot have significantmeaning,but the relative changinglevels aremoreintrinsic. Sowe should

first normalizetheoriginal geneintensityvaluesinto relative changinglevels.Ourgeneralformulais

�[� � . �:��� � . �� �1 y&'�4W� � �

1 y&'�� 0�{�|�^P| � � � -.��r� � � . 4g� 6 �=�(4)

where��� � .

denoteschanginglevel for gene� of sample7,� � .

representsthe original intensityvaluefor

gene� of sample7,y

is aparameter, and � � is themeanof theintensityvaluesfor gene� for all samples6

.

12

Page 13: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

3.2 Selecting Important Genes

Notice that amongthousandsof genes,not all of them have the samecontribution in distinguishingthe

classes.Actually, somegeneshave little contribution or just representnoisefrom thematrix (Figure5). We

needto remove thosegenes.

Assumingthereare genesand ) samples,whereeachgenevector(afternormalization)is denotedas

dz����������������� �!��"#"#"#��� �<� ��� ���R���~�E� ��"#"#"#����� * '�� 0`{�|�^P| � ��9!�<;=��"#"#" Z]\�^�|�_ f { � | |"(5)

Without losinggenerality, we assumethefirst � samplesbelongto oneclasswhile the remainingsamples

belongto anotherclass.Theidealgenewhichis highlycorrelatedwith thesamplesdistributionshouldmatch

thepattern �������9!��9!��"#"#"�9!�&UW��"#"#"#�&U='(first � numberis “

9” followedby

� )��� ' “U”) or �� ���bUW�&UW��"#"#"�UW��9!��"#"#"#��9M'

(first � numberis “U” followedby

� ):�� ' “9”). Wecalculatecorrelationcoefficient (Formula(2)) between

eachgenevectorandthepre-definedstablepattern �� , thensortgenesusingthesecorrelationcoefficientsby

a descendingsequencewhich exactly matchestheascendingsequenceif sortingby correlationcoefficients

with anotherpattern �� . We know a certainnumberof genesfrom thetop andthebottomof this sequence

shouldbe chosenasthe “important genes”,but the numberis usuallydecidedfrom the experience.Also

thebestnumberfor differentdatasetsis different.Usingour tool, we canconductflexible judgment.First,

we mapthewholedatasetinto 2-dimensionaldisplayingspace.We thenremove “unimportant”genesfrom

themiddleof thesequenceoneby one. Whentheboundaryof theclustersof thesamplescouldbeclearly

separated,we stopthis procedure.The remaininggenesarechosenas“important” genes(Figure6). The

whole procedureis integratedinto our visualizationtool. Becauseour mappingfunctionhastheproperty

of preservingcorrelationcoefficient similarity, sampleswith similar patternswill bemappedcloseto each

other.

13

Page 14: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure6: Procedureof removing “unimportant”genesbasedon a geneexpressiondataof 28 sampleswhich belong

to two clusters. (A) Shows original distribution of 28 samplesmappingfrom 4132 genesvectors. (B) Samples

distribution reducedto 120genes.(C) Samplesdistribution reducedto 70 genes.Thereis a clearboundarybetween

two clusters.

3.3 Weight Adjustment

After selectinga subsetof importantgenes,thenext stepis to build a classpredictorbasedon thesegenes.

Usuallythispredictorhastheformatof:

��0 � � � ��0 � � � ��"#"#"#��0 � � � ' � (6)

where�.is thevalueof theselectedgeneof thesampleto classifyand

0 .is weightof theselectedgene.

Now theproblemishow to decidetheweightof eachgene.Wecandirectlyusethecorrelationcoefficient

with thepattern �������9!��9!��"#"#"�9!�&UW��"#"#"#�&U�'or �� ���bUW�&UW��"#"#"�UW��9!��"#"#"#��9x'

astheweight,or getfrom theparameter

adjustmentfunction of our tool. By usingour tool, we canmanuallyadjustthe weightsof eachgeneas

illustratedin Figure4,or let thetool to performautomaticadjustment.For automaticadjustment,wepresent

ameasurethatevaluatesthequalityof thedistribution of two clusters,andusethismeasureto decidewhen

to stoptheadjustmentprocedure.

14

Page 15: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

First wedefinethecenter of eachsampleclusteras ��6usingFormula(7):

��6 �:� f u � � f u � ��"#"#"#� f u % ' � 0`{�|�^P| f u . � k¡  �!� �   . "

(7)

Thenfor eachsamplecluster, we definea value ¢ � 6 'to measurethegatherdegreeof all thepointsin

thisclusterusingFormula(8):

¢ � 6 'j� £¤¤¥ 9)¦ 9 -  �!� � �u ��,6 � � � £¤¤¥ 9

)¦ 9 -  �r�%- . / � �§u

. f u . ' � � (8)

where) ��� 6 �is thenumberof samplesin this cluster.

This measureis similar to thestandarddeviation in onedimensionalspace.If thedistribution of points

in samplecluster6

is sparse,¢ � 6 'will be large,otherwiseit is small. If we have two clustersof samples

denotedas6 �

and6 �

, wehopeeachclustergathertogetheraswell asthedistancebetweenthetwo clusters

is aslarge aspossible.Sowe presentanothermeasureto describethequality of theclusterdistribution as

Formula(9):

f � ¢ � 6 9x'~¨ ¢ � 6 ;P'� �,6 � �6 � � �(9)

where�6 �

is thecenterof6 �

,�6 �

is thecenterof6 �

, and� �6 � �,6 � �

is thedistancebetween�,6 �

and

�6 �.

We hopethedistancebetweenthetwo clusterscanbeaslargeaspossibleandthesparsedegreecanbe

aslow aspossible.Sothebestdistribution is reachedwhen f valueis thesmallest.We applythis measure

on the2-dimensionaldisplayingspaceto evaluatethesamplesdistribution while adjustingweights.Notice

thatto try thecombinationsof all theweightsis impracticalbecauseof exponentialtime complexity. What

wecando is whenthepointsdistribution reachesa local lowestvalueof f , westoptheadjustprocedureand

setcombinationof theweights.

15

Page 16: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

Figure7: Experimentresultfor MS IFN groupandCONTROL MS group.Redcirclesdenotesamplesbelongto MS

group,greencirclesrepresentsamplesin IFN groupwhile bluecirclesmeanCONTROL samples.Samplespointed

by arrowsarewronglyclassified.Straightlinesacrossthebig circledenote“classpredictor”whicharebuilt usingthe

cross-validationmethod.

4 Experimental Results

Theexperimentsarebasedontwodatasets:theMS IFN groupandtheCONTROL MS group.TheMS IFN

groupcontains28 samples(14MS samplesand14 IFN samples)while theCONTROL MS groupcontains

30 samples(15controlsamplesand15 MS samples).Eachsamplehas4132genes.

Duringthedataprocessingprocedure,by sortinggenesusingthecorrelationcoefficientsof Formula(2),

we select70 genes(Figure6 (C)) for MS IFN groupand58 genesfor CONTROL MS group. We then

adjustweightsfor thegenesin thesetwo groupsusingautomaticadjustmentfunctionin our tool.

We useK-meansclusteringmethodto evaluatethe “weightedclasspredictor” which we get from the

analyzingprocedure.Herewe choosethe cross-validation method[32] to evaluateeachgroup. In each

16

Page 17: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

group,choosea sample,usethe remainingsamplesof this groupto selectimportantgenes,andget class

predictor. Thenpredicttheclassof thewithheldsample.Theprocessis repeatedfor eachsample,andthe

cumulative error rateis calculated.For MS I group,samplesin theIFN groupwereall predictedcorrectly

but onesamplein theMS groupwasincorrectlyclassified.For theCONTROL MS group,samplesin the

MS groupwereall predictedcorrectly, but five samplesin theCONTROL groupwerewrongly classified.

Figure7 shows theevaluationresultof our “weightedclasspredictor”for thesetwo groupsusingthecross-

validationmethod.

5 Conclusion

In this paper, we presenteda visualizationmethodwhich mapsthe samples’ -dimensionalgenevectors

into 2-dimensionalpoints. This mappingis effective in keepingcorrelationcoefficient similarity which is

themostsuitablesimilarity measurefor analyzinggeneexpressiondata.Our analysismethodfirst removes

noisegenesfrom thegeneexpressionmatrix by sortinggenesaccordingto correlationcoefficient measure,

andthenadjuststheweightfor eachremaininggene.Wealsopresentedameasureto judgethequalityof the

clusterdistribution. We have integratedour geneanalysisalgorithminto a visualizationtool basedon the

mappingmethod.Wecanusethis tool to monitortheanalysisprocedure,to adjustparametersdynamically,

andto evaluatethe resultof eachstepof adjustment.Our approachtakes the advantageof dataanalysis

anddynamicvisualizationmethodsto revealcorrelatedpatternsof geneexpressiondata. In particular, we

usedthe above approachto distinguishthe healthycontrol, MS, IFN-treatedsamplesbasedon the data

collectedfrom DNA microarrayexperiments.Fromourexperiments,wedemonstratedthatthisapproachis

apromisingapproachto beusedfor analysisandvisualizationof genearraydatasets.

17

Page 18: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

References

[1] A. Ben-Dor, N. Friedman,andZ. Yakhini. Classdiscovery in geneexpressiondata. In Proc. Fifth

Annual Inter. Conf. on Computational Molecular Biology (RECOMB 2001), 2001.

[2] Alvis BrazmaandJaakVilo. Minireview: Geneexpressiondataanalysis. Federation of European

Biochemical societies, 480:17–24,June2000.

[3] Amir Ben-Dor, Ron Shamirand Zohar Yakhini. Clusteringgeneexpressionpatterns. Journal of

Computational Biology, 6(3/4):281–297,1999.

[4] Anna Jorgensen. Clusteringexcipient near infrared spectrausing different chemometricmethods.

Technicalreport,Dept.of Pharmacy, Universityof Helsinki,2000.

[5] Ash A. Alizadeh,Michael B. Eisen,R. Eric Davis, Chi Ma, Izidore S. Lossos,AdreasRosenWald,

JenniferC. Boldrick, HajeerSabet,Truc Tran,Xin Yu, JohnI. Powell, Liming Yang,GeraldE. Marti

et al. Distinct typesof diffuselargeb-cell lymphomaidentifiedby geneexpressionprofiling. Nature,

Vol.403:503–511,February2000.

[6] CharlesM. Perou,StefanieS.Jeffrey, Matt VanDe Rijn, ChristiaA. Rees,MichaelB. Eisen,Douglas

T. Ross,AlexanderPergamenschikov, Cheryl F. Williams, Shirley X. Zhu, Jeffrey C. F. Lee, Deval

Lashkari,Dari Shalon,Pat rick O. Brown, andDavid Bostein.Distinctive geneexpressionpatternsin

humanmammaryepithelialcells andbreastcancers.Proc. Natl. Acad. Sci. USA, Vol. 96(16):9212–

9217,August1999.

18

Page 19: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

[7] D. BhadraandA. Garg. An interactive visual framework for detectingclustersof a multidimensional

dataset.TechnicalReport2001-03,Dept.of ComputerScienceandEngineering,UniversityatBuffalo,

NY., 2001.

[8] D. Shalon,S.J.Smith,P.O. Brown. A DNA microarraysystemfor analyzingcomplex DNA samples

usingtwo-colorfluorescentprobehybridization.Genome Research, 6:639–645,1996.

[9] ElisabettaManduchi,Gregory R. Grant,StevenE. McKenzie,G. ChristianOverton,SaulSurrey and

ChristianJ.Stoeckert Jr. Generationof patternsform geneexpressiondataby assigningconfidenceto

differentiallyexpressedgenes.Bioinformatics, Vol. 16(8):685–698,2000.

[10] FranciscoAzuajeDepartment.Makinggenomeexpressiondatameaningful:Predictionanddiscovery

of classesof cancerthrougha connectionistlearningapproach,2000.

[11] GadGetz,Erel Levine andEytanDomany. Coupledtwo-way clusteringanalysisof genemicroarray

data.Proc. Natl. Acad. Sci. USA, Vol. 97(22):12079–12084, October2000.

[12] G.S.Michaels,D.B. Carr, M. Askenazi,S. Fuhrman,X. WenandR. Somogyi. ClusterAnalysisand

datavisualizationof large-scaleexpressiondata.In Pac Symposium of Biocomputing, volume3, pages

42–53,1998.

[13] HartiganJ.A. Clustering Algorithm. JohnWiley andSons,New York., 1975.

[14] J. DeRisi,L. Penland,P.O. Brown, M.L. Bittner, P.S.Meltzer, M. Ray, Y. Chen,Y.A. Su,J.M. Trent.

Useof a cDNA microarrayto analysegeneexpressionpatternsin humancancer. Nature Genetics,

14:457–460,1996.

19

Page 20: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

[15] Javier Herrero,Alfonso Valencia,andJoaquinDopazo. A hierarchicalunsupervisedgrowing neural

network for clusteringgeneexpressionpatterns.Bioinformatics, 17:126–136,2001.

[16] J.J.Chen,R.Wu, P.C.Yang,J.Y. Huang,Y.P. Sher, M.H. Han,W.C.Kao,P.J.Lee,T.F. Chiu,F. Chang,

Y.W. Chu,C.W. Wu,K. Peck.Profilingexpressionpatternsandisolatingdifferentiallyexpressedgenes

by cDNA microarraysystemwith colorimetrydetection.Genomics, 51:313–324,1998.

[17] J.L.DeRisi,V.R. Iyer andP.O.Brown. Exploringthemetabolicandgeneticcontrolof geneexpression

on agenomicscale.Science, pages680–686,1997.

[18] JohannesSchuchhardt,Dieter Beule,Arif Malik, Eryc Wolski, Holger Eickhoff, HansLehrachand

HanspeterHerzel. Normalizationstrategies for cDNA microarrays. Nucleic Acids Research, Vol.

28(10),2000.

[19] T. Kohonen.Self-Organization and Associative Memory. Spring-Verlag,Berlin, 1984.

[20] M. Schena,D. Shalon,R.W. Davis, P.O. Brown. Quantitative monitoringof geneexpressionpatterns

with acomplementaryDNA microarray.Science, 270:467–470,1995.

[21] Mark Schena,Dari Shalon,RenuHeller, Andrew Chai, Patrick O. Brown, and RonaldW. Davis.

Parallelhumangenomeanalysis:Microarray-basedexpressionmonitoringof 1000genes.Proc. Natl.

Acad. Sci. USA, Vol. 93(20):10614–10619, October1996.

[22] MichaelB. Eisen,PaulT. Spellman,PatrickO.Brown andDavid Botstein.Clusteranalysisanddisplay

of genome-wideexpressionpatterns.Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868,1998.

20

Page 21: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

[23] MichaelP. S.Brown,William NobleGrundy, David Lin, Nello Cristianini,CharlesSugnet,TerrenceS.

Furey, ManuelAresandJr.David Haussler.Knowledge-basedanalysisof microarraygeneexpression

datausingsupportvectormachines.Proc. Natl. Acad. Sci., 97(1):262–267,January2000.

[24] O. Ermolaeva, M. Rastogi,K.D. Pruitt, G.D. Schuler, M.L. Bittner, Y. Chen,R. Simon,P. Meltzer,

J.M.Trent,M.S.Boguski.Datamanagementandanalysisfor geneexpressionarrays.Nature Genetics,

20:19–23,1998.

[25] Orly Alter, Patrick O. Brown and David Bostein. Singularvalue decompositionfor genome-wide

expressiondataprocessingand modeling. Proc. Natl. Acad. Sci. USA, Vol. 97(18):10101–10106,

Auguest2000.

[26] Pablo Tamayo,DonnaSolni,mJill Mesirov, Qing Zhu, SutisakKitareewan, EthanDmitrovsky, Eric

S. LanderandTodd R. Golub. Interpretingpatternsof geneexpressionwith self-organizingmaps:

Methodsandapplicationto hematopoieticdifferentiation.Proc. Natl. Acad. Sci. USA, Vol. 96(6):2907–

2912,March1999.

[27] R.A. Heller, M. Schena,A. Chai, D. Shalon,T. Bedilion, J. Gilmore, D.E. Woolley, R.W. Davis.

Discovery andanalysisof inflammatorydisease-relatedgenesusingcDNA microarrays.Proc. Natl.

Acad. Sci. USA, 94:2150–2155,1997.

[28] S. Tavazoie,D. Hughes,M.J. Campbell,R.J. Cho andG.M. Church. Systematicdeterminationof

geneticnetwork architecture.Nature Genet, pages281–285,1999.

[29] ShumeiJiang,ChunTang,Li ZhangandAidong Zhang, Murali Ramanathan.A maximumentropy

approachto classifyinggenearraydatasets.In Proc. of Workshop on Data mining for genomics, First

SIAM International Conference on Data Mining, 2001.

21

Page 22: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

[30] S.M. Welford, J. Gregg, E. Chen,D. Garrison,P.H. Sorensen,C.T. Denny, S.F. Nelson. Detection

of differentially expressedgenesin primary tumor tissuesusingrepresentationaldifferencesanalysis

coupledto microarrayhybridization.Nucleic Acids Research, 26:3059–3065,1998.

[31] SpellmanP.T., SherlockG., ZhangM.Q., Iyer V.R.,AndersK., EisenM.B., Brown P.O.,BotsteinD.,

FutcherB. . Exploringthemetabolicandgeneticcontrolof geneexpressiononagenomicscale.Mol.

Biol. Cell, page3273,1998.

[32] T.R. Golub, D.K. Slonim, P. Tamayo,C. Huard,M. Gassenbeek,J.P. Mesirov, H. Coller, M.L. Loh,

J.R.Downing, M.A. Caligiuri, D.D. BloomfieldandE.S.Lander. Molecularclassificationof cancer:

Classdiscovery andclasspredictionby geneexpressionmonitoring. Science, Vol. 286(15):531–537,

October1999.

[33] Trevor Hastie,RobertTibshirani,David BoststeinandPatrick Brown. Supervisedharvestingof ex-

pressiontrees.Genome Biology, Vol. 2(1):0003.1–0003.12,January2001.

[34] U. Alon, N. Barkai,D.A. Notterman,K.Gish,S.Ybarra,D. Mack andA.J. Levine. Broadpatternsof

geneexpressionrevealedby clusteringanalysisof tumorandnormalcolontissuesprobedby oligonu-

cleotidearray. Proc. Natl. Acad. Sci. USA, Vol. 96(12):6745–6750, June1999.

[35] V. Yong,S. Chabot,Q. Stuve andG. Williams. Interferonbetain thetreatmentof multiple sclerosis:

mechanismsof action.Neurology, 51:682–689,1998.

[36] V.R. Iyer, M.B. Eisen,D.T. Ross,G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt,Jr. J.

Hudson,M.S.Boguski,D. Lashkari,D. Shalon,D. Botstein,P.O.Brown. Thetranscriptionalprogram

in theresponseof humanfibroblaststo serum.Science, 283:83–87,1999.

22

Page 23: Interactive Visualization and Analysis for Gene Expression ... · the details of our gene data analyzing approach with the help of our visualization tool. Section 4 presents the experimental

[37] Y BarashandN Friedman.Context-specificbayesianclusteringfor geneexpressiondata. Bioinfor-

matics, RECOM01, 2001.

[38] YangY.H.,DudoitS.,LuuP. andSpeedT. P. Normalizationfor cDNA MicroarrayData.In Proceedings

of SPIE BiOS 2001, SanJose,California,January2001.

23