Data-Driven Regionalization of Housing Markets

19
Data-Driven Regionalization of Housing Markets Marco Helbich, Wolfgang Brunauer, Julian Hagenauer, and Michael Leitner Institute of Geography, University of Heidelberg Credit Risk Methods Development, Strategic Risk Management & Control, Bank Austria—Member of UniCredit Group Department of Geography and Anthropology, Louisiana State University This article presents a data-driven framework for housing market segmentation. Local marginal house price surfaces are investigated by means of mixed geographically weighted regression and are reduced to a set of principal component maps, which in turn serve as input for spatial regionalization. The out-of-sample prediction error of a hedonic pricing model is applied to determine a “near-optimal” number of spatially coherent and homogeneous submarkets. The usefulness of this method is demonstrated with a detailed data set for the Austrian housing market. The results provide evidence that submarkets must always be considered, however they are defined, and that the proposed submarket taxonomy on a regional level significantly improves predictive quality compared to (1) a traditional pooled model, (2) a model that uses an ad hoc submarket definition based on administrative units, and (3) a model incorporating an alternative submarket definition on the basis of aspatial k-means clustering. Moreover, it is concluded that the Austrian housing market is characterized by regional determinants and that geography is the most important component determining the house prices. Key Words: Austria, hedonic modeling, mixed geographically weighted regression, prediction accuracy, real estate, spatial regionalization. , , , , , , , , , , , : (1) ; (2) ; (3) : , : , , , , , Este art´ ıculo presenta una estructura controlada por datos sobre la segmentaci´ on del mercado de vivienda. Las superficies del precio de la vivienda marginal a nivel local se investigan por medio de regresi´ on ponderada geogr´ aficamente mixta, superficies de las cuales se deriva un conjunto de mapas de componentes principales, que a la vez contribuyen como insumo a la regionalizaci´ on espacial. La predicci´ on de error por fuera de muestra de un modelo hed´ onico de precios se aplica para determinar el n´ umero “cercano a lo ´ optimo” de sub-mercados espacialmente coherentes y homog´ eneos. La utilidad de este m´ etodo se demuestra a trav´ es de un conjunto pormenorizado de datos del mercado de vivienda austriaco. Los resultados ponen en evidencia que los sub- mercados siempre deben ser tomados en cuenta, sin importar c´ omo se les defina, y que la taxonom´ ıa del sub-mercado propuesta a nivel regional mejora significativamente la cualidad predictiva en comparaci´ on con (1) un modelo mancomunado tradicional, (2) un modelo que utilice una definici´ on ad hoc del sub-mercado basada en unidades administrativas, y (3) un modelo que incorpore una definici´ on alternativa de sub-mercado con base en agrupamiento aespacial de k-medios. Adem´ as, se concluye que el mercado de vivienda austriaco est´ a caracterizado por determinadores regionales y que la geograf´ ıa es el componente m´ as importante para la determinaci´ on de los precios de las casas. Palabras clave: Austria, modelizaci´ on hed´ onica, regresi´ on ponderada geogr´ aficamente mixta, exactitud de la predicci´ on, finca ra´ ız, regionalizaci´ on espacial. I n real estate research, house prices are usually mod- eled by hedonic regression models, where struc- tural attributes of the house, its neighborhood, and its location serve as explanatory variables (Bourassa, Cantoni, and Hoesli 2007). Because housing is treated as a heterogeneous good, hedonic price theory is appli- cable, where an object is valued for its utility-bearing characteristics and its price is decomposed into its Annals of the Association of American Geographers, 10X(X) XXXX, pp. 1–19 C XXXX by Association of American Geographers Initial submission, October 2011; revised submission, March 2012; final acceptance, May 2012 Published by Taylor & Francis, LLC. Downloaded by [Vienna University Library] at 11:40 04 September 2012

Transcript of Data-Driven Regionalization of Housing Markets

Page 1: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing MarketsMarco Helbich,∗ Wolfgang Brunauer,† Julian Hagenauer,∗ and Michael Leitner‡

∗Institute of Geography, University of Heidelberg†Credit Risk Methods Development, Strategic Risk Management & Control, Bank Austria—Member of UniCredit Group

‡Department of Geography and Anthropology, Louisiana State University

This article presents a data-driven framework for housing market segmentation. Local marginal house pricesurfaces are investigated by means of mixed geographically weighted regression and are reduced to a set ofprincipal component maps, which in turn serve as input for spatial regionalization. The out-of-sample predictionerror of a hedonic pricing model is applied to determine a “near-optimal” number of spatially coherent andhomogeneous submarkets. The usefulness of this method is demonstrated with a detailed data set for the Austrianhousing market. The results provide evidence that submarkets must always be considered, however they aredefined, and that the proposed submarket taxonomy on a regional level significantly improves predictive qualitycompared to (1) a traditional pooled model, (2) a model that uses an ad hoc submarket definition basedon administrative units, and (3) a model incorporating an alternative submarket definition on the basis ofaspatial k-means clustering. Moreover, it is concluded that the Austrian housing market is characterized byregional determinants and that geography is the most important component determining the house prices. KeyWords: Austria, hedonic modeling, mixed geographically weighted regression, prediction accuracy, real estate, spatialregionalization.

�������������������������������, �����������������,

�������������, ���������������������������������, ��

������������� “�����”��,�����������������, ���������

��������,�����������,������������� ��,������������,

���������,������������: (1)�������; (2)�����������������

����; (3)������������������������������:�������������

�����,��������������������:���,������,��������,�����

�,���,������

Este artıculo presenta una estructura controlada por datos sobre la segmentacion del mercado de vivienda. Lassuperficies del precio de la vivienda marginal a nivel local se investigan por medio de regresion ponderadageograficamente mixta, superficies de las cuales se deriva un conjunto de mapas de componentes principales,que a la vez contribuyen como insumo a la regionalizacion espacial. La prediccion de error por fuera de muestrade un modelo hedonico de precios se aplica para determinar el numero “cercano a lo optimo” de sub-mercadosespacialmente coherentes y homogeneos. La utilidad de este metodo se demuestra a traves de un conjuntopormenorizado de datos del mercado de vivienda austriaco. Los resultados ponen en evidencia que los sub-mercados siempre deben ser tomados en cuenta, sin importar como se les defina, y que la taxonomıa delsub-mercado propuesta a nivel regional mejora significativamente la cualidad predictiva en comparacion con(1) un modelo mancomunado tradicional, (2) un modelo que utilice una definicion ad hoc del sub-mercadobasada en unidades administrativas, y (3) un modelo que incorpore una definicion alternativa de sub-mercadocon base en agrupamiento aespacial de k-medios. Ademas, se concluye que el mercado de vivienda austriacoesta caracterizado por determinadores regionales y que la geografıa es el componente mas importante parala determinacion de los precios de las casas. Palabras clave: Austria, modelizacion hedonica, regresion ponderadageograficamente mixta, exactitud de la prediccion, finca raız, regionalizacion espacial.

In real estate research, house prices are usually mod-eled by hedonic regression models, where struc-tural attributes of the house, its neighborhood, and

its location serve as explanatory variables (Bourassa,

Cantoni, and Hoesli 2007). Because housing is treatedas a heterogeneous good, hedonic price theory is appli-cable, where an object is valued for its utility-bearingcharacteristics and its price is decomposed into its

Annals of the Association of American Geographers, 10X(X) XXXX, pp. 1–19 C© XXXX by Association of American GeographersInitial submission, October 2011; revised submission, March 2012; final acceptance, May 2012

Published by Taylor & Francis, LLC.

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 2: Data-Driven Regionalization of Housing Markets

2 Helbich et al.

individual value-adding components (Rosen 1974).The hedonic price function results from spatial equi-librium conditions of supply and demand for housingcharacteristics, ensuring stationarity of the effects onprices over space within a market.

Extended research has established that housingmarkets are segmented in submarkets; that is, thathedonic price functions can vary across space (e.g.,Straszheim 1975; Schnare and Struyk 1976; Goodman1978; Palm 1978; Maclennan and Tu 1996; Goodmanand Thibodeau 1998, 2007; Watkins 2001; Bourassa,Hoesli, and Peng 2003; Hwang and Thill 2009). Thelack of clear guidance from economic theory hasresulted in a lack of a coherent terminology, however(Watkins 2001). According to Palm (1978, 218), “ahousing submarket may be defined as a collectivityof buyers and sellers with a distinct pattern of price-attribute valuations,” which results in geographicalareas with constant marginal prices (Goodman andThibodeau 2007). Reasons for functional disequilibriaare immobility and durability of real estate, informa-tion constraints, search costs, and spatially varyingdifferences in socioeconomic and demographic housingcharacteristics (Palm 1978; Goodman and Thibodeau1998). Therefore, submarkets can be defined (1) onthe supply side, where structural and neighborhoodhousing characteristics serve as discriminating factors;(2) on the demand side, based on household incomeor other sociodemographic characteristics; or (3) on acombination of both (Goodman and Thibodeau 1998,2007), as is proposed in this research.

In most applications, submarkets are operationalizedby disaggregating the entire market into discrete anddisjoint regions. Their consideration in hedonic mod-eling improves explanatory power, mitigates model vi-olations resulting from neighborhood effects, and leadsto more precise model predictions (e.g., Straszheim1975; Bourassa, Hoesli, and Peng 2003; Bourassa, Can-toni, and Hoesli 2007; Hwang and Thill 2009). Asubmarket definition, however, requires that the delim-itation coincides with the true data generating process(Fotheringham, Charlton, and Brunsdon 2002) as wellas the economic process, which are two assumptionsthat are difficult to fulfill. Not meeting these require-ments could result in an estimation bias and residualspatial correlation (LeSage and Pace 2009). Further-more, the results can be affected by the modifiable arealunit problem (Openshaw 1984), meaning that differentnumbers of submarkets and their spatial arrangementcould affect the regression output (Fotheringham andWong 1991). To avoid these problems, Fotheringham,

Charlton, and Brunsdon (2002) and Paez, Fei, and Far-ber (2008), among others, proposed spatially boundedsoft-market segmentations, where the hedonic pricefunction shows instability over space, meaning thatmarginal prices continuously change their influence onthe house price. The authors argue that continuous rep-resentations describe reality more closely and more ap-propriately than rigid, clear-cut submarket boundaries.

Nevertheless, these appealing properties come withserious methodological drawbacks, extensively dis-cussed in Wheeler and Tiefelsdorf (2005), Griffith(2008), Wheeler and Paez (2009), and Paez, Farber,and Wheeler (2011). Such limitations and the compre-hensive literature review in the next section underpinthe need for a generic data-driven framework, build-ing on the strengths of both continuous and discreteapproaches. Due to the fact that housing submarketsare not observable theoretical constructs, data-drivenapproaches, not relying on the intuition and expertknowledge of assessors, seem to be a rational option thatshould be considered. Therefore, this article combinesa semi-local regression technique and a regionalizationalgorithm to derive discrete submarket definitions. Thisexplicitly spatial approach has the advantage that littleprior knowledge is needed on the exact housing sub-market boundaries as the data-driven modeling processguides the specification of housing market segmenta-tions. Subsequently, these results can be used within thetraditional regression framework. This supports the rec-ommendations by Can (1992) as well as Goodman andThibodeau (1998), who demand more empirical justifi-cation and less arbitrariness in submarket segmentation.Introducing a detailed database for the Austrian hous-ing market, the advantages of this novel approach aredemonstrated with a model competition complement-ing this research. The following research questions willbe answered in this article:

� Is the proposed data-driven technique suitable to de-rive housing submarkets?

� How well do the derived submarkets perform as ad-ditional predictors?

� How well do the empirically derived submarkets per-form in a hedonic regression compared to a pooledmodel,1 a completely unpooled model, an ad hocpredefined submarket definition using federal states,and an alternative submarket definition using k-means–based submarkets?

The following section briefly discusses different ap-proaches and previous empirical findings on housing

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 3: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 3

submarkets. The subsequent section presents the mod-eling framework, followed by a definition of the studyarea and a brief introduction of the data. The results ofthe empirical analysis are discussed next. Finally, the ar-ticle concludes with major implications of the researchfindings and makes suggestions for directions of futurework.

Housing Market Segmentation: ALiterature Review

Even though the modeling properties of housing sub-markets are well known, their empirical delineationraises many methodological questions (Goodman andThibodeau 2007). Basically, there are three main ap-proaches to model housing submarkets. First, submar-kets are defined exogenously by means of predefinedspatial units, such as administrative units (Bourassaet al. 1999), school districts (Goodman and Thibodeau2003), stratifications of metropolitan areas (Adair,Berry, and McGreal 1996), or regions (Bischoff andMaennig 2011). Second, submarkets are considered tohave spatially continuous boundaries (Paez, Fei, andFarber 2008). Third, multivariate statistical methods,such as clustering algorithms (Hwang and Thill 2009)or neural networks (Kauko 2004), are used to definesubmarkets. Each of these three approaches is discussedin more detail next.

Ad Hoc Housing Market Segmentations

In most applications, the fixed effects model (e.g.,in the framework of an ordinary least squares [OLS]or autoregressive model) is used to model spatialheterogeneity, with submarkets being modeled usingdummy variables, which results in the intercept varyingover space. Slope heterogeneity can be controlled forthrough spatial interaction effects of the submarketdummy variables with explanatory covariates (e.g.,Kestens, Theriault, and Des Rosiers 2004).

Early examples for the application of ad hocpredetermined submarket definitions using fixed effectsmodels are Schnare and Struyk (1976), who analyzedfamily housing prices in suburban Boston (UnitedStates). They concluded that marginal prices vary overspace, even though they have small effects on theoverall house price. Comparing the estimated standarderrors of regressions between the market-wide and thestratified model, they found that both model types havethe same efficiency in terms of prediction power. This

supports the argument that submarkets are of marginalimportance for predictions. Additionally, due to dataloss through market segmentation, the stratified modelcould result in a lower reliability of model estimates.Goodman (1978) rejected these findings, comparinghedonic coefficients for New Haven, Connecticut(United States). He found significant price differencesbetween submarkets, which are not established by asingle equation model, and thus promoted the useof disaggregated submarkets. Likewise, Adair, Berry,and McGreal (1996) divided the metropolitan areaof Belfast, Ireland, into an inner, middle, and outersubmarket and emphasized their heterogeneous struc-ture, which is expressed in different combinations ofsignificant covariates. On a national scale, Bischoff andMaennig (2011) found heterogeneity in rental housingmarkets by distinguishing between East and WestGermany. Similar, Paez, Fei, and Farber (2008) stressedthe importance of market segmentation for Toronto,Canada, reporting superior accuracy of local regres-sions (i.e., consideration of submarkets) compared togeostatistical models. Recently, Bourassa, Cantoni, andHoesli (2007) compared several fixed effects modelslike aspatial OLS and autoregressive models in termsof their predictive performance for Auckland, NewZealand. They concluded that, because of its simplicity,OLS in combination with submarket dummies is moreappropriate than more sophisticated approaches (e.g.,conditional autoregressive or geostatistical models).If the number of submarkets or interactions becomeslarge, however, the fixed effects model approach canresult in the incidental parameter problem (Neymanand Scott 1948), meaning that the number of degreesof freedom is not sufficiently large for parameterestimation. Therefore, a trade-off between modelbias resulting from ignoring submarkets and parametervariability due to the smaller sample size has to be takeninto consideration (Bourassa, Hoesli, and Peng 2003).

Random or mixed effects models partially solve theincidental parameter problem in addition to modelheterogeneity as well as correlation structures withinhousing submarkets explicitly (Jones and Bullen 1994).The simplest case of a random effects model is theone-way intercept model, where intercepts vary acrosssubmarkets. Random effects can be approximated asa weighted average of the mean of the observationsin the submarkets (with a dummy specification) andthe overall mean of the entire study area (Gelman andHill 2007). The weights are determined by the amountof information within each submarket. For small sub-sample sizes (little information), estimates tend to be

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 4: Data-Driven Regionalization of Housing Markets

4 Helbich et al.

close to the global mean, whereas for large subsamples,estimates tend to be similar to the unpooled dummyestimate. One major advantage of this model strategy isthat the closer the estimates are to the pooled (global)model, the less effective degrees of freedom are used. Ifobservations belong to several (nested) levels of spatialunits, the model turns into a multilevel or hierarchicalregression problem (Goldstein 2011).

Empirical examples of multilevel or hierarchicalmodels can be found, among others, in Jones andBullen (1994), Orford (2000), and Brunauer et al.(2010) using a Bayesian framework. Also, Goodmanand Thibodeau (1998) proposed hierarchical modelsto study Dallas, Texas, housing submarkets on the basisof the quality of public education. In a more recentstudy, the same authors (Goodman and Thibodeau2003) compared two alternative submarket definitions:first, a combination of adjacent census tracts and,second, aggregated zone improvement plan codedistricts. They conclude that results strongly dependon how performance is measured. If random effectsare correlated with the predictors, the regression yieldsbiased and inconsistent results. In contrast, fixed effectsmodels produce unbiased and consistent results. In thecase of a rather small number of submarkets (similar toour study presented here), where the loss of degrees offreedom is not substantial, the application of the fixedeffects model seems to be preferred.

Spatially Continuous Housing MarketSegmentations

In contrast to discrete submarket boundaries appliedto fixed or random and mixed effects models, Fother-ingham, Charlton, and Brunsdon (2002) as well asPaez, Fei, and Farber (2008) proposed local regressionmethods, namely, geographically weighted regression(GWR), to model spatially continuous segmentations.GWR possesses the following limitations (e.g., Wheelerand Tiefelsdorf 2005; Griffith 2008; Wheeler and Paez2009), however: First, a number of data points arerepeatedly used in parameter estimations, resulting inmultiple comparisons. Second, GWR per se resultsin artificial parameter smoothness (Paez, Farber, andWheeler 2011). Third, local multicollinearity canfalsely induce parameter variability and inflates pa-rameter variance. Fourth, a strong correlation betweenthe GWR parameters might be present. Finally, theresulting standard errors are just approximations, andthe classical statistical test procedures are pseudo-counterparts of the traditional test procedures. As a

consequence, GWR results in highly volatile parameterestimations, possibly as a result of a globally optimizedbandwidth selection through cross-validation (Farberand Paez 2007). Recently, Paez, Farber, and Wheeler(2011) concluded that mitigating these negativeside effects requires sample sizes of more than 1,000observations. Nevertheless, GWR should be applicableto exploratory data analysis only (Wheeler and Paez2009), and house price predictions should continue tobe based on traditional econometric or geostatisticaltechniques (see, e.g., Bourassa, Cantoni, and Hoesli2007; Diggle and Ribeiro 2007; LeSage and Pace 2009).

Housing Market Segmentations Using MultivariateStatistics and Neural Computation

A different research line compared to ad hocand continuous segmentations relies on multivariatestatistics, particularly clustering algorithms, andneural networks to examine housing submarkets (e.g.,Abraham, Goetzmann, and Wachter 1994; Bourassaet al. 1999; Kauko, Hooimeijer, and Hakfoort 2002;Case et al. 2004; Hwang and Thill 2009). Earlyattempts at this alternative approach can be found inwork by Abraham, Goetzmann, and Wachter (1994),who applied k-means clustering in combination withbootstrapping to investigate cluster significance. For theyears 1977 to 1992 they discovered diversity and fluctu-ation within the U.S. housing market. Goetzmann andWachter (1995), focusing on rent and vacancy datain U.S. metropolitan areas, detected clusters of similarcities using the method of Abraham, Goetzmann, andWachter (1994). On a local scale, Bourassa et al. (1999)analyzed dwelling markets of the Australian cities ofSydney and Melbourne by expanding the researchdesign originally introduced by Maclennan and Tu(1996). Both studies combined principal componentanalysis (PCA) with k-means cluster analysis. Bourassaet al.’s (1999) results identified five distinct submarkets.Tests concerning prediction accuracy for the city ofMelbourne showed that submarkets derived on the ba-sis of individual data have a higher accuracy comparedto alternative submarket constructions. Single-marketmodels have the lowest accuracy. Pricing modelswith geographically concentrated submarkets yieldthe lowest prediction errors. To impose contiguityconstrains, Bourassa, Cantoni, and Hoesli (2010)included spatial coordinates as additional variablesin a hierarchical clustering, arguing that the Wardalgorithm is less sensitive to initial seed variationscompared to k-means. Applying the fuzzy c-means

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 5: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 5

clustering algorithm on the Buffalo–Niagara Fallsmetropolitan statistical area, Hwang and Thill (2009)stressed the advantage of noncrisp cluster boundaries,which contradicts the findings of Bourassa et al. (1999).Nevertheless, further analysis within the traditionalstatistical framework needs a defuzzification,2 whichagain results in crisp submarket boundaries.

Kauko, Hooimeijer, and Hakfoort (2002) and Kauko(2004) demonstrated the usefulness of unsupervisedneural networks (e.g., self-organizing maps [SOMs;Kohonen 2001]) to analyze nonlinear relationships inhousing markets. A comparison between the cities ofHelsinki, Finland, and Amsterdam, The Netherlands,showed that the latter has a more fragmented housingmarket. SOMs result in soft and data-driven market seg-mentations. To account for spatial dependencies Bacao,Lobo, and Painho (2005) stressed the need to adaptKohonen’s (2001) original SOM. Nevertheless, SOMsand their variants strongly depend on the subjectivechoice of several input parameters (e.g., number of neu-rons, map topology, learning rate), making them unsuit-able for this study.

With the exception of Bacao, Lobo, and Painho(2005), all of the previously mentioned clustering al-gorithms neglect spatial dependency, spatial proximity,and topological relationships between adjacent objectsimmanent in housing data (Dubin 1992). To overcomethese serious limitations, a “regionalization” approachhas proven to be promising (Miller 2010). This ap-proach is a form of clustering that groups data into spa-tially homogenous and contiguous submarkets. Severaltechniques (e.g., Openshaw and Rao 1995; Assuncaoet al. 2006; Guo 2008) are available to handle spa-tial effects during clustering, although none of themhas ever been applied in real estate studies. Recently,Assuncao et al. (2006) proposed the Spatial ‘K’lusterAnalysis by Tree Edge Removal (SKATER) algorithm.Compared to former techniques (e.g., Openshaw andRao 1995), it produces more homogenous regions andreduces computational burden. Guo (2008) criticizedSKATER, however, because, first, the contiguity con-straint imposed by the minimum spanning tree (MST)is static, notwithstanding that the relations of adjacentregions might change during the clustering. Second,the MST cannot guarantee that all objects within acluster are similar to each other. Thus, Guo (2008) pro-moted a family of algorithms called regionalization withdynamically constrained agglomerative clustering andpartitioning. Nevertheless, SKATER has a lower the-oretical computational complexity compared to Guo’s(2008) algorithms and is thus more suitable to cluster

medium to large data sets similar to the one presentedin this article.

To summarize, although the relevance of submarketsstated by Maclennan (1982) is widely accepted in prac-tice, there is no consensus about the methods to delimitor conceptualize hedonic submarkets empirically andwhich method is more appropriate. It seems, however,that each of these three main approaches has itsindividual advantages and disadvantages. Therefore, acombination of the strengths of each approach seems tobe suitable for delimiting submarkets more efficiently,while explicitly considering spatial effects during theregionalization process. For a model comparison, OLSlinked with submarket dummies is a suitable methodto evaluate housing segmentation.

Methodology

The study design is summarized in Figure 1. Theframework consists of four main steps:

1. A semi-local regression is applied to investigatenonstationary covariates of housing prices. To getarea-wide estimation surfaces of marginal prices,these local pointwise regression estimates areinterpolated.

2. A PCA reduces the information variability ofthese resulting parameter surface maps.

3. These principal component (PC) maps are clus-tered with the SKATER regionalization algo-rithm and the resulting submarkets are evaluatedby means of hedonic regressions.

4. A model competition tests the predictive accuracyof the data-driven submarkets against no segmen-tation and commonly used segmentations (i.e.,federal states and k-means–based submarkets). Fi-nally, covariate variability is investigated by anunpooled hedonic model.

The entire analysis is performed within the R environ-ment (R Development Core Team 2012). It must benoted that the proposed methodology (Figure 1) is notat all limited to the study area of this research but can,in principle, be generalized and transferred to any otherstudy area’s housing markets.

Geographically Weighted Regression

The GWR is a locally weighted regression analysis,introduced by Brunsdon, Fotheringham, and Charlton(1996), to model spatial autocorrelation as well as spa-tial heterogeneity. For each of the parameter estimation

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 6: Data-Driven Regionalization of Housing Markets

6 Helbich et al.

Figure 1. Simplified research design to model dwelling submarkets.SKATER = Spatial ‘K’luster Analysis by Tree Edge Removal; RMSE= root mean square error; PCA = principal component analysis.

steps, only a subset of data is taken into account, wheredata near the regression point have a higher influencethan data further away. The weighting itself is based ona kernel function (e.g., Gaussian), although the choiceof the kernel function has little impact on estimationresults. The crucial point in the GWR is the choice ofthe bandwidth. Commonly, the bandwidth is allowedto vary across space, depending on the density of thedata points. This improves the goodness of fit, if the datapoints are irregularly distributed across space. Thus, indensely populated areas, the kernel has a shorter band-width compared to regions with longer interpoint dis-tances (Fotheringham, Charlton, and Brunsdon 2002).

The basic GWR assumes that all predictors have anonstationary behavior and vary across space. If this isnot the case, a mixed GWR (MGWR) is more appropri-ate, reducing the degrees of freedom and thus resultingin a more parsimonious model. MGWR keeps coeffi-cients with nonsignificant variation constant, meaning

that, for example, their effect on housing prices is sta-tionary, whereas others are allowed to vary across space.Stationarity (spatially constant effects) can be investi-gated by a test statistic put forward by Leung, Mei,and Zhang (2000). A multistep algorithm proposed byFotheringham, Charlton, and Brunsdon (2002) is usedto estimate the following equation:

yi =k∑

j =1

a j xi j +m∑

l=1

bl (ui , vi ) xi l + ei

where yi is the logarithmically transformed sales price ofobservation i, aj are the global coefficients of covariatesxi j , and bl(ui, vi ) are the local coefficients of covariatesxi l at the coordinates (ui , vi ) of observation i. As the(M)GWR only provides point estimates, spatial inter-polation (e.g., ordinary kriging; Pebesma 2004) is em-ployed to get area-wide estimates of marginal housingprices required for further analysis.

As discussed in the literature review, significantspatial variation of the parameters points to the exis-tence of submarkets. Thus, MGWR can be regardedas a data-driven way to explore parameter variabilityof the hedonic model across space and soft-markethousing segmentation, respectively. The rather highvolatility of nonstationary parameters seems artificialand not robust, however, and is thus hardly applicablefor prediction purposes. This requires a way to reducethe information and to operationalize the results in amore parsimonious submarket definition.

Principal Component Analysis

Noise and multicollinearity of kriged coefficientsurfaces have serious effects (e.g., instability) on thediscriminating power of regionalization algorithms.The PCA provides a solution to this problem (e.g.,Maclennan and Tu 1996). Its objective is a linear di-mensional reduction of data resulting in a small set oforthogonal PCs, reflecting the most inherent variabil-ity of the multivariate attribute space. Technically, thePCA is an orthogonal transformation decomposing thecovariance or correlation matrix of the input data intoits eigenvectors and eigenvalues. The latter representthe amount of information gathered by each principalcomponent (Jolliffe 2002). Several criteria (e.g., Kaisercriterion, scree plot) exist to select an appropriate num-ber of components (Reimann et al. 2008). In addition,the final subset of PCs serves as input data for region-alization by the SKATER algorithm to derive spatially

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 7: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 7

compact and homogeneous regions, serving as submar-kets in a hedonic regression.

Spatial Regionalization

Although being aware of Guo’s (2008) criticism, theSKATER algorithm has the ability to derive spatiallycoherent and homogeneous submarkets. The SKATERalgorithm transforms the regionalization problem intoa graph partitioning problem (Assuncao et al. 2006).A spatially contiguous graph is built by connecting re-gions that are geographically adjacent. Additionally,each edge is assigned the similarity between the twocontiguous regions. By pruning the resulting graph intosubgraphs, homogenous and spatially contiguous re-gions are obtained. To reduce the complexity of thepartitioning of the graph, the SKATER algorithm startswith a MST. A MST is a graph with no circuits, mini-mum costs, and minimum number of edges connectingall vertices. The costs are defined as the sum of the dis-similarities over all edges. The dissimilarity measure isdependent on the attribute space and is measured by thesquared Euclidean distance. For partitioning, edges aresequentially removed from the MST until the desirednumber of regions is achieved. Because of the complex-ity of finding optimal edges for removal, the SKATERalgorithm uses a heuristic approach based on two ob-jective functions f1 and f2. f1 measures the change inhomogeneity of the clustering, when partitioning thetree by removing an edge, and the second function f2measures the change in homogeneity of the least ho-mogenous tree that results from cutting out an edge.Starting from the central vertex of the MST, incidentedges are evaluated. If the removal of one of these edgesresults in costs according to f1 superior to the best so-lution found so far, the edge is stored as a candidatesolution. Then, a new vertex from the set of evalu-ated edges is chosen based on the cost function f2,which penalizes edges that result in imbalanced treeswhen cutting them out. Then, the incident edges ofthe selected vertex are evaluated again. This procedureis iteratively repeated, until a stopping criterion (e.g.,the candidate solution has not changed for a certainnumber of iterations) is reached. Restrictions such aspopulation-balanced regions are not considered in thisregionalization procedure. A significant task in region-alization is the determination of the number of submar-kets. Because no a priori knowledge about the actualnumber of submarkets is present and generally appliedindexes (e.g., Davies and Bouldin 1979) do not considerspatial contiguity, a model-driven approach applying

the predictive performance of the hedonic regression isrequired.

To determine a near-optimal number of submarkets,two opposing strategies are pursued: First, the predictionerror of a pooled stepwise hedonic model with addi-tional submarket variables or submarket interaction ef-fects between the covariates (integrating intercept andslope heterogeneity), based on the SKATER-derivedsubmarket partition, is estimated. This is referred toas the regionalized model. This procedure is repeatedwith different numbers of SKATER clusters. As addi-tional decision guidance, Akaike’s information crite-rion (AIC; Akaike 1974) is used. The AIC describes atrade-off between bias and model variance, penalizingoverfitting (Burnham and Anderson 2002). Second, foreach partition, separate stepwise hedonic models areestimated for each submarket and their average predic-tion error is determined. This is referred to as the singleequation model and represents the notion of extremeheterogeneity, or distinct markets, where covariates fol-low completely different patterns over space.

To evaluate out-of-sample prediction accuracy ofthe competing models, the housing data set is split intoa training and a test set, which have not been used formodel building (Jain, Duin, and Mao 2000). To revealdifferences between the estimated values and the truevalues for each model the root mean square error(RMSE) is calculated. Concerning Goodman and Thi-bodeau’s (2007) result that model suitability dependson the performance measure, the mean absolute error(MAE) is also calculated for the model competition.Smaller RMSE and MAE values denote a smaller pre-diction error and thus a more accurate hedonic model.

Finally, a model competition is conducted. TheSKATER-based model competes against (1) a pooledmarket model without considering submarkets, whichserves as the reference model (the so-called globalmodel); and (2) two alternative models, one using fed-eral states3 (the so-called ad hoc model) and the otherone using k-means clusters as submarket delimitation(henceforth the k-means–based model).

Study Site and Data

The empirical study is based on 3,887 geocodedsingle-family homes located in Austria. The housingsegment of single-family homes is typical for suburbanand rural areas, compared to urban areas being primarilycharacterized by apartments. Attached to each home isthe transaction price for the years 1998 to 2009, nine

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 8: Data-Driven Regionalization of Housing Markets

8 Helbich et al.

Figure 2. Study site: Dwelling loca-tions (gray points) and Austrian fed-eral states (black lines).

house-specific structural variables (e.g., total floor area),and two temporal covariates (e.g., year of purchase).These data have been collected by UniCredit BankAustria AG and represent transactions that are associ-ated with Bank Austria AG. Figure 2 shows the studysite and the individual location of each house. Clearly,some east–west divide is evident, with more houses inthe eastern federal states.

The neighborhood of each house is described by fiveattributes (e.g., proportion of academics) for the year2001 at two different levels of aggregation: the Aus-trian administrative unit of municipality (equivalentto the U.S. census tract) and the enumeration districtlevel (equivalent to the U.S. census block group), bothpublished by Statistics Austria. Additionally, more up-to-date data sets for 2009 are used for two of the fiveattributes describing each house’s neighborhood. Theappendix lists all covariates, their anticipated effects onhouse price, and descriptive statistics.

Results

Exploring Nonstationarity

Initially, the exploratory data analysis indicates thathouse prices substantially vary across Austria (Moran’sI = 0.288, p < 0.001), with the highest prices paid inand around Vienna, in most provincial capitals (e.g.,Salzburg, Innsbruck), and in some smaller cities domi-nated by the tourism industry. This is a first indicationof possible spatial instability in the hedonic price func-tion, which is now further explored by the MGWR.

For this purpose the entire data set is partitioned intoa regionalization data set as well as a training and a test

data set for validation of the modeled submarkets. Dueto excessively time-consuming computations on a stan-dard desktop computer, a 50 percent random sampleof the entire data set for the actual regionalization wasselected for running the MGWR. Subsequently, the re-maining 50 percent of the entire data set was randomlydivided, using a common 85–15 percent split. The largerof the two samples was used for hedonic regression anal-ysis, and the smaller sample was applied to calculate anunbiased measure of the prediction accuracy.

The hedonic function was expressed with the nat-ural logarithm of the transaction price as responsevariable. Similarly, some of the continuous covariateswere also logarithmically transformed (see Table A1).4

A MGWR was estimated to explore global and lo-cal effects on house prices. The model performanceshows a pseudo-coefficient of determination (R2) rang-ing from 0.24 to 0.49 and regression assumptions likehomoscedasticity, residual independence, and normal-ity not being violated. The F(3)-statistic (Leung, Mei,and Zhang 2000) was consulted to discriminate betweenglobal and local effects. This resulted in six stationaryand ten nonstationary covariates, all having expectedsigns. The stationary covariates (e.g., condition of thehouse) are independent of location and show identicalbehavior across Austria; thus, they are not relevant forthe regionalization performed later. All nonstationaryeffects show significant spatial variation (at least p <

0.05) and include the following (each covariate’s effecton house price is shown in parentheses):

� Structural covariates: Log total floor area (+ eff.), logplot space (– eff.), low quality of the heating system(– eff.).

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 9: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 9

� Temporal effects: Age of building at time of sale(– eff.), year of purchase (+ eff.).

� Neighborhood covariates: Unemployment rate (–eff.), purchase power index (+, – eff.), proportionof academics (+ eff.), age index (– eff.), log popula-tion density (+ eff.).

As an example, Figure 3 shows the parameterestimates of four selected nonstationary covariates,interpolated with ordinary kriging. As expected, bothstructural covariates quantifying the size of thedwellings, namely, plot space and total floor area, arepositively related to house prices. Plot space has thehighest positive effect in the north and northeast ofVienna. For this region, a 10 percent increase of plotspace induces an increase in dwelling prices of up to 2percent. This relationship can also be interpreted as aplot space elasticity of 0.2. This effect is reduced, forexample, in the province of Salzburg, where the plotspace elasticity is only 0.1. The total floor area elas-ticity is lower in and around Vienna (0.3) comparedto the western part of Austria, where the elasticity isnearly twice as high (0.5), particularly in the provinceof Salzburg. The neighborhood variables show a sim-ilar spatially heterogeneous behavior. The proportionof academics has a positive influence on house prices.The higher the proportion of academics, the higher thedwelling prices are. The maximum marginal value isreached in the north of the city of Salzburg, where a1 percent increase of academics results in a more than3 percent increase in house prices, when holding allother effects constant. The age index measures the av-erage age of inhabitants. A high population age index,reflecting a rather old population, serves as a proxy forstructural weakness and is expected to have a negativeeffect on house prices (see Brunauer, Lang, and Umlauf2010). The effect of this covariate is almost negligiblein Vienna and the city of Linz. In other parts of Aus-tria, the effect is marginal but negative and ranges from–0.01 (for most parts of the province of Lower Austria)to –0.05 (province of Tyrol).

Reduction of Multicollinearity Using PCA

Possible methodological limitations of the (M)GWRfound by Wheeler and Tiefelsdorf (2005), like param-eter collinearity, are explored here as well. Spearman’sρ correlation coefficients confirm significant intercor-relations of the nonstationary parameter surfaces (e.g.,the coefficient surface age of the building is correlatedwith the unemployment rate; ρ = 0.495, p < 0.001).

Elimination of these multicollinearity effects and thusenhancement of the performance and quality of region-alization requires the calculation of a PCA on the scaledparameter surfaces. The Kaiser criterion suggests thata PC needs to possess eigenvalues greater than one,which in our study results in three PCs. This result isalso supported by the scree plot (Figure 4A) with thedeclining eigenvalues flattening out after the first threePCs. Moreover, these three PCs explain approximately84 percent of the overall variance (see Jolliffe 2002).PC1 explains 56 percent of the total variance, PC217 percent approximately one sixth, and PC3 12 per-cent. These results are similar to those of Bourassa et al.(1999), whose first three PCs explain 82 percent of thevariance.

The biplot (Figure 4B) shows the loadings (eigen-vectors) as well as the first and second scores of thedata points (Reimann et al. 2008). The loadings rep-resent the direction of each PC and express the rela-tionship with the original variables, whereas the scoresare the new data points projected on the new coordi-nates for each PC. Both grouping effects of variablesand similar loadings are visible. Most of the MGWRsurfaces are moderately correlated (indicated by the ar-row’s angles) and show a balanced variability of eachMGWR surface for PC1 and PC2 (shown by the lengthof arrows). Additionally, PC1 seems relatively balancedconcerning positive and negative loadings. For exam-ple, the logged plot space and logged population densityload positive on PC1, whereas logged total floor areahas a high negative loading. PC1 shows a tendencyto summarize structural characteristics of dwellings.In contrast, the second PC emphasizes the neighbor-hood characteristics slightly more. For instance, theproportion of academics has a relatively high positiveloading, whereas age of the building has a negative load-ing on PC2. PC3 is difficult to interpret, representingstructural, temporal, and neighborhood characteristics,dominated by the age index.

The use of census data in hedonic modeling is preva-lent to describe neighborhood characteristics locatedin specific submarkets. To ensure such a nested struc-ture as well as to facilitate operability and integrity withthese data sets, the three PCs are aggregated by aver-aging the cells with a resolution of 1,000 m over allmunicipalities, needed for subsequent regionalization.

Regionalization of the Housing Market

The regionalization algorithm is initiated with twoto fifteen submarkets and each time a stepwise hedonic

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 10: Data-Driven Regionalization of Housing Markets

Figu

re3.

Non

stat

iona

rypa

ram

eter

esti

mat

ions

bym

ixed

geog

raph

ical

lyw

eigh

ted

regr

essi

on(M

GW

R).

Pane

lsA

and

Bre

pres

ents

truc

tura

lcha

ract

eris

tics

and

pane

lsC

and

Dsh

owne

ighb

orho

odch

arac

teri

stic

s.Po

ints

show

the

capi

tals

ofth

efe

dera

lsta

tes.

The

blac

klin

esar

efe

dera

lsta

tebo

unda

ries

.(C

olor

figur

eav

aila

ble

onlin

e.)

10

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 11: Data-Driven Regionalization of Housing Markets

Figu

re4.

Prin

cipa

lco

mpo

nent

anal

ysis

resu

lts:

Pane

lA

show

sth

esc

ree

plot

and

the

cum

ulat

ive

perc

enta

geof

the

expl

aine

dva

rian

ce.P

anel

Bsh

ows

the

bipl

otof

the

first

two

prin

cipa

lcom

pone

nts.

PC=

prin

cipa

lcom

pone

nt.

11

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 12: Data-Driven Regionalization of Housing Markets

12 Helbich et al.

regression is estimated, using the covariates in TableA1. The left panel of Figure 5 shows the regression per-formance of the regionalized model. Dividing Austriainto eleven submarkets results in the lowest AIC scoreand, when compared to all other partitions, a relativelysmall RMSE of 0.324. A larger number of submarketsimproves the RMSE only slightly (e.g., –0.0008) but, atthe same time, increases the AIC.

In comparison, the average RMSEs of the best singleequation model, recommending only two submarkets,lead to a rather poor prediction error of more than0.339, with an increase in the number of submarketsresulting in a higher RMSE. The single equation modelwith eleven submarkets exceeds an RMSE of 0.388. Thesingle equation approach furthermore illustrates the in-cidental parameter problem. In other words, the numberof possible submarkets is constrained, as otherwise thenumber of observations is not sufficiently high for esti-mation purposes. These results mirror Bourassa, Hoesli,and Peng’s (2003, 15) conclusion that “too much ho-mogeneity may not be a good thing in practice.” Sim-ilarly, Adair, Berry, and McGreal (1996) providedevidence that just a few submarkets on a macrolevel arepreferable.

The final submarket partition of Austria is shownin the right panel of Figure 5. With the exceptionof SKATER submarket 5 (SK5), all submarkets showsignificant differences compared to the reference clus-ter SK2 (Table 1). Local expert knowledge confirmsthese results. For instance, Vienna and its proper sur-roundings (cluster 2) represent one submarket. Hereparticularly, southern municipalities in immediateproximity to Vienna achieve high housing prices. Hel-bich and Leitner (2009) and Helbich (2012) concludedthat these are effects of suburbanization and postsubur-banization processes boosting land and housing prices.The extent of this region reflects a commuting distanceof approximately as high as thirty-five minutes to thecity boundary of Vienna by motorized individual trans-port. For the year 2001, Statistics Austria (2007) re-ported for Lower Austria that more than 25 percent ofall daily commuters needed sixteen to thirty-five min-utes to get to work. Compared to all other clusters,the submarket represented by cluster 1 has relativelyhigh housing prices because the available land is scarcein alpine areas and mostly limited to the valley floors.Additionally, the tourism industry has pushed housingprices higher. In contrast, the submarket in the north ofVienna (cluster 6) reflects an economically weak regionsuffering from aging of the population and outmigration(Fassmann, Gorgl, and Helbich 2009).

Validation of Modeled Submarkets

Table 1 presents detailed results of the model compe-tition using out-of-sample RMSE (see earlier). On thebasis of paired Wilcoxon signed rank tests, the null hy-pothesis that the models including submarkets performstatistically equal to the global model without submar-kets can be rejected at p < 0.05. In addition, predictionerrors are noticeably lower for the submarket models.The ad hoc model reduces the RMSE by about 0.005 andthe regionalized model by about 0.008 compared to theglobal model. Even the adjusted R2 of the regionalizedmodel is lower than the ad hoc model (44 percent vs. 45percent explained variance). Additionally, as advisedin Jain, Duin, and Mao (2000), the regionalization iscross-checked against the widely used aspatial k-meansalgorithm in housing segmentation (e.g., Maclennanand Tu 1996). To define a suitable number of clusters,this study follows Venables and Ripley (2002), who it-eratively minimized the within-group sum of squares.As in Bourassa, Cantoni, and Hoesli (2010), the resultssuggest eight to ten clusters, which are then used in thehedonic regression. All k-means RMSEs are higher thanthe SKATER results, marginally lower or equal thanthe ad hoc results, and lower than the global model. Itshould be noted that simply using the MGWR modelresults in the highest RMSE of 0.338. What is more, theregionalized model results in a significantly superior out-of-sample prediction performance than the alternativecounterparts (Wilcoxon test p < 0.05). No significantdifference in the prediction performance is found be-tween the ad hoc model and the k-means–based model.In contrast to the findings of Goodman and Thibodeau(2007), who claimed that the model preference dependson the measured performance criterion, the results be-tween different performance measures are consistent inour research, as the MAE also supports the conclusionsfrom the RMSE.

Independent of the prediction accuracy of these com-peting models, all estimated coefficients for the struc-tural, temporal, and neighborhood effects of the modelshave expected signs and are highly significant. Simi-lar to Bourassa et al. (1999), the results suggest thatinternal homogeneity and external heterogeneity arebetter represented in the modeled submarkets than inexogenously defined alternative submarkets.

Finally, the results of the regionalized model witheleven submarkets are compared to the single equa-tion model, where a separate model, one for each ofthe eleven SKATER submarkets, is estimated. As ear-lier, for each regression a stepwise variable selection is

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 13: Data-Driven Regionalization of Housing Markets

Figu

re5.

Mod

el-d

rive

nde

term

inat

ion

ofth

enu

mbe

rof

subm

arke

ts(l

eft)

and

the

resu

ltin

gre

gion

aliz

atio

nac

ross

Aus

tria

(rig

ht).

Are

asin

gray

repr

esen

tth

eSK

AT

ERsu

bmar

kets

and

blac

klin

esar

eth

efe

dera

lsta

tebo

unda

ries

.RM

SE=

root

mea

nsq

uare

erro

r;A

IC=

Aka

ike

info

rmat

ion

crit

erio

n;SK

AT

ER=

Spat

ial‘

K’lu

ster

Ana

lysi

sby

Tre

eEd

geR

emov

al.

13

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 14: Data-Driven Regionalization of Housing Markets

Tab

le1.

Esti

mat

edor

dina

ryle

asts

quar

esm

odel

s

Reg

iona

lized

mod

el11

SKA

TER

subm

arke

tsG

loba

lmod

el:

Pool

edda

taA

dho

cm

odel

8su

bmar

kets

k-m

eans

–bas

edm

odel

8su

bmar

kets

Coe

ffici

ent

SEt-

valu

eC

oeffi

cien

tSE

t-va

lue

Coe

ffici

ent

SEt-

valu

eC

oeffi

cien

tSE

t-va

lue

Inte

rcep

t10

.011

0.21

346

.999

∗∗∗

10.3

460.

203

50.9

03∗∗

∗9.

203

0.22

141

.578

∗∗∗

10.1

130.

214

47.2

85∗∗

Stru

ctur

alco

vari

ates

lnar

eato

tal

0.42

50.

020

20.9

77∗∗

∗0.

452

0.02

022

.480

∗∗∗

0.42

40.

020

21.3

74∗∗

∗0.

438

0.02

021

.698

∗∗∗

lnar

eapl

ot0.

094

0.01

46.

482∗∗

∗0.

058

0.01

44.

085∗∗

∗0.

104

0.01

47.

277∗∗

∗0.

084

0.01

45.

879∗∗

cond

hous

e1–0

.027

0.01

6–1

.715†

–0.0

260.

016

–1.6

72†

–0.0

400.

015

–2.5

82∗∗

–0.0

260.

016

–1.6

84he

at1

–0.1

190.

026

–4.4

91∗∗

∗–0

.109

0.02

7–4

.064

∗∗∗

–0.1

110.

026

–4.2

56∗∗

∗–0

.127

0.02

6–4

.781

∗∗∗

bath

1–0

.069

0.02

5–2

.767

∗∗–0

.059

0.02

5–2

.347

∗–0

.060

0.02

4–2

.431

∗–0

.061

0.02

5–2

.446

atti

c1–0

.026

0.01

3–1

.987

∗–0

.038

0.01

3–2

.896

∗∗–0

.027

0.01

3–2

.083

∗–0

.023

0.01

3–1

.743

cella

r10.

124

0.01

58.

315∗∗

∗0.

126

0.01

58.

377∗∗

∗0.

135

0.01

59.

208∗∗

∗0.

122

0.01

58.

154∗∗

gara

ge1

–0.0

790.

013

–5.9

91∗∗

∗–0

.078

0.01

3–5

.811

∗∗∗

–0.0

850.

013

–6.4

68∗∗

∗–0

.080

0.01

3–6

.035

∗∗∗

terr

10.

068

0.01

35.

137∗∗

∗0.

065

0.01

34.

893∗∗

∗0.

066

0.01

35.

083∗∗

∗0.

068

0.01

35.

106∗∗

Tem

pora

lcov

aria

tes

age

–0.0

060.

000

–15.

189∗∗

∗–0

.006

0.00

0–1

4.52

8∗∗∗

–0.0

060.

000

–15.

978∗∗

∗–0

.006

0.00

0–1

4.77

8∗∗∗

tim

e0.

009

0.00

33.

254∗∗

∗0.

006

0.00

32.

301∗

0.01

10.

003

3.94

7∗∗∗

0.00

80.

003

2.79

8∗∗

Nei

ghbo

rhoo

dco

vari

ates

enum

dun

empl

–1.1

290.

499

–2.2

63∗

–1.7

990.

465

–3.8

70∗∗

∗–1

.463

0.49

6–2

.949

∗∗

mun

ipp

ind

0.00

40.

001

4.19

0∗∗∗

0.00

50.

001

6.43

7∗∗∗

0.00

70.

001

7.27

5∗∗∗

0.00

40.

001

3.76

7∗∗∗

mun

iac

ad0.

013

0.00

25.

639∗∗

∗0.

011

0.00

25.

444∗∗

∗0.

009

0.00

24.

426∗∗

∗0.

012

0.00

25.

422∗∗

mun

iag

ein

d–0

.031

0.00

4–8

.030

∗∗∗

–0.0

390.

004

–10.

983∗∗

∗–0

.020

0.00

4–5

.234

∗∗∗

–0.0

320.

004

–8.4

11∗∗

lnm

uni

popd

0.06

00.

006

9.52

9∗∗∗

0.06

30.

006

10.6

79∗∗

∗0.

020

0.00

72.

901∗∗

0.05

70.

006

9.01

3∗∗∗

Subm

arke

tsSK

1/V

bg+T

yr/k

10.

156

0.04

03.

919∗∗

∗0.

249

0.03

86.

627∗∗

∗–0

.182

0.03

4–5

.293

∗∗∗

SK3/

Car

/k2

–0.0

910.

035

–2.6

40∗∗

–0.0

900.

029

–3.0

58∗∗

–0.1

680.

022

–7.6

02∗∗

SK4/

Sty/

k3–0

.129

0.02

3–5

.612

∗∗∗

–0.0

500.

022

–2.2

79∗

0.13

60.

040

3.41

8∗∗∗

SK5/

Upp

/k4

0.03

80.

026

1.44

20.

040

0.02

01.

961∗

0.00

10.

024

0.02

4SK

6/Sa

l/k5

–0.0

830.

025

–3.4

01∗∗

∗0.

255

0.02

98.

788∗∗

∗–0

.068

0.02

2–3

.055

∗∗

SK7/

Vie

/k6

0.10

40.

042

2.46

3∗0.

304

0.03

39.

141∗∗

∗–0

.078

0.02

4–3

.241

∗∗

SK8/

Bgl

d/k7

–0.0

530.

023

–2.2

47∗

–0.1

300.

029

–4.4

39∗∗

∗–0

.088

0.02

5–3

.594

∗∗∗

SK9

–0.0

550.

025

–2.1

77∗

SK10

–0.2

370.

038

–6.2

77∗∗

SK11

–0.1

370.

029

–4.7

67∗∗

F-te

st(p

-val

ue)

0.00

10.

001

0.00

10.

001

Adj

.R2

(%)

4442

4543

RM

SE0.

324

0.33

20.

327

0.32

7

Not

e:SE

=st

anda

rder

ror;

SK=

SKA

TER

;Vbg

=V

orar

lber

g;T

yr=

Tyr

ol;C

ar=

Car

inth

ia;S

ty=

Styr

ia;U

pp=

Upp

erA

ustr

ia;S

al=

Salz

burg

;Vie

=V

ienn

a;B

gld

=B

urge

nlan

d;R

MSE

=ro

otm

ean

squa

reer

ror.

† p<

0.1.

∗ p<

0.05

.∗∗

p<

0.01

.∗∗

∗ p<

0.00

1.

14

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 15: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 15

Table 2. Single equation model for each SKATER submarket

SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9 SK10 SK11

Intercept ++++ ++++ ++++ ++++ ++++ ++++ ++++ ++++ ++++ ++++ ++++Structural covariates

lnarea total +++ ++++ ++++ ++++ ++++ ++++ +++ ++++ ++++ ++++ ++++lnarea plot + +++ +++ ++++ +++ ++cond house1 −− − + −− −heat1 −− −−−− − −−−bath1 −−attic1 −cellar1 + +++ ++++ ++ ++++ +++garage1 −−−− − −− − −−−− −−terr1 ++++ ++ +++ +++

Temporal covariatesage −−−− −−−− −−−− −−−− −−−− − −−−− −−−− −−−−time +++ ++++ + ++++

Neighborhood covariatesenumd unempl + −−− − − ++ −−muni pp ind +++ ++ ++ ++++muni acad + ++ ++ ++++ ++ ++ − ++++ +++ ++muni age ind −− −−−− −−−− −−− −−− −−ln muni popd + ++++ +++ +++

Adj. R2 (%) 24 46 26 35 49 57 39 42 41 41 28RMSE 0.511 0.345 0.362 0.357 0.377 0.335 0.511 0.348 0.369 0.399 0.348n 100 1,003 134 402 293 328 85 309 304 107 239

Note: For all models, F-test p < 0.001. SK = SKATER; + = positive effect; – = negative effect; RMSE = root mean square error.Significance levels: ++++ = 0.001; +++ = 0.01; ++ = 0.05; + = 0.10; −−−− = 0.001; −−− = 0.01; −− = 0.05; − = 0.10.

applied. The results are summarized in Table 2 and in-dicate local differences in the hedonic price function.Within each submarket, different covariates are signif-icant to explain house prices. Total floor area has ahighly significant positive effect and age has a negativeeffect across all submarkets. Furthermore, a tendencyfor submarkets with a smaller number of houses (n) tolead to a higher RMSE and thus less reliable estimatesis noticeable. These findings are consistent with Adair,Berry, and McGreal (1996), where the spatial hetero-geneity is also expressed in different combinations ofcovariates to explain house prices. Such single equationmodels, in particular the ones with a small sample size,can result in estimates having an unexpected sign. Thisis demonstrated in the model SK7 with only eighty-five houses. For example, the covariate poor conditionof the house shows a positive effect, whereas the pro-portion of academics shows a negative effect, both ofwhich are unexpected. Remaining submarkets havinga larger number of observations show effects on houseprices that are expected. To summarize, separate modelsfor each submarket result in serious shortcomings, in-cluding unstable estimates due to a reduced sample size,

being less prone to misspecifications, and being affectedby statistical artifacts.

Conclusions

This article promotes a data-driven spatial region-alization framework for housing market segmentation.Well-established approaches, such as the ad hoc sub-market definition, specify housing submarkets exoge-nously and have their administrative units’ mimicthe spatial extent (e.g., Adair, Berry, and McGreal1996). Such an approach suffers from arbitrarily chosenboundaries not corresponding with spatial economicprocesses. It might induce the modifiable areal unitproblem, which can further result in biased estimatesof the hedonic price function. Therefore, local spa-tial analysis techniques are proposed to estimate lo-cal varying submarkets (e.g., Paez, Fei, and Farber2008). Nevertheless, it is shown that due to seriousmethodological drawbacks this line of research is morecapable for exploratory analysis and less suitable forpredictions. To avoid the arbitrariness of ad hoc

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 16: Data-Driven Regionalization of Housing Markets

16 Helbich et al.

submarket definitions, clustering of housing as well associoeconomic attributes is applied, which results inhomogeneous submarkets (e.g., Bourassa et al. 1999).Even though it is well known in real estate that spacematters, submarket definitions by means of clusteringalgorithms have disregarded this important maxim sofar. This might have led to erroneous conclusions.

In contrast, this research addresses and resolves thesedrawbacks. For the delimitation of housing market seg-mentation, a data-driven framework is proposed. First,stationary and nonstationary structural, temporal, andneighborhood effects on house prices are explored usingthe MGWR. The former represent economically con-nected real estate markets (e.g., through similar federalpolicy, like governmental subsidies), and the latter re-sult in a continuous and local definition of a housing seg-mentation. These estimated MGWR coefficients tendto be highly correlated and show distinctive volatility.Hence, in a second step, the MGWR coefficients are re-duced to a small number of orthogonal PCs, serving asinput data for the regionalization. Finally, the SKATERalgorithm is used to investigate homogenous and spa-tially contiguous housing market segmentation, explic-itly accounting for spatial effects, which have beenneglected so far. To determine an appropriate numberof submarkets, a model-driven approach in the form ofa hedonic regression is utilized. The out-of-sample pre-diction performance is applied to determine the “near-optimal” number of submarkets. Compared to previousapproaches, this framework needs less theory and priorknowledge on the exact housing market segmentationand the underlying spatial economic process.

Finally, using a data set of 3,887 geocoded single-family homes throughout Austria from 1998 to 2009,the proposed methodological framework is empiricallytested. A trade-off between prediction accuracy andmodel parsimony shows that eleven submarkets aremost suitable for Austria. The modeled submarkets arefurther validated within a hedonic regression frame-work. It is demonstrated that a hedonic model consid-ering the modeled SKATER submarkets significantlyimproves prediction quality compared to a pooledmarket-wide model. These results are in line with Good-man and Thibodeau (2003), as well as Hwang and Thill(2009). More important, the ad hoc submarket defini-tion using federal states is significantly outperformedwith this novel approach as well as the commonly usedk-means–based segmentation, which results in compar-atively higher prediction errors. In contrast to Bourassa,Cantoni, and Hoesli (2010), the results show that adhoc submarkets are competitive to k-means–based sub-

markets. Due to a straightforward application of the adhoc approach, the empirical application recommendsconsidering administrative spatial indicators instead ofthe more sophisticated k-means clustering. Addition-ally, a different single hedonic model is estimated foreach of the eleven submarkets to analyze the instabil-ity of structural, temporal, and neighborhood covari-ates across each submarket. The results show that thetype of covariates and their significance levels differacross the submarkets and that the reliability of the es-timates strongly depends on the sample size within eachsubmarket. In accordance with the results in Abraham,Goetzmann, and Wachter (1994) for the U.S. housingmarket, it can be concluded that the Austrian hous-ing market is based on strong regional determinants.In other words, geography is the essential componentdetermining the housing market’s characteristic.

To conclude, this article proposes a powerfulmethodological framework for housing segmentation,being superior in terms of prediction accuracy comparedto former approaches. The empirical model competitionclearly demonstrates that submarkets, however defined,must always be considered in hedonic modeling. Never-theless, there are some limitations. First, the methodol-ogy has limited applicability for large data sets becauseeach modeling component is computationally demand-ing. Second, this data-driven submarket definition mustbe further analyzed by a more extensive clustering andregionalization algorithm comparisons (e.g., Hagenauerand Helbich 2012) and alternative methodologies toreach its full potential. Such alternative methodologiesinclude multilevel modeling or additive (mixed) mod-eling. For this reason, future additional investigationsinto the performance of submarkets will be necessary.

Acknowledgments

Marco Helbich was supported by the Alexander vonHumboldt Foundation. We thank the four anonymousreviewers for their constructive comments that greatlyimproved the article.

Notes1. A pooled model estimates a single regression for the whole

study area, not accounting for submarket specific effects.In contrast, a completely unpooled model estimates a sep-arate equation for each submarket. The SKATER, ad hoc,and k-means models lie in between these two extremesand can model spatial heterogeneity through intercept orslope variation.

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 17: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 17

2. Defuzzification translates the fuzzy clustering to a crisprepresentation by assigning an object to the cluster havingthe highest membership degree.

3. Due to a sparse sample size in Vorarlberg and Tyrol, bothregions are considered as one spatial unit.

4. If both sides of the equation appear as a log–log specifi-cation, the coefficient β can be interpreted as elasticity.A 1 percent change of x results in a change of y by β ×100 percent. In a log-linear specification, a change of oneunit of x results in a change of y by 100 × (exp(β) – 1)percent. For small values, however, β can be interpretedas semielasticity, meaning that a one-unit change in x re-sults in a β × 100 percent change in y. See Greene (2008)for a more detailed discussion.

ReferencesAbraham, J. M., W. N. Goetzmann, and S. M. Wachter.

1994. Homogeneous groupings in metropolitan housingmarket. Journal of Housing Economics 3 (3): 186–206.

Adair, A. S., J. N. Berry, and W. S. McGreal. 1996. Hedo-nic modelling, housing submarkets and residential valu-ation. Journal of Property Research 13 (2): 67–83.

Akaike, H. 1974. A new look at the statistical model identi-fication. IEEE Transactions on Automatic Control 19 (6):716–23.

Assuncao, R. M., M. C. Neves, G. Camara, and C. DaCosta Freitas. 2006. Efficient regionalization techniquesfor socio-economic geographical units using minimumspanning trees. International Journal of Geographical In-formation Science 20 (7): 797–811.

Bacao, F., V. Lobo, and M. Painho. 2005. The self-organizingmap, the Geo-SOM, and relevant variants for geo-sciences. Computers & Geosciences 31 (2): 155–63.

Bischoff, O., and W. Maennig. 2011. Rental housing mar-ket segmentation in Germany according to ownership.Journal of Property Research 28 (2): 133–49.

Bourassa, S. C., E. Cantoni, and M. Hoesli. 2007. Spatialdependence, housing submarkets, and house price pre-diction. The Journal of Real Estate Finance and Economics35 (2): 143–60.

———. 2010. Predicting house prices with spatial depen-dence: A comparison of alternative methods. Journal ofReal Estate Research 32 (2): 139–60.

Bourassa, S. C., F. Hamelink, M. Hoesli, and B. MacGregor.1999. Defining housing submarkets. Journal of HousingEconomics 8 (2): 160–83.

Bourassa, S. C., M. Hoesli, and V. S. Peng. 2003. Do housingsubmarkets really matter? Journal of Housing Economics12 (1): 12–28.

Brunauer, W., S. Lang, and N. Umlauf. 2010. Mod-eling house prices using multilevel structured addi-tive regression. Working paper, Faculty of Economicsand Statistics, University of Innsbruck, Austria. http://econpapers .repec.org/paper/ innwpaper/2010–19.htm(last accessed 11 May 2012).

Brunauer, W., S. Lang, P. Wechselberger, and S. Bienert.2010. Additive hedonic regression models with spatialscaling factors: An application for rents in Vienna. TheJournal of Real Estate Finance and Economics 41 (4):390–411.

Brunsdon, C., S. A. Fotheringham, and M. Charlton. 1996.Geographically weighted regression: A method for ex-ploring spatial nonstationarity. Geographical Analysis 28(4): 281–98.

Burnham, K., and D. Anderson. 2002. Model selection andmultimodel inference: A practical information-theoretic ap-proach. New York: Springer.

Can, A. 1992. Specification and estimation of hedonic houseprice models. Regional Sciences and Urban Economics 22(3): 453–74.

Case, B., J. Clapp, R. Dubin, and M. Rodriguez. 2004. Mod-eling spatial and temporal house price patterns: A com-parison of four models. Journal of Real Estate Finance andEconomics 29 (2): 167–91.

Davies, D. L., and D. W. Bouldin. 1979. A cluster separa-tion measure. IEEE Transactions on Pattern Analysis andMachine Intelligence 2:224–27.

Diggle, P., and J. Ribeiro. 2007. Model-based geostatistics. NewYork: Springer.

Dubin, R. A. 1992. Spatial autocorrelation and neighborhoodquality. Regional Science and Urban Economics 22 (3):433–52.

Farber, S., and A. Paez. 2007. A systematic investigation ofcross-validation in GWR model estimation: Empiricalanalysis and Monte Carlo simulations. Journal of Geo-graphical Systems 9 (4): 371–96.

Fassmann, H., P. Gorgl, and M. Helbich. 2009. Atlas derwachsenden stadtregion [Atlas of the growing urban re-gion]. Vienna: PGO.

Fotheringham, S. A., M. Charlton, and C. Brunsdon. 2002.Geographically weighted regression: The analysis of spatiallyvarying relationships. Chichester, UK: Wiley.

Fotheringham, S. A., and D. W. S. Wong. 1991. Themodifiable areal unit problem in multivariate statisti-cal analysis. Environment and Planning A 23 (7): 1025–44.

Gelman, A., and J. Hill. 2007. Data analysis using regressionand multilevel/hierarchical models. Cambridge, UK: Cam-bridge University Press.

Goetzmann, W. N., and S. M. Wachter. 1995. Clusteringmethods for real estate portfolios. Real Estate Economics23 (3): 271–310.

Goldstein, H. 2011. Multilevel statistical models. Chichester,UK: Wiley.

Goodman, A. C. 1978. Hedonic prices, price indices andhousing markets. Journal of Urban Economics 5 (4):471–84.

Goodman, A. C., and T. G. Thibodeau. 1998. Housing mar-ket segmentation. Journal of Housing Economics 7 (2):121–43.

———. 2003. Housing market segmentation and hedonicprediction accuracy. Journal of Housing Economics 12 (3):181–201.

———. 2007. The spatial proximity of metropolitan areahousing submarkets. Real Estate Economics 35 (2):209–32.

Greene, W. H. 2008. Econometric analysis. 6th ed. UpperSaddle River, NJ: Pearson.

Griffith, D. A. 2008. Spatial-filtering-based contributions toa critique of geographically weighted regression (GWR).Environment and Planning A 40 (11): 2751–69.

Guo, D. 2008. Regionalization with dynamically constrainedagglomerative clustering and partitioning (REDCAP).

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 18: Data-Driven Regionalization of Housing Markets

18 Helbich et al.

International Journal of Geographical Information Science22 (7): 801–23.

Hagenauer, J., and M. Helbich. 2012. Contextual neu-ral gas for spatial clustering and analysis. Interna-tional Journal of Geographical Information Science. doi:10.1080/13658816.2012.667106.

Helbich, M. 2012. Beyond postsuburbia? Multifunctional ser-vice agglomeration in Vienna’s urban fringe. Journal ofEconomic & Social Geography 103 (1): 39–52.

Helbich, M., and M. Leitner. 2009. Spatial analysis ofthe urban-to-rural migration determinants in the Vi-ennese metropolitan area: A transition from sub- topostsuburbia? Applied Spatial Analysis and Policy 2 (3):237–60.

Hwang, S., and J.-C. Thill. 2009. Delineating urban hous-ing submarkets with fuzzy clustering. Environment andPlanning B: Planning and Design 36 (5): 865–82.

Jain, A. K., R. P. W. Duin, and J. Mao. 2000. Statistical pat-tern recognition: A review. IEEE Transactions on PatternAnalysis and Machine Intelligence 22 (1): 4–37.

Jolliffe, I. T. 2002. Principal component analysis. New York:Springer.

Jones, K., and N. Bullen. 1994. Contextual models of ur-ban house prices: A comparison of fixed- and random-coefficient models developed by expansion. EconomicGeography 70 (3): 252–72.

Kauko, T. 2004. A comparative perspective on urbanspatial housing market structure: Some more evi-dence of local sub-markets based on a neural networkclassification of Amsterdam. Urban Studies 41 (13):2555–79.

Kauko, T., P. Hooimeijer, and J. Hakfoort. 2002. Capturinghousing market segmentation: An alternative approachbased on neural network modelling. Housing Studies 17(6): 875–94.

Kestens, Y., M. Theriault, and F. Des Rosiers. 2004. Theimpact of surrounding land use and vegetation on single-family house prices. Environment and Planning B: Planningand Design 31 (4): 539–67.

Kohonen, T. 2001. Self-organizing maps. New York: Springer.LeSage, J., and K. R. Pace. 2009. Introduction to spatial econo-

metrics. Boca Raton, FL: CRC Press.Leung, Y., C.-L. Mei, and W.-X. Zhang. 2000. Statistical tests

for spatial nonstationarity based on the geographicallyweighted regression model. Environment and Planning A32 (1): 9–32.

Maclennan, D. 1982. Housing economics: An applied approach.London: Longman.

Maclennan, D., and Y. Tu. 1996. Economic perspectives onthe structure of local housing systems. Housing Studies 11(3): 387–406.

Miller, H. J. 2010. The data avalanche is here. Shouldn’twe be digging? Journal of Regional Science 50 (1):181–201.

Neyman, J., and E. L. Scott. 1948. Consistent estimates basedon partially consistent observations. Econometrica 16 (1):1–32.

Openshaw, S. 1984. The modifiable areal unit problem. Nor-wich, UK: Geo Books.

Openshaw, S., and L. Rao. 1995. Algorithms for reengineer-ing 1991 census geography. Environment and Planning A27 (3): 425–46.

Orford, S. 2000. Modelling spatial structures in local hous-ing market dynamics: A multilevel perspective. UrbanStudies 37 (9): 1643–71.

Paez, A., S. Farber, and D. Wheeler. 2011. A simulation-based study of geographically weighted regression as amethod for investigating spatially varying relationships.Environment and Planning A 43 (12): 2992–3010.

Paez, A., L. Fei, and S. Farber. 2008. Moving window ap-proaches for hedonic price estimation: An empiricalcomparison of modelling techniques. Urban Studies 45(8): 1565–81.

Palm, R. 1978. Spatial segmentation of the urban housingmarket. Economic Geography 54 (3): 210–21.

Pebesma, E. 2004. Multivariable geostatistics in S: The gstatpackage. Computers & Geosciences 30 (7): 683–91.

R Development Core Team. 2012. R: A language and environ-ment for statistical computing. Vienna, Austria: R Founda-tion for Statistical Computing.

Reimann, C., P. Filzmoser, R. G. Garrett, and R. Dutter.2008. Statistical data analysis explained: Applied environ-mental statistics with R. Chichester, UK: Wiley.

Rosen, S. 1974. Hedonic prices and implicit markets: Productdifferentiation in pure competition. Journal of PoliticalEconomy 82 (1): 34–55.

Schnare, A., and R. Struyk. 1976. Segmentation in ur-ban housing markets. Journal of Urban Economics 3 (2):146–66.

Statistics Austria. 2007. Commuters (based on census data2001). http://www.statistik.at/web de/statistiken/bevoelkerung/volkszaehlungen registerzaehlungen/pendler/index.html (last accessed 2 February 2012).

Straszheim, M. 1975. An econometric analysis of the urbanhousing market. Cambridge, MA: National Bureau of Eco-nomic Research.

Venables, W. N., and B. D. Ripley. 2002. Modern appliedstatistics with S. New York: Springer.

Watkins, C. A. 2001. The definition and identification ofhousing submarkets. Environment and Planning A 33 (12):2235–53.

Wheeler, D., and A. Paez. 2009. Geographically weighted re-gression. In Handbook of spatial analysis, ed. M. M. Fischerand A. Getis, 461–86. Oxford, UK: Elsevier.

Wheeler, D., and M. Tiefelsdorf. 2005. Multicollinearity andcorrelation among local regression coefficients in geo-graphically weighted regression. Journal of GeographicalSystems 7 (2): 161–87.

Correspondence: Institute of Geography, University of Heidelberg, Berliner Straße 48 D-69120, Heidelberg, Germany, e-mail: [email protected] (Helbich); [email protected] (Hagenauer); Credit Risk Methods Development, Strategic Risk Management &Control, Bank Austria—Member of UniCredit Group A-1090, Vienna, Austria, e-mail: [email protected] (Brunauer);Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803, e-mail: [email protected] (Leitner).

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12

Page 19: Data-Driven Regionalization of Housing Markets

Data-Driven Regionalization of Housing Markets 19

Appendix

Table A1. Description and descriptive statistics of the variables

Abbreviation Description Source Effect Minimum Mean Maximum

lnp Log purchase price of thedwelling

BA 10.309 11.917 13.218

Structural covariateslnarea total Log of total floor area (except

cellar)BA + 3.778 4.846 6.204

lnarea plot Log of plot space BA + 4.382 6.498 7.824cond house Condition of the house (0 =

good, 1 = poor)BA – 0.000 1.000

heat Quality of the heating system (0= high, 1 = poor)

BA – 0.000 1.000

bath Quality of the bathroom/toilet(0 = high, 1 = poor)

BA – 0.000 1.000

attic Attic (0 = no, 1 = yes) BA – 0.000 1.000cellar Cellar (0 = no, 1 = yes) BA + 0.000 1.000garage Quality of the garage (0 = high,

1 = poor)BA – 0.000 1.000

terr Terrace (0 = no, 1 = yes) BA + 0.000 1.000Temporal covariates

age Age of building at time of sale BA – 1.000 24.360 81.000time Year of purchase (1998-2009) BA + 0.000 7.224 11.000

Neighborhood covariatesenumd unempl Unemployment rate (2009)

enumeration districtMB – 0.000 0.034 0.148

muni pp ind Purchase power index (2009)municipality level

SA + 65.000 102.796 148.500

muni acad Proportion of academics (2001)municipality level

SA + 3.716 15.099 40.795

muni age ind Age index (2001) municipalitylevel

SA – 33.040 39.321 46.136

ln muni popd Log population density (2001)municipality level

SA + –3.519 0.727 4.857

Note: BA = Bank Austria AG; MB = Michael Bauer Research; SA = Statistics Austria.

Dow

nloa

ded

by [

Vie

nna

Uni

vers

ity L

ibra

ry]

at 1

1:40

04

Sept

embe

r 20

12