Start-up valuation in Switzerland: analysis and methods · classical financial formulae miserably...

Start-up valuation in Switzerland:

analysis and methods

Master Thesis

Candidate: Silvia Lama

Supervisor: Prof. Dr. Didier Sornette

ETH Zürich

Department of Management, Technology and Economics (D-MTEC)

Chair of Entrepreneurial Risks

June 2019 – November 2019

1.1 Motivation and overview 2

0 Abstract 3

ABSTRACT

The aim of this master thesis is to provide an overview of start-up valuations in Switzerland. The

first part focuses on the analysis of funding rounds closed in Switzerland, between 2010 and

2019. The existence of patterns and trends is investigated, visualized, and commented. The

second part selects the best model to estimate a range of pre-money valuation for a target start-

up, as a fair benchmark. This could be used by investors and co-founders as a starting point in

their investment negotiation process. Indeed, traditional valuation methods1 cannot be applied

to start-ups, due to the uncertainty of the latter, their short history, absence of publicly available

data on financials, comparable companies or transactions. As a consequence, new valuation

methods have emerged and, in the conclusion chapter, they are compared to our approach,

stressing their lack of objectivity, contextuality, accuracy, and precision. Finally, several traces

for further research are recommended.

1 (e.g. the Discounted Cash Flow, the Valuation Multiples)


ACKNOWLEDGEMENTS

I would like to express all my gratitude to Professor Didier Sornette, for the opportunity to

conduct my Master Thesis at the Chair of Entrepreneurial Risks and for his guidance, and to Dr.

Spencer Wheatley for the useful advice.

Besides, I would like to sincerely thank Steffen Wagner and Michael Blank, for the trust

demonstrated by choosing me to pursue this delicate and extremely interesting research

project at Investiere.

My warmest thank goes to Dr. Matteo Farnè, for the valuable support and interest in my work,

and for being a reliable point of reference in my life.

I also wish to heartily thank my mentor and angel investor, Professor Silvio Marenco, who has

been always believing in me, for all the care and time he has been investing in my professional

growth. Above all, during these years he taught me the value of respect and of building trustful

relationships.

If I achieved this goal, it is also thanks to the precious advice of my mentor Andrea Girardello. I

feel truly grateful to all his attention and effort, allowing me to avoid many mistakes, and

inspiring my winding path as a student-entrepreneur. He taught me to never give up, and that

it is always possible to find a smarter way to face challenges, by thinking outside the box.

This journey would not have been so special and unforgettable without the fantastic company

of my dearest friends, and of my lovely flatmates, bringing sparkling colours to every day of my

life.

Finally, I am enormously grateful to my family, who makes me feel the luckiest person on the

Earth, by supporting all my passions and activities, and by loving me as I am.

0 Acknowledgements 5

TABLE OF CONTENTS

Abstract ......................................................................................................................................... 3

Acknowledgements ....................................................................................................................... 4

1 Introduction .......................................................................................................................... 9

1.1 Motivation and overview .............................................................................................. 9

1.2 Research questions ..................................................................................................... 10

2 Start-up valuation methods ................................................................................................ 11

2.1 Overview ..................................................................................................................... 11

2.2 Scorecard method ....................................................................................................... 11

2.3 Berkus model .............................................................................................................. 12

2.4 Venture Capital Method ............................................................................................. 13

3 Data Collection .................................................................................................................... 14

3.1 Sources of data ............................................................................................................ 14

3.2 Process ........................................................................................................................ 14

3.3 Description and pre-processing of the data set .......................................................... 15

3.4 Log Transformation ..................................................................................................... 19

4 Multivariate Data Analysis .................................................................................................. 22

4.1 Treatment of missing data .......................................................................................... 22

4.2 Correlation analysis between continuous variables ................................................... 22

4.2.1 Methodology ....................................................................................................... 22

4.2.2 lThrough_Investiere analysis ............................................................................... 24


4.2.3 Employees analysis .............................................................................................. 30

4.2.4 Analysis of the entire data set ............................................................................. 33

4.3 Correlation Analysis between categorical variables ................................................... 41

4.3.1 Pooling levels together ........................................................................................ 41

4.3.2 Methodology ....................................................................................................... 44

4.3.3 Results ................................................................................................................. 44

4.4 Correlation analysis between continuous and categorical variables .......................... 46

4.4.1 Methodology ....................................................................................................... 46

4.4.2 Results ................................................................................................................. 46

5 Predicting the future success of a swiss start-up ................................................................ 62

5.1 Overview...................................................................................................................... 62

5.2 Methodology ............................................................................................................... 62

5.3 Results ......................................................................................................................... 63

6 Multiple Regression Analysis ............................................................................................... 64

6.1 Purpose ........................................................................................................................ 64

6.2 Methodology ............................................................................................................... 64

6.2.1 Overview .............................................................................................................. 64

6.2.2 Steps .................................................................................................................... 65

6.3 Data pre-processing .................................................................................................... 66

6.4 Second Manual Variables selection (from 20 to 8 independent variables): ............... 66

6.5 Best models comparison ............................................................................................. 68

0 Acknowledgements 7

6.6 Best selected model .................................................................................................... 74

6.6.1 Confidence and Prediction intervals ................................................................... 74

6.7 MLR BLUE Assumptions Check .................................................................................... 75

6.7.1 Outlier detection ................................................................................................. 76

6.7.2 Check MLR assumptions ..................................................................................... 77

7 Conclusions ......................................................................................................................... 81

8 References ........................................................................................................................... 85

9 Appendix: Model specification and selection ..................................................................... 88

9.1 Automated Models ..................................................................................................... 88

9.1.1 Stepwise regression ............................................................................................ 88

9.1.2 Best Subset Selection .......................................................................................... 88

9.2 Third manual variable selection (from 8 to 5 predictors) ........................................... 91

9.2.1 Basic.model ......................................................................................................... 91

9.2.2 Results ............................................................................................................... 104

9.3 Interaction Terms ...................................................................................................... 104

9.3.1 Stepwise regression .......................................................................................... 105

9.3.2 Best Subset Selection ........................................................................................ 105

9.4 Further methods used to select the best model ....................................................... 106

1 Introduction 9

1 INTRODUCTION

1.1 MOTIVATION AND OVERVIEW

My personal passion for Entrepreneurship is a long and most probably never-ending journey,

started in September 2012. On that day, instead of attending classes at high-school, together

with my best friend (later on start-up’s partner), we attended a cycle of lectures about student

entrepreneurship, organized by the University of Bologna. On that day, I understood my way

was to be an entrepreneur, and I set the goal to found my own company at around 30 years old.

But the occasion presented to me much earlier: at 21 I grasped it and founded my first start-up

Musa, in the industry of Education technology.

When the need for a second investment round was approaching, I faced a big question mark:

which is the value of our company? It is not profitable yet, it has very low revenue, and a not

well quantifiable risk; as a result, all traditional financial methods to evaluate companies are of

any help. By interacting with founders, private and institutional investors, the lack of an

objective method to evaluate early-stage start-ups turned out to be a common, worldwide

shared problem.

Indeed, nowadays the pre-money valuation of a start-up is nothing but the result of a

negotiation process, between the investor and the founders. It requires on average 7-9 months2,

and enormous effort from both parties (Clarysse and Kiefer, 2011). Often, the result is mainly

determined by the negotiation power of each actor (e.g. the number of interested investors and

their experience, the urgency of money of the start-up), rather than the value of the underlying

risky business. As in a poker match, each player withholds information and tries to convince the

opponents that his hand is better than it actually is. But, unlike poker, the participants of

investment negotiations should communicate complete information and work together toward

the shared goal of growing a successful business. The valuation, in fact, is only one part of the

investment process, and it often leads to controversies that get the founders-investors

relationships off on the wrong foot (Villalobos, 2007).

2Average time elapsed in Switzerland between the business plan submission to a Venture Capital firm and the

actual investment

1.2 Research questions 10

When I approached the Swiss VC Investiere, we found common ground in investigating this

subject that, under the supervision of Prof. Dr. Sornette -Chair of Entrepreneurial Risks-, became

the topic of the present Master thesis.

This chapter proceeds with the presentation of the specific research questions that the project

wants to address, while in Chapter 2, we will review the literature about start-up valuation

methods.

Chapter 3 is dedicated to the description of the data set used in our research and its collection

process, while its multivariate analysis is visualized and commented in Chapter 4.

Further chapters, instead, have the courageous aim to create and select models that, given data

of a specific start-up as input, give as outcome its future success with the minimum error rate

(Chapter 5), and an estimated benchmark for its pre-money valuation with the highest possible

accuracy and precision (Chapter 6).

Keeping in mind that “all models are wrong, some models are useful”, we summarize our main

findings in Chapter 7, compare them to the literature, and finally suggest possible trajectories

for future research on the topic.

1.2 RESEARCH QUESTIONS

The goal of this research is to investigate the pre-money valuations achieved by Swiss start-ups

at their investment rounds, between 2010 and 2019. In particular, in the following chapters we

will answer mainly, but not only, to the following questions:

− Are there differences in start-up valuations between industries? (Chapter 4.4.2.2)

− Does the type of lead investor involved in the round have an influence on the valuation?

(Chapter 4.4.2.4)

− Is there a correlation between the size of the round and the total funding previously

raised? And with the pre-money valuation? (Chapter 4.2.3)

− Can we predict the future success of a start-up based on its current status? (Chapter 5)

− What is the best model to estimate the pre-money valuation of a start-up? (Chapter 6)

2 Start-up valuation methods 11

2 START-UP VALUATION METHODS

2.1 OVERVIEW

Traditional valuation methods for companies are usually based on the forecasted revenue and

profit that an organization is expected to make. When it comes to early-stage start-ups, these

classical financial formulae miserably fail. In fact, these firms are not profitable yet, and have

very low or zero revenue, so their valuation is inevitably determined by other factors.

In the following paragraphs, we briefly describe the main available methods to valuate early-

stage start-ups. These approaches are vague and leave space to any kind of interpretation.

Behrmann proved how different the outcomes of valuation for the same firm can be by applying

them, in addition to the demonstration that the same valuation method, when used on different

firms, may understate as well as overstate the values to the market (Behrmann, 2016). Finally,

he stresses that a valuation obtained through these current methods can only be as good as the

assumptions. Change in just one number can dramatically change the results (Behrmann, 2016).

2.2 SCORECARD METHOD

The Scorecard method, also known as the Bill Payne valuation method, compares pre-money

and pre-revenue start-ups to average valuations, and then adjust them according to certain

metrics. Following Payne (2011), it is first necessary to survey pre-money, and pre-revenue

valuations of venture capitalists or private angels for start-ups, in the industry and region of the

target company. Next, the start-up is compared qualitatively to the comparable start-ups in the

valuation survey, in accordance with the following categories and weights (Table 2.1):

Strength of the Management Team 0-30%

Size of the Opportunity 0-25%

Product/Technology 0-15%

Competitive Environment 0-10%

Marketing/Sales Channels/Partnerships 0-10%

Need for Additional Investment 0-5%

Other 0-5%

Table 2.1: Scorecard method: valuation categories with corresponding weights.

When the actual assessment is performed, the start-up is compared to the average of the

surveyed start-ups. Full 30% for team would be awarded to an average team in regard to

the comparable companies. When the valuation subject has a far greater than average team,

2.3 Berkus model 12

e.g. 150% of the average, the resulting factor would be 0.45. While, if in one category the start-

up underperforms the peer group, less than the full amount is noted. In the end, the sum of all

factors is multiplied by the average valuation obtained from the valuation survey. If, for

example, we have a total factor of 1.2 (an above average venture), and a mean valuation on the

market of €1.6 million, the target would be valued at €1.92 million pre-money. Table 2.2

showcases a complete scorecard valuation of a start-up.

Table 2.2: Exemplary assessment of a start-up using the Scorecard method (from Gunn, 2016)

2.3 BERKUS MODEL

Berkus model was developed and proposed by the angel investor Dave Berkus (2009) to evaluate

very early-stage companies with zero or very low revenue, but showing a potential of reaching

over $20 million in revenues within five years. According to him, “the universal truth is that

fewer than one in a thousand start-ups meet or exceed their projected revenues in the periods

planned” (Berkus, 2009). Therefore, his method to establish an initial pre-money valuation does

not take financials into account.

Example of motivations behind the choice of the % of Norm:

− A few co-founders, a not yet established Advisory Board

− The market is there and it is growing

− The concept is nailed down, Minimum Viable Product is in development

− Competition definitely exists, however company has a business model supposed to be disruptive

− No Sales yet, Partnerships are in place for distribution

− In need of $$ to finish development, launch, test, etc

− Tested the market, have positive feedback

2 Start-up valuation methods 13

If Exists: Add to company value UP to:

1. Sound Idea (basic value, product risk) USD 0.5m

2. Prototype (reducing technology risk) USD 0.5m

3. Quality Management Team (reducing execution risk) USD 0.5m

4. Strategic relationships (reducing market risk and competitive risk) USD 0.5m

5. Product Rollout or Sales (reducing financial or production risk) USD 0.5m

Table 2.3: The Berkus Model: valuation dimensions

Berkus proposition is to add up to half a million USD depending on the degree at which the start-

up fulfils the respective dimensions shown in Table 2.3. Once the company starts to generate

revenues, Berkus states this method loses credibility, and most everyone will use the actual

revenues to project value over time.

2.4 VENTURE CAPITAL METHOD

The third most common method for start-up valuation is called Venture Capital Method. It was

first ideated by Sahlman and Scherlis (1989), then revised in 2009 in his Harvard Business School

case study, and also thoroughly described by Engel (2002). The procedure starts by estimating

the terminal value (TV) for the company in some years from now, when the exit is planned: for

that year, revenues are estimated and translated to a TV by multiplying them with P/E ratios or

sales multiples of similar companies in the industry. As an example, a venture has estimated

revenues of € 15 M in five years (t), with similar businesses having a sales multiple of two. This

leads to a TV of approximately € 30 M in five years. This value is then discounted to the present

day, with the discount rate (r) estimated by the VC, usually the required internal rate of return

(IRR) or generally target rate of return (Damodaran, 2007). Let’s say that, as this is a quite risky

business, r is 60%. This would translate to a present value of PV= 30M / (1+ 0.6)5 ≈ 2.86M. In

Table 2.4 we show the summary of the steps, adapted from Engel (2002):

Step 1 Estimating terminal value

TV= P/E*Earnings

or

TV= (Sales multiple) * Sales

Step 2 Determining present value PV= TV/ (1+r)t

Step 3 Calculating demanded ownership fraction F= (Round size) / PV

Table 2.4: The Venture Capital Method: summary of steps (Engel, 2002).

Engel (2002) also states that the pre-money valuation is calculated by substracting the round

size to the post money valuation.

3.1 Sources of data 14

3 DATA COLLECTION

3.1 SOURCES OF DATA

This research project would not have been possible without the collaboration of the following

organizations, that allowed us to access directly or indirectly to their data set concerning start-

up investment rounds:

• Investiere | Verve Capital Partners AG: A key role has been played by the Swiss Venture

Capital Investiere, by raising the need to investigate Swiss start-up valuations and to

provide the conditions to pursue the analysis.

• Dr. Hervé Lebret: He supported our research by sharing with us the data set behind his

study “The Analysis of 500+ start-ups” published on www.startup-book.com (Lebret,

2019). He is the Manager of Innogrants (EPFL). He is also a Senior Scientist in the field of

high-tech entrepreneurship and his research field concentrates on academic spin-offs,

including Stanford University and Silicon Valley.

• Startupticker.ch: this organization shared with us all data from their annual Venture

Capital reports, from 2012 to 2018. Startupticker.ch is the main online news portal

about young Swiss companies.

• Commercial Registries of Switzerland’s Confederation: through the Registries of

commerce – Zefix online portal it is possible to gain access to the legal acts of Swiss

companies for some cantons of Switzerland. They do comprehend some details of

funding rounds (e.g. post-money valuation, number of issued shares).

• Crunchbase: this platform provides companies insight and we extracted some data

about the analysed start-ups.

3.2 PROCESS

Data collection has been the research task that required overall most of the time and effort. The

result has been a unique collection of precious, extremely confidential, sensitive data, regarding

the details of start-ups’ investment rounds (306 samples overall, concerning 190 companies).

Because of the confidentiality of this data protected by non-disclosure agreements between

investors and co-founders, all contents and results of the research will be shared anonymously.

http://www.startup-book.com/

3 Data Collection 15

A first bunch of samples has been provided by Investiere | Verve Capital Partners AG, related to

the investment rounds in which it was directly involved as an investor. After that, data collection

proceeded in two simultaneous directions:

• Search of new sources of data, by directly contacting all the main active organizations

in the Swiss start-up ecosystem (e.g. incubators, accelerators, University technology

transfer offices, facilitators, investors’ clubs).

• Search of data for specific start-ups to replace missing values (e.g. Swiss commercial

registries, CBInsight, start-ups).

3.3 DESCRIPTION AND PRE-PROCESSING OF THE DATA SET

First of all, we pre-process the collected data, to make sure integrity and coherence are

respected. For the purpose of this analysis, we decide to focus only on equity investment rounds

pursued by Swiss start-ups, between 2010 and 2019. Therefore, we remove 12 samples

concerning convertible investment rounds and 8 samples related to non-Swiss companies. So,

we can exclude all variables regarding the details of the convertible rounds, plus the following

variables: Round Type (as we consider only Equity rounds), and Company name. In fact, this

research could only be pursued anonymously, because of signed NDAs protecting data. We can

also remove the variable Data_source, because samples have been randomly collected from

different sources, and we can assume zero correlation between values and their original source.

After this preliminary selection of variables and samples, we deal with a data set comprehending

286 observations and 16 variables.

In Table 3.2 we show an overview of this starting data set (a legend is provided in Table 3.1),

while in Table 3.3 we provide an extended variables’ description.

Abbreviation Meaning

Cat Categorical

N Nominal

D Dichotomous

O Ordinal

I Interval

Num Numeric

C Continuous

Dis Discrete

Table 3.1: Legend of Table 3.2

3.3 Description and pre-processing of the data set 16

Variable’s name Type Short description Nr. of groups % NA’s

Foundation_Year Num Dis Company’s foundation year / 0.00

Round_name Cat N Harmonized official round name 9 0.00

Industry Cat N Company’s industry 9 0.00

Stage Cat O Company’s development stage 7 61.54

Pre_valuation Num C Pre-money Valuation / 0.00

Prev_raised Num C Tot. funding previously raised / 0.00

Amount_raised Num C Size of the investment round / 0.00

Through_Investiere Num C Amount invested by Investiere / 54.89

Type_Lead_Investor Cat N Type of the main Investor 4 4.19

Profitable Cat D Is the company profitable? 2 62.24

Revenue Cat O, I Last 12 months revenues 6 59.79

Closing_Year Num Dis Year of round’s closure (10) 1.05

Still_operating Cat N Company’s present status 3 0.00

Employees Num Dis Nr.of employees / 82.87

Currency Cat N Funding’s currency 1 0.00

Location Cat N Company’s legal location 1 0.00

Table 3.2: Data set overview

Foundation_Year Year is which the company has been officially incorporated. Numerical, discrete variables.

Round name

Harmonized name of the round used in the official company documentation. Categorical, nominal, 9 groups:

− Pre-seed

− Seed round

− Series A Round

− Series B Round

− Series C Round

− Series D Round

− Series E Round

− Pre-Exit

− IPO

Industry

Area of Business according to the Swiss Venture Capital Report: Categorical, nominal, 9 groups:

− Biotech

− Cleantech

− Consumer Products

− Fintech

− Healthcare

− ICT

− Medtech

− Micro / Nano

− Other

Stage

Indicates the development stage of the Company, at the time of Closing. Categorical, ordinal, 7 groups

− Idea (0 samples)

− Prototyping


− Beta-Phase

− Clinical Trials

− First Clients

− Growth

− Internationalisation

Pre_valuation

Pre-Money Valuation: the value of a company just before that specific round of Financing. When summed up to Amount_raised, it gives as a result the post-money valuation (Frei and Leleux, 2004). Unit of measure is CHF. Numerical, continuous

Prev_raised The sum of all funds the startup raised since incorporation until the moment just before closing that specific investment round. Unit of measure is CHF. Numerical, continuous

Through_investiere VC Investiere's tranche of the respective financing round. Unit of measure is CHF. Numerical, continuous

Type_Lead_Investor or (TLI)

Type classification of the lead Investor of the investment round. Categorical, nominal, 4 groups:

− Accelerator/incubator: accelerators and incubators are organizations helping start-ups attain success. Incubators usually offer dedicated office and development space to the start-ups for a set period of time, and a first grant or funding round to allow start-up’s incorporation and beginning of activities. Start-up accelerators tend to focus on providing mentorship, and resources to help the start-ups succeed, but usually tend not to offer dedicated office space. Accelerators and incubators usually get involved at early-stage. Some of them focus on a specific industry, market, technology, whereas others are generalists. Start-ups are usually admitted in batches, after a screening process (Isabelle, 2013).

− Private Angel: an angel investor (also known as a business angel, informal investor, angel funder, private investor, or seed investor) is an affluent individual who provides capital, advice and contacts to a start-up, usually in exchange for convertible debt or ownership equity. Unlike venture capitalists, they usually play an indirect role as advisors in the operations of the investee firm (Wong, Bhatia, and Freeman, 2009).

− Institutional Financial: an institutional investor is an organization that invests on behalf of its members. A financial investor invests in a business merely to maximize its financial returns, over a specified period of time. These investors often take board seats and add value by introducing co-founders to a larger network, or help in terms of strategy, hiring, financials and industry insights (Arping and Falconieri, 2009).

− Institutional Strategic: an institutional investor is an organization that invests on behalf of its members. Strategic investors are not only looking for a return on their capital, but for a ‘strategic’ scope: access to technology/assets, new market or target segment. They are more

https://en.wikipedia.org/wiki/Startup_company

https://en.wikipedia.org/wiki/Startup_company

https://en.wikipedia.org/wiki/Convertible_debt

https://en.wikipedia.org/wiki/Ownership_equity

3.3 Description and pre-processing of the data set 18

patient to see returns on their investments than financial investors (Arping and Falconieri, 2009).

Profitable

It answers to the following question: Is the company profitable at the time of the investment round closure (i.e. is in the condition of yielding a financial profit or gain)? Categorical, Dichotomous

− Yes: the company is profitable

− No: the company is not profitable

Revenue

The income generated in the last 12 months before the funding round, from sale of goods or services, or any other use of capital or assets, associated with the main operations of an organization before any costs or expenses are deducted. Also called sales, or (in the UK) turnover. Categorical, ordinal, interval, 6 groups:

− 0 - 50k

− 50k - 100k

− 100k - 500k

− 500k – 1M

− 1M – 5M

− >5M

Closing_Year

Year of closing of the investment round. We created two identical variables for this data: one is a numerical discrete variable, the second one is categorical ordinal, with 10 groups (we will later decide which variable is the most useful for our analysis):

− 2010

− 2011

− 2012

− 2013

− 2014

− 2015

− 2016

− 2017

− 2018

− 2019

Still_operating

Indicates the current3 status of the company. Categorical, ordinal, 3 groups:

− Yes: the company is still operating (i.e. active company)

− No: the company has been liquidated

− Exit: The company has been acquired to another company

Employees Number of employees of the start-up at the time of the funding round.

3 Last update: Oct 2019


Location

Country of the start-up’s registered office. Categorical, nominal, 1 group:

− Switzerland

Currency

Primary currency of the financing round. Categorical, nominal, 1 group:

− CHF

Table 3.3: Extended variables description

3.4 LOG TRANSFORMATION

By analysing the distribution of some continuous variables (Pre_valuation, Amount_raised,

Prev_raised, and Through Investiere) we can state they are all far from normality, implying

restrictions in the application of some statistical methods strictly assuming normal distributions

(e.g. Pearson’s correlation, ANOVA). By observing their distributions, the best transformation

we can apply is the natural logarithmic function. In the following graphs (figure 3.1 - 3.2) we

report the significant improvement achieved thanks to this transformation4. Nevertheless, if we

apply it a second or third time (i.e. log3(Pre_valuation)), the additional improvement is instead

not significant.

If we test now the normality of these transformed variables, by applying for example the Shapiro

Test, we are still forced to reject the zero hypothesis of normality. We can notice, in fact, from

the graphs in Figure 3.1, very long tails in variables distributions and some proportion of

skewness. Actually, these tails could just be outlier cases: if we detect outliers with the R

function aq.plot we get as a result that the 38.81% of samples are outliers. Of course, this big

proportion does not allow us to remove them now. Anyway, this is not an issue: linear regression

analysis does not assume normality for either predictors or outcome. The main role, instead, is

played by the distribution of residuals. (The distribution of residuals together with outlier

detection will be conducted for specific models during regression analysis, paragraph 6.7.2).

4 As the log(0) is undefined, we add 0.1 to all zero values in these continuous variables before applying the log transformation.

3.4 Log Transformation 20

Figure 3.1: On the left graphs we use log-log axis, base 10: (top) adaptive Kernel density estimation of pre-money valuation of a company at each round (Pre_valuation); (bottom) adaptive Kernel density estimation of funds raised before a round (Prev_raised). On the right, top and bottom: Kernel density estimation of the same variables, after their natural log transformation. Red lines indicate the median.

10

-21

10-1

8 1

0-1

5

10

-12

1

0-9

2*106 107 5*107 5*108 109

5*104 5*105 5*106 5*107 109

10

-17

10-1

4 1

0-1

1

10

-8


Figure 3.2: On the left graphs we use log-log axis, base 10: (top) adaptive Kernel density estimation of the amount raised at each round, in CHF (Amount_raised); (bottom) Kernel density estimation of the amount invested by the VC Investiere, in CHF (lThrough_investiere) at each round. On the right, top and bottom: Kernel density estimation of the same variables after their natural log transformation. Red lines indicate the median.

10

-15

1

0-1

3

1

0-1

1 1

0-9

10

-7

5*1

0-1

0 5

*10

-9 5

*10

-8

5

*10

-7

10

-7

103 105 107

104 5*104 5*105 5*106 109

4.1 Treatment of missing data 22

4 MULTIVARIATE DATA ANALYSIS

4.1 TREATMENT OF MISSING DATA

During the conduction of this research project, we spent most of the time and effort and into

the data collection process (paragraph 3.2). This phase did not aim only to collect as many

samples as possible, but also to substitute the missing data with the true values. After this long,

time consuming process, we list the variables sorted by decreasing fraction of missing data:

Variable NA’s Employees 0.83 Profitable 0.62 Stage 0.62 Revenue 0.60 Through investiere (CHF) 0.55 Type_Lead_Investor 0.04 Closing_Year 0.01 Foundation Year 0.00 Industry 0.00 Round_name 0.00 Still_operating 0.00 Amount_raised 0.00 Pre_valuation 0.00 Prev_raised 0.00 Country 0.00 Currency 0.00

We read that only 8 out of 20 (16 + 4 log transformed) are still containing missing values. For the

purpose of our research, imputation of missing data would be misleading and useless, due to

the low ratio of available samples per variable. Therefore, we decide to keep all NA’s and all

samples, in order to avoid loss or distortion of information.

4.2 CORRELATION ANALYSIS BETWEEN CONTINUOUS VARIABLES

4.2.1 Methodology

In order to analyse the correlations existing between the continuous variables in the data set,

we will adopt the following statistical tools:

A. Correlation matrix

B. Scatterplot

C. Boxplot

4 Multivariate Data Analysis 23

Correlation matrix will be calculated by applying the Kendall’s Tau method5. We generally

consider a correlation between two variables as “low” if its absolute value is lower than 0.3,

"moderate", if it stays between 0.3 and 0.7, and "strong" if higher than 0.7.

Scatterplots graphically show the linear fit of each pair of variables. The regression line and

correlation coefficients allow us to distinguish which pairs of variables show an interesting

significant correlation, and which ones do not. Scatterplot analysis is useful in preparation to the

Multiple Regression Analysis (Chapter 6). MLR, in fact, requires the relationships between the

independent and dependent variables to be linear. This linearity assumption can be best tested

and visualized through scatterplots.

Boxplots are useful to easily identify outliers, and schematically visualize the distribution of each

variable. The values between 2.689 sigma stay within the min and max border lines of each

boxplot:

MIN= max[MIN, Q1-1.5*(Q3-Q1)]

MAX= min[MAX, Q3+1.5*(Q3-Q1)]

The remaining extreme values are identified as outliers, and represented in the boxplot beyond

whiskers (we are not interested nor allowed to identify to which companies they correspond

to).

As we’ve just seen in the previous paragraph, there are two continuous variables with a

tremendously high number of missing data, while the other four continuous variables have

approximately 100% of values available. Employees – indicating the number of employees of the

start-up at the time of a specific round - has 82.9% of NA, while the continuous transformed

variable lThrough_investiere – indicating the amount invested in that specific financing round

by the VC Investiere - has 55.1% of missing values. That means keeping these variables during

our further analysis would make us neglect over 82% of our samples. Besides, studying the

relationship between lThrough_investiere and the other variables is not menaingful for all Swiss

5 As we noticed in paragraph 3.4, none of the continuous variables follows a normal distribution, as proved via the

Shapiro Test. We therefore calculate correlation values through the Kendall’s Tau method, instead of the Pearson’s

one, that assumes normality.

4.2 Correlation analysis between continuous variables 24

investment rounds, but only for the ones in which Investiere was actually involved (96 samples

out of 286, so 33.4%). For these reasons, we now conduct a separate analysis for those two

continuous variables, and we will then omit them from our data set to take into account all of

the available rounds.

4.2.2 lThrough_Investiere analysis

We now focus our analysis on the rounds in which we know if and how much the VC Investiere

contributed. In Figure 4.1, we represent the frequency distribution of the natural log of the

amount invested by the VC Investiere, in CHF, at each round (lThrough_investiere), omitting all

its NA values (55%). We clearly see two relative maxima in this distribution, where the lowest

one indicates rounds in which Investiere did not invest. In Figure 4.2 instead, we consider only

the rounds in which Investiere invested (96 out of 129).

Figure 4.1: On the x-axis: the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere) at each round. On the top of the figure, a boxplot representation of this variable. On the y-axis: the Kernel density estimation (higher density for higher probability of seeing a point at that location). N is the number of samples, and Bandwidth is the parameter controlling the smoothness of the curve (higher values make smoother curves), and it equals the standard deviation of the kernel used. The red line indicates the median.

When measuring the relationships among continuous variables with the Kendall’s Tau method,

the correlations involving lThrough_investiere are not significant or very low. This changes if we

calculate correlations by considering only the rounds in which Investiere actually invested (96

samples). The corresponding correlation matrix is in figure 4.3, where non-significant values

(significance level is 0.05) are hidden by black crosses.


When measuring the relationships among continuous variables with the Kendall’s Tau method,

the correlations involving lThrough_investiere are not significant or very low. This changes if we

calculate correlations by considering only the rounds in which Investiere actually invested (96

samples). The corresponding correlation matrix is in figure 4.3, where non-significant values

(significance level is 0.05) are hidden by black crosses.

Except for Foundation_year, all the other variables have now a significant, positive correlation

with lThrough_investiere. We now analyse them more in detail with the following scatterplots

(by only considering the rounds in which we know that Investiere invested money).

In figure 4.4, the correlation coefficient is r=0.403 (moderate) and significant (p << 0.05). The

trend is visible but noisy. The explanation of this correlation can be found in the moderate

correlation existing between lAmount_raised and lPre_valuation. In fact, lThrough_investiere is

a value belonging to the interval [0; lAmount_raised], and its relationship with lAmount_raised

is represented in Figure 4.5. It is a matter of fact that the companies in which Investiere invested

more, raised more money. So, the real reason behind the previous trend and correlation (figure

Figure 4.2 On the x-axis: the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere). ). Here we only consider rounds in which Investiere actually invested. On the top of the figure, a boxplot representation of this variable. On the y-axis: the Kernel density estimation (higher density for higher probability of seeing a point at that location). Bandwidth is the parameter controlling the smoothness of the curve (higher values make smoother curves), and it equals the standard deviation of the kernel used. The red line indicates the median.


4.4) is that lThrough_Investiere is moderately correlated with lAmount_raised, that is

moderately-strongly correlated to lPre_valuation, as shown in the matrix (R=0.57, Figure 4.3).

Figure 4.3: Correlation matrix summarizing all correlations among continuous variables. Only rounds in which the VC Investiere participated are considered. Following the legend, numbers in a colour tending to red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (significance level threshold is p-value= 0.05).

Figure 4.4: Scatterplot of the natural log of Pre-money valuation in CHF (lPre_valuation) and the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere), considering only rounds in which the VC Investiere participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.


We want now to analyse the relationship between lThrough and lPrev_raised, figure 4.6. Also in

this case, we have a trend disturbed by the significant proportion of samples having zero money

previously raised. The correlation remains low even if we exclude this proportion of samples,

Figure 4.7.

Figure 4.5: Scatterplot of the log of the size of the round (lAmount_raised) and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC Investiere participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

Figure 4.6: Scatterplot of the log of the funds previously raised (lPrev_raised) and and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.


It is interesting to notice that in the 57.29% of all cases in which Investiere invested, it was the

first investment received by that specific startup. This is in accordance with what we read on

Investiere’s website, F.A.Q. page (https://www.investiere.ch/startup-vc-investment/):

"When do you invest?

We invest in early stage as well as growth stage rounds. A pitch deck or idea without validation

is not sufficient. The right timing for a funding round can vary depending on the industry or

other factors but generally being able to show market traction, proof of technology and a

complete and well-functioning core team are decisive factors."

So, there is no doubt that, for Investiere, having zero money previously raised is not a limitation

to its investment commitment. The graph also tells us that, if the start-up has already raised

money in the past, the amount then invested by Investiere slightly tends to increase with the

significant correlation factor of 0.27.

We will not show the relationship between lThrough_investiere and Foundation Year, as it is

low and not significant. Nevertheless, we know that Investiere invested only in companies

founded in the last 15 years, except for one outlier. The histogram (figure 4.8) represents the

Figure 4.7: Scatterplot of the log of the funds previously raised (lPrev_raised) and and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC participated, and the start-up already raised funds in the past. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.


number of investments done by Investiere, by specifying in which year the start-up was

founded. Any particular trend can be observed, it instead roughly recalls a normal distribution.

We plot now lThough_investiere against the Closing Year of the round (Figure 4.9):

Figure 4.8: Histogram showing the number of samples sharing the same Foundation Year, by only considering rounds in which the VC Investiere invested.

Figure 4.9: Scatterplot of Closing Year and the amount invested by the VC (lThrough_investiere), considering only rounds in which the Investiere participated. R is the Kendall’s Tau correlation,

0

2

4

6

8

10

12

14

16Foundation Year of VC Investiere's rounds

# sa

mp

les

Foundation Year


while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

In this case, a trend is evident with a moderate correlation of 0.44. So, we can state that, over

years, Investiere is on average investing more on each investment deal. If we create the

histogram (Figure 4.10), we can add that also the number of deals is increasing over time (2019

shows an underestimation of the real value, because the year is still ongoing).

Figure 4.10: Histogram showing the number of samples sharing the same Closing Year, by only considering rounds in which the VC Investiere invested.

4.2.3 Employees analysis

We now proceed with the variable Employees -indicating the number resources employed in

the company at the time of a specific round- in the same way we did for lThrough_investiere.

This time, as 81% of Employees’ values are NA, we are only taking into account 48 complete

samples of our data set. Figure 4.11 shows its distribution (we removed four extremes outliers

with more than 100 employees). It has a very long right tail, making the median much higher

than the mode and mean. In the next Figure 4.12 we show the correlation matrix, by considering

all available 48 complete samples.

We plot all the moderate correlations between Employees and the other variables in figures

4.13 - 4.16. The strongest correlation is between Employees and lPre_valuation. This makes us

suppose that, having more data available, Employees would be a relevant and significant

predictor in determining the valuation of a start-up. Nevertheless, having more employees does

0

2

4

6

8

10

12

14

16

18

20

Closing Year of VC Investiere's rounds

# sa

mp

les

Closing Year


not necessarily mean that the start-up has previously raised more funds (moderate-low

correlation). We also observe a moderate correlation between Employees and lAmount_raised,

and Closing_year. There is no correlation, instead, between Employees and neither Foundation

Year, nor lThrough_investiere.

Figure 4.12: Correlation matrix summarizing all correlations among continuous variables. Only complete samples are here considered. Following the legend, numbers in a colour tending to

Figure 4.11: Employees distribution (boxplot and probability density function). The red line indicates the median.

Employee

N=46 Bandwidth = 5.778


red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (p-value above 0.05).

Figure 4.13: Scatterplot of the log of the amount raised in the round (lAmount_raised) and the number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

Figure 4.14: Scatterplot of the log of the funds previously raised (lPrev_raised) and the number of Employees, considering only rounds in which the number of Employees is known, and lPrev_raised is above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.


Figure 4.15: Scatterplot of the log of Pre-money valuation (lPre_valuation) and the number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

Figure 4.16: Scatterplot of Closing Year of the round and number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

4.2.4 Analysis of the entire data set

After having separately analysed lThrough_investiere and Employees’ impacts on other

variables, there is no need to keep them in our further analysis. In fact, we want to consider all

rounds in our dataset, regardless if we have information about Investiere’s participation or the


number of employees. If we continued considering those variables in our further analysis, we

would exclude the 82.9% of samples (percentage of Employees’ missing values). By including

now all samples of our data set, we want to visualize the distribution of each continuous variable

(Figure 4.17 – 4.21):

Figure 4.17: Distribution of the log of pre-money valuation (lPre_valuation), via boxplot and probability density function. The red line indicates the median.

Figure 4.18: Closing_Year distribution (boxplot and probability density function). The red line indicates the median.

lPre_valuation


Closing_Year



lPre_valuation and lAmount_raised show many outliers beyond their MAX values6, creating the

long tails in the distribution curve. The correlation matrix, including all complete samples for the

selected continuous variables, and a relationships summary overview are in Figure 4.22. As

Shapiro Test makes us reject the null hypothesis of normality, we continue using the Kendall’s

Tau method.

Figure 4.19: Distribution of the log of funds previously raised (lPrev_raised), via boxplot and probability density function. The red line indicates the median.

Figure 4.20: Distribution of the log of the Amount previously raised (lAmount_raised), via boxplot and probability density function. The red line indicates the median.

6 MAX=min[MAX, Q3+1.5*(Q3-Q1)]

lPrev_raised


lAmount_raised



All variables tend to normal distribution, except for Closing_Year (clear growing trend, 2019 is

still in progress) and lPrev_raised (having two relative maxima, because of the conspicuous

number of zero values). Anyway, normality of variables is not an assumption neither for

Kendall’s nor for MLR.

Figure 4.21: Distribution of the Foundation Year of samples (boxplot and probability density function). The red line indicates the median.

Figure 4.22: Correlation matrix summarizing all correlations among continuous variables. Following the legend, numbers in a colour tending to red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (p-value above 0.05).

Foundation_Year



The correlation between lAmount_raised and lPre_valuation is the strongest one among our

continuous variables (0.59), and it is highly significant (Figure 4.23). This correlation is expected,

otherwise raising high investment would lead start-ups to enormous dilution, not sustainable

for further growth. Nevertheless, if this factor is considered alone, it can be misleading for

companies. In fact, it could lead the company to show a higher financial need in order to raise

more money, and therefore obtain a higher valuation. This strategy is not recommendable, as it

is likely to bring the start-up to over dilution and lower credibility, if not properly justified. So,

every company will have to carefully evaluate the combination of factors influencing its pre-

money valuation (that we will be entirely revealed in Chapter 6), and meditate carefully upon

the trade-off existing between the amount raised (and therefore the lPre_valuation obtained),

and the consequent dilution.

It is followed by the one between Foundation Year and Closing Year (0.47): youngest companies

have the most recent investments, and it is always true that Foundation_Year <= Closing_Year.

We find several outliers in this trend.

Between lPre_valuation and lPrev_raised (Figure 4.24) there is also a moderate, significant

correlation (0.43). We now look at their trend, considering only samples with lPrev_raised >0

(correlation grows to 0.5). lPrev_raised is certainly a main factor to determine the valuation of

a company, (we measure its impact more in detail in Chapter 6). While the range of

Figure 4.23: Scatterplot of the log of Pre-money valuation (lPre_valuation) and the log of the amount raised (lAmount_raised), considering the entire data set. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.


lPre_valuation for the rounds previously excluded (having lPrev_raised=0) is wide and

comprehends many outliers.

Figure 4.24: The Figure represents the scatterplot of lPre_valuation and lPrev_raised, considering all samples with lPrev_raised above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.

We plot now the relationship between lAmount_raised and lPrev_raised (Figure 4.25), as both

these variables are strongly correlated with lPre_valuation.


Figure 4.25: Scatterplot of the natural log of the funds raised by the company before a specific round (x-axis= lPrev_raised) and the natural log of the amount raised at that round (y-axis= lAmount_raised), by considering only rounds with lPrev_raised above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. On the top of the figure, the equation of the plotted regression line is shown, where y=y-axis variable, and x=x-axis variable. The 95% confidence interval is displayed by the grey area.

Indeed, they show a moderate, significant correlation: the more you previously raised, the more

you are likely to raise. This is perfectly normal: if this is not the first financing round of the

company, most probably the company is at a later development stage, it had the possibility to

bring to the table more proves of product/service validation, reducing the risk for the investors.

At the same time, having low lPrev_raised does not prevent a company to raise high

investments. In particular, for companies with lPrev_raised=0, lAmount_raised has the following

distribution: Figure 4.26. The volatility is extremely high and five outliers are visible.

2.2*10-16


Figure 4.26: Distribution of the log of the amount raised (lAmount_raised) for samples with no funds previously raised, via boxplot and probability density function. The red line indicates the median.

All in all, lPrev_raised is an important factor determining the lAmount_raised, which plays even

a more relevant role in influencing the lPre_valuation (highest existing correlation). As these

three variables are mutually correlated one to the other, we add a three-variables bubble plot

(Figure 4.27), to offer a final overview of their relationships (lPre_valuation on y-axes,

lAmount_raised on x-axis, circle size represents lPrev_raised).

Finally, we found a highly significant but low positive correlation between Closing_year and

lPre_valuation (0.16), while between Foundation Year and lPrev_raised there is a low and

negative correlation: the older the company, the higher lPrev_raised. This is also obvious and

expected.

All the other pairwise relationships among variables can be neglected (extremely low

correlations or non-significant).

lAmount_raised



Figure 4.273: Three-dimensional representation: the log of pre-money valuation (lPre_valuation) on the ordinate, the log of the amount raised (lAmount_raised) on the abscissa, and the log of funds previously raised (lPrev_raised) by using colours and points dimension.

4.3 CORRELATION ANALYSIS BETWEEN CATEGORICAL VARIABLES

In this section, we want to investigate the correlation existing between categorical variables.

4.3.1 Pooling levels together

In order to dive deeper into this analysis and to get significant results, first we pool appropriate

levels together, to make sure that by comparing variables in pairs, each level counts at least 5

samples. The final distribution of each variable is represented in the next Figures 4.28 a) and b).

4.3 Correlation Analysis between categorical variables 42

Figure 4.28: a) Overview of the categorical variables Round Name (top) and Industry (bottom). On the top of each histogram, the title refers to the name of the displayed variable. On the abscissa are written the names of the groups belonging to that categorical variable, while the ordinate indicates the absolute number of samples in the data set belonging to that specific group. The total number of samples for each categorical variable changes between variables, because of missing values (see paragraph 4.1).

0

20

40

60

80

100

120

Pre

-see

d

Seed

Ro

un

d

Seri

es A

Seri

es B

Seri

es C

Fro

mSe

ries

D t

oIP

O

Round name

0

20

40

60

80

100

120

140

Bio

tech

Oth

er

Fin

tech

Med

tech

/H

ealt

hca

re ICT

Cle

ante

ch

Oth

er

Mic

ro/

Nan

o

Co

nsu

mer

Pro

du

cts

Industry

a) #

sam

ple

s #

sam

ple

s


Figure 4.28: b) Overview of the categorical variables Revenue (top-left), Still_operating (top-right), TLI (bottom-left), and Stage (bottom-right). On the top of each histogram, the title refers to the name of the displayed variable. On the abscissa are written the names of the groups belonging to that categorical variable, while the ordinate indicates the absolute number of samples in the data set belonging to that specific group. The total number of samples for each categorical variable changes between variables, because of missing values (see paragraph 4.1).

0

5

10

15

20

25

30

35

40

45

50

0-5

0k

50

k-1

00

k

10

0k-

50

0k

50

0k-

1M

>1M

Revenue

0

50

100

150

200

250

300

exit no

yes

Still operating

0

5

10

15

20

25

30

35

40

Pro

toty

pin

g

Bet

a/ C

linic

al T

rial

s

Firs

t C

lien

ts

Gro

wth

/ In

tern

atio

nal

Stage

0

50

100

150

200

250

Acc

/In

c/P

A

Inst

. Fin

anci

al

Inst

. Str

ateg

ic

Type Lead Investor

b)

# sa

mp

les

# sa

mp

les

# s

amp

les

# sa

mp

les

4.3 Correlation Analysis between categorical variables 44

4.3.2 Methodology

After the pre-processing phase, we apply the following methods:

• Contingency Analysis (or Chi-square independent test)

• Cramer’s V

• Contingency coefficient (or Pearson’s coefficient)

The Contingency Analysis tests the null hypothesis that the two considered variables are

mutually independent. That means that the knowledge of one does not help us in predicting the

value of the other. On the other side, if the p-value is less than the significance level (0.05), we

reject the null hypothesis and conclude that there is a statistically significant relationship

between the two categorical variables, that is they are not independent. The test makes use of

contingency tables, as a result of which it is known as 'Contingency Analysis'.

In case we reject independency, Cramer’s V and Contingency coefficient provide measures of

the correlation existing between two categorical variables. As for continuous variables, we

consider a coefficient in the range of [0, 0.3] as weak, in [0.3, 0.7] as medium, and > 0.7 as strong.

4.3.3 Results

Legend:

low corr.

Round_name 1 moderate corr.

Industry independent 1 V= Cramer's V

Stage V= 0.392 C= 0.485

V= 0.401 C=0.570

1 C= Contingency

Type_Investor V= 0.266 C= 0.352

V= 0.262 C= 0.348

independent 1

Revenue V= 0.323 C= 0.416

V= 0.358 C=0.625

V= 0.526 C= 0.674

independent 1

Still_operating V= 0.204 C= 0.277

independent V= 0.359 C= 0.453

V=0.242 C=0.324

V= 0.289 C= 0.378

1

Round_name Industry Stage Type_Investor Revenue Still_operating

Table 4.1: The Table shows the relationships existing between categorical variables. The white cell “independent” means that any significant correlation has been revealed. In all the other cases, V and C indicate the Cramer’s V coefficient and the Contingency coefficient, respectively.

Final results are summarized in Table 4.1: correlations are moderate or low and the strongest

one is between Revenue and Stage. That was already evident from the scatterplot and it makes

logical sense (e.g. there cannot be significant revenues at Prototyping stage, while they are


necessary to be in Growth/Internationalisation stage). We can also confirm that Industry is

independent from the Round name (all round names can be applied to any industry). We

underline that belonging to a particular Industry does not influence the success of the start-up

(independency from still_operating), but it influences its revenue (see distribution, Figure 4.29).

Figure 4.29: Distribution of Revenue given the Industry. On the ordinate, the Industry (a bar for each Industry group). On the abscissa, how many samples (in percentage) of that Industry group have a certain interval of Revenue. Following the legend, each colour section of the bars corresponds to a certain Revenue group.

We were expecting a correlation between the Type of Lead investor with the variables Stage

and Revenue, at a certain extent, but this is not confirmed by numbers. So, we cannot say that

a particular type of investor invests mainly in start-ups at a particular stage or with a particular

range of revenues. Instead, all types of investors invest in a diversified portfolio of companies,

as we will see more in detail in the next paragraph 1.5.2.4.

A moderate correlation is identified between Stage and Industry, but this is just a chance event

in our data set. Of course, all Industries are populated by start-ups at all stages.

Finally, we could think that high revenues would be an important factor in determining the

success of a start-up (still_operating), but from our data set we can only state the existence of a

low correlation. We will make a specific analysis to investigate the impact of different variables

in the future success of Swiss start-ups, Chapter 5.

Revenue groups:

4.4 Correlation analysis between continuous and categorical variables 46

4.4 CORRELATION ANALYSIS BETWEEN CONTINUOUS AND CATEGORICAL VARIABLES

4.4.1 Methodology

There are three big-picture methods to understand if a continuous and categorical variable are

significantly correlated:

• Point biserial correlation: the categorical variable must be dichotomous, which is never

our case;

• Logistic regression: the dependent variable must be binomial, which is not our case

(lPre_valuation is continuous);

• Boxplot analysis: see results in the upcoming paragraph;

• ANOVA and ANCOVA: their assumptions of normality are not respected in our data set,

as we saw in paragraph 3.4;

• Kruskal Wallis H-Test: non-parametric alternative test to ANOVA. It does not assume

data are coming from a particular distribution. In particular, we decide to use the H test

as the assumptions for ANOVA aren't met (like the assumption of normality). It is

sometimes called the “One-way ANOVA on ranks”, as the ranks of the data values are

used in the test rather than the actual data points. The test determines whether the

means of two or more groups are significantly different. The test statistic used in this

test is called “the H statistic” and the hypotheses for the test are:

o H0: population means are equal.

o H1: population means are not equal.

We reject H0 if the adjusted p-value, calculated through the default method "holm", is

below the threshold of 0.05. However, this test alone will not tell us which groups are

different. In order to know that, we run a Post Hoc Wilcox test and we comment its

results. We therefore adopt this method and results are shown in the upcoming

paragraph.

4.4.2 Results

We now plot the dependent variable lPre_valuation against all considered categorical variables.

In each graph, we also show the result of the Kruskal Wallis test.


4.4.2.1 Round_name

The graph in Figure 4.30 shows a strong correlation, which is straight forward: the round name

is usually assigned based on the lPre-valuation. So, with no surprise further rounds have higher

lPre_valuation. The boxplot of “From Series D to IPO” is the highest one but also the tallest, so

it has the highest volatility in Pre-money valuation. They all have some outliers and tend to be

left skewed (the tail of the distribution is longer on the left-hand side than on the right-hand

side and the median is closer to the third quartile than to the first one). Finally, the population

means of each group are significantly different from all the other groups.

Figure 4.30: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Round name. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4.2.2 Industry

In Figure 4.31 we see the boxplots of lPre_valuation for each Industry group. Some company

Industries are more volatile than others: ICT shows the widest range and several outliers, while

Cleantech and Consumer Products ranges are much more restricted. MedTech/ Healthcare

shows one case particularly extreme. Consumer Products and Others are strongly left-skewed,

which means that 3rd - 4th quartiles have a more restricted range than the first two. We do not

obtain significant mean differences among groups, so we pool together Industries with less than

20 samples into the group Others, to see if we obtain different results. Figure 4.32 shows the


resulting boxplots. Also in this case, group means are not significantly different one from the

other. So, we presume that by adding this explanatory variable to our regression model (see

Chapter 6) we will not obtain advantages.

Figure 4.31: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Industry. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Figure 4.32 Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the industry, after we pooled together the groups having less than 20 samples in the Industry group Other. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.


4.4.2.3 Stage

The relationship with the variable Stage is represented in Figure 4.33. For this variable we have

61.7% of missing values, so inserting it in our regression model would make as loose the majority

of samples in our data set. For this reason, it is even more important to analyse separately its

relationship with the dependent variable lPre_valuation. Prototyping is the most volatile group,

and it is right skewed. Beta-phase/ Clinical Trials is also right-skewed (the 50% of these samples

having the highest lPre_valuation stay in a wider range). Although the high volatility involves all

groups, we can clearly see a growing trend: the later the stage of a startup, the higher its

lPre_valuation. Through the Wilcox test, we can state that the mean between the following

groups is statistically significantly different:

• Prototyping and Growth/International

• Beta-Phase/Clinical Trials and First Clients

• Beta-Phase/Clinical Trials and Growth/International

• First Clients and Growth/International

Figure 4.33: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Stage. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

As Prototyping has only 13 samples and there is no significant difference between this group

and Beta-Phase/Clinical Trials Stages, we try now to pool them together in a group called “Early-

stage” and test if there is a significant difference with First Clients and with


Growth/International, that now we will rename, for coherence, as “Later-Stage”. We now obtain

Figure 4.34 and significant mean differences between all groups. All that makes us maintain this

new structure of the variable and state that Stage is a potentially useful predictor of

lPre_valuation.

Figure 4.34: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Stage, after we pooled together the groups Prototyping and Beta-Phase/Clinical Trials Stages in the new group Early-stage. For coherence, the group Growth/International is here renamed as Later-Stage. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4.2.4 Type Lead Investor (TLI)

This graph in Figure 4.35 for TLI is very interesting: the trend is evident and highly significant.

The mean valuation obtained in rounds involving Accelerator/ incubators is much lower than all

the other ones. As expected, Private angels are in between accelerators and Institutional

investors. Is their valuation lower because they do only invest in early-stage start-ups? The

answer is shown in Figure 4.36.

Except for Accelerator/incubators, all other types of investors show a differentiated portfolio of

rounds, based on the start-up Stage. Institutional Financial shows many more samples and

therefore a wider range of offered valuations compared to the other groups, while Institutional

Strategic has the highest mean lPre-valuation.

Nevertheless, for a fair comparison, we need to verify if these significant mean differences are

valid even when separating rounds depending on companies’ development stage. In fact, it


could be possible that Private Angels offer on average lower pre-valuations only because they

mainly invest in early-stage startups, while Institutional Strategic investors only invest in later

stage companies. In the following pie charts and histogram (Figure 4.37 - 4.38), we represent

the distribution of rounds’ Stages across the different Types of investors. For example, we note

that Private Angels investments involve early-stage start-ups (Prototyping + Beta-Phase + Clinical

Trials) in the majority of cases (43%). Anyway, the lower valuation offered by Private angels

cannot be addressed merely to the explanation that they mainly invest in early-stage start-ups.

Instead, a more comprehensive reason is that they’re willing to take higher risk than the other

players, and therefore ask for higher returns on investment.

Figure 4.35: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type of Lead Investor. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.


Figure 4.36: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI) and the Stage. Given a TLI, a different boxplot of lPre_valuation is created for each Stage (following the legend, a colour is associated to each Stage). Samples with unknown investor have been excluded from the graph. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Figure 4.4: Distribution of the Type of Lead Investor (TLI) across Stages. On the abscissa we read the Stage (a bar for each Stage group). On the ordinate we read (in percentage) the proportion of samples in that Stage group belonging to a certain TLI group. Following the legend, each colour section of the bars corresponds to a certain TLI group.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Prototyping Beta-Phase Clinical Trial First Clients Growth/Internat

Private Angel Inst.Financial Inst.StrategicTLI groups: Stage


Figures 4.39 – 4.41 confirm that all types of investors (except for Accelerators/incubators) show

a well-diversified portfolio of start-ups in terms of development stage. Nevertheless, the mean

differences among groups continue to be significant only when considering rounds of:

• Early-stage start-ups, between all types of investors

• First Clients stage start-ups, just between Private Angels and Inst. Financials (one-side

test)

In those cases, the type of investor involved in the round makes a significant impact on

lPre_valuation. As the majority of rounds involving Private Angels are related to early-stage

start-ups (but not only), the overall outcome is a significant mean difference among the three

types of investors.

Figure 4.38: These pies show the distribution of the variable Stage for each Type of Lead Investor (TLI). In a) we consider only samples in which the TLI is Institutional Financial, in b) only Private Angel, and in c) only Institutional Strategic. Following the legends, each colour section of the pies corresponds to a certain Stage group, and its proportion of samples (in percentage) is written inside the section.


How can we interpret this? Private angels are known to be willing to take more risk than

institutional investors, and impose less constraints on companies. On the other side, they

require a higher return on their investment, by investing at a relatively low valuation. About the

overall highest valuation offered by strategic investors, we have to remember that they are

called “strategic”, as they invest in virtue of a particular strategic interest they have for a specific

start-up. An interest that other investors (strategic or not) might probably not have. Examples

of strategic reasons behind a start-up investment are: exploitation of the developed technology,

IP rights, complementary products, control of competition, reaching new customer segments or

new markets, access to know-how or specific resources, etc. That’s why their valuations of

companies are higher than Institutional Financials, that instead do not take any strategic

advantages out of the investments.

Thinking about our upcoming regression analysis, based on this information we think that

Type_Lead_investor would have a relevant impact on lPre_valuation when considering its

interaction with the variable Stage.

Figure 4.39: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI), considering only Early-stage start-up rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Early-stage rounds


Figure 4.40: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI), considering only start-up rounds at the stage First Clients. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Figure 4.41: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type of Lead Investor, considering only Later-stage start-up rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

First Clients rounds

Later-stage rounds


4.4.2.5 Revenue

In Revenue variable we have 60% of NA’s values. From the representation in figure 4.42 we can

see a shy positive trend. Some of these groups’ means differ significantly (one-side test, as we

assume a growing trend):

• 0 – 50k and 500k – 1M

• 0 – 50k and >1M

• 50k – 100k and >1M

• 100k – 500k and >1M

To get only significant differences, we pool together groups. The final Revenue’s structure

includes 3 levels: 0 – 50k, 50k – 1M, >1M. Now the boxplots become (Figure 4.43).

Figure 4.42: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Revenue. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

The Kruskal-Wallis test indicates now higher significance in mean differences (p-value is now

0.008 instead of 0.014). Still, we cannot reject the null Hypothesis between 0 – 50k and 50k –

1M, but we can do that between >1M and the other two groups. Having less NA’s would improve

the precision of our results. For these reasons, Revenue does not seem to be an essential

predictor in a regression model estimating lPre_valuation.


Figure 4.43: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Revenue, after we pooled together the groups 50k – 100k, 100k – 500k, and 500k – 1M, in the new group 50k – 1M. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

4.4.2.6 Still_operating

The relationship between Still_operating and lPre_valuation is extremely important (Figure

4.44), not to predict valuation by knowing if the startup is still operating, but the other way

around. In fact, if we want to estimate the Pre_valuation of a start-up, it means that the start-

up hasn’t been acquired yet (so it does not belong to exit group), nor it is been liquidated. So,

every time we aim to predict the pre_valuation of a startup, it means that the startup is indeed

still_operating.

What we find more interesting is to investigate the possibility to predict the future of a start-up

(acquired, liquidated, or still operating) by knowing its lPre_valuation. We do that in Chapter 5.

The figure 4.44 takes into consideration all rounds, and it confirms that a relevant relationship

is existing between these two variables. In fact, we count a significant difference in mean

lPre_valuation among all of the three groups. Companies who reached an exit have the highest

median lPre_valuation, liquidated companies have the lowest one, while start-ups that are still

operating have the highest volatility, but a median staying in between the other two groups.


We are aware that, one day, the majority of start-ups belonging now to the yes group will then

belong to no or exit groups. As we still do not know their destiny, let’s now exclude them from

the analysis and focus only on exit and no group (Figure 4.45).

Figure 4.44: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating). Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

The figure 4.45 shows an enormous significant difference among the mean of the two groups.

This is an extremely precious information that makes us wondering the following important

question: was the future of the start-up already evident when those companies where just at

early-stage round? Was their future already predictable at their first investment round (when

they had zero money previously raised)? In figure 4.46 we analyse a selection of rounds with

Prev_raised=0. Even if we only have a few samples in the exit groups, the mean difference

between exit and no groups, and between yes and no groups are significant! That’s even more

evident by looking only at exit and no groups (Figure 4.47).


Figure 4.45: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the samples in the groups “exit” and “no”. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Figure 4.46: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the rounds with zero money previously raised. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Rounds with lPrev_raised=0


Figure 4.47: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the rounds with zero money previously raised and belonging to the groups “exit” or “no”. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

In figure 4.48, we consider only early-stage start-ups and we show with different colours their

precise development stage. This time we have a more conspicuous number of samples in no and

yes group, and their mean difference is again significant, below 0.05 threshold. Nevertheless,

we have no sample belonging to the exit group, so by knowing the lPre_valuation of a company

at its early-stage round, we can only predict if -in the future- it will face a liquidation or not.

Overall, our conclusion would be extremely useful to be applied but, unfortunately, the very

limited availability of samples and the high volatility of lPre_valuation for companies belonging

to the yes group prevent our statements from being strongly reliable.

Anyway, we can add the following comment to the relevant demonstrated relationship between

the future success of the start-up and its lPre_valuations: reaching a high pre-money valuation

for a start-up can be both a cause and an effect of its still_operating status. In fact, if the

company is still operating or achieved an exit, it was probably already showing a lower risk of

failure at the investment round, and for this reason it got a higher valuation. But it can also be

true the other way around: as the company got a higher valuation in the round, it could invest

resources in a most efficient way, it probably raised more money (as we saw the strong

correlation existing between lPre_valuation and lAmount_raised), and therefore maximized its

probabilities to still be operating / sold.

Rounds with lPrev_raised=0


Figure 4.48: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the early-stage rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.

Early-stage rounds

5.1 Overview 62

5 PREDICTING THE FUTURE SUCCESS OF A SWISS START-UP

5.1 OVERVIEW

In paragraph 4.4.2.6, we proved the existence of a significant relevant correlation between the

continuous variable lPre_valuation -indicating the Pre-money valuation obtained by a company

in a funding round- and the categorical variable Still_operating -revealing the current status of

a start-up (yes= still operating/ no=liquidated/ exit= acquired)-. Based on that information, we

build in this chapter a model predicting the future status of a company (Still_operating), with

the minimum possible error rate.

5.2 METHODOLOGY

To reach our goal, we apply the following Discriminant Function Analysis approaches to several

combinations of explanatory variables (both continuous and categorical):

• Linear discriminant analysis (LDA)

• Quadratic discriminant analysis (QDA)

• Multiple discriminant analysis (MDA)

• Flexible discriminant analysis (FDA)

Discriminant Analysis, indeed, can be used to determine which variable(s) are the best

predictors of the outcome categorical variable. Its analysis makes the assumption that the data

(for the variables) represent a sample from a multivariate normal distribution. However,

violations of the normality assumption are usually not "fatal," meaning that the resultant

significance tests etc. are still "trustworthy", especially for FDA. So theoretically, this should be

the best method to apply in our case, as we have a multivariate non-normal data set (see

distribution analysis in paragraph 3.4). Anyway, we test and compare performances of all these

classifiers (LDA, QDA, MDA, FDA) and, for each of them, we create several models, involving

different combinations of continuous and categorical explanatory variables. Therefore, we train

models on 80% of our data set and we test them on the remaining 20% of samples. Finally, we

compare the accuracy of these models in terms of percentage of observations which are

classified correctly, and we select the optimal one.

5 Predicting the future success of a swiss start-up 63

5.3 RESULTS

By comparing the performance of our models, we find that the best combination of explanatory

variables to predict Still_operating is made of:

− Closing_Year

− lPre_valuation

− lPrev_raised

− lAmount_raised

Both classifiers LDA and FDA provide the most accurate predictions: they predict correctly the

future company’s status for the 96.36% of observations in our data set. This value could be

improved with more observations and measured variables. More in detail, we report the

combination of predictors used by our final selected model:

Coefficients of linear discriminants: LD1 LD2 Closing_Year 0.984 0.305 lPre_valuation 0.432 -0.986 lPrev_raised -0.096 -0.573 lAmount_raised -0.262 0.481 Proportion of trace: LD1 LD2 0.783 0.217

The weakest point of this model is that we cannot estimate in which year this future status will

be exactly realized. We only know that all companies in our data set closed investment rounds

between 2010 and 2019, while the Still_operating variable is updated to October 2019. So,

overall, the predicted future status of companies is supposed to be true in up to nine years since

the inputs of the model are measured.

The strength of the model is that it reveals that Closing_Year and lPre_valuation are the most

influential factors to determine the future status (and success) of a company.

6.1 Purpose 64

6 MULTIPLE REGRESSION ANALYSIS

6.1 PURPOSE

The goal of this Multiple Regression Analysis is to estimate a fair benchmark range for the pre-

money valuation of a start-up at its upcoming investment round, given as input some

explanatory variables related to the company. This benchmark could be useful for the co-

founders, as well as for the investors evaluating the company before making an investment

proposal.

While conducting this analysis, we keep in mind that correlation does not imply causation. Even

if we build a model having significant predictors and a high adjusted R-squared, still nothing is

known about causal relationships.

6.2 METHODOLOGY

6.2.1 Overview

We want to predict the dependent variable (lPre_valuation) based on values of a set of

predictors (mixing continuous and categorical variables). The decision of the type of the

regression model depends on the type of distribution followed by its dependent variable:

− Linear regression for continuous variable having linear relationships with the predictors

− Logistic regression for dichotomous distribution

− Log-linear analysis for Poisson or multinomial distribution

− Cox regression for time-to-event data in the presence of censored cases (survival-type)

− Non-linear regression for continuous dependent variables having non-linear

relationships with the predictors

In our case, the independent variable lPre_valuation is continuous and moderately correlated

with the other continuous predictors (as we saw in paragraph 4.2). So, it is a good rule to start

creating the simplest possible model via multiple linear regression, and to make it more

complicated only when it is truly needed. If we make a model more complex, we should get

confirmation that we are not going toward overfitting, by testing its performance. Besides, the

prediction intervals should become more precise (narrower). If we have several models with

6 Multiple Regression Analysis 65

comparable predictive abilities, the simplest one it is likely to be the best model (Zellner,

Keuzenkamp and McAleer, 2001).

Because of the restricted number of samples in our data set (286) and the large proportion of

missing values for some variables, we decided to exploit all available data to train models, and

then to test them on the same data set, via k-fold cross validation, LOOCV, validation set, and

bootstrap.

All methods have been tested and visualized via the Software R. To distinguish models, in the

following paragraphs we give them names within square brackets (e.g. [regsub.best]).

6.2.2 Steps

We follow and iterate the following steps:

1. Pre-processing the data set (paragraph 6.3);

2. Manual variable selection based on previous knowledge and conducted analyses,

(paragraph 6.4);

3. Model specification by applying a combination of methods (Appendix: Model

Specification). For each created model:

o Outliers check and eventually removal from the model if justified;

o Assumptions check;7

o Calculation of model fit statistics: internal measures, and test measures through

validation set, cross validation and bootstrap;

4. Comparison of the best models, making appropriate considerations (paragraph 6.5);

5. Choice of the optimal model among the best models, final comments (paragraph 6.6).

About step 3., here we list the methods we apply to find the best model fit:

1. Automated models: stepwise regression (“both”: backward and forward);

2. Automated models: best subset regression. Train and test models via:

a. validation set;

b. 5-fold cross validation;

7 For reasons of brevity, this procedure and its results are shown only once for the final selected best model, paragraph 6.7.

6.3 Data pre-processing 66

3. Removal of insignificant terms (through drop1 function);

4. Curve Fitting using Polynomial Terms;

5. Fractional exponents

6. Spline

7. Log function of predictors

8. Non-linear regression: GLM (Generalized Linear Models)

9. Loess regression

10. Kernel regression

6.3 DATA PRE-PROCESSING

In the following regression analysis, we start by maintaining the same group structure of

categorical variables as it resulted out of the analysis conducted in paragraph 4.4.

6.4 SECOND MANUAL VARIABLES SELECTION (FROM 20 TO 8 INDEPENDENT VARIABLES):

Based on previous knowledge and the results of conducted analyses, we exclude the following

variables:

− Foundation_Year: its correlation coefficient with the dependent variable is close to zero

(0.08). This can be confirmed by testing a simple regression model with

Foundation_Year as the only explanatory variable and lPre_valuation as the dependent

variable. The explanation of lPre_valuation done by this predictor is close to 0% and not

significant;

− Still_operating and Round_name: these variables are strongly correlated with

lPre_valuation, but they are the effect of a determined lPre_valuation, not the cause.

When using our model for the purpose explained in paragraph 6.1, the user does not

know neither about the future of the company, nor about the name of the round that

the company is considering. For these reasons, we exclude them from our model;

− Profitable: in our data set, only 1 sample is profitable, and 179 missing values are

present. So, we cannot extract useful information out of this variable;

− Pre_Valuation, Amount_raised, Prev_raised, Through_investiere: as we’ve seen in

paragraph 3.4, the log transformation of these variables has better proprieties for

regression and correlation analysis, than the original ones;


− lThrough_investiere and Employees: a separated analysis has been conducted for each

of these variables. Keeping them in our analysis would make us lose the 82% of samples,

due to their large proportion of NA values. Besides, lThrough_investiere is not a

universal explanatory factor for all rounds closed in Switzerland;

− Country and Currency: as we consider only Swiss start-ups, all rounds are closed in

Switzerland and via Swiss Francs (CHF) currency. These conditions must be respected

when using the model;

The resulting regression data set is composed of 286 samples -of which 108 are complete cases-

one continuous dependent variable (lPre_valuation), and eight explanatory variables 8 (five

categorical and three continuous):

− Industry (Cat.)

− Closing_Year_factor (Cat.)

− Closing_year (Cont.)

− Stage (Cat.)

− Revenue (Cat.)

− Type_Lead_Investor (Cat.)

− lPrev_raised (Cont.)

− lAmount_raised (Cont.)

Besides, we already know that we will have to decide if including Closing_Year or

Closing_Year_factor as a predictor. We will test their performance and then decide.

After this Second Manual Variables selection, we are still interested in reducing the number of

explanatory variables in our model. In fact, as five of these variables are categorical involving

more than two groups, if we included all of them, we would have a model with already 15

explanatory variables, before including interaction terms.

Green (1991) indicates that N>50+8m (where m is the number of independent variables)

samples are needed for testing multiple correlation. Harris (1985) says that the number of

samples should exceed the number of predictors by at least 50. Van Voorhis, Besty and Morgan

(2007) affirm it is better to go for 30 samples per predictor. Finally, the “one in ten rule” is a rule

of thumb for how many predictor parameters can be estimated from data when doing

regression analysis (in particular proportional hazards models in survival analysis and logistic

regression) while keeping the risk of overfitting low. The rule states that one predictive variable

8 Detailed variable description and overview are provided in paragraph 3.3

https://en.wikipedia.org/wiki/Rule_of_thumb

https://en.wikipedia.org/wiki/Rule_of_thumb

https://en.wikipedia.org/wiki/Dependent_and_independent_variables

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Proportional_hazards_models

https://en.wikipedia.org/wiki/Survival_analysis

https://en.wikipedia.org/wiki/Logistic_regression

https://en.wikipedia.org/wiki/Logistic_regression

https://en.wikipedia.org/wiki/Overfitting

6.5 Best models comparison 68

can be studied for every ten events (Harrell, Lee, Califf, Pryor and Rosati, 1984). As a rule of

thumb, we decide to have a model with at least 15-20 samples per predictor that, in our case,

by considering only complete samples it means to have a maximum of 5-7 predictors.

In Appendix: Model Specification, we first indagate which selection would Automated models

suggest, and then we compare it with the tested application of other methods, as exposed in

Step paragraph 6.2.2, without excluding further pooling of groups if necessary.

6.5 BEST MODELS COMPARISON

We selected two best models: model [A] and [B]. We report the complete detailed procedure

that made us select these two models in the Appendix: Model Specification. These two models

are [A]:

lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI +

Closing_Year:lPrev_raised

And [B]:

lPre_valuation ~ lPrev_raised + lAmount_raised + TLI + Closing_Year:TLI +

Closing_Year:lPrev_raised + Closing_Year:lAmount_raised +

Stage:lAmount_raised

Note that the first one does not include the variable Stage, that has a significant higher

proportion of missing values, compared to the other predictors. Plus, none of these two models

includes the interaction term TLI:Stage, that we thought to be a relevant predictor (4.4.2.3). So

in this paragraph, after comparing performances of these two models, we select the winner, and

we test if, by adding to it the interaction term TLI:Stage, we get better results.

Figure 6.1 gives us a clear picture of how the magnitude effect differs for these predictors

(circle/square), plus their uncertainty (horizontal lines indicate the 95% confidence interval). In

Figures 6.2 and 6.3, instead, we read that the vast majority of performance metrics agree that

[A] model is the best one (we compare them with the model resulting from best subset selection,


and with the two previously considered variations of [basic.model], during Model Specification:

[TLI*Stage] and [TLI + Stage]9).

Figure 6.1: On the right, the names of predictors belonging to model [A] and/or model [B]. The figure compares the magnitude effect (blue circles for model [A] and orange squares for model [B]) of each predictor on Pre-money valuation, plus its uncertainty (the horizontal lines indicate the 95% confidence interval of these estimates).

So we select [A] as the best model, and we now try to add the interaction term TLI:Stage to it,

as a final confirmation.

9The full explation of these models is provided in the Appendix: Model Specification

lPrev_raised

[A]

[B]

lAmount_raised

Closing_Year

Inst.Financial

Inst.Strategic

Closing_Year:Inst.Financial

Closing_Year:Inst.Strategic

lPrev_raised:Closing_Year

Acc/Inc/PA:Closing_Year

Inst.Financial:Closing_Year

lAmount_raised:Closing_Year

lAmount_raised:First_Clients

lAmount_raised:Later-Stage

Inst.Strategic:Closing_Year

-0.5 0.0 0.5


Model: A + TLI:Stage Residuals: Min 1Q Median 3Q Max -0.80396 -0.23757 -0.00219 0.26622 0.92276 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.826e+02 6.792e+01 -4.160 7.14e-05 *** lPrev_raised 9.600e-05 3.172e-05 3.027 0.00321 ** lAmount_raised 3.792e-01 6.239e-02 6.078 2.73e-08 *** Closing_Year 1.451e-01 3.384e-02 4.289 4.43e-05 *** TLI_Institutional Financial -3.685e+00 1.022e+02 -0.036 0.97133 TLI_Institutional Strategic 4.082e+02 1.535e+02 2.659 0.00925 ** Closing_Year:TLI_Institutional Financial 1.690e-03 5.074e-02 0.033 0.97351 Closing_Year:TLI_Institutional Strategic -2.023e-01 7.619e-02 -2.655 0.00935 ** lPrev_raised:Closing_Year -4.753e-08 1.572e-08 -3.023 0.00324 ** TLI_Acc/Inc/PA:StageFirst Clients -3.436e-02 1.560e-01 -0.220 0.82610 TLI_Instit.Financial:StageFirst Clients 2.052e-01 1.575e-01 1.303 0.19597 TLI_Inst.Strategic:StageFirst Clients -3.389e-01 3.033e-01 -1.118 0.26669 TLI_Acc/Inc/PA:StageLater-stage 1.099e-01 2.160e-01 0.509 0.61205 TLI_Inst.Financial:StageLater-stage 2.704e-01 1.548e-01 1.746 0.08413 . TLI_Inst.Strategic:StageLater-stage -8.369e-02 2.823e-01 -0.296 0.76753 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3956 on 92 degrees of freedom Multiple R-squared: 0.8196, Adjusted R-squared: 0.7921 F-statistic: 29.85 on 14 and 92 DF, p-value: < 2.2e-16

Figure 6.2: Comparison of models’ performance. Reading the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate and have been calculated through the software R. Models with lower values for these metrics are preferred, which are calculated as follows:

− AIC= -2(log-likelihood) + 2K, where K is the number of model parameters, and log-likelihood is a measure of model fit (the higher the number, the better the fit);

− BIC = −2(log-likelihood) + log(n)K, where K is the number of model parameters and n the number of observations;

− As defined by Allen (1974), PRESS is based on leave-one-out technique: from a fitted model, each of the n samples in turn is removed, and the model is refitted to the (n−1) points. The predicted value is calculated at the excluded point, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors.

ANOVA test does not show a significant improvement:

Model 1: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised Model 2: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:Type_Lead_Investor + lPrev_raised:Closing_Year + Type_Lead_Investor:Stage Res.Df RSS Df Sum of Sq Pr(>Chi) 1 98 15.238

0.000 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 180.000 200.000

AIC

BIC

PRESS

Interaction terms selection

B A TLI*Stage basic.model + TLI + Stage


2 92 14.400 6 0.83771 0.4995

We then compare them in figure 6.4 and 6.5. Except for validation set measures and internal

RMSE and MAE, all other metrics indicate that [A + TLI:Stage] is performing worse than [A] alone.

Besides, many predictors would not be significant anymore. Before concluding that [A] is our

best model, we want to eventually test through the function drop1 if any of its predictors should

be removed:

Single term deletions Model: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + + Closing_Year:TLI + Closing_Year:lPrev_raised Df Sum of Sq RSS AIC Pr(>Chi) <none> 16.484 -187.90 Closing_Year:Type_Lead_Investor 2 1.9901 18.474 -179.47 0.0020057 ** lPrev_raised:Closing_Year 1 1.7901 18.274 -178.66 0.0008018 *** lAmount_raised 1 7.3165 23.800 -149.86 2.491e-10 ***

All terms are significant and we can therefore conclude that [A] is the best selected model.


Figure 6.3: Comparison of models’ performance. Reading the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate, and models with lower values are preferred, except for R2 measures, where the contrary is true. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. k-fold 5.5 means we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats. MAE is the average, over the test sample, of the absolute differences between prediction and actual observation (all having equal weight). RMSE is the square root of the average of squared differences between prediction and actual observation. S is the Standard error (absolute measure of the typical distance that the data points fall from the regression line, in the units of the dependent variable). R-squared (R2) provides the relative measure of the percentage of the dependent variable variance that the model explains (from 0 to 1).

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Adj.R2

S

Pred.err.rate

Pred.R2

RMSE

MAE

RMSE

R2

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

R2

RMSE

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

inte

rnal

me

asu

res

k-fo

ld 5

.5LO

OC

Vva

lidat

ion

set

bo

ots

trap

Interaction terms selectionB A TLI*Stage basic.model + TLI + Stage


Figure 6.4: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.

0 20 40 60 80 100 120 140 160 180

AIC

BIC

PRESS

Interaction terms selection

A + TLI:Stage A

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Adj.R2

S

Pred.err.rate

Pred.R2

RMSE

MAE

RMSE

R2

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

R2

RMSE

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

inte

rnal

me

asu

res

k-fo

ld 5

.5LO

OC

Vva

lidat

ion

set

bo

ots

trap

Interaction terms selectionA + TLI:Stage A

6.6 Best selected model 74

Figure 6.5: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.

6.6 BEST SELECTED MODEL

In the previous paragraph 6.5, we selected [A] as our best model, having the following formula:

lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + Type_Lead_Investor +

Closing_Year:Type_Lead_Investor + Closing_Year:lPrev_raised

Compared with our Third manual variable selection (Appendix, paragraph 9.2), model [A]

includes a combination of only four explanatory variables. The importance of these predictors

has been already commented in Chapter 4. All of them, in fact, are significant, except for the

group Institutional Financial of the variable Type_Lead_Investor (as we already noted in the

analysis of this variable, conducted in paragraph 4.4.2.4), and its interaction with Closing_Year.

The interaction term between Closing Year and lPrev_raised corrects the impact that the total

funding of the start-up has on its valuation, depending on the year in which the round is

concluded (and viceversa).

6.6.1 Confidence and Prediction intervals10

When using the suggested best model [A] to predict lPre_valuation for new data (funding

round), prediction intervals become very important for making real-world predictions with

realistic bounds of uncertainty.

For reasons of brevity, we will only report confidence and prediction intervals for the first eight

samples of our data set. Of course, narrower prediction intervals indicate a better model, but

10 The confidence interval reflects the uncertainty around the mean predictions. It states, according to the model, which is on average the lPre_valuation range for a funding round with some specific inputs as independent variables. A prediction interval gives an interval within which we expect the dependent variable (lPre_valuation) to lie with a specified probability (95%, in our case). The prediction interval gives uncertainty around a single value, given values of the independent variables specified. If the prediction intervals are too wide, the predictions do not provide useful information. Narrow prediction intervals represent more precise predictions. Thus, a prediction interval will be generally much wider than a confidence interval for the same value.


we did not use them to select the best model, because they strongly rely on the assumption that

the residual errors are normally distributed, with constant variance. Instead, that assumption is

not so strong for linear regression (see Assumptions check, in paragraph 6.7.2).

lPre_valuation fit lwr.conf.int upr.conf.int lwr.pred.int upr.pred.int. 1 15.84721 15.06559 14.90795 15.22324 13.89531 16.23588 2 20.62511 19.01204 18.66894 19.35514 17.80273 20.22135 3 13.38473 14.35444 14.19031 14.51856 13.18326 15.52561 4 13.30468 13.78248 13.58372 13.98125 12.60595 14.95901 5 12.50618 13.03486 12.74742 13.32231 11.84015 14.22958 6 12.76569 13.12320 12.80986 13.43654 11.92199 14.32441 7 13.28788 13.96765 13.76486 14.17044 12.79043 15.14486 8 12.89922 13.68907 13.44615 13.93200 12.50428 14.87386

In the graph (figure 6.6) we show the first 70 samples of the data set (points), their prediction

intervals (red lines), and confidence intervals (green lines).

Figure 6.6: On the abscissa is the log of the Amount raised at the round (lAmount_raised) and on the ordinate the log of Pre-money valuation (lPre_valuation). The Figure shows, for Model [A], the actual Pre-money valuations (black points), the confidence intervals (green lines), and the prediction intervals (red lines).

6.7 MLR BLUE ASSUMPTIONS CHECK

We now show the procedure that we applied for each outcoming model in order to test if it was

meeting the Multiple Linear Regression BLUE assumptions. For reasons of brevity, we show the

process only once for our best model [A]. Overall, all of MLR BLUE assumptions are met, as linear

regression allows a certain margin of tolerance. Our best model [A] is thus valid from a

theoretical viewpoint.

---- prediction interval

---- confidence interval

. actual lPre-valuation

| | | | |

11 12 13 14 15

lAmount_raised

16 –

lPre

_va

lua

tion

14 –

12 –

6.7 MLR BLUE Assumptions Check 76

6.7.1 Outlier detection

We remove all the samples with a Cook’s distance higher than 0.1 (Figure 6.7 – 6.8). Then, under

Bonferroni correction of p-values, there outlier test does not find any significant outlier for this

model (any observation has adjusted p-value below 0.05).

Figure 6.8: For Model [A], on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.

Figure 6.7: For Model [A], on the abscissa is plotted the Leverage and on the ordinate the standardized residuals.


6.7.2 Check MLR assumptions

6.7.2.1 Assumption 1. Linearity

This assumption is overall respected: a linear trend between the dependent variable

lPre_valuation and the predictors is evident in figures 6.9 – 6.10.

Figure 6.10: For Model [A], on the abscissa are plotted the predicted values of Pre-money valuation and on the ordinate its observed values. The red line is the quadrant bisector used to evaluate if samples (points) are homogeneously distributed over and under the line.

Figure 6.9: For Model [A], on the abscissa are plotted the fitted values of the Pre-money valuation and on the ordinate the absolute residuals.


6.7.2.2 Assumption2: It is a random sampling of observations.

This assumption is true because of the data collection process that we adopted and explained

in detail in Chapter 3.

6.7.2.3 Assumption 3: Zero conditional mean of residuals

This assumption is fully respected, as shown below:

Min. 1st Qu. Median Mean 3rd Qu. Max. -0.96992 -0.26535 0.02779 0.00000 0.26272 1.00308

6.7.2.4 A4. There is no multi-collinearity (or perfect collinearity).

There should be no linear relationship between the independent variables. The correlation

matrix in Chapter 4 (figure 4.22) shows the existence of moderate relationships among

predictors, but the result of vif.test reassure us that these values are tolerable.

An important implication of this assumption is that there should be sufficient variation in the

predictors. The larger the variability in the explanatory variables, the better are the OLS

estimates in determining the impact of predictors on lPre_valuation. Below we check the

variability of predictors:

freqRatio percentUnique zeroVar nzv Closing_Year 1.222222 9.345794 FALSE FALSE lPre_valuation 1.333333 80.373832 FALSE FALSE lPrev_raised 14.666667 51.401869 FALSE FALSE lAmount_raised 1.428571 57.009346 FALSE FALSE

Perfect: any predictor has zero variance (zeroVar) or near zero variance (nzv).

6.7.2.5 Assumption 5. Spherical errors

This assumption requires homoscedasticity and no autocorrelation among residuals. In figure

6.9 residuals do not show a trend over time. In the next figures 6.11 – 6.12 we plot residuals on

Closing_Year to make sure that no time trend is existing among residuals. Kruskal-Wallis’ test

confirms that there are no significant differences among residuals over time.


Figure 6.11: For Model [A], on the abscissa we find the Closing Year of rounds and on the ordinate the absolute residuals. The red horizontal line is set zero to evaluate if residuals (points) are homogeneously distributed over and under the line, independently from the Closing Year.

Figure 6.12: Conditional boxplot of standardized residuals given the Closing Year of the round.

6.7.2.6 Assumption 6 (optional): Error terms should be normally distributed.

This assumption is not strict. In the figure 6.13 below, we show the distribution of residuals, and

the result of normality tests is the following:

---------------------------------------------- Test Statistic pvalue ----------------------------------------------- Shapiro-Wilk 0.9899 0.6070 Kolmogorov-Smirnov 0.0674 0.7165 Anderson-Darling 0.4357 0.2935 -----------------------------------------------


In any of these normality tests, we reject the null hypothesis of normality, as the p-value is

always above 0.05. Our model is thus valid from a theoretical viewpoint.

Figure 6.13: Probability density function and boxplot of the standardized residuals of Model [A].

Standardized residuals


7 Conclusions 81

7 CONCLUSIONS

Coming to the conclusions of this research, we bear in mind that “all models are wrong, some

models are useful” (Box, 1976). As stated in the Introduction, the aim of this thesis is to provide

an overview of start-up valuations in Switzerland, and a model estimating a fair range of pre-

money valuation for a target start-up, which could be a useful starting point in the negotiation

process between co-founders and investors. With the proposed model, therefore, we do not

aspire to predict the exact valuations of Swiss early-stage companies, but rather to indicate a

(prediction) interval within which their valuation is most likely to lie. We are also aware of the

existence of other factors influencing the valuation of start-ups, that we could not measure in

our analysis (as we explain further on in the limitations of our approach). Besides, we tend to

agree with the following statement of Matthew Schubring, managing director at Chartwell (Dahl,

2016):

“A good valuation is 75% art and 25% science, because it takes into account the story behind the

numbers of a business. Appraisals fall down when there isn’t enough support for the story behind

it. It’s based not on just what happened, but on why things happened.”

Made these premises, in Chapter 4 we exposed the main findings from the conducted

Multivariate Analysis. For the VC Investiere, having zero money previously raised is not a

limitation to its investment commitment. Nonetheless, if a start-up has already raised money in

the past, the amount then invested by Investiere slightly tends to increase with the significant

correlation factor of 0.27. This trend has been confirmed also for rounds in which Investiere was

not involved, as the correlation between the size of the round (lAmount_raised) and the total

funding previously raised by the start-up (lPrev_raised) are linked with a significant correlation

(r=0.42).

The correlation between the Amount raised and the pre-money valuation is the strongest one

among our continuous variables (r=0.59, paragraph 4.2.4). This helps entrepreneurs to find a

balance between the size of the investment and the consequent dilution.

Among categorical variables, the strongest existing correlation is between the revenues

generated by a company and its development stage. We also found that belonging to a particular

industry does not affect the success of the start-up, but it influences its revenues (Figure 4.34).


In our data set, all types of investors have a diversified portfolio of companies, in terms of

generated revenues and development stage, as we saw more in detail in paragraph 4.4.2.4. On

the other hand, the type of investor involved in the round makes a significant impact on the

start-up valuation.

We also could not find evidence that the future status of a company (acquired/liquidated/still

operating) is determined by the generated revenues. Instead, the Discriminant Function Analysis

conducted in Chapter 5 showed that the main predictors of the success of a start-up are the pre-

money valuation (confirming the importance of the topic of this research) and the closing year

of the round. We also proposed a model able to predict correctly the future status of a start-up

(acquired, liquidated, or still operating) for the 96.36% of observations in our test data set. This

value could be improved in further research by increasing the number of observations and

measured variables.

In addition, we found significant differences in start-up valuations depending on their

development stage, but not on the industry they belong to (paragraph 4.4.2.2). The most volatile

sector is ICT, while Cleantech and Consumer Products ranges are relatively restricted.

In Chapter 2, we presented an overview of the main start-ups valuation methods that we are

now ready to comment in comparison with our findings and with the best model resulting out

of the Multiple Regression Analysis conducted in Chapter 6.

The Scorecard Method (Payne, 2011) is based on the average valuation of early-stage start-ups

in the industry and region of the target company. As explained in Chapter 3, that kind of sensitive

data collection procedure is extremely time-consuming, making quite unrealistic the hypothesis

that co-founders and investors could take advantage of it.

The Berkus Model (Berkus, 2009), instead, considers very broad concepts as influencing factors

of start-ups valuations and, when coming to numbers, it leaves enormous space to open

interpretation. So, it is highly probable that the same start-up, when evaluated by different

actors via the Berkus model, receives very disparate valuations.

Finally, the Venture Capital Method (Sahlman and Scherlis, 1989, revised in 2009) is the oldest

and closest one to classical financial methods. Its weakest point is that it bases on the

assumption that the target company will generate a certain estimated amount of revenues in

five years. As Berkus (2009) states, this is a goal with less than 0.1% of probability to be met.

7 Conclusions 83

In contrast to all these methods, the approach that we propose stands out on several aspects.

First of all, it takes as input only data which are 100% objective and accessible to all players, at

zero cost: the round size (Amount_raised), the total funding previously raised by the company

(Prev_raised), the year in which the round is going to be closed (Closing_Year), and the nature

of the main investor involved (Type_Lead_Investor). An important consequence is that its results

are objective (independent from the user).

The second important difference between our model and the other ones is its immediacy: the

valuation estimation is timeless and effortless.

Another main dissimilarity from the literature concerns the richness of inputs and outcomes.

Our model considers inputs that any of the previous models has taken into account before.

These factors allow our valuation estimation to be more precise, because it is contextualized in

time (Closing_Year), space (Country), and actors involved (type of investor). In Chapter 4, we

also found clue to believe that, with a larger availability of data, other factors not included in

previous approaches would turn out to be significant predictors of pre-money valuation: the

industry, the revenue generated by the start-up, and the number of its employees.

Regarding the richness of the outcome, our method does not provide only a single valuation

estimate, but it offers a range (prediction interval) within which the market value is supposed to

lie, with a certain probability chosen by the user (e.g. 95%). It also offers a prospectus of the

individual effects played by each factor on the pre-money valuation, and their significance.

Last but not least, a relevant dimension on which our model beats all the others is the validity

of its performance. In paragraph 6.5 we provided a transparent and detailed analysis of its

internal measures and results when tested through validation set, cross validation, and

bootstrap. All of these tests agree on an adjusted R-squared about 0.8, a predicted R-squared of

0.77, and a prediction error rate close to 2.7%.

To sum up, the advantages of our model compared to the available ones in the literature are:

− Accessibility: everyone has access to input data at zero cost;

− Immediacy: given the inputs, the method is instantaneous and effortless;

− Objectiveness: results are independent from the user;


− Outcome richness: not just a valuation benchmark but a probability interval, plus the

individual effect of each factor;

− Contextuality: location, timing, and investors involved are taken into account;

− Performance: the prediction error rate of the model flows around 2.7%11, the adjusted

R-squared is 0.8, and the predicted R-squared about 0.77;

On the other side, our approach presents several limitations:

− Limited geographic validity (Switzerland);

− Limited dimension of the training set (286 samples) and therefore absence of a separate

test set;

− Abundance of missing values on specific variables (revenue, stage, profitability,

employees);

− Other relevant qualitative and quantitative measures are not considered (background

and experience of the team, market size, validity of the business strategy, proof of

validation, tangible and intangible assets, market positioning -leadership, barriers to

entry, brand awareness-, expected synergies, patents/technology, competition, …)

These boundaries could be crossed in further research, by expanding the data set for Swiss

rounds, or focusing on another country, or on a specific industry. Overall, the literature on this

topic is young and still poorly able to provide co-founders and investors with a starting point in

their “dance of concessions”. Nonetheless, this thesis shows that the room for improvement

and the pool of interested stakeholders are broad. This has been also underlined by a recent

survey conducted by the Canadian Golden Triangle Angel Network, involving 200 private angels.

The result has been that, for the 50% of them, the most likely reason behind the decision to not

invest in a start-up is that “Companies overstated their valuations” (Douglas, 2016).

11 When tested through validation set, cross validation, and bootstrap techniques

8 References 85

8 REFERENCES

Allen, D. M. (1974). The relationship between variable selection and data agumentation and a

method for prediction. Technometrics, 16(1), 125-127.

Arping, S., & Falconieri, S. (2009). Strategic versus financial investors: The role of strategic

objectives in financial contracting. Oxford Economic Papers, 62(4), 691-714.

Behrmann, G. (2016). Internet Company Valuation - A Study of Valuation Methods and Their

Accuracy. EBS Universität für Wirtschaft und Recht, Oestrich-Winkel, Germany.

Berkus, D. (2009, Nov 4). The Berkus Method – Valuing the Early Stage Investment. Berkonomics.

https://berkonomics.com/?p=131.

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356),

791-799.

Clarysse, B., & Kiefer, S. (2011). 12. Introducing the venture roadmap and basic financials The

Smart Entrepreneur: How to Build for a Successful Business (p.191). Elliott &

Thompson.

Dahl, D. (2015, Oct 30). Why Valuing Your Business Is More Art Than Science. Forbes

Damodaran, A. (2007). Valuation approaches and metrics: a survey of the theory and evidence.

Foundations and Trends® in Finance, 1(8), 693-784.

Douglas, R. (2016, Sept 2). Early-stage startup valuations: More art than science. Communitech

News - https://news.communitech.ca/early-stage-startup-valuations-more-art-than-

science/

Engel, R. (2002). Teaching note: An introduction to the venture capital method.

Frei, P., & Leleux, B. (2004). Valuation—what you need to know. Bioentrepreneur, 1-3.


Gentle, J. E. (2009). Computational statistics (Vol. 308). New York: Springer.

Green, S. B. (1991). How many subjects does it take to do a regression analysis. Multivariate

behavioral research, 26(3), 499-510.

Gunn, M. A. (2016). When science meets entrepreneurship: Ensuring biobusiness graduate

students understand the business of biotechnology. Journal of Entrepreneurship

Education, 19(2), 53.

Harris, R. J. (1985). A primer of multivariate statistics (2nd ed.). New York: Academic Press.

Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). Regression modelling

strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52.

Isabelle, D. (2013). Key factors affecting a technology entrepreneur's choice of incubator or

accelerator. Technology innovation management review, 16-22.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning

(Vol. 112). New York: Springer.

Lebret H. (2019). The Analysis of 500+ startups. Retrieved from: http://www.startup-

book.com/2019/04/26/the-analysis-of-500-startups/.

Payne, B. (2011). Scorecard valuation methodology. Establishing the Valuation of Prerevenue,

Startup Companies. Retrieved from: http://docplayer. net/14290190-Scorecard-

valuation-methodologyestablishing-the-valuation-of-pre-revenue-start-up-companies-

by-billpayne. html.

Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., Feinstein, A. R. (1996). A simulation study of

the number of events per variable in logistic regression analysis. Journal of Clinical

Epidemiology. 49 (12)

Rao, M. B., & Rao, C. R. (2014). Computational Statistics with R (Vol. 32). Elsevier.

http://www.startup-/

8 References 87

Sahlman, W., & Scherlis, D. (1989). A Method For Valuing High-Risk, Long-Term Investments: The

"Venture Capital Method". Harvard Business School, 9-288. (Revised October 2009.)

Van Voorhis, C. R. W., Betsy, L. & Morgan R. (2007). Understanding Power and Rules of Thumb

for Determining Sample Sizes. Tutorials in Quantitative Methods for Psychology, 3(2),

43-50.

Villalobos, L. (2007). Investment Valuations of Seed- and Early-Stage Ventures. The

entrepreneur’s trusted guide to high growth (pp. 3-4). Ewing Marion Kauffman

Foundation.

Wong, A., Bhatia, M., & Freeman, Z. (2009). Angel finance: the other venture capital. Strategic

Change: Briefings in Entrepreneurial Finance, 18(7‐8), 221-230.

Zellner, A., Keuzenkamp, H. A., & McAleer, M. (Eds.). (2001). Simplicity, inference and modelling:

keeping it sophisticatedly simple. Cambridge University Press.

9.1 Automated Models 88

9 APPENDIX: MODEL SPECIFICATION AND SELECTION

9.1 AUTOMATED MODELS

9.1.1 Stepwise regression

We report the best model out of stepwise regression, with direction= ”both” :

Residuals: Min 1Q Median 3Q Max -1.35936 -0.34815 0.05134 0.31283 1.30798 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.810e+02 5.255e+01 -3.443 0.000835 *** Closing_Year 9.467e-02 2.626e-02 3.604 0.000486 *** Type_Lead_InvestorInstitutional Financial -7.293e-02 1.227e-01 -0.594 0.553601 Type_Lead_InvestorInstitutional Strategic 2.749e-01 1.535e-01 1.791 0.076237 . lPrev_raised 1.124e-07 2.202e-08 5.104 1.55e-06 *** lAmount_raised 3.859e-01 7.354e-02 5.248 8.41e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4949 on 102 degrees of freedom Multiple R-squared: 0.674, Adjusted R-squared: 0.658 F-statistic: 42.17 on 5 and 102 DF, p-value: < 2.2e-16

Industry, Stage, Closing_Year_factor, and Revenue are excluded from the model, that counts

five predictors (out of eight). The intercept is negative and Institutional Financial is not

significant.

9.1.2 Best Subset Selection

Figure 9.1: The Figure shows three graphs to analyse results of Best subset selection method, applied via the software R. On the abscissa is represented the number of variables in the model, while on the ordinate is shown the resulting value of the following metrics: adjusted R squared, Cp, and BIC. The optimal value for each metric is in red colour.

9 Appendix: Model specification and selection 89

We obtain different results in Figure 9.1; five predictors could be a good compromise. The first

selected var with this method is lAmount_raised, and the second one is lPrev_raised, as

expected because of their moderate-high correlation with lPrev_raised. They are followed by

Closing_Year, Institutional Strategic, and Later-Stage.

Model: regsub.best Residuals: Min 1Q Median 3Q Max -1.29992 -0.27916 0.00431 0.33149 1.27445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.513e+02 4.743e+01 -3.190 0.00188 ** lAmount_raised 3.432e-01 7.235e-02 4.744 6.70e-06 *** lPrev_raised 1.109e-07 2.137e-08 5.191 1.04e-06 *** Closing_Year 8.021e-02 2.373e-02 3.379 0.00102 ** Type_Lead_InvestorInstitutional Strategic 3.299e-01 1.273e-01 2.592 0.01093 * StageLater-stage 2.156e-01 1.125e-01 1.917 0.05793 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4827 on 104 degrees of freedom (176 observations deleted due to missingness) Multiple R-squared: 0.7012, Adjusted R-squared: 0.6868 F-statistic: 48.81 on 5 and 104 DF, p-value: < 2.2e-16

These predictors are indeed all significant and the adj.R2 is higher than before. The intercept is

negative.

Note also that the adjusted R2, BIC and Cp are calculated on the training data that have been

used to fit the model. This means that model selection through these metrics is possibly subject

to overfitting and may not perform as well when applied to new data. As justified in

Methodology (paragraph 6.2), we will adopt validation set and cross validation to test

automated models.

9.1.2.1 Model selection using a validation set

We split data in two halves (one train one test), we run Best Subset Analysis on train and we test

it on test. Figure 9.2 shows the optimal number of predictors -having the minimum Mean

Squared Error (MSE)- is five once again, so the best model is confirmed to be [regsub.best].

9.1.2.2 Model selection by k-fold cross validation

We test the models with 5-fold cross validation method. Figure 9.3 shows that the model with

minimum cross validation error has four predictors (in accordance with minimum BIC, in Best

subsect selection, Figure 9.1), which gives the following result:

Model: regsub.cv.best Residuals: Min 1Q Median 3Q Max -1.5498 -0.4447 -0.0130 0.3711 1.9436 Coefficients:

9.1 Automated Models 90

Estimate Std. Error t value Pr(>|t|) (Intercept) -7.595e+01 3.335e+01 -2.277 0.0236 * lAmount_raised 6.717e-01 3.478e-02 19.313 < 2e-16 *** lPrev_raised 2.966e-08 6.517e-09 4.551 8.12e-06 *** Closing_Year 4.063e-02 1.660e-02 2.448 0.0150 * Type_Lead_InvestorInstitutional Strategic 3.327e-01 1.489e-01 2.234 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6324 on 266 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.7215, Adjusted R-squared: 0.7173 F-statistic: 172.3 on 4 and 266 DF, p-value: < 2.2e-16

Figure 9.2: The Figure shows a graph allowing to select the best model via the Validation set method. On the abscissa is represented the number of predictors in the model, and on the ordinate is shown the corresponding MSE. The optimal value is coloured in red.

Figure 9.3: The Figure shows a graph allowing to select the best model via the k-fold cross validation method. On the abscissa is represented the number of predictors in the model, and on the ordinate is shown the corresponding mean cross validation error. The optimal value is coloured in red.


By not including the 5th predictor Stage, this model only excludes 15 samples due to NA, instead

of 176 NA in the 5-predictors model [regsub.best]. This means that the df is much higher, as well

as adj.R-squared, than for model [regsub.best].

9.2 THIRD MANUAL VARIABLE SELECTION (FROM 8 TO 5 PREDICTORS)

In Automated models, we experimented different methods, compared resulting models and

selected [regsub.best] and [regsub.cv.best] as the best ones. We have no doubts about the

importance of the four selected predictors in [regsub.cv.best] and we look for further

confirmation about the fifth predictor (Stage), included in [regsub.best]. We also want to make

sure that no relevant variable has been excluded.

In order to get confirmation of automated model results, we will manually add one variable at a

time to a [basic.model] (including only the four most relevant predictors, that we definitely want

in our models). Therefore, we will compare regression results to the correlation analysis

conducted in paragraph 4.4, to see how the significance of coefficients and the performance of

our model change because of the introduction of new predictors.

9.2.1 Basic.model

In [basic.model] we only include the most relevant predictors, that we cannot miss as

explanatory variables in our models.

9.2.1.1 Continuous variables

Based on our Correlation Analysis, conducted in paragraph 4.2, lAmount_raised and

lPrev_raised show the tightest relationships with the dependent variable lPre_valuation

(Kendall’s Tau coefficients are 0.59 and 0.43 respectively, and both highly significant). As we saw

in scatterplots (figures 4.23 - 4.25), these relationships follow a linear trend. For these reasons,

our [basic.model] formula will be:

lPre_valuation ~ lAmount_raised + lPrev_raised

When applied to the full pre-processed data set, we obtain:

Model: basic.model Residuals: Min 1Q Median 3Q Max -1.49462 -0.42526 0.00596 0.36189 1.93327

9.2 Third manual variable selection (from 8 to 5 predictors) 92

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.510e+00 4.508e-01 12.222 < 2e-16 *** lAmount_raised 7.059e-01 3.184e-02 22.173 < 2e-16 *** lPrev_raised 2.743e-08 6.177e-09 4.441 1.28e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6491 on 283 degrees of freedom Multiple R-squared: 0.7302, Adjusted R-squared: 0.7283 F-statistic: 382.9 on 2 and 283 DF, p-value: < 2.2e-16

All estimated coefficients are strongly significant, the intercept is positive, and the adjusted R

squared is above 72%. With only two predictors and df=283, we are sure not to risk overfitting.

MLR BLUE Assumptions are overall respected, except for the highest lPre_valuation observed

values, where residuals tend to increase. Six outliers are identified, and two of them are

particularly extreme (observations 259 and 281), so we already decide to remove them, before

proceeding in model optimization (see figure 9.4 showing Cook’s distances, with an indicative

red line set at 4*mean(cooksd)).

Figure 9.4: For basic.model, on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.

9.2.1.1.1 Interaction term

As lPrev_raised and lAmount_raised are moderately correlated one each other, we wonder if

the addition of an interaction term between these two variables would improve the model.

Model: interaction.basic.model Residuals: Min 1Q Median 3Q Max -1.68162 -0.41095 0.01992 0.34764 2.02972


Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.472e+00 4.499e-01 12.161 <2e-16 *** lAmount_raised 7.067e-01 3.174e-02 22.267 <2e-16 *** lPrev_raised 1.677e-07 8.298e-08 2.021 0.0442 * lAmount_raised:lPrev_raised -8.088e-09 4.771e-09 -1.695 0.0912 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.647 on 282 degrees of freedom Multiple R-squared: 0.7329, Adjusted R-squared: 0.7301 F-statistic: 257.9 on 3 and 282 DF, p-value: < 2.2e-16

The estimated coefficient of the interaction term is not significant below 5% and extremely low.

Other values do not change that much. ANOVA test and model performance metrics confirm

that no significant improvement is provided. This means that the effect of lAmount_raised on

lPre_valuation is not influenced by lPrev_raised (and viceversa). So, we conserve our

[basic.model] formula as a starting point to build the best model.

9.2.1.1.2 Closing_Year

The third predictor we want to add to [basic.model] is Closing_Year, as strongly suggested by all

automated models.

Model: basic.model Residuals: Min 1Q Median 3Q Max -1.48129 -0.45169 -0.03373 0.38469 1.92212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.759e+01 3.154e+01 -2.777 0.00586 ** lAmount_raised 6.239e-01 3.531e-02 17.670 < 2e-16 *** lPrev_raised 5.037e-08 9.126e-09 5.520 7.88e-08 *** Closing_Year 4.674e-02 1.570e-02 2.977 0.00317 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6189 on 274 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.705, Adjusted R-squared: 0.7017 F-statistic: 218.2 on 3 and 274 DF, p-value: < 2.2e-16

Its contribution is positive and highly significant. The intercept is now negative. Adj.R2 has

slightly decreased, because three observations have been deleted due to missingness. We

cannot compare its performance to [basic.model] through ANOVA, because the sample size

differs (three NA’s in Closing_Year). We have no hesitation in accepting this parameter into our

[basic.model].

9.2.1.2 Categorical variables

Looking at the correlations between lPre_valuation and the categorical variables (paragraph

4.4), Stage is a potentially meaningful predictor and the interaction between

Type_Lead_investor and Stage is also very well promising. Finally, we do not expect Revenue to


be a significant predictor, but unfortunately this is also due to its large proportion of missing

values.12

In the following paragraphs we will manually test these intuitions concerning categorical

variables, by adding them to [basic.model], one-by-one.

9.2.1.2.1 Type_Lead_Investor (TLI)

Automated models suggest the inclusion of the group Institutional Strategic (belonging to the

variable Type_Lead_Investor) to the model. Let us try and compare models to the full variable

alone, and to the variable with pooled levels: the automatic procedure suggests that there is no

significant difference between the groups “Institutional Financial” and “Acc/Inc/PA”

(Accelerators/Incubators/Private Angels), both belonging to the categorical variable

Type_Lead_Investor (TLI); therefore, with “pooled levels” we mean that we merged these two

groups together into one group called “Non-strategic investments”, to distinguish it from the

second group belonging to TLI “Institutional Strategic”.

In the next paragraph we will also test the interaction term TLI:Stage (as suggested by our

previous Correlation analysis), and compare performances.

Model: basic.model + TLI Residuals: Min 1Q Median 3Q Max -1.4723 -0.4208 -0.0169 0.3713 1.9430 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.225e+01 3.285e+01 -2.808 0.00536 ** lAmount_raised 5.865e-01 3.620e-02 16.201 < 2e-16 *** lPrev_raised 6.329e-08 1.100e-08 5.754 2.44e-08 *** Closing_Year 4.925e-02 1.635e-02 3.011 0.00286 ** Type_Lead_InvestorInstitutional Financial 9.420e-02 9.522e-02 0.989 0.32344 Type_Lead_InvestorInstitutional Strategic 3.361e-01 1.636e-01 2.055 0.04091 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5967 on 260 degrees of freedom Multiple R-squared: 0.7075, Adjusted R-squared: 0.7018 F-statistic: 125.8 on 5 and 260 DF, p-value: < 2.2e-16

The next model is equal to the previous one, but for the predictor Type_Lead_Investor we

pooled together the groups Acc/Inc/PA and Institutional Financial in the new group Non-

strategic investors.

12 Suggestion for further research: by expanding Revenue’s data availability, we expect this predictor to be a significant one (see chapter 7)


Model: basic.model + TLI (pooled) Residuals: Min 1Q Median 3Q Max -1.53913 -0.43311 -0.01314 0.32959 1.89674 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.760e+01 3.205e+01 -2.421 0.01615 * lAmount_raised 5.910e-01 3.591e-02 16.458 < 2e-16 *** lPrev_raised 9.771e-08 1.489e-08 6.560 2.89e-10 *** Closing_Year 4.197e-02 1.595e-02 2.631 0.00903 ** Type_Lead_InvestorInstitutional Strategic 3.099e-01 1.406e-01 2.204 0.02839 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5947 on 260 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.7201, Adjusted R-squared: 0.7158 F-statistic: 167.2 on 4 and 260 DF, p-value: < 2.2e-16

Figure 9.5: For the model “basic.model + TLI (pooled)”, on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.

We remove the outliers 268, 278 (via outlier analysis we see that their adjusted p-values -via

Bonferroni correction- stand below the 0.05 threshold, Figure 9.5).

ANOVA test, internal measures, cross-validation and bootstrap tests do not find a significant

difference among these two models (non-pooled and pooled). Adj.R2 and Predicted R2 are

slightly higher for the second model, because of the reduced number of predictors. Anyway, we

got the confirmation that TLI is indeed a relevant predictor, as suggested by Automated models,

so we include it in our [basic.model], now having the following formula:

lPre_valuation ~ lAmount_raised + lPrev_raised + Closing_Year + TLI


9.2.1.2.2 Stage

Our correlation analysis reports both the importance of this variable and of its interaction term

with Type_Lead_Investor. [regsub.best] model suggests to pool Stage, by merging Early-stage

and First Clients groups together, while [regsub.cv.best] model does not consider this predictor.

We start with the individual variable and compare performances when Type_Lead_Investor and

Stage are both pooled or not.

Model: basic.model + Stage (non-pooled) Residuals: Min 1Q Median 3Q Max -1.26874 -0.27839 0.00519 0.31994 1.30212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.557e+02 5.564e+01 -2.798 0.00615 ** lAmount_raised 3.475e-01 7.378e-02 4.710 7.83e-06 *** lPrev_raised 1.106e-07 2.170e-08 5.097 1.59e-06 *** Closing_Year 8.235e-02 2.778e-02 2.965 0.00377 ** TLI_Institutional Financial -4.243e-02 1.232e-01 -0.344 0.73129 TLI_Institutional Strategic 3.032e-01 1.510e-01 2.007 0.04737 * StageFirst Clients 2.794e-02 1.209e-01 0.231 0.81767 StageLater-stage 2.281e-01 1.375e-01 1.659 0.10017 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4869 on 102 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7018, Adjusted R-squared: 0.6813 F-statistic: 34.3 on 7 and 102 DF, p-value: < 2.2e-16

By introducing the predictor Stage, we lose 158 degrees of freedom, because of its large

proportion of NA’s. TLI_Inst.Financial and Stage are not significant below 5%.

Model: basic.model + Stage (pooled) Residuals: Min 1Q Median 3Q Max -1.29992 -0.27916 0.00431 0.33149 1.27445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.513e+02 4.743e+01 -3.190 0.00188 ** lAmount_raised 3.432e-01 7.235e-02 4.744 6.70e-06 *** lPrev_raised 1.109e-07 2.137e-08 5.191 1.04e-06 *** Closing_Year 8.021e-02 2.373e-02 3.379 0.00102 ** Type_Lead_InvestorInstitutional Strategic 3.299e-01 1.273e-01 2.592 0.01093 * StageLater-stage 2.156e-01 1.125e-01 1.917 0.05793 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4827 on 104 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7012, Adjusted R-squared: 0.6868 F-statistic: 48.81 on 5 and 104 DF, p-value: < 2.2e-16

By pooling variables’ levels, we gain 2 degrees of freedom, and all coefficients are now significant

at 5% level. ANOVA test does not find any significant difference among the two models. So, in

figures 9.6 and 9.7 we compare their performances more in detail (TLI= Type_Lead_Inevstor).

All measures drastically improved by adding these two variables, when compared to

[basic.model without TLI]. Besides, all metrics show that the pooled model performs slightly

better, even if the difference appears to be very small.



9.2.1.2.3 Interaction effect Stage:TLI

The results obtained in the previous models stress the importance of the predictors Stage and

Type_Lead_Investor (TLI). In the previous correlation analysis (paragraph 4.4) we also found clue

about the impact of their interaction on lPre_valuation. Let us now test it.

Model: basic.model + TLI * Stage Residuals: Min 1Q Median 3Q Max -1.1698 -0.2697 0.0467 0.2376 1.4641 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.634e+00 9.569e-01 9.023 1.48e-14 *** lAmount_raised 4.449e-01 6.982e-02 6.372 5.92e-09 *** lPrev_raised 9.225e-08 2.158e-08 4.275 4.41e-05 *** TLI_Inst.Financial 1.983e-01 1.703e-01 1.164 0.247064 TLI_Inst.Strategic 1.329e+00 2.410e-01 5.515 2.79e-07 *** StageFirst Clients 3.333e-01 1.592e-01 2.093 0.038884 * StageLater-stage 7.385e-01 1.971e-01 3.747 0.000301 *** TLI_Inst.Financial:StageFirst Clients -1.025e-02 2.367e-01 -0.043 0.965541 TLI_Inst.Strategic:StageFirst Clients -1.265e+00 3.266e-01 -3.872 0.000194 *** TLI_Inst.Financial:StageLater-stage -3.297e-01 2.482e-01 -1.329 0.187035 TLI_Inst.Strategic:StageLater-stage -1.360e+00 3.318e-01 -4.098 8.54e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R e s i d u a l s t a n d a r d e r r o r : 0 . 4 6 5 4 o n 9 9 d e g r e e s o f f r e e d o m (170 observations deleted due to missingness) Multiple R-squared: 0.7356, Adjusted R-squared: 0.7089 F-statistic: 27.54 on 10 and 99 DF, p-value: < 2.2e-16 Model: basic.model + TLI * Stage (pooled) Residuals: Min 1Q Median 3Q Max -1.27517 -0.32262 0.02451 0.32777 1.23082 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.001e+00 9.678e-01 9.300 2.46e-15 *** lAmount_raised 4.349e-01 6.911e-02 6.293 7.52e-09 *** lPrev_raised 1.150e-07 2.212e-08 5.199 1.01e-06 *** TLI_Inst.Strategic 5.445e-01 1.670e-01 3.261 0.0015 ** StageLater-stage 3.770e-01 1.270e-01 2.970 0.0037 ** TLI_Inst.Strategic:StageLater-stage -4.897e-01 2.696e-01 -1.816 0.0722 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5006 on 104 degrees of freedom (170 observations deleted due to missingness)

0.000 100.000 200.000 300.000 400.000 500.000 600.000

AIC

BIC

PRESS

[basic.model without TLI] Vs [TLI+Stage] Vs [TLI + Stage (pooled)]

basic.model + TLI (pooled) + Stage (pooled) basic.model + TLI + Stage basic.model without TLI


Multiple R-squared: 0.6786, Adjusted R-squared: 0.6631 F-statistic: 43.92 on 5 and 104 DF, p-value: < 2.2e-16


By comparing the non-pooled models with and without the interaction term, ANOVA finds a

significant difference:

Model 1: lPre_valuation ~ lAmount_raised + lPrev_raised + Closing_Year + TLI + Stage

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900

Adj.R2

S

Pred.err.rate

Pred.R2

RMSE

MAE

RMSE

R2

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

R2

RMSE

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

inte

rnal

me

asu

res

k-fo

ld 5

.5LO

OC

Vva

lidat

ion

set

bo

ots

trap

Pooled Vs Original groups

TLI + Stage (pooled) TLI + Stage


Model 2: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage Res.Df RSS Df Sum of Sq Pr(>Chi) 1 102 24.181 2 99 21.444 3 2.7374 0.005489 **

If we then compare the pooled and non-pooled models, both with the interaction terms, the

difference is again significant:

Res.Df RSS Df Sum of Sq Pr(>Chi) 1 99 21.444 2 104 26.064 -5 -4.62 0.0007017 ***

In figures 9.8 – 9.9 we compare performances more in detail.

From these metrics, the best model always turns out to be the [basic.model + TLI*Stage] with

non-pooled groups. The only exception is BIC, where [basic.model + TLI + Stage] has the

minimum value. Nonetheless, as all other measures vote for [basic.model + TLI*Stage, original],

we can state that this is the best model so far. Note that the Prediction error rate, measured

through the validation set, is only 2.1% for this model, and around 3.1% when measured by

the other methods.

Now that we tested the effect of all predictors selected by the Automated models, we want to

investigate the effect of the inclusion of two other variables in our best model so far

[basic.model + TLI*Stage, non-pooled]: Industry and Revenue.

9.2.1.2.4 Industry

During the correlation analysis (paragraph 4.4.2.2), we showed that, even after pooling some

minor groups of the categorical variable Industry together, their means are not significantly

different one another. So, we presumed that adding this explanatory variable to [basic.model +

TLI*Stage] would not be useful. Indeed, automated selection methods excluded Industry from

their best regression models.



Model: basic.model + TLI * Stage, non-pooled + Industry Residuals: Min 1Q Median 3Q Max -1.16817 -0.27212 0.05026 0.23742 1.31437 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.896e+00 9.889e-01 8.996 2.31e-14 *** lAmount_raised 4.176e-01 7.154e-02 5.838 7.33e-08 *** lPrev_raised 9.445e-08 2.135e-08 4.424 2.58e-05 *** TLI_Inst.Financial 9.192e-02 1.742e-01 0.528 0.598942 TLI_Insti.Strategic 1.448e+00 2.516e-01 5.755 1.05e-07 *** StageFirst Clients 4.259e-01 1.680e-01 2.535 0.012890 * StageLater-stage 8.008e-01 2.101e-01 3.812 0.000245 *** IndustryFintech 2.346e-01 2.600e-01 0.902 0.369302 IndustryBiotech 7.638e-01 3.670e-01 2.081 0.040087 * IndustryMedtech/Healthcare 2.503e-01 1.502e-01 1.666 0.098914 . IndustryICT 1.205e-02 1.210e-01 0.100 0.920888 TLI_Inst.Financial:StageFirst Clients 1.098e-01 2.416e-01 0.455 0.650420 TLI_Inst.Strategic:StageFirst Clients -1.382e+00 3.309e-01 -4.176 6.59e-05 *** TLI_Inst.Financial:StageLater-stage -1.787e-01 2.577e-01 -0.693 0.489766 TLI_Inst.Strategic:StageLater-stage -1.408e+00 3.658e-01 -3.849 0.000215 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.459 on 95 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7532, Adjusted R-squared: 0.7169 F-statistic: 20.71 on 14 and 95 DF, p-value: < 2.2e-16

0.000 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 180.000 200.000

AIC

BIC

PRESS

Variations to basic.model: [TLI+Stage] Vs [TLI*Stage] Vs [TLI*Stage (pooled)]

TLI*Stage (Pooled) TLI*Stage basic.model + TLI + Stage



0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Adj.R2

S

Pred.err.rate

Pred.R2

RMSE

MAE

RMSE

R2

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

R2

RMSE

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

inte

rnal

me

asu

res

k-fo

ld 5

.5LO

OC

Vva

lidat

ion

set

bo

ots

trap

Variations to [basic.model, withour TLI]: [TLI + Stage] Vs [TLI*Stage original] Vs [TLI*Stage pooled]

TLI*Stage pooled TLI*Stage original TLI + Stage



0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Adj.R2

S

Pred.err.rate

Pred.R2

RMSE

MAE

RMSE

R2

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

R2

RMSE

MAE

Pred.err.rate

RMSE

R2

MAE

Pred.err.rate

inte

rnal

me

asu

res

k-fo

ld 5

.5LO

OC

Vva

lidat

ion

set

bo

ots

trap

Variations to basic.model: [TLI*Stage] Vs [TLI*Stage + Industry]

TLI*Stage + Industry TLI*Stage



Instead, by setting Others as reference group, all coefficients are influential and almost

significant. The less significant is ICT that, on the contrary, revealed the clearest growing trend

in Correlation analysis. The adj.R2 improves but ANOVA test indicates no significant difference

between the models:

Model 1: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage Model 2: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage + Industry Res.Df RSS Df Sum of Sq Pr(>Chi) 1 99 21.444 2 95 20.011 4 1.4329 0.1467

By comparing the other metrics, results are discordant: it generally improves internal measures

but makes worse the performance tests via CV and bootstrap (see results in Figures 9.10 – 9.11).

All that makes us think that, by adding the predictor Industry to our best model, we are going

toward an overfitting problem. So, we decide to definitely exclude it.

9.2.1.2.5 Revenue

During our correlation analysis (Chapter 4), we showed that, even after pooling some minor

groups of Revenue together, their means are not significantly different one another. So, we

presumed that adding this explanatory variable to [basic.model] would not be useful. Indeed,

automated selection methods excluded Revenue from their best regression models.

Model: TLI*Stage, non-pooled + Revenue Residuals: Min 1Q Median 3Q Max -1.15306 -0.24997 0.04723 0.26149 1.47081 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.622e+00 9.970e-01 8.649 1.27e-13 *** lAmount_raised 4.453e-01 7.238e-02 6.152 1.81e-08 *** lPrev_raised 8.569e-08 2.280e-08 3.759 0.000295 *** TLI_Institutional Financial 2.127e-01 1.738e-01 1.224 0.224009 TLI_Institutional Strategic 1.359e+00 2.470e-01 5.504 3.14e-07 *** StageFirst Clients 3.667e-01 1.958e-01 1.873 0.064099 . StageLater-stage 7.109e-01 2.495e-01 2.850 0.005365 ** Revenue50k - 1M -2.617e-02 1.397e-01 -0.187 0.851768 Revenue> 1M 1.266e-01 2.266e-01 0.558 0.577859

0.000 50.000 100.000 150.000 200.000 250.000

AIC

BIC

PRESS

Variations to basic.model: [TLI*Stage] Vs [TLI*Stage + Industry]

TLI*Stage + Industry TLI*Stage

9.3 Interaction Terms 104

TLI_Inst.Financial:StageFirst Clients -1.864e-02 2.457e-01 -0.076 0.939691 TLI_Inst.Strategic:StageFirst Clients -1.287e+00 3.338e-01 -3.855 0.000211 *** TLI_Inst.Financial:StageLater-stage -3.119e-01 2.554e-01 -1.221 0.224988 TLI_Inst.Strategic:StageLater-stage -1.383e+00 3.506e-01 -3.944 0.000153 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4722 on 95 degrees of freedom (172 observations deleted due to missingness) Multiple R-squared: 0.7235, Adjusted R-squared: 0.6886 F-statistic: 20.71 on 12 and 95 DF, p-value: < 2.2e-16

Revenue coefficients are influential but not significant. Both internal measures and performance

tests pursued via CV and bootstrap get worse results than [basic.model + TLI*Stage, non-

pooled]. Therefore, we definitely exclude the variable Revenue from the predictors of our best

model.

9.2.2 Results

Based on the tests conducted above, the result of the third manual variable selection is to

exclude Industry, Revenue, and Closing_Year_factor from the predictors of our best model,

while including:

• lAmount_raised

• lPrev_raised

• Closing_Year

• Type_Lead_Investor

• Stage

And potentially the interaction term Type_Lead_Investor:Stage. This conclusion is in accordance

with the Automated Models selection, but it does not additionally pool groups together, and it

suggests an interaction term between Stage and TLI.

9.3 INTERACTION TERMS

After our third variable selection, we finally wonder if we omitted some important interaction

terms in our best model. In order to indagate this, we run automated models on a new data

frame, including only the five predictors selected in the Third manual variable selection, plus all

their possible interactions. The results of Stepwise regression and Best Subset Selection are

following.


9.3.1 Stepwise regression

By applying Stepwise regression, with direction= ”both”, we obtain the following optimal model,

that we call [A]:

Call: lm(lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised) Residuals: Min 1Q Median 3Q Max -1.56505 -0.27575 0.04329 0.28700 1.04449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.917e+02 5.857e+01 -4.980 2.63e-06 *** Closing_Year 1.495e-01 2.923e-02 5.117 1.49e-06 *** TLI_Institutional Financial 6.188e+01 9.606e+01 0.644 0.520912 TLI_Institutional Strategic 5.421e+02 1.520e+02 3.567 0.000555 *** lPrev_raised 7.204e-05 3.024e-05 2.383 0.019062 * lAmount_raised 3.980e-01 6.418e-02 6.201 1.24e-08 *** Closing_Year:TLI_Institutional Financial -3.077e-02 4.765e-02 -0.646 0.519913 Closing_Year:TLI_Institutional Strategic -2.688e-01 7.538e-02 -3.565 0.000557 *** Closing_Year:lPrev_raised -3.565e-08 1.499e-08 -2.379 0.019240 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4351 on 101 degrees of freedom Multiple R-squared: 0.7642, Adjusted R-squared: 0.7455 F-statistic: 40.91 on 8 and 101 DF, p-value: < 2.2e-16

9.3.2 Best Subset Selection

The results of Best Subset selection (Figure 9.12) vote for model with between five and seven

predictors: lPrev_raised, lAmount_raised, Closing_Year:TLI, Closing_Year:lAmount_raised,

Closing_Year:lPrev_raised, Inst.Strategic, and Stage:lAmount_raised. We call this model [B].

Figure 9.42: The Figure shows three graphs to analyse results of Best subset selection method, applied via the software R. On the abscissa is represented the number of variables in the model, while on the ordinate is shown the resulting value of the following metrics: adjusted R squared, Cp, and BIC. The optimal value for each metric is in red colour.

9.4 Further methods used to select the best model 106

See comparison of these two best models [A] and [B] in paragraph 6.5.

9.4 FURTHER METHODS USED TO SELECT THE BEST MODEL

We did not limit our best model selection to the methods in this Chapter. We also

experimented some non-linear regression and other approaches. As they did not directly add

information in explaining our dependent variable lPre_valuation, for reasons of brevity we do

not show their results in this report as we did for the other ones. Nevertheless, we want to

mention that, by applying and experimenting the following methods (Rao, 2014; James et al.,

2013), via the Software R, any significant improvement could be achieved:

− GLM (Generalized Linear Models)

− Curve Fitting using Polynomial Terms

− Fractional exponents (a small significant improvement was provided, but residual plots

showed the model was getting further from meeting MLR assumptions)

− Spline

− Log function of predictors

− Loess regression

− Kernel regression

We compared the performance of the models created by applying these methods with the

performance of [A], and we concluded that [A] continues to be the best model for the goal of

our research. The application of GLM regression, in particular, confirmed precisely [A] as the

best model, which is a positive robustness check.

Start-up valuation in Switzerland: analysis and methods · classical financial formulae miserably...

Documents

Transcript of Start-up valuation in Switzerland: analysis and methods · classical financial formulae miserably...