Start-up valuation in Switzerland: analysis and methods · classical financial formulae miserably...
Transcript of Start-up valuation in Switzerland: analysis and methods · classical financial formulae miserably...
Start-up valuation in Switzerland:
analysis and methods
Master Thesis
Candidate: Silvia Lama
Supervisor: Prof. Dr. Didier Sornette
ETH Zürich
Department of Management, Technology and Economics (D-MTEC)
Chair of Entrepreneurial Risks
June 2019 – November 2019
1.1 Motivation and overview 2
0 Abstract 3
ABSTRACT
The aim of this master thesis is to provide an overview of start-up valuations in Switzerland. The
first part focuses on the analysis of funding rounds closed in Switzerland, between 2010 and
2019. The existence of patterns and trends is investigated, visualized, and commented. The
second part selects the best model to estimate a range of pre-money valuation for a target start-
up, as a fair benchmark. This could be used by investors and co-founders as a starting point in
their investment negotiation process. Indeed, traditional valuation methods1 cannot be applied
to start-ups, due to the uncertainty of the latter, their short history, absence of publicly available
data on financials, comparable companies or transactions. As a consequence, new valuation
methods have emerged and, in the conclusion chapter, they are compared to our approach,
stressing their lack of objectivity, contextuality, accuracy, and precision. Finally, several traces
for further research are recommended.
1 (e.g. the Discounted Cash Flow, the Valuation Multiples)
1.1 Motivation and overview 4
ACKNOWLEDGEMENTS
I would like to express all my gratitude to Professor Didier Sornette, for the opportunity to
conduct my Master Thesis at the Chair of Entrepreneurial Risks and for his guidance, and to Dr.
Spencer Wheatley for the useful advice.
Besides, I would like to sincerely thank Steffen Wagner and Michael Blank, for the trust
demonstrated by choosing me to pursue this delicate and extremely interesting research
project at Investiere.
My warmest thank goes to Dr. Matteo Farnè, for the valuable support and interest in my work,
and for being a reliable point of reference in my life.
I also wish to heartily thank my mentor and angel investor, Professor Silvio Marenco, who has
been always believing in me, for all the care and time he has been investing in my professional
growth. Above all, during these years he taught me the value of respect and of building trustful
relationships.
If I achieved this goal, it is also thanks to the precious advice of my mentor Andrea Girardello. I
feel truly grateful to all his attention and effort, allowing me to avoid many mistakes, and
inspiring my winding path as a student-entrepreneur. He taught me to never give up, and that
it is always possible to find a smarter way to face challenges, by thinking outside the box.
This journey would not have been so special and unforgettable without the fantastic company
of my dearest friends, and of my lovely flatmates, bringing sparkling colours to every day of my
life.
Finally, I am enormously grateful to my family, who makes me feel the luckiest person on the
Earth, by supporting all my passions and activities, and by loving me as I am.
0 Acknowledgements 5
TABLE OF CONTENTS
Abstract ......................................................................................................................................... 3
Acknowledgements ....................................................................................................................... 4
1 Introduction .......................................................................................................................... 9
1.1 Motivation and overview .............................................................................................. 9
1.2 Research questions ..................................................................................................... 10
2 Start-up valuation methods ................................................................................................ 11
2.1 Overview ..................................................................................................................... 11
2.2 Scorecard method ....................................................................................................... 11
2.3 Berkus model .............................................................................................................. 12
2.4 Venture Capital Method ............................................................................................. 13
3 Data Collection .................................................................................................................... 14
3.1 Sources of data ............................................................................................................ 14
3.2 Process ........................................................................................................................ 14
3.3 Description and pre-processing of the data set .......................................................... 15
3.4 Log Transformation ..................................................................................................... 19
4 Multivariate Data Analysis .................................................................................................. 22
4.1 Treatment of missing data .......................................................................................... 22
4.2 Correlation analysis between continuous variables ................................................... 22
4.2.1 Methodology ....................................................................................................... 22
4.2.2 lThrough_Investiere analysis ............................................................................... 24
1.1 Motivation and overview 6
4.2.3 Employees analysis .............................................................................................. 30
4.2.4 Analysis of the entire data set ............................................................................. 33
4.3 Correlation Analysis between categorical variables ................................................... 41
4.3.1 Pooling levels together ........................................................................................ 41
4.3.2 Methodology ....................................................................................................... 44
4.3.3 Results ................................................................................................................. 44
4.4 Correlation analysis between continuous and categorical variables .......................... 46
4.4.1 Methodology ....................................................................................................... 46
4.4.2 Results ................................................................................................................. 46
5 Predicting the future success of a swiss start-up ................................................................ 62
5.1 Overview...................................................................................................................... 62
5.2 Methodology ............................................................................................................... 62
5.3 Results ......................................................................................................................... 63
6 Multiple Regression Analysis ............................................................................................... 64
6.1 Purpose ........................................................................................................................ 64
6.2 Methodology ............................................................................................................... 64
6.2.1 Overview .............................................................................................................. 64
6.2.2 Steps .................................................................................................................... 65
6.3 Data pre-processing .................................................................................................... 66
6.4 Second Manual Variables selection (from 20 to 8 independent variables): ............... 66
6.5 Best models comparison ............................................................................................. 68
0 Acknowledgements 7
6.6 Best selected model .................................................................................................... 74
6.6.1 Confidence and Prediction intervals ................................................................... 74
6.7 MLR BLUE Assumptions Check .................................................................................... 75
6.7.1 Outlier detection ................................................................................................. 76
6.7.2 Check MLR assumptions ..................................................................................... 77
7 Conclusions ......................................................................................................................... 81
8 References ........................................................................................................................... 85
9 Appendix: Model specification and selection ..................................................................... 88
9.1 Automated Models ..................................................................................................... 88
9.1.1 Stepwise regression ............................................................................................ 88
9.1.2 Best Subset Selection .......................................................................................... 88
9.2 Third manual variable selection (from 8 to 5 predictors) ........................................... 91
9.2.1 Basic.model ......................................................................................................... 91
9.2.2 Results ............................................................................................................... 104
9.3 Interaction Terms ...................................................................................................... 104
9.3.1 Stepwise regression .......................................................................................... 105
9.3.2 Best Subset Selection ........................................................................................ 105
9.4 Further methods used to select the best model ....................................................... 106
1.1 Motivation and overview 8
1 Introduction 9
1 INTRODUCTION
1.1 MOTIVATION AND OVERVIEW
My personal passion for Entrepreneurship is a long and most probably never-ending journey,
started in September 2012. On that day, instead of attending classes at high-school, together
with my best friend (later on start-up’s partner), we attended a cycle of lectures about student
entrepreneurship, organized by the University of Bologna. On that day, I understood my way
was to be an entrepreneur, and I set the goal to found my own company at around 30 years old.
But the occasion presented to me much earlier: at 21 I grasped it and founded my first start-up
Musa, in the industry of Education technology.
When the need for a second investment round was approaching, I faced a big question mark:
which is the value of our company? It is not profitable yet, it has very low revenue, and a not
well quantifiable risk; as a result, all traditional financial methods to evaluate companies are of
any help. By interacting with founders, private and institutional investors, the lack of an
objective method to evaluate early-stage start-ups turned out to be a common, worldwide
shared problem.
Indeed, nowadays the pre-money valuation of a start-up is nothing but the result of a
negotiation process, between the investor and the founders. It requires on average 7-9 months2,
and enormous effort from both parties (Clarysse and Kiefer, 2011). Often, the result is mainly
determined by the negotiation power of each actor (e.g. the number of interested investors and
their experience, the urgency of money of the start-up), rather than the value of the underlying
risky business. As in a poker match, each player withholds information and tries to convince the
opponents that his hand is better than it actually is. But, unlike poker, the participants of
investment negotiations should communicate complete information and work together toward
the shared goal of growing a successful business. The valuation, in fact, is only one part of the
investment process, and it often leads to controversies that get the founders-investors
relationships off on the wrong foot (Villalobos, 2007).
2Average time elapsed in Switzerland between the business plan submission to a Venture Capital firm and the
actual investment
1.2 Research questions 10
When I approached the Swiss VC Investiere, we found common ground in investigating this
subject that, under the supervision of Prof. Dr. Sornette -Chair of Entrepreneurial Risks-, became
the topic of the present Master thesis.
This chapter proceeds with the presentation of the specific research questions that the project
wants to address, while in Chapter 2, we will review the literature about start-up valuation
methods.
Chapter 3 is dedicated to the description of the data set used in our research and its collection
process, while its multivariate analysis is visualized and commented in Chapter 4.
Further chapters, instead, have the courageous aim to create and select models that, given data
of a specific start-up as input, give as outcome its future success with the minimum error rate
(Chapter 5), and an estimated benchmark for its pre-money valuation with the highest possible
accuracy and precision (Chapter 6).
Keeping in mind that “all models are wrong, some models are useful”, we summarize our main
findings in Chapter 7, compare them to the literature, and finally suggest possible trajectories
for future research on the topic.
1.2 RESEARCH QUESTIONS
The goal of this research is to investigate the pre-money valuations achieved by Swiss start-ups
at their investment rounds, between 2010 and 2019. In particular, in the following chapters we
will answer mainly, but not only, to the following questions:
− Are there differences in start-up valuations between industries? (Chapter 4.4.2.2)
− Does the type of lead investor involved in the round have an influence on the valuation?
(Chapter 4.4.2.4)
− Is there a correlation between the size of the round and the total funding previously
raised? And with the pre-money valuation? (Chapter 4.2.3)
− Can we predict the future success of a start-up based on its current status? (Chapter 5)
− What is the best model to estimate the pre-money valuation of a start-up? (Chapter 6)
2 Start-up valuation methods 11
2 START-UP VALUATION METHODS
2.1 OVERVIEW
Traditional valuation methods for companies are usually based on the forecasted revenue and
profit that an organization is expected to make. When it comes to early-stage start-ups, these
classical financial formulae miserably fail. In fact, these firms are not profitable yet, and have
very low or zero revenue, so their valuation is inevitably determined by other factors.
In the following paragraphs, we briefly describe the main available methods to valuate early-
stage start-ups. These approaches are vague and leave space to any kind of interpretation.
Behrmann proved how different the outcomes of valuation for the same firm can be by applying
them, in addition to the demonstration that the same valuation method, when used on different
firms, may understate as well as overstate the values to the market (Behrmann, 2016). Finally,
he stresses that a valuation obtained through these current methods can only be as good as the
assumptions. Change in just one number can dramatically change the results (Behrmann, 2016).
2.2 SCORECARD METHOD
The Scorecard method, also known as the Bill Payne valuation method, compares pre-money
and pre-revenue start-ups to average valuations, and then adjust them according to certain
metrics. Following Payne (2011), it is first necessary to survey pre-money, and pre-revenue
valuations of venture capitalists or private angels for start-ups, in the industry and region of the
target company. Next, the start-up is compared qualitatively to the comparable start-ups in the
valuation survey, in accordance with the following categories and weights (Table 2.1):
Strength of the Management Team 0-30%
Size of the Opportunity 0-25%
Product/Technology 0-15%
Competitive Environment 0-10%
Marketing/Sales Channels/Partnerships 0-10%
Need for Additional Investment 0-5%
Other 0-5%
Table 2.1: Scorecard method: valuation categories with corresponding weights.
When the actual assessment is performed, the start-up is compared to the average of the
surveyed start-ups. Full 30% for team would be awarded to an average team in regard to
the comparable companies. When the valuation subject has a far greater than average team,
2.3 Berkus model 12
e.g. 150% of the average, the resulting factor would be 0.45. While, if in one category the start-
up underperforms the peer group, less than the full amount is noted. In the end, the sum of all
factors is multiplied by the average valuation obtained from the valuation survey. If, for
example, we have a total factor of 1.2 (an above average venture), and a mean valuation on the
market of €1.6 million, the target would be valued at €1.92 million pre-money. Table 2.2
showcases a complete scorecard valuation of a start-up.
Table 2.2: Exemplary assessment of a start-up using the Scorecard method (from Gunn, 2016)
2.3 BERKUS MODEL
Berkus model was developed and proposed by the angel investor Dave Berkus (2009) to evaluate
very early-stage companies with zero or very low revenue, but showing a potential of reaching
over $20 million in revenues within five years. According to him, “the universal truth is that
fewer than one in a thousand start-ups meet or exceed their projected revenues in the periods
planned” (Berkus, 2009). Therefore, his method to establish an initial pre-money valuation does
not take financials into account.
Example of motivations behind the choice of the % of Norm:
− A few co-founders, a not yet established Advisory Board
− The market is there and it is growing
− The concept is nailed down, Minimum Viable Product is in development
− Competition definitely exists, however company has a business model supposed to be disruptive
− No Sales yet, Partnerships are in place for distribution
− In need of $$ to finish development, launch, test, etc
− Tested the market, have positive feedback
2 Start-up valuation methods 13
If Exists: Add to company value UP to:
1. Sound Idea (basic value, product risk) USD 0.5m
2. Prototype (reducing technology risk) USD 0.5m
3. Quality Management Team (reducing execution risk) USD 0.5m
4. Strategic relationships (reducing market risk and competitive risk) USD 0.5m
5. Product Rollout or Sales (reducing financial or production risk) USD 0.5m
Table 2.3: The Berkus Model: valuation dimensions
Berkus proposition is to add up to half a million USD depending on the degree at which the start-
up fulfils the respective dimensions shown in Table 2.3. Once the company starts to generate
revenues, Berkus states this method loses credibility, and most everyone will use the actual
revenues to project value over time.
2.4 VENTURE CAPITAL METHOD
The third most common method for start-up valuation is called Venture Capital Method. It was
first ideated by Sahlman and Scherlis (1989), then revised in 2009 in his Harvard Business School
case study, and also thoroughly described by Engel (2002). The procedure starts by estimating
the terminal value (TV) for the company in some years from now, when the exit is planned: for
that year, revenues are estimated and translated to a TV by multiplying them with P/E ratios or
sales multiples of similar companies in the industry. As an example, a venture has estimated
revenues of € 15 M in five years (t), with similar businesses having a sales multiple of two. This
leads to a TV of approximately € 30 M in five years. This value is then discounted to the present
day, with the discount rate (r) estimated by the VC, usually the required internal rate of return
(IRR) or generally target rate of return (Damodaran, 2007). Let’s say that, as this is a quite risky
business, r is 60%. This would translate to a present value of PV= 30M / (1+ 0.6)5 ≈ 2.86M. In
Table 2.4 we show the summary of the steps, adapted from Engel (2002):
Step 1 Estimating terminal value
TV= P/E*Earnings
or
TV= (Sales multiple) * Sales
Step 2 Determining present value PV= TV/ (1+r)t
Step 3 Calculating demanded ownership fraction F= (Round size) / PV
Table 2.4: The Venture Capital Method: summary of steps (Engel, 2002).
Engel (2002) also states that the pre-money valuation is calculated by substracting the round
size to the post money valuation.
3.1 Sources of data 14
3 DATA COLLECTION
3.1 SOURCES OF DATA
This research project would not have been possible without the collaboration of the following
organizations, that allowed us to access directly or indirectly to their data set concerning start-
up investment rounds:
• Investiere | Verve Capital Partners AG: A key role has been played by the Swiss Venture
Capital Investiere, by raising the need to investigate Swiss start-up valuations and to
provide the conditions to pursue the analysis.
• Dr. Hervé Lebret: He supported our research by sharing with us the data set behind his
study “The Analysis of 500+ start-ups” published on www.startup-book.com (Lebret,
2019). He is the Manager of Innogrants (EPFL). He is also a Senior Scientist in the field of
high-tech entrepreneurship and his research field concentrates on academic spin-offs,
including Stanford University and Silicon Valley.
• Startupticker.ch: this organization shared with us all data from their annual Venture
Capital reports, from 2012 to 2018. Startupticker.ch is the main online news portal
about young Swiss companies.
• Commercial Registries of Switzerland’s Confederation: through the Registries of
commerce – Zefix online portal it is possible to gain access to the legal acts of Swiss
companies for some cantons of Switzerland. They do comprehend some details of
funding rounds (e.g. post-money valuation, number of issued shares).
• Crunchbase: this platform provides companies insight and we extracted some data
about the analysed start-ups.
3.2 PROCESS
Data collection has been the research task that required overall most of the time and effort. The
result has been a unique collection of precious, extremely confidential, sensitive data, regarding
the details of start-ups’ investment rounds (306 samples overall, concerning 190 companies).
Because of the confidentiality of this data protected by non-disclosure agreements between
investors and co-founders, all contents and results of the research will be shared anonymously.
3 Data Collection 15
A first bunch of samples has been provided by Investiere | Verve Capital Partners AG, related to
the investment rounds in which it was directly involved as an investor. After that, data collection
proceeded in two simultaneous directions:
• Search of new sources of data, by directly contacting all the main active organizations
in the Swiss start-up ecosystem (e.g. incubators, accelerators, University technology
transfer offices, facilitators, investors’ clubs).
• Search of data for specific start-ups to replace missing values (e.g. Swiss commercial
registries, CBInsight, start-ups).
3.3 DESCRIPTION AND PRE-PROCESSING OF THE DATA SET
First of all, we pre-process the collected data, to make sure integrity and coherence are
respected. For the purpose of this analysis, we decide to focus only on equity investment rounds
pursued by Swiss start-ups, between 2010 and 2019. Therefore, we remove 12 samples
concerning convertible investment rounds and 8 samples related to non-Swiss companies. So,
we can exclude all variables regarding the details of the convertible rounds, plus the following
variables: Round Type (as we consider only Equity rounds), and Company name. In fact, this
research could only be pursued anonymously, because of signed NDAs protecting data. We can
also remove the variable Data_source, because samples have been randomly collected from
different sources, and we can assume zero correlation between values and their original source.
After this preliminary selection of variables and samples, we deal with a data set comprehending
286 observations and 16 variables.
In Table 3.2 we show an overview of this starting data set (a legend is provided in Table 3.1),
while in Table 3.3 we provide an extended variables’ description.
Abbreviation Meaning
Cat Categorical
N Nominal
D Dichotomous
O Ordinal
I Interval
Num Numeric
C Continuous
Dis Discrete
Table 3.1: Legend of Table 3.2
3.3 Description and pre-processing of the data set 16
Variable’s name Type Short description Nr. of groups % NA’s
Foundation_Year Num Dis Company’s foundation year / 0.00
Round_name Cat N Harmonized official round name 9 0.00
Industry Cat N Company’s industry 9 0.00
Stage Cat O Company’s development stage 7 61.54
Pre_valuation Num C Pre-money Valuation / 0.00
Prev_raised Num C Tot. funding previously raised / 0.00
Amount_raised Num C Size of the investment round / 0.00
Through_Investiere Num C Amount invested by Investiere / 54.89
Type_Lead_Investor Cat N Type of the main Investor 4 4.19
Profitable Cat D Is the company profitable? 2 62.24
Revenue Cat O, I Last 12 months revenues 6 59.79
Closing_Year Num Dis Year of round’s closure (10) 1.05
Still_operating Cat N Company’s present status 3 0.00
Employees Num Dis Nr.of employees / 82.87
Currency Cat N Funding’s currency 1 0.00
Location Cat N Company’s legal location 1 0.00
Table 3.2: Data set overview
Foundation_Year Year is which the company has been officially incorporated. Numerical, discrete variables.
Round name
Harmonized name of the round used in the official company documentation. Categorical, nominal, 9 groups:
− Pre-seed
− Seed round
− Series A Round
− Series B Round
− Series C Round
− Series D Round
− Series E Round
− Pre-Exit
− IPO
Industry
Area of Business according to the Swiss Venture Capital Report: Categorical, nominal, 9 groups:
− Biotech
− Cleantech
− Consumer Products
− Fintech
− Healthcare
− ICT
− Medtech
− Micro / Nano
− Other
Stage
Indicates the development stage of the Company, at the time of Closing. Categorical, ordinal, 7 groups
− Idea (0 samples)
− Prototyping
3 Data Collection 17
− Beta-Phase
− Clinical Trials
− First Clients
− Growth
− Internationalisation
Pre_valuation
Pre-Money Valuation: the value of a company just before that specific round of Financing. When summed up to Amount_raised, it gives as a result the post-money valuation (Frei and Leleux, 2004). Unit of measure is CHF. Numerical, continuous
Prev_raised The sum of all funds the startup raised since incorporation until the moment just before closing that specific investment round. Unit of measure is CHF. Numerical, continuous
Through_investiere VC Investiere's tranche of the respective financing round. Unit of measure is CHF. Numerical, continuous
Type_Lead_Investor or (TLI)
Type classification of the lead Investor of the investment round. Categorical, nominal, 4 groups:
− Accelerator/incubator: accelerators and incubators are organizations helping start-ups attain success. Incubators usually offer dedicated office and development space to the start-ups for a set period of time, and a first grant or funding round to allow start-up’s incorporation and beginning of activities. Start-up accelerators tend to focus on providing mentorship, and resources to help the start-ups succeed, but usually tend not to offer dedicated office space. Accelerators and incubators usually get involved at early-stage. Some of them focus on a specific industry, market, technology, whereas others are generalists. Start-ups are usually admitted in batches, after a screening process (Isabelle, 2013).
− Private Angel: an angel investor (also known as a business angel, informal investor, angel funder, private investor, or seed investor) is an affluent individual who provides capital, advice and contacts to a start-up, usually in exchange for convertible debt or ownership equity. Unlike venture capitalists, they usually play an indirect role as advisors in the operations of the investee firm (Wong, Bhatia, and Freeman, 2009).
− Institutional Financial: an institutional investor is an organization that invests on behalf of its members. A financial investor invests in a business merely to maximize its financial returns, over a specified period of time. These investors often take board seats and add value by introducing co-founders to a larger network, or help in terms of strategy, hiring, financials and industry insights (Arping and Falconieri, 2009).
− Institutional Strategic: an institutional investor is an organization that invests on behalf of its members. Strategic investors are not only looking for a return on their capital, but for a ‘strategic’ scope: access to technology/assets, new market or target segment. They are more
3.3 Description and pre-processing of the data set 18
patient to see returns on their investments than financial investors (Arping and Falconieri, 2009).
Profitable
It answers to the following question: Is the company profitable at the time of the investment round closure (i.e. is in the condition of yielding a financial profit or gain)? Categorical, Dichotomous
− Yes: the company is profitable
− No: the company is not profitable
Revenue
The income generated in the last 12 months before the funding round, from sale of goods or services, or any other use of capital or assets, associated with the main operations of an organization before any costs or expenses are deducted. Also called sales, or (in the UK) turnover. Categorical, ordinal, interval, 6 groups:
− 0 - 50k
− 50k - 100k
− 100k - 500k
− 500k – 1M
− 1M – 5M
− >5M
Closing_Year
Year of closing of the investment round. We created two identical variables for this data: one is a numerical discrete variable, the second one is categorical ordinal, with 10 groups (we will later decide which variable is the most useful for our analysis):
− 2010
− 2011
− 2012
− 2013
− 2014
− 2015
− 2016
− 2017
− 2018
− 2019
Still_operating
Indicates the current3 status of the company. Categorical, ordinal, 3 groups:
− Yes: the company is still operating (i.e. active company)
− No: the company has been liquidated
− Exit: The company has been acquired to another company
Employees Number of employees of the start-up at the time of the funding round.
3 Last update: Oct 2019
3 Data Collection 19
Location
Country of the start-up’s registered office. Categorical, nominal, 1 group:
− Switzerland
Currency
Primary currency of the financing round. Categorical, nominal, 1 group:
− CHF
Table 3.3: Extended variables description
3.4 LOG TRANSFORMATION
By analysing the distribution of some continuous variables (Pre_valuation, Amount_raised,
Prev_raised, and Through Investiere) we can state they are all far from normality, implying
restrictions in the application of some statistical methods strictly assuming normal distributions
(e.g. Pearson’s correlation, ANOVA). By observing their distributions, the best transformation
we can apply is the natural logarithmic function. In the following graphs (figure 3.1 - 3.2) we
report the significant improvement achieved thanks to this transformation4. Nevertheless, if we
apply it a second or third time (i.e. log3(Pre_valuation)), the additional improvement is instead
not significant.
If we test now the normality of these transformed variables, by applying for example the Shapiro
Test, we are still forced to reject the zero hypothesis of normality. We can notice, in fact, from
the graphs in Figure 3.1, very long tails in variables distributions and some proportion of
skewness. Actually, these tails could just be outlier cases: if we detect outliers with the R
function aq.plot we get as a result that the 38.81% of samples are outliers. Of course, this big
proportion does not allow us to remove them now. Anyway, this is not an issue: linear regression
analysis does not assume normality for either predictors or outcome. The main role, instead, is
played by the distribution of residuals. (The distribution of residuals together with outlier
detection will be conducted for specific models during regression analysis, paragraph 6.7.2).
4 As the log(0) is undefined, we add 0.1 to all zero values in these continuous variables before applying the log transformation.
3.4 Log Transformation 20
Figure 3.1: On the left graphs we use log-log axis, base 10: (top) adaptive Kernel density estimation of pre-money valuation of a company at each round (Pre_valuation); (bottom) adaptive Kernel density estimation of funds raised before a round (Prev_raised). On the right, top and bottom: Kernel density estimation of the same variables, after their natural log transformation. Red lines indicate the median.
10
-21
10-1
8 1
0-1
5
10
-12
1
0-9
2*106 107 5*107 5*108 109
5*104 5*105 5*106 5*107 109
10
-17
10-1
4 1
0-1
1
10
-8
3 Data Collection 21
Figure 3.2: On the left graphs we use log-log axis, base 10: (top) adaptive Kernel density estimation of the amount raised at each round, in CHF (Amount_raised); (bottom) Kernel density estimation of the amount invested by the VC Investiere, in CHF (lThrough_investiere) at each round. On the right, top and bottom: Kernel density estimation of the same variables after their natural log transformation. Red lines indicate the median.
10
-15
1
0-1
3
1
0-1
1 1
0-9
10
-7
5*1
0-1
0 5
*10
-9 5
*10
-8
5
*10
-7
10
-7
103 105 107
104 5*104 5*105 5*106 109
4.1 Treatment of missing data 22
4 MULTIVARIATE DATA ANALYSIS
4.1 TREATMENT OF MISSING DATA
During the conduction of this research project, we spent most of the time and effort and into
the data collection process (paragraph 3.2). This phase did not aim only to collect as many
samples as possible, but also to substitute the missing data with the true values. After this long,
time consuming process, we list the variables sorted by decreasing fraction of missing data:
Variable NA’s Employees 0.83 Profitable 0.62 Stage 0.62 Revenue 0.60 Through investiere (CHF) 0.55 Type_Lead_Investor 0.04 Closing_Year 0.01 Foundation Year 0.00 Industry 0.00 Round_name 0.00 Still_operating 0.00 Amount_raised 0.00 Pre_valuation 0.00 Prev_raised 0.00 Country 0.00 Currency 0.00
We read that only 8 out of 20 (16 + 4 log transformed) are still containing missing values. For the
purpose of our research, imputation of missing data would be misleading and useless, due to
the low ratio of available samples per variable. Therefore, we decide to keep all NA’s and all
samples, in order to avoid loss or distortion of information.
4.2 CORRELATION ANALYSIS BETWEEN CONTINUOUS VARIABLES
4.2.1 Methodology
In order to analyse the correlations existing between the continuous variables in the data set,
we will adopt the following statistical tools:
A. Correlation matrix
B. Scatterplot
C. Boxplot
4 Multivariate Data Analysis 23
Correlation matrix will be calculated by applying the Kendall’s Tau method5. We generally
consider a correlation between two variables as “low” if its absolute value is lower than 0.3,
"moderate", if it stays between 0.3 and 0.7, and "strong" if higher than 0.7.
Scatterplots graphically show the linear fit of each pair of variables. The regression line and
correlation coefficients allow us to distinguish which pairs of variables show an interesting
significant correlation, and which ones do not. Scatterplot analysis is useful in preparation to the
Multiple Regression Analysis (Chapter 6). MLR, in fact, requires the relationships between the
independent and dependent variables to be linear. This linearity assumption can be best tested
and visualized through scatterplots.
Boxplots are useful to easily identify outliers, and schematically visualize the distribution of each
variable. The values between 2.689 sigma stay within the min and max border lines of each
boxplot:
MIN= max[MIN, Q1-1.5*(Q3-Q1)]
MAX= min[MAX, Q3+1.5*(Q3-Q1)]
The remaining extreme values are identified as outliers, and represented in the boxplot beyond
whiskers (we are not interested nor allowed to identify to which companies they correspond
to).
As we’ve just seen in the previous paragraph, there are two continuous variables with a
tremendously high number of missing data, while the other four continuous variables have
approximately 100% of values available. Employees – indicating the number of employees of the
start-up at the time of a specific round - has 82.9% of NA, while the continuous transformed
variable lThrough_investiere – indicating the amount invested in that specific financing round
by the VC Investiere - has 55.1% of missing values. That means keeping these variables during
our further analysis would make us neglect over 82% of our samples. Besides, studying the
relationship between lThrough_investiere and the other variables is not menaingful for all Swiss
5 As we noticed in paragraph 3.4, none of the continuous variables follows a normal distribution, as proved via the
Shapiro Test. We therefore calculate correlation values through the Kendall’s Tau method, instead of the Pearson’s
one, that assumes normality.
4.2 Correlation analysis between continuous variables 24
investment rounds, but only for the ones in which Investiere was actually involved (96 samples
out of 286, so 33.4%). For these reasons, we now conduct a separate analysis for those two
continuous variables, and we will then omit them from our data set to take into account all of
the available rounds.
4.2.2 lThrough_Investiere analysis
We now focus our analysis on the rounds in which we know if and how much the VC Investiere
contributed. In Figure 4.1, we represent the frequency distribution of the natural log of the
amount invested by the VC Investiere, in CHF, at each round (lThrough_investiere), omitting all
its NA values (55%). We clearly see two relative maxima in this distribution, where the lowest
one indicates rounds in which Investiere did not invest. In Figure 4.2 instead, we consider only
the rounds in which Investiere invested (96 out of 129).
Figure 4.1: On the x-axis: the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere) at each round. On the top of the figure, a boxplot representation of this variable. On the y-axis: the Kernel density estimation (higher density for higher probability of seeing a point at that location). N is the number of samples, and Bandwidth is the parameter controlling the smoothness of the curve (higher values make smoother curves), and it equals the standard deviation of the kernel used. The red line indicates the median.
When measuring the relationships among continuous variables with the Kendall’s Tau method,
the correlations involving lThrough_investiere are not significant or very low. This changes if we
calculate correlations by considering only the rounds in which Investiere actually invested (96
samples). The corresponding correlation matrix is in figure 4.3, where non-significant values
(significance level is 0.05) are hidden by black crosses.
4 Multivariate Data Analysis 25
When measuring the relationships among continuous variables with the Kendall’s Tau method,
the correlations involving lThrough_investiere are not significant or very low. This changes if we
calculate correlations by considering only the rounds in which Investiere actually invested (96
samples). The corresponding correlation matrix is in figure 4.3, where non-significant values
(significance level is 0.05) are hidden by black crosses.
Except for Foundation_year, all the other variables have now a significant, positive correlation
with lThrough_investiere. We now analyse them more in detail with the following scatterplots
(by only considering the rounds in which we know that Investiere invested money).
In figure 4.4, the correlation coefficient is r=0.403 (moderate) and significant (p << 0.05). The
trend is visible but noisy. The explanation of this correlation can be found in the moderate
correlation existing between lAmount_raised and lPre_valuation. In fact, lThrough_investiere is
a value belonging to the interval [0; lAmount_raised], and its relationship with lAmount_raised
is represented in Figure 4.5. It is a matter of fact that the companies in which Investiere invested
more, raised more money. So, the real reason behind the previous trend and correlation (figure
Figure 4.2 On the x-axis: the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere). ). Here we only consider rounds in which Investiere actually invested. On the top of the figure, a boxplot representation of this variable. On the y-axis: the Kernel density estimation (higher density for higher probability of seeing a point at that location). Bandwidth is the parameter controlling the smoothness of the curve (higher values make smoother curves), and it equals the standard deviation of the kernel used. The red line indicates the median.
4.2 Correlation analysis between continuous variables 26
4.4) is that lThrough_Investiere is moderately correlated with lAmount_raised, that is
moderately-strongly correlated to lPre_valuation, as shown in the matrix (R=0.57, Figure 4.3).
Figure 4.3: Correlation matrix summarizing all correlations among continuous variables. Only rounds in which the VC Investiere participated are considered. Following the legend, numbers in a colour tending to red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (significance level threshold is p-value= 0.05).
Figure 4.4: Scatterplot of the natural log of Pre-money valuation in CHF (lPre_valuation) and the natural log of the amount invested by the VC Investiere in CHF (lThrough_investiere), considering only rounds in which the VC Investiere participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
4 Multivariate Data Analysis 27
We want now to analyse the relationship between lThrough and lPrev_raised, figure 4.6. Also in
this case, we have a trend disturbed by the significant proportion of samples having zero money
previously raised. The correlation remains low even if we exclude this proportion of samples,
Figure 4.7.
Figure 4.5: Scatterplot of the log of the size of the round (lAmount_raised) and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC Investiere participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
Figure 4.6: Scatterplot of the log of the funds previously raised (lPrev_raised) and and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC participated. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
4.2 Correlation analysis between continuous variables 28
It is interesting to notice that in the 57.29% of all cases in which Investiere invested, it was the
first investment received by that specific startup. This is in accordance with what we read on
Investiere’s website, F.A.Q. page (https://www.investiere.ch/startup-vc-investment/):
"When do you invest?
We invest in early stage as well as growth stage rounds. A pitch deck or idea without validation
is not sufficient. The right timing for a funding round can vary depending on the industry or
other factors but generally being able to show market traction, proof of technology and a
complete and well-functioning core team are decisive factors."
So, there is no doubt that, for Investiere, having zero money previously raised is not a limitation
to its investment commitment. The graph also tells us that, if the start-up has already raised
money in the past, the amount then invested by Investiere slightly tends to increase with the
significant correlation factor of 0.27.
We will not show the relationship between lThrough_investiere and Foundation Year, as it is
low and not significant. Nevertheless, we know that Investiere invested only in companies
founded in the last 15 years, except for one outlier. The histogram (figure 4.8) represents the
Figure 4.7: Scatterplot of the log of the funds previously raised (lPrev_raised) and and the log of the amount invested by Investiere (lThrough_investiere), considering only rounds in which the VC participated, and the start-up already raised funds in the past. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
4 Multivariate Data Analysis 29
number of investments done by Investiere, by specifying in which year the start-up was
founded. Any particular trend can be observed, it instead roughly recalls a normal distribution.
We plot now lThough_investiere against the Closing Year of the round (Figure 4.9):
Figure 4.8: Histogram showing the number of samples sharing the same Foundation Year, by only considering rounds in which the VC Investiere invested.
Figure 4.9: Scatterplot of Closing Year and the amount invested by the VC (lThrough_investiere), considering only rounds in which the Investiere participated. R is the Kendall’s Tau correlation,
0
2
4
6
8
10
12
14
16Foundation Year of VC Investiere's rounds
# sa
mp
les
Foundation Year
4.2 Correlation analysis between continuous variables 30
while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
In this case, a trend is evident with a moderate correlation of 0.44. So, we can state that, over
years, Investiere is on average investing more on each investment deal. If we create the
histogram (Figure 4.10), we can add that also the number of deals is increasing over time (2019
shows an underestimation of the real value, because the year is still ongoing).
Figure 4.10: Histogram showing the number of samples sharing the same Closing Year, by only considering rounds in which the VC Investiere invested.
4.2.3 Employees analysis
We now proceed with the variable Employees -indicating the number resources employed in
the company at the time of a specific round- in the same way we did for lThrough_investiere.
This time, as 81% of Employees’ values are NA, we are only taking into account 48 complete
samples of our data set. Figure 4.11 shows its distribution (we removed four extremes outliers
with more than 100 employees). It has a very long right tail, making the median much higher
than the mode and mean. In the next Figure 4.12 we show the correlation matrix, by considering
all available 48 complete samples.
We plot all the moderate correlations between Employees and the other variables in figures
4.13 - 4.16. The strongest correlation is between Employees and lPre_valuation. This makes us
suppose that, having more data available, Employees would be a relevant and significant
predictor in determining the valuation of a start-up. Nevertheless, having more employees does
0
2
4
6
8
10
12
14
16
18
20
Closing Year of VC Investiere's rounds
# sa
mp
les
Closing Year
4 Multivariate Data Analysis 31
not necessarily mean that the start-up has previously raised more funds (moderate-low
correlation). We also observe a moderate correlation between Employees and lAmount_raised,
and Closing_year. There is no correlation, instead, between Employees and neither Foundation
Year, nor lThrough_investiere.
Figure 4.12: Correlation matrix summarizing all correlations among continuous variables. Only complete samples are here considered. Following the legend, numbers in a colour tending to
Figure 4.11: Employees distribution (boxplot and probability density function). The red line indicates the median.
Employee
N=46 Bandwidth = 5.778
4.2 Correlation analysis between continuous variables 32
red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (p-value above 0.05).
Figure 4.13: Scatterplot of the log of the amount raised in the round (lAmount_raised) and the number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
Figure 4.14: Scatterplot of the log of the funds previously raised (lPrev_raised) and the number of Employees, considering only rounds in which the number of Employees is known, and lPrev_raised is above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
4 Multivariate Data Analysis 33
Figure 4.15: Scatterplot of the log of Pre-money valuation (lPre_valuation) and the number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
Figure 4.16: Scatterplot of Closing Year of the round and number of Employees, considering only rounds in which the number of Employees is known. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
4.2.4 Analysis of the entire data set
After having separately analysed lThrough_investiere and Employees’ impacts on other
variables, there is no need to keep them in our further analysis. In fact, we want to consider all
rounds in our dataset, regardless if we have information about Investiere’s participation or the
4.2 Correlation analysis between continuous variables 34
number of employees. If we continued considering those variables in our further analysis, we
would exclude the 82.9% of samples (percentage of Employees’ missing values). By including
now all samples of our data set, we want to visualize the distribution of each continuous variable
(Figure 4.17 – 4.21):
Figure 4.17: Distribution of the log of pre-money valuation (lPre_valuation), via boxplot and probability density function. The red line indicates the median.
Figure 4.18: Closing_Year distribution (boxplot and probability density function). The red line indicates the median.
lPre_valuation
N=286 Bandwidth = 0.3031
Closing_Year
N=283 Bandwidth = 0.7059
4 Multivariate Data Analysis 35
lPre_valuation and lAmount_raised show many outliers beyond their MAX values6, creating the
long tails in the distribution curve. The correlation matrix, including all complete samples for the
selected continuous variables, and a relationships summary overview are in Figure 4.22. As
Shapiro Test makes us reject the null hypothesis of normality, we continue using the Kendall’s
Tau method.
Figure 4.19: Distribution of the log of funds previously raised (lPrev_raised), via boxplot and probability density function. The red line indicates the median.
Figure 4.20: Distribution of the log of the Amount previously raised (lAmount_raised), via boxplot and probability density function. The red line indicates the median.
6 MAX=min[MAX, Q3+1.5*(Q3-Q1)]
lPrev_raised
N=286 Bandwidth = 2.057
lAmount_raised
N=286 Bandwidth = 0.295
4.2 Correlation analysis between continuous variables 36
All variables tend to normal distribution, except for Closing_Year (clear growing trend, 2019 is
still in progress) and lPrev_raised (having two relative maxima, because of the conspicuous
number of zero values). Anyway, normality of variables is not an assumption neither for
Kendall’s nor for MLR.
Figure 4.21: Distribution of the Foundation Year of samples (boxplot and probability density function). The red line indicates the median.
Figure 4.22: Correlation matrix summarizing all correlations among continuous variables. Following the legend, numbers in a colour tending to red suggest negative correlations, while numbers in a colour tending to blue indicate positive correlations. Insignificant values are hidden by a black cross (p-value above 0.05).
Foundation_Year
N=286 Bandwidth = 1.084
4 Multivariate Data Analysis 37
The correlation between lAmount_raised and lPre_valuation is the strongest one among our
continuous variables (0.59), and it is highly significant (Figure 4.23). This correlation is expected,
otherwise raising high investment would lead start-ups to enormous dilution, not sustainable
for further growth. Nevertheless, if this factor is considered alone, it can be misleading for
companies. In fact, it could lead the company to show a higher financial need in order to raise
more money, and therefore obtain a higher valuation. This strategy is not recommendable, as it
is likely to bring the start-up to over dilution and lower credibility, if not properly justified. So,
every company will have to carefully evaluate the combination of factors influencing its pre-
money valuation (that we will be entirely revealed in Chapter 6), and meditate carefully upon
the trade-off existing between the amount raised (and therefore the lPre_valuation obtained),
and the consequent dilution.
It is followed by the one between Foundation Year and Closing Year (0.47): youngest companies
have the most recent investments, and it is always true that Foundation_Year <= Closing_Year.
We find several outliers in this trend.
Between lPre_valuation and lPrev_raised (Figure 4.24) there is also a moderate, significant
correlation (0.43). We now look at their trend, considering only samples with lPrev_raised >0
(correlation grows to 0.5). lPrev_raised is certainly a main factor to determine the valuation of
a company, (we measure its impact more in detail in Chapter 6). While the range of
Figure 4.23: Scatterplot of the log of Pre-money valuation (lPre_valuation) and the log of the amount raised (lAmount_raised), considering the entire data set. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
4.2 Correlation analysis between continuous variables 38
lPre_valuation for the rounds previously excluded (having lPrev_raised=0) is wide and
comprehends many outliers.
Figure 4.24: The Figure represents the scatterplot of lPre_valuation and lPrev_raised, considering all samples with lPrev_raised above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. The 95% confidence interval is displayed by the grey area.
We plot now the relationship between lAmount_raised and lPrev_raised (Figure 4.25), as both
these variables are strongly correlated with lPre_valuation.
4 Multivariate Data Analysis 39
Figure 4.25: Scatterplot of the natural log of the funds raised by the company before a specific round (x-axis= lPrev_raised) and the natural log of the amount raised at that round (y-axis= lAmount_raised), by considering only rounds with lPrev_raised above zero. R is the Kendall’s Tau correlation, while p is the p-value indicating the significance of their correlation. On the top of the figure, the equation of the plotted regression line is shown, where y=y-axis variable, and x=x-axis variable. The 95% confidence interval is displayed by the grey area.
Indeed, they show a moderate, significant correlation: the more you previously raised, the more
you are likely to raise. This is perfectly normal: if this is not the first financing round of the
company, most probably the company is at a later development stage, it had the possibility to
bring to the table more proves of product/service validation, reducing the risk for the investors.
At the same time, having low lPrev_raised does not prevent a company to raise high
investments. In particular, for companies with lPrev_raised=0, lAmount_raised has the following
distribution: Figure 4.26. The volatility is extremely high and five outliers are visible.
2.2*10-16
4.2 Correlation analysis between continuous variables 40
Figure 4.26: Distribution of the log of the amount raised (lAmount_raised) for samples with no funds previously raised, via boxplot and probability density function. The red line indicates the median.
All in all, lPrev_raised is an important factor determining the lAmount_raised, which plays even
a more relevant role in influencing the lPre_valuation (highest existing correlation). As these
three variables are mutually correlated one to the other, we add a three-variables bubble plot
(Figure 4.27), to offer a final overview of their relationships (lPre_valuation on y-axes,
lAmount_raised on x-axis, circle size represents lPrev_raised).
Finally, we found a highly significant but low positive correlation between Closing_year and
lPre_valuation (0.16), while between Foundation Year and lPrev_raised there is a low and
negative correlation: the older the company, the higher lPrev_raised. This is also obvious and
expected.
All the other pairwise relationships among variables can be neglected (extremely low
correlations or non-significant).
lAmount_raised
N=174 Bandwidth = 0.362
4 Multivariate Data Analysis 41
Figure 4.273: Three-dimensional representation: the log of pre-money valuation (lPre_valuation) on the ordinate, the log of the amount raised (lAmount_raised) on the abscissa, and the log of funds previously raised (lPrev_raised) by using colours and points dimension.
4.3 CORRELATION ANALYSIS BETWEEN CATEGORICAL VARIABLES
In this section, we want to investigate the correlation existing between categorical variables.
4.3.1 Pooling levels together
In order to dive deeper into this analysis and to get significant results, first we pool appropriate
levels together, to make sure that by comparing variables in pairs, each level counts at least 5
samples. The final distribution of each variable is represented in the next Figures 4.28 a) and b).
4.3 Correlation Analysis between categorical variables 42
Figure 4.28: a) Overview of the categorical variables Round Name (top) and Industry (bottom). On the top of each histogram, the title refers to the name of the displayed variable. On the abscissa are written the names of the groups belonging to that categorical variable, while the ordinate indicates the absolute number of samples in the data set belonging to that specific group. The total number of samples for each categorical variable changes between variables, because of missing values (see paragraph 4.1).
0
20
40
60
80
100
120
Pre
-see
d
Seed
Ro
un
d
Seri
es A
Seri
es B
Seri
es C
Fro
mSe
ries
D t
oIP
O
Round name
0
20
40
60
80
100
120
140
Bio
tech
Oth
er
Fin
tech
Med
tech
/H
ealt
hca
re ICT
Cle
ante
ch
Oth
er
Mic
ro/
Nan
o
Co
nsu
mer
Pro
du
cts
Industry
a) #
sam
ple
s #
sam
ple
s
4 Multivariate Data Analysis 43
Figure 4.28: b) Overview of the categorical variables Revenue (top-left), Still_operating (top-right), TLI (bottom-left), and Stage (bottom-right). On the top of each histogram, the title refers to the name of the displayed variable. On the abscissa are written the names of the groups belonging to that categorical variable, while the ordinate indicates the absolute number of samples in the data set belonging to that specific group. The total number of samples for each categorical variable changes between variables, because of missing values (see paragraph 4.1).
0
5
10
15
20
25
30
35
40
45
50
0-5
0k
50
k-1
00
k
10
0k-
50
0k
50
0k-
1M
>1M
Revenue
0
50
100
150
200
250
300
exit no
yes
Still operating
0
5
10
15
20
25
30
35
40
Pro
toty
pin
g
Bet
a/ C
linic
al T
rial
s
Firs
t C
lien
ts
Gro
wth
/ In
tern
atio
nal
Stage
0
50
100
150
200
250
Acc
/In
c/P
A
Inst
. Fin
anci
al
Inst
. Str
ateg
ic
Type Lead Investor
b)
# sa
mp
les
# sa
mp
les
# s
amp
les
# sa
mp
les
4.3 Correlation Analysis between categorical variables 44
4.3.2 Methodology
After the pre-processing phase, we apply the following methods:
• Contingency Analysis (or Chi-square independent test)
• Cramer’s V
• Contingency coefficient (or Pearson’s coefficient)
The Contingency Analysis tests the null hypothesis that the two considered variables are
mutually independent. That means that the knowledge of one does not help us in predicting the
value of the other. On the other side, if the p-value is less than the significance level (0.05), we
reject the null hypothesis and conclude that there is a statistically significant relationship
between the two categorical variables, that is they are not independent. The test makes use of
contingency tables, as a result of which it is known as 'Contingency Analysis'.
In case we reject independency, Cramer’s V and Contingency coefficient provide measures of
the correlation existing between two categorical variables. As for continuous variables, we
consider a coefficient in the range of [0, 0.3] as weak, in [0.3, 0.7] as medium, and > 0.7 as strong.
4.3.3 Results
Legend:
low corr.
Round_name 1 moderate corr.
Industry independent 1 V= Cramer's V
Stage V= 0.392 C= 0.485
V= 0.401 C=0.570
1 C= Contingency
Type_Investor V= 0.266 C= 0.352
V= 0.262 C= 0.348
independent 1
Revenue V= 0.323 C= 0.416
V= 0.358 C=0.625
V= 0.526 C= 0.674
independent 1
Still_operating V= 0.204 C= 0.277
independent V= 0.359 C= 0.453
V=0.242 C=0.324
V= 0.289 C= 0.378
1
Round_name Industry Stage Type_Investor Revenue Still_operating
Table 4.1: The Table shows the relationships existing between categorical variables. The white cell “independent” means that any significant correlation has been revealed. In all the other cases, V and C indicate the Cramer’s V coefficient and the Contingency coefficient, respectively.
Final results are summarized in Table 4.1: correlations are moderate or low and the strongest
one is between Revenue and Stage. That was already evident from the scatterplot and it makes
logical sense (e.g. there cannot be significant revenues at Prototyping stage, while they are
4 Multivariate Data Analysis 45
necessary to be in Growth/Internationalisation stage). We can also confirm that Industry is
independent from the Round name (all round names can be applied to any industry). We
underline that belonging to a particular Industry does not influence the success of the start-up
(independency from still_operating), but it influences its revenue (see distribution, Figure 4.29).
Figure 4.29: Distribution of Revenue given the Industry. On the ordinate, the Industry (a bar for each Industry group). On the abscissa, how many samples (in percentage) of that Industry group have a certain interval of Revenue. Following the legend, each colour section of the bars corresponds to a certain Revenue group.
We were expecting a correlation between the Type of Lead investor with the variables Stage
and Revenue, at a certain extent, but this is not confirmed by numbers. So, we cannot say that
a particular type of investor invests mainly in start-ups at a particular stage or with a particular
range of revenues. Instead, all types of investors invest in a diversified portfolio of companies,
as we will see more in detail in the next paragraph 1.5.2.4.
A moderate correlation is identified between Stage and Industry, but this is just a chance event
in our data set. Of course, all Industries are populated by start-ups at all stages.
Finally, we could think that high revenues would be an important factor in determining the
success of a start-up (still_operating), but from our data set we can only state the existence of a
low correlation. We will make a specific analysis to investigate the impact of different variables
in the future success of Swiss start-ups, Chapter 5.
Revenue groups:
4.4 Correlation analysis between continuous and categorical variables 46
4.4 CORRELATION ANALYSIS BETWEEN CONTINUOUS AND CATEGORICAL VARIABLES
4.4.1 Methodology
There are three big-picture methods to understand if a continuous and categorical variable are
significantly correlated:
• Point biserial correlation: the categorical variable must be dichotomous, which is never
our case;
• Logistic regression: the dependent variable must be binomial, which is not our case
(lPre_valuation is continuous);
• Boxplot analysis: see results in the upcoming paragraph;
• ANOVA and ANCOVA: their assumptions of normality are not respected in our data set,
as we saw in paragraph 3.4;
• Kruskal Wallis H-Test: non-parametric alternative test to ANOVA. It does not assume
data are coming from a particular distribution. In particular, we decide to use the H test
as the assumptions for ANOVA aren't met (like the assumption of normality). It is
sometimes called the “One-way ANOVA on ranks”, as the ranks of the data values are
used in the test rather than the actual data points. The test determines whether the
means of two or more groups are significantly different. The test statistic used in this
test is called “the H statistic” and the hypotheses for the test are:
o H0: population means are equal.
o H1: population means are not equal.
We reject H0 if the adjusted p-value, calculated through the default method "holm", is
below the threshold of 0.05. However, this test alone will not tell us which groups are
different. In order to know that, we run a Post Hoc Wilcox test and we comment its
results. We therefore adopt this method and results are shown in the upcoming
paragraph.
4.4.2 Results
We now plot the dependent variable lPre_valuation against all considered categorical variables.
In each graph, we also show the result of the Kruskal Wallis test.
4 Multivariate Data Analysis 47
4.4.2.1 Round_name
The graph in Figure 4.30 shows a strong correlation, which is straight forward: the round name
is usually assigned based on the lPre-valuation. So, with no surprise further rounds have higher
lPre_valuation. The boxplot of “From Series D to IPO” is the highest one but also the tallest, so
it has the highest volatility in Pre-money valuation. They all have some outliers and tend to be
left skewed (the tail of the distribution is longer on the left-hand side than on the right-hand
side and the median is closer to the third quartile than to the first one). Finally, the population
means of each group are significantly different from all the other groups.
Figure 4.30: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Round name. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
4.4.2.2 Industry
In Figure 4.31 we see the boxplots of lPre_valuation for each Industry group. Some company
Industries are more volatile than others: ICT shows the widest range and several outliers, while
Cleantech and Consumer Products ranges are much more restricted. MedTech/ Healthcare
shows one case particularly extreme. Consumer Products and Others are strongly left-skewed,
which means that 3rd - 4th quartiles have a more restricted range than the first two. We do not
obtain significant mean differences among groups, so we pool together Industries with less than
20 samples into the group Others, to see if we obtain different results. Figure 4.32 shows the
4.4 Correlation analysis between continuous and categorical variables 48
resulting boxplots. Also in this case, group means are not significantly different one from the
other. So, we presume that by adding this explanatory variable to our regression model (see
Chapter 6) we will not obtain advantages.
Figure 4.31: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Industry. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Figure 4.32 Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the industry, after we pooled together the groups having less than 20 samples in the Industry group Other. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
4 Multivariate Data Analysis 49
4.4.2.3 Stage
The relationship with the variable Stage is represented in Figure 4.33. For this variable we have
61.7% of missing values, so inserting it in our regression model would make as loose the majority
of samples in our data set. For this reason, it is even more important to analyse separately its
relationship with the dependent variable lPre_valuation. Prototyping is the most volatile group,
and it is right skewed. Beta-phase/ Clinical Trials is also right-skewed (the 50% of these samples
having the highest lPre_valuation stay in a wider range). Although the high volatility involves all
groups, we can clearly see a growing trend: the later the stage of a startup, the higher its
lPre_valuation. Through the Wilcox test, we can state that the mean between the following
groups is statistically significantly different:
• Prototyping and Growth/International
• Beta-Phase/Clinical Trials and First Clients
• Beta-Phase/Clinical Trials and Growth/International
• First Clients and Growth/International
Figure 4.33: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Stage. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
As Prototyping has only 13 samples and there is no significant difference between this group
and Beta-Phase/Clinical Trials Stages, we try now to pool them together in a group called “Early-
stage” and test if there is a significant difference with First Clients and with
4.4 Correlation analysis between continuous and categorical variables 50
Growth/International, that now we will rename, for coherence, as “Later-Stage”. We now obtain
Figure 4.34 and significant mean differences between all groups. All that makes us maintain this
new structure of the variable and state that Stage is a potentially useful predictor of
lPre_valuation.
Figure 4.34: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Stage, after we pooled together the groups Prototyping and Beta-Phase/Clinical Trials Stages in the new group Early-stage. For coherence, the group Growth/International is here renamed as Later-Stage. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
4.4.2.4 Type Lead Investor (TLI)
This graph in Figure 4.35 for TLI is very interesting: the trend is evident and highly significant.
The mean valuation obtained in rounds involving Accelerator/ incubators is much lower than all
the other ones. As expected, Private angels are in between accelerators and Institutional
investors. Is their valuation lower because they do only invest in early-stage start-ups? The
answer is shown in Figure 4.36.
Except for Accelerator/incubators, all other types of investors show a differentiated portfolio of
rounds, based on the start-up Stage. Institutional Financial shows many more samples and
therefore a wider range of offered valuations compared to the other groups, while Institutional
Strategic has the highest mean lPre-valuation.
Nevertheless, for a fair comparison, we need to verify if these significant mean differences are
valid even when separating rounds depending on companies’ development stage. In fact, it
4 Multivariate Data Analysis 51
could be possible that Private Angels offer on average lower pre-valuations only because they
mainly invest in early-stage startups, while Institutional Strategic investors only invest in later
stage companies. In the following pie charts and histogram (Figure 4.37 - 4.38), we represent
the distribution of rounds’ Stages across the different Types of investors. For example, we note
that Private Angels investments involve early-stage start-ups (Prototyping + Beta-Phase + Clinical
Trials) in the majority of cases (43%). Anyway, the lower valuation offered by Private angels
cannot be addressed merely to the explanation that they mainly invest in early-stage start-ups.
Instead, a more comprehensive reason is that they’re willing to take higher risk than the other
players, and therefore ask for higher returns on investment.
Figure 4.35: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type of Lead Investor. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
4.4 Correlation analysis between continuous and categorical variables 52
Figure 4.36: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI) and the Stage. Given a TLI, a different boxplot of lPre_valuation is created for each Stage (following the legend, a colour is associated to each Stage). Samples with unknown investor have been excluded from the graph. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Figure 4.4: Distribution of the Type of Lead Investor (TLI) across Stages. On the abscissa we read the Stage (a bar for each Stage group). On the ordinate we read (in percentage) the proportion of samples in that Stage group belonging to a certain TLI group. Following the legend, each colour section of the bars corresponds to a certain TLI group.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Prototyping Beta-Phase Clinical Trial First Clients Growth/Internat
Private Angel Inst.Financial Inst.StrategicTLI groups: Stage
4 Multivariate Data Analysis 53
Figures 4.39 – 4.41 confirm that all types of investors (except for Accelerators/incubators) show
a well-diversified portfolio of start-ups in terms of development stage. Nevertheless, the mean
differences among groups continue to be significant only when considering rounds of:
• Early-stage start-ups, between all types of investors
• First Clients stage start-ups, just between Private Angels and Inst. Financials (one-side
test)
In those cases, the type of investor involved in the round makes a significant impact on
lPre_valuation. As the majority of rounds involving Private Angels are related to early-stage
start-ups (but not only), the overall outcome is a significant mean difference among the three
types of investors.
Figure 4.38: These pies show the distribution of the variable Stage for each Type of Lead Investor (TLI). In a) we consider only samples in which the TLI is Institutional Financial, in b) only Private Angel, and in c) only Institutional Strategic. Following the legends, each colour section of the pies corresponds to a certain Stage group, and its proportion of samples (in percentage) is written inside the section.
4.4 Correlation analysis between continuous and categorical variables 54
How can we interpret this? Private angels are known to be willing to take more risk than
institutional investors, and impose less constraints on companies. On the other side, they
require a higher return on their investment, by investing at a relatively low valuation. About the
overall highest valuation offered by strategic investors, we have to remember that they are
called “strategic”, as they invest in virtue of a particular strategic interest they have for a specific
start-up. An interest that other investors (strategic or not) might probably not have. Examples
of strategic reasons behind a start-up investment are: exploitation of the developed technology,
IP rights, complementary products, control of competition, reaching new customer segments or
new markets, access to know-how or specific resources, etc. That’s why their valuations of
companies are higher than Institutional Financials, that instead do not take any strategic
advantages out of the investments.
Thinking about our upcoming regression analysis, based on this information we think that
Type_Lead_investor would have a relevant impact on lPre_valuation when considering its
interaction with the variable Stage.
Figure 4.39: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI), considering only Early-stage start-up rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Early-stage rounds
4 Multivariate Data Analysis 55
Figure 4.40: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type_Lead_Investor (TLI), considering only start-up rounds at the stage First Clients. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Figure 4.41: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Type of Lead Investor, considering only Later-stage start-up rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
First Clients rounds
Later-stage rounds
4.4 Correlation analysis between continuous and categorical variables 56
4.4.2.5 Revenue
In Revenue variable we have 60% of NA’s values. From the representation in figure 4.42 we can
see a shy positive trend. Some of these groups’ means differ significantly (one-side test, as we
assume a growing trend):
• 0 – 50k and 500k – 1M
• 0 – 50k and >1M
• 50k – 100k and >1M
• 100k – 500k and >1M
To get only significant differences, we pool together groups. The final Revenue’s structure
includes 3 levels: 0 – 50k, 50k – 1M, >1M. Now the boxplots become (Figure 4.43).
Figure 4.42: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Revenue. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
The Kruskal-Wallis test indicates now higher significance in mean differences (p-value is now
0.008 instead of 0.014). Still, we cannot reject the null Hypothesis between 0 – 50k and 50k –
1M, but we can do that between >1M and the other two groups. Having less NA’s would improve
the precision of our results. For these reasons, Revenue does not seem to be an essential
predictor in a regression model estimating lPre_valuation.
4 Multivariate Data Analysis 57
Figure 4.43: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given the Revenue, after we pooled together the groups 50k – 100k, 100k – 500k, and 500k – 1M, in the new group 50k – 1M. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
4.4.2.6 Still_operating
The relationship between Still_operating and lPre_valuation is extremely important (Figure
4.44), not to predict valuation by knowing if the startup is still operating, but the other way
around. In fact, if we want to estimate the Pre_valuation of a start-up, it means that the start-
up hasn’t been acquired yet (so it does not belong to exit group), nor it is been liquidated. So,
every time we aim to predict the pre_valuation of a startup, it means that the startup is indeed
still_operating.
What we find more interesting is to investigate the possibility to predict the future of a start-up
(acquired, liquidated, or still operating) by knowing its lPre_valuation. We do that in Chapter 5.
The figure 4.44 takes into consideration all rounds, and it confirms that a relevant relationship
is existing between these two variables. In fact, we count a significant difference in mean
lPre_valuation among all of the three groups. Companies who reached an exit have the highest
median lPre_valuation, liquidated companies have the lowest one, while start-ups that are still
operating have the highest volatility, but a median staying in between the other two groups.
4.4 Correlation analysis between continuous and categorical variables 58
We are aware that, one day, the majority of start-ups belonging now to the yes group will then
belong to no or exit groups. As we still do not know their destiny, let’s now exclude them from
the analysis and focus only on exit and no group (Figure 4.45).
Figure 4.44: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating). Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
The figure 4.45 shows an enormous significant difference among the mean of the two groups.
This is an extremely precious information that makes us wondering the following important
question: was the future of the start-up already evident when those companies where just at
early-stage round? Was their future already predictable at their first investment round (when
they had zero money previously raised)? In figure 4.46 we analyse a selection of rounds with
Prev_raised=0. Even if we only have a few samples in the exit groups, the mean difference
between exit and no groups, and between yes and no groups are significant! That’s even more
evident by looking only at exit and no groups (Figure 4.47).
4 Multivariate Data Analysis 59
Figure 4.45: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the samples in the groups “exit” and “no”. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Figure 4.46: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the rounds with zero money previously raised. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Rounds with lPrev_raised=0
4.4 Correlation analysis between continuous and categorical variables 60
Figure 4.47: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the rounds with zero money previously raised and belonging to the groups “exit” or “no”. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
In figure 4.48, we consider only early-stage start-ups and we show with different colours their
precise development stage. This time we have a more conspicuous number of samples in no and
yes group, and their mean difference is again significant, below 0.05 threshold. Nevertheless,
we have no sample belonging to the exit group, so by knowing the lPre_valuation of a company
at its early-stage round, we can only predict if -in the future- it will face a liquidation or not.
Overall, our conclusion would be extremely useful to be applied but, unfortunately, the very
limited availability of samples and the high volatility of lPre_valuation for companies belonging
to the yes group prevent our statements from being strongly reliable.
Anyway, we can add the following comment to the relevant demonstrated relationship between
the future success of the start-up and its lPre_valuations: reaching a high pre-money valuation
for a start-up can be both a cause and an effect of its still_operating status. In fact, if the
company is still operating or achieved an exit, it was probably already showing a lower risk of
failure at the investment round, and for this reason it got a higher valuation. But it can also be
true the other way around: as the company got a higher valuation in the round, it could invest
resources in a most efficient way, it probably raised more money (as we saw the strong
correlation existing between lPre_valuation and lAmount_raised), and therefore maximized its
probabilities to still be operating / sold.
Rounds with lPrev_raised=0
4 Multivariate Data Analysis 61
Figure 4.48: Conditional boxplot of the log of Pre-money valuation (lPre_valuation) given if the start-up is still operating (Still_operating), considering only the early-stage rounds. Points in the graph represent all samples in the data set, allowing to see which groups are more populated than others, and the presence of outliers (points over and below whiskers). On the top of the graph is reported the resulting p-value of the Kruskal Wallis H-Test.
Early-stage rounds
5.1 Overview 62
5 PREDICTING THE FUTURE SUCCESS OF A SWISS START-UP
5.1 OVERVIEW
In paragraph 4.4.2.6, we proved the existence of a significant relevant correlation between the
continuous variable lPre_valuation -indicating the Pre-money valuation obtained by a company
in a funding round- and the categorical variable Still_operating -revealing the current status of
a start-up (yes= still operating/ no=liquidated/ exit= acquired)-. Based on that information, we
build in this chapter a model predicting the future status of a company (Still_operating), with
the minimum possible error rate.
5.2 METHODOLOGY
To reach our goal, we apply the following Discriminant Function Analysis approaches to several
combinations of explanatory variables (both continuous and categorical):
• Linear discriminant analysis (LDA)
• Quadratic discriminant analysis (QDA)
• Multiple discriminant analysis (MDA)
• Flexible discriminant analysis (FDA)
Discriminant Analysis, indeed, can be used to determine which variable(s) are the best
predictors of the outcome categorical variable. Its analysis makes the assumption that the data
(for the variables) represent a sample from a multivariate normal distribution. However,
violations of the normality assumption are usually not "fatal," meaning that the resultant
significance tests etc. are still "trustworthy", especially for FDA. So theoretically, this should be
the best method to apply in our case, as we have a multivariate non-normal data set (see
distribution analysis in paragraph 3.4). Anyway, we test and compare performances of all these
classifiers (LDA, QDA, MDA, FDA) and, for each of them, we create several models, involving
different combinations of continuous and categorical explanatory variables. Therefore, we train
models on 80% of our data set and we test them on the remaining 20% of samples. Finally, we
compare the accuracy of these models in terms of percentage of observations which are
classified correctly, and we select the optimal one.
5 Predicting the future success of a swiss start-up 63
5.3 RESULTS
By comparing the performance of our models, we find that the best combination of explanatory
variables to predict Still_operating is made of:
− Closing_Year
− lPre_valuation
− lPrev_raised
− lAmount_raised
Both classifiers LDA and FDA provide the most accurate predictions: they predict correctly the
future company’s status for the 96.36% of observations in our data set. This value could be
improved with more observations and measured variables. More in detail, we report the
combination of predictors used by our final selected model:
Coefficients of linear discriminants: LD1 LD2 Closing_Year 0.984 0.305 lPre_valuation 0.432 -0.986 lPrev_raised -0.096 -0.573 lAmount_raised -0.262 0.481 Proportion of trace: LD1 LD2 0.783 0.217
The weakest point of this model is that we cannot estimate in which year this future status will
be exactly realized. We only know that all companies in our data set closed investment rounds
between 2010 and 2019, while the Still_operating variable is updated to October 2019. So,
overall, the predicted future status of companies is supposed to be true in up to nine years since
the inputs of the model are measured.
The strength of the model is that it reveals that Closing_Year and lPre_valuation are the most
influential factors to determine the future status (and success) of a company.
6.1 Purpose 64
6 MULTIPLE REGRESSION ANALYSIS
6.1 PURPOSE
The goal of this Multiple Regression Analysis is to estimate a fair benchmark range for the pre-
money valuation of a start-up at its upcoming investment round, given as input some
explanatory variables related to the company. This benchmark could be useful for the co-
founders, as well as for the investors evaluating the company before making an investment
proposal.
While conducting this analysis, we keep in mind that correlation does not imply causation. Even
if we build a model having significant predictors and a high adjusted R-squared, still nothing is
known about causal relationships.
6.2 METHODOLOGY
6.2.1 Overview
We want to predict the dependent variable (lPre_valuation) based on values of a set of
predictors (mixing continuous and categorical variables). The decision of the type of the
regression model depends on the type of distribution followed by its dependent variable:
− Linear regression for continuous variable having linear relationships with the predictors
− Logistic regression for dichotomous distribution
− Log-linear analysis for Poisson or multinomial distribution
− Cox regression for time-to-event data in the presence of censored cases (survival-type)
− Non-linear regression for continuous dependent variables having non-linear
relationships with the predictors
In our case, the independent variable lPre_valuation is continuous and moderately correlated
with the other continuous predictors (as we saw in paragraph 4.2). So, it is a good rule to start
creating the simplest possible model via multiple linear regression, and to make it more
complicated only when it is truly needed. If we make a model more complex, we should get
confirmation that we are not going toward overfitting, by testing its performance. Besides, the
prediction intervals should become more precise (narrower). If we have several models with
6 Multiple Regression Analysis 65
comparable predictive abilities, the simplest one it is likely to be the best model (Zellner,
Keuzenkamp and McAleer, 2001).
Because of the restricted number of samples in our data set (286) and the large proportion of
missing values for some variables, we decided to exploit all available data to train models, and
then to test them on the same data set, via k-fold cross validation, LOOCV, validation set, and
bootstrap.
All methods have been tested and visualized via the Software R. To distinguish models, in the
following paragraphs we give them names within square brackets (e.g. [regsub.best]).
6.2.2 Steps
We follow and iterate the following steps:
1. Pre-processing the data set (paragraph 6.3);
2. Manual variable selection based on previous knowledge and conducted analyses,
(paragraph 6.4);
3. Model specification by applying a combination of methods (Appendix: Model
Specification). For each created model:
o Outliers check and eventually removal from the model if justified;
o Assumptions check;7
o Calculation of model fit statistics: internal measures, and test measures through
validation set, cross validation and bootstrap;
4. Comparison of the best models, making appropriate considerations (paragraph 6.5);
5. Choice of the optimal model among the best models, final comments (paragraph 6.6).
About step 3., here we list the methods we apply to find the best model fit:
1. Automated models: stepwise regression (“both”: backward and forward);
2. Automated models: best subset regression. Train and test models via:
a. validation set;
b. 5-fold cross validation;
7 For reasons of brevity, this procedure and its results are shown only once for the final selected best model, paragraph 6.7.
6.3 Data pre-processing 66
3. Removal of insignificant terms (through drop1 function);
4. Curve Fitting using Polynomial Terms;
5. Fractional exponents
6. Spline
7. Log function of predictors
8. Non-linear regression: GLM (Generalized Linear Models)
9. Loess regression
10. Kernel regression
6.3 DATA PRE-PROCESSING
In the following regression analysis, we start by maintaining the same group structure of
categorical variables as it resulted out of the analysis conducted in paragraph 4.4.
6.4 SECOND MANUAL VARIABLES SELECTION (FROM 20 TO 8 INDEPENDENT VARIABLES):
Based on previous knowledge and the results of conducted analyses, we exclude the following
variables:
− Foundation_Year: its correlation coefficient with the dependent variable is close to zero
(0.08). This can be confirmed by testing a simple regression model with
Foundation_Year as the only explanatory variable and lPre_valuation as the dependent
variable. The explanation of lPre_valuation done by this predictor is close to 0% and not
significant;
− Still_operating and Round_name: these variables are strongly correlated with
lPre_valuation, but they are the effect of a determined lPre_valuation, not the cause.
When using our model for the purpose explained in paragraph 6.1, the user does not
know neither about the future of the company, nor about the name of the round that
the company is considering. For these reasons, we exclude them from our model;
− Profitable: in our data set, only 1 sample is profitable, and 179 missing values are
present. So, we cannot extract useful information out of this variable;
− Pre_Valuation, Amount_raised, Prev_raised, Through_investiere: as we’ve seen in
paragraph 3.4, the log transformation of these variables has better proprieties for
regression and correlation analysis, than the original ones;
6 Multiple Regression Analysis 67
− lThrough_investiere and Employees: a separated analysis has been conducted for each
of these variables. Keeping them in our analysis would make us lose the 82% of samples,
due to their large proportion of NA values. Besides, lThrough_investiere is not a
universal explanatory factor for all rounds closed in Switzerland;
− Country and Currency: as we consider only Swiss start-ups, all rounds are closed in
Switzerland and via Swiss Francs (CHF) currency. These conditions must be respected
when using the model;
The resulting regression data set is composed of 286 samples -of which 108 are complete cases-
one continuous dependent variable (lPre_valuation), and eight explanatory variables 8 (five
categorical and three continuous):
− Industry (Cat.)
− Closing_Year_factor (Cat.)
− Closing_year (Cont.)
− Stage (Cat.)
− Revenue (Cat.)
− Type_Lead_Investor (Cat.)
− lPrev_raised (Cont.)
− lAmount_raised (Cont.)
Besides, we already know that we will have to decide if including Closing_Year or
Closing_Year_factor as a predictor. We will test their performance and then decide.
After this Second Manual Variables selection, we are still interested in reducing the number of
explanatory variables in our model. In fact, as five of these variables are categorical involving
more than two groups, if we included all of them, we would have a model with already 15
explanatory variables, before including interaction terms.
Green (1991) indicates that N>50+8m (where m is the number of independent variables)
samples are needed for testing multiple correlation. Harris (1985) says that the number of
samples should exceed the number of predictors by at least 50. Van Voorhis, Besty and Morgan
(2007) affirm it is better to go for 30 samples per predictor. Finally, the “one in ten rule” is a rule
of thumb for how many predictor parameters can be estimated from data when doing
regression analysis (in particular proportional hazards models in survival analysis and logistic
regression) while keeping the risk of overfitting low. The rule states that one predictive variable
8 Detailed variable description and overview are provided in paragraph 3.3
6.5 Best models comparison 68
can be studied for every ten events (Harrell, Lee, Califf, Pryor and Rosati, 1984). As a rule of
thumb, we decide to have a model with at least 15-20 samples per predictor that, in our case,
by considering only complete samples it means to have a maximum of 5-7 predictors.
In Appendix: Model Specification, we first indagate which selection would Automated models
suggest, and then we compare it with the tested application of other methods, as exposed in
Step paragraph 6.2.2, without excluding further pooling of groups if necessary.
6.5 BEST MODELS COMPARISON
We selected two best models: model [A] and [B]. We report the complete detailed procedure
that made us select these two models in the Appendix: Model Specification. These two models
are [A]:
lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI +
Closing_Year:lPrev_raised
And [B]:
lPre_valuation ~ lPrev_raised + lAmount_raised + TLI + Closing_Year:TLI +
Closing_Year:lPrev_raised + Closing_Year:lAmount_raised +
Stage:lAmount_raised
Note that the first one does not include the variable Stage, that has a significant higher
proportion of missing values, compared to the other predictors. Plus, none of these two models
includes the interaction term TLI:Stage, that we thought to be a relevant predictor (4.4.2.3). So
in this paragraph, after comparing performances of these two models, we select the winner, and
we test if, by adding to it the interaction term TLI:Stage, we get better results.
Figure 6.1 gives us a clear picture of how the magnitude effect differs for these predictors
(circle/square), plus their uncertainty (horizontal lines indicate the 95% confidence interval). In
Figures 6.2 and 6.3, instead, we read that the vast majority of performance metrics agree that
[A] model is the best one (we compare them with the model resulting from best subset selection,
6 Multiple Regression Analysis 69
and with the two previously considered variations of [basic.model], during Model Specification:
[TLI*Stage] and [TLI + Stage]9).
Figure 6.1: On the right, the names of predictors belonging to model [A] and/or model [B]. The figure compares the magnitude effect (blue circles for model [A] and orange squares for model [B]) of each predictor on Pre-money valuation, plus its uncertainty (the horizontal lines indicate the 95% confidence interval of these estimates).
So we select [A] as the best model, and we now try to add the interaction term TLI:Stage to it,
as a final confirmation.
9The full explation of these models is provided in the Appendix: Model Specification
lPrev_raised
[A]
[B]
lAmount_raised
Closing_Year
Inst.Financial
Inst.Strategic
Closing_Year:Inst.Financial
Closing_Year:Inst.Strategic
lPrev_raised:Closing_Year
Acc/Inc/PA:Closing_Year
Inst.Financial:Closing_Year
lAmount_raised:Closing_Year
lAmount_raised:First_Clients
lAmount_raised:Later-Stage
Inst.Strategic:Closing_Year
-0.5 0.0 0.5
6.5 Best models comparison 70
Model: A + TLI:Stage Residuals: Min 1Q Median 3Q Max -0.80396 -0.23757 -0.00219 0.26622 0.92276 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.826e+02 6.792e+01 -4.160 7.14e-05 *** lPrev_raised 9.600e-05 3.172e-05 3.027 0.00321 ** lAmount_raised 3.792e-01 6.239e-02 6.078 2.73e-08 *** Closing_Year 1.451e-01 3.384e-02 4.289 4.43e-05 *** TLI_Institutional Financial -3.685e+00 1.022e+02 -0.036 0.97133 TLI_Institutional Strategic 4.082e+02 1.535e+02 2.659 0.00925 ** Closing_Year:TLI_Institutional Financial 1.690e-03 5.074e-02 0.033 0.97351 Closing_Year:TLI_Institutional Strategic -2.023e-01 7.619e-02 -2.655 0.00935 ** lPrev_raised:Closing_Year -4.753e-08 1.572e-08 -3.023 0.00324 ** TLI_Acc/Inc/PA:StageFirst Clients -3.436e-02 1.560e-01 -0.220 0.82610 TLI_Instit.Financial:StageFirst Clients 2.052e-01 1.575e-01 1.303 0.19597 TLI_Inst.Strategic:StageFirst Clients -3.389e-01 3.033e-01 -1.118 0.26669 TLI_Acc/Inc/PA:StageLater-stage 1.099e-01 2.160e-01 0.509 0.61205 TLI_Inst.Financial:StageLater-stage 2.704e-01 1.548e-01 1.746 0.08413 . TLI_Inst.Strategic:StageLater-stage -8.369e-02 2.823e-01 -0.296 0.76753 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3956 on 92 degrees of freedom Multiple R-squared: 0.8196, Adjusted R-squared: 0.7921 F-statistic: 29.85 on 14 and 92 DF, p-value: < 2.2e-16
Figure 6.2: Comparison of models’ performance. Reading the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate and have been calculated through the software R. Models with lower values for these metrics are preferred, which are calculated as follows:
− AIC= -2(log-likelihood) + 2K, where K is the number of model parameters, and log-likelihood is a measure of model fit (the higher the number, the better the fit);
− BIC = −2(log-likelihood) + log(n)K, where K is the number of model parameters and n the number of observations;
− As defined by Allen (1974), PRESS is based on leave-one-out technique: from a fitted model, each of the n samples in turn is removed, and the model is refitted to the (n−1) points. The predicted value is calculated at the excluded point, and the PRESS statistic is calculated as the sum of the squares of all the resulting prediction errors.
ANOVA test does not show a significant improvement:
Model 1: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised Model 2: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:Type_Lead_Investor + lPrev_raised:Closing_Year + Type_Lead_Investor:Stage Res.Df RSS Df Sum of Sq Pr(>Chi) 1 98 15.238
0.000 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 180.000 200.000
AIC
BIC
PRESS
Interaction terms selection
B A TLI*Stage basic.model + TLI + Stage
6 Multiple Regression Analysis 71
2 92 14.400 6 0.83771 0.4995
We then compare them in figure 6.4 and 6.5. Except for validation set measures and internal
RMSE and MAE, all other metrics indicate that [A + TLI:Stage] is performing worse than [A] alone.
Besides, many predictors would not be significant anymore. Before concluding that [A] is our
best model, we want to eventually test through the function drop1 if any of its predictors should
be removed:
Single term deletions Model: lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + + Closing_Year:TLI + Closing_Year:lPrev_raised Df Sum of Sq RSS AIC Pr(>Chi) <none> 16.484 -187.90 Closing_Year:Type_Lead_Investor 2 1.9901 18.474 -179.47 0.0020057 ** lPrev_raised:Closing_Year 1 1.7901 18.274 -178.66 0.0008018 *** lAmount_raised 1 7.3165 23.800 -149.86 2.491e-10 ***
All terms are significant and we can therefore conclude that [A] is the best selected model.
6.5 Best models comparison 72
Figure 6.3: Comparison of models’ performance. Reading the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate, and models with lower values are preferred, except for R2 measures, where the contrary is true. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. k-fold 5.5 means we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats. MAE is the average, over the test sample, of the absolute differences between prediction and actual observation (all having equal weight). RMSE is the square root of the average of squared differences between prediction and actual observation. S is the Standard error (absolute measure of the typical distance that the data points fall from the regression line, in the units of the dependent variable). R-squared (R2) provides the relative measure of the percentage of the dependent variable variance that the model explains (from 0 to 1).
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
Adj.R2
S
Pred.err.rate
Pred.R2
RMSE
MAE
RMSE
R2
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
R2
RMSE
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
inte
rnal
me
asu
res
k-fo
ld 5
.5LO
OC
Vva
lidat
ion
set
bo
ots
trap
Interaction terms selectionB A TLI*Stage basic.model + TLI + Stage
6 Multiple Regression Analysis 73
Figure 6.4: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.
0 20 40 60 80 100 120 140 160 180
AIC
BIC
PRESS
Interaction terms selection
A + TLI:Stage A
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Adj.R2
S
Pred.err.rate
Pred.R2
RMSE
MAE
RMSE
R2
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
R2
RMSE
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
inte
rnal
me
asu
res
k-fo
ld 5
.5LO
OC
Vva
lidat
ion
set
bo
ots
trap
Interaction terms selectionA + TLI:Stage A
6.6 Best selected model 74
Figure 6.5: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.
6.6 BEST SELECTED MODEL
In the previous paragraph 6.5, we selected [A] as our best model, having the following formula:
lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + Type_Lead_Investor +
Closing_Year:Type_Lead_Investor + Closing_Year:lPrev_raised
Compared with our Third manual variable selection (Appendix, paragraph 9.2), model [A]
includes a combination of only four explanatory variables. The importance of these predictors
has been already commented in Chapter 4. All of them, in fact, are significant, except for the
group Institutional Financial of the variable Type_Lead_Investor (as we already noted in the
analysis of this variable, conducted in paragraph 4.4.2.4), and its interaction with Closing_Year.
The interaction term between Closing Year and lPrev_raised corrects the impact that the total
funding of the start-up has on its valuation, depending on the year in which the round is
concluded (and viceversa).
6.6.1 Confidence and Prediction intervals10
When using the suggested best model [A] to predict lPre_valuation for new data (funding
round), prediction intervals become very important for making real-world predictions with
realistic bounds of uncertainty.
For reasons of brevity, we will only report confidence and prediction intervals for the first eight
samples of our data set. Of course, narrower prediction intervals indicate a better model, but
10 The confidence interval reflects the uncertainty around the mean predictions. It states, according to the model, which is on average the lPre_valuation range for a funding round with some specific inputs as independent variables. A prediction interval gives an interval within which we expect the dependent variable (lPre_valuation) to lie with a specified probability (95%, in our case). The prediction interval gives uncertainty around a single value, given values of the independent variables specified. If the prediction intervals are too wide, the predictions do not provide useful information. Narrow prediction intervals represent more precise predictions. Thus, a prediction interval will be generally much wider than a confidence interval for the same value.
6 Multiple Regression Analysis 75
we did not use them to select the best model, because they strongly rely on the assumption that
the residual errors are normally distributed, with constant variance. Instead, that assumption is
not so strong for linear regression (see Assumptions check, in paragraph 6.7.2).
lPre_valuation fit lwr.conf.int upr.conf.int lwr.pred.int upr.pred.int. 1 15.84721 15.06559 14.90795 15.22324 13.89531 16.23588 2 20.62511 19.01204 18.66894 19.35514 17.80273 20.22135 3 13.38473 14.35444 14.19031 14.51856 13.18326 15.52561 4 13.30468 13.78248 13.58372 13.98125 12.60595 14.95901 5 12.50618 13.03486 12.74742 13.32231 11.84015 14.22958 6 12.76569 13.12320 12.80986 13.43654 11.92199 14.32441 7 13.28788 13.96765 13.76486 14.17044 12.79043 15.14486 8 12.89922 13.68907 13.44615 13.93200 12.50428 14.87386
In the graph (figure 6.6) we show the first 70 samples of the data set (points), their prediction
intervals (red lines), and confidence intervals (green lines).
Figure 6.6: On the abscissa is the log of the Amount raised at the round (lAmount_raised) and on the ordinate the log of Pre-money valuation (lPre_valuation). The Figure shows, for Model [A], the actual Pre-money valuations (black points), the confidence intervals (green lines), and the prediction intervals (red lines).
6.7 MLR BLUE ASSUMPTIONS CHECK
We now show the procedure that we applied for each outcoming model in order to test if it was
meeting the Multiple Linear Regression BLUE assumptions. For reasons of brevity, we show the
process only once for our best model [A]. Overall, all of MLR BLUE assumptions are met, as linear
regression allows a certain margin of tolerance. Our best model [A] is thus valid from a
theoretical viewpoint.
---- prediction interval
---- confidence interval
. actual lPre-valuation
| | | | |
11 12 13 14 15
lAmount_raised
16 –
lPre
_va
lua
tion
14 –
12 –
6.7 MLR BLUE Assumptions Check 76
6.7.1 Outlier detection
We remove all the samples with a Cook’s distance higher than 0.1 (Figure 6.7 – 6.8). Then, under
Bonferroni correction of p-values, there outlier test does not find any significant outlier for this
model (any observation has adjusted p-value below 0.05).
Figure 6.8: For Model [A], on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.
Figure 6.7: For Model [A], on the abscissa is plotted the Leverage and on the ordinate the standardized residuals.
6 Multiple Regression Analysis 77
6.7.2 Check MLR assumptions
6.7.2.1 Assumption 1. Linearity
This assumption is overall respected: a linear trend between the dependent variable
lPre_valuation and the predictors is evident in figures 6.9 – 6.10.
Figure 6.10: For Model [A], on the abscissa are plotted the predicted values of Pre-money valuation and on the ordinate its observed values. The red line is the quadrant bisector used to evaluate if samples (points) are homogeneously distributed over and under the line.
Figure 6.9: For Model [A], on the abscissa are plotted the fitted values of the Pre-money valuation and on the ordinate the absolute residuals.
6.7 MLR BLUE Assumptions Check 78
6.7.2.2 Assumption2: It is a random sampling of observations.
This assumption is true because of the data collection process that we adopted and explained
in detail in Chapter 3.
6.7.2.3 Assumption 3: Zero conditional mean of residuals
This assumption is fully respected, as shown below:
Min. 1st Qu. Median Mean 3rd Qu. Max. -0.96992 -0.26535 0.02779 0.00000 0.26272 1.00308
6.7.2.4 A4. There is no multi-collinearity (or perfect collinearity).
There should be no linear relationship between the independent variables. The correlation
matrix in Chapter 4 (figure 4.22) shows the existence of moderate relationships among
predictors, but the result of vif.test reassure us that these values are tolerable.
An important implication of this assumption is that there should be sufficient variation in the
predictors. The larger the variability in the explanatory variables, the better are the OLS
estimates in determining the impact of predictors on lPre_valuation. Below we check the
variability of predictors:
freqRatio percentUnique zeroVar nzv Closing_Year 1.222222 9.345794 FALSE FALSE lPre_valuation 1.333333 80.373832 FALSE FALSE lPrev_raised 14.666667 51.401869 FALSE FALSE lAmount_raised 1.428571 57.009346 FALSE FALSE
Perfect: any predictor has zero variance (zeroVar) or near zero variance (nzv).
6.7.2.5 Assumption 5. Spherical errors
This assumption requires homoscedasticity and no autocorrelation among residuals. In figure
6.9 residuals do not show a trend over time. In the next figures 6.11 – 6.12 we plot residuals on
Closing_Year to make sure that no time trend is existing among residuals. Kruskal-Wallis’ test
confirms that there are no significant differences among residuals over time.
6 Multiple Regression Analysis 79
Figure 6.11: For Model [A], on the abscissa we find the Closing Year of rounds and on the ordinate the absolute residuals. The red horizontal line is set zero to evaluate if residuals (points) are homogeneously distributed over and under the line, independently from the Closing Year.
Figure 6.12: Conditional boxplot of standardized residuals given the Closing Year of the round.
6.7.2.6 Assumption 6 (optional): Error terms should be normally distributed.
This assumption is not strict. In the figure 6.13 below, we show the distribution of residuals, and
the result of normality tests is the following:
---------------------------------------------- Test Statistic pvalue ----------------------------------------------- Shapiro-Wilk 0.9899 0.6070 Kolmogorov-Smirnov 0.0674 0.7165 Anderson-Darling 0.4357 0.2935 -----------------------------------------------
6.7 MLR BLUE Assumptions Check 80
In any of these normality tests, we reject the null hypothesis of normality, as the p-value is
always above 0.05. Our model is thus valid from a theoretical viewpoint.
Figure 6.13: Probability density function and boxplot of the standardized residuals of Model [A].
Standardized residuals
N=107 Bandwidth = 0.3551
7 Conclusions 81
7 CONCLUSIONS
Coming to the conclusions of this research, we bear in mind that “all models are wrong, some
models are useful” (Box, 1976). As stated in the Introduction, the aim of this thesis is to provide
an overview of start-up valuations in Switzerland, and a model estimating a fair range of pre-
money valuation for a target start-up, which could be a useful starting point in the negotiation
process between co-founders and investors. With the proposed model, therefore, we do not
aspire to predict the exact valuations of Swiss early-stage companies, but rather to indicate a
(prediction) interval within which their valuation is most likely to lie. We are also aware of the
existence of other factors influencing the valuation of start-ups, that we could not measure in
our analysis (as we explain further on in the limitations of our approach). Besides, we tend to
agree with the following statement of Matthew Schubring, managing director at Chartwell (Dahl,
2016):
“A good valuation is 75% art and 25% science, because it takes into account the story behind the
numbers of a business. Appraisals fall down when there isn’t enough support for the story behind
it. It’s based not on just what happened, but on why things happened.”
Made these premises, in Chapter 4 we exposed the main findings from the conducted
Multivariate Analysis. For the VC Investiere, having zero money previously raised is not a
limitation to its investment commitment. Nonetheless, if a start-up has already raised money in
the past, the amount then invested by Investiere slightly tends to increase with the significant
correlation factor of 0.27. This trend has been confirmed also for rounds in which Investiere was
not involved, as the correlation between the size of the round (lAmount_raised) and the total
funding previously raised by the start-up (lPrev_raised) are linked with a significant correlation
(r=0.42).
The correlation between the Amount raised and the pre-money valuation is the strongest one
among our continuous variables (r=0.59, paragraph 4.2.4). This helps entrepreneurs to find a
balance between the size of the investment and the consequent dilution.
Among categorical variables, the strongest existing correlation is between the revenues
generated by a company and its development stage. We also found that belonging to a particular
industry does not affect the success of the start-up, but it influences its revenues (Figure 4.34).
6.7 MLR BLUE Assumptions Check 82
In our data set, all types of investors have a diversified portfolio of companies, in terms of
generated revenues and development stage, as we saw more in detail in paragraph 4.4.2.4. On
the other hand, the type of investor involved in the round makes a significant impact on the
start-up valuation.
We also could not find evidence that the future status of a company (acquired/liquidated/still
operating) is determined by the generated revenues. Instead, the Discriminant Function Analysis
conducted in Chapter 5 showed that the main predictors of the success of a start-up are the pre-
money valuation (confirming the importance of the topic of this research) and the closing year
of the round. We also proposed a model able to predict correctly the future status of a start-up
(acquired, liquidated, or still operating) for the 96.36% of observations in our test data set. This
value could be improved in further research by increasing the number of observations and
measured variables.
In addition, we found significant differences in start-up valuations depending on their
development stage, but not on the industry they belong to (paragraph 4.4.2.2). The most volatile
sector is ICT, while Cleantech and Consumer Products ranges are relatively restricted.
In Chapter 2, we presented an overview of the main start-ups valuation methods that we are
now ready to comment in comparison with our findings and with the best model resulting out
of the Multiple Regression Analysis conducted in Chapter 6.
The Scorecard Method (Payne, 2011) is based on the average valuation of early-stage start-ups
in the industry and region of the target company. As explained in Chapter 3, that kind of sensitive
data collection procedure is extremely time-consuming, making quite unrealistic the hypothesis
that co-founders and investors could take advantage of it.
The Berkus Model (Berkus, 2009), instead, considers very broad concepts as influencing factors
of start-ups valuations and, when coming to numbers, it leaves enormous space to open
interpretation. So, it is highly probable that the same start-up, when evaluated by different
actors via the Berkus model, receives very disparate valuations.
Finally, the Venture Capital Method (Sahlman and Scherlis, 1989, revised in 2009) is the oldest
and closest one to classical financial methods. Its weakest point is that it bases on the
assumption that the target company will generate a certain estimated amount of revenues in
five years. As Berkus (2009) states, this is a goal with less than 0.1% of probability to be met.
7 Conclusions 83
In contrast to all these methods, the approach that we propose stands out on several aspects.
First of all, it takes as input only data which are 100% objective and accessible to all players, at
zero cost: the round size (Amount_raised), the total funding previously raised by the company
(Prev_raised), the year in which the round is going to be closed (Closing_Year), and the nature
of the main investor involved (Type_Lead_Investor). An important consequence is that its results
are objective (independent from the user).
The second important difference between our model and the other ones is its immediacy: the
valuation estimation is timeless and effortless.
Another main dissimilarity from the literature concerns the richness of inputs and outcomes.
Our model considers inputs that any of the previous models has taken into account before.
These factors allow our valuation estimation to be more precise, because it is contextualized in
time (Closing_Year), space (Country), and actors involved (type of investor). In Chapter 4, we
also found clue to believe that, with a larger availability of data, other factors not included in
previous approaches would turn out to be significant predictors of pre-money valuation: the
industry, the revenue generated by the start-up, and the number of its employees.
Regarding the richness of the outcome, our method does not provide only a single valuation
estimate, but it offers a range (prediction interval) within which the market value is supposed to
lie, with a certain probability chosen by the user (e.g. 95%). It also offers a prospectus of the
individual effects played by each factor on the pre-money valuation, and their significance.
Last but not least, a relevant dimension on which our model beats all the others is the validity
of its performance. In paragraph 6.5 we provided a transparent and detailed analysis of its
internal measures and results when tested through validation set, cross validation, and
bootstrap. All of these tests agree on an adjusted R-squared about 0.8, a predicted R-squared of
0.77, and a prediction error rate close to 2.7%.
To sum up, the advantages of our model compared to the available ones in the literature are:
− Accessibility: everyone has access to input data at zero cost;
− Immediacy: given the inputs, the method is instantaneous and effortless;
− Objectiveness: results are independent from the user;
6.7 MLR BLUE Assumptions Check 84
− Outcome richness: not just a valuation benchmark but a probability interval, plus the
individual effect of each factor;
− Contextuality: location, timing, and investors involved are taken into account;
− Performance: the prediction error rate of the model flows around 2.7%11, the adjusted
R-squared is 0.8, and the predicted R-squared about 0.77;
On the other side, our approach presents several limitations:
− Limited geographic validity (Switzerland);
− Limited dimension of the training set (286 samples) and therefore absence of a separate
test set;
− Abundance of missing values on specific variables (revenue, stage, profitability,
employees);
− Other relevant qualitative and quantitative measures are not considered (background
and experience of the team, market size, validity of the business strategy, proof of
validation, tangible and intangible assets, market positioning -leadership, barriers to
entry, brand awareness-, expected synergies, patents/technology, competition, …)
These boundaries could be crossed in further research, by expanding the data set for Swiss
rounds, or focusing on another country, or on a specific industry. Overall, the literature on this
topic is young and still poorly able to provide co-founders and investors with a starting point in
their “dance of concessions”. Nonetheless, this thesis shows that the room for improvement
and the pool of interested stakeholders are broad. This has been also underlined by a recent
survey conducted by the Canadian Golden Triangle Angel Network, involving 200 private angels.
The result has been that, for the 50% of them, the most likely reason behind the decision to not
invest in a start-up is that “Companies overstated their valuations” (Douglas, 2016).
11 When tested through validation set, cross validation, and bootstrap techniques
8 References 85
8 REFERENCES
Allen, D. M. (1974). The relationship between variable selection and data agumentation and a
method for prediction. Technometrics, 16(1), 125-127.
Arping, S., & Falconieri, S. (2009). Strategic versus financial investors: The role of strategic
objectives in financial contracting. Oxford Economic Papers, 62(4), 691-714.
Behrmann, G. (2016). Internet Company Valuation - A Study of Valuation Methods and Their
Accuracy. EBS Universität für Wirtschaft und Recht, Oestrich-Winkel, Germany.
Berkus, D. (2009, Nov 4). The Berkus Method – Valuing the Early Stage Investment. Berkonomics.
https://berkonomics.com/?p=131.
Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356),
791-799.
Clarysse, B., & Kiefer, S. (2011). 12. Introducing the venture roadmap and basic financials The
Smart Entrepreneur: How to Build for a Successful Business (p.191). Elliott &
Thompson.
Dahl, D. (2015, Oct 30). Why Valuing Your Business Is More Art Than Science. Forbes
Damodaran, A. (2007). Valuation approaches and metrics: a survey of the theory and evidence.
Foundations and Trends® in Finance, 1(8), 693-784.
Douglas, R. (2016, Sept 2). Early-stage startup valuations: More art than science. Communitech
News - https://news.communitech.ca/early-stage-startup-valuations-more-art-than-
science/
Engel, R. (2002). Teaching note: An introduction to the venture capital method.
Frei, P., & Leleux, B. (2004). Valuation—what you need to know. Bioentrepreneur, 1-3.
6.7 MLR BLUE Assumptions Check 86
Gentle, J. E. (2009). Computational statistics (Vol. 308). New York: Springer.
Green, S. B. (1991). How many subjects does it take to do a regression analysis. Multivariate
behavioral research, 26(3), 499-510.
Gunn, M. A. (2016). When science meets entrepreneurship: Ensuring biobusiness graduate
students understand the business of biotechnology. Journal of Entrepreneurship
Education, 19(2), 53.
Harris, R. J. (1985). A primer of multivariate statistics (2nd ed.). New York: Academic Press.
Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). Regression modelling
strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52.
Isabelle, D. (2013). Key factors affecting a technology entrepreneur's choice of incubator or
accelerator. Technology innovation management review, 16-22.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol. 112). New York: Springer.
Lebret H. (2019). The Analysis of 500+ startups. Retrieved from: http://www.startup-
book.com/2019/04/26/the-analysis-of-500-startups/.
Payne, B. (2011). Scorecard valuation methodology. Establishing the Valuation of Prerevenue,
Startup Companies. Retrieved from: http://docplayer. net/14290190-Scorecard-
valuation-methodologyestablishing-the-valuation-of-pre-revenue-start-up-companies-
by-billpayne. html.
Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., Feinstein, A. R. (1996). A simulation study of
the number of events per variable in logistic regression analysis. Journal of Clinical
Epidemiology. 49 (12)
Rao, M. B., & Rao, C. R. (2014). Computational Statistics with R (Vol. 32). Elsevier.
8 References 87
Sahlman, W., & Scherlis, D. (1989). A Method For Valuing High-Risk, Long-Term Investments: The
"Venture Capital Method". Harvard Business School, 9-288. (Revised October 2009.)
Van Voorhis, C. R. W., Betsy, L. & Morgan R. (2007). Understanding Power and Rules of Thumb
for Determining Sample Sizes. Tutorials in Quantitative Methods for Psychology, 3(2),
43-50.
Villalobos, L. (2007). Investment Valuations of Seed- and Early-Stage Ventures. The
entrepreneur’s trusted guide to high growth (pp. 3-4). Ewing Marion Kauffman
Foundation.
Wong, A., Bhatia, M., & Freeman, Z. (2009). Angel finance: the other venture capital. Strategic
Change: Briefings in Entrepreneurial Finance, 18(7‐8), 221-230.
Zellner, A., Keuzenkamp, H. A., & McAleer, M. (Eds.). (2001). Simplicity, inference and modelling:
keeping it sophisticatedly simple. Cambridge University Press.
9.1 Automated Models 88
9 APPENDIX: MODEL SPECIFICATION AND SELECTION
9.1 AUTOMATED MODELS
9.1.1 Stepwise regression
We report the best model out of stepwise regression, with direction= ”both” :
Residuals: Min 1Q Median 3Q Max -1.35936 -0.34815 0.05134 0.31283 1.30798 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.810e+02 5.255e+01 -3.443 0.000835 *** Closing_Year 9.467e-02 2.626e-02 3.604 0.000486 *** Type_Lead_InvestorInstitutional Financial -7.293e-02 1.227e-01 -0.594 0.553601 Type_Lead_InvestorInstitutional Strategic 2.749e-01 1.535e-01 1.791 0.076237 . lPrev_raised 1.124e-07 2.202e-08 5.104 1.55e-06 *** lAmount_raised 3.859e-01 7.354e-02 5.248 8.41e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4949 on 102 degrees of freedom Multiple R-squared: 0.674, Adjusted R-squared: 0.658 F-statistic: 42.17 on 5 and 102 DF, p-value: < 2.2e-16
Industry, Stage, Closing_Year_factor, and Revenue are excluded from the model, that counts
five predictors (out of eight). The intercept is negative and Institutional Financial is not
significant.
9.1.2 Best Subset Selection
Figure 9.1: The Figure shows three graphs to analyse results of Best subset selection method, applied via the software R. On the abscissa is represented the number of variables in the model, while on the ordinate is shown the resulting value of the following metrics: adjusted R squared, Cp, and BIC. The optimal value for each metric is in red colour.
9 Appendix: Model specification and selection 89
We obtain different results in Figure 9.1; five predictors could be a good compromise. The first
selected var with this method is lAmount_raised, and the second one is lPrev_raised, as
expected because of their moderate-high correlation with lPrev_raised. They are followed by
Closing_Year, Institutional Strategic, and Later-Stage.
Model: regsub.best Residuals: Min 1Q Median 3Q Max -1.29992 -0.27916 0.00431 0.33149 1.27445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.513e+02 4.743e+01 -3.190 0.00188 ** lAmount_raised 3.432e-01 7.235e-02 4.744 6.70e-06 *** lPrev_raised 1.109e-07 2.137e-08 5.191 1.04e-06 *** Closing_Year 8.021e-02 2.373e-02 3.379 0.00102 ** Type_Lead_InvestorInstitutional Strategic 3.299e-01 1.273e-01 2.592 0.01093 * StageLater-stage 2.156e-01 1.125e-01 1.917 0.05793 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4827 on 104 degrees of freedom (176 observations deleted due to missingness) Multiple R-squared: 0.7012, Adjusted R-squared: 0.6868 F-statistic: 48.81 on 5 and 104 DF, p-value: < 2.2e-16
These predictors are indeed all significant and the adj.R2 is higher than before. The intercept is
negative.
Note also that the adjusted R2, BIC and Cp are calculated on the training data that have been
used to fit the model. This means that model selection through these metrics is possibly subject
to overfitting and may not perform as well when applied to new data. As justified in
Methodology (paragraph 6.2), we will adopt validation set and cross validation to test
automated models.
9.1.2.1 Model selection using a validation set
We split data in two halves (one train one test), we run Best Subset Analysis on train and we test
it on test. Figure 9.2 shows the optimal number of predictors -having the minimum Mean
Squared Error (MSE)- is five once again, so the best model is confirmed to be [regsub.best].
9.1.2.2 Model selection by k-fold cross validation
We test the models with 5-fold cross validation method. Figure 9.3 shows that the model with
minimum cross validation error has four predictors (in accordance with minimum BIC, in Best
subsect selection, Figure 9.1), which gives the following result:
Model: regsub.cv.best Residuals: Min 1Q Median 3Q Max -1.5498 -0.4447 -0.0130 0.3711 1.9436 Coefficients:
9.1 Automated Models 90
Estimate Std. Error t value Pr(>|t|) (Intercept) -7.595e+01 3.335e+01 -2.277 0.0236 * lAmount_raised 6.717e-01 3.478e-02 19.313 < 2e-16 *** lPrev_raised 2.966e-08 6.517e-09 4.551 8.12e-06 *** Closing_Year 4.063e-02 1.660e-02 2.448 0.0150 * Type_Lead_InvestorInstitutional Strategic 3.327e-01 1.489e-01 2.234 0.0263 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6324 on 266 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.7215, Adjusted R-squared: 0.7173 F-statistic: 172.3 on 4 and 266 DF, p-value: < 2.2e-16
Figure 9.2: The Figure shows a graph allowing to select the best model via the Validation set method. On the abscissa is represented the number of predictors in the model, and on the ordinate is shown the corresponding MSE. The optimal value is coloured in red.
Figure 9.3: The Figure shows a graph allowing to select the best model via the k-fold cross validation method. On the abscissa is represented the number of predictors in the model, and on the ordinate is shown the corresponding mean cross validation error. The optimal value is coloured in red.
9 Appendix: Model specification and selection 91
By not including the 5th predictor Stage, this model only excludes 15 samples due to NA, instead
of 176 NA in the 5-predictors model [regsub.best]. This means that the df is much higher, as well
as adj.R-squared, than for model [regsub.best].
9.2 THIRD MANUAL VARIABLE SELECTION (FROM 8 TO 5 PREDICTORS)
In Automated models, we experimented different methods, compared resulting models and
selected [regsub.best] and [regsub.cv.best] as the best ones. We have no doubts about the
importance of the four selected predictors in [regsub.cv.best] and we look for further
confirmation about the fifth predictor (Stage), included in [regsub.best]. We also want to make
sure that no relevant variable has been excluded.
In order to get confirmation of automated model results, we will manually add one variable at a
time to a [basic.model] (including only the four most relevant predictors, that we definitely want
in our models). Therefore, we will compare regression results to the correlation analysis
conducted in paragraph 4.4, to see how the significance of coefficients and the performance of
our model change because of the introduction of new predictors.
9.2.1 Basic.model
In [basic.model] we only include the most relevant predictors, that we cannot miss as
explanatory variables in our models.
9.2.1.1 Continuous variables
Based on our Correlation Analysis, conducted in paragraph 4.2, lAmount_raised and
lPrev_raised show the tightest relationships with the dependent variable lPre_valuation
(Kendall’s Tau coefficients are 0.59 and 0.43 respectively, and both highly significant). As we saw
in scatterplots (figures 4.23 - 4.25), these relationships follow a linear trend. For these reasons,
our [basic.model] formula will be:
lPre_valuation ~ lAmount_raised + lPrev_raised
When applied to the full pre-processed data set, we obtain:
Model: basic.model Residuals: Min 1Q Median 3Q Max -1.49462 -0.42526 0.00596 0.36189 1.93327
9.2 Third manual variable selection (from 8 to 5 predictors) 92
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.510e+00 4.508e-01 12.222 < 2e-16 *** lAmount_raised 7.059e-01 3.184e-02 22.173 < 2e-16 *** lPrev_raised 2.743e-08 6.177e-09 4.441 1.28e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6491 on 283 degrees of freedom Multiple R-squared: 0.7302, Adjusted R-squared: 0.7283 F-statistic: 382.9 on 2 and 283 DF, p-value: < 2.2e-16
All estimated coefficients are strongly significant, the intercept is positive, and the adjusted R
squared is above 72%. With only two predictors and df=283, we are sure not to risk overfitting.
MLR BLUE Assumptions are overall respected, except for the highest lPre_valuation observed
values, where residuals tend to increase. Six outliers are identified, and two of them are
particularly extreme (observations 259 and 281), so we already decide to remove them, before
proceeding in model optimization (see figure 9.4 showing Cook’s distances, with an indicative
red line set at 4*mean(cooksd)).
Figure 9.4: For basic.model, on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.
9.2.1.1.1 Interaction term
As lPrev_raised and lAmount_raised are moderately correlated one each other, we wonder if
the addition of an interaction term between these two variables would improve the model.
Model: interaction.basic.model Residuals: Min 1Q Median 3Q Max -1.68162 -0.41095 0.01992 0.34764 2.02972
9 Appendix: Model specification and selection 93
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.472e+00 4.499e-01 12.161 <2e-16 *** lAmount_raised 7.067e-01 3.174e-02 22.267 <2e-16 *** lPrev_raised 1.677e-07 8.298e-08 2.021 0.0442 * lAmount_raised:lPrev_raised -8.088e-09 4.771e-09 -1.695 0.0912 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.647 on 282 degrees of freedom Multiple R-squared: 0.7329, Adjusted R-squared: 0.7301 F-statistic: 257.9 on 3 and 282 DF, p-value: < 2.2e-16
The estimated coefficient of the interaction term is not significant below 5% and extremely low.
Other values do not change that much. ANOVA test and model performance metrics confirm
that no significant improvement is provided. This means that the effect of lAmount_raised on
lPre_valuation is not influenced by lPrev_raised (and viceversa). So, we conserve our
[basic.model] formula as a starting point to build the best model.
9.2.1.1.2 Closing_Year
The third predictor we want to add to [basic.model] is Closing_Year, as strongly suggested by all
automated models.
Model: basic.model Residuals: Min 1Q Median 3Q Max -1.48129 -0.45169 -0.03373 0.38469 1.92212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.759e+01 3.154e+01 -2.777 0.00586 ** lAmount_raised 6.239e-01 3.531e-02 17.670 < 2e-16 *** lPrev_raised 5.037e-08 9.126e-09 5.520 7.88e-08 *** Closing_Year 4.674e-02 1.570e-02 2.977 0.00317 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6189 on 274 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.705, Adjusted R-squared: 0.7017 F-statistic: 218.2 on 3 and 274 DF, p-value: < 2.2e-16
Its contribution is positive and highly significant. The intercept is now negative. Adj.R2 has
slightly decreased, because three observations have been deleted due to missingness. We
cannot compare its performance to [basic.model] through ANOVA, because the sample size
differs (three NA’s in Closing_Year). We have no hesitation in accepting this parameter into our
[basic.model].
9.2.1.2 Categorical variables
Looking at the correlations between lPre_valuation and the categorical variables (paragraph
4.4), Stage is a potentially meaningful predictor and the interaction between
Type_Lead_investor and Stage is also very well promising. Finally, we do not expect Revenue to
9.2 Third manual variable selection (from 8 to 5 predictors) 94
be a significant predictor, but unfortunately this is also due to its large proportion of missing
values.12
In the following paragraphs we will manually test these intuitions concerning categorical
variables, by adding them to [basic.model], one-by-one.
9.2.1.2.1 Type_Lead_Investor (TLI)
Automated models suggest the inclusion of the group Institutional Strategic (belonging to the
variable Type_Lead_Investor) to the model. Let us try and compare models to the full variable
alone, and to the variable with pooled levels: the automatic procedure suggests that there is no
significant difference between the groups “Institutional Financial” and “Acc/Inc/PA”
(Accelerators/Incubators/Private Angels), both belonging to the categorical variable
Type_Lead_Investor (TLI); therefore, with “pooled levels” we mean that we merged these two
groups together into one group called “Non-strategic investments”, to distinguish it from the
second group belonging to TLI “Institutional Strategic”.
In the next paragraph we will also test the interaction term TLI:Stage (as suggested by our
previous Correlation analysis), and compare performances.
Model: basic.model + TLI Residuals: Min 1Q Median 3Q Max -1.4723 -0.4208 -0.0169 0.3713 1.9430 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.225e+01 3.285e+01 -2.808 0.00536 ** lAmount_raised 5.865e-01 3.620e-02 16.201 < 2e-16 *** lPrev_raised 6.329e-08 1.100e-08 5.754 2.44e-08 *** Closing_Year 4.925e-02 1.635e-02 3.011 0.00286 ** Type_Lead_InvestorInstitutional Financial 9.420e-02 9.522e-02 0.989 0.32344 Type_Lead_InvestorInstitutional Strategic 3.361e-01 1.636e-01 2.055 0.04091 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5967 on 260 degrees of freedom Multiple R-squared: 0.7075, Adjusted R-squared: 0.7018 F-statistic: 125.8 on 5 and 260 DF, p-value: < 2.2e-16
The next model is equal to the previous one, but for the predictor Type_Lead_Investor we
pooled together the groups Acc/Inc/PA and Institutional Financial in the new group Non-
strategic investors.
12 Suggestion for further research: by expanding Revenue’s data availability, we expect this predictor to be a significant one (see chapter 7)
9 Appendix: Model specification and selection 95
Model: basic.model + TLI (pooled) Residuals: Min 1Q Median 3Q Max -1.53913 -0.43311 -0.01314 0.32959 1.89674 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.760e+01 3.205e+01 -2.421 0.01615 * lAmount_raised 5.910e-01 3.591e-02 16.458 < 2e-16 *** lPrev_raised 9.771e-08 1.489e-08 6.560 2.89e-10 *** Closing_Year 4.197e-02 1.595e-02 2.631 0.00903 ** Type_Lead_InvestorInstitutional Strategic 3.099e-01 1.406e-01 2.204 0.02839 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5947 on 260 degrees of freedom (15 observations deleted due to missingness) Multiple R-squared: 0.7201, Adjusted R-squared: 0.7158 F-statistic: 167.2 on 4 and 260 DF, p-value: < 2.2e-16
Figure 9.5: For the model “basic.model + TLI (pooled)”, on the abscissa is plotted the number of the sample (random order) and on the ordinate its Cook's distance, calculated via the software R.
We remove the outliers 268, 278 (via outlier analysis we see that their adjusted p-values -via
Bonferroni correction- stand below the 0.05 threshold, Figure 9.5).
ANOVA test, internal measures, cross-validation and bootstrap tests do not find a significant
difference among these two models (non-pooled and pooled). Adj.R2 and Predicted R2 are
slightly higher for the second model, because of the reduced number of predictors. Anyway, we
got the confirmation that TLI is indeed a relevant predictor, as suggested by Automated models,
so we include it in our [basic.model], now having the following formula:
lPre_valuation ~ lAmount_raised + lPrev_raised + Closing_Year + TLI
9.2 Third manual variable selection (from 8 to 5 predictors) 96
9.2.1.2.2 Stage
Our correlation analysis reports both the importance of this variable and of its interaction term
with Type_Lead_Investor. [regsub.best] model suggests to pool Stage, by merging Early-stage
and First Clients groups together, while [regsub.cv.best] model does not consider this predictor.
We start with the individual variable and compare performances when Type_Lead_Investor and
Stage are both pooled or not.
Model: basic.model + Stage (non-pooled) Residuals: Min 1Q Median 3Q Max -1.26874 -0.27839 0.00519 0.31994 1.30212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.557e+02 5.564e+01 -2.798 0.00615 ** lAmount_raised 3.475e-01 7.378e-02 4.710 7.83e-06 *** lPrev_raised 1.106e-07 2.170e-08 5.097 1.59e-06 *** Closing_Year 8.235e-02 2.778e-02 2.965 0.00377 ** TLI_Institutional Financial -4.243e-02 1.232e-01 -0.344 0.73129 TLI_Institutional Strategic 3.032e-01 1.510e-01 2.007 0.04737 * StageFirst Clients 2.794e-02 1.209e-01 0.231 0.81767 StageLater-stage 2.281e-01 1.375e-01 1.659 0.10017 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4869 on 102 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7018, Adjusted R-squared: 0.6813 F-statistic: 34.3 on 7 and 102 DF, p-value: < 2.2e-16
By introducing the predictor Stage, we lose 158 degrees of freedom, because of its large
proportion of NA’s. TLI_Inst.Financial and Stage are not significant below 5%.
Model: basic.model + Stage (pooled) Residuals: Min 1Q Median 3Q Max -1.29992 -0.27916 0.00431 0.33149 1.27445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.513e+02 4.743e+01 -3.190 0.00188 ** lAmount_raised 3.432e-01 7.235e-02 4.744 6.70e-06 *** lPrev_raised 1.109e-07 2.137e-08 5.191 1.04e-06 *** Closing_Year 8.021e-02 2.373e-02 3.379 0.00102 ** Type_Lead_InvestorInstitutional Strategic 3.299e-01 1.273e-01 2.592 0.01093 * StageLater-stage 2.156e-01 1.125e-01 1.917 0.05793 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4827 on 104 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7012, Adjusted R-squared: 0.6868 F-statistic: 48.81 on 5 and 104 DF, p-value: < 2.2e-16
By pooling variables’ levels, we gain 2 degrees of freedom, and all coefficients are now significant
at 5% level. ANOVA test does not find any significant difference among the two models. So, in
figures 9.6 and 9.7 we compare their performances more in detail (TLI= Type_Lead_Inevstor).
All measures drastically improved by adding these two variables, when compared to
[basic.model without TLI]. Besides, all metrics show that the pooled model performs slightly
better, even if the difference appears to be very small.
9 Appendix: Model specification and selection 97
Figure 9.6: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.
9.2.1.2.3 Interaction effect Stage:TLI
The results obtained in the previous models stress the importance of the predictors Stage and
Type_Lead_Investor (TLI). In the previous correlation analysis (paragraph 4.4) we also found clue
about the impact of their interaction on lPre_valuation. Let us now test it.
Model: basic.model + TLI * Stage Residuals: Min 1Q Median 3Q Max -1.1698 -0.2697 0.0467 0.2376 1.4641 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.634e+00 9.569e-01 9.023 1.48e-14 *** lAmount_raised 4.449e-01 6.982e-02 6.372 5.92e-09 *** lPrev_raised 9.225e-08 2.158e-08 4.275 4.41e-05 *** TLI_Inst.Financial 1.983e-01 1.703e-01 1.164 0.247064 TLI_Inst.Strategic 1.329e+00 2.410e-01 5.515 2.79e-07 *** StageFirst Clients 3.333e-01 1.592e-01 2.093 0.038884 * StageLater-stage 7.385e-01 1.971e-01 3.747 0.000301 *** TLI_Inst.Financial:StageFirst Clients -1.025e-02 2.367e-01 -0.043 0.965541 TLI_Inst.Strategic:StageFirst Clients -1.265e+00 3.266e-01 -3.872 0.000194 *** TLI_Inst.Financial:StageLater-stage -3.297e-01 2.482e-01 -1.329 0.187035 TLI_Inst.Strategic:StageLater-stage -1.360e+00 3.318e-01 -4.098 8.54e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R e s i d u a l s t a n d a r d e r r o r : 0 . 4 6 5 4 o n 9 9 d e g r e e s o f f r e e d o m (170 observations deleted due to missingness) Multiple R-squared: 0.7356, Adjusted R-squared: 0.7089 F-statistic: 27.54 on 10 and 99 DF, p-value: < 2.2e-16 Model: basic.model + TLI * Stage (pooled) Residuals: Min 1Q Median 3Q Max -1.27517 -0.32262 0.02451 0.32777 1.23082 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.001e+00 9.678e-01 9.300 2.46e-15 *** lAmount_raised 4.349e-01 6.911e-02 6.293 7.52e-09 *** lPrev_raised 1.150e-07 2.212e-08 5.199 1.01e-06 *** TLI_Inst.Strategic 5.445e-01 1.670e-01 3.261 0.0015 ** StageLater-stage 3.770e-01 1.270e-01 2.970 0.0037 ** TLI_Inst.Strategic:StageLater-stage -4.897e-01 2.696e-01 -1.816 0.0722 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5006 on 104 degrees of freedom (170 observations deleted due to missingness)
0.000 100.000 200.000 300.000 400.000 500.000 600.000
AIC
BIC
PRESS
[basic.model without TLI] Vs [TLI+Stage] Vs [TLI + Stage (pooled)]
basic.model + TLI (pooled) + Stage (pooled) basic.model + TLI + Stage basic.model without TLI
9.2 Third manual variable selection (from 8 to 5 predictors) 98
Multiple R-squared: 0.6786, Adjusted R-squared: 0.6631 F-statistic: 43.92 on 5 and 104 DF, p-value: < 2.2e-16
Figure 9.7: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.
By comparing the non-pooled models with and without the interaction term, ANOVA finds a
significant difference:
Model 1: lPre_valuation ~ lAmount_raised + lPrev_raised + Closing_Year + TLI + Stage
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900
Adj.R2
S
Pred.err.rate
Pred.R2
RMSE
MAE
RMSE
R2
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
R2
RMSE
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
inte
rnal
me
asu
res
k-fo
ld 5
.5LO
OC
Vva
lidat
ion
set
bo
ots
trap
Pooled Vs Original groups
TLI + Stage (pooled) TLI + Stage
9 Appendix: Model specification and selection 99
Model 2: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage Res.Df RSS Df Sum of Sq Pr(>Chi) 1 102 24.181 2 99 21.444 3 2.7374 0.005489 **
If we then compare the pooled and non-pooled models, both with the interaction terms, the
difference is again significant:
Res.Df RSS Df Sum of Sq Pr(>Chi) 1 99 21.444 2 104 26.064 -5 -4.62 0.0007017 ***
In figures 9.8 – 9.9 we compare performances more in detail.
From these metrics, the best model always turns out to be the [basic.model + TLI*Stage] with
non-pooled groups. The only exception is BIC, where [basic.model + TLI + Stage] has the
minimum value. Nonetheless, as all other measures vote for [basic.model + TLI*Stage, original],
we can state that this is the best model so far. Note that the Prediction error rate, measured
through the validation set, is only 2.1% for this model, and around 3.1% when measured by
the other methods.
Now that we tested the effect of all predictors selected by the Automated models, we want to
investigate the effect of the inclusion of two other variables in our best model so far
[basic.model + TLI*Stage, non-pooled]: Industry and Revenue.
9.2.1.2.4 Industry
During the correlation analysis (paragraph 4.4.2.2), we showed that, even after pooling some
minor groups of the categorical variable Industry together, their means are not significantly
different one another. So, we presumed that adding this explanatory variable to [basic.model +
TLI*Stage] would not be useful. Indeed, automated selection methods excluded Industry from
their best regression models.
9.2 Third manual variable selection (from 8 to 5 predictors) 100
Figure 9.8: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.
Model: basic.model + TLI * Stage, non-pooled + Industry Residuals: Min 1Q Median 3Q Max -1.16817 -0.27212 0.05026 0.23742 1.31437 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.896e+00 9.889e-01 8.996 2.31e-14 *** lAmount_raised 4.176e-01 7.154e-02 5.838 7.33e-08 *** lPrev_raised 9.445e-08 2.135e-08 4.424 2.58e-05 *** TLI_Inst.Financial 9.192e-02 1.742e-01 0.528 0.598942 TLI_Insti.Strategic 1.448e+00 2.516e-01 5.755 1.05e-07 *** StageFirst Clients 4.259e-01 1.680e-01 2.535 0.012890 * StageLater-stage 8.008e-01 2.101e-01 3.812 0.000245 *** IndustryFintech 2.346e-01 2.600e-01 0.902 0.369302 IndustryBiotech 7.638e-01 3.670e-01 2.081 0.040087 * IndustryMedtech/Healthcare 2.503e-01 1.502e-01 1.666 0.098914 . IndustryICT 1.205e-02 1.210e-01 0.100 0.920888 TLI_Inst.Financial:StageFirst Clients 1.098e-01 2.416e-01 0.455 0.650420 TLI_Inst.Strategic:StageFirst Clients -1.382e+00 3.309e-01 -4.176 6.59e-05 *** TLI_Inst.Financial:StageLater-stage -1.787e-01 2.577e-01 -0.693 0.489766 TLI_Inst.Strategic:StageLater-stage -1.408e+00 3.658e-01 -3.849 0.000215 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.459 on 95 degrees of freedom (170 observations deleted due to missingness) Multiple R-squared: 0.7532, Adjusted R-squared: 0.7169 F-statistic: 20.71 on 14 and 95 DF, p-value: < 2.2e-16
0.000 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 180.000 200.000
AIC
BIC
PRESS
Variations to basic.model: [TLI+Stage] Vs [TLI*Stage] Vs [TLI*Stage (pooled)]
TLI*Stage (Pooled) TLI*Stage basic.model + TLI + Stage
9 Appendix: Model specification and selection 101
Figure 9.9: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
Adj.R2
S
Pred.err.rate
Pred.R2
RMSE
MAE
RMSE
R2
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
R2
RMSE
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
inte
rnal
me
asu
res
k-fo
ld 5
.5LO
OC
Vva
lidat
ion
set
bo
ots
trap
Variations to [basic.model, withour TLI]: [TLI + Stage] Vs [TLI*Stage original] Vs [TLI*Stage pooled]
TLI*Stage pooled TLI*Stage original TLI + Stage
9.2 Third manual variable selection (from 8 to 5 predictors) 102
Figure 9.10: Comparison of models’ performance. Following the legend, each bar colour corresponds to a specific model. The metrics used are shown on the ordinate. They have been calculated through the software R, by applying the following statistical methods: k-fold CV, LOOCV, validation set, and bootstrap. “S” refers to Standard error, which in literature is also called sigma, or residual standard error. “k fold 5.5” means that we split the data set into five folds, and we repeat cross validation five times. The final model error is taken as the mean error from the number of repeats.
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
Adj.R2
S
Pred.err.rate
Pred.R2
RMSE
MAE
RMSE
R2
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
R2
RMSE
MAE
Pred.err.rate
RMSE
R2
MAE
Pred.err.rate
inte
rnal
me
asu
res
k-fo
ld 5
.5LO
OC
Vva
lidat
ion
set
bo
ots
trap
Variations to basic.model: [TLI*Stage] Vs [TLI*Stage + Industry]
TLI*Stage + Industry TLI*Stage
9 Appendix: Model specification and selection 103
Figure 9.11: Comparison of models’ performance, in terms of PRESS, BIC and AIC. Following the legend, each bar colour corresponds to a specific model.
Instead, by setting Others as reference group, all coefficients are influential and almost
significant. The less significant is ICT that, on the contrary, revealed the clearest growing trend
in Correlation analysis. The adj.R2 improves but ANOVA test indicates no significant difference
between the models:
Model 1: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage Model 2: lPre_valuation ~ lAmount_raised + lPrev_raised + TLI*Stage + Industry Res.Df RSS Df Sum of Sq Pr(>Chi) 1 99 21.444 2 95 20.011 4 1.4329 0.1467
By comparing the other metrics, results are discordant: it generally improves internal measures
but makes worse the performance tests via CV and bootstrap (see results in Figures 9.10 – 9.11).
All that makes us think that, by adding the predictor Industry to our best model, we are going
toward an overfitting problem. So, we decide to definitely exclude it.
9.2.1.2.5 Revenue
During our correlation analysis (Chapter 4), we showed that, even after pooling some minor
groups of Revenue together, their means are not significantly different one another. So, we
presumed that adding this explanatory variable to [basic.model] would not be useful. Indeed,
automated selection methods excluded Revenue from their best regression models.
Model: TLI*Stage, non-pooled + Revenue Residuals: Min 1Q Median 3Q Max -1.15306 -0.24997 0.04723 0.26149 1.47081 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.622e+00 9.970e-01 8.649 1.27e-13 *** lAmount_raised 4.453e-01 7.238e-02 6.152 1.81e-08 *** lPrev_raised 8.569e-08 2.280e-08 3.759 0.000295 *** TLI_Institutional Financial 2.127e-01 1.738e-01 1.224 0.224009 TLI_Institutional Strategic 1.359e+00 2.470e-01 5.504 3.14e-07 *** StageFirst Clients 3.667e-01 1.958e-01 1.873 0.064099 . StageLater-stage 7.109e-01 2.495e-01 2.850 0.005365 ** Revenue50k - 1M -2.617e-02 1.397e-01 -0.187 0.851768 Revenue> 1M 1.266e-01 2.266e-01 0.558 0.577859
0.000 50.000 100.000 150.000 200.000 250.000
AIC
BIC
PRESS
Variations to basic.model: [TLI*Stage] Vs [TLI*Stage + Industry]
TLI*Stage + Industry TLI*Stage
9.3 Interaction Terms 104
TLI_Inst.Financial:StageFirst Clients -1.864e-02 2.457e-01 -0.076 0.939691 TLI_Inst.Strategic:StageFirst Clients -1.287e+00 3.338e-01 -3.855 0.000211 *** TLI_Inst.Financial:StageLater-stage -3.119e-01 2.554e-01 -1.221 0.224988 TLI_Inst.Strategic:StageLater-stage -1.383e+00 3.506e-01 -3.944 0.000153 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4722 on 95 degrees of freedom (172 observations deleted due to missingness) Multiple R-squared: 0.7235, Adjusted R-squared: 0.6886 F-statistic: 20.71 on 12 and 95 DF, p-value: < 2.2e-16
Revenue coefficients are influential but not significant. Both internal measures and performance
tests pursued via CV and bootstrap get worse results than [basic.model + TLI*Stage, non-
pooled]. Therefore, we definitely exclude the variable Revenue from the predictors of our best
model.
9.2.2 Results
Based on the tests conducted above, the result of the third manual variable selection is to
exclude Industry, Revenue, and Closing_Year_factor from the predictors of our best model,
while including:
• lAmount_raised
• lPrev_raised
• Closing_Year
• Type_Lead_Investor
• Stage
And potentially the interaction term Type_Lead_Investor:Stage. This conclusion is in accordance
with the Automated Models selection, but it does not additionally pool groups together, and it
suggests an interaction term between Stage and TLI.
9.3 INTERACTION TERMS
After our third variable selection, we finally wonder if we omitted some important interaction
terms in our best model. In order to indagate this, we run automated models on a new data
frame, including only the five predictors selected in the Third manual variable selection, plus all
their possible interactions. The results of Stepwise regression and Best Subset Selection are
following.
9 Appendix: Model specification and selection 105
9.3.1 Stepwise regression
By applying Stepwise regression, with direction= ”both”, we obtain the following optimal model,
that we call [A]:
Call: lm(lPre_valuation ~ lPrev_raised + lAmount_raised + Closing_Year + TLI + Closing_Year:TLI + Closing_Year:lPrev_raised) Residuals: Min 1Q Median 3Q Max -1.56505 -0.27575 0.04329 0.28700 1.04449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.917e+02 5.857e+01 -4.980 2.63e-06 *** Closing_Year 1.495e-01 2.923e-02 5.117 1.49e-06 *** TLI_Institutional Financial 6.188e+01 9.606e+01 0.644 0.520912 TLI_Institutional Strategic 5.421e+02 1.520e+02 3.567 0.000555 *** lPrev_raised 7.204e-05 3.024e-05 2.383 0.019062 * lAmount_raised 3.980e-01 6.418e-02 6.201 1.24e-08 *** Closing_Year:TLI_Institutional Financial -3.077e-02 4.765e-02 -0.646 0.519913 Closing_Year:TLI_Institutional Strategic -2.688e-01 7.538e-02 -3.565 0.000557 *** Closing_Year:lPrev_raised -3.565e-08 1.499e-08 -2.379 0.019240 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4351 on 101 degrees of freedom Multiple R-squared: 0.7642, Adjusted R-squared: 0.7455 F-statistic: 40.91 on 8 and 101 DF, p-value: < 2.2e-16
9.3.2 Best Subset Selection
The results of Best Subset selection (Figure 9.12) vote for model with between five and seven
predictors: lPrev_raised, lAmount_raised, Closing_Year:TLI, Closing_Year:lAmount_raised,
Closing_Year:lPrev_raised, Inst.Strategic, and Stage:lAmount_raised. We call this model [B].
Figure 9.42: The Figure shows three graphs to analyse results of Best subset selection method, applied via the software R. On the abscissa is represented the number of variables in the model, while on the ordinate is shown the resulting value of the following metrics: adjusted R squared, Cp, and BIC. The optimal value for each metric is in red colour.
9.4 Further methods used to select the best model 106
See comparison of these two best models [A] and [B] in paragraph 6.5.
9.4 FURTHER METHODS USED TO SELECT THE BEST MODEL
We did not limit our best model selection to the methods in this Chapter. We also
experimented some non-linear regression and other approaches. As they did not directly add
information in explaining our dependent variable lPre_valuation, for reasons of brevity we do
not show their results in this report as we did for the other ones. Nevertheless, we want to
mention that, by applying and experimenting the following methods (Rao, 2014; James et al.,
2013), via the Software R, any significant improvement could be achieved:
− GLM (Generalized Linear Models)
− Curve Fitting using Polynomial Terms
− Fractional exponents (a small significant improvement was provided, but residual plots
showed the model was getting further from meeting MLR assumptions)
− Spline
− Log function of predictors
− Loess regression
− Kernel regression
We compared the performance of the models created by applying these methods with the
performance of [A], and we concluded that [A] continues to be the best model for the goal of
our research. The application of GLM regression, in particular, confirmed precisely [A] as the
best model, which is a positive robustness check.