Pagina principale Economic Forecasting

Economic Forecasting

,
Economic forecasting involves choosing simple yet robust models to best approximate highly complex and evolving data-generating processes. This poses unique challenges for researchers in a host of practical forecasting situations, from forecasting budget deficits and assessing financial risk to predicting inflation and stock market returns. Economic Forecasting presents a comprehensive, unified approach to assessing the costs and benefits of different methods currently available to forecasters.
This text approaches forecasting problems from the perspective of decision theory and estimation, and demonstrates the profound implications of this approach for how we understand variable selection, estimation, and combination methods for forecasting models, and how we evaluate the resulting forecasts. Both Bayesian and non-Bayesian methods are covered in depth, as are a range of cutting-edge techniques for producing point, interval, and density forecasts. The book features detailed presentations and empirical examples of a range of forecasting methods and shows how to generate forecasts in the presence of large-dimensional sets of predictor variables. The authors pay special attention to how estimation error, model uncertainty, and model instability affect forecasting performance.

• Presents a comprehensive and integrated approach to assessing the strengths and weaknesses of different forecasting methods
• Approaches forecasting from a decision theoretic and estimation perspective
• Covers Bayesian modeling, including methods for generating density forecasts
• Discusses model selection methods as well as forecast combinations
• Covers a large range of nonlinear prediction models, including regime switching models, threshold autoregressions, and models with time-varying volatility
• Features numerous empirical examples
• Examines the latest advances in forecast evaluation
• Essential for practitioners and students alike
Categories: Economy\\Econometrics
Anno: 2016
Editore: Princeton University Press
Lingua: english
Pagine: 567
ISBN 13: 978-0-691-14013-1
File: PDF, 3.20 MB
Download (pdf, 3.20 MB)
Leggi il libro online

You may be interested in

 
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

The Drone Eats with Me: A Gaza Diary

Anno: 2016
Lingua: english
File: EPUB, 1.83 MB
2

Afeni Shakur: Evolution of a Revolutionary

Anno: 2005
Lingua: english
File: EPUB, 329 KB
利得图书馆

ECONOMIC FORECASTING


LIBRARY.LEOPOLDS.COM

利得图书馆

LIBRARY.LEOPOLDS.COM

利得图书馆

ECONOMIC
FORECASTING


GRAHAM ELLIOTT AND
ALLAN TIMMERMANN

PRINCETON

UNIVERSITY

PRINCETON

LIBRARY.LEOPOLDS.COM

AND

PRESS
OXFORD

利得图书馆

c 2016 by Princeton University Press
Copyright 
Published by Princeton University Press, 41 William Street,
Princeton, New Jersey 08540
In the United Kingdom: Princeton University Press, 6 Oxford Street,
Woodstock, Oxfordshire OX20 1TW
press.princeton.edu
All Rights Reserved
ISBN: 978-0-691-14013-1
Library of Congress Control Number: 2015959185
British Library Cataloging-in-Publication Data is available
This book has been composed in Minion Pro and Helvetica Neue
Printed on acid-free paper. ∞
Typeset by S R Nova Pvt Ltd, Bangalore, India
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10

LIBRARY.LEOPOLDS.COM

利得图书馆

FOR OUR FAMILIES

Jason and Shirley
Henry, Rafaella
and Solange

LIBRARY.LEOPOLDS.COM

利得图书馆

LIBRARY.LEOPOLDS.COM

利得图书馆

Contents



Preface xiii

I Foundations
1 Introduction 3
1.1
1.2

Outline of the Book 3
Technical Notes 12

2 Loss Functions 13
2.1
2.2
2.3
2.4
2.5

Construction and Specification of the Loss Function 14
Specific Loss Functions 20
Multivariate Loss Functions 28
Scoring Rules for Distribution Forecasts 29
Examples of Applications of Forecasts in Macroeconomics
and Finance 31
2.6 Conclusion 37

3 The Parametric Forecasting Problem 39
3.1
3.2
3.3
3.4
3.5
3.6

Optimal Point Forecasts 41
Classical Approach 47
Bayesian Approach 54
Relating the Bayesian and Classical Methods 56
Empirical Example: Asset Allocation with Parameter Uncertainty 59
Conclusion 62

4 Classical Estimation of Forecasting Models 63
4.1
4.2
4.3
4.4

Loss-Based Estimators 64
Plug-In Estimators 68
Parametric versus Nonparametric Estimation Approaches 73
Conclusion 74

5 Bayesian Forecasting Methods 76
5.1
5.2
5.3
5.4
5.5

Bayes Risk 77
Ridge and Shrinkage Estimators 81
Computational Methods 83
Economic Applications of Bayesian Forecasting Methods
Conclusion 88

6 Model Selection 89
6.1
6.2
6.3
6.4

Trade-Offs in Model Selection 90
Sequential Hypothesis Testing 93
Information Criteria 96
Cross Validation 99

LIBRARY.LEOPOLDS.COM

85

利得图书馆

viii

•

Contents

6.5
6.6
6.7
6.8
6.9
6.10
6.11

Lasso Model Selection 101
Hard versus Soft Thresholds: Bagging 104
Empirical Illustration: Forecasting Stock Returns 106
Properties of Model Selection Procedures 115
Risk for Model Selection Methods: Monte Carlo Simulations 121
Conclusion 125
Appendix: Derivation of Information Criteria 126

II Forecast Methods
7 Univariate Linear Prediction Models 133
7.1
7.2
7.3
7.4
7.5
7.6

ARMA Models as Approximations 134
Estimation and Lag Selection for ARMA Models 142
Forecasting with ARMA Models 147
Deterministic and Seasonal Components 155
Exponential Smoothing and Unobserved Components 159
Conclusion 164

8 Univariate Nonlinear Prediction Models 166
8.1
8.2
8.3
8.4
8.5
8.6

Threshold Autoregressive Models 167
Smooth Transition Autoregressive Models 169
Regime Switching Models 172
Testing for Nonlinearity 179
Forecasting with Nonlinear Univariate Models 180
Conclusion 185

9 Vector Autoregressions 186
9.1
9.2
9.3
9.4
9.5
9.6
9.7

Specification of Vector Autoregressions 186
Classical Estimation of VARs 189
Bayesian VARs 194
DSGE Models 206
Conditional Forecasts 210
Empirical Example 212
Conclusion 217

10 Forecasting in a Data-Rich Environment 218
10.1
10.2
10.3
10.4
10.5
10.6
10.7

Forecasting with Factor Models 220
Estimation of Factors 223
Determining the Number of Common Factors 229
Practical Issues Arising with Factor Models 232
Empirical Evidence 234
Forecasting with Panel Data 241
Conclusion 243

11 Nonparametric Forecasting Methods 244
11.1
11.2
11.3
11.4

Kernel Estimation of Forecasting Models
Estimation of Sieve Models 246
Boosted Regression Trees 256
Conclusion 259

LIBRARY.LEOPOLDS.COM

245

利得图书馆

Contents

12 Binary Forecasts 260
12.1
12.2
12.3
12.4

Point and Probability Forecasts for Binary Outcomes 261
Density Forecasts for Binary Outcomes 265
Constructing Point Forecasts for Binary Outcomes 269
Empirical Application: Forecasting the Direction of
the Stock Market 272
12.5 Conclusion 273

13 Volatility and Density Forecasting 275
13.1
13.2
13.3
13.4
13.5
13.6
13.7
13.8

Role of the Loss Function 277
Volatility Models 278
Forecasts Using Realized Volatility Measures
Approaches to Density Forecasting 291
Interval and Quantile Forecasts 301
Multivariate Volatility Models 304
Copulas 306
Conclusion 308

288

14 Forecast Combinations 310
14.1
14.2
14.3
14.4
14.5
14.6
14.7
14.8

Optimal Forecast Combinations: Theory 312
Estimation of Forecast Combination Weights 316
Risk for Forecast Combinations 325
Model Combination 329
Density Combination 336
Bayesian Model Averaging 339
Empirical Evidence 341
Conclusion 344

III Forecast Evaluation
15 Desirable Properties of Forecasts 347
15.1
15.2
15.3
15.4
15.5

Informal Evaluation Methods 348
Loss Decomposition Methods 352
Efficiency Properties with Known Loss 355
Optimality Tests under Unknown Loss 365
Optimality Tests That Do Not Rely on
Measuring the Outcome 368
15.6 Interpreting Efficiency Tests 368
15.7 Conclusion 371

16 Evaluation of Individual Forecasts 372
16.1
16.2
16.3
16.4
16.5
16.6

The Sampling Distribution of Average Losses 373
Simulating Out-of-Sample Forecasts 375
Conducting Inference on the Out-of-Sample Average Loss 380
Out-of-Sample Asymptotics for Rationality Tests 385
Evaluation of Aggregate versus Disaggregate Forecasts 388
Conclusion 390

LIBRARY.LEOPOLDS.COM

•

ix

利得图书馆

x

•

Contents

17 Evaluation and Comparison of Multiple Forecasts 391
17.1
17.2
17.3
17.4
17.5
17.6
17.7
17.8
17.9
17.10
17.11

Forecast Encompassing Tests 393
Tests of Equivalent Expected Loss: The Diebold–Mariano Test 397
Comparing Forecasting Methods: The Giacomini–White
Approach 400
Comparing Forecasting Performance across Nested Models 403
Comparing Many Forecasts 409
Addressing Data Mining 413
Identifying Superior Models 415
Choice of Sample Split 417
Relating the Methods 418
In-Sample versus Out-of-Sample Forecast Comparison 418
Conclusion 420

18 Evaluating Density Forecasts 422
18.1
18.2
18.3
18.4
18.5
18.6

Evaluation Based on Loss Functions 423
Evaluating Features of Distributional Forecasts 428
Tests Based on the Probability Integral Transform 433
Evaluation of Multicategory Forecasts 438
Evaluating Interval Forecasts 440
Conclusion 441

IV Refinements and Extensions
19 Forecasting under Model Instability 445
19.1
19.2
19.3
19.4
19.5
19.6
19.7
19.8

Breaks and Forecasting Performance 446
Limitations of In-Sample Tests for Model Instability 448
Models with a Single Break 451
Models with Multiple Breaks 455
Forecasts That Model the Break Process 456
Ad Hoc Methods for Dealing with Breaks 460
Model Instability and Forecast Evaluation 463
Conclusion 465

20 Trending Variables and Forecasting 467
20.1
20.2
20.3
20.4
20.5
20.6

Expected Loss with Trending Variables 468
Univariate Forecasting Models 470
Multivariate Forecasting Models 478
Forecasting with Persistent Regressors 480
Forecast Evaluation 486
Conclusion 489

21 Forecasting Nonstandard Data 490
21.1
21.2
21.3

Forecasting Count Data 491
Forecasting Durations 493
Real-Time Data 495

LIBRARY.LEOPOLDS.COM

利得图书馆

Contents

21.4
21.5

Irregularly Observed and Unobserved Data 498
Conclusion 504

Appendix 505
A.1 Kalman Filter 505
A.2 Kalman Filter Equations 507
A.3 Orders of Probability 514
A.4 Brownian Motion and Functional Central Limit Theory
Bibliography
Index 539

517

LIBRARY.LEOPOLDS.COM

515

•

xi

利得图书馆

LIBRARY.LEOPOLDS.COM

利得图书馆

Preface



e started working on this book more than 10 years ago after teaching courses
on forecasting techniques at University of Aarhus, Denmark, and in Bertinoro,
Italy, to groups of PhD students and assistant professors. Since then, we have
developed the material through courses offered to participants at many institutions,
including at CREATES (University of Aarhus), American University, Edhec, Bank of
Italy, SoFiE (Oxford University), and Universidad del Rosario.
Our idea was to provide a unified perspective that takes both the economics and
statistics of the forecasting problem seriously. The intention was to write a forecasting
book that could be used by masters and Phd students as well as professionals in
places such as central banks, financial institutions, and research institutes. The book
can be used as a textbook. Indeed, the first section of the book provides a unified
theoretical discussion of the basic approach to forecasting that is grounded in the
standard statistical practice of minimizing the “risk” (expected loss) of any method.
The remainder of the book can be used both as a text and as a reference to a
wide range of forecasting methods. We have tried as much as possible to provide
detailed descriptions of how to construct forecasts, how to evaluate such forecasts,
and how to compare them across different methods. This allows the book to serve as
a single source for many widely employed forecasting methods. Through empirical
applications and reviews of the empirical literature, we also shed light on which
methods work well in different circumstances. We use examples ranging from stock
returns to macroeconomic variables and surveys of forecasters.
Nearly all researchers who are interested in developing new forecasting methods
through theoretical analysis or improving their empirical performance through data
analysis work within a decision-theoretic framework. For example, the provision of
point forecasts is a special case of point estimation and the provision of distributional
forecasts is a special case of density estimation. We use this connection as a
foundation for understanding the statistical basis for forecasting analysis and gaining
a better understanding of how to think about the many forecasting methods in
practical use. Thus, the first premise of the book is that taking seriously the economics
underlying the forecasting problem means that the forecaster’s loss function should
be the starting point of the analysis.
The second premise of the book is that the joint density of the random variables
that generate the observed data used to build and evaluate a forecasting model
is far more complicated than we understand theoretically or empirically. As a
consequence, all forecasting models are misspecified in the sense that they are
approximations to the best possible forecasting model. In practice, this means
choosing forecasting methods based on their risk functions (the expected loss
given the data), but acknowledging that these risk functions are themselves very
complicated objects that depend on the underlying (unknown) data-generating
process. It is exactly the difficulties in understanding the risk functions that allow
so many different forecasting approaches to be used in empirical work.

W

LIBRARY.LEOPOLDS.COM

利得图书馆

xiv

•

Preface

In addition to the students in our forecasting courses who have provided valuable
feedback, throughout the years we have also benefitted from discussions on forecasting with many individuals. Without implying that they necessarily agree with
the points of view expressed in the book, we thank our colleagues at UCSD (past
and present) including Brendan Beare, Robert Engle, Clive Granger, Jim Hamilton,
Ivana Komunjer, Andres Santos, Yixiao Sun, Rossen Valkanov, and Hal White. More
widely in the profession we thank Frank Diebold, Peter Hansen, Andrew Patton,
Hashem Pesaran, Ulrich Müller, Jim Stock, Mark Watson, and Ken West for their
insights and support. We thank all of them for the inspiration they have offered over
the years. This book has also benefitted more directly from the input of many friends
and colleagues. In particular, we thank Peter Hansen, Kirstin Hubrich, Simone
Manganelli, Andrew Patton, Davide Pettenuzzo, Barbara Rossi, and four anonymous
reviewers for comments on the book. A number of PhD students provided exceptionally capable research assistance with the empirical analysis, notably Leland E.
Farmer, Antonio Gargano, Rafael Burjack, Hiroaki Kaido, and Christian Constandse.
Thanks also goes to Naveen Basavanhally for help with formatting the manuscript,
to Alison Durham for doing an excellent job at copyediting the manuscript, and to
Ali Parrington and the team at Princeton University Press for ensuring a smooth
production process.
For collaboration on forecasting papers over the years we also wish to thank
several former PhD students and colleagues, including Marco Aiolfi, Ayelen Banegas,
Gray Calhoun, Carlos Capistran, Luis Catao, Tolga Cenesizoglu, Leland Farmer,
Antonio Gargano, Veronique Genre, Dahlia Ghanem, Ben Gillen, Clive Granger,
Niels Groenborg, Massimo Guidolin, Peter Reinhard Hansen, Geoff Kenny, Ivana
Komunjer, Robert Kosowski, Fabian Krueger, Robert Lieli, Asger Lunde, Aidan
Meyler, Andrew Patton, Bradley Paye, Thomas Pedersen, Gabriel Perez-Quiros,
Hashem Pesaran, Davide Pettenuzzo, Marius Rodrigues, Steve Satchell, Larry
Schmidt, Ryan Sullivan, Russ Wermers, Hal White, and Yinchu Zhu.
Last, but not least, we wish to thank our families for their understanding and
inspiration during the years it took to complete the book. The book would not have
been possible without their unwavering support.

LIBRARY.LEOPOLDS.COM

利得图书馆

I
Foundations


LIBRARY.LEOPOLDS.COM

利得图书馆

LIBRARY.LEOPOLDS.COM

利得图书馆

1



Introduction

ur aim with this book is to present an overview of the theory and methods
underlying forecasting as currently practiced in economics and finance, but
more widely applicable to a great range of forecasting problems. We hope to provide
an overview that is useful to practitioners in places such as central banks and financial
institutions, academic researchers as well as graduate students seeking a point of
entry into the field. The assumed econometric level of the reader is that of someone
who has taken a graduate or advanced undergraduate course in econometrics.
Whenever a forecast is being constructed or evaluated, an overriding concern
revolves around the practical problem that the best forecasting model is not only
unknown but also unlikely to be known well enough to even correctly specify
forecasting equations up to a set of unknown parameters. We view this as the only
reasonable description of the forecaster’s problem. Some methods do claim to find
the correct model (oracle methods) as the sample gets very large. However, in any
problem with a finite sample there is always a set of models—as opposed to a single
model—that are consistent with the data. Moreover, in many situations the datagenerating process changes over time, further emphasizing the difficulty in obtaining
very large samples of observations on which to base a model. These foundations—
using misspecified models to forecast outcomes generated by a process that may be
evolving over time—generate many of the complications encountered in forecasting.
If the true models were fully known apart from the values of the parameters, Bayesian
methods could be used to construct density and point forecasts that, for a given loss
function, would be difficult or impossible to beat in practice.
Without knowing the true data-generating process, the problem of constructing a
good forecasting method becomes much more difficult. Oftentimes very simple (and
clearly misspecified) methods provide forecasts that outperform more complicated
methods that seek to exploit the data in ways we would expect to be important and
advantageous. As a case in point, simple averages of forecasts from many models,
even ones that on their own do not seem to be very good, are often found empirically
to outperform carefully chosen model averages or the best individual models.

O

1.1 OUTLINE OF THE BOOK
The approach of this book is for the most part based on forecasting as a decisiontheoretic problem. By this we mean that the forecaster has a specific objective in

LIBRARY.LEOPOLDS.COM

利得图书馆

4

•

Chapter 1

mind (i.e., wishes to make a decision) and wants to base this decision on some data.
Setting up this approach comprises most of the first part of the book. This part
details the basic elements of the decision problem, with chapters on the decision
maker’s loss function, forecasting as a decision-theoretic problem, and an overview
of general approaches to forecasting employing either classical or Bayesian methods.
This part of the book provides foundations for understanding how different methods
fit together. We also provide details of methods that are subsequently applied to many
of the issues examined in the next part of the book, e.g., model selection and forecast
combination.
The second part of the book reviews various approaches to constructing forecasting models. Methods employed differ for many reasons: lack of relevant data or the
existence of a great deal of potentially relevant data, as well as assumptions made on
functional forms for the models. In these chapters we attempt, as far as possible, to
present the methods in enough detail that they can be employed without reference to
other sources.
The third part of the book examines the evaluation of forecasts, while the fourth
part covers forecasting models that deal with special complications such as model
instability (breaks) and highly persistent (trending) data. This part also discusses data
structures of special interest to forecasters, including real-time data (revised data)
and data collected at different frequencies.
Finally, the fourth part of the book presents various extensions and refinements to
the forecasting methods covered in the earlier parts of the book, including forecasting
under model instability, long-run forecasting, and forecasting with data that either
take a non-standard form (count data and durations) or are measured at irregular
intervals and are subject to revisions.
1.1.1 Part I
The first part of the book motivates that point forecasting should be thought of
simply as an application of decision theory. Since much is known about decision
theory, much is also known about forecasting. This perspective makes point forecasting a special case of estimation, a field where excellent texts already exist. What
makes economic forecasting interesting as a separate topic is the particular details
of how decision theory is applied to the problem at hand. To apply this approach,
we require a clear statement about the costs of forecast errors e = y − f, where y is
the outcome being predicted and f is the forecast. The trade-off between different
forecasting mistakes is embodied in a loss function, L ( f, y) which is discussed
in chapter 2, with additional material on the binary case available in chapter 12.
We regard loss functions as realistic expositions of the forecaster’s objectives, and
consider the specification of the loss function as an integral part of the forecaster’s
decision problem.1 Different forecasters approaching the same outcome may well
have different loss functions which could result in different choices of forecasting
models for the same outcome.
The specification of loss functions is often disregarded in economic forecasting,
and instead “standard” loss functions such as mean squared error loss tend to
1 An alternative literature considers features of loss functions and attempts to suggest a good loss
function for all forecasting problems. We do not consider this approach and view loss functions as
primitive to the forecaster’s problem.

LIBRARY.LEOPOLDS.COM

利得图书馆

Introduction

•

5

be employed. This can prove costly in real forecasting situations as it overlooks
directions in which forecast errors are particularly costly. Nonetheless, much of the
academic literature is based on these standard loss functions and so we focus much
of our survey of methods throughout the second part of the book on these standard
loss functions.
Chapter 3 provides a general description of the forecaster’s problem as a decision
problem. It may strike some readers, more used to the “art” of forecasting, as unusual
to cast point forecasting as a decision-theoretic problem. However, even readers
who do not explicitly follow this approach are indeed operating within the decisiontheoretic framework. For example, most forecasting methods are motivated in one
of two ways: either the methods are demonstrated to provide better performance
given a loss function (or set of loss functions) through Monte Carlo simulations
for reasonable data-generating processes, or alternatively, the forecasting methods
are shown to work well for some loss function for a particular set of empirical
data z = (y, x), where x represents the set of predictor variables used to forecast
the outcome y. Both ways of measuring performance place the forecasting problem
within the decision-theoretic approach.
To illustrate this point, consider Monte Carlo simulations of a data-generating
process (joint density for the data) regarded as a reasonable representation of
some data of interest. The simulation method suggests constructing N independent pseudo samples from this density, constructing N forecasts and evaluating
N
L ( f (n) , y (n) ), where y (n) is the outcome we wish to forecast, f (n) is the
N −1 n=1
forecast generated by a prediction model, and L ( f, y) is the loss function which
measures the costs of forecast inaccuracies. Superscripts refer to the individual
simulations, n = 1, . . . , N. The simulated average loss is usually thought of as a
measure of the performance of the forecasting method or model for this datagenerating process. This is reasonable since as N gets large, the sample average is,
by standard laws of large numbers, a consistent estimate of the risk at the point of the
parameter space for the data-generating process chosen for the Monte Carlo, i.e., as
long as E [L ( f, y)] exists,
N −1

N


L ( f (n) , y (n) ) → p E [L ( f, y)] ,

(1.1)

n=1

where the Monte Carlo estimates a point on the risk function and → p means
convergence in probability. Finding a forecast that minimizes the risk is precisely
the setup of a decision-theoretic problem.
The third part of the book discusses methods for evaluation of sequential outof-sample predictions. In each case, one obtains from the data set T observations
of the “realized” loss from the data. One then evaluates the time-series average
T
L ( f t , yt ), where the t subscript refers to time, as a measure of the expected
T −1 t=1
loss. In this case the assumptions that underlie results such as (1.1) are much
more stringent because the sequence of expected losses generated from data are not
independently and identically distributed (i.i.d.) as in the Monte Carlo simulations.
However, under suitable assumptions again this method estimates risk. When using
real (as opposed to simulated) data, we do not know the true parameter values of the
data-generating process. Analyzing a variety of economic variables, we get a sense of
how well different forecasting methods work for different types of data.

LIBRARY.LEOPOLDS.COM

利得图书馆

6

•

Chapter 1

The general setup in chapter 3 is common to forecasters basing their estimation
strategies either on frequentist or on Bayesian approaches. Chapters 4 and 5 build
on this setup separately for these two approaches. Chapter 4 examines the typical
frequentist approaches, explaining general pitfalls that can occur as well as highlighting special cases arising later in the book. Chapter 5 does the same for the Bayesian
approach.
Viewing forecasting as a decision-theoretic problem sometimes means that the
best forecasting model, despite working well in practice, may actually be a model
that is very difficult to interpret economically. This becomes a problem when the
forecasting exercise is a step in a decision process, and the forecaster must “explain”
the forecast to decision makers or forecast users. In these cases an inferior point
forecast that tends to be further away from the outcome may be preferred because
it is easier to explain and may be seen to be more credible. Of course in situations
where we suspect a lot of overfitting or instability in the relationships between the
variables, we might prefer forecasting models that conform to economic theory since
they are expected to be more robust. Practically, economically motivated restrictions
on forecasting models can just be seen as following the decision-theoretic approach
for a restricted set of models.
The final chapter of the first part of the book, chapter 6, examines issues related to
model selection. By now the econometrics literature has a very good understanding
of the merits and limitations of model selection, which we discuss for general
models. From the perspective of forecasting, however, we regard model selection
as simply part of the model estimation process. Of interest to the forecaster is the
risk of the final forecasting model computed in a way that accounts for the full
estimation process. Given the complexity of the distributions of estimators obtained
from models whose selection is driven by the data, this issue is difficult to address
analytically although it is still of direct relevance to the forecaster.
1.1.2 Part II
Part II of the book provides an overview of the various approaches to forecasting
that have become standard in many areas, including the economic and finance
forecasting literature. Chapters are based around either the amount of information
available—from only the past history of the predicted variable through very large
panels of variables—or the general estimation approach, principally parametric or
nonparametric methods. To the extent possible, we provide details of how to go
about constructing forecasts from the various methods, or alternatively direct readers
to explanations available in the literature. We also discuss the trade-offs between
different methods. In this sense we endeavor to provide a “first stop” for practitioners
wishing to apply the methods covered in this section.
An important insight that arises from the decision-theoretic approach is that there
is no single best or dominant approach to constructing a forecast for all possible
forecasting situations. We discuss the types of forecasting situations where each
individual method is likely to be a reasonable approach and also highlight situations
where other approaches should be considered.
Throughout the book, we use a variety of empirical applications to illustrate
how different approaches work. In most applications we use so-called pseudo
out-of-sample forecasts which simulate the forecast as it could have been generated using data only up to the date of the prediction. This method restricts both

LIBRARY.LEOPOLDS.COM

利得图书馆

Introduction

•

7

model selection and parameter estimation to rely on data available at the point of
the forecast. As time progresses and more data become available, the forecasting
method, including the parameter estimates, are updated recursively. Such methods
are commonly used to evaluate the usefulness of forecasts; a critical discussion of
such out-of-sample forecasting methods versus in-sample methods is provided in
part three of the book.
When building a forecasting model for an economic variable, the simplest
specification of the conditioning information set is the variable’s own past history.
This leads to univariate autoregressive moving average, or ARMA, models. Since
Box and Jenkins (1970) these models have been extensively used and often provide
benchmarks that are difficult to beat using more complicated forecasting methods.
Linear ARMA models are also easy to estimate and a large literature has evolved on
how best to cover issues in implementation such as lag length selection, generation
of multiperiod forecasts, and parameter estimation. We discuss these issues in
chapter 7. The chapter also covers exponential smoothing, unobserved components
models, and other ways to account for trends when forecasting economic variables.
Chapter 8 continues under the assumption that the information set is limited
to the predicted variable’s own past, but focuses on nonlinear parametric models.
Examples include threshold autoregressions, smooth threshold autoregressions, and
Markov switching models. These models have been used to capture evidence of
nonlinear dynamics in many macroeconomic and financial time series. Unlike
nonparametric models they do not, however, have the ability to provide a global
approximation to general data-generating processes of unknown form.
Chapter 9 expands the information set to include multivariate information by
considering a natural extension to univariate autoregressive models, namely vector
autoregressions, or VARs. VARs provide a framework for producing internally consistent multiperiod forecasts of all the included variables. As used in macroeconomic
forecasting VARs typically include a relatively small set of variables, often less than
10, but they still require a large number of parameters to be estimated if the number
of included lags is high. To deal with the resulting negative effects of estimation
errors on forecasting performance, a large literature has developed Bayesian methods
for estimating and forecasting with VARs. Both classical and Bayesian estimation of
VARs is covered in the chapter which also deals with forecasting when the future
paths of some variables are specified, a common practice in scenario analysis or
contingent forecasting.
The emergence of very large data sets has given rise to a wealth of information
becoming readily available to forecasters. This poses both a unique opportunity—
the potential for identifying new informative predictor variables—but also some
real challenges given the limitations to most economic data. Suppose that N
potential predictor variables are available, and that N is a large number, i.e., in
the hundreds or thousands. Including all variables in the forecasting model—the
so-called kitchen sink approach—is generally not feasible or desirable even for linear
models since parameter estimation error becomes too large, unless the length of
the estimation sample, T , is very large relative to N. Standard forecasting methods
that conduct comprehensive model selection searches are also not feasible in this
situation. If the true model is sparse, i.e., includes only few variables, one possibility
is to use algorithms such as the Lasso, covered in chapter 6, to identify a few
key predictors. Another strategy is to develop a few key summary measures that
aggregate information from a large cross section of variables. This is the approach

LIBRARY.LEOPOLDS.COM

利得图书馆

8

•

Chapter 1

used by common factor models. Chapter 10 describes how these methods can be
used in forecasting, including in factor-augmented VAR models that include both
univariate autoregressive terms along with information in the factors. Finally, we
discuss the possibility of using methods from panel data estimation to generate
forecasts.
While chapters 7–10 focus on parametric estimation methods and so assume that
a certain amount of structure can be imposed on the forecasting model, chapter 11
considers nonparametric forecasting strategies. These include kernel regressions
and sieve estimators such as polynomials and spline expansions, artificial neural
networks, along with more recent techniques from the machine-learning literature
such as boosted regression trees. Although these methods have powerful abilities
to approximate many data-generating processes as the number of terms included
by the approach gets large, in practice any given estimated nonparametric model is
itself an approximation to this approximation. Notably, the number of terms that
can be successfully included in empirical applications will often be severely restricted
by the available data sample. These approximate models thus do not have the same
approximation ability as the models and thus themselves are approximations. Once
again, the algorithm used to fit these forecasting models—along with the loss function
used to guide the estimation—become key to their forecasting performance and to
avoiding issues related to overfitting.
Forecasts of binary variables, i.e., variables that are restricted to take only two
possible values, play a special role in decisions such as households’ choice on
whether or not to buy a car, the decision on whether to pursue a particular
education, or banks’ decisions on whether to change interest rates for short-term
deposits. Restricting the outcome to only two possible values has the advantage
that it crystallizes the costs of making wrong forecasts, i.e., false positives or false
negatives. Chapter 12 takes advantage of these simplifications to cover point and
probability forecasts of binary outcomes and discusses both statistical and utilitybased estimators for such data.
The decision-theoretic approach embodies a loss function that is appropriate for
the decision to be made and not, as is so often the case, chosen for convenience.
It results in a decision, i.e., a choice of an action to be made. This directs itself
to basing estimation on an objective of providing the best decision. Alternatively,
we might consider provision of a predictive distribution (density forecast) for an
outcome as the objective of the forecasting problem. In chapter 13 we see that
this perspective is useful for a wide range of decisions.2 Distribution forecasts also
serve the important role of quantifying the degree of uncertainty surrounding point
forecasts.
Distributional forecasting fills an important place in any forecaster’s toolbox but
it does not replace point forecasting. First, although density forecasts can be used to
construct point forecasts, typically it is the point forecast or decision that is required.
Second, distributional forecasts rely on the distribution being estimated from data.
This brings the loss function or scoring rule—the loss function used to estimate the
2

Dawid (1984) introduced what he termed the “prequential” approach to statistics, where prequential
is a fusing of the words “probability” and “sequential.” This approach argued that rather than parameters
being the object of statistical inference, the proper approach was to provide a sequence of probability
forecasts for an outcome of interest. Hence, the provision of a density is important not just for forecasting,
but for statistics in general.

LIBRARY.LEOPOLDS.COM

利得图书馆

Introduction

•

9

density—back into the problem. Often ad hoc loss functions are employed to estimate
the distributional forecast, leading to problems when the distributional forecast is
subsequently used to construct the point forecast.
Given the plethora of different modeling approaches for construction of forecasts throughout chapters 7–13, it is not surprising that forecasters frequently
have access to multiple predictions of the same outcome. Instead of aiming to
identify a single best forecast, another strategy is to combine the information in
the individual forecasts. This is the topic of forecast combinations covered in
chapter 14. If the information used to generate the underlying forecasts is not
available, forecast combination reduces to a simple estimation problem that basically
treats the individual forecasts as predictors that could be part of a larger conditioning
information set. Special restrictions on the forecast combination weights are sometimes imposed if it can be assumed that the individual forecasts are unbiased. If
more information is available on the models underlying the individual forecasts,
model combination methods can be used. These weight the individual forecasts
based on their marginal likelihood or some such performance measure. Bayesian
model averaging is a key example of such methods and is also covered in this
chapter.
1.1.3 Part III
The third part of the book deals with forecast evaluation methods. Evaluation of
forecast methods is central to the forecasting problem and the difficulties involved
in this step explain both the plethora of methods suggested for forecasting any
particular outcome and the need for careful evaluation of forecasting methods.
To see the central issue, consider the simple problem of forecasting the next
outcome, yT +1 , in a sequence of independently and identically distributed data yt ,
t = 1, . . . , T with mean μ, variance σ 2 , and no explanatory variables. It is well
known that under mean squared error (MSE) loss
the best forecast is an estimate
T
yt . Since the outcome yT +1
of the mean, μ, such as the sample mean ȳT = T −1 t=1
is a random variable whose distribution is centered on μ, the forecast is typically
different from the outcome even if we had a perfect estimate of μ, i.e., if we knew μ,
as long as σ 2 > 0. Observing a single outcome far away from the forecast is therefore
not necessarily indicative of a poor forecast. More generally, methods for forecast
evaluation have to deal with the fact that (in expectation) the average in-sample loss
and the average out-of-sample loss differ. To see this, suppose we use the sample
mean as our forecast. For any in-sample observation, t = 1, . . . , T , the MSE of the
forecast (or fitted value) is

E [yt − ȳT ] = E (yt − μ) − T
2

−1

=σ

2
(yt − μ)

t=1


2

T


T
1
1+ 2 −2
T
T



= σ 2 (1 − T −1 ).
Here the third term in the second line comes from the cross product when we
compute the squared terms in the first line.

LIBRARY.LEOPOLDS.COM

利得图书馆

10

•

Chapter 1

In contrast, the MSE of out-of-sample forecasts of yT +1 is

E [yT +1 − ȳT ] = E (yT +1 − μ) − T
2


= σ2 1+

T
T2



−1

T


2
(yt − μ)

t=1

= σ 2 (1 + T −1 ).
Here there is no cross-product term. Comparing these two expressions, we see
that estimation error reduces the in-sample MSE but increases the out-of-sample
MSE. In both cases the terms are of order T −1 and so the difference disappears
asymptotically. However, in many forecasting problems this smaller-order term is
important both statistically and economically. When we consider many different
models of the outcome, differences in the MSE across models are of the same order
as the effects on estimation error. This makes it difficult to distinguish between
models and is one reason why model selection is so difficult. The insight that the
in-sample fit improves by using overparameterized models, whereas out-of-sample
predictive accuracy can be reduced by using such models, strongly motivates the use
of out-of-sample evaluation methods, although caveats apply as we discuss in part III
of the book.
In the past 20 years many new forecast evaluation methods have been developed.
Prior to this development, most academic work on evaluation and ranking of
forecasting performance paid very little attention to the consideration that forecasts
were obtained from recursively estimated models. Thus, often studies used the
sample mean squared forecast error, computed for a particular empirical data set,
to give an estimate of a model’s performance without accompanying standard errors.
An obvious limitation of this approach is that such averages often are averages over
very complicated functions of the data. Through their dependence on estimated
parameters these averages are also typically correlated across time in ways that give
rise to quite complicated distributions for standard test statistics. For some of the
simpler ways that forecasts could have been generated recursively, recent papers
derive the resulting standard errors, although much more work remains to be done
to extend results to many of the popular forecasting methods used in practice.
Chapter 15 first establishes the properties that a good forecast should have in the
context of the underlying loss function and discusses how these properties can be
tested in practice. The chapter goes from the case where very little structure can be
imposed on the loss function to cases where the loss function is known up to a small
set of parameters. In the latter case it can be tested that the derivative of the loss
with respect to the forecast, the so-called generalized forecast error, is unpredictable
given current information. The chapter also shows how assumptions about the loss
function can be traded off against testable assumptions on the underlying datagenerating process.
Chapter 16 gives an overview of basic issues in evaluating forecasts, along with a
description of informal methods. This chapter examines the evaluation of a sequence
of forecasts from a single model. Critical values for the tests of forecast efficiency
depend on how the forecast was constructed, specifically whether a fixed, rolling, or
expanding estimation window was used.

LIBRARY.LEOPOLDS.COM

利得图书馆

Introduction

•

11

Chapter 17 extends the assessment of the predictive performance of a single model
to the situation with more than one forecast to examine and so addresses the issue of
which, if any, forecasting method is best. We review ways to compare the forecasting
methods and strategies for testing hypotheses useful to identifying methods that work
well in practice. Special attention is paid to the case with nested forecasting models,
i.e., cases where one model includes all the terms of another benchmark model
plus some additional information. We distinguish between tests of equal predictive
accuracy and tests of forecast encompassing, the latter case referring to situations
where one forecast dominates another. We also discuss how to test whether the
best among many (possibly thousands) of forecasts is genuinely better than some
benchmark.
Chapter 18 examines the evaluation of distributional forecasts. A complication
that arises is that we never observe the density of the outcome; only a single draw
from the distribution gets observed. Various approaches have been suggested to
deal with this issue, including logarithmic scores and probability integral transforms.
We discuss these as well as ways to evaluate whether the basic features of a density
forecast match the data.
1.1.4 Part IV
The fourth part of the book covers a variety of topics that are specific to forecasting.
Chapter 19 discusses predictions under model instability. This chapter builds on
the earlier observation that all forecasting models are simplified representations
of a much more complex and evolving data-generating process. A key source
of model misspecification is the constant-parameter assumption made by many
prediction models. Empirical evidence suggests that simple ARMA models are in
fact misspecified for many macroeconomic variables. The chapter first discusses
how model instability can be monitored before moving over to discuss prediction
approaches that specifically incorporate time-varying parameters, including random
walk or mean-reverting parameters and regime switching parameters.
The previous chapters deal with cases where the forecast horizon is relatively
short. Chapter 20 directly attacks the case where the forecast horizon can be long.
Oftentimes a policy maker or budget office is interested in 5 or 10-year forecasts of
revenue or expenditures. Interest may also lie in forecasts of the average growth rate
over some period. From an estimation perspective, whether the forecast horizon is
short or long is measured relative to the length of the data sample. We discuss these
issues in chapter 20.
Real-time forecasting methods emphasize the need to ensure that all information
and all methods used to construct a forecast would have been available in real time.
This consideration becomes particularly relevant in so-called pseudo out-of-sample
forecasts that simulate a sequence of historical forecasts. Many macroeconomic
time series are subject to revisions that become available only after the date of the
forecast. Since the selection of a forecasting model and estimation of its parameters
may depend on the conditioning information set, which vintage of data is used can
sometimes make a material difference. Similar issues related to data availability are
addressed by a relatively new field known as nowcasting which uses filtering and
updating algorithms to account for the jagged-edge nature of data, i.e., the fact that
data are released at different frequencies and on different dates. These issues are
covered in chapter 21.

LIBRARY.LEOPOLDS.COM

利得图书馆

12

•

Chapter 1

This chapter also covers models for predicting data that take the format of either
counts, and so are restricted to being an integer number, or durations, i.e., the
length of the time intervals between certain events. The nature of the dependent
variable gives rise to specific forecasting models, such as Poisson models, that are
different from the models covered in the previous chapters of the book. Count
models have gained widespread popularity in the context of analysis of credit events
such as bankruptcies or credit card default, while duration analysis is used to predict
unemployment spells and times between trades in financial markets.

1.2 TECHNICAL NOTES
Throughout the book we follow standard statistical methods which view the data as
realizations of underlying random variables. Objective functions and other functions
of interest are then also functions of random variables. Further, we assume that
all functions are measurable, including functions that arise from maximizations of
functions over parameters. We are rarely explicit about these assumptions, though
this is seldom an issue for the functions examined in the book.
The decision-theoretic approach relies on the existence of risk or expected loss.
For loss functions that are bounded, this is usually not problematic, but many
popular loss functions are not bounded. For example, mean squared error loss and
mean absolute error loss are the most popular loss functions in practice, and neither
is bounded. It is fairly standard in the forecasting literature to simply assume that
the expected loss exists, and further assume that the asymptotic limit of expected
loss is the expected value of the limiting random variable that measures the loss.
Throughout the book we follow this practice without giving conditions. Forecasting
practice in some instances does seem to enforce “boundedness” of a sort on forecast
losses; for example, in evaluating nonlinear models with mean squared error loss,
often extreme forecasts that could lead to very large losses are removed and so the
loss is in effect bounded.
Throughout the book we tend not to present results as fully worked theorems
but instead give the main conditions under which the results hold. Original papers
with the full set of conditions are cited. The reasons for this approach are twofold.
First, often there are many overlapping sets of conditions that would result in lengthy
expositions on often very straightforward methods if we were to include all the details
of a result. Second, many of the conditions are highly technical in nature and often
difficult or impossible to verify.

LIBRARY.LEOPOLDS.COM

利得图书馆

2



Loss Functions

hort of the special and ultimately uninteresting case with perfect foresight, it is
not possible to find a method that always sets the forecast equal to the outcome.
A formal method for trading off potential forecast errors of different signs and
magnitudes is therefore required. The loss function, L (·), describes in relative terms
how costly it is to use an imperfect forecast, f, given the outcome, Y, and possibly
other observed data, Z. This chapter examines the construction and properties of loss
functions and introduces loss functions that are commonly used in forecasting.
A central point in the construction of loss functions is that the loss function
should reflect the actual trade-offs between different forecast errors. In this sense
the loss function is a primitive to the forecasting problem. From a decision-theoretic
perspective the forecast is the action that must be constructed given the loss function
and the predictive distribution, which we discuss in the next chapter. For example,
the Congressional Budget Office must provide forecasts of future budget deficits.
Their loss function in providing the forecasts should be based on the relative costs
of over- and underpredicting public deficits. Weather forecasters face very different
costs from underpredicting the strength of a storm compared to overpredicting it.
The choice of a loss function is important for every facet of the forecasting
exercise. This choice affects which forecasting models are preferred as well as how
their parameters are estimated and how the resulting forecasts are evaluated and
compared against forecasts from competing models. Despite its pivotal role, it is
common practice to simply choose off-the-shelf loss functions. In doing this it is
important to choose a loss function that at least approximately reflects the types
of trade-offs relevant for the forecast problem under study. For example, when
forecasting hotel room bookings, it is hard to imagine that over- and underpredicting
the number of hotel rooms booked on a particular day lead to identical losses because
hotel rooms are a perishable good. Hence, using a symmetric loss function for this
problem would make little sense. Asymmetric loss that reflects the larger loss from
over- rather than underpredicting bookings would be more reasonable.
There are examples of carefully grounded loss functions in the economics literature. For example, sometimes a forecast can be viewed as a signal in a strategic
game that is influenced by the forecast provider’s incentives. Studies such as Ehrbeck
and Waldmann (1996), Hong and Kubik (2003), Laster, Bennett, and Geoum
(1999), Ottaviani and Sørensen (2006), Scharfstein and Stein (1990) and Trueman
(1994) suggest loss functions grounded on game-theoretical models. Forecasters are

S

LIBRARY.LEOPOLDS.COM

利得图书馆

14

•

Chapter 2

assumed to differ in their ability to predict future outcomes. The chief objective of the
forecasters is to influence forecast users’ assessment of their ability. Such objectives
are common for business analysts or analysts employed by financial services firms
such as investment banks or brokerages whose fees are directly linked to clients’
assessment of their forecasting ability.
The chapter proceeds as follows. Section 2.1 examines general issues that arise in
construction of loss functions. We discuss the mathematical setup of a loss function
before relating it to the forecaster’s decisions and examining some general properties
that loss functions have. Section 2.2 reviews specific loss functions commonly used in
economic forecasting problems, assuming there is only a single outcome to predict,
before extending the analysis in section 2.3 to cover cases with multiple outcome
variables. Section 2.4 considers loss functions (scoring rules) for distributional
forecasts, while section 2.5 provides some concrete examples of loss functions and
economic decision problems from macroeconomic and financial analysis. Section 2.6
concludes the chapter.

2.1 CONSTRUCTION AND SPECIFICATION OF THE LOSS FUNCTION
Let Y denote the random variable describing the outcome of interest and let Y denote
the set of all possible outcomes. For outcomes that are either continuous or can take
on a very large number of possible values, typically Y is the real line, R. In some
forecasting problems the set of possible outcomes, Y, can be much smaller, such as
for a binary random variable where Y = {0, 1}. For multivariate outcomes typically
Y =Rk for some integer k, where k is the number of forecasts to be evaluated.
Point forecasts are denoted by f and are defined on the set F. Typically we assume
F = Y since in most cases it does not make sense to have forecasts that cannot take
on the same values as Y or, conversely, have forecasts that can take on values that the
outcome Y cannot. There are exceptions to this rule, however. For example, a forecast
of the number of children per family could be a fraction such as 1.9, indicating close
to 2 children, even though Y cannot take this value. We assume that the predictors
Z (as well as the outcome Y and hence the forecast f ) are real valued. Formally, the
loss function, L ( f, Y, Z), is then defined as a mapping L : Y × Y × Z → L, where
L is in R1 , and Z contains the set of possible values the conditioning variables, z,
can take. Often L =R1+ , the set of nonnegative real numbers. Alternatively, we could
constrain the forecasts to lie in the convex hull of the set of all possible outcomes, i.e.,
F = conv(Y). We discuss this further below.
A common assumption for loss functions is that loss is minimized when the
forecast is equal to the outcome—min f L ( f, y, z) = L (y, y, z). The idea is that if we
are to find a forecast that minimizes loss, then nothing dominates a perfect forecast.
In cases where the loss function does not depend on Z, so L ( f, Y, Z) = L ( f, Y), it
is natural to normalize the loss function so that it takes a minimum value at 0. This
can be done without loss of generality by subtracting the loss associated with the
perfect forecast f = y, i.e., L ( f, Y) = L̃ ( f, Y) − L̃ (Y, Y) for any loss function, L̃ .
For f = y to be a unique minimum we must have L ( f, y) > 0 for all f = y.1 More
generally, when the loss function L ( f, Y, Z) varies with Z, it may not be possible to
1

In binary forecasting this condition is often not imposed. This usually does not affect the analysis but
only the interpretation of the calculated loss figures.

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

15

rescale the loss function in this manner. For example, a policy maker’s loss function
over inflation forecasts might depend on the unemployment rate so that losses from
incorrect inflation forecasts depend on whether the unemployment rate is high or
low. For simplicity, in what follows we will mostly drop the explicit dependence of
the loss function on Z and focus on the simpler loss functions L ( f, Y).
2.1.1 Constructing a Loss Function
Construction of loss functions, much like construction of prior distributions in
Bayesian analysis, requires a careful study of the forecasting problem at hand
and should reflect the actual trade-offs between forecast errors of different signs
and magnitudes. Laying out the trade-off can be straightforward if the decision
environment is fully specified and naturally results in a measurable outcome that
depends on the forecast. For example, for a profit-maximizing investor with a specific
trading strategy that requires forecasts of future asset prices, the natural choice
of loss is the function relating payoffs to the forecast and realized returns. Other
problems may not lead so easily to a specific loss function. For example, when the
IMF forecasts individual countries’ budget deficits, both short-term considerations
related to debt financing costs and long-term reputational concerns could matter.2
In such cases one can again follow a Bayesian prior selection strategy of defining
a function that approximates a reasonable shape of losses associated with decisions
based on incorrect forecasts.
Loss functions, as used by forecasters to evaluate their performance, and utility
functions, as used by economists to assess the economic value of different outcomes,
are naturally related. Both are grounded in the same decision-theoretic setup which
regards the forecast as the decision and the outcome as the true state and maps
pairs of outcomes (states) and forecasts (Y, f ) to the real line. In both cases we are
interested in minimizing the expected loss or disutility that arises from the decision.3
The relationship between utility and loss is examined in Granger and Machina
(2006), who show that the loss function can be viewed as the negative of a utility
function, although a more general relation of the following form holds:
U ( f, Y) = k(Y) − L ( f, Y),

(2.1)

where k(Y) plays no role in the derivation of the optimal forecast.4
Example 2.1.1 (Squared loss and utility). Granger and Machina (2006) show that a
utility function U ( f, Y) generates squared error loss, L ( f, Y) = a(Y − f )2 , for a > 0,
if and only if it takes the form
U ( f, Y) = k(Y) − a(Y − f )2 .

(2.2)

It follows that utility functions associated with squared error loss are restricted to a very
narrow set.
2

Forecasts can even have feedback effects on outcomes as in the case of credit ratings companies whose
credit scores can trigger debt payments for private companies that affect future ratings (Manso, 2013).
3 The first section of chapter 3 examines this issue in more detail.
4 Granger and Machina (2006) allow decisions to depend on forecasts without requiring that the
two necessarily be identical. Instead they require that the function mapping forecasts to decisions is
monotonic.

LIBRARY.LEOPOLDS.COM

利得图书馆

16

•

Chapter 2

Academic studies often do not derive loss functions from first principles by
referring to utility functions or fully specified decision-theoretic problems, though
there are some exceptions. Loss functions that take the form of profit functions have
been used to evaluate forecasts by Leitch and Tanner (1991) and Elliott and Ito
(1999). West et al. (1993) compare utility-based and statistical measures of predictive
accuracy for exchange rate models. Examples of loss functions derived from utility
are provided in the final section of this chapter.
2.1.2 Common Properties of Loss Functions
Reasonable loss functions are grounded in economic decision problems. Under the
utility-maximizing approach, loss functions inherit well-known properties from the
utility function. Rather than deriving loss functions from first principles, however,
it is common practice to instead use loss functions with a “reasonable shape.” For
the loss function to be “reasonable,” a set of minimal properties should hold. Other
properties such as symmetry or homogeneity may suggest broad families of loss
functions with certain desirable characteristics. We cover both types of properties
below.
Trade-offs between different forecast errors when f = y are quantified by the
loss function. To capture the notion that bigger errors imply bigger losses, often it
is imposed that the loss is nondecreasing as the forecast moves further away from the
outcome. Mathematically, this means that L ( f 2 , y) ≥ L ( f 1 , y) for either f 2 > f 1 > y
or f 2 < f 1 < y for all real y. Nearly all loss functions used in practice have this
feature.
For loss functions that depend only on the forecast error, e = y − f , and thus take
the form L ( f, y) = L (e), Granger (1999) summarized these requirements:
L (0) = 0 (minimal loss of 0);

(2.3a)

L (e) ≥ 0 for all e;

(2.3b)

L (e) is nonincreasing in e for e < 0 and nondecreasing in e for e > 0 :
L (e 1 ) ≤ L (e 2 ) if e 2 < e 1 < 0,

L (e 1 ) ≤ L (e 2 ) if e 2 > e 1 > 0.

(2.3c)

As in the case with more general loss, L ( f, y), condition (2.3a) simply normalizes
the loss associated with the perfect forecast (y = f ) to be 0. The second condition
states that imperfect forecasts (y = f ) generate larger loss than perfect ones. Most
common loss functions depend only on e; see section 2.2 for examples.
Other properties of loss functions such as homogeneity, symmetry, differentiability, and boundedness can be used to define broad classes of loss functions. We next
review these.
Homogeneity can be used to define classes of loss functions that lead to the same
decisions. Homogeneous loss functions factor in such a way that
L (a f, ay) = h(a)L ( f, y),

(2.4)

for some positive function h(a), where the degree of homogeneity does not matter.
For loss functions that depend only on the forecast error, homogeneity amounts to
L (ae) = h(a)L (e) for some positive function h(a). Homogeneity is a useful property
when solving for optimal forecasts since the optimal forecast will be invariant to
different values of h(a).

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

17

Symmetry of the loss function refers to symmetry of the forecast around y. It is
the property that, for all f ,
L (y − f, y) = L (y + f, y).

(2.5)

For loss functions that depend only on the forecast error, symmetry reduces to
L (−e) = L (e), so that over- and underpredictions of the same magnitude lead to
identical loss.5
Most empirical work in economic forecasting assumes symmetric loss. This
choice reflects the difficulties in putting numbers on the relative cost of over- and
underpredictions. Construction of a loss function requires a deeper understanding
of the forecaster’s objectives and this may be difficult to accomplish. Still, the implicit
choice of MSE loss by the majority of studies in the forecasting literature seems
difficult to justify on economic grounds. As noted by Granger and Newbold (1986,
page 125), “an assumption of symmetry about the conditional mean. . . is likely to be
an easy one to accept. . . an assumption of symmetry for the cost function is much less
acceptable.”
Differentiability of the loss function with respect to the forecast is again a
regularity condition that is useful and helps simplify numerically the search for
optimal forecasts. However, this condition may not be desirable and is certainly not
required for a loss function to be well defined. In general, a finite numbers of points
where the loss function fails to be differentiable will not cause undue problems at the
estimation stage. However, when the loss function is extremely irregular, different
methods are required for understanding the statistical properties of the loss function
(see the maximum utility estimator in chapter 12).
Finally, loss functions may be bounded or unbounded. As a practical matter, there
is often no obvious reason to let the weight the loss function places on very large
forecast errors increase without bound. For example, the squared error loss function
examined below assigns very different losses to forecasts of, say, US inflation that
result in errors of 100% versus 500% even though it is not obvious that the associated
losses should really be very different since both forecasts would lead to very similar
actions. Unbounded loss functions can create technical problems for the analysis
of forecasts as the expected loss may not exist, so most results in decision theory
are derived under the assumption of bounded loss. In practice, forecasts are usually
bounded and extremely large forecasts typically get trimmed as they are deemed
implausible.
2.1.3 Existence of Expected Loss
Restrictions must be imposed on the form of the loss function to make sense of the
idea of minimizing the expected loss. Most basically, it is required that the expected
loss exists. Suppose the forecast depends on data Z through a vector of parameters, β,
which depends on the parameters of the data generating process, θ , so f = f (z, β).
From the definition of expected loss, we have

(2.6)
E Y [L ( f (z, β), Y)] = L ( f (z, β), y) pY (y|z, θ )dy,
5

A related concept is the class of bowl-shaped loss functions. A loss function is bowl shaped if the level
sets {e : L (e) ≤ c} are convex and symmetric about the origin.

LIBRARY.LEOPOLDS.COM

利得图书馆

18

•

Chapter 2

where pY (y|z, θ ) is the predictive density of y given z, θ . When the space of
outcomes Y is finite, this expression is guaranteed to be finite. However, for outcomes
that are continuously distributed, restrictions must sometimes be imposed on the loss
function to ensure finite expected loss. The existence of expected loss depends, both,
on the loss function and on the distribution of the predicted variable, given the data,
pY (y|z, θ ), where θ denotes the parameters of this conditional distribution. Existence
of expected loss thus hinges on how large losses can get in relation to the tail behavior
of the predicted variable, as captured by pY (y|z, θ ).
A direct way to ensure that the expected loss exists is to bound the loss function
from above.6 From a practical perspective this would seem to be a sensible practice
in constructing loss functions. Even so, many of the most popular loss functions are
not bounded from above. In part this practice stems from not considering the loss
related to the forecasting problem at hand, but instead borrowing “off-the-shelf” loss
functions from estimation methods that lead to simple closed-form expressions for
the optimal forecast.
It is useful to demonstrate the conditions needed to ensure that the expected loss
exists. Following Elliott and Timmermann (2004), suppose that L depends only on
the forecast error, e = y − f , and lends itself to a Taylor-series expansion around the
mean error, μe = E Y [Y − f ]:
∞  

1
1
2
L kμe (e − μe )k ,
L (e) = L (μe ) + L μe (e − μe ) + L μe (e − μe ) +
2
k!
k=3

(2.7)

where L kμe denotes the kth derivative of L evaluated at μe . Suppose there are only a
finite number of points where L is not analytic and that these can be ignored because
they occur with probability 0. Taking expectations in (2.7), we then get
∞  

1
1
L kμe E Y [(e − μe )k ]
E [L (e)] = L (μe ) + L μe E Y [(e − μe )2 ] +
2
k!
k=3
∞  
k  


1
k
1
2
k
E Y [e k−i μie ]
= L (μe ) + L μe E Y [(e − μe ) ] +
L μe
i
2
k!
k=3
i =0
∞
k


1
1
L kμe
= L (μe ) + L μe E Y [(e − μe )2 ] +
E Y [e k−i μie ].
2
i
!(k
−
i
)!
k=3
i =0

(2.8)
This expression is finite provided that all moments of the error distribution exist for
which the corresponding derivative of the loss function with respect to the forecast
error is nonzero. This is a strong requirement and rules out some interesting combinations of loss functions and forecast error distributions. For example, exponential
loss (or the Linex loss function defined below) and a student-t distribution with a
finite number of degrees of freedom would lead to infinite expected loss since all
higher-order moments do not exist for this distribution. What is required to make

6

This is sufficient since we have already bounded the loss function (typically at 0) from below.

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

19

the higher-order terms in (2.8) vanish is that the tail decay of the predicted variable
is sufficiently fast relative to the weight on these terms implied by the loss function.

2.1.4 Loss Functions Not Based on Expected Loss
So far we have characterized the loss function L ( f, Y) for a univariate outcome,
and defined its properties with reference to a “one-shot” problem. This makes sense
when forecasting is placed in a decision-theoretic or utility-maximization context.
This approach to forecasting is internally consistent, from initially setting up the
problem to defining the expected loss and conducting model estimation and forecast
evaluation.
Some loss functions that have been used in practice are based directly on sample
statistics without relating the sample loss to a population loss function. In cases where
such a population loss function exists and satisfies reasonable properties, this does
not cause any problems. Basing the loss function directly on a sample of losses can,
however, sometimes yield a loss function that does not make sense in population or
for fully specified decision problems. Loss functions that do not map back to decision
problems often have poor and unintended properties. We consider one such example
below.
Example 2.1.2 (Kuipers score for binary outcome). Let f = {1, −1} be a forecast
of the binary variable y = {1, −1} and let n j,k , j, k ∈ {−1, 1} be the number of
observations for which the forecast equals j and the outcome equals k. The Kuipers
score is given by
n1,−1
n1,1
−
.
n1,1 + n−1,1 n1,−1 + n−1,−1

(2.9)

This is the positive hit rate, i.e., the proportion of times where y = 1 is correctly
predicted less the “false positive rate,” i.e., the proportion of times where y = 1 is
wrongly predicted. This can equivalently be thought of as
KuS =

n−1,−1
n1,1
+
− 1,
n1,1 + n−1,1 n1,−1 + n−1,−1

(2.10)

which is the hit rate for y = 1 plus the hit rate for y = −1 minus a centering constant
of 1. The Kuipers score is positive if the sum of the positive and negative hit rates
exceeds 1. For a sample with a single observation, this definition makes no sense, as
one of the denominators in (2.10) is 0: either n1,1 + n−1,1 = 0 or n1,−1 + n−1,−1 = 0.
For a single observation, this sample statistic does not follow from any obvious loss
function. The first term in (2.10) is the sample analog of P [ f = 1|Y = 1] and the
second is the sample analog of P [ f = −1|Y = −1]. However, they do not combine
to a loss function with this sample analog. This failure to embed the loss function into
the expected loss framework results in odd properties for the objective. For example,
the definition of KuS in (2.9) implies that the marginal value of an extra “hit,” i.e., a
correct call, depends on the sample proportion of hits. To see this, consider the improvement in KuS from adding a single successfully predicted observation y = 1, f = 1.

LIBRARY.LEOPOLDS.COM

利得图书馆

20

•

Chapter 2

The resulting improvement in the hit rate is
n1,1 + 1
n1,1
−
n1,1 + 1 + n−1,1 n1,1 + n−1,1
n−1,1
=
.
(n1,1 + n−1,1 )(n1,1 + 1 + n−1,1 )

KuS =

Thus the marginal value of a correct call depends on the total number of observations
and the proportion of missed hits prior to the new observation. The Kuipers score’s poor
properties arise from the lack of justification of its setup for a population problem.

2.2 SPECIFIC LOSS FUNCTIONS
We next review various families of loss functions that have been suggested in the forecasting literature. The vast majority of empirical work on forecasting assumes that
the loss function depends only on the forecast error, e = Y − f , i.e., the difference
between the outcome and the forecast. In this case we can write L ( f, Y, Z) = L (e).
In general, loss functions can be more complicated functions of the outcome and
forecast and take the form L ( f, Y) or L ( f, Y, Z).
2.2.1 Loss That Depends Only on Forecast Errors
The most commonly used loss functions, including squared error loss and absolute
error loss, depend only on the forecast error. For such loss functions, L ( f, Y, Z) =
L (e), so the loss function takes a particularly simple form.
2.2.1.1 Squared Error Loss
By far the most popular loss function in empirical studies is squared error loss, also
known as quadratic or mean squared error (MSE) loss:
L (e) = ae 2 ,

a > 0.

(2.11)

This loss function clearly satisfies the three Granger properties listed in ( 2.3). When
viewed as a family of loss functions—corresponding to different values of the scalar
a—squared error loss forms a homogeneous class.7 It is symmetric, bowl shaped, and
differentiable everywhere and penalizes large forecast errors at an increasing rate due
to its convexity in |e|. The loss function is not bounded from above. Large forecast
errors or “outliers” are thus very costly under this loss function.
2.2.1.2 Absolute Error Loss
Rather than using squared error loss, which results in increasingly large losses for
large forecast errors, the absolute error is preferred in some cases. Under mean
absolute error (MAE) loss,
L (e) = a |e| ,

a > 0.

7

(2.12)

While the scaling factor, a, does not matter to the properties of the optimal forecast, it is common to
set a = 0.5, which removes the “2” that arises from taking first derivatives.

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

L(e)

10
5

•

21

α = 0.25
linlin
MSE

0
–3

–2

–1

0
e

1

2

3

1

2

3

1

2

3

α = 0.5, MAE loss
L(e)

10
5

MSE
linlin

0
–3

–2

–1

α = 0.75

L(e)

10
5
0
–3

0
e

MSE
linlin
–2

–1

0
e

Figure 2.1: MSE loss versus lin-lin loss for different values of the lin-lin asymmetry
parameter, α.

Like MSE loss, this loss function satisfies the three Granger properties listed in (2.3).
The loss function is symmetric, bowl shaped, and differentiable everywhere except
at 0. It is again unbounded. However, the penalty to large forecast errors increases
linearly rather than quadratically as for MSE loss.
2.2.1.3 Piecewise Linear Loss
Piecewise linear, or so-called lin-lin loss, takes the form
L (e) =

−a(1 − α)e

if e ≤ 0,

aαe

if e > 0,

a > 0,

(2.13)

for 0 < α < 1. Positive forecast errors are assigned a (relative) weight of α, while
negative errors get a weight of 1 − α. The greater is α, the bigger the loss from positive
forecast errors, and the smaller the loss from negative errors. Again, this loss function
forms a homogeneous class for all positive values of a. It is common to set a = 1, so
that the weights are normalized to sum to 1.
Lin-lin loss clearly satisfies the three Granger properties. Moreover, it is differentiable everywhere, except at 0. Compared to MSE loss, this loss function does
not penalize large errors as much. MAE loss arises as a special case of lin-lin loss
if α = 0.5, in which case (2.13) simplifies to (2.12).
Figure 2.1 plots lin-lin loss against squared error loss. The middle window shows
the symmetric case with α = 0.5, and so corresponds to MAE loss. Small forecast
errors (|e| < 1) are costlier under MAE loss than under MSE loss, while conversely

LIBRARY.LEOPOLDS.COM

利得图书馆

22

•

Chapter 2

large errors are costlier under MSE loss. The top window assumes that α = 0.25, so
negative forecast errors are three times as costly as positive errors, reflected in the
steeper slope of the loss curve for e < 0. In the bottom window, α = 0.75 and so
positive forecast errors are three times costlier than negative errors.
2.2.1.4 Linex Loss
Linear-exponential, or Linex, loss takes the form
L (e) = a1 (exp(a2 e) − a2 e − 1),

a2 = 0, a1 > 0.

(2.14)

Linex loss is differentiable everywhere, but is not symmetric. Varian (1975) used this
loss function to analyze real estate assessments, while Zellner (1986a) used it in the
context of Bayesian prediction problems.
The parameter a2 controls both the degree and direction of asymmetry. When
a2 > 0, Linex loss is approximately linear for negative forecast errors and approximately exponential for positive forecast errors. In this case, large underpredictions ( f < y, so e = y − f > 0) are costlier than overpredictions of the same
magnitude, with the relative cost increasing as the magnitude of the forecast error
rises. Conversely, for a2 < 0, large overpredictions are costlier than equally large
underpredictions.
Although Linex loss is not defined for a1 = 0, setting a1 = 2/a22 and taking the
limit as a2 → 0, by L’Hôpital’s rule the Linex loss function approaches squared error
loss:
exp(a2 e) − e
e 2 exp(a2 e) e 2
= lim
= .
a2 →0
a2 →0
2a2
2
2

lim L (e) = lim

a2 →0

Figure 2.2 plots MSE loss against Linex loss for a2 = 1 (top) and a2 = −1
(bottom). Measured relative to the benchmark MSE loss, large positive (top) or
large negative (bottom) forecast errors are very costly in these respective cases. This
loss function has been used in many empirical studies on variables such as budget
forecasts (Artis and Marcellino, 2001) and survey forecasts of inflation (Capistrán
and Timmermann, 2009). Christoffersen and Diebold (1997) examine this loss
function in more detail.
2.2.1.5 Piecewise Asymmetric Loss
A general class of asymmetric loss functions can be constructed by letting the loss
function shift at a discrete set of points, {ē 1 , . . . , ē n−1 }:
⎧
L 1 (e)
⎪
⎪
⎪
⎪
⎪
⎨ L 2 (e)
L (e) =
..
⎪
⎪
.
⎪
⎪
⎪
⎩
L n (e)

if e ≤ ē 1 ,
if ē 1 < e ≤ ē 2 ,
..
.

(2.15)

if e > ē n−1 .

Here ē i −1 < ē i for i = 2, . . . , n − 1. It is common to set n = 2, choose ē 1 = 0 and
assume that both pieces of the loss function satisfy the usual loss properties so
that the loss is piecewise asymmetric around 0 and continuous (but not necessarily

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

23

Right-skewed linex loss with a2 = 1
20

L(e)

15
10

MSE

5
Linex
0
–3

–2

–1

0
e

1

2

3

2

3

Left-skewed linex loss with a2 = –1
20

L(e)

15
10

Linex

MSE

5
0
–3

–2

–1

0
e

1

Figure 2.2: MSE loss versus Linex loss for different values of the Linex parameter, a2 .

differentiable) at 0. Lin-lin loss in (2.13) is a special case of (2.15) as is the asymmetric
quadratic loss function
L (e) =

(1 − α)e 2

if e ≤ 0,

αe 2

if e > 0,

(2.16)

considered by Artis and Marcellino (2001), Newey and Powell (1987) and Weiss
(1996).
A flexible class of loss functions proposed by Elliott, Komunjer, and Timmermann
(2005) sets n = 2 and ē 1 = 0 in (2.15), while L 1 (e) = (1 − α)|e| p and L 2 (e) = α|e| p ,
where p is a positive integer, and α ∈ (0, 1). This gives the EKT loss function,
L (e) ≡ [α + (1 − 2α)1(e < 0)]|e| p ,

(2.17)

where 1(e < 0) is an indicator function that equals 1 if e < 0, otherwise equals
0. Letting α deviate from 0.5 produces asymmetric loss, with larger values of α
indicating greater aversion to positive forecast errors. Imposing p = 1 and α = 0.5,
MAE loss is obtained. More generally, setting p = 1, (2.17) reduces to lin-lin loss
since the loss is linear on both sides of 0, but with different slopes. Setting p = 2 and
α = 0.5 gives the MSE loss function which is therefore also nested as a special case, as
is the asymmetric quadratic loss function (2.16) for p = 2, α ∈ (0, 1). Hence, the EKT
family of loss functions nests the loss functions in (2.11), (2.12), (2.13), and (2.16) as
special cases and generalizes many of the commonly employed loss functions.

LIBRARY.LEOPOLDS.COM

利得图书馆

24

•

Chapter 2
α = 0.25

25

L(e)

20
15
10

EKT

MSE

5
0
–3

–2

–1

0
e

1

2

3

1

2

3

α = 0.75

25

L(e)

20
15
10

MSE

5
EKT
0
–3

–2

–1

0
e

Figure 2.3: MSE loss versus EKT loss with p=3 for different values of the asymmetry
parameter, α.

Figure 2.3 plots the EKT loss function for p = 3, α = 0.25 (top) and α = 0.75
(bottom). Compared with MSE loss, substantial asymmetries can be generated by
this loss function.
Empirically, the EKT loss function has been used to analyze forecasts of government budget deficits produced by the IMF and OECD (Elliott, Komunjer, and
Timmermann, 2005), the Federal Reserve Board’s inflation forecasts (Capistrán,
2008), as well as output and inflation forecasts from the Survey of Professional
Forecasters (Elliott, Komunjer, and Timmermann, 2008).

2.2.1.6 Binary Loss
When the space of outcomes Y is discrete, the forecast errors typically take on only
a small number of possible values. Hence in constructing a loss function for such
problems, all that is required is to evaluate each of a small number of possibilities.
The simplest case arises when forecasting a binary outcome so that Y = {−1, 1} or
Y = {0, 1}. In this case there are only four possible pairings of the point forecast and
outcome: two where the forecast gives the correct outcome and two errors. If we
restrict the loss function to not depend on Z (this case is examined below) and also
restrict the problem so that a correct forecast has the same value regardless of the

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

25

value for Y, then the binary loss function can be written as8
⎧
⎪
⎪0
⎨
(1 − c)
L ( f, y) =
c
⎪
⎪
⎩
0

if
if
if
if

f
f
f
f

= y = 0,
= 0, y = 1,
= 1, y = 0,
= y = 1.

(2.18)

Here we have set the loss from a correct prediction to 0 and normalized the losses
from an incorrect forecast to sum to 1 by dividing by their sum; see Schervish (1989),
Boyes, Hoffman, and Low (1989), Granger and Pesaran (2000), and Elliott and Lieli
(2013).
For (2.18) to be a valid loss function, we require that 0 < c < 1. This ensures that
the properties of the loss function listed in (2.3) hold. Notice that the binary loss
function can be written as L (e), since the loss is equal to
L ( f, y) = c1(e < 0) + (1 − c)1(e > 0).
2.2.2 Level- and Forecast-Dependent Loss Functions
Economic loss is mostly assumed to depend on only the forecast error, e = Y − f .
This is too restrictive an assumption for situations in which the forecaster’s objective
function depends on state variables such as the level of the outcome variable Y.
More generally, we can consider loss functions of the form L ( f, y) = L (e). The
most common level-dependent loss function is the mean absolute percentage error
(MAPE), given by
 
e 
L (e, y) = a   .
y

(2.19)

Since the forecast and forecast error have the same units as the outcome, the MAPE
is a unitless loss function. This is considered to be an advantage when constructing
the sample analog of this loss function and employing it to evaluate forecast methods
across outcomes measured in different units. If the loss function is well grounded in
terms of the actual costs arising from the forecasting problem, dependence on units
does not seem to be an important issue—comparisons across different forecasts with
different units should be related not through some arbitrary adjustment but instead
in a way that trades off the costs associated with the forecast errors for each of the
outcomes. This is achieved by the multivariate loss functions examined in the next
section.
Scaling the forecast error by the outcome in (2.19) has the effect of weighting
forecast errors more heavily when y is near 0 than when y is far from 0. This is
difficult to justify in many applications. Moreover, if the predictive density for Y has
nontrivial mass at 0, then the expected loss is unlikely to exist, hence invalidating
many of the results from decision theory for this case. Nonetheless, MAPE loss
remains popular in many practical forecast evaluation experiments.
More generally, level- and forecast-dependent loss functions can be written as
L ( f, y) but do not reduce to L (e) or L (e, y). Although loss functions in this class
8

See chapter 12 for a comprehensive treatment of forecast analysis under this loss function.

LIBRARY.LEOPOLDS.COM

利得图书馆

26

•

Chapter 2

are not particularly common, there are examples of their use. For example, Bregman
(1967) suggested loss functions of the form
L ( f, y) = φ (y) − φ ( f ) − φ ( f ) (y − f ) ,

(2.20)

where φ is a strictly convex function, so φ > 0. Squared error loss is nested as a
special case of 2.20.
Differentiating (2.20) with respect to the forecast, f , we get
∂ L ( f, y)
= −φ ( f ) − φ ( f ) (y − f ) + φ ( f )
∂f
= −φ ( f ) (y − f ) ,
which generally depends on both y and f . This, along with the assumption that
φ > 0, ensures that the conditional mean is the optimal forecast. Bregman loss is
further discussed in Patton (2015).
In an empirical application of level-dependent loss, Patton and Timmermann
(2007b) find that the Federal Reserve’s forecasts of output growth fail to be optimal
if their loss is restricted to depend only on the forecast error. Rationalizing the
Federal Reserve’s forecasts requires not only that overpredictions of output growth
are costlier than underpredictions, but also that overpredictions of output are
particularly costly during periods of low economic growth. This finding can be
justified if the cost of an overly tight monetary policy is particularly high during
periods with low economic growth when such a policy may cause or extend a
recession.9
2.2.3 Loss Functions That Depend on Other State Variables
Under some simplifying assumptions we saw earlier that the binary loss function
takes a particularly simple form. More generally, if the loss function depends on Z
and the loss associated with a perfect forecast depends on the outcome Y, then the
loss function for the binary problem becomes
⎧
−u1,1 (z)
⎪
⎪
⎨
−u1,0 (z)
L ( f, y, z) =
−u0,1 (z)
⎪
⎪
⎩
−u0,0 (z)

if
if
if
if

f
f
f
f

= 1 and y = 1,
= 1 and y = 0,
= 0 and y = 1,
= 0 and y = 0,

(2.21)

where ui, j (z) are the utilities gained when f = i, y = j , and Z = z.
In this general form, the loss function cannot be simplified to depend only on the
forecast error. Again restrictions need to be imposed on the losses in (2.21). First,
we require that u0,0 (z) > u1,0 (z) and u1,1 (z) > u0,1 (z) so that losses associated with
correct forecasts are not higher than those associated with incorrect forecasts. We
might also impose that min{u0,0 (z), u1,1 (z)} > max{u1,0 (z), u1,0 (z)} so that correct
forecasts result in a lower loss (higher utility) than incorrect forecasts. Finally, it is
9 Some central banks desire to keep inflation within a band of 0 to 2% per annum. Inflation within
this band might be regarded as a successful outcome, whereas deflation or inflation above 2% is viewed as
failure. Again this is indicative of a nonstandard loss function; see Kilian and Manganelli (2008).

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

27

quite reasonable to assume that correct forecasts are associated with different losses
u0,0 (z) = u1,1 (z), in which case normalizing the loss associated with a perfect forecast
to 0 will not be possible for both outcomes. This is an example of level-dependent loss
being built directly into the loss function.
2.2.4 Consistent Ranking of Forecasts with Measurement Errors in the Outcome
Hansen and Lunde (2006) and Patton (2011) consider the problem of comparing
and consistently ranking volatility forecasts from different models when the observed
outcome is measured with noise. This situation is common in volatility forecasting or
in macro forecasting where the outcome may subsequently be revised. The volatility
of asset returns is never actually observed although a proxy for it can be constructed.
Volatility forecast comparisons typically use realized volatility, squared returns, or
range-based proxies, σ̂ 2 , in place of the true variance, σ 2 .
Hansen and Lunde establish sufficient conditions under which noisy proxies can
be used in the forecast evaluation without giving rise to rankings that are inconsistent
with the (infeasible) ranking based on the true outcome.
Patton defines a loss function as being robust to measurement errors in the
outcome if it gives the same expected-loss ranking of two forecasts whether based
on the true (but unobserved) outcome or some unbiased proxy thereof. Specifically,
a loss function is robust to such measurement errors if, for two forecasts f 1 and f 2 ,
the ranking based on the true outcome, y,
E [L ( f 1 , y)]  E [L ( f 2 , y)]
is the same as the ranking based on the proxied outcome, ŷ:
E [L ( f 1 , ŷ)]  E [L ( f 2 , ŷ)],
for unbiased proxies ŷ satisfying E [ ŷ|Z] = y, where Z is again the information set
used to generate the forecasts.
Patton (2011, Proposition 1) establishes conditions under which robust loss
functions must belong to the following family:
L ( f, ŷ) = C̃ ( f ) + B( ŷ) + C ( f )( ŷ − f ),

(2.22)

where B and C are twice continuously differentiable functions, C is strictly decreasing, and C̃ is the antiderivative of C , i.e., C̃ = C .10 In Patton’s analysis f is a volatility
forecast and ŷ is a proxy for the realized volatility. Examples of loss functions in the
family (2.22) include MSE and QLIKE loss:
MSE : L ( f, ŷ) = ( ŷ − f )2 ,
QLIKE : L ( f, ŷ) = log( f ) +

10

ŷ
.
f

If B = −C̃ , this family of loss functions yields the Bregman family in equation (2.20).

LIBRARY.LEOPOLDS.COM

利得图书馆

28

•

Chapter 2

2.3 MULTIVARIATE LOSS FUNCTIONS
When a decision maker’s objectives depend on multiple variables, the loss function
needs to be extended from being defined over scalar outcomes to depend on a vector
of outcomes. This situation arises, for example, for a central bank concerned with
both inflation and employment prospects.
Conceptually it is easy to generalize univariate loss functions to the multivariate
case, although difficulties may arise in determining how costly different combinations of forecast errors are. How individual forecast errors or their cross products are
weighted becomes particularly important.
The most common multivariate loss function is multivariate quadratic error loss,
also known as multivariate MSE loss; see Clements and Hendry (1993). This loss
function maps a vector of forecast errors e = (e 1 , . . . , e n ) to the real number line
and so is simply a weighted average of the individual squared forecast errors and
their cross products:11
MSE(A) = e Ae.

(2.23)

Here the (n × n) matrix A is required to be nonnegative and positive definite. This is
the matrix equivalent of the univariate assumption for MSE loss that a > 0 in (2.11).
As noted in the discussion of MAPE loss, the loss function in (2.23) may be
difficult to interpret when the predicted variables are measured in different units.
This concern is related to obtaining a reasonable specification of the loss function
whose role it is to compare and trade off losses of different sizes across different
variables. Hence this is not really a limitation of the loss function itself but of
applications of the loss function.
The loss function in (2.23) is “bowl shaped” in the sense that the level sets are
convex and symmetric around 0. It is easily verified that (2.23) satisfies the basic
assumptions for a loss function in (2.3). If the entire vector of forecast errors is 0,
then the loss is 0. A positive-definite and nonnegative weighting matrix A ensures
that losses rise as forecast errors get larger, so assumption (2.3c) holds.12
A special case arises when A = In , the (n × n) identity matrix. In this case
covariances can be ignored and the loss function simplifies to MSE(In ) = E [e e] =
tr E [(ee )], i.e., the sum of the individual mean squared errors. Thus, a loss function
based on the trace of the covariance matrix of forecast errors is simply a special case of
the general form in (2.23). In general, however, covariances between forecast errors
come into play, reflecting the cross products corresponding to the off-diagonal terms
in A.
As a second example of a multivariate loss function, Komunjer and Owyang
(2012) provides an interesting generalization of the Elliott, Komunjer, and
Timmermann (2005) loss function in (2.17) to the case where e = (e 1 , . . . , e n ) .
11 While the vector of forecast errors could represent different variables, it could also comprise forecast
errors for the same variable measured at different horizons, corresponding to short and long-horizon
forecasts.
12 Positive-definiteness alone is not sufficient to guarantee that the multivariate equivalent to (2.3) holds.
Suppose n = 2 and let A be a symmetric matrix with 2 on the diagonals and −1 in the off-diagonal cells.
A is positive definite but the marginal effect of making a bigger error on the second forecast is 4e 2 − 2e 1 ,
where e = (e 1 , e 2 ) . Hence if e 2 < e 1 /2, increasing the error associated with the second forecast would
reduce loss, thus violating (2.3).

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

29

Let ||e|| p = (|e 1 | p + · · · + |e n | p )1/ p be the l p norm of e and assume that the n-vector
of asymmetry parameters, α, satisfies ||α||q < 1. Further, let 1 ≤ p ≤ ∞ and, for a
given value of p, set q so that 1/ p + 1/q = 1. The multivariate loss function proposed
by Komunjer and Owyang takes the form
L (e) = (||e|| p + α e)||e|| pp−1 .

(2.24)

As in the univariate case, the extent to which large forecast errors are penalized
relative to small ones is determined by the exponent, p. However, now the full
vector α = (α1 , . . . , αn ) characterizes the asymmetry in the loss function, with α = 0
representing the symmetric case. Since α is a vector, this loss function offers great
flexibility in both the magnitude and direction of asymmetry for multivariate loss
functions.
Other multivariate loss functions have been used empirically. Laurent, Rombouts,
and Violante (2013) consider a multivariate version of the family of loss functions
introduced by Patton (2011), and apply it to volatility forecasting.

2.4 SCORING RULES FOR DISTRIBUTION FORECASTS
So far we have focused our discussion on point forecasts, but forecasts of the full
distribution of outcomes are increasingly reported. Just as point forecasting requires
a loss-based measure of the distance between the forecast f and the outcome Y,
distribution forecasts also require a loss function. These are known as scoring rules
and reward forecasters for making more accurate predictions, i.e., predictions that
are “closer” to the observed outcome get a higher score, where closeness depends on
the shape of the scoring rule. Gneiting and Raftery (2007) provide a survey of scoring
rules and discuss their properties.
Scoring rules, S( p, y), are mappings of predictive probability distributions, p, and
outcomes, y, to the real line. Suppose a forecaster uses the predictive probability
distribution, p, while the probability distribution used to evaluate the “goodness
of fit” of p is denoted p0 . Then the expected value of S( p, y) under p0 is denoted
S( p, p0 ). A scoring rule is called strictly proper if the forecaster’s best probability
distribution is p0 , i.e., S( p0 , p0 ) ≥ S( p, p0 ) with equality holding only if p = p0 .
In this situation there will be no incentive for the forecaster to use a probability
distribution p = p0 since this would reduce the score. The performance of a given
candidate probability distribution, p, relative to the optimal rule, can be measured
through the so-called divergence function
d( p, p0 ) = S( p0 , p0 ) − S( p, p0 ).

(2.25)

Notice the similarity to the normalization in equation (2.3a) for loss functions based
on point forecasts in (2.3): the divergence function obtains its minimum value of
0 only if p = p0 , and otherwise takes a positive value. The forecaster’s objective of
maximizing the scoring rule thus translates into minimizing the divergence function.
Several scoring rules have been used in the literature. Many of these have been
considered for categorical data limited to discrete outcomes y = (y1 , . . . , ym ) with
associated probabilities { p1 , . . . , pm }. Denote by pi the predicted probability that

LIBRARY.LEOPOLDS.COM

利得图书馆

30

•

Chapter 2

corresponds to the range that includes yi . The logarithmic score,
S( p, yi ) = log( pi ),

(2.26)

gives rise to the well-known Kullback–Leibler divergence measure,
d( p, p0 ) =

m


p0 j log( p0 j / p j ).

(2.27)

j =1

Similarly, the quadratic or Brier score,
S( p, yi ) = 2 pi −

m


p 2j − 1,

(2.28)

( p j − p0 j )2 .

(2.29)

j =1

generates the squared divergence
d( p, p0 ) =

m

j =1

For density forecasts defined over continuous outcomes the logarithmic and
quadratic scores take the form
log S( p, y) = log p(y),
1/2

2
,
S( p, y) = 2 p(y) −
p(y) μ(dy)
where μ(·) is the probability measure associated with the outcome, y. Both are proper
scoring rules. By contrast, the linear score, S( p, y) = p(y), can be shown not to be a
proper scoring rule; see Gneiting and Raftery (2007).
Which scoring rule to use in a given situation depends, of course, on the
underlying objectives for the problem at hand and the choice should most closely
resemble the costs involved in the decision problem. To illustrate this point, we next
provide an example from the semiconductor supply chain.
Example 2.4.1 (Loss function for semiconductors). Cohen et al. (2003) construct an
economically motivated loss or cost function for a semiconductor equipment supply
chain. Supply firms are assumed to hold soft orders from clients which may either be
canceled (with probability π ) or get finalized (with probability 1 − π) at some later
date, y N , when the final information arrives. Given such orders, firms attempt to
optimally determine the timing of the production start, yπ , where y N > yπ due to a
production lead-time delay. If an order is canceled, the supplier incurs a cancelation
cost, c, per unit of time. Let y denote the final delivery date in excess of the production
lead time. If this exceeds the production date, the supplier will incur holding (inventory)
costs, h, per unit of time. Conversely, if the production start date, yπ , exceeds y, the
company will not be able to meet the requested delivery date and so incurs a delay cost
of g per unit of time. Cohen et al. (2003) assume that suppliers choose the production

LIBRARY.LEOPOLDS.COM

利得图书馆

Loss Functions

•

31

date, yπ , so as to minimize the expected total cost


∞

E [L (yπ , y, y N )] = π × c

(y N − yπ )d P N (y N )

yπ

 
+(1 − π ) h

∞
yπ


(y − yπ )d P y (y) + g

yπ
−∞


(yπ − y)d P y (y) ,

where P y (y) and P N (y N ) are the cumulative distribution functions of y and y N , respectively. Provided that this expression is convex in yπ , the cost-minimizing production
time, yπ∗ , can be shown to solve the first-order condition
π × c × P N (yπ∗ ) + (1 − π)(g + h)P y (yπ∗ ) = π × c + (1 − π)h,

(2.30)

and so implicitly depends on the cancelation probability, cancelation costs, inventory
and delay costs, in addition to the predictive distributions for the finalization and final
delivery dates. Cohen et al. (2003) use an exponential distribution to model the arrival
time of the final order, P N , and a Weibull distribution to model the distribution of
the final delivery date, PY . To estimate the model parameters and predict the lead
time, the authors use data on soft orders, final orders, and order lead time. Empirical
estimates suggest that ĝ = 1.0, ĥ = 3.0, ĉ = 2.1, indicating that holding costs are
three times greater than delay costs, while cancelation costs are twice as high as the
delay costs. This in turn helps the manufacturer decide on the optimal start date for
production, yπ∗ .

2.5 EXAMPLES OF APPLICATIONS OF FORECASTS IN
MACROECONOMICS AND FINANCE
Forecasts are of interest to economic agents only in so far as they can help improve
their decisions, so it is useful to illustrate the importance of forecasts in the context of
some simple economic decision problems. This section provides three such examples
from economics and finance.

2.5.1 Central Bank’s Decision Problem
Consider a central bank with an objective of targeting inflation by means of a single
policy instrument, yt , which could be an interest rate such as the repo rate, i.e.,
the rate charged on collateralized loans. Svensson (1997) sets out a simple model
in which the central bank’s loss function depends on the difference between the
inflation rate (yt ) and a target inflation rate (y ∗ ). Svensson shows that, conditional on
having chosen a value for its instrument (the repo rate), the central bank’s decision
problem reduces to that of choosing a forecast that minimizes the deviation from
the target. Although the forecast does not enter directly into the central bank’s loss
function, it does so indirectly because the actual rate of inflation (which is what the
central bank really cares about) is affected by the bank’s choice of interest rate which
in turn reflects the inflation forecast.

LIBRARY.LEOPOLDS.COM

利得图书馆

32

•

Chapter 2

Specifically, the central bank is assumed to choose a sequence of interest rates
{i τ }∞
τ =t to minimize a weighted sum of expected future losses,
Et

∞


λτ −t L (yτ − y ∗ ),

(2.31)

τ =t

where λ ∈ (0, 1) is a discount rate and E t [ ] denotes the conditional expectation
given information available at time t. Both current and future deviations from target
inflation affect the central bank’s loss.
Following Svensson’s analysis, suppose the central bank has quadratic loss
L (yτ − y ∗ ) = 12 (yτ − y ∗ )2 .

(2.32)

Future inflation rates depend on the sequence of interest rates which are chosen to
minimize expected future loss and hence satisfy the condition
∞



 ∗ ∞
τ −t
∗ 2
(y
.
λ
E
−
y
)
i τ t = arg min
t
τ
∞
{i τ }t

(2.33)

τ =t

Complicating matters, inflation is not exogenous but is affected by the central bank’s
actions. Solving (2.33) is therefore quite difficult since current and future interest
rates can be expected to affect future inflation rates. Because inflation forecasts matter
only in so far as they affect the central bank’s interest rate policy and hence future
inflation, a model for the data-generating process for inflation is needed. Svensson
proposes a tractable approach in which inflation and output are generated according
to the equations13
yt+1 = yt + α1 zt +

t+1 ,

(2.34)

zt+1 = β1 zt − β2 (i t − yt ) + ηt+1 ,

(2.35)

where zt is current output relative to its potential level, and all parameters are
positive, i.e., α1 , β1 , β2 > 0. The quantities t+1 and ηt+1 are unpredictable shocks to
inflation and output, respectively. The first equation expresses the change in inflation
as a function of the lagged output, while the second equation shows that the real
interest ra