Pagina principale
Economic Forecasting
Economic Forecasting
Graham Elliott, Allan Timmermann
Economic forecasting involves choosing simple yet robust models to best approximate highly complex and evolving datagenerating processes. This poses unique challenges for researchers in a host of practical forecasting situations, from forecasting budget deficits and assessing financial risk to predicting inflation and stock market returns. Economic Forecasting presents a comprehensive, unified approach to assessing the costs and benefits of different methods currently available to forecasters.
This text approaches forecasting problems from the perspective of decision theory and estimation, and demonstrates the profound implications of this approach for how we understand variable selection, estimation, and combination methods for forecasting models, and how we evaluate the resulting forecasts. Both Bayesian and nonBayesian methods are covered in depth, as are a range of cuttingedge techniques for producing point, interval, and density forecasts. The book features detailed presentations and empirical examples of a range of forecasting methods and shows how to generate forecasts in the presence of largedimensional sets of predictor variables. The authors pay special attention to how estimation error, model uncertainty, and model instability affect forecasting performance.
• Presents a comprehensive and integrated approach to assessing the strengths and weaknesses of different forecasting methods
• Approaches forecasting from a decision theoretic and estimation perspective
• Covers Bayesian modeling, including methods for generating density forecasts
• Discusses model selection methods as well as forecast combinations
• Covers a large range of nonlinear prediction models, including regime switching models, threshold autoregressions, and models with timevarying volatility
• Features numerous empirical examples
• Examines the latest advances in forecast evaluation
• Essential for practitioners and students alike
This text approaches forecasting problems from the perspective of decision theory and estimation, and demonstrates the profound implications of this approach for how we understand variable selection, estimation, and combination methods for forecasting models, and how we evaluate the resulting forecasts. Both Bayesian and nonBayesian methods are covered in depth, as are a range of cuttingedge techniques for producing point, interval, and density forecasts. The book features detailed presentations and empirical examples of a range of forecasting methods and shows how to generate forecasts in the presence of largedimensional sets of predictor variables. The authors pay special attention to how estimation error, model uncertainty, and model instability affect forecasting performance.
• Presents a comprehensive and integrated approach to assessing the strengths and weaknesses of different forecasting methods
• Approaches forecasting from a decision theoretic and estimation perspective
• Covers Bayesian modeling, including methods for generating density forecasts
• Discusses model selection methods as well as forecast combinations
• Covers a large range of nonlinear prediction models, including regime switching models, threshold autoregressions, and models with timevarying volatility
• Features numerous empirical examples
• Examines the latest advances in forecast evaluation
• Essential for practitioners and students alike
Categories:
Economy\\Econometrics
Anno:
2016
Editore:
Princeton University Press
Lingua:
english
Pagine:
567
ISBN 13:
9780691140131
File:
PDF, 3.20 MB
Download (pdf, 3.20 MB)
Leggi il libro online
 Open in Browser
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You may be interested in
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

利得图书馆 ECONOMIC FORECASTING LIBRARY.LEOPOLDS.COM 利得图书馆 LIBRARY.LEOPOLDS.COM 利得图书馆 ECONOMIC FORECASTING GRAHAM ELLIOTT AND ALLAN TIMMERMANN PRINCETON UNIVERSITY PRINCETON LIBRARY.LEOPOLDS.COM AND PRESS OXFORD 利得图书馆 c 2016 by Princeton University Press Copyright Published by Princeton University Press, 41 William Street, Princeton, New Jersey 08540 In the United Kingdom: Princeton University Press, 6 Oxford Street, Woodstock, Oxfordshire OX20 1TW press.princeton.edu All Rights Reserved ISBN: 9780691140131 Library of Congress Control Number: 2015959185 British Library CataloginginPublication Data is available This book has been composed in Minion Pro and Helvetica Neue Printed on acidfree paper. ∞ Typeset by S R Nova Pvt Ltd, Bangalore, India Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 LIBRARY.LEOPOLDS.COM 利得图书馆 FOR OUR FAMILIES Jason and Shirley Henry, Rafaella and Solange LIBRARY.LEOPOLDS.COM 利得图书馆 LIBRARY.LEOPOLDS.COM 利得图书馆 Contents Preface xiii I Foundations 1 Introduction 3 1.1 1.2 Outline of the Book 3 Technical Notes 12 2 Loss Functions 13 2.1 2.2 2.3 2.4 2.5 Construction and Speciﬁcation of the Loss Function 14 Speciﬁc Loss Functions 20 Multivariate Loss Functions 28 Scoring Rules for Distribution Forecasts 29 Examples of Applications of Forecasts in Macroeconomics and Finance 31 2.6 Conclusion 37 3 The Parametric Forecasting Problem 39 3.1 3.2 3.3 3.4 3.5 3.6 Optimal Point Forecasts 41 Classical Approach 47 Bayesian Approach 54 Relating the Bayesian and Classical Methods 56 Empirical Example: Asset Allocation with Parameter Uncertainty 59 Conclusion 62 4 Classical Estimation of Forecasting Models 63 4.1 4.2 4.3 4.4 LossBased Estimators 64 PlugIn Estimators 68 Parametric versus Nonparametric Estimation Approaches 73 Conclusion 74 5 Bayesian Forecasting Methods 76 5.1 5.2 5.3 5.4 5.5 Bayes Risk 77 Ridge and Shrinkage Estimators 81 Computational Methods 83 Economic Applications of Bayesian Forecasting Methods Conclusion 88 6 Model Selection 89 6.1 6.2 6.3 6.4 TradeOffs in Model Selection 90 Sequential Hypothesis Testing 93 Information Criteria 96 Cross Validation 99 LIBRARY.LEOPOLDS.COM 85 利得图书馆 viii • Contents 6.5 6.6 6.7 6.8 6.9 6.10 6.11 Lasso Model Selection 101 Hard versus Soft Thresholds: Bagging 104 Empirical Illustration: Forecasting Stock Returns 106 Properties of Model Selection Procedures 115 Risk for Model Selection Methods: Monte Carlo Simulations 121 Conclusion 125 Appendix: Derivation of Information Criteria 126 II Forecast Methods 7 Univariate Linear Prediction Models 133 7.1 7.2 7.3 7.4 7.5 7.6 ARMA Models as Approximations 134 Estimation and Lag Selection for ARMA Models 142 Forecasting with ARMA Models 147 Deterministic and Seasonal Components 155 Exponential Smoothing and Unobserved Components 159 Conclusion 164 8 Univariate Nonlinear Prediction Models 166 8.1 8.2 8.3 8.4 8.5 8.6 Threshold Autoregressive Models 167 Smooth Transition Autoregressive Models 169 Regime Switching Models 172 Testing for Nonlinearity 179 Forecasting with Nonlinear Univariate Models 180 Conclusion 185 9 Vector Autoregressions 186 9.1 9.2 9.3 9.4 9.5 9.6 9.7 Speciﬁcation of Vector Autoregressions 186 Classical Estimation of VARs 189 Bayesian VARs 194 DSGE Models 206 Conditional Forecasts 210 Empirical Example 212 Conclusion 217 10 Forecasting in a DataRich Environment 218 10.1 10.2 10.3 10.4 10.5 10.6 10.7 Forecasting with Factor Models 220 Estimation of Factors 223 Determining the Number of Common Factors 229 Practical Issues Arising with Factor Models 232 Empirical Evidence 234 Forecasting with Panel Data 241 Conclusion 243 11 Nonparametric Forecasting Methods 244 11.1 11.2 11.3 11.4 Kernel Estimation of Forecasting Models Estimation of Sieve Models 246 Boosted Regression Trees 256 Conclusion 259 LIBRARY.LEOPOLDS.COM 245 利得图书馆 Contents 12 Binary Forecasts 260 12.1 12.2 12.3 12.4 Point and Probability Forecasts for Binary Outcomes 261 Density Forecasts for Binary Outcomes 265 Constructing Point Forecasts for Binary Outcomes 269 Empirical Application: Forecasting the Direction of the Stock Market 272 12.5 Conclusion 273 13 Volatility and Density Forecasting 275 13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 Role of the Loss Function 277 Volatility Models 278 Forecasts Using Realized Volatility Measures Approaches to Density Forecasting 291 Interval and Quantile Forecasts 301 Multivariate Volatility Models 304 Copulas 306 Conclusion 308 288 14 Forecast Combinations 310 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 Optimal Forecast Combinations: Theory 312 Estimation of Forecast Combination Weights 316 Risk for Forecast Combinations 325 Model Combination 329 Density Combination 336 Bayesian Model Averaging 339 Empirical Evidence 341 Conclusion 344 III Forecast Evaluation 15 Desirable Properties of Forecasts 347 15.1 15.2 15.3 15.4 15.5 Informal Evaluation Methods 348 Loss Decomposition Methods 352 Efﬁciency Properties with Known Loss 355 Optimality Tests under Unknown Loss 365 Optimality Tests That Do Not Rely on Measuring the Outcome 368 15.6 Interpreting Efﬁciency Tests 368 15.7 Conclusion 371 16 Evaluation of Individual Forecasts 372 16.1 16.2 16.3 16.4 16.5 16.6 The Sampling Distribution of Average Losses 373 Simulating OutofSample Forecasts 375 Conducting Inference on the OutofSample Average Loss 380 OutofSample Asymptotics for Rationality Tests 385 Evaluation of Aggregate versus Disaggregate Forecasts 388 Conclusion 390 LIBRARY.LEOPOLDS.COM • ix 利得图书馆 x • Contents 17 Evaluation and Comparison of Multiple Forecasts 391 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9 17.10 17.11 Forecast Encompassing Tests 393 Tests of Equivalent Expected Loss: The Diebold–Mariano Test 397 Comparing Forecasting Methods: The Giacomini–White Approach 400 Comparing Forecasting Performance across Nested Models 403 Comparing Many Forecasts 409 Addressing Data Mining 413 Identifying Superior Models 415 Choice of Sample Split 417 Relating the Methods 418 InSample versus OutofSample Forecast Comparison 418 Conclusion 420 18 Evaluating Density Forecasts 422 18.1 18.2 18.3 18.4 18.5 18.6 Evaluation Based on Loss Functions 423 Evaluating Features of Distributional Forecasts 428 Tests Based on the Probability Integral Transform 433 Evaluation of Multicategory Forecasts 438 Evaluating Interval Forecasts 440 Conclusion 441 IV Reﬁnements and Extensions 19 Forecasting under Model Instability 445 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 Breaks and Forecasting Performance 446 Limitations of InSample Tests for Model Instability 448 Models with a Single Break 451 Models with Multiple Breaks 455 Forecasts That Model the Break Process 456 Ad Hoc Methods for Dealing with Breaks 460 Model Instability and Forecast Evaluation 463 Conclusion 465 20 Trending Variables and Forecasting 467 20.1 20.2 20.3 20.4 20.5 20.6 Expected Loss with Trending Variables 468 Univariate Forecasting Models 470 Multivariate Forecasting Models 478 Forecasting with Persistent Regressors 480 Forecast Evaluation 486 Conclusion 489 21 Forecasting Nonstandard Data 490 21.1 21.2 21.3 Forecasting Count Data 491 Forecasting Durations 493 RealTime Data 495 LIBRARY.LEOPOLDS.COM 利得图书馆 Contents 21.4 21.5 Irregularly Observed and Unobserved Data 498 Conclusion 504 Appendix 505 A.1 Kalman Filter 505 A.2 Kalman Filter Equations 507 A.3 Orders of Probability 514 A.4 Brownian Motion and Functional Central Limit Theory Bibliography Index 539 517 LIBRARY.LEOPOLDS.COM 515 • xi 利得图书馆 LIBRARY.LEOPOLDS.COM 利得图书馆 Preface e started working on this book more than 10 years ago after teaching courses on forecasting techniques at University of Aarhus, Denmark, and in Bertinoro, Italy, to groups of PhD students and assistant professors. Since then, we have developed the material through courses offered to participants at many institutions, including at CREATES (University of Aarhus), American University, Edhec, Bank of Italy, SoFiE (Oxford University), and Universidad del Rosario. Our idea was to provide a uniﬁed perspective that takes both the economics and statistics of the forecasting problem seriously. The intention was to write a forecasting book that could be used by masters and Phd students as well as professionals in places such as central banks, ﬁnancial institutions, and research institutes. The book can be used as a textbook. Indeed, the ﬁrst section of the book provides a uniﬁed theoretical discussion of the basic approach to forecasting that is grounded in the standard statistical practice of minimizing the “risk” (expected loss) of any method. The remainder of the book can be used both as a text and as a reference to a wide range of forecasting methods. We have tried as much as possible to provide detailed descriptions of how to construct forecasts, how to evaluate such forecasts, and how to compare them across different methods. This allows the book to serve as a single source for many widely employed forecasting methods. Through empirical applications and reviews of the empirical literature, we also shed light on which methods work well in different circumstances. We use examples ranging from stock returns to macroeconomic variables and surveys of forecasters. Nearly all researchers who are interested in developing new forecasting methods through theoretical analysis or improving their empirical performance through data analysis work within a decisiontheoretic framework. For example, the provision of point forecasts is a special case of point estimation and the provision of distributional forecasts is a special case of density estimation. We use this connection as a foundation for understanding the statistical basis for forecasting analysis and gaining a better understanding of how to think about the many forecasting methods in practical use. Thus, the ﬁrst premise of the book is that taking seriously the economics underlying the forecasting problem means that the forecaster’s loss function should be the starting point of the analysis. The second premise of the book is that the joint density of the random variables that generate the observed data used to build and evaluate a forecasting model is far more complicated than we understand theoretically or empirically. As a consequence, all forecasting models are misspeciﬁed in the sense that they are approximations to the best possible forecasting model. In practice, this means choosing forecasting methods based on their risk functions (the expected loss given the data), but acknowledging that these risk functions are themselves very complicated objects that depend on the underlying (unknown) datagenerating process. It is exactly the difﬁculties in understanding the risk functions that allow so many different forecasting approaches to be used in empirical work. W LIBRARY.LEOPOLDS.COM 利得图书馆 xiv • Preface In addition to the students in our forecasting courses who have provided valuable feedback, throughout the years we have also beneﬁtted from discussions on forecasting with many individuals. Without implying that they necessarily agree with the points of view expressed in the book, we thank our colleagues at UCSD (past and present) including Brendan Beare, Robert Engle, Clive Granger, Jim Hamilton, Ivana Komunjer, Andres Santos, Yixiao Sun, Rossen Valkanov, and Hal White. More widely in the profession we thank Frank Diebold, Peter Hansen, Andrew Patton, Hashem Pesaran, Ulrich Müller, Jim Stock, Mark Watson, and Ken West for their insights and support. We thank all of them for the inspiration they have offered over the years. This book has also beneﬁtted more directly from the input of many friends and colleagues. In particular, we thank Peter Hansen, Kirstin Hubrich, Simone Manganelli, Andrew Patton, Davide Pettenuzzo, Barbara Rossi, and four anonymous reviewers for comments on the book. A number of PhD students provided exceptionally capable research assistance with the empirical analysis, notably Leland E. Farmer, Antonio Gargano, Rafael Burjack, Hiroaki Kaido, and Christian Constandse. Thanks also goes to Naveen Basavanhally for help with formatting the manuscript, to Alison Durham for doing an excellent job at copyediting the manuscript, and to Ali Parrington and the team at Princeton University Press for ensuring a smooth production process. For collaboration on forecasting papers over the years we also wish to thank several former PhD students and colleagues, including Marco Aiolﬁ, Ayelen Banegas, Gray Calhoun, Carlos Capistran, Luis Catao, Tolga Cenesizoglu, Leland Farmer, Antonio Gargano, Veronique Genre, Dahlia Ghanem, Ben Gillen, Clive Granger, Niels Groenborg, Massimo Guidolin, Peter Reinhard Hansen, Geoff Kenny, Ivana Komunjer, Robert Kosowski, Fabian Krueger, Robert Lieli, Asger Lunde, Aidan Meyler, Andrew Patton, Bradley Paye, Thomas Pedersen, Gabriel PerezQuiros, Hashem Pesaran, Davide Pettenuzzo, Marius Rodrigues, Steve Satchell, Larry Schmidt, Ryan Sullivan, Russ Wermers, Hal White, and Yinchu Zhu. Last, but not least, we wish to thank our families for their understanding and inspiration during the years it took to complete the book. The book would not have been possible without their unwavering support. LIBRARY.LEOPOLDS.COM 利得图书馆 I Foundations LIBRARY.LEOPOLDS.COM 利得图书馆 LIBRARY.LEOPOLDS.COM 利得图书馆 1 Introduction ur aim with this book is to present an overview of the theory and methods underlying forecasting as currently practiced in economics and ﬁnance, but more widely applicable to a great range of forecasting problems. We hope to provide an overview that is useful to practitioners in places such as central banks and ﬁnancial institutions, academic researchers as well as graduate students seeking a point of entry into the ﬁeld. The assumed econometric level of the reader is that of someone who has taken a graduate or advanced undergraduate course in econometrics. Whenever a forecast is being constructed or evaluated, an overriding concern revolves around the practical problem that the best forecasting model is not only unknown but also unlikely to be known well enough to even correctly specify forecasting equations up to a set of unknown parameters. We view this as the only reasonable description of the forecaster’s problem. Some methods do claim to ﬁnd the correct model (oracle methods) as the sample gets very large. However, in any problem with a ﬁnite sample there is always a set of models—as opposed to a single model—that are consistent with the data. Moreover, in many situations the datagenerating process changes over time, further emphasizing the difﬁculty in obtaining very large samples of observations on which to base a model. These foundations— using misspeciﬁed models to forecast outcomes generated by a process that may be evolving over time—generate many of the complications encountered in forecasting. If the true models were fully known apart from the values of the parameters, Bayesian methods could be used to construct density and point forecasts that, for a given loss function, would be difﬁcult or impossible to beat in practice. Without knowing the true datagenerating process, the problem of constructing a good forecasting method becomes much more difﬁcult. Oftentimes very simple (and clearly misspeciﬁed) methods provide forecasts that outperform more complicated methods that seek to exploit the data in ways we would expect to be important and advantageous. As a case in point, simple averages of forecasts from many models, even ones that on their own do not seem to be very good, are often found empirically to outperform carefully chosen model averages or the best individual models. O 1.1 OUTLINE OF THE BOOK The approach of this book is for the most part based on forecasting as a decisiontheoretic problem. By this we mean that the forecaster has a speciﬁc objective in LIBRARY.LEOPOLDS.COM 利得图书馆 4 • Chapter 1 mind (i.e., wishes to make a decision) and wants to base this decision on some data. Setting up this approach comprises most of the ﬁrst part of the book. This part details the basic elements of the decision problem, with chapters on the decision maker’s loss function, forecasting as a decisiontheoretic problem, and an overview of general approaches to forecasting employing either classical or Bayesian methods. This part of the book provides foundations for understanding how different methods ﬁt together. We also provide details of methods that are subsequently applied to many of the issues examined in the next part of the book, e.g., model selection and forecast combination. The second part of the book reviews various approaches to constructing forecasting models. Methods employed differ for many reasons: lack of relevant data or the existence of a great deal of potentially relevant data, as well as assumptions made on functional forms for the models. In these chapters we attempt, as far as possible, to present the methods in enough detail that they can be employed without reference to other sources. The third part of the book examines the evaluation of forecasts, while the fourth part covers forecasting models that deal with special complications such as model instability (breaks) and highly persistent (trending) data. This part also discusses data structures of special interest to forecasters, including realtime data (revised data) and data collected at different frequencies. Finally, the fourth part of the book presents various extensions and reﬁnements to the forecasting methods covered in the earlier parts of the book, including forecasting under model instability, longrun forecasting, and forecasting with data that either take a nonstandard form (count data and durations) or are measured at irregular intervals and are subject to revisions. 1.1.1 Part I The ﬁrst part of the book motivates that point forecasting should be thought of simply as an application of decision theory. Since much is known about decision theory, much is also known about forecasting. This perspective makes point forecasting a special case of estimation, a ﬁeld where excellent texts already exist. What makes economic forecasting interesting as a separate topic is the particular details of how decision theory is applied to the problem at hand. To apply this approach, we require a clear statement about the costs of forecast errors e = y − f, where y is the outcome being predicted and f is the forecast. The tradeoff between different forecasting mistakes is embodied in a loss function, L ( f, y) which is discussed in chapter 2, with additional material on the binary case available in chapter 12. We regard loss functions as realistic expositions of the forecaster’s objectives, and consider the speciﬁcation of the loss function as an integral part of the forecaster’s decision problem.1 Different forecasters approaching the same outcome may well have different loss functions which could result in different choices of forecasting models for the same outcome. The speciﬁcation of loss functions is often disregarded in economic forecasting, and instead “standard” loss functions such as mean squared error loss tend to 1 An alternative literature considers features of loss functions and attempts to suggest a good loss function for all forecasting problems. We do not consider this approach and view loss functions as primitive to the forecaster’s problem. LIBRARY.LEOPOLDS.COM 利得图书馆 Introduction • 5 be employed. This can prove costly in real forecasting situations as it overlooks directions in which forecast errors are particularly costly. Nonetheless, much of the academic literature is based on these standard loss functions and so we focus much of our survey of methods throughout the second part of the book on these standard loss functions. Chapter 3 provides a general description of the forecaster’s problem as a decision problem. It may strike some readers, more used to the “art” of forecasting, as unusual to cast point forecasting as a decisiontheoretic problem. However, even readers who do not explicitly follow this approach are indeed operating within the decisiontheoretic framework. For example, most forecasting methods are motivated in one of two ways: either the methods are demonstrated to provide better performance given a loss function (or set of loss functions) through Monte Carlo simulations for reasonable datagenerating processes, or alternatively, the forecasting methods are shown to work well for some loss function for a particular set of empirical data z = (y, x), where x represents the set of predictor variables used to forecast the outcome y. Both ways of measuring performance place the forecasting problem within the decisiontheoretic approach. To illustrate this point, consider Monte Carlo simulations of a datagenerating process (joint density for the data) regarded as a reasonable representation of some data of interest. The simulation method suggests constructing N independent pseudo samples from this density, constructing N forecasts and evaluating N L ( f (n) , y (n) ), where y (n) is the outcome we wish to forecast, f (n) is the N −1 n=1 forecast generated by a prediction model, and L ( f, y) is the loss function which measures the costs of forecast inaccuracies. Superscripts refer to the individual simulations, n = 1, . . . , N. The simulated average loss is usually thought of as a measure of the performance of the forecasting method or model for this datagenerating process. This is reasonable since as N gets large, the sample average is, by standard laws of large numbers, a consistent estimate of the risk at the point of the parameter space for the datagenerating process chosen for the Monte Carlo, i.e., as long as E [L ( f, y)] exists, N −1 N L ( f (n) , y (n) ) → p E [L ( f, y)] , (1.1) n=1 where the Monte Carlo estimates a point on the risk function and → p means convergence in probability. Finding a forecast that minimizes the risk is precisely the setup of a decisiontheoretic problem. The third part of the book discusses methods for evaluation of sequential outofsample predictions. In each case, one obtains from the data set T observations of the “realized” loss from the data. One then evaluates the timeseries average T L ( f t , yt ), where the t subscript refers to time, as a measure of the expected T −1 t=1 loss. In this case the assumptions that underlie results such as (1.1) are much more stringent because the sequence of expected losses generated from data are not independently and identically distributed (i.i.d.) as in the Monte Carlo simulations. However, under suitable assumptions again this method estimates risk. When using real (as opposed to simulated) data, we do not know the true parameter values of the datagenerating process. Analyzing a variety of economic variables, we get a sense of how well different forecasting methods work for different types of data. LIBRARY.LEOPOLDS.COM 利得图书馆 6 • Chapter 1 The general setup in chapter 3 is common to forecasters basing their estimation strategies either on frequentist or on Bayesian approaches. Chapters 4 and 5 build on this setup separately for these two approaches. Chapter 4 examines the typical frequentist approaches, explaining general pitfalls that can occur as well as highlighting special cases arising later in the book. Chapter 5 does the same for the Bayesian approach. Viewing forecasting as a decisiontheoretic problem sometimes means that the best forecasting model, despite working well in practice, may actually be a model that is very difﬁcult to interpret economically. This becomes a problem when the forecasting exercise is a step in a decision process, and the forecaster must “explain” the forecast to decision makers or forecast users. In these cases an inferior point forecast that tends to be further away from the outcome may be preferred because it is easier to explain and may be seen to be more credible. Of course in situations where we suspect a lot of overﬁtting or instability in the relationships between the variables, we might prefer forecasting models that conform to economic theory since they are expected to be more robust. Practically, economically motivated restrictions on forecasting models can just be seen as following the decisiontheoretic approach for a restricted set of models. The ﬁnal chapter of the ﬁrst part of the book, chapter 6, examines issues related to model selection. By now the econometrics literature has a very good understanding of the merits and limitations of model selection, which we discuss for general models. From the perspective of forecasting, however, we regard model selection as simply part of the model estimation process. Of interest to the forecaster is the risk of the ﬁnal forecasting model computed in a way that accounts for the full estimation process. Given the complexity of the distributions of estimators obtained from models whose selection is driven by the data, this issue is difﬁcult to address analytically although it is still of direct relevance to the forecaster. 1.1.2 Part II Part II of the book provides an overview of the various approaches to forecasting that have become standard in many areas, including the economic and ﬁnance forecasting literature. Chapters are based around either the amount of information available—from only the past history of the predicted variable through very large panels of variables—or the general estimation approach, principally parametric or nonparametric methods. To the extent possible, we provide details of how to go about constructing forecasts from the various methods, or alternatively direct readers to explanations available in the literature. We also discuss the tradeoffs between different methods. In this sense we endeavor to provide a “ﬁrst stop” for practitioners wishing to apply the methods covered in this section. An important insight that arises from the decisiontheoretic approach is that there is no single best or dominant approach to constructing a forecast for all possible forecasting situations. We discuss the types of forecasting situations where each individual method is likely to be a reasonable approach and also highlight situations where other approaches should be considered. Throughout the book, we use a variety of empirical applications to illustrate how different approaches work. In most applications we use socalled pseudo outofsample forecasts which simulate the forecast as it could have been generated using data only up to the date of the prediction. This method restricts both LIBRARY.LEOPOLDS.COM 利得图书馆 Introduction • 7 model selection and parameter estimation to rely on data available at the point of the forecast. As time progresses and more data become available, the forecasting method, including the parameter estimates, are updated recursively. Such methods are commonly used to evaluate the usefulness of forecasts; a critical discussion of such outofsample forecasting methods versus insample methods is provided in part three of the book. When building a forecasting model for an economic variable, the simplest speciﬁcation of the conditioning information set is the variable’s own past history. This leads to univariate autoregressive moving average, or ARMA, models. Since Box and Jenkins (1970) these models have been extensively used and often provide benchmarks that are difﬁcult to beat using more complicated forecasting methods. Linear ARMA models are also easy to estimate and a large literature has evolved on how best to cover issues in implementation such as lag length selection, generation of multiperiod forecasts, and parameter estimation. We discuss these issues in chapter 7. The chapter also covers exponential smoothing, unobserved components models, and other ways to account for trends when forecasting economic variables. Chapter 8 continues under the assumption that the information set is limited to the predicted variable’s own past, but focuses on nonlinear parametric models. Examples include threshold autoregressions, smooth threshold autoregressions, and Markov switching models. These models have been used to capture evidence of nonlinear dynamics in many macroeconomic and ﬁnancial time series. Unlike nonparametric models they do not, however, have the ability to provide a global approximation to general datagenerating processes of unknown form. Chapter 9 expands the information set to include multivariate information by considering a natural extension to univariate autoregressive models, namely vector autoregressions, or VARs. VARs provide a framework for producing internally consistent multiperiod forecasts of all the included variables. As used in macroeconomic forecasting VARs typically include a relatively small set of variables, often less than 10, but they still require a large number of parameters to be estimated if the number of included lags is high. To deal with the resulting negative effects of estimation errors on forecasting performance, a large literature has developed Bayesian methods for estimating and forecasting with VARs. Both classical and Bayesian estimation of VARs is covered in the chapter which also deals with forecasting when the future paths of some variables are speciﬁed, a common practice in scenario analysis or contingent forecasting. The emergence of very large data sets has given rise to a wealth of information becoming readily available to forecasters. This poses both a unique opportunity— the potential for identifying new informative predictor variables—but also some real challenges given the limitations to most economic data. Suppose that N potential predictor variables are available, and that N is a large number, i.e., in the hundreds or thousands. Including all variables in the forecasting model—the socalled kitchen sink approach—is generally not feasible or desirable even for linear models since parameter estimation error becomes too large, unless the length of the estimation sample, T , is very large relative to N. Standard forecasting methods that conduct comprehensive model selection searches are also not feasible in this situation. If the true model is sparse, i.e., includes only few variables, one possibility is to use algorithms such as the Lasso, covered in chapter 6, to identify a few key predictors. Another strategy is to develop a few key summary measures that aggregate information from a large cross section of variables. This is the approach LIBRARY.LEOPOLDS.COM 利得图书馆 8 • Chapter 1 used by common factor models. Chapter 10 describes how these methods can be used in forecasting, including in factoraugmented VAR models that include both univariate autoregressive terms along with information in the factors. Finally, we discuss the possibility of using methods from panel data estimation to generate forecasts. While chapters 7–10 focus on parametric estimation methods and so assume that a certain amount of structure can be imposed on the forecasting model, chapter 11 considers nonparametric forecasting strategies. These include kernel regressions and sieve estimators such as polynomials and spline expansions, artiﬁcial neural networks, along with more recent techniques from the machinelearning literature such as boosted regression trees. Although these methods have powerful abilities to approximate many datagenerating processes as the number of terms included by the approach gets large, in practice any given estimated nonparametric model is itself an approximation to this approximation. Notably, the number of terms that can be successfully included in empirical applications will often be severely restricted by the available data sample. These approximate models thus do not have the same approximation ability as the models and thus themselves are approximations. Once again, the algorithm used to ﬁt these forecasting models—along with the loss function used to guide the estimation—become key to their forecasting performance and to avoiding issues related to overﬁtting. Forecasts of binary variables, i.e., variables that are restricted to take only two possible values, play a special role in decisions such as households’ choice on whether or not to buy a car, the decision on whether to pursue a particular education, or banks’ decisions on whether to change interest rates for shortterm deposits. Restricting the outcome to only two possible values has the advantage that it crystallizes the costs of making wrong forecasts, i.e., false positives or false negatives. Chapter 12 takes advantage of these simpliﬁcations to cover point and probability forecasts of binary outcomes and discusses both statistical and utilitybased estimators for such data. The decisiontheoretic approach embodies a loss function that is appropriate for the decision to be made and not, as is so often the case, chosen for convenience. It results in a decision, i.e., a choice of an action to be made. This directs itself to basing estimation on an objective of providing the best decision. Alternatively, we might consider provision of a predictive distribution (density forecast) for an outcome as the objective of the forecasting problem. In chapter 13 we see that this perspective is useful for a wide range of decisions.2 Distribution forecasts also serve the important role of quantifying the degree of uncertainty surrounding point forecasts. Distributional forecasting ﬁlls an important place in any forecaster’s toolbox but it does not replace point forecasting. First, although density forecasts can be used to construct point forecasts, typically it is the point forecast or decision that is required. Second, distributional forecasts rely on the distribution being estimated from data. This brings the loss function or scoring rule—the loss function used to estimate the 2 Dawid (1984) introduced what he termed the “prequential” approach to statistics, where prequential is a fusing of the words “probability” and “sequential.” This approach argued that rather than parameters being the object of statistical inference, the proper approach was to provide a sequence of probability forecasts for an outcome of interest. Hence, the provision of a density is important not just for forecasting, but for statistics in general. LIBRARY.LEOPOLDS.COM 利得图书馆 Introduction • 9 density—back into the problem. Often ad hoc loss functions are employed to estimate the distributional forecast, leading to problems when the distributional forecast is subsequently used to construct the point forecast. Given the plethora of different modeling approaches for construction of forecasts throughout chapters 7–13, it is not surprising that forecasters frequently have access to multiple predictions of the same outcome. Instead of aiming to identify a single best forecast, another strategy is to combine the information in the individual forecasts. This is the topic of forecast combinations covered in chapter 14. If the information used to generate the underlying forecasts is not available, forecast combination reduces to a simple estimation problem that basically treats the individual forecasts as predictors that could be part of a larger conditioning information set. Special restrictions on the forecast combination weights are sometimes imposed if it can be assumed that the individual forecasts are unbiased. If more information is available on the models underlying the individual forecasts, model combination methods can be used. These weight the individual forecasts based on their marginal likelihood or some such performance measure. Bayesian model averaging is a key example of such methods and is also covered in this chapter. 1.1.3 Part III The third part of the book deals with forecast evaluation methods. Evaluation of forecast methods is central to the forecasting problem and the difﬁculties involved in this step explain both the plethora of methods suggested for forecasting any particular outcome and the need for careful evaluation of forecasting methods. To see the central issue, consider the simple problem of forecasting the next outcome, yT +1 , in a sequence of independently and identically distributed data yt , t = 1, . . . , T with mean μ, variance σ 2 , and no explanatory variables. It is well known that under mean squared error (MSE) loss the best forecast is an estimate T yt . Since the outcome yT +1 of the mean, μ, such as the sample mean ȳT = T −1 t=1 is a random variable whose distribution is centered on μ, the forecast is typically different from the outcome even if we had a perfect estimate of μ, i.e., if we knew μ, as long as σ 2 > 0. Observing a single outcome far away from the forecast is therefore not necessarily indicative of a poor forecast. More generally, methods for forecast evaluation have to deal with the fact that (in expectation) the average insample loss and the average outofsample loss differ. To see this, suppose we use the sample mean as our forecast. For any insample observation, t = 1, . . . , T , the MSE of the forecast (or ﬁtted value) is E [yt − ȳT ] = E (yt − μ) − T 2 −1 =σ 2 (yt − μ) t=1 2 T T 1 1+ 2 −2 T T = σ 2 (1 − T −1 ). Here the third term in the second line comes from the cross product when we compute the squared terms in the ﬁrst line. LIBRARY.LEOPOLDS.COM 利得图书馆 10 • Chapter 1 In contrast, the MSE of outofsample forecasts of yT +1 is E [yT +1 − ȳT ] = E (yT +1 − μ) − T 2 = σ2 1+ T T2 −1 T 2 (yt − μ) t=1 = σ 2 (1 + T −1 ). Here there is no crossproduct term. Comparing these two expressions, we see that estimation error reduces the insample MSE but increases the outofsample MSE. In both cases the terms are of order T −1 and so the difference disappears asymptotically. However, in many forecasting problems this smallerorder term is important both statistically and economically. When we consider many different models of the outcome, differences in the MSE across models are of the same order as the effects on estimation error. This makes it difﬁcult to distinguish between models and is one reason why model selection is so difﬁcult. The insight that the insample ﬁt improves by using overparameterized models, whereas outofsample predictive accuracy can be reduced by using such models, strongly motivates the use of outofsample evaluation methods, although caveats apply as we discuss in part III of the book. In the past 20 years many new forecast evaluation methods have been developed. Prior to this development, most academic work on evaluation and ranking of forecasting performance paid very little attention to the consideration that forecasts were obtained from recursively estimated models. Thus, often studies used the sample mean squared forecast error, computed for a particular empirical data set, to give an estimate of a model’s performance without accompanying standard errors. An obvious limitation of this approach is that such averages often are averages over very complicated functions of the data. Through their dependence on estimated parameters these averages are also typically correlated across time in ways that give rise to quite complicated distributions for standard test statistics. For some of the simpler ways that forecasts could have been generated recursively, recent papers derive the resulting standard errors, although much more work remains to be done to extend results to many of the popular forecasting methods used in practice. Chapter 15 ﬁrst establishes the properties that a good forecast should have in the context of the underlying loss function and discusses how these properties can be tested in practice. The chapter goes from the case where very little structure can be imposed on the loss function to cases where the loss function is known up to a small set of parameters. In the latter case it can be tested that the derivative of the loss with respect to the forecast, the socalled generalized forecast error, is unpredictable given current information. The chapter also shows how assumptions about the loss function can be traded off against testable assumptions on the underlying datagenerating process. Chapter 16 gives an overview of basic issues in evaluating forecasts, along with a description of informal methods. This chapter examines the evaluation of a sequence of forecasts from a single model. Critical values for the tests of forecast efﬁciency depend on how the forecast was constructed, speciﬁcally whether a ﬁxed, rolling, or expanding estimation window was used. LIBRARY.LEOPOLDS.COM 利得图书馆 Introduction • 11 Chapter 17 extends the assessment of the predictive performance of a single model to the situation with more than one forecast to examine and so addresses the issue of which, if any, forecasting method is best. We review ways to compare the forecasting methods and strategies for testing hypotheses useful to identifying methods that work well in practice. Special attention is paid to the case with nested forecasting models, i.e., cases where one model includes all the terms of another benchmark model plus some additional information. We distinguish between tests of equal predictive accuracy and tests of forecast encompassing, the latter case referring to situations where one forecast dominates another. We also discuss how to test whether the best among many (possibly thousands) of forecasts is genuinely better than some benchmark. Chapter 18 examines the evaluation of distributional forecasts. A complication that arises is that we never observe the density of the outcome; only a single draw from the distribution gets observed. Various approaches have been suggested to deal with this issue, including logarithmic scores and probability integral transforms. We discuss these as well as ways to evaluate whether the basic features of a density forecast match the data. 1.1.4 Part IV The fourth part of the book covers a variety of topics that are speciﬁc to forecasting. Chapter 19 discusses predictions under model instability. This chapter builds on the earlier observation that all forecasting models are simpliﬁed representations of a much more complex and evolving datagenerating process. A key source of model misspeciﬁcation is the constantparameter assumption made by many prediction models. Empirical evidence suggests that simple ARMA models are in fact misspeciﬁed for many macroeconomic variables. The chapter ﬁrst discusses how model instability can be monitored before moving over to discuss prediction approaches that speciﬁcally incorporate timevarying parameters, including random walk or meanreverting parameters and regime switching parameters. The previous chapters deal with cases where the forecast horizon is relatively short. Chapter 20 directly attacks the case where the forecast horizon can be long. Oftentimes a policy maker or budget ofﬁce is interested in 5 or 10year forecasts of revenue or expenditures. Interest may also lie in forecasts of the average growth rate over some period. From an estimation perspective, whether the forecast horizon is short or long is measured relative to the length of the data sample. We discuss these issues in chapter 20. Realtime forecasting methods emphasize the need to ensure that all information and all methods used to construct a forecast would have been available in real time. This consideration becomes particularly relevant in socalled pseudo outofsample forecasts that simulate a sequence of historical forecasts. Many macroeconomic time series are subject to revisions that become available only after the date of the forecast. Since the selection of a forecasting model and estimation of its parameters may depend on the conditioning information set, which vintage of data is used can sometimes make a material difference. Similar issues related to data availability are addressed by a relatively new ﬁeld known as nowcasting which uses ﬁltering and updating algorithms to account for the jaggededge nature of data, i.e., the fact that data are released at different frequencies and on different dates. These issues are covered in chapter 21. LIBRARY.LEOPOLDS.COM 利得图书馆 12 • Chapter 1 This chapter also covers models for predicting data that take the format of either counts, and so are restricted to being an integer number, or durations, i.e., the length of the time intervals between certain events. The nature of the dependent variable gives rise to speciﬁc forecasting models, such as Poisson models, that are different from the models covered in the previous chapters of the book. Count models have gained widespread popularity in the context of analysis of credit events such as bankruptcies or credit card default, while duration analysis is used to predict unemployment spells and times between trades in ﬁnancial markets. 1.2 TECHNICAL NOTES Throughout the book we follow standard statistical methods which view the data as realizations of underlying random variables. Objective functions and other functions of interest are then also functions of random variables. Further, we assume that all functions are measurable, including functions that arise from maximizations of functions over parameters. We are rarely explicit about these assumptions, though this is seldom an issue for the functions examined in the book. The decisiontheoretic approach relies on the existence of risk or expected loss. For loss functions that are bounded, this is usually not problematic, but many popular loss functions are not bounded. For example, mean squared error loss and mean absolute error loss are the most popular loss functions in practice, and neither is bounded. It is fairly standard in the forecasting literature to simply assume that the expected loss exists, and further assume that the asymptotic limit of expected loss is the expected value of the limiting random variable that measures the loss. Throughout the book we follow this practice without giving conditions. Forecasting practice in some instances does seem to enforce “boundedness” of a sort on forecast losses; for example, in evaluating nonlinear models with mean squared error loss, often extreme forecasts that could lead to very large losses are removed and so the loss is in effect bounded. Throughout the book we tend not to present results as fully worked theorems but instead give the main conditions under which the results hold. Original papers with the full set of conditions are cited. The reasons for this approach are twofold. First, often there are many overlapping sets of conditions that would result in lengthy expositions on often very straightforward methods if we were to include all the details of a result. Second, many of the conditions are highly technical in nature and often difﬁcult or impossible to verify. LIBRARY.LEOPOLDS.COM 利得图书馆 2 Loss Functions hort of the special and ultimately uninteresting case with perfect foresight, it is not possible to ﬁnd a method that always sets the forecast equal to the outcome. A formal method for trading off potential forecast errors of different signs and magnitudes is therefore required. The loss function, L (·), describes in relative terms how costly it is to use an imperfect forecast, f, given the outcome, Y, and possibly other observed data, Z. This chapter examines the construction and properties of loss functions and introduces loss functions that are commonly used in forecasting. A central point in the construction of loss functions is that the loss function should reﬂect the actual tradeoffs between different forecast errors. In this sense the loss function is a primitive to the forecasting problem. From a decisiontheoretic perspective the forecast is the action that must be constructed given the loss function and the predictive distribution, which we discuss in the next chapter. For example, the Congressional Budget Ofﬁce must provide forecasts of future budget deﬁcits. Their loss function in providing the forecasts should be based on the relative costs of over and underpredicting public deﬁcits. Weather forecasters face very different costs from underpredicting the strength of a storm compared to overpredicting it. The choice of a loss function is important for every facet of the forecasting exercise. This choice affects which forecasting models are preferred as well as how their parameters are estimated and how the resulting forecasts are evaluated and compared against forecasts from competing models. Despite its pivotal role, it is common practice to simply choose offtheshelf loss functions. In doing this it is important to choose a loss function that at least approximately reﬂects the types of tradeoffs relevant for the forecast problem under study. For example, when forecasting hotel room bookings, it is hard to imagine that over and underpredicting the number of hotel rooms booked on a particular day lead to identical losses because hotel rooms are a perishable good. Hence, using a symmetric loss function for this problem would make little sense. Asymmetric loss that reﬂects the larger loss from over rather than underpredicting bookings would be more reasonable. There are examples of carefully grounded loss functions in the economics literature. For example, sometimes a forecast can be viewed as a signal in a strategic game that is inﬂuenced by the forecast provider’s incentives. Studies such as Ehrbeck and Waldmann (1996), Hong and Kubik (2003), Laster, Bennett, and Geoum (1999), Ottaviani and Sørensen (2006), Scharfstein and Stein (1990) and Trueman (1994) suggest loss functions grounded on gametheoretical models. Forecasters are S LIBRARY.LEOPOLDS.COM 利得图书馆 14 • Chapter 2 assumed to differ in their ability to predict future outcomes. The chief objective of the forecasters is to inﬂuence forecast users’ assessment of their ability. Such objectives are common for business analysts or analysts employed by ﬁnancial services ﬁrms such as investment banks or brokerages whose fees are directly linked to clients’ assessment of their forecasting ability. The chapter proceeds as follows. Section 2.1 examines general issues that arise in construction of loss functions. We discuss the mathematical setup of a loss function before relating it to the forecaster’s decisions and examining some general properties that loss functions have. Section 2.2 reviews speciﬁc loss functions commonly used in economic forecasting problems, assuming there is only a single outcome to predict, before extending the analysis in section 2.3 to cover cases with multiple outcome variables. Section 2.4 considers loss functions (scoring rules) for distributional forecasts, while section 2.5 provides some concrete examples of loss functions and economic decision problems from macroeconomic and ﬁnancial analysis. Section 2.6 concludes the chapter. 2.1 CONSTRUCTION AND SPECIFICATION OF THE LOSS FUNCTION Let Y denote the random variable describing the outcome of interest and let Y denote the set of all possible outcomes. For outcomes that are either continuous or can take on a very large number of possible values, typically Y is the real line, R. In some forecasting problems the set of possible outcomes, Y, can be much smaller, such as for a binary random variable where Y = {0, 1}. For multivariate outcomes typically Y =Rk for some integer k, where k is the number of forecasts to be evaluated. Point forecasts are denoted by f and are deﬁned on the set F. Typically we assume F = Y since in most cases it does not make sense to have forecasts that cannot take on the same values as Y or, conversely, have forecasts that can take on values that the outcome Y cannot. There are exceptions to this rule, however. For example, a forecast of the number of children per family could be a fraction such as 1.9, indicating close to 2 children, even though Y cannot take this value. We assume that the predictors Z (as well as the outcome Y and hence the forecast f ) are real valued. Formally, the loss function, L ( f, Y, Z), is then deﬁned as a mapping L : Y × Y × Z → L, where L is in R1 , and Z contains the set of possible values the conditioning variables, z, can take. Often L =R1+ , the set of nonnegative real numbers. Alternatively, we could constrain the forecasts to lie in the convex hull of the set of all possible outcomes, i.e., F = conv(Y). We discuss this further below. A common assumption for loss functions is that loss is minimized when the forecast is equal to the outcome—min f L ( f, y, z) = L (y, y, z). The idea is that if we are to ﬁnd a forecast that minimizes loss, then nothing dominates a perfect forecast. In cases where the loss function does not depend on Z, so L ( f, Y, Z) = L ( f, Y), it is natural to normalize the loss function so that it takes a minimum value at 0. This can be done without loss of generality by subtracting the loss associated with the perfect forecast f = y, i.e., L ( f, Y) = L̃ ( f, Y) − L̃ (Y, Y) for any loss function, L̃ . For f = y to be a unique minimum we must have L ( f, y) > 0 for all f = y.1 More generally, when the loss function L ( f, Y, Z) varies with Z, it may not be possible to 1 In binary forecasting this condition is often not imposed. This usually does not affect the analysis but only the interpretation of the calculated loss ﬁgures. LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 15 rescale the loss function in this manner. For example, a policy maker’s loss function over inﬂation forecasts might depend on the unemployment rate so that losses from incorrect inﬂation forecasts depend on whether the unemployment rate is high or low. For simplicity, in what follows we will mostly drop the explicit dependence of the loss function on Z and focus on the simpler loss functions L ( f, Y). 2.1.1 Constructing a Loss Function Construction of loss functions, much like construction of prior distributions in Bayesian analysis, requires a careful study of the forecasting problem at hand and should reﬂect the actual tradeoffs between forecast errors of different signs and magnitudes. Laying out the tradeoff can be straightforward if the decision environment is fully speciﬁed and naturally results in a measurable outcome that depends on the forecast. For example, for a proﬁtmaximizing investor with a speciﬁc trading strategy that requires forecasts of future asset prices, the natural choice of loss is the function relating payoffs to the forecast and realized returns. Other problems may not lead so easily to a speciﬁc loss function. For example, when the IMF forecasts individual countries’ budget deﬁcits, both shortterm considerations related to debt ﬁnancing costs and longterm reputational concerns could matter.2 In such cases one can again follow a Bayesian prior selection strategy of deﬁning a function that approximates a reasonable shape of losses associated with decisions based on incorrect forecasts. Loss functions, as used by forecasters to evaluate their performance, and utility functions, as used by economists to assess the economic value of different outcomes, are naturally related. Both are grounded in the same decisiontheoretic setup which regards the forecast as the decision and the outcome as the true state and maps pairs of outcomes (states) and forecasts (Y, f ) to the real line. In both cases we are interested in minimizing the expected loss or disutility that arises from the decision.3 The relationship between utility and loss is examined in Granger and Machina (2006), who show that the loss function can be viewed as the negative of a utility function, although a more general relation of the following form holds: U ( f, Y) = k(Y) − L ( f, Y), (2.1) where k(Y) plays no role in the derivation of the optimal forecast.4 Example 2.1.1 (Squared loss and utility). Granger and Machina (2006) show that a utility function U ( f, Y) generates squared error loss, L ( f, Y) = a(Y − f )2 , for a > 0, if and only if it takes the form U ( f, Y) = k(Y) − a(Y − f )2 . (2.2) It follows that utility functions associated with squared error loss are restricted to a very narrow set. 2 Forecasts can even have feedback effects on outcomes as in the case of credit ratings companies whose credit scores can trigger debt payments for private companies that affect future ratings (Manso, 2013). 3 The ﬁrst section of chapter 3 examines this issue in more detail. 4 Granger and Machina (2006) allow decisions to depend on forecasts without requiring that the two necessarily be identical. Instead they require that the function mapping forecasts to decisions is monotonic. LIBRARY.LEOPOLDS.COM 利得图书馆 16 • Chapter 2 Academic studies often do not derive loss functions from ﬁrst principles by referring to utility functions or fully speciﬁed decisiontheoretic problems, though there are some exceptions. Loss functions that take the form of proﬁt functions have been used to evaluate forecasts by Leitch and Tanner (1991) and Elliott and Ito (1999). West et al. (1993) compare utilitybased and statistical measures of predictive accuracy for exchange rate models. Examples of loss functions derived from utility are provided in the ﬁnal section of this chapter. 2.1.2 Common Properties of Loss Functions Reasonable loss functions are grounded in economic decision problems. Under the utilitymaximizing approach, loss functions inherit wellknown properties from the utility function. Rather than deriving loss functions from ﬁrst principles, however, it is common practice to instead use loss functions with a “reasonable shape.” For the loss function to be “reasonable,” a set of minimal properties should hold. Other properties such as symmetry or homogeneity may suggest broad families of loss functions with certain desirable characteristics. We cover both types of properties below. Tradeoffs between different forecast errors when f = y are quantiﬁed by the loss function. To capture the notion that bigger errors imply bigger losses, often it is imposed that the loss is nondecreasing as the forecast moves further away from the outcome. Mathematically, this means that L ( f 2 , y) ≥ L ( f 1 , y) for either f 2 > f 1 > y or f 2 < f 1 < y for all real y. Nearly all loss functions used in practice have this feature. For loss functions that depend only on the forecast error, e = y − f , and thus take the form L ( f, y) = L (e), Granger (1999) summarized these requirements: L (0) = 0 (minimal loss of 0); (2.3a) L (e) ≥ 0 for all e; (2.3b) L (e) is nonincreasing in e for e < 0 and nondecreasing in e for e > 0 : L (e 1 ) ≤ L (e 2 ) if e 2 < e 1 < 0, L (e 1 ) ≤ L (e 2 ) if e 2 > e 1 > 0. (2.3c) As in the case with more general loss, L ( f, y), condition (2.3a) simply normalizes the loss associated with the perfect forecast (y = f ) to be 0. The second condition states that imperfect forecasts (y = f ) generate larger loss than perfect ones. Most common loss functions depend only on e; see section 2.2 for examples. Other properties of loss functions such as homogeneity, symmetry, differentiability, and boundedness can be used to deﬁne broad classes of loss functions. We next review these. Homogeneity can be used to deﬁne classes of loss functions that lead to the same decisions. Homogeneous loss functions factor in such a way that L (a f, ay) = h(a)L ( f, y), (2.4) for some positive function h(a), where the degree of homogeneity does not matter. For loss functions that depend only on the forecast error, homogeneity amounts to L (ae) = h(a)L (e) for some positive function h(a). Homogeneity is a useful property when solving for optimal forecasts since the optimal forecast will be invariant to different values of h(a). LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 17 Symmetry of the loss function refers to symmetry of the forecast around y. It is the property that, for all f , L (y − f, y) = L (y + f, y). (2.5) For loss functions that depend only on the forecast error, symmetry reduces to L (−e) = L (e), so that over and underpredictions of the same magnitude lead to identical loss.5 Most empirical work in economic forecasting assumes symmetric loss. This choice reﬂects the difﬁculties in putting numbers on the relative cost of over and underpredictions. Construction of a loss function requires a deeper understanding of the forecaster’s objectives and this may be difﬁcult to accomplish. Still, the implicit choice of MSE loss by the majority of studies in the forecasting literature seems difﬁcult to justify on economic grounds. As noted by Granger and Newbold (1986, page 125), “an assumption of symmetry about the conditional mean. . . is likely to be an easy one to accept. . . an assumption of symmetry for the cost function is much less acceptable.” Differentiability of the loss function with respect to the forecast is again a regularity condition that is useful and helps simplify numerically the search for optimal forecasts. However, this condition may not be desirable and is certainly not required for a loss function to be well deﬁned. In general, a ﬁnite numbers of points where the loss function fails to be differentiable will not cause undue problems at the estimation stage. However, when the loss function is extremely irregular, different methods are required for understanding the statistical properties of the loss function (see the maximum utility estimator in chapter 12). Finally, loss functions may be bounded or unbounded. As a practical matter, there is often no obvious reason to let the weight the loss function places on very large forecast errors increase without bound. For example, the squared error loss function examined below assigns very different losses to forecasts of, say, US inﬂation that result in errors of 100% versus 500% even though it is not obvious that the associated losses should really be very different since both forecasts would lead to very similar actions. Unbounded loss functions can create technical problems for the analysis of forecasts as the expected loss may not exist, so most results in decision theory are derived under the assumption of bounded loss. In practice, forecasts are usually bounded and extremely large forecasts typically get trimmed as they are deemed implausible. 2.1.3 Existence of Expected Loss Restrictions must be imposed on the form of the loss function to make sense of the idea of minimizing the expected loss. Most basically, it is required that the expected loss exists. Suppose the forecast depends on data Z through a vector of parameters, β, which depends on the parameters of the data generating process, θ , so f = f (z, β). From the deﬁnition of expected loss, we have (2.6) E Y [L ( f (z, β), Y)] = L ( f (z, β), y) pY (yz, θ )dy, 5 A related concept is the class of bowlshaped loss functions. A loss function is bowl shaped if the level sets {e : L (e) ≤ c} are convex and symmetric about the origin. LIBRARY.LEOPOLDS.COM 利得图书馆 18 • Chapter 2 where pY (yz, θ ) is the predictive density of y given z, θ . When the space of outcomes Y is ﬁnite, this expression is guaranteed to be ﬁnite. However, for outcomes that are continuously distributed, restrictions must sometimes be imposed on the loss function to ensure ﬁnite expected loss. The existence of expected loss depends, both, on the loss function and on the distribution of the predicted variable, given the data, pY (yz, θ ), where θ denotes the parameters of this conditional distribution. Existence of expected loss thus hinges on how large losses can get in relation to the tail behavior of the predicted variable, as captured by pY (yz, θ ). A direct way to ensure that the expected loss exists is to bound the loss function from above.6 From a practical perspective this would seem to be a sensible practice in constructing loss functions. Even so, many of the most popular loss functions are not bounded from above. In part this practice stems from not considering the loss related to the forecasting problem at hand, but instead borrowing “offtheshelf” loss functions from estimation methods that lead to simple closedform expressions for the optimal forecast. It is useful to demonstrate the conditions needed to ensure that the expected loss exists. Following Elliott and Timmermann (2004), suppose that L depends only on the forecast error, e = y − f , and lends itself to a Taylorseries expansion around the mean error, μe = E Y [Y − f ]: ∞ 1 1 2 L kμe (e − μe )k , L (e) = L (μe ) + L μe (e − μe ) + L μe (e − μe ) + 2 k! k=3 (2.7) where L kμe denotes the kth derivative of L evaluated at μe . Suppose there are only a ﬁnite number of points where L is not analytic and that these can be ignored because they occur with probability 0. Taking expectations in (2.7), we then get ∞ 1 1 L kμe E Y [(e − μe )k ] E [L (e)] = L (μe ) + L μe E Y [(e − μe )2 ] + 2 k! k=3 ∞ k 1 k 1 2 k E Y [e k−i μie ] = L (μe ) + L μe E Y [(e − μe ) ] + L μe i 2 k! k=3 i =0 ∞ k 1 1 L kμe = L (μe ) + L μe E Y [(e − μe )2 ] + E Y [e k−i μie ]. 2 i !(k − i )! k=3 i =0 (2.8) This expression is ﬁnite provided that all moments of the error distribution exist for which the corresponding derivative of the loss function with respect to the forecast error is nonzero. This is a strong requirement and rules out some interesting combinations of loss functions and forecast error distributions. For example, exponential loss (or the Linex loss function deﬁned below) and a studentt distribution with a ﬁnite number of degrees of freedom would lead to inﬁnite expected loss since all higherorder moments do not exist for this distribution. What is required to make 6 This is sufﬁcient since we have already bounded the loss function (typically at 0) from below. LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 19 the higherorder terms in (2.8) vanish is that the tail decay of the predicted variable is sufﬁciently fast relative to the weight on these terms implied by the loss function. 2.1.4 Loss Functions Not Based on Expected Loss So far we have characterized the loss function L ( f, Y) for a univariate outcome, and deﬁned its properties with reference to a “oneshot” problem. This makes sense when forecasting is placed in a decisiontheoretic or utilitymaximization context. This approach to forecasting is internally consistent, from initially setting up the problem to deﬁning the expected loss and conducting model estimation and forecast evaluation. Some loss functions that have been used in practice are based directly on sample statistics without relating the sample loss to a population loss function. In cases where such a population loss function exists and satisﬁes reasonable properties, this does not cause any problems. Basing the loss function directly on a sample of losses can, however, sometimes yield a loss function that does not make sense in population or for fully speciﬁed decision problems. Loss functions that do not map back to decision problems often have poor and unintended properties. We consider one such example below. Example 2.1.2 (Kuipers score for binary outcome). Let f = {1, −1} be a forecast of the binary variable y = {1, −1} and let n j,k , j, k ∈ {−1, 1} be the number of observations for which the forecast equals j and the outcome equals k. The Kuipers score is given by n1,−1 n1,1 − . n1,1 + n−1,1 n1,−1 + n−1,−1 (2.9) This is the positive hit rate, i.e., the proportion of times where y = 1 is correctly predicted less the “false positive rate,” i.e., the proportion of times where y = 1 is wrongly predicted. This can equivalently be thought of as KuS = n−1,−1 n1,1 + − 1, n1,1 + n−1,1 n1,−1 + n−1,−1 (2.10) which is the hit rate for y = 1 plus the hit rate for y = −1 minus a centering constant of 1. The Kuipers score is positive if the sum of the positive and negative hit rates exceeds 1. For a sample with a single observation, this deﬁnition makes no sense, as one of the denominators in (2.10) is 0: either n1,1 + n−1,1 = 0 or n1,−1 + n−1,−1 = 0. For a single observation, this sample statistic does not follow from any obvious loss function. The ﬁrst term in (2.10) is the sample analog of P [ f = 1Y = 1] and the second is the sample analog of P [ f = −1Y = −1]. However, they do not combine to a loss function with this sample analog. This failure to embed the loss function into the expected loss framework results in odd properties for the objective. For example, the deﬁnition of KuS in (2.9) implies that the marginal value of an extra “hit,” i.e., a correct call, depends on the sample proportion of hits. To see this, consider the improvement in KuS from adding a single successfully predicted observation y = 1, f = 1. LIBRARY.LEOPOLDS.COM 利得图书馆 20 • Chapter 2 The resulting improvement in the hit rate is n1,1 + 1 n1,1 − n1,1 + 1 + n−1,1 n1,1 + n−1,1 n−1,1 = . (n1,1 + n−1,1 )(n1,1 + 1 + n−1,1 ) KuS = Thus the marginal value of a correct call depends on the total number of observations and the proportion of missed hits prior to the new observation. The Kuipers score’s poor properties arise from the lack of justiﬁcation of its setup for a population problem. 2.2 SPECIFIC LOSS FUNCTIONS We next review various families of loss functions that have been suggested in the forecasting literature. The vast majority of empirical work on forecasting assumes that the loss function depends only on the forecast error, e = Y − f , i.e., the difference between the outcome and the forecast. In this case we can write L ( f, Y, Z) = L (e). In general, loss functions can be more complicated functions of the outcome and forecast and take the form L ( f, Y) or L ( f, Y, Z). 2.2.1 Loss That Depends Only on Forecast Errors The most commonly used loss functions, including squared error loss and absolute error loss, depend only on the forecast error. For such loss functions, L ( f, Y, Z) = L (e), so the loss function takes a particularly simple form. 2.2.1.1 Squared Error Loss By far the most popular loss function in empirical studies is squared error loss, also known as quadratic or mean squared error (MSE) loss: L (e) = ae 2 , a > 0. (2.11) This loss function clearly satisﬁes the three Granger properties listed in ( 2.3). When viewed as a family of loss functions—corresponding to different values of the scalar a—squared error loss forms a homogeneous class.7 It is symmetric, bowl shaped, and differentiable everywhere and penalizes large forecast errors at an increasing rate due to its convexity in e. The loss function is not bounded from above. Large forecast errors or “outliers” are thus very costly under this loss function. 2.2.1.2 Absolute Error Loss Rather than using squared error loss, which results in increasingly large losses for large forecast errors, the absolute error is preferred in some cases. Under mean absolute error (MAE) loss, L (e) = a e , a > 0. 7 (2.12) While the scaling factor, a, does not matter to the properties of the optimal forecast, it is common to set a = 0.5, which removes the “2” that arises from taking ﬁrst derivatives. LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions L(e) 10 5 • 21 α = 0.25 linlin MSE 0 –3 –2 –1 0 e 1 2 3 1 2 3 1 2 3 α = 0.5, MAE loss L(e) 10 5 MSE linlin 0 –3 –2 –1 α = 0.75 L(e) 10 5 0 –3 0 e MSE linlin –2 –1 0 e Figure 2.1: MSE loss versus linlin loss for different values of the linlin asymmetry parameter, α. Like MSE loss, this loss function satisﬁes the three Granger properties listed in (2.3). The loss function is symmetric, bowl shaped, and differentiable everywhere except at 0. It is again unbounded. However, the penalty to large forecast errors increases linearly rather than quadratically as for MSE loss. 2.2.1.3 Piecewise Linear Loss Piecewise linear, or socalled linlin loss, takes the form L (e) = −a(1 − α)e if e ≤ 0, aαe if e > 0, a > 0, (2.13) for 0 < α < 1. Positive forecast errors are assigned a (relative) weight of α, while negative errors get a weight of 1 − α. The greater is α, the bigger the loss from positive forecast errors, and the smaller the loss from negative errors. Again, this loss function forms a homogeneous class for all positive values of a. It is common to set a = 1, so that the weights are normalized to sum to 1. Linlin loss clearly satisﬁes the three Granger properties. Moreover, it is differentiable everywhere, except at 0. Compared to MSE loss, this loss function does not penalize large errors as much. MAE loss arises as a special case of linlin loss if α = 0.5, in which case (2.13) simpliﬁes to (2.12). Figure 2.1 plots linlin loss against squared error loss. The middle window shows the symmetric case with α = 0.5, and so corresponds to MAE loss. Small forecast errors (e < 1) are costlier under MAE loss than under MSE loss, while conversely LIBRARY.LEOPOLDS.COM 利得图书馆 22 • Chapter 2 large errors are costlier under MSE loss. The top window assumes that α = 0.25, so negative forecast errors are three times as costly as positive errors, reﬂected in the steeper slope of the loss curve for e < 0. In the bottom window, α = 0.75 and so positive forecast errors are three times costlier than negative errors. 2.2.1.4 Linex Loss Linearexponential, or Linex, loss takes the form L (e) = a1 (exp(a2 e) − a2 e − 1), a2 = 0, a1 > 0. (2.14) Linex loss is differentiable everywhere, but is not symmetric. Varian (1975) used this loss function to analyze real estate assessments, while Zellner (1986a) used it in the context of Bayesian prediction problems. The parameter a2 controls both the degree and direction of asymmetry. When a2 > 0, Linex loss is approximately linear for negative forecast errors and approximately exponential for positive forecast errors. In this case, large underpredictions ( f < y, so e = y − f > 0) are costlier than overpredictions of the same magnitude, with the relative cost increasing as the magnitude of the forecast error rises. Conversely, for a2 < 0, large overpredictions are costlier than equally large underpredictions. Although Linex loss is not deﬁned for a1 = 0, setting a1 = 2/a22 and taking the limit as a2 → 0, by L’Hôpital’s rule the Linex loss function approaches squared error loss: exp(a2 e) − e e 2 exp(a2 e) e 2 = lim = . a2 →0 a2 →0 2a2 2 2 lim L (e) = lim a2 →0 Figure 2.2 plots MSE loss against Linex loss for a2 = 1 (top) and a2 = −1 (bottom). Measured relative to the benchmark MSE loss, large positive (top) or large negative (bottom) forecast errors are very costly in these respective cases. This loss function has been used in many empirical studies on variables such as budget forecasts (Artis and Marcellino, 2001) and survey forecasts of inﬂation (Capistrán and Timmermann, 2009). Christoffersen and Diebold (1997) examine this loss function in more detail. 2.2.1.5 Piecewise Asymmetric Loss A general class of asymmetric loss functions can be constructed by letting the loss function shift at a discrete set of points, {ē 1 , . . . , ē n−1 }: ⎧ L 1 (e) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ L 2 (e) L (e) = .. ⎪ ⎪ . ⎪ ⎪ ⎪ ⎩ L n (e) if e ≤ ē 1 , if ē 1 < e ≤ ē 2 , .. . (2.15) if e > ē n−1 . Here ē i −1 < ē i for i = 2, . . . , n − 1. It is common to set n = 2, choose ē 1 = 0 and assume that both pieces of the loss function satisfy the usual loss properties so that the loss is piecewise asymmetric around 0 and continuous (but not necessarily LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 23 Rightskewed linex loss with a2 = 1 20 L(e) 15 10 MSE 5 Linex 0 –3 –2 –1 0 e 1 2 3 2 3 Leftskewed linex loss with a2 = –1 20 L(e) 15 10 Linex MSE 5 0 –3 –2 –1 0 e 1 Figure 2.2: MSE loss versus Linex loss for different values of the Linex parameter, a2 . differentiable) at 0. Linlin loss in (2.13) is a special case of (2.15) as is the asymmetric quadratic loss function L (e) = (1 − α)e 2 if e ≤ 0, αe 2 if e > 0, (2.16) considered by Artis and Marcellino (2001), Newey and Powell (1987) and Weiss (1996). A ﬂexible class of loss functions proposed by Elliott, Komunjer, and Timmermann (2005) sets n = 2 and ē 1 = 0 in (2.15), while L 1 (e) = (1 − α)e p and L 2 (e) = αe p , where p is a positive integer, and α ∈ (0, 1). This gives the EKT loss function, L (e) ≡ [α + (1 − 2α)1(e < 0)]e p , (2.17) where 1(e < 0) is an indicator function that equals 1 if e < 0, otherwise equals 0. Letting α deviate from 0.5 produces asymmetric loss, with larger values of α indicating greater aversion to positive forecast errors. Imposing p = 1 and α = 0.5, MAE loss is obtained. More generally, setting p = 1, (2.17) reduces to linlin loss since the loss is linear on both sides of 0, but with different slopes. Setting p = 2 and α = 0.5 gives the MSE loss function which is therefore also nested as a special case, as is the asymmetric quadratic loss function (2.16) for p = 2, α ∈ (0, 1). Hence, the EKT family of loss functions nests the loss functions in (2.11), (2.12), (2.13), and (2.16) as special cases and generalizes many of the commonly employed loss functions. LIBRARY.LEOPOLDS.COM 利得图书馆 24 • Chapter 2 α = 0.25 25 L(e) 20 15 10 EKT MSE 5 0 –3 –2 –1 0 e 1 2 3 1 2 3 α = 0.75 25 L(e) 20 15 10 MSE 5 EKT 0 –3 –2 –1 0 e Figure 2.3: MSE loss versus EKT loss with p=3 for different values of the asymmetry parameter, α. Figure 2.3 plots the EKT loss function for p = 3, α = 0.25 (top) and α = 0.75 (bottom). Compared with MSE loss, substantial asymmetries can be generated by this loss function. Empirically, the EKT loss function has been used to analyze forecasts of government budget deﬁcits produced by the IMF and OECD (Elliott, Komunjer, and Timmermann, 2005), the Federal Reserve Board’s inﬂation forecasts (Capistrán, 2008), as well as output and inﬂation forecasts from the Survey of Professional Forecasters (Elliott, Komunjer, and Timmermann, 2008). 2.2.1.6 Binary Loss When the space of outcomes Y is discrete, the forecast errors typically take on only a small number of possible values. Hence in constructing a loss function for such problems, all that is required is to evaluate each of a small number of possibilities. The simplest case arises when forecasting a binary outcome so that Y = {−1, 1} or Y = {0, 1}. In this case there are only four possible pairings of the point forecast and outcome: two where the forecast gives the correct outcome and two errors. If we restrict the loss function to not depend on Z (this case is examined below) and also restrict the problem so that a correct forecast has the same value regardless of the LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 25 value for Y, then the binary loss function can be written as8 ⎧ ⎪ ⎪0 ⎨ (1 − c) L ( f, y) = c ⎪ ⎪ ⎩ 0 if if if if f f f f = y = 0, = 0, y = 1, = 1, y = 0, = y = 1. (2.18) Here we have set the loss from a correct prediction to 0 and normalized the losses from an incorrect forecast to sum to 1 by dividing by their sum; see Schervish (1989), Boyes, Hoffman, and Low (1989), Granger and Pesaran (2000), and Elliott and Lieli (2013). For (2.18) to be a valid loss function, we require that 0 < c < 1. This ensures that the properties of the loss function listed in (2.3) hold. Notice that the binary loss function can be written as L (e), since the loss is equal to L ( f, y) = c1(e < 0) + (1 − c)1(e > 0). 2.2.2 Level and ForecastDependent Loss Functions Economic loss is mostly assumed to depend on only the forecast error, e = Y − f . This is too restrictive an assumption for situations in which the forecaster’s objective function depends on state variables such as the level of the outcome variable Y. More generally, we can consider loss functions of the form L ( f, y) = L (e). The most common leveldependent loss function is the mean absolute percentage error (MAPE), given by e L (e, y) = a . y (2.19) Since the forecast and forecast error have the same units as the outcome, the MAPE is a unitless loss function. This is considered to be an advantage when constructing the sample analog of this loss function and employing it to evaluate forecast methods across outcomes measured in different units. If the loss function is well grounded in terms of the actual costs arising from the forecasting problem, dependence on units does not seem to be an important issue—comparisons across different forecasts with different units should be related not through some arbitrary adjustment but instead in a way that trades off the costs associated with the forecast errors for each of the outcomes. This is achieved by the multivariate loss functions examined in the next section. Scaling the forecast error by the outcome in (2.19) has the effect of weighting forecast errors more heavily when y is near 0 than when y is far from 0. This is difﬁcult to justify in many applications. Moreover, if the predictive density for Y has nontrivial mass at 0, then the expected loss is unlikely to exist, hence invalidating many of the results from decision theory for this case. Nonetheless, MAPE loss remains popular in many practical forecast evaluation experiments. More generally, level and forecastdependent loss functions can be written as L ( f, y) but do not reduce to L (e) or L (e, y). Although loss functions in this class 8 See chapter 12 for a comprehensive treatment of forecast analysis under this loss function. LIBRARY.LEOPOLDS.COM 利得图书馆 26 • Chapter 2 are not particularly common, there are examples of their use. For example, Bregman (1967) suggested loss functions of the form L ( f, y) = φ (y) − φ ( f ) − φ ( f ) (y − f ) , (2.20) where φ is a strictly convex function, so φ > 0. Squared error loss is nested as a special case of 2.20. Differentiating (2.20) with respect to the forecast, f , we get ∂ L ( f, y) = −φ ( f ) − φ ( f ) (y − f ) + φ ( f ) ∂f = −φ ( f ) (y − f ) , which generally depends on both y and f . This, along with the assumption that φ > 0, ensures that the conditional mean is the optimal forecast. Bregman loss is further discussed in Patton (2015). In an empirical application of leveldependent loss, Patton and Timmermann (2007b) ﬁnd that the Federal Reserve’s forecasts of output growth fail to be optimal if their loss is restricted to depend only on the forecast error. Rationalizing the Federal Reserve’s forecasts requires not only that overpredictions of output growth are costlier than underpredictions, but also that overpredictions of output are particularly costly during periods of low economic growth. This ﬁnding can be justiﬁed if the cost of an overly tight monetary policy is particularly high during periods with low economic growth when such a policy may cause or extend a recession.9 2.2.3 Loss Functions That Depend on Other State Variables Under some simplifying assumptions we saw earlier that the binary loss function takes a particularly simple form. More generally, if the loss function depends on Z and the loss associated with a perfect forecast depends on the outcome Y, then the loss function for the binary problem becomes ⎧ −u1,1 (z) ⎪ ⎪ ⎨ −u1,0 (z) L ( f, y, z) = −u0,1 (z) ⎪ ⎪ ⎩ −u0,0 (z) if if if if f f f f = 1 and y = 1, = 1 and y = 0, = 0 and y = 1, = 0 and y = 0, (2.21) where ui, j (z) are the utilities gained when f = i, y = j , and Z = z. In this general form, the loss function cannot be simpliﬁed to depend only on the forecast error. Again restrictions need to be imposed on the losses in (2.21). First, we require that u0,0 (z) > u1,0 (z) and u1,1 (z) > u0,1 (z) so that losses associated with correct forecasts are not higher than those associated with incorrect forecasts. We might also impose that min{u0,0 (z), u1,1 (z)} > max{u1,0 (z), u1,0 (z)} so that correct forecasts result in a lower loss (higher utility) than incorrect forecasts. Finally, it is 9 Some central banks desire to keep inﬂation within a band of 0 to 2% per annum. Inﬂation within this band might be regarded as a successful outcome, whereas deﬂation or inﬂation above 2% is viewed as failure. Again this is indicative of a nonstandard loss function; see Kilian and Manganelli (2008). LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 27 quite reasonable to assume that correct forecasts are associated with different losses u0,0 (z) = u1,1 (z), in which case normalizing the loss associated with a perfect forecast to 0 will not be possible for both outcomes. This is an example of leveldependent loss being built directly into the loss function. 2.2.4 Consistent Ranking of Forecasts with Measurement Errors in the Outcome Hansen and Lunde (2006) and Patton (2011) consider the problem of comparing and consistently ranking volatility forecasts from different models when the observed outcome is measured with noise. This situation is common in volatility forecasting or in macro forecasting where the outcome may subsequently be revised. The volatility of asset returns is never actually observed although a proxy for it can be constructed. Volatility forecast comparisons typically use realized volatility, squared returns, or rangebased proxies, σ̂ 2 , in place of the true variance, σ 2 . Hansen and Lunde establish sufﬁcient conditions under which noisy proxies can be used in the forecast evaluation without giving rise to rankings that are inconsistent with the (infeasible) ranking based on the true outcome. Patton deﬁnes a loss function as being robust to measurement errors in the outcome if it gives the same expectedloss ranking of two forecasts whether based on the true (but unobserved) outcome or some unbiased proxy thereof. Speciﬁcally, a loss function is robust to such measurement errors if, for two forecasts f 1 and f 2 , the ranking based on the true outcome, y, E [L ( f 1 , y)] E [L ( f 2 , y)] is the same as the ranking based on the proxied outcome, ŷ: E [L ( f 1 , ŷ)] E [L ( f 2 , ŷ)], for unbiased proxies ŷ satisfying E [ ŷZ] = y, where Z is again the information set used to generate the forecasts. Patton (2011, Proposition 1) establishes conditions under which robust loss functions must belong to the following family: L ( f, ŷ) = C̃ ( f ) + B( ŷ) + C ( f )( ŷ − f ), (2.22) where B and C are twice continuously differentiable functions, C is strictly decreasing, and C̃ is the antiderivative of C , i.e., C̃ = C .10 In Patton’s analysis f is a volatility forecast and ŷ is a proxy for the realized volatility. Examples of loss functions in the family (2.22) include MSE and QLIKE loss: MSE : L ( f, ŷ) = ( ŷ − f )2 , QLIKE : L ( f, ŷ) = log( f ) + 10 ŷ . f If B = −C̃ , this family of loss functions yields the Bregman family in equation (2.20). LIBRARY.LEOPOLDS.COM 利得图书馆 28 • Chapter 2 2.3 MULTIVARIATE LOSS FUNCTIONS When a decision maker’s objectives depend on multiple variables, the loss function needs to be extended from being deﬁned over scalar outcomes to depend on a vector of outcomes. This situation arises, for example, for a central bank concerned with both inﬂation and employment prospects. Conceptually it is easy to generalize univariate loss functions to the multivariate case, although difﬁculties may arise in determining how costly different combinations of forecast errors are. How individual forecast errors or their cross products are weighted becomes particularly important. The most common multivariate loss function is multivariate quadratic error loss, also known as multivariate MSE loss; see Clements and Hendry (1993). This loss function maps a vector of forecast errors e = (e 1 , . . . , e n ) to the real number line and so is simply a weighted average of the individual squared forecast errors and their cross products:11 MSE(A) = e Ae. (2.23) Here the (n × n) matrix A is required to be nonnegative and positive deﬁnite. This is the matrix equivalent of the univariate assumption for MSE loss that a > 0 in (2.11). As noted in the discussion of MAPE loss, the loss function in (2.23) may be difﬁcult to interpret when the predicted variables are measured in different units. This concern is related to obtaining a reasonable speciﬁcation of the loss function whose role it is to compare and trade off losses of different sizes across different variables. Hence this is not really a limitation of the loss function itself but of applications of the loss function. The loss function in (2.23) is “bowl shaped” in the sense that the level sets are convex and symmetric around 0. It is easily veriﬁed that (2.23) satisﬁes the basic assumptions for a loss function in (2.3). If the entire vector of forecast errors is 0, then the loss is 0. A positivedeﬁnite and nonnegative weighting matrix A ensures that losses rise as forecast errors get larger, so assumption (2.3c) holds.12 A special case arises when A = In , the (n × n) identity matrix. In this case covariances can be ignored and the loss function simpliﬁes to MSE(In ) = E [e e] = tr E [(ee )], i.e., the sum of the individual mean squared errors. Thus, a loss function based on the trace of the covariance matrix of forecast errors is simply a special case of the general form in (2.23). In general, however, covariances between forecast errors come into play, reﬂecting the cross products corresponding to the offdiagonal terms in A. As a second example of a multivariate loss function, Komunjer and Owyang (2012) provides an interesting generalization of the Elliott, Komunjer, and Timmermann (2005) loss function in (2.17) to the case where e = (e 1 , . . . , e n ) . 11 While the vector of forecast errors could represent different variables, it could also comprise forecast errors for the same variable measured at different horizons, corresponding to short and longhorizon forecasts. 12 Positivedeﬁniteness alone is not sufﬁcient to guarantee that the multivariate equivalent to (2.3) holds. Suppose n = 2 and let A be a symmetric matrix with 2 on the diagonals and −1 in the offdiagonal cells. A is positive deﬁnite but the marginal effect of making a bigger error on the second forecast is 4e 2 − 2e 1 , where e = (e 1 , e 2 ) . Hence if e 2 < e 1 /2, increasing the error associated with the second forecast would reduce loss, thus violating (2.3). LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 29 Let e p = (e 1  p + · · · + e n  p )1/ p be the l p norm of e and assume that the nvector of asymmetry parameters, α, satisﬁes αq < 1. Further, let 1 ≤ p ≤ ∞ and, for a given value of p, set q so that 1/ p + 1/q = 1. The multivariate loss function proposed by Komunjer and Owyang takes the form L (e) = (e p + α e)e pp−1 . (2.24) As in the univariate case, the extent to which large forecast errors are penalized relative to small ones is determined by the exponent, p. However, now the full vector α = (α1 , . . . , αn ) characterizes the asymmetry in the loss function, with α = 0 representing the symmetric case. Since α is a vector, this loss function offers great ﬂexibility in both the magnitude and direction of asymmetry for multivariate loss functions. Other multivariate loss functions have been used empirically. Laurent, Rombouts, and Violante (2013) consider a multivariate version of the family of loss functions introduced by Patton (2011), and apply it to volatility forecasting. 2.4 SCORING RULES FOR DISTRIBUTION FORECASTS So far we have focused our discussion on point forecasts, but forecasts of the full distribution of outcomes are increasingly reported. Just as point forecasting requires a lossbased measure of the distance between the forecast f and the outcome Y, distribution forecasts also require a loss function. These are known as scoring rules and reward forecasters for making more accurate predictions, i.e., predictions that are “closer” to the observed outcome get a higher score, where closeness depends on the shape of the scoring rule. Gneiting and Raftery (2007) provide a survey of scoring rules and discuss their properties. Scoring rules, S( p, y), are mappings of predictive probability distributions, p, and outcomes, y, to the real line. Suppose a forecaster uses the predictive probability distribution, p, while the probability distribution used to evaluate the “goodness of ﬁt” of p is denoted p0 . Then the expected value of S( p, y) under p0 is denoted S( p, p0 ). A scoring rule is called strictly proper if the forecaster’s best probability distribution is p0 , i.e., S( p0 , p0 ) ≥ S( p, p0 ) with equality holding only if p = p0 . In this situation there will be no incentive for the forecaster to use a probability distribution p = p0 since this would reduce the score. The performance of a given candidate probability distribution, p, relative to the optimal rule, can be measured through the socalled divergence function d( p, p0 ) = S( p0 , p0 ) − S( p, p0 ). (2.25) Notice the similarity to the normalization in equation (2.3a) for loss functions based on point forecasts in (2.3): the divergence function obtains its minimum value of 0 only if p = p0 , and otherwise takes a positive value. The forecaster’s objective of maximizing the scoring rule thus translates into minimizing the divergence function. Several scoring rules have been used in the literature. Many of these have been considered for categorical data limited to discrete outcomes y = (y1 , . . . , ym ) with associated probabilities { p1 , . . . , pm }. Denote by pi the predicted probability that LIBRARY.LEOPOLDS.COM 利得图书馆 30 • Chapter 2 corresponds to the range that includes yi . The logarithmic score, S( p, yi ) = log( pi ), (2.26) gives rise to the wellknown Kullback–Leibler divergence measure, d( p, p0 ) = m p0 j log( p0 j / p j ). (2.27) j =1 Similarly, the quadratic or Brier score, S( p, yi ) = 2 pi − m p 2j − 1, (2.28) ( p j − p0 j )2 . (2.29) j =1 generates the squared divergence d( p, p0 ) = m j =1 For density forecasts deﬁned over continuous outcomes the logarithmic and quadratic scores take the form log S( p, y) = log p(y), 1/2 2 , S( p, y) = 2 p(y) − p(y) μ(dy) where μ(·) is the probability measure associated with the outcome, y. Both are proper scoring rules. By contrast, the linear score, S( p, y) = p(y), can be shown not to be a proper scoring rule; see Gneiting and Raftery (2007). Which scoring rule to use in a given situation depends, of course, on the underlying objectives for the problem at hand and the choice should most closely resemble the costs involved in the decision problem. To illustrate this point, we next provide an example from the semiconductor supply chain. Example 2.4.1 (Loss function for semiconductors). Cohen et al. (2003) construct an economically motivated loss or cost function for a semiconductor equipment supply chain. Supply ﬁrms are assumed to hold soft orders from clients which may either be canceled (with probability π ) or get ﬁnalized (with probability 1 − π) at some later date, y N , when the ﬁnal information arrives. Given such orders, ﬁrms attempt to optimally determine the timing of the production start, yπ , where y N > yπ due to a production leadtime delay. If an order is canceled, the supplier incurs a cancelation cost, c, per unit of time. Let y denote the ﬁnal delivery date in excess of the production lead time. If this exceeds the production date, the supplier will incur holding (inventory) costs, h, per unit of time. Conversely, if the production start date, yπ , exceeds y, the company will not be able to meet the requested delivery date and so incurs a delay cost of g per unit of time. Cohen et al. (2003) assume that suppliers choose the production LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 31 date, yπ , so as to minimize the expected total cost ∞ E [L (yπ , y, y N )] = π × c (y N − yπ )d P N (y N ) yπ +(1 − π ) h ∞ yπ (y − yπ )d P y (y) + g yπ −∞ (yπ − y)d P y (y) , where P y (y) and P N (y N ) are the cumulative distribution functions of y and y N , respectively. Provided that this expression is convex in yπ , the costminimizing production time, yπ∗ , can be shown to solve the ﬁrstorder condition π × c × P N (yπ∗ ) + (1 − π)(g + h)P y (yπ∗ ) = π × c + (1 − π)h, (2.30) and so implicitly depends on the cancelation probability, cancelation costs, inventory and delay costs, in addition to the predictive distributions for the ﬁnalization and ﬁnal delivery dates. Cohen et al. (2003) use an exponential distribution to model the arrival time of the ﬁnal order, P N , and a Weibull distribution to model the distribution of the ﬁnal delivery date, PY . To estimate the model parameters and predict the lead time, the authors use data on soft orders, ﬁnal orders, and order lead time. Empirical estimates suggest that ĝ = 1.0, ĥ = 3.0, ĉ = 2.1, indicating that holding costs are three times greater than delay costs, while cancelation costs are twice as high as the delay costs. This in turn helps the manufacturer decide on the optimal start date for production, yπ∗ . 2.5 EXAMPLES OF APPLICATIONS OF FORECASTS IN MACROECONOMICS AND FINANCE Forecasts are of interest to economic agents only in so far as they can help improve their decisions, so it is useful to illustrate the importance of forecasts in the context of some simple economic decision problems. This section provides three such examples from economics and ﬁnance. 2.5.1 Central Bank’s Decision Problem Consider a central bank with an objective of targeting inﬂation by means of a single policy instrument, yt , which could be an interest rate such as the repo rate, i.e., the rate charged on collateralized loans. Svensson (1997) sets out a simple model in which the central bank’s loss function depends on the difference between the inﬂation rate (yt ) and a target inﬂation rate (y ∗ ). Svensson shows that, conditional on having chosen a value for its instrument (the repo rate), the central bank’s decision problem reduces to that of choosing a forecast that minimizes the deviation from the target. Although the forecast does not enter directly into the central bank’s loss function, it does so indirectly because the actual rate of inﬂation (which is what the central bank really cares about) is affected by the bank’s choice of interest rate which in turn reﬂects the inﬂation forecast. LIBRARY.LEOPOLDS.COM 利得图书馆 32 • Chapter 2 Speciﬁcally, the central bank is assumed to choose a sequence of interest rates {i τ }∞ τ =t to minimize a weighted sum of expected future losses, Et ∞ λτ −t L (yτ − y ∗ ), (2.31) τ =t where λ ∈ (0, 1) is a discount rate and E t [ ] denotes the conditional expectation given information available at time t. Both current and future deviations from target inﬂation affect the central bank’s loss. Following Svensson’s analysis, suppose the central bank has quadratic loss L (yτ − y ∗ ) = 12 (yτ − y ∗ )2 . (2.32) Future inﬂation rates depend on the sequence of interest rates which are chosen to minimize expected future loss and hence satisfy the condition ∞ ∗ ∞ τ −t ∗ 2 (y . λ E − y ) i τ t = arg min t τ ∞ {i τ }t (2.33) τ =t Complicating matters, inﬂation is not exogenous but is affected by the central bank’s actions. Solving (2.33) is therefore quite difﬁcult since current and future interest rates can be expected to affect future inﬂation rates. Because inﬂation forecasts matter only in so far as they affect the central bank’s interest rate policy and hence future inﬂation, a model for the datagenerating process for inﬂation is needed. Svensson proposes a tractable approach in which inﬂation and output are generated according to the equations13 yt+1 = yt + α1 zt + t+1 , (2.34) zt+1 = β1 zt − β2 (i t − yt ) + ηt+1 , (2.35) where zt is current output relative to its potential level, and all parameters are positive, i.e., α1 , β1 , β2 > 0. The quantities t+1 and ηt+1 are unpredictable shocks to inﬂation and output, respectively. The ﬁrst equation expresses the change in inﬂation as a function of the lagged output, while the second equation shows that the real interest rate (i t − yt ) impacts output with a lag and also allows for autoregressive dynamics assuming β1 < 1. Using these equations to solve for inﬂation two periods ahead, we obtain the following equation: yt+2 = (1 + α1 β2 )yt + α1 (1 + β1 )zt − α1 β2 i t + t+1 + α1 ηt+1 + t+2 . (2.36) Notice that the policy instrument (i ) impacts the target variable (y) with a twoperiod delay. Moreover, each interest rate affects one future inﬂation rate and so a solution to the inﬁnite sum in (2.33) reduces to choosing i t to target yt+2 , choosing i t+1 to target yt+3 , etc. Hence, the central bank’s objective in setting the current interest 13 We have simpliﬁed Svensson’s model by omitting an additional exogenous variable. LIBRARY.LEOPOLDS.COM 利得图书馆 Loss Functions • 33 rate, i t , simpliﬁes to min E t λ2 (yt+2 − y ∗ )2 . it Using the quadratic loss function in (2.32), the ﬁrstorder condition becomes Et ∂ L (yt+2 − y ∗ ) ∗ ∂ yt+2 = E t (yt+2 − y ) = 0. ∂i t ∂i t (2.37) From (2.36) this means choosing i t so that E t [yt+2 ] = y ∗ , which can be accomplished by setting i t∗ = yt + (yt − y ∗ ) + α1 (1 + β1 )zt . α1 β2 (2.38) It follows that the optimal current interest rate, i t∗ , should be higher, the higher the current inﬂation rate as well as the higher the output relative to its potential, i.e., the lower the output gap. Under this choice of interest rate level, the argument in the loss function reduces to yt+2 − y ∗ = ( t+1 + α1 ηt+1 + t+2 ). This is just an example of certainty equivalence, which relies heavily on the chosen squared error loss function in (2.32). If the original loss function did not have a ﬁrstorder condition (2.37) that is linear in inﬂation, then the solution would not be so simple and the expected loss would not be a straightforward functio