Pagina principale
Python for Probability, Statistics, and Machine Learning
Python for Probability, Statistics, and Machine Learning
José Unpingco
This textbook, fully updated to feature Python version 3.7, covers the key ideas that link probability, statistics, and machine learning illustrated using Python modules. The entire text, including all the figures and numerical results, is reproducible using the Python codes and their associated Jupyter/IPython notebooks, which are provided as supplementary downloads. The author develops key intuitions in machine learning by working meaningful examples using multiple analytical methods and Python codes, thereby connecting theoretical concepts to concrete implementations. The update features full coverage of Webbased scientific visualization with Bokeh Jupyter Hub; Fisher Exact, Cohen’s D and RankSum Tests; Local Regression, Spline, and Additive Methods; and Survival Analysis, Stochastic Gradient Trees, and Neural Networks and Deep Learning. Modern Python modules like Pandas, Sympy, and Scikitlearn are applied to simulate and visualize important machine learning concepts like the bias/variance tradeoff, crossvalidation, and regularization. Many abstract mathematical ideas, such as convergence in probability theory, are developed and illustrated with numerical examples. This book is suitable for classes in probability, statistics, or machine learning and requires only rudimentary knowledge of Python programming.
Categories:
Anno:
2019
Edizione:
2
Editore:
Springer
Lingua:
english
Pagine:
395
ISBN 10:
3030185443
ISBN 13:
9783030185442
File:
PDF, 11.11 MB
Download (pdf, 11.11 MB)
 Open in Browser
 Checking other formats...
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km0@bookmail.org to approved email addresses. Read more.
Please note you need to add our email km0@bookmail.org to approved email addresses. Read more.
You may be interested in Powered by Rec2Me
Most frequently terms
probability^{411}
fig^{337}
python^{232}
import^{212}
numpy^{171}
compute^{159}
array^{157}
corresponding^{153}
regression^{151}
variables^{150}
samples^{145}
statistics^{140}
density^{139}
variable^{135}
matrix^{132}
estimate^{128}
likelihood^{127}
estimator^{126}
sample^{122}
variance^{116}
linear^{103}
sympy^{101}
output^{101}
interval^{92}
kernel^{90}
hypothesis^{90}
expectation^{84}
random variables^{84}
input^{82}
convergence^{81}
algorithm^{78}
parameters^{77}
maximum likelihood^{76}
programming^{76}
exp^{74}
vector^{73}
random variable^{71}
module^{65}
distributed^{65}
dimensional^{62}
columns^{62}
parameter^{61}
panel^{59}
nan^{58}
programming tip^{57}
estimated^{57}
scikit^{55}
bias^{55}
probability density^{53}
density function^{53}
projection^{52}
optimization^{51}
gradient descent^{50}
inequality^{50}
shown in fig^{49}
scipy^{48}
jupyter^{47}
estimation^{46}
arrays^{43}
tianyuzhu2011
VERY GOOD. Very interesting book. I like it
21 September 2019 (17:47)
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

José Unpingco Python for Probability, Statistics, and Machine Learning Second Edition Python for Probability, Statistics, and Machine Learning José Unpingco Python for Probability, Statistics, and Machine Learning Second Edition 123 José Unpingco San Diego, CA, USA ISBN 9783030185442 ISBN 9783030185459 https://doi.org/10.1007/9783030185459 (eBook) 1st edition: © Springer International Publishing Switzerland 2016 2nd edition: © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland To Irene, Nicholas, and Daniella, for all their patient support. Preface to the Second Edition This second edition is updated for Python version 3.6+. Fu; rthermore, many existing sections have been revised for clarity based on feedback from the ﬁrst version. The book is now over thirty percent larger than the original with new material about important probability distributions, including key derivations and illustrative code samples. Additional important statistical tests are included in the statistics chapter including the Fisher Exact test and the Mann–Whitney–Wilcoxon Test. A new section on survival analysis has been included. The most signiﬁcant addition is the section on deep learning for image processing with a detailed discussion of gradient descent methods that underpin all deep learning work. There is also substantial discussion regarding generalized linear models. As before, there are more Programming Tips that illustrate effective Python modules and methods for scientiﬁc programming and machine learning. There are 445 runable code blocks that have been tested for accuracy so you can try these out for yourself in your own codes. Over 158 graphical visualizations (almost all generated using Python) illustrate the concepts that are developed both in code and in mathematics. We also discuss and use key Python modules such as NumPy, Scikitlearn, SymPy, SciPy, lifelines, CVXPY, Theano, Matplotlib, Pandas, TensorFlow, StatsModels, and Keras. As with the ﬁrst edition, all of the key concepts are developed mathematically and are reproducible in Python, to provide the reader with multiple perspectives on the material. As before, this book is not designed to be exhaustive and reflects the author’s eclectic industrial background. The focus remains on concepts and fundamentals for daytoday work using Python in the most expressive way possible. vii viii Preface to the Second Edition Acknowledgements I would like to acknowledge the help of Brian Granger and Fernando Perez, two of the originators of the Jupyter Notebook, for all their great work, as well as the Python community as a whole, for all their contributions that made this book possible. Hans Petter Langtangen is the author of the Doconce [1] document preparation system that was used to write this text. Thanks to Geoffrey Poore [2] for his work with PythonTeX and LATEX, both key technologies used to produce this book. San Diego, CA, USA February 2019 José Unpingco References 1. H.P. Langtangen, DocOnce markup language, https://github.com/hplgit/doconce 2. G.M. Poore, Pythontex: reproducible documents with latex, python, and more. Comput. Sci. Discov. 8(1), 014010 (2015) Preface to the First Edition This book will teach you the fundamental concepts that underpin probability and statistics and illustrate how they relate to machine learning via the Python language and its powerful extensions. This is not a good ﬁrst book in any of these topics because we assume that you already had a decent undergraduatelevel introduction to probability and statistics. Furthermore, we also assume that you have a good grasp of the basic mechanics of the Python language itself. Having said that, this book is appropriate if you have this basic background and want to learn how to use the scientiﬁc Python toolchain to investigate these topics. On the other hand, if you are comfortable with Python, perhaps through working in another scientiﬁc ﬁeld, then this book will teach you the fundamentals of probability and statistics and how to use these ideas to interpret machine learning methods. Likewise, if you are a practicing engineer using a commercial package (e.g., MATLAB, IDL), then you will learn how to effectively use the scientiﬁc Python toolchain by reviewing concepts you are already familiar with. The most important feature of this book is that everything in it is reproducible using Python. Speciﬁcally, all of the code, all of the ﬁgures, and (most of) the text is available in the downloadable supplementary materials that correspond to this book as IPython Notebooks. IPython Notebooks are live interactive documents that allow you to change parameters, recompute plots, and generally tinker with all of the ideas and code in this book. I urge you to download these IPython Notebooks and follow along with the text to experiment with the topics covered. I guarantee doing this will boost your understanding because the IPython Notebooks allow for interactive widgets, animations, and other intuitionbuilding features that help make many of these abstract ideas concrete. As an opensource project, the entire scientiﬁc Python toolchain, including the IPython Notebook, is freely available. Having taught this material for many years, I am convinced that the only way to learn is to experiment as you go. The text provides instructions on how to get started installing and conﬁguring your scientiﬁc Python environment. ix x Preface to the First Edition This book is not designed to be exhaustive and reflects the author’s eclectic background in industry. The focus is on fundamentals and intuitions for daytoday work, especially when you must explain the results of your methods to a nontechnical audience. We have tried to use the Python language in the most expressive way possible while encouraging good Pythoncoding practices. Acknowledgements I would like to acknowledge the help of Brian Granger and Fernando Perez, two of the originators of the Jupyter/IPython Notebook, for all their great work, as well as the Python community as a whole, for all their contributions that made this book possible. Additionally, I would also like to thank Juan Carlos Chavez for his thoughtful review. Hans Petter Langtangen is the author of the Doconce [14] document preparation system that was used to write this text. Thanks to Geoffrey Poore [25] for his work with PythonTeX and LATEX. San Diego, CA, USA February 2016 José Unpingco Contents 1 Getting Started with Scientiﬁc Python . . . . . . . . . . . . . . . . 1.1 Installation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Numpy Arrays and Memory . . . . . . . . . . . . . . 1.2.2 Numpy Matrices . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Numpy Broadcasting . . . . . . . . . . . . . . . . . . . 1.2.4 Numpy Masked Arrays . . . . . . . . . . . . . . . . . . 1.2.5 FloatingPoint Numbers . . . . . . . . . . . . . . . . . 1.2.6 Numpy Optimizations and Prospectus . . . . . . . 1.3 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Alternatives to Matplotlib . . . . . . . . . . . . . . . . 1.3.2 Extensions to Matplotlib . . . . . . . . . . . . . . . . . 1.4 IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Scipy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Sympy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Interfacing with Compiled Libraries . . . . . . . . . . . . . . . 1.10 Integrated Development Environments . . . . . . . . . . . . . 1.11 Quick Guide to Performance and Parallel Programming 1.12 Other Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 6 9 10 13 13 17 17 19 20 20 22 24 25 25 27 30 32 33 34 37 38 2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Understanding Probability Density 2.1.2 Random Variables . . . . . . . . . . . . 2.1.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 40 41 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi xii Contents 2.1.4 Transformation of Variables Beyond Calculus . . . . . . 2.1.5 Independent Random Variables . . . . . . . . . . . . . . . . . 2.1.6 Classic Broken Rod Example . . . . . . . . . . . . . . . . . . 2.2 Projection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Weighted Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conditional Expectation as Projection . . . . . . . . . . . . . . . . . . 2.3.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conditional Expectation and Mean Squared Error . . . . . . . . . . 2.5 Worked Examples of Conditional Expectation and Mean Square Error Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Useful Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Chisquare Distribution . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Poisson and Exponential Distributions . . . . . . . . . . . . 2.6.5 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.7 DirichletMultinomial Distribution . . . . . . . . . . . . . . . 2.7 Information Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Information Theory Concepts . . . . . . . . . . . . . . . . . . 2.7.2 Properties of Information Entropy . . . . . . . . . . . . . . . 2.7.3 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . 2.7.4 CrossEntropy as Maximum Likelihood . . . . . . . . . . . 2.8 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Monte Carlo Sampling Methods . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Inverse CDF Method for Discrete Variables . . . . . . . . 2.9.2 Inverse CDF Method for Continuous Variables . . . . . 2.9.3 Rejection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Sampling Importance Resampling . . . . . . . . . . . . . . . . . . . . . 2.11 Useful Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11.1 Markov’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 2.11.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . 2.11.3 Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 51 53 55 57 58 64 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 69 72 75 78 79 82 83 83 84 86 89 90 91 93 95 96 98 99 100 101 104 105 107 108 113 115 115 116 118 120 Contents 3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Python Modules for Statistics . . . . . . . . . . . . . . . 3.2.1 Scipy Statistics Module . . . . . . . . . . . . . 3.2.2 Sympy Statistics Module . . . . . . . . . . . . 3.2.3 Other Python Modules for Statistics . . . . 3.3 Types of Convergence . . . . . . . . . . . . . . . . . . . . 3.3.1 Almost Sure Convergence . . . . . . . . . . . 3.3.2 Convergence in Probability . . . . . . . . . . . 3.3.3 Convergence in Distribution . . . . . . . . . . 3.3.4 Limit Theorems . . . . . . . . . . . . . . . . . . . 3.4 Estimation Using Maximum Likelihood . . . . . . . . 3.4.1 Setting Up the CoinFlipping Experiment 3.4.2 Delta Method . . . . . . . . . . . . . . . . . . . . . 3.5 Hypothesis Testing and PValues . . . . . . . . . . . . . 3.5.1 Back to the CoinFlipping Example . . . . . 3.5.2 Receiver Operating Characteristic . . . . . . 3.5.3 PValues . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Test Statistics . . . . . . . . . . . . . . . . . . . . . 3.5.5 Testing Multiple Hypotheses . . . . . . . . . . 3.5.6 Fisher Exact Test . . . . . . . . . . . . . . . . . . 3.6 Conﬁdence Intervals . . . . . . . . . . . . . . . . . . . . . . 3.7 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Extensions to Multiple Covariates . . . . . . 3.8 Maximum APosteriori . . . . . . . . . . . . . . . . . . . . 3.9 Robust Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Parametric Bootstrap . . . . . . . . . . . . . . . 3.11 Gauss–Markov . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Nonparametric Methods . . . . . . . . . . . . . . . . . . . 3.12.1 Kernel Density Estimation . . . . . . . . . . . 3.12.2 Kernel Smoothing . . . . . . . . . . . . . . . . . 3.12.3 Nonparametric Regression Estimators . . . 3.12.4 Nearest Neighbors Regression . . . . . . . . . 3.12.5 Kernel Regression . . . . . . . . . . . . . . . . . 3.12.6 Curse of Dimensionality . . . . . . . . . . . . . 3.12.7 Nonparametric Tests . . . . . . . . . . . . . . . . 3.13 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.13.1 Example . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 123 124 124 125 126 126 126 129 131 132 133 135 145 147 149 152 154 155 163 163 166 169 178 183 188 195 200 201 205 205 207 213 214 218 219 221 228 231 236 xiv 4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Python Machine Learning Modules . . . . . . . . . . . . . . . . . . . 4.3 Theory of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Introduction to Theory of Machine Learning . . . . . . 4.3.2 Theory of Generalization . . . . . . . . . . . . . . . . . . . . 4.3.3 Worked Example for Generalization/Approximation Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 CrossValidation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Learning Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Kernel Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.1 Independent Component Analysis . . . . . . . . . . . . . . 4.11 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13.1 Introduction to Tensorﬂow . . . . . . . . . . . . . . . . . . . 4.13.2 Understanding Gradient Descent . . . . . . . . . . . . . . . 4.13.3 Image Processing Using Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents . . . . . . . . . . . . . . . . . . 237 237 237 241 244 249 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 256 260 265 268 275 277 281 281 285 295 300 304 309 311 315 317 321 325 329 329 331 334 343 350 . . . 363 . . . 379 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Chapter 1 Getting Started with Scientific Python Python is fundamental to data science and machine learning, as well as an everexpanding list of areas including cybersecurity, and web programming. The fundamental reason for Python’s widespread use is that it provides the software glue that permits easy exchange of methods and data across core routines typically written in Fortran or C. Python is a language geared toward scientists and engineers who may not have formal software development training. It is used to prototype, design, simulate, and test without getting in the way because Python provides an inherently easy and incremental development cycle, interoperability with existing codes, access to a large base of reliable opensource codes, and a hierarchical compartmentalized design philosophy. Python is known for enhancing user productivity because it reduces the development time (i.e., time spent programming) and thereby increases program runtime. Python is an interpreted language. This means that Python codes run on a Python virtual machine that provides a layer of abstraction between the code and the platform it runs on, thus making codes portable across different platforms. For example, the same script that runs on a Windows laptop can also run on a Linuxbased supercomputer or on a mobile phone. This makes programming easier because the virtual machine handles the lowlevel details of implementing the business logic of the script on the underlying platform. Python is a dynamically typed language, which means that the interpreter itself figures out the representative types (e.g., floats, integers) interactively or at runtime. This is in contrast to a language like Fortran that has compilers that study the code from beginning to end, perform many compilerlevel optimizations, link intimately with the existing libraries on a specific platform, and then create an executable that is henceforth liberated from the compiler. As you may guess, the compiler’s access to the details of the underlying platform means that it can utilize optimizations that exploit chipspecific features and cache memory. Because the virtual machine abstracts away these details, it means that the Python language does not have programmable access to these kinds of optimizations. So, where is the balance between the ease © Springer Nature Switzerland AG 2019 J. Unpingco, Python for Probability, Statistics, and Machine Learning, https://doi.org/10.1007/9783030185459_1 1 2 1 Getting Started with Scientific Python of programming the virtual machine and these key numerical optimizations that are crucial for scientific work? The balance comes from Python’s native ability to bind to compiled Fortran and C libraries. This means that you can send intensive computations to compiled libraries directly from the interpreter. This approach has two primary advantages. First, it gives you the fun of programming in Python, with its expressive syntax and lack of visual clutter. This is a particular boon to scientists who typically want to use software as a tool as opposed to developing software as a product. The second advantage is that you can mixandmatch different compiled libraries from diverse research areas that were not otherwise designed to work together. This works because Python makes it easy to allocate and fill memory in the interpreter, pass it as input to compiled libraries, and then recover the output back in the interpreter. Moreover, Python provides a multiplatform solution for scientific codes. As an opensource project, Python itself is available anywhere you can build it, even though it typically comes standard nowadays, as part of many operating systems. This means that once you have written your code in Python, you can just transfer the script to another platform and run it, as long as the thirdparty compiled libraries are also available there. What if the compiled libraries are absent? Building and configuring compiled libraries across multiple systems used to be a painstaking job, but as scientific Python has matured, a wide range of libraries have now become available across all of the major platforms (i.e., Windows, MacOS, Linux, Unix) as prepackaged distributions. Finally, scientific Python facilitates maintainability of scientific codes because Python syntax is clean, free of semicolon litter and other visual distractions that makes code hard to read and easy to obfuscate. Python has many builtin testing, documentation, and development tools that ease maintenance. Scientific codes are usually written by scientists unschooled in software development, so having solid software development tools built into the language itself is a particular boon. 1.1 Installation and Setup The easiest way to get started is to download the freely available Anaconda distribution provided by Anaconda (anaconda.com), which is available for all of the major platforms. On Linux, even though most of the toolchain is available via the builtin Linux package manager, it is still better to install the Anaconda distribution because it provides its own powerful package manager (i.e., conda) that can keep track of changes in the software dependencies of the packages that it supports. Note that if you do not have administrator privileges, there is also a corresponding Miniconda distribution that does not require these privileges. Regardless of your platform, we recommend Python version 3.6 or better. You may have encountered other Python variants on the web, such as IronPython (Python implemented in C#) and Jython (Python implemented in Java). In this text, we focus on the C implementation of Python (i.e., known as CPython), which is, by 1.1 Installation and Setup 3 far, the most popular implementation. These other Python variants permit specialized, native interaction with libraries in C# or Java (respectively), which is still possible (but clunky) using CPython. Even more Python variants exist that implement the lowlevel machinery of Python differently for various reasons, beyond interacting with native libraries in other languages. Most notable of these is Pypy that implements a justintime compiler (JIT) and other powerful optimizations that can substantially speed up pure Python codes. The downside of Pypy is that its coverage of some popular scientific modules (e.g., Matplotlib, Scipy) is limited or nonexistent which means that you cannot use those modules in code meant for Pypy. If you want to install a Python module that is not available via the conda manager, the pip installer is available. This installer is the main one used outside of the scientific computing community. The key difference between the two installer is that conda implements a satisfiability solver that checks for conflicts in versions among and between installed packages. This can result in conda decreasing versions of certain packages to accommodate proposed package installation. The pip installer does not check for such conflicts checks only if the proposed package already has its dependencies installed and will install them if not or remove existing incompatible modules. The following command line uses pip to install the given Python module, Terminal> pip install package_name The pip installer will download the package you want and its dependencies and install them in the existing directory tree. This works beautifully in the case where the package in question is purePython, without any systemspecific dependencies. Otherwise, this can be a real nightmare, especially on Windows, which lacks freely available Fortran compilers. If the module in question is a C library, one way to cope is to install the freely available Visual Studio Community Edition, which usually has enough to compile many Ccodes. This platform dependency is the problem that conda was designed to solve by making the binary dependencies of the various platforms available instead of attempting to compile them. On a Windows system, if you installed Anaconda and registered it as the default Python installation (it asks during the install process), then you can use the highquality Python wheel files on Christoph Gohlke’s laboratory site at the University of California, Irvine where he kindly makes a long list of scientific modules available.1 Failing this, you can try the condaforge site, which is a communitypowered repository of modules that conda is capable of installing, but which are not formally supported by Anaconda. Note that condaforge allows you to share scientific Python configurations with your remote colleagues using authentication so that you can be sure that you are downloading and running code from users you trust. Again, if you are on Windows, and none of the above works, then you may want to consider installing a full virtual machine solution, as provided by VMWare’s Player or Oracle’s VirtualBox (both freely available under liberal terms), or with 1 Wheel files are a Python distribution format that you download and install using pip as in pip install file.whl. Christoph names files according to Python version (e.g., cp27 means Python 2.7) and chipset (e.g., amd32 vs. Intel win32). 4 1 Getting Started with Scientific Python the Windows subsystem for Linux (WSL) that is built into Windows 10. Using either of these, you can set up a Linux machine running on top of Windows, which should cure these problems entirely! The great part of this approach is that you can share directories between the virtual machine and the Windows system so that you don’t have to maintain duplicate data files. Anaconda Linux images are also available on the cloud by Platform as a Service (PaaS) providers like Amazon Web Services and Microsoft Azure. Note that for the vast majority of users, especially newcomers to Python, the Anaconda distribution should be more than enough on any platform. It is just worth highlighting the Windowsspecific issues and associated workarounds early on. Note that there are other wellmaintained scientific Python Windows installers like WinPython and PythonXY. These provide the spyder integrated development environment, which is very MATLABlike environment for transitioning MATLAB users. 1.2 Numpy As we touched upon earlier, to use a compiled scientific library, the memory allocated in the Python interpreter must somehow reach this library as input. Furthermore, the output from these libraries must likewise return to the Python interpreter. This twoway exchange of memory is essentially the core function of the Numpy (numerical arrays in Python) module. Numpy is the de facto standard for numerical arrays in Python. It arose as an effort by Travis Oliphant and others to unify the preexisting numerical arrays in Python. In this section, we provide an overview and some tips for using Numpy effectively, but for much more detail, Travis’ freely available book [1] is a great place to start. Numpy provides specification of bytesized arrays in Python. For example, below we create an array of three numbers, each of 4 bytes long (32bits at 8bits per byte) as shown by the itemsize property. The first line imports Numpy as np, which is the recommended convention. The next line creates an array of 32bit floatingpoint numbers. The itemize property shows the number of bytes per item. >>> import numpy as np # recommended convention >>> x = np.array([1,2,3],dtype=np.float32) >>> x array([1., 2., 3.], dtype=float32) >>> x.itemsize 4 In addition to providing uniform containers for numbers, Numpy provides a comprehensive set of universal functions (i.e., ufuncs) that process arrays elementwise without additional looping semantics. Below, we show how to compute the elementwise sine using Numpy, 1.2 Numpy 5 >>> np.sin(np.array([1,2,3],dtype=np.float32) ) array([0.84147096, 0.9092974 , 0.14112 ], dtype=float32) This computes the sine of the input array [1,2,3], using Numpy’s unary function, np.sin. There is another sine function in the builtin math module, but the Numpy version is faster because it does not require explicit looping (i.e., using a for loop) over each of the elements in the array. That looping happens in the compiled np.sin function itself. Otherwise, we would have to do looping explicitly as in the following: >>> from math import sin >>> [sin(i) for i in [1,2,3]] # list comprehension [0.8414709848078965, 0.9092974268256817, 0.1411200080598672] Numpy uses commonsense casting rules to resolve the output types. For example, if the inputs had been an integertype, the output would still have been a floatingpoint type. In this example, we provided a Numpy array as input to the sine function. We could have also used a plain Python list instead and Numpy would have built the intermediate Numpy array (e.g., np.sin([1,1,1])). The Numpy documentation provides a comprehensive (and very long) list of available ufuncs. Numpy arrays come in many dimensions. For example, the following shows a twodimensional 2x3 array constructed from two conforming Python lists. >>> x=np.array([ [1,2,3],[4,5,6] ]) >>> x.shape (2, 3) Note that Numpy is limited to 32 dimensions unless you build it for more.2 Numpy arrays follow the usual Python slicing rules in multiple dimensions as shown below where the : colon character selects all elements along a particular axis. >>> x=np.array([ >>> x[:,0] # 0th array([1, 4]) >>> x[:,1] # 1st array([2, 5]) >>> x[0,:] # 0th array([1, 2, 3]) >>> x[1,:] # 1st array([4, 5, 6]) [1,2,3],[4,5,6] ]) column column row row You can also select subsections of arrays by using slicing as shown below >>> x=np.array([ [1,2,3],[4,5,6] ]) >>> x array([[1, 2, 3], [4, 5, 6]]) 2 See arrayobject.h in the Numpy source code. 6 1 Getting Started with Scientific Python >>> x[:,1:] # all rows, 1st thru last column array([[2, 3], [5, 6]]) >>> x[:,::2] # all rows, every other column array([[1, 3], [4, 6]]) >>> x[:,::1] # reverse order of columns array([[3, 2, 1], [6, 5, 4]]) 1.2.1 Numpy Arrays and Memory Some interpreted languages implicitly allocate memory. For example, in MATLAB, you can extend a matrix by simply tacking on another dimension as in the following MATLAB session: >> x=ones(3,3) x = 1 1 1 1 1 1 1 1 1 >> x(:,4)=ones(3,1) % tack on extra dimension x = 1 1 1 1 1 1 1 1 1 1 1 1 >> size(x) ans = 3 4 This works because MATLAB arrays use passbyvalue semantics so that slice operations actually copy parts of the array as needed. By contrast, Numpy uses passbyreference semantics so that slice operations are views into the array without implicit copying. This is particularly helpful with large arrays that already strain available memory. In Numpy terminology, slicing creates views (no copying) and advanced indexing creates copies. Let’s start with advanced indexing. If the indexing object (i.e., the item between the brackets) is a nontuple sequence object, another Numpy array (of type integer or boolean), or a tuple with at least one sequence object or Numpy array, then indexing creates copies. For the above example, to accomplish the same array extension in Numpy, you have to do something like the following: >>> x = np.ones((3,3)) >>> x array([[1., 1., 1.], [1., 1., 1.], 1.2 Numpy 7 [1., 1., 1.]]) >>> x[:,[0,1,2,2]] # notice duplicated last dimension array([[1., 1., 1., 1.], [1., 1., 1., 1.], [1., 1., 1., 1.]]) >>> y=x[:,[0,1,2,2]] # same as above, but do assign it to y Because of advanced indexing, the variable y has its own memory because the relevant parts of x were copied. To prove it, we assign a new element to x and see that y is not updated. >>> x[0,0]=999 # change element in x >>> x # changed array([[999., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.]]) >>> y # not changed! array([[1., 1., 1., 1.], [1., 1., 1., 1.], [1., 1., 1., 1.]]) However, if we start over and construct y by slicing (which makes it a view) as shown below, then the change we made does affect y because a view is just a window into the same memory. >>> x = np.ones((3,3)) >>> y = x[:2,:2] # view of upper left piece >>> x[0,0] = 999 # change value >>> x array([[999., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.]]) >>> y array([[999., 1.], [ 1., 1.]]) Note that if you want to explicitly force a copy without any indexing tricks, you can do y=x.copy(). The code below works through another example of advanced indexing versus slicing. >>> x = np.arange(5) # create array >>> x array([0, 1, 2, 3, 4]) >>> y=x[[0,1,2]] # index by integer list to force copy >>> y array([0, 1, 2]) >>> z=x[:3] # slice creates view 8 1 Getting Started with Scientific Python >>> z # note y and z have same entries array([0, 1, 2]) >>> x[0]=999 # change element of x >>> x array([999, 1, 2, 3, 4]) >>> y # note y is unaffected, array([0, 1, 2]) >>> z # but z is (it's a view). array([999, 1, 2]) In this example, y is a copy, not a view, because it was created using advanced indexing whereas z was created using slicing. Thus, even though y and z have the same entries, only z is affected by changes to x. Note that the flags property of Numpy arrays can help sort this out until you get used to it. Manipulating memory using views is particularly powerful for signal and image processing algorithms that require overlapping fragments of memory. The following is an example of how to use advanced Numpy to create overlapping blocks that do not actually consume additional memory, >>> from numpy.lib.stride_tricks import as_strided >>> x = np.arange(16,dtype=np.int64) >>> y=as_strided(x,(7,4),(16,8)) # overlapped entries >>> y array([[ 0, 1, 2, 3], [ 2, 3, 4, 5], [ 4, 5, 6, 7], [ 6, 7, 8, 9], [ 8, 9, 10, 11], [10, 11, 12, 13], [12, 13, 14, 15]]) The above code creates a range of integers and then overlaps the entries to create a 7x4 Numpy array. The final argument in the as_strided function are the strides, which are the steps in bytes to move in the row and column dimensions, respectively. Thus, the resulting array steps eight bytes in the column dimension and sixteen bytes in the row dimension. Because the integer elements in the Numpy array are eight bytes, this is equivalent to moving by one element in the column dimension and by two elements in the row dimension. The second row in the Numpy array starts at sixteen bytes (two elements) from the first entry (i.e., 2) and then proceeds by eight bytes (by one element) in the column dimension (i.e., 2,3,4,5). The important part is that memory is reused in the resulting 7x4 Numpy array. The code below demonstrates this by reassigning elements in the original x array. The changes show up in the y array because they point at the same allocated memory. >>> x[::2]=99 # assign every other value >>> x array([99, 1, 99, 3, 99, 5, 99, 7, 99, 9, 99, 11, 99, 13, 99, 15]) 1.2 Numpy 9 >>> y # the changes appear because y is a view array([[99, 1, 99, 3], [99, 3, 99, 5], [99, 5, 99, 7], [99, 7, 99, 9], [99, 9, 99, 11], [99, 11, 99, 13], [99, 13, 99, 15]]) Bear in mind that as_strided does not check that you stay within memory block bounds. So, if the size of the target matrix is not filled by the available data, the remaining elements will come from whatever bytes are at that memory location. In other words, there is no default filling by zeros or other strategy that defends memory block bounds. One defense is to explicitly control the dimensions as in the following code: >>> n = 8 # number of elements >>> x = np.arange(n) # create array >>> k = 5 # desired number of rows >>> y = as_strided(x,(k,nk+1),(x.itemsize,)*2) >>> y array([[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6], [4, 5, 6, 7]]) 1.2.2 Numpy Matrices Matrices in Numpy are similar to Numpy arrays but they can only have two dimensions. They implement row–column matrix multiplication as opposed to elementwise multiplication. If you have two matrices you want to multiply, you can either create them directly or convert them from Numpy arrays. For example, the following shows how to create two matrices and multiply them. >>> import numpy as np >>> A=np.matrix([[1,2,3],[4,5,6],[7,8,9]]) >>> x=np.matrix([[1],[0],[0]]) >>> A*x matrix([[1], [4], [7]]) This can also be done using arrays as shown below 10 1 Getting Started with Scientific Python >>> A=np.array([[1,2,3],[4,5,6],[7,8,9]]) >>> x=np.array([[1],[0],[0]]) >>> A.dot(x) array([[1], [4], [7]]) Numpy arrays support elementwise multiplication, not row–column multiplication. You must use Numpy matrices for this kind of multiplication unless use the inner product np.dot, which also works in multiple dimensions (see np.tensordot for more general dot products). Note that Python 3.x has a new @ notation for matrix multiplication so we can redo the last calculation as follows: >>> A @ x array([[1], [4], [7]]) It is unnecessary to cast all multiplicands to matrices for multiplication. In the next example, everything until last line is a Numpy array and thereafter we cast the array as a matrix with np.matrix which then uses row–column multiplication. Note that it is unnecessary to cast the x variable as a matrix because the lefttoright order of the evaluation takes care of that automatically. If we need to use A as a matrix elsewhere in the code then we should bind it to another variable instead of recasting it every time. If you find yourself casting back and forth for large arrays, passing the copy=False flag to matrix avoids the expense of making a copy. >>> A=np.ones((3,3)) >>> type(A) # array not matrix <class 'numpy.ndarray'> >>> x=np.ones((3,1)) # array not matrix >>> A*x array([[1., 1., 1.], [1., 1., 1.], [1., 1., 1.]]) >>> np.matrix(A)*x # rowcolumn multiplication matrix([[3.], [3.], [3.]]) 1.2.3 Numpy Broadcasting Numpy broadcasting is a powerful way to make implicit multidimensional grids for expressions. It is probably the single most powerful feature of Numpy and the most difficult to grasp. Proceeding by example, consider the vertices of a twodimensional unit square as shown below 1.2 Numpy 11 >>> X,Y=np.meshgrid(np.arange(2),np.arange(2)) >>> X array([[0, 1], [0, 1]]) >>> Y array([[0, 0], [1, 1]]) Numpy’s meshgrid creates twodimensional grids. The X and Y arrays have corresponding entries match the coordinates of the vertices of the unit square (e.g., (0, 0), (0, 1), (1, 0), (1, 1)). To add the x and ycoordinates, we could use X and Y as in X+Y shown below, The output is the sum of the vertex coordinates of the unit square. >>> X+Y array([[0, 1], [1, 2]]) Because the two arrays have compatible shapes, they can be added together elementwise. It turns out we can skip a step here and not bother with meshgrid to implicitly obtain the vertex coordinates by using broadcasting as shown below >>> x = np.array([0,1]) >>> y = np.array([0,1]) >>> x array([0, 1]) >>> y array([0, 1]) >>> x + y[:,None] # add broadcast dimension array([[0, 1], [1, 2]]) >>> X+Y array([[0, 1], [1, 2]]) On line 7 the None Python singleton tells Numpy to make copies of y along this dimension to create a conformable calculation. Note that np.newaxis can be used instead of None to be more explicit. The following lines show that we obtain the same output as when we used the X+Y Numpy arrays. Note that without broadcasting x+y=array([0, 2]) which is not what we are trying to compute. Let’s continue with a more complicated example where we have differing array shapes. >>> x = np.array([0,1]) >>> y = np.array([0,1,2]) >>> X,Y = np.meshgrid(x,y) >>> X array([[0, 1], 12 1 Getting Started with Scientific Python [0, 1], [0, 1]]) >>> Y array([[0, 0], [1, 1], [2, 2]]) >>> X+Y array([[0, 1], [1, 2], [2, 3]]) >>> x+y[:,None] # same as with meshgrid array([[0, 1], [1, 2], [2, 3]]) In this example, the array shapes are different, so the addition of x and y is not possible without Numpy broadcasting. The last line shows that broadcasting generates the same output as using the compatible array generated by meshgrid. This shows that broadcasting works with different array shapes. For the sake of comparison, on line 3, meshgrid creates two conformable arrays, X and Y. On the last line, x+y[:,None] produces the same output as X+Y without the meshgrid. We can also put the None dimension on the x array as x[:,None]+y which would give the transpose of the result. Broadcasting works in multiple dimensions also. The output shown has shape (4,3,2). On the last line, the x+y[:,None] produces a twodimensional array which is then broadcast against z[:,None,None], which duplicates itself along the two added dimensions to accommodate the twodimensional result on its left (i.e., x + y[:,None]). The caveat about broadcasting is that it can potentially create large, memoryconsuming, intermediate arrays. There are methods for controlling this by reusing previously allocated memory but that is beyond our scope here. Formulas in physics that evaluate functions on the vertices of high dimensional grids are great usecases for broadcasting. >>> x = np.array([0,1]) >>> y = np.array([0,1,2]) >>> z = np.array([0,1,2,3]) >>> x+y[:,None]+z[:,None,None] array([[[0, 1], [1, 2], [2, 3]], [[1, 2], [2, 3], [3, 4]], 1.2 Numpy 13 [[2, 3], [3, 4], [4, 5]], [[3, 4], [4, 5], [5, 6]]]) 1.2.4 Numpy Masked Arrays Numpy provides a powerful method to temporarily hide array elements without changing the shape of the array itself, >>> from numpy import ma # import masked arrays >>> x = np.arange(10) >>> y = ma.masked_array(x, x<5) >>> print (y) [     5 6 7 8 9] >>> print (y.shape) (10,) Note that the elements in the array for which the logical condition (x<5) is true are masked, but the size of the array remains the same. This is particularly useful in plotting categorical data, where you may only want those values that correspond to a given category for part of the plot. Another common use is for image processing, wherein parts of the image may need to be excluded from subsequent processing. Note that creating a masked array does not force an implicit copy operation unless copy=True argument is used. For example, changing an element in x does change the corresponding element in y, even though y is a masked array, >>> >>> [ 0 >>> [ x[1] = 99 # change this print(x) 1 2 3 4 5 6 7 8 99] print(y)# masked array changed!     5 6 7 8 99] 1.2.5 FloatingPoint Numbers There are precision limitations when representing floatingpoint numbers on a computer with finite memory. For example, the following shows these limitations when adding two simple numbers, >>> 0.1 + 0.2 0.30000000000000004 14 1 Getting Started with Scientific Python So, then, why is the output not 0.3? The issue is the floatingpoint representation of the two numbers and the algorithm that adds them. To represent an integer in binary, we just write it out in powers of 2. For example, 230 = (11100110)2 . Python can do this conversion using string formatting, >>> print('{0:b}'.format(230)) 11100110 To add integers, we just add up the corresponding bits and fit them into the allowable number of bits. Unless there is an overflow (the results cannot be represented with that number of bits), then there is no problem. Representing floating point is trickier because we have to represent these numbers as binary fractions. The IEEE 754 standard requires that floatingpoint numbers be represented as ±C × 2 E where C is the significand (mantissa) and E is the exponent. To represent a regular decimal fraction as binary fraction, we need to compute the expansion of the fraction in the following form a1 /2 + a2 /22 + a3 /23 ... In other words, we need to find the ai coefficients. We can do this using the same process we would use for a decimal fraction: just keep dividing by the fractional powers of 1/2 and keep track of the whole and fractional parts. Python’s divmod function can do most of the work for this. For example, to represent 0.125 as a binary fraction, >>> a = 0.125 >>> divmod(a*2,1) (0.0, 0.25) The first item in the tuple is the quotient and the other is the remainder. If the quotient was greater than 1, then the corresponding ai term is one and is zero otherwise. For this example, we have a1 = 0. To get the next term in the expansion, we just keep multiplying by 2 which moves us rightward along the expansion to ai+1 and so on. Then, >>> >>> >>> 0.0 >>> >>> 0.0 >>> >>> 1.0 a = 0.125 q,a = divmod(a*2,1) print (q,a) 0.25 q,a = divmod(a*2,1) print (q,a) 0.5 q,a = divmod(a*2,1) print (q,a) 0.0 The algorithm stops when the remainder term is zero. Thus, we have that 0.125 = (0.001)2 . The specification requires that the leading term in the expansion be one. Thus, we have 0.125 = (1.000) × 2−3 . This means the significand is 1 and the exponent is 3. Now, let’s get back to our main problem 0.1+0.2 by developing the representation 0.1 by coding up the individual steps above. 1.2 Numpy 15 >>> a = 0.1 >>> bits = [] >>> while a>0: ... q,a = divmod(a*2,1) ... bits.append(q) ... >>> print (''.join(['%d'%i for i in bits])) 0001100110011001100110011001100110011001100110011001101 Note that the representation has an infinitely repeating pattern. This means that we have (1.1001)2 × 2−4 . The IEEE standard does not have a way to represent infinitely repeating sequences. Nonetheless, we can compute this, ∞ n=1 1 1 3 + 4n = 24n−3 2 5 Thus, 0.1 ≈ 1.6 × 2−4 . Per the IEEE 754 standard, for float type, we have 24bits for the significand and 23bits for the fractional part. Because we cannot represent the infinitely repeating sequence, we have to round off at 23bits, 10011001100110011001101. Thus, whereas the significand’s representation used to be 1.6, with this rounding, it is Now >>> b = '10011001100110011001101' >>> 1+sum([int(i)/(2**n) for n,i in enumerate(b,1)]) 1.600000023841858 Thus, we now have 0.1 ≈ 1.600000023841858×2−4 = 0.10000000149011612. For the 0.2 expansion, we have the same repeating sequence with a different exponent, so that we have 0.2 ≈ 1.600000023841858 × 2−3 = 0.20000000298023224. To add 0.1+0.2 in binary, we must adjust the exponents until they match the higher of the two. Thus, 0.11001100110011001100110 +1.10011001100110011001101 10.01100110011001100110011 Now, the sum has to be scaled back to fit into the significand’s available bits so the result is 1.00110011001100110011010 with exponent 2. Computing this in the usual way as shown below gives the result. >>> k='00110011001100110011010' >>> print('%0.12f'%((1+sum([int(i)/(2**n) ... for n,i in enumerate(k,1)]))/2**2)) 0.300000011921 which matches what we get with numpy 16 1 Getting Started with Scientific Python >>> import numpy as np >>> print('%0.12f'%(np.float32(0.1) + np.float32(0.2))) 0.300000011921 The entire process proceeds the same for 64bit floats. Python has a fractions and decimal modules that allow more exact number representations. The decimal module is particularly important for certain financial computations. Roundoff Error. Let’s consider the example of adding 100,000,000 and 10 in 32bit floating point. >>> print('{0:b}'.format(100000000)) 101111101011110000100000000 This means that 100, 000, 000 = (1.01111101011110000100000000)2 × 226 . Likewise, 10 = (1.010)2 × 23 . To add these we have to make the exponents match as in the following, 1.01111101011110000100000000 +0.00000000000000000000001010 1.01111101011110000100001010 Now, we have to round off because we only have 23 bits to the right of the decimal point and obtain 1.0111110101111000010000, thus losing the trailing 10 bits. This effectively makes the decimal 10 = (1010)2 we started out with become 8 = (1000)2 . Thus, using Numpy again, >>> print(format(np.float32(100000000) + np.float32(10),'10.3f')) 100000008.000 The problem here is that the order of magnitude between the two numbers was so great that it resulted in loss in the significand’s bits as the smaller number was rightshifted. When summing numbers like these, the Kahan summation algorithm (see math.fsum()) can effectively manage these roundoff errors. >>> import math >>> math.fsum([np.float32(100000000),np.float32(10)]) 100000010.0 Cancelation Error. Cancelation error (loss of significance) results when two nearly equal floatingpoint numbers are subtracted. Let’s consider subtracting 0.1111112 and 0.1111111. As binary fractions, we have the following, 1.11000111000111001000101 E4 1.11000111000111000110111 E4 0.00000000000000000011100 1.2 Numpy 17 As a binary fraction, this is 1.11 with exponent 23 or (1.75)10 × 2−23 ≈ 0.00000010430812836. In Numpy, this loss of precision is shown in the following: >>> print(format(np.float32(0.1111112)np.float32(0.1111111),'1.17f')) 0.00000010430812836 To sum up, when using floating point, you must check for approximate equality using something like Numpy allclose instead of the usual Python equality (i.e., ==) sign. This enforces error bounds instead of strict equality. Whenever practicable, use fixed scaling to employ integer values instead of decimal fractions. Double precision 64bit floatingpoint numbers are much better than single precision and, while not eliminating these problems, effectively kicks the can down the road for all but the strictest precision requirements. The Kahan algorithm is effective for summing floating point numbers across very large data without accruing roundoff errors. To minimize cancelation errors, refactor the calculation to avoid subtracting two nearly equal numbers. 1.2.6 Numpy Optimizations and Prospectus The scientific Python community continues to push the frontier of scientific computing. Several important extensions to Numpy are under active development. First, Numba is a compiler that generates optimized machine code from purePython code using the LLVM compiler infrastructure. LLVM started as a research project at the University of Illinois to provide a targetindependent compilation strategy for arbitrary programming languages and is now a wellestablished technology. The combination of LLVM and Python via Numba means that accelerating a block of Python code can be as easy as putting a @numba.jit decorator above the function definition, but this doesn’t work for all situations. Numba can target general graphics processing units (GPGPUs) also. The Dask project contains dask.array extensions for manipulating very large datasets that are too big to fit in a single computer’s RAM (i.e., out of core) using Numpy semantics. Furthermore, dask includes extensions for Pandas dataframes (see Sect. 1.7). Roughly speaking, this means that dask understands how to unpack Python expressions and translate them for a variety of distributed backend data services upon which the computing takes place. This means that dask separates the expression of the computation from the particular implementation on a given backend. 1.3 Matplotlib Matplotlib is the primary visualization tool for scientific graphics in Python. Like all great opensource projects, it originated to satisfy a personal need. At the time of its inception, John Hunter primarily used MATLAB for scientific visualization, but as he began to integrate data from disparate sources using Python, he realized he needed 18 1 Getting Started with Scientific Python a Python solution for visualization, so he singlehandedly wrote Matplotlib. Since those early years, Matplotlib has displaced the other competing methods for twodimensional scientific visualization and today is a very actively maintained project, even without John Hunter, who sadly passed away in 2012. John had a few basic requirements for Matplotlib: • Plots should look publication quality with beautiful text. • Plots should output Postscript for inclusion within LATEX documents and publication quality printing. • Plots should be embeddable in a graphical user interface (GUI) for application development. • The code should be mostly Python to allow for users to become developers. • Plots should be easy to make with just a few lines of code for simple graphs. Each of these requirements has been completely satisfied and Matplotlib’s capabilities have grown far beyond these requirements. In the beginning, to ease the transition from MATLAB to Python, many of the Matplotlib functions were closely named after the corresponding MATLAB commands. The community has moved away from this style and, even though you may still find the old MATLABesque style used in the online Matplotlib documentation. The following shows the quickest way to draw a plot using Matplotlib and the plain Python interpreter. Later, we’ll see how to do this even faster using IPython. The first line imports the requisite module as plt which is the recommended convention. The next line plots a sequence of numbers generated using Python’s range object. Note the output list contains a Line2D object. This is an artist in Matplotlib parlance. Finally, the plt.show() function draws the plot in a GUI figure window. import matplotlib.pyplot as plt plt.plot(range(10)) plt.show() # unnecessary in IPython (discussed later) If you try this in your own plain Python interpreter (and you should!), you will see that you cannot type in anything further in the interpreter until the figure window (i.e., something like Fig. 1.1) is closed. This is because the plt.show() function preoccupies the interpreter with the controls in the GUI and blocks further interaction. As we discuss below, IPython provides ways to get around this blocking so you can simultaneously interact with the interpreter and the figure window.3 As shown in Fig. 1.1, the plot function returns a list containing the Line2D object. More complicated plots yield larger lists filled with artists. The terminology is that artists draw on the canvas contained in the Matplotlib figure. The final line is the plt.show function that provokes the embedded artists to render on the Matplotlib canvas. The reason this is a separate function is that plots may have dozens of complicated artists and rendering may be a timeconsuming task to only be undertaken at 3 You can also do this in the plain Python interpreter by doing import interactive(True). matplotlib;matplotlib. 1.3 Matplotlib 19 Fig. 1.1 The Matplotlib figure window. The icons on the bottom allow some limited plotediting tools the end, when all the artists have been mustered. Matplotlib supports plotting images, contours, and many others that we cover in detail in the following chapters. Even though this is the quickest way to draw a plot in Matplotlib, it is not recommended because there are no handles to the intermediate products of the plot such as the plot’s axis. While this is okay for a simple plot like this, later on we will see how to construct complicated plots using the recommended method. One of the best ways to get started with Matplotlib is to browse the extensive online gallery of plots on the main Matplotlib site. Each plot comes with corresponding source code that you can use as a starting point for your own plots. In Sect. 1.4, we discuss special magic commands that make this particularly easy. The annual John Hunter: Excellence in Plotting Contest provides fantastic, compelling examples of scientific visualizations that are possible using Matplotlib. 1.3.1 Alternatives to Matplotlib Even though Matplotlib is the most complete option for scriptbased plotting, there are some alternatives for specialized scientific graphics that may be of interest. 20 1 Getting Started with Scientific Python If you require realtime data display and tools for volumetric data rendering and complicated 3D meshes with isosurfaces, then PyQtGraph is an option. PyQtGraph is a purePython graphics and GUI library that depends on Python bindings for the Qt GUI library (i.e., PySide or PyQt4) and Numpy. This means that the PyQtGraph relies on these other libraries (especially Qt’s GraphicsView framework) for the heavyduty number crunching and rendering. This package is actively maintained, with solid documentation. You also need to grasp a few QtGUI development concepts to use this effectively. An alternative that comes from the R community is ggplot which is a Python port of the ggplot2 package that is fundamental to statistical graphics in R. From the Python standpoint, the main advantage of ggplot is the tight integration with the Pandas dataframe, which makes it easy to draw beautifully formatted statistical graphs. The downside of this package is that it applies unPythonic semantics based on the Grammar of Graphics [2], which is nonetheless a wellthoughtout method for articulating complicated graphs. Of course, because there are twoway bridges between Python and R via the R2Py module (among others), it is workable to send Numpy arrays to R for native ggplot2 rendering and then retrieve the socomputed graphic back into Python. This is a workflow that is lubricated by the Jupyter Notebook (see below) via the rmagic extension. Thus, it is quite possible to get the best of both worlds via the Jupyter Notebook and this kind of multilanguage workflow is quite common in data analysis communities. 1.3.2 Extensions to Matplotlib Initially, to encourage adoption of Matplotlib from MATLAB, many of the graphical sensibilities were adopted from MATLAB to preserve the look and feel for transitioning users. Modern sensibilities and prettier default plots are possible because Matplotlib provides the ability to drill down and tweak every element on the canvas. However, this can be tedious to do and several alternatives offer relief. For statistical plots, the first place to look is the seaborn module that includes a vast array of beautifully formatted plots including violin plots, kernel density plots, and bivariate histograms. The seaborn gallery includes samples of available plots and the corresponding code that generates them. Note that importing seaborn hijacks the default settings for all plots, so you have to coordinate this if you only want to use seaborn for some (not all) of your visualizations in a given session. Note that you can find the defaults for Matplotlib in the matplotlib.rcParams dictionary. 1.4 IPython IPython [3] originated as a way to enhance Python’s basic interpreter for smooth interactive scientific development. In the early days, the most important enhancement 1.4 IPython 21 was tab completion for dynamic introspection of workspace variables. For example, you can start IPython at the commandline by typing ipython and then you should see something like the following in your terminal: Python 2.7.11 Continuum Analytics, Inc. (default, Dec 7 2015, 14:00 Type "copyright", "credits" or "license" for more information. IPython 4.0.0  An enhanced Interactive Python. ? > Introduction and overview of IPython's features. %%quickref > Quick reference. help > Python's own help system. object? > Details about 'object', use 'object??' for extra details. In [1]: Next, creating a string as shown and hitting the TAB key after the dot character initiates the introspection, showing all the functions and attributes of the string object in x. In [1]: x = 'this is a string' In [2]: x.<TAB> x.capitalize x.format x.center x.index x.count x.isalnum x.decode x.isalpha x.encode x.isdigit x.endswith x.islower x.expandtabs x.isspace x.find x.istitle x.isupper x.join x.ljust x.lower x.lstrip x.partition x.replace x.rfind x.rindex x.rjust x.rpartition x.rsplit x.rstrip x.split x.splitlines x.startswith x.strip x.swapcase x.title x.translate x.upper x.zfill To get help about any of these, you simply add the ? character at the end as shown below In [2]: x.center? Type: builtin_function_or_method String Form:<builtin method center of str object at 0x03193390> Docstring: S.center(width[, fillchar]) > string Return S centered in a string of length width. Padding is done using the specified fill character (default is a space) and IPython provides the builtin help documentation. Note that you can also get this documentation with help(x.center) which works in the plain Python interpreter as well. The combination of dynamic tabbased introspection and quick interactive help accelerates development because you can keep your eyes and fingers in one place as you work. This was the original IPython experience, but IPython has since grown into a complete framework for delivering a rich scientific computing workflow that retains and enhances these fundamental features. 22 1 Getting Started with Scientific Python 1.5 Jupyter Notebook As you may have noticed investigating Python on the web, most Python users are web developers, not scientific programmers, meaning that the Python stack is very well developed for web technologies. The genius of the IPython development team was to leverage these technologies for scientific computing by embedding IPython in modern web browsers. In fact, this strategy has been so successful that IPython has moved into other languages beyond Python such as Julia and R as the Jupyter project. You can start the Jupyter Notebook with the following commandline: Terminal> jupyter notebook After starting the notebook, you should see something like the following in the terminal: [I [I [I [I 16:08:21.213 16:08:21.214 16:08:21.214 16:08:21.214 NotebookApp] NotebookApp] NotebookApp] NotebookApp] Serving notebooks from local directory: /home/user The Jupyter Notebook is running at: http://localhost:8888/?token=80281f0c324924d34a4e Use ControlC to stop this server and shut down The first line reveals where Jupyter looks for default settings. The next line shows where it looks for documents in the Jupyter Notebook format. The third line shows that the Jupyter Notebook started a web server on the local machine (i.e., 127.0.0.1) on port number 8888. This is the address your browser needs to connect to the Jupyter session although your default browser should have opened automatically to this address. The port number and other configuration options are available either on the commandline or in the profile file shown in the first line. If you are on a Windows platform and you do not get this far, then the Window’s firewall is probably blocking the port. For additional configuration help, see the main Jupyter site (www.jupyter.org). When Jupyter starts, it initiates several Python processes that use the blazingfast ZeroMQ message passing framework for interprocess communication, along with the websockets protocol for backandforth communication with the browser. To start Jupyter and get around your default browser, you can use the additonal nobrowser flag and then manually type in the local host address http://127.0.0.1:8888 into your favorite browser to get started. Once all that is settled, you should see something like the following Fig. 1.2, You can create a new document by clicking the New Notebook button shown in Fig. 1.2. Then, you should see something like Fig. 1.3. To start using the Jupyter Notebook, you just start typing code in the shaded textbox and then hit SHIFT+ENTER to execute the code in that Jupyter cell. Figure 1.4 shows the dynamic introspection in the pulldown menu when you type the TAB key after the x.. Contextbased help is also available as before by using the ? suffix which opens a help panel at the bottom of the browser window. There are many amazing features including the ability to share notebooks between different users and to run Jupyter Notebooks in the Amazon cloud, but these features go beyond our scope here. Check the jupyter.org website or peek at the mailing list for the latest work on these fronts. 1.5 Jupyter Notebook 23 Fig. 1.2 The Jupyter Notebook dashboard Fig. 1.3 A new Jupyter Notebook The Jupyter Notebook supports highquality mathematical typesetting using MathJaX, which is a JavaScript implementation of most of LATEX, as well as video and other rich content. The concept of consolidating mathematical algorithm descriptions and the code that implements those algorithms into a shareable document is more important than all of these amazing features. There is no understating the importance of this in practice because the algorithm documentation (if it exists) is usually in one format and completely separate from the code that implements it. This common practice leads to unsynchronized documentation and code that renders one or the other useless. The Jupyter Notebook solves this problem by putting everything into a living shareable document based upon open standards and freely 24 1 Getting Started with Scientific Python Fig. 1.4 Jupyter Notebook pulldown completion menu available software. Jupyter Notebooks can even be saved as static HTML documents for those without Python! Finally, Jupyter provides a large set of magic commands for creating macros, profiling, debugging, and viewing codes. A full list of these can be found by typing in %lsmagic in Jupyter. Help on any of these is available using the ? character suffix. Some frequently used commands include the %cd command that changes the current working directory, the %ls command that lists the files in the current directory, and the %hist command that shows the history of previous commands (including optional searching). The most important of these for new users is probably the %loadpy command that can load scripts from the local disk or from the web. Using this to explore the Matplotlib gallery is a great way to experiment with and reuse the plots there. 1.6 Scipy Scipy was the first consolidated module for a wide range of compiled libraries, all based on Numpy arrays. Scipy includes numerous special functions (e.g., Airy, Bessel, elliptical) as well as powerful numerical quadrature routines via the QUADPACK Fortran library (see scipy.integrate), where you will also find other quadrature methods. Note that some of the same functions appear in multiple places 1.6 Scipy 25 within Scipy itself as well as in Numpy. Additionally, Scipy provides access to the ODEPACK library for solving differential equations. Lots of statistical functions, including random number generators, and a wide variety of probability distributions are included in the scipy.stats module. Interfaces to the Fortran MINPACK optimization library are provided via scipy.optimize. These include methods for rootfinding, minimization and maximization problems, with and without higher order derivatives. Methods for interpolation are provided in the scipy.interpolate module via the FITPACK Fortran package. Note that some of the modules are so big that you do not get all of them with import scipy because that would take too long to load. You may have to load some of these packages individually as import scipy.interpolate, for example. As we discussed, the Scipy module is already packed with an extensive list of scientific codes. For that reason, the scikits modules were originally established as a way to stage candidates that could eventually make it into the already stuffed Scipy module, but it turns out that many of these modules became so successful on their own that they will never be integrated into Scipy proper. Some examples include sklearn for machine learning and scikitimage for image processing. 1.7 Pandas Pandas [4] is a powerful module that is optimized on top of Numpy and provides a set of data structures particularly suited to time series and spreadsheetstyle data analysis (think of pivot tables in Excel). If you are familiar with the R statistical package, then you can think of Pandas as providing a Numpypowered dataframe for Python. 1.7.1 Series There are two primary data structures in Pandas. The first is the Series object which combines an index and corresponding data values. >>> import pandas as pd # recommended convention >>> x=pd.Series(index = range(5),data=[1,3,9,11,12]) >>> x 0 1 1 3 2 9 3 11 4 12 dtype: int64 26 1 Getting Started with Scientific Python The main thing to keep in mind with Pandas is that these data structures were originally designed to work with timeseries data. In that case, the index in the data structures corresponds to a sequence of ordered time stamps. In the general case, the index must be a sortable arraylike entity. For example, >>> x=pd.Series(index = ['a','b','d','z','z'],data=[1,3,9,11,12]) >>> x a 1 b 3 d 9 z 11 z 12 dtype: int64 Note the duplicated z entries in the index. We can get at the entries in the Series in a number of ways. First, we can used the dot notation to select as in the following: >>> x.a 1 >>> x.z z 11 z 12 dtype: int64 We can also use the indexed position of the entries with iloc as in the following: >>> x.iloc[:3] a 1 b 3 d 9 dtype: int64 which uses the same slicing syntax as Numpy arrays. You can also slice across the index, even if it is not numeric with loc as in the following: >>> x.loc['a':'d'] a 1 b 3 d 9 dtype: int64 which you can get directly from the usual slicing notation: >>> x['a':'d'] a 1 b 3 d 9 dtype: int64 1.7 Pandas 27 Note that, unlike Python, slicing this way includes the endpoints. While that is very interesting, the main power of Pandas comes from its power to aggregate and group data. In the following, we build a more interesting Series object: >>> x = pd.Series(range(5),[1,2,11,9,10]) and then group it in the following: >>> grp=x.groupby(lambda i:i%2) # odd or even >>> grp.get_group(0) # even group 2 1 10 4 dtype: int64 >>> grp.get_group(1) # odd group 1 0 11 2 9 3 dtype: int64 The first line groups the elements of the Series object by whether or not the index is even or odd. The lambda function returns 0 or 1 depending on whether or not the corresponding index is even or odd, respectively. The next line shows the 0 (even) group and then the one after shows the 1 (odd) group. Now, that we have separate groups, we can perform a wide variety of summarizations on the group. You can think of these as reducing each group into a single value. For example, in the following, we get the maximum value of each group: >>> grp.max() # max in each group 0 4 1 3 dtype: int64 Note that the operation above returns another Series object with an index corresponding to the [0,1] elements. 1.7.2 Dataframe The Pandas DataFrame is an encapsulation of the Series that extends to two dimensions. One way to create a DataFrame is with dictionaries as in the following: >>> df = pd.DataFrame({'col1': [1,3,11,2], 'col2': [9,23,0,2]}) Note that the keys in the input dictionary are now the column headings (labels) of the DataFrame, with each corresponding column matching the list of corresponding values from the dictionary. Like the Series object, the DataFrame also has in index, which is the [0,1,2,3] column on the farleft. We can extract elements from each column using the iloc method as discussed earlier as shown below 28 1 Getting Started with Scientific Python >>> df.iloc[:2,:2] # get section col1 col2 0 1 9 1 3 23 or by directly slicing or by using the dot notation as shown below >>> df['col1'] # indexing 0 1 1 3 2 11 3 2 Name: col1, dtype: int64 >>> df.col1 # use dot notation 0 1 1 3 2 11 3 2 Name: col1, dtype: int64 Subsequent operations on the DataFrame preserve its columnwise structure as in the following: >>> df.sum() col1 17 col2 34 dtype: int64 where each column was totaled. Grouping and aggregating with the dataframe is even more powerful than with Series. Let’s construct the following dataframe: >>> df = pd.DataFrame({'col1': [1,1,0,0], 'col2': [1,2,3,4]}) In the above dataframe, note that the col1 column has only two entries. We can group the data using this column as in the following: >>> grp=df.groupby('col1') >>> grp.get_group(0) col1 col2 2 0 3 3 0 4 >>> grp.get_group(1) col1 col2 0 1 1 1 1 2 Note that each group corresponds to entries for which col1 was either of its two values. Now that we have grouped on col1, as with the Series object, we can also functionally summarize each of the groups as in the following: 1.7 Pandas 29 >>> grp.sum() col2 col1 0 7 1 3 where the sum is applied across each of the Dataframes present in each group. Note that the index of the output above is each of the values in the original col1. The Dataframe can compute new columns based on existing columns using the eval method as shown below >>> df['sum_col']=df.eval('col1+col2') >>> df col1 col2 sum_col 0 1 1 2 1 1 2 3 2 0 3 3 3 0 4 4 Note that you can assign the output to a new column to the Dataframe as shown.4 We can group by multiple columns as shown below >>> grp=df.groupby(['sum_col','col1']) Doing the sum operation on each group gives the following: >>> res=grp.sum() >>> res col2 sum_col col1 2 1 1 3 0 3 1 2 4 0 4 This output is much more complicated than anything we have seen so far, so let’s carefully walk through it. Below the headers, the first row 2 1 1 indicates that for sum_col=2 and for all values of col1 (namely, just the value 1), the value of col2 is 1. For the next row, the same pattern applies except that for sum_col=3, there are now two values for col1, namely 0 and 1, which each have their corresponding two values for the sum operation in col2. This layered display is one way to look at the result. Note that the layers above are not uniform. Alternatively, we can unstack this result to obtain the following tabular view of the previous result: 4 Note this kind of onthefly memory extension is not possible in regular Numpy. For example, x = np.array([1,2]); x[3]=3 generates an error. 30 1 Getting Started with Scientific Python >>> res.unstack() col2 col1 0 1 sum_col 2 NaN 1.0 3 3.0 2.0 4 4.0 NaN The NaN values indicate positions in the table where there is no entry. For example, for the pair (sum_col=2,col2=0), there is no corresponding value in the Dataframe, as you may verify by looking at the penultimate code block. There is also no entry corresponding to the (sum_col=4,col2=1) pair. Thus, this shows that the original presentation in the penultimate code block is the same as this one, just without the abovementioned missing entries indicated by NaN. We have barely scratched the surface of what Pandas is capable of and we have completely ignored its powerful features for managing dates and times. The text by Mckinney [4] is a very complete and happily readable introduction to Pandas. The online documentation and tutorials at the main Pandas site are also great for diving deeper into Pandas. 1.8 Sympy Sympy [5] is the main computer algebra module in Python. It is a purePython package with no platform dependencies. With the help of multiple Google Summer of Code sponsorships, it has grown into a powerful computer algebra system with many collateral projects that make it faster and integrate it tighter with Numpy and Jupyter. Sympy’s online tutorial is excellent and allows interacting with its embedded code samples in the browser by running the code on the Google App Engine behind the scenes. This provides an excellent way to interact and experiment with Sympy. If you find Sympy too slow or need algorithms that it does not implement, then SAGE is your next stop. The SAGE project is a consolidation of over 70 of the best opensource packages for computer algebra and related computation. Although Sympy and SAGE share code freely between them, SAGE is a specialized build of the Python kernel to facilitate deep integration with the underlying libraries. Thus, it is not a purePython solution for computer algebra (i.e., not as portable) and it is a proper superset of Python with its own extended syntax. The choice between SAGE and Sympy really depends on whether or not you intend primarily work in SAGE or just need occasional computer algebra support in your existing Python code. An important new development regarding SAGE is the freely available SAGE Cloud (https://cloud.sagemath.com/), sponsored by University of Washington that allows you to use SAGE entirely in the browser with no additional setup. Both SAGE and Sympy offer tight integration with the Jupyter Notebook for mathematical typesetting in the browser using MathJaX. 1.8 Sympy 31 To get started with Sympy, you must import the module as usual, >>> import sympy as S # might take awhile which may take a bit because it is a big package. The next step is to create a Sympy variable as in the following: >>> x = S.symbols('x') Now we can manipulate this using Sympy functions and Python logic as shown below >>> p=sum(x**i for i in range(3)) # 2nd order polynomial >>> p x**2 + x + 1 Now, we can find the roots of this polynomial using Sympy functions, >>> S.solve(p) # solves p == 0 [1/2  sqrt(3)*I/2, 1/2 + sqrt(3)*I/2] There is also a sympy.roots function that provides the same output but as a dictionary. >>> S.roots(p) {1/2  sqrt(3)*I/2: 1, 1/2 + sqrt(3)*I/2: 1} We can also have more than one symbolic element in any expression as in the following: >>> from sympy.abc import a,b,c # quick way to get common symbols >>> p = a* x**2 + b*x + c >>> S.solve(p,x) # specific solving for xvariable [(b + sqrt(4*a*c + b**2))/(2*a), (b + sqrt(4*a*c + b**2))/(2*a)] which is the usual quadratic formula for roots. Sympy also provides many mathematical functions designed to work with Sympy variables. For example, >>> S.exp(S.I*a) #using Sympy exponential exp(I*a) We can expand this using expand_complex to obtain the following: >>> S.expand_complex(S.exp(S.I*a)) I*exp(im(a))*sin(re(a)) + exp(im(a))*cos(re(a)) which gives us Euler’s formula for the complex exponential. Note that Sympy does not know whether or not a is itself a complex number. We can fix this by making that fact part of the construction of a as in the following: 32 1 Getting Started with Scientific Python >>> a = S.symbols('a',real=True) >>> S.expand_complex(S.exp(S.I*a)) I*sin(a) + cos(a) Note the much simpler output this time because we have forced the additional condition on a. A powerful way to use Sympy is to construct complicated expressions that you can later evaluate using Numpy via the lambdify method. For example, >>> y = S.tan(x) * x + x**2 >>> yf= S.lambdify(x,y,'numpy') >>> y.subs(x,.1) # evaluated using Sympy 0.0200334672085451 >>> yf(.1) # evaluated using Numpy 0.020033467208545055 After creating the Numpy function with lambdify, you can use Numpy arrays as input as shown >>> yf(np.arange(3)) # input is Numpy array array([ 0. , 2.55740772, 0.37007973]) >>> [ y.subs(x,i).evalf() for i in range(3) ] # need extra work for Sympy [0, 2.55740772465490, 0.370079726523038] We can get the same output using Sympy, but that requires the extra programming logic shown to do the vectorizing that Numpy performs natively. Once again, we have merely scratched the surface of what Sympy is capable of and the online interactive tutorial is the best place to learn more. Sympy also allows automatic mathematical typesetting within the Jupyter Notebook using LATEX so the soconstructed notebooks look almost publicationready (see sympy.latex) and can be made so with the jupyter nbconvert command. This makes it easier to jump the cognitive gap between the Python code and the symbology of traditional mathematics. 1.9 Interfacing with Compiled Libraries As we have discussed, Python for scientific computing really consists of gluing together different scientific libraries written in a compiled language like C or Fortran. Ultimately, you may want to use libraries not available with existing Python bindings. There are many, many options for doing this. The most direct way is to use the builtin ctypes module which provides tools for providing input/output pointers to the library’s functions just as if you were calling them from a compiled language. This means that you have to know the function signatures in the library exactly—how many bytes for each input and how many bytes for the output. You are responsible for building the inputs exactly the way the library expects and collecting the resulting 1.9 Interfacing with Compiled Libraries 33 outputs. Even though this seems tedious, Python bindings for vast libraries have been built this way. If you want an easier way, then SWIG is an automatic wrapper generating tool that can provide bindings to a long list of languages, not just Python; so if you need bindings for multiple languages, then this is your best and only option. Using SWIG consists of writing an interface file so that the compiled Python dynamically linked library (.pyd files) can be readily imported into the Python interpreter. Huge and complex libraries like Trilinos (Sandia National Labs) have been interfaced to Python using SWIG, so it is a welltested option. SWIG also supports Numpy arrays. However, the SWIG model assumes that you want to continue developing primarily in C/Fortran and you are hooking into Python for usability or other reasons. On the other hand, if you start developing algorithms in Python and then want to speed them up, then Cython is an excellent option because it provides a mixed language that allows you to have both C language and Python code intermixed. Like SWIG, you have to write additional files in this hybrid Python/C dialect to have Cython generate the Ccode that you will ultimately compile. The best part of Cython is the profiler that can generate an HTML report showing where the code is slow and could benefit from translation to Cython. The Jupyter Notebook integrates nicely with Cython via its %cython magic command. This means you can write Cython code in a cell in Jupyter Notebook and the notebook will handle all of the tedious details like setting up the intermediate files to actually compile the Cython extension. Cython also supports Numpy arrays. Cython and SWIG are just two of the ways to create Python bindings for your favorite compiled libraries. Other notable (but less popular) options include FWrap, f2py, CFFI, and weave. It is also possible to use Python’s own API directly, but this is a tedious undertaking that is hard to justify given the existence of so many welldeveloped alternatives. 1.10 Integrated Development Environments For those who prefer integrated development environments (IDEs), there is a lot to choose from. The most comprehensive is Enthought Canopy, which includes a rich, syntaxhighlighted editor, integrated help, debugger, and even integrated training. If you are already familiar with Eclipse from other projects, or do mixedlanguage programming, then there is a Python plugin called PyDev that contains all usual features from Eclipse with a Python debugger. Wingware provides an affordable professionallevel IDE with multiproject management support and unusually clairvoyant code completion that works even in debug mode. Another favorite is PyCharm, which also supports multiple languages and is particularly popular among Python web developers because it provides powerful templates for popular web frameworks like Django. Visual Studio Code has quickly developed a strong following among Python newcomers because of its beautiful interface and plugin ecosystem. If you are a VIM user, then the Jedi plugin provides excellent code completion that works well with 34 1 Getting Started with Scientific Python pylint, which provides static code analysis (i.e., identifies missing modules and typos). Naturally, emacs has many related plugins for developing in Python. Note that are many other options, but I have tried to emphasize those most suitable for Python beginners. 1.11 Quick Guide to Performance and Parallel Programming There are many options available to improve the performance of your Python codes. The first thing to determine is what is limiting your computation. It could be CPU speed (unlikely), memory limitations (outofcore computing), or it could be data transfer speed (waiting on data to arrive for processing). If your code is purePython, then you can try running it with Pypy, which is is an alternative Python implementation that employs a justintime compiler. If your code does not experience a massive speedup with Pypy, then there is probably something external to the code that is slowing it down (e.g., disk access or network access). If Pypy doesn’t make any sense because you are using many compiled modules that Pypy does not support, then there are many diagnostic tools available. Python has its own builtin profiler cProfile you can invoke from the command line as in the following: >>> python m cProfile o program.prof my_program.py The output of the profiler is saved to the program.prof file. This file can be visualized in runsnakerun to get a nice graphical picture of where the code is spending the most time. The task manager on your operating system can also provide clues as your program runs to see how it is consuming resources. The line_profiler by Robert Kern provides an excellent way to see how the code is spending its time by annotating each line of the code by its timings. In combination with runsnakerun, this narrows down problems to the line level from the function level. The most common situation is that your program is waiting on data from disk or from some busy network resource. This is a common situation in web programming and there are lots of wellestablished tools to deal with this. Python has a multiprocessing module that is part of the standard library. This makes it easy to spawn child worker processes that can break off and individually process small parts of a big job. However, it is still your responsibility as the programmer to figure out how to distribute the data for your algorithm. Using this module means that the individual processes are to be managed by the operating system, which will be in charge of balancing the load. The basic template for using multiprocessing is the following: # filename multiprocessing_demo.py import multiprocessing import time def worker(k): 1.11 Quick Guide to Performance and Parallel Programming 35 'worker function' print('am starting process %d' % (k)) time.sleep(10) # wait ten seconds print('am done waiting!') return if __name__ == '__main__': for i in range(10): p = multiprocessing.Process(target=worker, args=(i,)) p.start() Then, you run this program at the terminal as in the following: Terminal> python multiprocessing_demo.py It is crucially important that you run the program from the terminal this way. It is not possible to do this interactively from within Jupyter, say. If you look at the process manager on the operating system, you should see a number of new Python processes loitering for ten seconds. You should also see the output of the print statements above. Naturally, in a real application, you would be assigning some meaningful work for each of the workers and figuring out how to send partially finished pieces between individual workers. Doing this is complex and easy to get wrong, so Python 3 has the helpful concurrent.futures. # filename: concurrent_demo.py from concurrent import futures import time def worker(k): 'worker function' print ('am starting process %d' % (k)) time.sleep(10) # wait ten seconds print ('am done waiting!') return def main(): with futures.ProcessPoolExecutor(max_workers=3) as executor: list(executor.map(worker,range(10))) if __name__ == '__main__': main() Terminal> python concurrent_demo.py You should see something like the following in the terminal. Note that we explicitly restricted the number of processes to three. am starting process 0 am starting process 1 am starting process 2 am done waiting! am done waiting! ... The futures module is built on top of multiprocessing and makes it easier to use for this kind of simple task. Note that there are also versions of both that use threads instead of processes while maintaining the same usage pattern. The main 36 1 Getting Started with Scientific Python difference between threads and processes is that processes have their own compartmentalized resources. The C language Python (i.e., CPython) implementation uses a global interpreter lock (GIL) that prevents threads from locking up on internal data structures. This is a coursegrained locking mechanism where one thread may individually run faster because it does not have to keep track of all the bookkeeping involved in running multiple threads simultaneously. The downside is that you cannot run multiple threads simultaneously to speed up certain tasks. There is no corresponding locking problem with processes but these are somewhat slower to start up because each process has to create its own private workspace for data structures that may be transferred between them. However, each process can certainly run independently and simultaneously once all that is set up. Note that certain alternative implementations of Python like IronPython use a finergrain threading design rather than a GIL approach. As a final comment, on modern systems with multiple cores, it could be that multiple threads actually slow things down because the operating system may have to switch threads between different cores. This creates additional overheads in the thread switching mechanism that ultimately slow things down. Jupyter itself has a parallel programming framework built (ipyparallel) that is both powerful and easy to use. The first step is to fire up separate Jupyter engines at the terminal as in the following: Terminal> ipcluster start n=4 Then, in an Jupyter window, you can get the client, In [1]: from ipyparallel import Client ...: rc = Client() The client has a connection to each of the processes we started before using ipcluster. To use all of the engines, we assign the DirectView object from the client as in the following: In [2]: dview = rc[:] Now, we can apply functions for each of the engines. For example, we can get the process identifiers using the os.getpid function, In [3]: import os In [4]: dview.apply_sync(os.getpid) Out[4]: [6824, 4752, 8836, 3124] Once the engines are up and running, data can be distributed to them using scatter, In [5]: dview.scatter('a',range(10)) Out[5]: <AsyncResult: finished> In [6]: dview.execute('print(a)').display_outputs() [stdout:0] [0, 1, 2] [stdout:1] [3, 4, 5] [stdout:2] [6, 7] [stdout:3] [8, 9] 1.11 Quick Guide to Performance and Parallel Programming 37 Note that the execute method evaluates the given string in each engine. Now that the data have been sprinkled among the active engines, we can do further computing on them, In [7]: dview.execute('b=sum(a)') Out[7]: <AsyncResult: finished> In [8]: dview.execute('print(b)').display_outputs() [stdout:0] 3 [stdout:1] 12 [stdout:2] 13 [stdout:3] 17 In this example, we added up the individual a sublists available on each of the engines. We can gather up the individual results into a single list as in the following: In [9]: dview.gather('b').result Out[9]: [3, 12, 13, 17] This is one of the simplest mechanisms for distributing work to the individual engines and collecting the results. Unlike the other methods we discussed, you can do this iteratively, which makes it easy to experiment with how you want to distribute and compute with the data. The Jupyter documentation has many more examples of parallel programming styles that include running the engines on cloud resources, supercomputer clusters, and across disparate networked computing resources. Although there are many other specialized parallel programming packages, Jupyter provides the best tradeoff for generality against complexity across all of the major platforms. 1.12 Other Resources The Python community is filled with supersmart and amazingly helpful people. One of the best places to get help with scientific Python is the www.stackoverflow.com site which hosts a competitive Q&A forum that is particularly welcoming for Python newbies. Several of the key Python developers regularly participate there and the quality of the answers is very high. The mailing lists for any of the key tools (e.g., Numpy, Jupyter, Matplotlib) are also great for keeping up with the newest developments. Anything written by Hans Petter Langtangen [6] is excellent, especially if you have a physics background. The Scientific Python conference held annually in Austin is also a great place to see your favorite developers in person, ask questions, and participate in the many interesting subgroups organized around niche topics. The PyData workshop is a semiannual meeting focused on Python for largescale dataintensive processing. 38 1 Getting Started with Scientific Python References 1. T.E. Oliphant, A Guide to NumPy (Trelgol Publishing, 2006) 2. L. Wilkinson, D. Wills, D. Rope, A. Norton, R. Dubbs, The Grammar of Graphics. Statistics and Computing (Springer, Berlin, 2006) 3. F. Perez, B.E. Granger et al., IPython software package for interactive scientific computing. http://ipython.org/ 4. W. McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (O’Reilly, 2012) 5. O. Certik et al., SymPy: python library for symbolic mathematics. http://sympy.org/ 6. H.P. Langtangen, Python Scripting for Computational Science, vol. 3, 3rd edn. Texts in Computational Science and Engineering (Springer, Berlin, 2009) Chapter 2 Probability 2.1 Introduction This chapter takes a geometric view of probability theory and relates it to familiar concepts in linear algebra and geometry. This approach connects your natural geometric intuition to the key abstractions in probability that can help guide your reasoning. This is particularly important in probability because it is easy to be misled. We need a bit of rigor and some intuition to guide us. In grade school, you were introduced to the natural numbers (i.e., 1,2,3,..) and you learned how to manipulate them by operations like addition, subtraction, and multiplication. Later, you were introduced to positive and negative numbers and were again taught how to manipulate them. Ultimately, you were introduced to the calculus of the real line, and learned how to differentiate, take limits, and so on. This progression provided more abstractions, but also widened the field of problems you could successfully tackle. The same is true of probability. One way to think about probability is as a new number concept that allows you to tackle problems that have a special kind of uncertainty built into them. Thus, the key idea is that there is some number, say x, with a traveling companion, say, f (x), and this companion represents the uncertainties about the value of x as if looking at the number x through a frosted window. The degree of opacity of the window is represented by f (x). If we want to manipulate x, then we have to figure out what to do with f (x). For example if we want y = 2x, then we have to understand how f (x) generates f (y). Where is the random part? To conceptualize this, we need still another analogy: think about a beehive with the swarm around it representing f (x), and the hive itself, which you can barely see through the swarm, as x. The random piece is you don’t know which bee in particular is going to sting you! Once this happens the uncertainty evaporates. Up until that happens, all we have is a concept of a swarm (i.e., density of bees) which represents a potentiality of which bee will ultimately sting. In summary, one way to think about probability is as a way of carrying through mathematical reasoning (e.g., adding, subtracting, taking limits) with a notion of potentiality that is sotransformed by these operations. © Springer Nature Switzerland AG 2019 J. Unpingco, Python for Probability, Statistics, and Machine Learning, https://doi.org/10.1007/9783030185459_2 39 40 2 Probability 2.1.1 Understanding Probability Density In order to understand the heart of modern probability, which is built on the Lebesgue theory of integration, we need to extend the concept of integration from basic calculus. To begin, let us consider the following piecewise function ⎧ ⎪ ⎨1 if 0 < x ≤ 1 f (x) = 2 if 1 < x ≤ 2 ⎪ ⎩ 0 otherwise as shown in Fig. 2.1. In calculus, you learned Riemann integration, which you can apply here as 2 f (x)d x = 1 + 2 = 3 0 which has the usual interpretation as the area of the two rectangles that make up f (x). So far, so good. With Lesbesgue integration, the idea is very similar except that we focus on the yaxis instead of moving along the xaxis. The question is given f (x) = 1, what is the set of x values for which this is true? For our example, this is true whenever x ∈ (0, 1]. So now we have a correspondence between the values of the function (namely, 1 and 2) and the sets of x values for which this is true, namely, {(0, 1]} and {(1, 2]}, respectively. To compute the integral, we simply take the function values (i.e., 1,2) and some way of measuring the size of the corresponding interval (i.e., μ) as in the following: Fig. 2.1 Simple piecewiseconstant function 2.1 Introduction 41 2 f dμ = 1μ({(0, 1]}) + 2μ({(1, 2]}) 0 We have suppressed some of the notation above to emphasize generality. Note that we obtain the same value of the integral as in the Riemann case when μ((0, 1]) = μ((1, 2]) = 1. By introducing the μ function as a way of measuring the intervals above, we have introduced another degree of freedom in our integration. This accommodates many weird functions that are not tractable using the usual Riemann theory, but we refer you to a proper introduction to Lesbesgue integration for further study [1]. Nonetheless, the key step in the above discussion is the introduction of the μ function, which we will encounter again as the socalled probability density function. 2.1.2 Random Variables Most introductions to probability jump straight into random variables and then explain how to compute complicated integrals. The problem with this approach is that it skips over some of the important subtleties that we will now consider. Unfortunately, the term random variable is not very descriptive. A better term is mea