http://shop.oreilly.com/product/0636920028529.do {{amazon>1449358659}} Doing Data Science \\ Table of Contents \\ ====== Doing a Data Science ====== ====== Dedication ====== ====== Preface ====== ===== Motivation ===== ===== Origins of the Class ===== ===== Origins of the Book ===== ===== What to Expect from This Book ===== ===== How This Book Is Organized ===== ===== How to Read This Book ===== ===== How Code Is Used in This Book ===== ===== Who This Book Is For =====
I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. -- Doing a data science
Supplemental Readings \\ **__Math__** * Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning) * Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press) * A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross **__Coding__** * R in a Nutshell by Joseph Adler (O’Reilly) * Learning Python by Mark Lutz and David Ascher (O’Reilly) * R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley) * The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press) * Python for Data Analysis by Wes McKinney (O’Reilly) **__Data Analysis and Statistical Inference__** * Statistical Inference by George Casella and Roger L. Berger (Cengage Learning) * Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall) * Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press) * Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press) * The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer) **__Artificial Intelligence and Machine Learning__** * Pattern Recognition and Machine Learning by Christopher Bishop (Springer) * Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press) * Programming Collective Intelligence by Toby Segaran (O’Reilly) * Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall) * Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press) * Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press) **__Experimental Design__** * Field Experiments by Alan S. Gerber and Donald P. Green (Norton) * Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience) **__Visualization__** * The Elements of Graphing Data by William Cleveland (Hobart Press) * Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley) * The Visual Display of Quantitative Information by Edward Tufte (Graphics Press) ===== Prerequisites ===== ===== Supplemental Reading ===== ===== About the Contributors ===== ===== Conventions Used in This Book ===== ===== Using Code Examples ===== ===== Safari Books Online ===== ===== How to Contact Us ===== ===== Acknowledgments ===== ====== 1. Introduction: What Is Data Science? ====== ===== Big Data and Data Science Hype ===== ===== Getting Past the Hype ===== ===== Why Now? ===== ===== Datafication ===== ===== The Current Landscape (with a Little History) ===== ===== Data Science Jobs ===== ===== A Data Science Profile ===== ===== Thought Experiment: Meta-Definition ===== ===== OK, So What Is a Data Scientist, Really? ===== ===== In Academia ===== ===== In Industry ===== ====== 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process ====== ===== Statistical Thinking in the Age of Big Data ===== ===== Statistical Inference ===== ===== Populations and Samples ===== ===== Populations and Samples of Big Data ===== ===== Big Data Can Mean Big Assumptions ===== ===== Can N=ALL? ===== ===== Data is not objective ===== ===== Modeling ===== ===== What is a model? ===== ===== Statistical modeling ===== ===== But how do you build a model? ===== ===== Probability distributions ===== ===== Fitting a model ===== ===== Overfitting ===== ===== Exploratory Data Analysis ===== ===== Philosophy of Exploratory Data Analysis ===== ===== Exercise: EDA ===== ===== Sample code ===== ===== The Data Science Process ===== ===== A Data Scientist’s Role in This Process ===== ===== Thought Experiment: How Would You Simulate Chaos? ===== ===== Case Study: RealDirect ===== ===== How Does RealDirect Make Money? ===== ===== Exercise: RealDirect Data Strategy ===== ===== Sample R code ===== ====== 3. Algorithms ====== ===== Machine Learning Algorithms ===== ===== Three Basic Algorithms ===== ===== Linear Regression ===== ===== Start by writing something down ===== ===== Fitting the model ===== ===== Extending beyond least squares ===== ===== Adding in modeling assumptions about the errors ===== ===== Adding other predictors ===== ===== Transformations ===== ===== Review ===== ===== Exercise ===== ===== k-Nearest Neighbors (k-NN) ===== ===== Example with credit scores ===== ===== Similarity or distance metrics ===== ===== Training and test sets ===== ===== Pick an evaluation metric ===== ===== Putting it all together ===== ===== Choosing k ===== ===== What are the modeling assumptions? ===== ===== k-means ===== ===== 2D version ===== ===== Exercise: Basic Machine Learning Algorithms ===== ===== Solutions ===== ===== Sample R code: Linear regression on the housing dataset ===== ===== Sample R code: K-NN on the housing dataset ===== ===== Summing It All Up ===== ===== Thought Experiment: Automated Statistician ===== ====== 4. Spam Filters, Naive Bayes, and Wrangling ====== ===== Thought Experiment: Learning by Example ===== ===== Why Won’t Linear Regression Work for Filtering Spam? ===== ===== How About k-nearest Neighbors? ===== ===== Naive Bayes ===== ===== Bayes Law ===== ===== A Spam Filter for Individual Words ===== ===== A Spam Filter That Combines Words: Naive Bayes ===== ===== Fancy It Up: Laplace Smoothing ===== ===== Comparing Naive Bayes to k-NN ===== ===== Sample Code in bash ===== ===== Scraping the Web: APIs and Other Tools ===== ===== Jake’s Exercise: Naive Bayes for Article Classification ===== ===== Sample R Code for Dealing with the NYT API ===== ====== 5. Logistic Regression ====== ===== Thought Experiments ===== ===== Classifiers ===== ===== Runtime ===== ===== You ===== ===== Interpretability ===== ===== Scalability ===== ===== M6D Logistic Regression Case Study ===== ===== Click Models ===== ===== The Underlying Math ===== ===== Estimating α and β ===== ===== Newton’s Method ===== ===== Stochastic Gradient Descent ===== ===== Implementation ===== ===== Evaluation ===== ===== Media 6 Degrees Exercise ===== ===== Sample R Code ===== ====== 6. Time Stamps and Financial Modeling ====== ===== Kyle Teague and GetGlue ===== ===== Timestamps ===== ===== Exploratory Data Analysis (EDA) ===== ===== Metrics and New Variables or Features ===== ===== What’s Next? ===== ===== Cathy O’Neil ===== ===== Thought Experiment ===== ===== Financial Modeling ===== ===== In-Sample, Out-of-Sample, and Causality ===== ===== Preparing Financial Data ===== ===== Log Returns ===== ===== Example: The S&P Index ===== ===== Working out a Volatility Measurement ===== ===== Exponential Downweighting ===== ===== The Financial Modeling Feedback Loop ===== ===== Why Regression? ===== ===== Adding Priors ===== ===== A Baby Model ===== ===== Exercise: GetGlue and Timestamped Event Data ===== ===== Exercise: Financial Data ===== ====== 7. Extracting Meaning from Data ====== ===== William Cukierski ===== ===== Background: Data Science Competitions ===== ===== Background: Crowdsourcing ===== ===== The Kaggle Model ===== ===== A Single Contestant ===== ===== Their Customers ===== ===== Thought Experiment: What Are the Ethical Implications of a Robo-Grader? ===== ===== Feature Selection ===== ===== Example: User Retention ===== ===== Filters ===== ===== Wrappers ===== ===== Selecting an algorithm ===== ===== Selection criterion ===== ===== In practice ===== ===== Embedded Methods: Decision Trees ===== ===== Entropy ===== ===== The Decision Tree Algorithm ===== ===== Handling Continuous Variables in Decision Trees ===== ===== Random Forests ===== ===== User Retention: Interpretability Versus Predictive Power ===== ===== David Huffaker: Google’s Hybrid Approach to Social Research ===== ===== Moving from Descriptive to Predictive ===== ===== Social at Google ===== ===== Privacy ===== ===== Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control? ===== ====== 8. Recommendation Engines: Building a User-Facing Data Product at Scale ====== ===== A Real-World Recommendation Engine ===== ===== Nearest Neighbor Algorithm Review ===== ===== Some Problems with Nearest Neighbors ===== ===== Beyond Nearest Neighbor: Machine Learning Classification ===== ===== The Dimensionality Problem ===== ===== Singular Value Decomposition (SVD) ===== ===== Important Properties of SVD ===== ===== Principal Component Analysis (PCA) ===== ===== Theorem: The resulting latent features will be uncorrelated ===== ===== Alternating Least Squares ===== ===== Theorem with no proof: The preceding algorithm will converge if your prior is large enough ===== ===== Fix V and Update U ===== ===== Last Thoughts on These Algorithms ===== ===== Thought Experiment: Filter Bubbles ===== ===== Exercise: Build Your Own Recommendation System ===== ===== Sample Code in Python ===== ====== 9. Data Visualization and Fraud Detection ====== ===== Data Visualization History ===== ===== Gabriel Tarde ===== ===== Mark’s Thought Experiment ===== ===== What Is Data Science, Redux? ===== ===== Processing ===== ===== Franco Moretti ===== ===== A Sample of Data Visualization Projects ===== ===== Mark’s Data Visualization Projects ===== ===== New York Times Lobby: Moveable Type ===== ===== Project Cascade: Lives on a Screen ===== ===== Cronkite Plaza ===== ===== eBay Transactions and Books ===== ===== Public Theater Shakespeare Machine ===== ===== Goals of These Exhibits ===== ===== Data Science and Risk ===== ===== About Square ===== ===== The Risk Challenge ===== ===== Detecting suspicious activity using machine learning ===== ===== The Trouble with Performance Estimation ===== ===== Defining the error metric ===== ===== Defining the labels ===== ===== Challenges in features and learning ===== ===== Model Building Tips ===== ===== Code readability and reusability ===== ===== Get a pair! ===== ===== Productionizing machine learning models ===== ===== Data Visualization at Square ===== ===== Ian’s Thought Experiment ===== ===== Data Visualization for the Rest of Us ===== ===== Data Visualization Exercise ===== ====== 10. Social Networks and Data Journalism ====== ===== Social Network Analysis at Morning Analytics ===== ===== Case-Attribute Data versus Social Network Data ===== ===== Social Network Analysis ===== ===== Terminology from Social Networks ===== ===== Centrality Measures ===== ===== The Industry of Centrality Measures ===== ===== Thought Experiment ===== ===== Morningside Analytics ===== ===== How Visualizations Help Us Find Schools of Fish ===== ===== More Background on Social Network Analysis from a Statistical Point of View ===== ===== Representations of Networks and Eigenvalue Centrality ===== ===== A First Example of Random Graphs: The Erdos-Renyi Model ===== ===== A Second Example of Random Graphs: The Exponential Random Graph Model ===== ===== Inference for ERGMs ===== ===== Further examples of random graphs: latent space models, small-world networks ===== ===== Data Journalism ===== ===== A Bit of History on Data Journalism ===== ===== Writing Technical Journalism: Advice from an Expert ===== ====== 11. Causality ====== ===== Correlation Doesn’t Imply Causation ===== ===== Asking Causal Questions ===== ===== Confounders: A Dating Example ===== ===== OK Cupid’s Attempt ===== ===== The Gold Standard: Randomized Clinical Trials ===== ===== A/B Tests ===== ===== Second Best: Observational Studies ===== ===== Simpson’s Paradox ===== ===== The Rubin Causal Model ===== ===== Visualizing Causality ===== ===== Definition: The Causal Effect ===== ===== Three Pieces of Advice ===== ====== 12. Epidemiology ====== ===== Madigan’s Background ===== ===== Thought Experiment ===== ===== Modern Academic Statistics ===== ===== Medical Literature and Observational Studies ===== ===== Stratification Does Not Solve the Confounder Problem ===== ===== What Do People Do About Confounding Things in Practice? ===== ===== Is There a Better Way? ===== ===== Research Experiment (Observational Medical Outcomes Partnership) ===== ===== Closing Thought Experiment ===== ====== 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation ====== ===== Claudia’s Data Scientist Profile ===== ===== The Life of a Chief Data Scientist ===== ===== On Being a Female Data Scientist ===== ===== Data Mining Competitions ===== ===== How to Be a Good Modeler ===== ===== Data Leakage ===== ===== Market Predictions ===== ===== Amazon Case Study: Big Spenders ===== ===== A Jewelry Sampling Problem ===== ===== IBM Customer Targeting ===== ===== Breast Cancer Detection ===== ===== Pneumonia Prediction ===== ===== How to Avoid Leakage ===== ===== Evaluating Models ===== ===== Accuracy: Meh ===== ===== Probabilities Matter, Not 0s and 1s ===== ===== Choosing an Algorithm ===== ===== A Final Example ===== ===== Parting Thoughts ===== ====== 14. Data Engineering: MapReduce, Pregel, and Hadoop ====== ===== About David Crawshaw ===== ===== Thought Experiment ===== ===== MapReduce ===== ===== Word Frequency Problem ===== ===== Enter MapReduce ===== ===== Other Examples of MapReduce ===== ===== What Can’t MapReduce Do? ===== ===== Pregel ===== ===== About Josh Wills ===== ===== Thought Experiment ===== ===== On Being a Data Scientist ===== ===== Data Abundance Versus Data Scarcity ===== ===== Designing Models ===== ===== Mind the gap ===== ===== Economic Interlude: Hadoop ===== ===== A Brief Introduction to Hadoop ===== ===== Cloudera ===== ===== Back to Josh: Workflow ===== ===== So How to Get Started with Hadoop? ===== ====== 15. The Students Speak ====== ===== Process Thinking ===== ===== Naive No Longer ===== ===== Helping Hands ===== ===== Your Mileage May Vary ===== ===== Bridging Tunnels ===== ===== Some of Our Work ===== ====== 16. Next-Generation Data Scientists, Hubris, and Ethics ====== ===== What Just Happened? ===== ===== What Is Data Science (Again)? ===== ===== What Are Next-Gen Data Scientists? ===== ===== Being Problem Solvers ===== ===== Cultivating Soft Skills ===== ===== Being Question Askers ===== ===== Being an Ethical Data Scientist ===== ===== Career Advice ===== ===== Index ===== ===== Colophon ===== ===== Copyright ===== ====== Temp ======
I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. -- Doing a data science
Supplemental Readings \\ **__Math__** * Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning) * Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press) * A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross **__Coding__** * R in a Nutshell by Joseph Adler (O’Reilly) * Learning Python by Mark Lutz and David Ascher (O’Reilly) * R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley) * The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press) * Python for Data Analysis by Wes McKinney (O’Reilly) **__Data Analysis and Statistical Inference__** * Statistical Inference by George Casella and Roger L. Berger (Cengage Learning) * Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall) * Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press) * Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press) * The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer) **__Artificial Intelligence and Machine Learning__** * Pattern Recognition and Machine Learning by Christopher Bishop (Springer) * Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press) * Programming Collective Intelligence by Toby Segaran (O’Reilly) * Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall) * Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press) * Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press) **__Experimental Design__** * Field Experiments by Alan S. Gerber and Donald P. Green (Norton) * Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience) **__Visualization__** * The Elements of Graphing Data by William Cleveland (Hobart Press) * Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley) * The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)