Table of Contents
http://shop.oreilly.com/product/0636920028529.do
1449358659?tag=splitbrain-20
Doing Data Science
Table of Contents
Doing a Data Science
Dedication
Preface
Motivation
Origins of the Class
Origins of the Book
What to Expect from This Book
How This Book Is Organized
How to Read This Book
How Code Is Used in This Book
Who This Book Is For
I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. – Doing a data science
Supplemental Readings
Math
- Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning)
- Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press)
- A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross
Coding
- R in a Nutshell by Joseph Adler (O’Reilly)
- Learning Python by Mark Lutz and David Ascher (O’Reilly)
- R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley)
- The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press)
- Python for Data Analysis by Wes McKinney (O’Reilly)
Data Analysis and Statistical Inference
- Statistical Inference by George Casella and Roger L. Berger (Cengage Learning)
- Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall)
- Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press)
- Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press)
- The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer)
Artificial Intelligence and Machine Learning
- Pattern Recognition and Machine Learning by Christopher Bishop (Springer)
- Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press)
- Programming Collective Intelligence by Toby Segaran (O’Reilly)
- Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall)
- Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press)
- Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press)
Experimental Design
- Field Experiments by Alan S. Gerber and Donald P. Green (Norton)
- Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience)
Visualization
- The Elements of Graphing Data by William Cleveland (Hobart Press)
- Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley)
- The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)
Prerequisites
Supplemental Reading
About the Contributors
Conventions Used in This Book
Using Code Examples
Safari Books Online
How to Contact Us
Acknowledgments
1. Introduction: What Is Data Science?
Big Data and Data Science Hype
Getting Past the Hype
Why Now?
Datafication
The Current Landscape (with a Little History)
Data Science Jobs
A Data Science Profile
Thought Experiment: Meta-Definition
OK, So What Is a Data Scientist, Really?
In Academia
In Industry
2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Statistical Thinking in the Age of Big Data
Statistical Inference
Populations and Samples
Populations and Samples of Big Data
Big Data Can Mean Big Assumptions
Can N=ALL?
Data is not objective
Modeling
What is a model?
Statistical modeling
But how do you build a model?
Probability distributions
Fitting a model
Overfitting
Exploratory Data Analysis
Philosophy of Exploratory Data Analysis
Exercise: EDA
Sample code
The Data Science Process
A Data Scientist’s Role in This Process
Thought Experiment: How Would You Simulate Chaos?
Case Study: RealDirect
How Does RealDirect Make Money?
Exercise: RealDirect Data Strategy
Sample R code
3. Algorithms
Machine Learning Algorithms
Three Basic Algorithms
Linear Regression
Start by writing something down
Fitting the model
Extending beyond least squares
Adding in modeling assumptions about the errors
Adding other predictors
Transformations
Review
Exercise
k-Nearest Neighbors (k-NN)
Example with credit scores
Similarity or distance metrics
Training and test sets
Pick an evaluation metric
Putting it all together
Choosing k
What are the modeling assumptions?
k-means
2D version
Exercise: Basic Machine Learning Algorithms
Solutions
Sample R code: Linear regression on the housing dataset
Sample R code: K-NN on the housing dataset
Summing It All Up
Thought Experiment: Automated Statistician
4. Spam Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by Example
Why Won’t Linear Regression Work for Filtering Spam?
How About k-nearest Neighbors?
Naive Bayes
Bayes Law
A Spam Filter for Individual Words
A Spam Filter That Combines Words: Naive Bayes
Fancy It Up: Laplace Smoothing
Comparing Naive Bayes to k-NN
Sample Code in bash
Scraping the Web: APIs and Other Tools
Jake’s Exercise: Naive Bayes for Article Classification
Sample R Code for Dealing with the NYT API
5. Logistic Regression
Thought Experiments
Classifiers
Runtime
You
Interpretability
Scalability
M6D Logistic Regression Case Study
Click Models
The Underlying Math
Estimating α and β
Newton’s Method
Stochastic Gradient Descent
Implementation
Evaluation
Media 6 Degrees Exercise
Sample R Code
6. Time Stamps and Financial Modeling
Kyle Teague and GetGlue
Timestamps
Exploratory Data Analysis (EDA)
Metrics and New Variables or Features
What’s Next?
Cathy O’Neil
Thought Experiment
Financial Modeling
In-Sample, Out-of-Sample, and Causality
Preparing Financial Data
Log Returns
Example: The S&P Index
Working out a Volatility Measurement
Exponential Downweighting
The Financial Modeling Feedback Loop
Why Regression?
Adding Priors
A Baby Model
Exercise: GetGlue and Timestamped Event Data
Exercise: Financial Data
7. Extracting Meaning from Data
William Cukierski
Background: Data Science Competitions
Background: Crowdsourcing
The Kaggle Model
A Single Contestant
Their Customers
Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
Feature Selection
Example: User Retention
Filters
Wrappers
Selecting an algorithm
Selection criterion
In practice
Embedded Methods: Decision Trees
Entropy
The Decision Tree Algorithm
Handling Continuous Variables in Decision Trees
Random Forests
User Retention: Interpretability Versus Predictive Power
David Huffaker: Google’s Hybrid Approach to Social Research
Moving from Descriptive to Predictive
Social at Google
Privacy
Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
8. Recommendation Engines: Building a User-Facing Data Product at Scale
A Real-World Recommendation Engine
Nearest Neighbor Algorithm Review
Some Problems with Nearest Neighbors
Beyond Nearest Neighbor: Machine Learning Classification
The Dimensionality Problem
Singular Value Decomposition (SVD)
Important Properties of SVD
Principal Component Analysis (PCA)
Theorem: The resulting latent features will be uncorrelated
Alternating Least Squares
Theorem with no proof: The preceding algorithm will converge if your prior is large enough
Fix V and Update U
Last Thoughts on These Algorithms
Thought Experiment: Filter Bubbles
Exercise: Build Your Own Recommendation System
Sample Code in Python
9. Data Visualization and Fraud Detection
Data Visualization History
Gabriel Tarde
Mark’s Thought Experiment
What Is Data Science, Redux?
Processing
Franco Moretti
A Sample of Data Visualization Projects
Mark’s Data Visualization Projects
New York Times Lobby: Moveable Type
Project Cascade: Lives on a Screen
Cronkite Plaza
eBay Transactions and Books
Public Theater Shakespeare Machine
Goals of These Exhibits
Data Science and Risk
About Square
The Risk Challenge
Detecting suspicious activity using machine learning
The Trouble with Performance Estimation
Defining the error metric
Defining the labels
Challenges in features and learning
Model Building Tips
Code readability and reusability
Get a pair!
Productionizing machine learning models
Data Visualization at Square
Ian’s Thought Experiment
Data Visualization for the Rest of Us
Data Visualization Exercise
10. Social Networks and Data Journalism
Social Network Analysis at Morning Analytics
Case-Attribute Data versus Social Network Data
Social Network Analysis
Terminology from Social Networks
Centrality Measures
The Industry of Centrality Measures
Thought Experiment
Morningside Analytics
How Visualizations Help Us Find Schools of Fish
More Background on Social Network Analysis from a Statistical Point of View
Representations of Networks and Eigenvalue Centrality
A First Example of Random Graphs: The Erdos-Renyi Model
A Second Example of Random Graphs: The Exponential Random Graph Model
Inference for ERGMs
Further examples of random graphs: latent space models, small-world networks
Data Journalism
A Bit of History on Data Journalism
Writing Technical Journalism: Advice from an Expert
11. Causality
Correlation Doesn’t Imply Causation
Asking Causal Questions
Confounders: A Dating Example
OK Cupid’s Attempt
The Gold Standard: Randomized Clinical Trials
A/B Tests
Second Best: Observational Studies
Simpson’s Paradox
The Rubin Causal Model
Visualizing Causality
Definition: The Causal Effect
Three Pieces of Advice
12. Epidemiology
Madigan’s Background
Thought Experiment
Modern Academic Statistics
Medical Literature and Observational Studies
Stratification Does Not Solve the Confounder Problem
What Do People Do About Confounding Things in Practice?
Is There a Better Way?
Research Experiment (Observational Medical Outcomes Partnership)
Closing Thought Experiment
13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
Claudia’s Data Scientist Profile
The Life of a Chief Data Scientist
On Being a Female Data Scientist
Data Mining Competitions
How to Be a Good Modeler
Data Leakage
Market Predictions
Amazon Case Study: Big Spenders
A Jewelry Sampling Problem
IBM Customer Targeting
Breast Cancer Detection
Pneumonia Prediction
How to Avoid Leakage
Evaluating Models
Accuracy: Meh
Probabilities Matter, Not 0s and 1s
Choosing an Algorithm
A Final Example
Parting Thoughts
14. Data Engineering: MapReduce, Pregel, and Hadoop
About David Crawshaw
Thought Experiment
MapReduce
Word Frequency Problem
Enter MapReduce
Other Examples of MapReduce
What Can’t MapReduce Do?
Pregel
About Josh Wills
Thought Experiment
On Being a Data Scientist
Data Abundance Versus Data Scarcity
Designing Models
Mind the gap
Economic Interlude: Hadoop
A Brief Introduction to Hadoop
Cloudera
Back to Josh: Workflow
So How to Get Started with Hadoop?
15. The Students Speak
Process Thinking
Naive No Longer
Helping Hands
Your Mileage May Vary
Bridging Tunnels
Some of Our Work
16. Next-Generation Data Scientists, Hubris, and Ethics
What Just Happened?
What Is Data Science (Again)?
What Are Next-Gen Data Scientists?
Being Problem Solvers
Cultivating Soft Skills
Being Question Askers
Being an Ethical Data Scientist
Career Advice
Index
Colophon
Copyright
Temp
I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. – Doing a data science
Supplemental Readings
Math
- Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning)
- Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press)
- A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross
Coding
- R in a Nutshell by Joseph Adler (O’Reilly)
- Learning Python by Mark Lutz and David Ascher (O’Reilly)
- R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley)
- The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press)
- Python for Data Analysis by Wes McKinney (O’Reilly)
Data Analysis and Statistical Inference
- Statistical Inference by George Casella and Roger L. Berger (Cengage Learning)
- Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall)
- Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press)
- Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press)
- The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer)
Artificial Intelligence and Machine Learning
- Pattern Recognition and Machine Learning by Christopher Bishop (Springer)
- Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press)
- Programming Collective Intelligence by Toby Segaran (O’Reilly)
- Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall)
- Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press)
- Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press)
Experimental Design
- Field Experiments by Alan S. Gerber and Donald P. Green (Norton)
- Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience)
Visualization
- The Elements of Graphing Data by William Cleveland (Hobart Press)
- Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley)
- The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)