Doing a Data Science
Dedication
Preface
- Motivation
- Origins of the Class
- Origins of the Book
- What to Expect from This Book
- How This Book Is Organized
- How to Read This Book
- How Code Is Used in This Book
- Who This Book Is For
- Prerequisites
- Supplemental Reading
- About the Contributors
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
1. Introduction: What Is Data Science?
- Big Data and Data Science Hype
- Getting Past the Hype
- Why Now?
- Datafication
- The Current Landscape (with a Little History)
- Data Science Jobs
- A Data Science Profile
- Thought Experiment: Meta-Definition
- OK, So What Is a Data Scientist, Really?
- In Academia
- In Industry
2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
- Statistical Thinking in the Age of Big Data
- Statistical Inference
- Populations and Samples
- Populations and Samples of Big Data
- Big Data Can Mean Big Assumptions
- Can N=ALL?
- Data is not objective
- Modeling
- What is a model?
- Statistical modeling
- But how do you build a model?
- Probability distributions
- Fitting a model
- Overfitting
- Exploratory Data Analysis
- Philosophy of Exploratory Data Analysis
- Exercise: EDA
- Sample code
- The Data Science Process
- A Data Scientist’s Role in This Process
- Thought Experiment: How Would You Simulate Chaos?
- Case Study: RealDirect
- How Does RealDirect Make Money?
- Exercise: RealDirect Data Strategy
- Sample R code
3. Algorithms
- Machine Learning Algorithms
- Three Basic Algorithms
- Linear Regression
- Start by writing something down
- Fitting the model
- Extending beyond least squares
- Adding in modeling assumptions about the errors
- Adding other predictors
- Transformations
- Review
- Exercise
- k-Nearest Neighbors (k-NN)
- Example with credit scores
- Similarity or distance metrics
- Training and test sets
- Pick an evaluation metric
- Putting it all together
- Choosing k
- What are the modeling assumptions?
- k-means
- 2D version
- Exercise: Basic Machine Learning Algorithms
- Solutions
- Sample R code: Linear regression on the housing dataset
- Sample R code: K-NN on the housing dataset
- Summing It All Up
- Thought Experiment: Automated Statistician
4. Spam Filters, Naive Bayes, and Wrangling
- Thought Experiment: Learning by Example
- Why Won’t Linear Regression Work for Filtering Spam?
- How About k-nearest Neighbors?
- Naive Bayes
- Bayes Law
- A Spam Filter for Individual Words
- A Spam Filter That Combines Words: Naive Bayes
- Fancy It Up: Laplace Smoothing
- Comparing Naive Bayes to k-NN
- Sample Code in bash
- Scraping the Web: APIs and Other Tools
- Jake’s Exercise: Naive Bayes for Article Classification
- Sample R Code for Dealing with the NYT API
5. Logistic Regression
- Thought Experiments
- Classifiers
- Runtime
- You
- Interpretability
- Scalability
- M6D Logistic Regression Case Study
- Click Models
- The Underlying Math
- Estimating α and β
- Newton’s Method
- Stochastic Gradient Descent
- Implementation
- Evaluation
- Media 6 Degrees Exercise
- Sample R Code
6. Time Stamps and Financial Modeling
- Kyle Teague and GetGlue
- Timestamps
- Exploratory Data Analysis (EDA)
- Metrics and New Variables or Features
- What’s Next?
- Cathy O’Neil
- Thought Experiment
- Financial Modeling
- In-Sample, Out-of-Sample, and Causality
- Preparing Financial Data
- Log Returns
- Example: The S&P Index
- Working out a Volatility Measurement
- Exponential Downweighting
- The Financial Modeling Feedback Loop
- Why Regression?
- Adding Priors
- A Baby Model
- Exercise: GetGlue and Timestamped Event Data
- Exercise: Financial Data
7. Extracting Meaning from Data
- William Cukierski
- Background: Data Science Competitions
- Background: Crowdsourcing
- The Kaggle Model
- A Single Contestant
- Their Customers
- Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
- Feature Selection
- Example: User Retention
- Filters
- Wrappers
- Selecting an algorithm
- Selection criterion
- In practice
- Embedded Methods: Decision Trees
- Entropy
- The Decision Tree Algorithm
- Handling Continuous Variables in Decision Trees
- Random Forests
- User Retention: Interpretability Versus Predictive Power
- David Huffaker: Google’s Hybrid Approach to Social Research
- Moving from Descriptive to Predictive
- Social at Google
- Privacy
- Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
8. Recommendation Engines: Building a User-Facing Data Product at Scale
- A Real-World Recommendation Engine
- Nearest Neighbor Algorithm Review
- Some Problems with Nearest Neighbors
- Beyond Nearest Neighbor: Machine Learning Classification
- The Dimensionality Problem
- Singular Value Decomposition (SVD)
- Important Properties of SVD
- Principal Component Analysis (PCA)
- Theorem: The resulting latent features will be uncorrelated
- Alternating Least Squares
- Theorem with no proof: The preceding algorithm will converge if your prior is large enough
- Fix V and Update U
- Last Thoughts on These Algorithms
- Thought Experiment: Filter Bubbles
- Exercise: Build Your Own Recommendation System
- Sample Code in Python
9. Data Visualization and Fraud Detection
- Data Visualization History
- Gabriel Tarde
- Mark’s Thought Experiment
- What Is Data Science, Redux?
- Processing
- Franco Moretti
- A Sample of Data Visualization Projects
- Mark’s Data Visualization Projects
- New York Times Lobby: Moveable Type
- Project Cascade: Lives on a Screen
- Cronkite Plaza
- eBay Transactions and Books
- Public Theater Shakespeare Machine
- Goals of These Exhibits
- Data Science and Risk
- About Square
- The Risk Challenge
- Detecting suspicious activity using machine learning
- The Trouble with Performance Estimation
- Defining the error metric
- Defining the labels
- Challenges in features and learning
- Model Building Tips
- Code readability and reusability
- Get a pair!
- Productionizing machine learning models
- Data Visualization at Square
- Ian’s Thought Experiment
- Data Visualization for the Rest of Us
- Data Visualization Exercise
10. Social Networks and Data Journalism
- Social Network Analysis at Morning Analytics
- Case-Attribute Data versus Social Network Data
- Social Network Analysis
- Terminology from Social Networks
- Centrality Measures
- The Industry of Centrality Measures
- Thought Experiment
- Morningside Analytics
- How Visualizations Help Us Find Schools of Fish
- More Background on Social Network Analysis from a Statistical Point of View
- Representations of Networks and Eigenvalue Centrality
- A First Example of Random Graphs: The Erdos-Renyi Model
- A Second Example of Random Graphs: The Exponential Random Graph Model
- Inference for ERGMs
- Further examples of random graphs: latent space models, small-world networks
- Data Journalism
- A Bit of History on Data Journalism
- Writing Technical Journalism: Advice from an Expert
11. Causality
- Correlation Doesn’t Imply Causation
- Asking Causal Questions
- Confounders: A Dating Example
- OK Cupid’s Attempt
- The Gold Standard: Randomized Clinical Trials
- A/B Tests
- Second Best: Observational Studies
- Simpson’s Paradox
- The Rubin Causal Model
- Visualizing Causality
- Definition: The Causal Effect
- Three Pieces of Advice
12. Epidemiology
- Madigan’s Background
- Thought Experiment
- Modern Academic Statistics
- Medical Literature and Observational Studies
- Stratification Does Not Solve the Confounder Problem
- What Do People Do About Confounding Things in Practice?
- Is There a Better Way?
- Research Experiment (Observational Medical Outcomes Partnership)
- Closing Thought Experiment
13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
- Claudia’s Data Scientist Profile
- The Life of a Chief Data Scientist
- On Being a Female Data Scientist
- Data Mining Competitions
- How to Be a Good Modeler
- Data Leakage
- Market Predictions
- Amazon Case Study: Big Spenders
- A Jewelry Sampling Problem
- IBM Customer Targeting
- Breast Cancer Detection
- Pneumonia Prediction
- How to Avoid Leakage
- Evaluating Models
- Accuracy: Meh
- Probabilities Matter, Not 0s and 1s
- Choosing an Algorithm
- A Final Example
- Parting Thoughts
14. Data Engineering: MapReduce, Pregel, and Hadoop
- About David Crawshaw
- Thought Experiment
- MapReduce
- Word Frequency Problem
- Enter MapReduce
- Other Examples of MapReduce
- What Can’t MapReduce Do?
- Pregel
- About Josh Wills
- Thought Experiment
- On Being a Data Scientist
- Data Abundance Versus Data Scarcity
- Designing Models
- Mind the gap
- Economic Interlude: Hadoop
- A Brief Introduction to Hadoop
- Cloudera
- Back to Josh: Workflow
- So How to Get Started with Hadoop?
15. The Students Speak
- Process Thinking
- Naive No Longer
- Helping Hands
- Your Mileage May Vary
- Bridging Tunnels
- Some of Our Work
16. Next-Generation Data Scientists, Hubris, and Ethics
- What Just Happened?
- What Is Data Science (Again)?
- What Are Next-Gen Data Scientists?
- Being Problem Solvers
- Cultivating Soft Skills
- Being Question Askers
- Being an Ethical Data Scientist
- Career Advice
- Index
- Colophon
- Copyright
Temp

http://shop.oreilly.com/product/0636920028529.do
1449358659?tag=splitbrain-20

Doing Data Science

Table of Contents

Doing a Data Science

Dedication

Preface

Motivation

Origins of the Class

Origins of the Book

What to Expect from This Book

How This Book Is Organized

How to Read This Book

How Code Is Used in This Book

Who This Book Is For

I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. – Doing a data science

Supplemental Readings

Math

Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning)
Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press)
A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross

Coding

R in a Nutshell by Joseph Adler (O’Reilly)
Learning Python by Mark Lutz and David Ascher (O’Reilly)
R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley)
The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press)
Python for Data Analysis by Wes McKinney (O’Reilly)

Data Analysis and Statistical Inference

Statistical Inference by George Casella and Roger L. Berger (Cengage Learning)
Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall)
Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press)
Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press)
The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer)

Artificial Intelligence and Machine Learning

Pattern Recognition and Machine Learning by Christopher Bishop (Springer)
Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press)
Programming Collective Intelligence by Toby Segaran (O’Reilly)
Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall)
Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press)
Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press)

Experimental Design

Field Experiments by Alan S. Gerber and Donald P. Green (Norton)
Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience)

Visualization

The Elements of Graphing Data by William Cleveland (Hobart Press)
Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley)
The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)

Prerequisites

Supplemental Reading

About the Contributors

Conventions Used in This Book

Using Code Examples

Safari Books Online

How to Contact Us

Acknowledgments

1. Introduction: What Is Data Science?

Big Data and Data Science Hype

Getting Past the Hype

Why Now?

Datafication

The Current Landscape (with a Little History)

Data Science Jobs

A Data Science Profile

Thought Experiment: Meta-Definition

OK, So What Is a Data Scientist, Really?

In Academia

In Industry

2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process

Statistical Thinking in the Age of Big Data

Statistical Inference

Populations and Samples

Populations and Samples of Big Data

Big Data Can Mean Big Assumptions

Can N=ALL?

Data is not objective

Modeling

What is a model?

Statistical modeling

But how do you build a model?

Probability distributions

Fitting a model

Overfitting

Exploratory Data Analysis

Philosophy of Exploratory Data Analysis

Exercise: EDA

Sample code

The Data Science Process

A Data Scientist’s Role in This Process

Thought Experiment: How Would You Simulate Chaos?

Case Study: RealDirect

How Does RealDirect Make Money?

Exercise: RealDirect Data Strategy

Sample R code

3. Algorithms

Machine Learning Algorithms

Three Basic Algorithms

Linear Regression

Start by writing something down

Fitting the model

Extending beyond least squares

Adding in modeling assumptions about the errors

Adding other predictors

Transformations

Review

Exercise

k-Nearest Neighbors (k-NN)

Example with credit scores

Similarity or distance metrics

Training and test sets

Pick an evaluation metric

Putting it all together

Choosing k

What are the modeling assumptions?

k-means

2D version

Exercise: Basic Machine Learning Algorithms

Solutions

Sample R code: Linear regression on the housing dataset

Sample R code: K-NN on the housing dataset

Summing It All Up

Thought Experiment: Automated Statistician

4. Spam Filters, Naive Bayes, and Wrangling

Thought Experiment: Learning by Example

Why Won’t Linear Regression Work for Filtering Spam?

How About k-nearest Neighbors?

Naive Bayes

Bayes Law

A Spam Filter for Individual Words

A Spam Filter That Combines Words: Naive Bayes

Fancy It Up: Laplace Smoothing

Comparing Naive Bayes to k-NN

Sample Code in bash

Scraping the Web: APIs and Other Tools

Jake’s Exercise: Naive Bayes for Article Classification

Sample R Code for Dealing with the NYT API

5. Logistic Regression

Thought Experiments

Classifiers

Runtime

You

Interpretability

Scalability

M6D Logistic Regression Case Study

Click Models

The Underlying Math

Estimating α and β

Newton’s Method

Stochastic Gradient Descent

Implementation

Evaluation

Media 6 Degrees Exercise

Sample R Code

6. Time Stamps and Financial Modeling

Kyle Teague and GetGlue

Timestamps

Exploratory Data Analysis (EDA)

Metrics and New Variables or Features

What’s Next?

Cathy O’Neil

Thought Experiment

Financial Modeling

In-Sample, Out-of-Sample, and Causality

Preparing Financial Data

Log Returns

Example: The S&P Index

Working out a Volatility Measurement

Exponential Downweighting

The Financial Modeling Feedback Loop

Why Regression?

Adding Priors

A Baby Model

Exercise: GetGlue and Timestamped Event Data

Exercise: Financial Data

7. Extracting Meaning from Data

William Cukierski

Background: Data Science Competitions

Background: Crowdsourcing

The Kaggle Model

A Single Contestant

Their Customers

Thought Experiment: What Are the Ethical Implications of a Robo-Grader?

Feature Selection

Example: User Retention

Filters

Wrappers

Selecting an algorithm

Selection criterion

In practice

Embedded Methods: Decision Trees

Entropy

The Decision Tree Algorithm

Handling Continuous Variables in Decision Trees

Random Forests

User Retention: Interpretability Versus Predictive Power

David Huffaker: Google’s Hybrid Approach to Social Research

Moving from Descriptive to Predictive

Social at Google

Privacy

Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?

8. Recommendation Engines: Building a User-Facing Data Product at Scale

A Real-World Recommendation Engine

Nearest Neighbor Algorithm Review

Some Problems with Nearest Neighbors

Beyond Nearest Neighbor: Machine Learning Classification

The Dimensionality Problem

Singular Value Decomposition (SVD)

Important Properties of SVD

Principal Component Analysis (PCA)

Theorem: The resulting latent features will be uncorrelated

Alternating Least Squares

Theorem with no proof: The preceding algorithm will converge if your prior is large enough

Fix V and Update U

Last Thoughts on These Algorithms

Thought Experiment: Filter Bubbles

Exercise: Build Your Own Recommendation System

Sample Code in Python

9. Data Visualization and Fraud Detection

Data Visualization History

Gabriel Tarde

Mark’s Thought Experiment

What Is Data Science, Redux?

Processing

Franco Moretti

A Sample of Data Visualization Projects

Mark’s Data Visualization Projects

New York Times Lobby: Moveable Type

Project Cascade: Lives on a Screen

Cronkite Plaza

eBay Transactions and Books

Public Theater Shakespeare Machine

Goals of These Exhibits

Data Science and Risk

About Square

The Risk Challenge

Detecting suspicious activity using machine learning

The Trouble with Performance Estimation

Defining the error metric

Defining the labels

Challenges in features and learning

Model Building Tips

Code readability and reusability

Get a pair!

Productionizing machine learning models

Data Visualization at Square

Ian’s Thought Experiment

Data Visualization for the Rest of Us

Data Visualization Exercise

10. Social Networks and Data Journalism

Social Network Analysis at Morning Analytics

Case-Attribute Data versus Social Network Data

Social Network Analysis

Terminology from Social Networks

Centrality Measures

The Industry of Centrality Measures

Thought Experiment

Morningside Analytics

How Visualizations Help Us Find Schools of Fish

More Background on Social Network Analysis from a Statistical Point of View

Representations of Networks and Eigenvalue Centrality

A First Example of Random Graphs: The Erdos-Renyi Model

A Second Example of Random Graphs: The Exponential Random Graph Model

Inference for ERGMs

Further examples of random graphs: latent space models, small-world networks

Data Journalism

A Bit of History on Data Journalism

Writing Technical Journalism: Advice from an Expert

11. Causality

Correlation Doesn’t Imply Causation

Asking Causal Questions

Confounders: A Dating Example

OK Cupid’s Attempt

The Gold Standard: Randomized Clinical Trials

A/B Tests

Second Best: Observational Studies

Simpson’s Paradox

The Rubin Causal Model

Visualizing Causality

Definition: The Causal Effect

Three Pieces of Advice

12. Epidemiology

Madigan’s Background

Thought Experiment

Modern Academic Statistics

Medical Literature and Observational Studies

Stratification Does Not Solve the Confounder Problem

What Do People Do About Confounding Things in Practice?

Is There a Better Way?

Research Experiment (Observational Medical Outcomes Partnership)

Closing Thought Experiment

13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation

Claudia’s Data Scientist Profile

The Life of a Chief Data Scientist

On Being a Female Data Scientist

Data Mining Competitions

How to Be a Good Modeler

Data Leakage

Market Predictions

Amazon Case Study: Big Spenders

A Jewelry Sampling Problem

IBM Customer Targeting

Breast Cancer Detection

Pneumonia Prediction

How to Avoid Leakage

Evaluating Models

Accuracy: Meh

Probabilities Matter, Not 0s and 1s

Choosing an Algorithm

A Final Example

Parting Thoughts

14. Data Engineering: MapReduce, Pregel, and Hadoop

About David Crawshaw

Thought Experiment

MapReduce

Word Frequency Problem

Enter MapReduce

Other Examples of MapReduce

What Can’t MapReduce Do?

Pregel

About Josh Wills

Thought Experiment

On Being a Data Scientist

Data Abundance Versus Data Scarcity

Designing Models

Mind the gap

Economic Interlude: Hadoop

A Brief Introduction to Hadoop

Cloudera

Back to Josh: Workflow

So How to Get Started with Hadoop?

15. The Students Speak

Process Thinking

Naive No Longer

Helping Hands

Your Mileage May Vary

Bridging Tunnels

Some of Our Work

16. Next-Generation Data Scientists, Hubris, and Ethics

What Just Happened?

What Is Data Science (Again)?

What Are Next-Gen Data Scientists?

Being Problem Solvers

Cultivating Soft Skills

Being Question Askers

Being an Ethical Data Scientist

Career Advice

Index

Colophon

Copyright

Temp

I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. – Doing a data science

Supplemental Readings

Math

Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning)
Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press)
A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross

Coding

R in a Nutshell by Joseph Adler (O’Reilly)
Learning Python by Mark Lutz and David Ascher (O’Reilly)
R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley)
The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press)
Python for Data Analysis by Wes McKinney (O’Reilly)

Data Analysis and Statistical Inference

Statistical Inference by George Casella and Roger L. Berger (Cengage Learning)
Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall)
Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press)
Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press)
The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer)

Artificial Intelligence and Machine Learning

Pattern Recognition and Machine Learning by Christopher Bishop (Springer)
Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press)
Programming Collective Intelligence by Toby Segaran (O’Reilly)
Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall)
Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press)
Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press)

Experimental Design

Field Experiments by Alan S. Gerber and Donald P. Green (Norton)
Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience)

Visualization

The Elements of Graphing Data by William Cleveland (Hobart Press)
Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley)
The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)