User Tools

Site Tools


b:doing_a_data_science

Table of Contents

http://shop.oreilly.com/product/0636920028529.do

failed to fetch data: unkown error

Doing Data Science

Table of Contents

Doing a Data Science

Dedication

Preface

Motivation

Origins of the Class

Origins of the Book

What to Expect from This Book

How This Book Is Organized

How to Read This Book

How Code Is Used in This Book

Who This Book Is For

I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. – Doing a data science

Supplemental Readings

Math

  • Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning)
  • Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press)
  • A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross

Coding

  • R in a Nutshell by Joseph Adler (O’Reilly)
  • Learning Python by Mark Lutz and David Ascher (O’Reilly)
  • R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley)
  • The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press)
  • Python for Data Analysis by Wes McKinney (O’Reilly)

Data Analysis and Statistical Inference

  • Statistical Inference by George Casella and Roger L. Berger (Cengage Learning)
  • Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall)
  • Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press)
  • Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press)
  • The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer)

Artificial Intelligence and Machine Learning

  • Pattern Recognition and Machine Learning by Christopher Bishop (Springer)
  • Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press)
  • Programming Collective Intelligence by Toby Segaran (O’Reilly)
  • Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall)
  • Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press)
  • Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press)

Experimental Design

  • Field Experiments by Alan S. Gerber and Donald P. Green (Norton)
  • Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience)

Visualization

  • The Elements of Graphing Data by William Cleveland (Hobart Press)
  • Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley)
  • The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)

Prerequisites

Supplemental Reading

About the Contributors

Conventions Used in This Book

Using Code Examples

Safari Books Online

How to Contact Us

Acknowledgments

1. Introduction: What Is Data Science?

Big Data and Data Science Hype

Getting Past the Hype

Why Now?

Datafication

The Current Landscape (with a Little History)

Data Science Jobs

A Data Science Profile

Thought Experiment: Meta-Definition

OK, So What Is a Data Scientist, Really?

In Academia

In Industry

2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process

Statistical Thinking in the Age of Big Data

Statistical Inference

Populations and Samples

Populations and Samples of Big Data

Big Data Can Mean Big Assumptions

Can N=ALL?

Data is not objective

Modeling

What is a model?

Statistical modeling

But how do you build a model?

Probability distributions

Fitting a model

Overfitting

Exploratory Data Analysis

Philosophy of Exploratory Data Analysis

Exercise: EDA

Sample code

The Data Science Process

A Data Scientist’s Role in This Process

Thought Experiment: How Would You Simulate Chaos?

Case Study: RealDirect

How Does RealDirect Make Money?

Exercise: RealDirect Data Strategy

Sample R code

3. Algorithms

Machine Learning Algorithms

Three Basic Algorithms

Linear Regression

Start by writing something down

Fitting the model

Extending beyond least squares

Adding in modeling assumptions about the errors

Adding other predictors

Transformations

Review

Exercise

k-Nearest Neighbors (k-NN)

Example with credit scores

Similarity or distance metrics

Training and test sets

Pick an evaluation metric

Putting it all together

Choosing k

What are the modeling assumptions?

k-means

2D version

Exercise: Basic Machine Learning Algorithms

Solutions

Sample R code: Linear regression on the housing dataset

Sample R code: K-NN on the housing dataset

Summing It All Up

Thought Experiment: Automated Statistician

4. Spam Filters, Naive Bayes, and Wrangling

Thought Experiment: Learning by Example

Why Won’t Linear Regression Work for Filtering Spam?

How About k-nearest Neighbors?

Naive Bayes

Bayes Law

A Spam Filter for Individual Words

A Spam Filter That Combines Words: Naive Bayes

Fancy It Up: Laplace Smoothing

Comparing Naive Bayes to k-NN

Sample Code in bash

Scraping the Web: APIs and Other Tools

Jake’s Exercise: Naive Bayes for Article Classification

Sample R Code for Dealing with the NYT API

5. Logistic Regression

Thought Experiments

Classifiers

Runtime

You

Interpretability

Scalability

M6D Logistic Regression Case Study

Click Models

The Underlying Math

Estimating α and β

Newton’s Method

Stochastic Gradient Descent

Implementation

Evaluation

Media 6 Degrees Exercise

Sample R Code

6. Time Stamps and Financial Modeling

Kyle Teague and GetGlue

Timestamps

Exploratory Data Analysis (EDA)

Metrics and New Variables or Features

What’s Next?

Cathy O’Neil

Thought Experiment

Financial Modeling

In-Sample, Out-of-Sample, and Causality

Preparing Financial Data

Log Returns

Example: The S&P Index

Working out a Volatility Measurement

Exponential Downweighting

The Financial Modeling Feedback Loop

Why Regression?

Adding Priors

A Baby Model

Exercise: GetGlue and Timestamped Event Data

Exercise: Financial Data

7. Extracting Meaning from Data

William Cukierski

Background: Data Science Competitions

Background: Crowdsourcing

The Kaggle Model

A Single Contestant

Their Customers

Thought Experiment: What Are the Ethical Implications of a Robo-Grader?

Feature Selection

Example: User Retention

Filters

Wrappers

Selecting an algorithm

Selection criterion

In practice

Embedded Methods: Decision Trees

Entropy

The Decision Tree Algorithm

Handling Continuous Variables in Decision Trees

Random Forests

User Retention: Interpretability Versus Predictive Power

David Huffaker: Google’s Hybrid Approach to Social Research

Moving from Descriptive to Predictive

Social at Google

Privacy

Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?

8. Recommendation Engines: Building a User-Facing Data Product at Scale

A Real-World Recommendation Engine

Nearest Neighbor Algorithm Review

Some Problems with Nearest Neighbors

Beyond Nearest Neighbor: Machine Learning Classification

The Dimensionality Problem

Singular Value Decomposition (SVD)

Important Properties of SVD

Principal Component Analysis (PCA)

Theorem: The resulting latent features will be uncorrelated

Alternating Least Squares

Theorem with no proof: The preceding algorithm will converge if your prior is large enough

Fix V and Update U

Last Thoughts on These Algorithms

Thought Experiment: Filter Bubbles

Exercise: Build Your Own Recommendation System

Sample Code in Python

9. Data Visualization and Fraud Detection

Data Visualization History

Gabriel Tarde

Mark’s Thought Experiment

What Is Data Science, Redux?

Processing

Franco Moretti

A Sample of Data Visualization Projects

Mark’s Data Visualization Projects

New York Times Lobby: Moveable Type

Project Cascade: Lives on a Screen

Cronkite Plaza

eBay Transactions and Books

Public Theater Shakespeare Machine

Goals of These Exhibits

Data Science and Risk

About Square

The Risk Challenge

Detecting suspicious activity using machine learning

The Trouble with Performance Estimation

Defining the error metric

Defining the labels

Challenges in features and learning

Model Building Tips

Code readability and reusability

Get a pair!

Productionizing machine learning models

Data Visualization at Square

Ian’s Thought Experiment

Data Visualization for the Rest of Us

Data Visualization Exercise

10. Social Networks and Data Journalism

Social Network Analysis at Morning Analytics

Case-Attribute Data versus Social Network Data

Social Network Analysis

Terminology from Social Networks

Centrality Measures

The Industry of Centrality Measures

Thought Experiment

Morningside Analytics

How Visualizations Help Us Find Schools of Fish

More Background on Social Network Analysis from a Statistical Point of View

Representations of Networks and Eigenvalue Centrality

A First Example of Random Graphs: The Erdos-Renyi Model

A Second Example of Random Graphs: The Exponential Random Graph Model

Inference for ERGMs

Further examples of random graphs: latent space models, small-world networks

Data Journalism

A Bit of History on Data Journalism

Writing Technical Journalism: Advice from an Expert

11. Causality

Correlation Doesn’t Imply Causation

Asking Causal Questions

Confounders: A Dating Example

OK Cupid’s Attempt

The Gold Standard: Randomized Clinical Trials

A/B Tests

Second Best: Observational Studies

Simpson’s Paradox

The Rubin Causal Model

Visualizing Causality

Definition: The Causal Effect

Three Pieces of Advice

12. Epidemiology

Madigan’s Background

Thought Experiment

Modern Academic Statistics

Medical Literature and Observational Studies

Stratification Does Not Solve the Confounder Problem

What Do People Do About Confounding Things in Practice?

Is There a Better Way?

Research Experiment (Observational Medical Outcomes Partnership)

Closing Thought Experiment

13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation

Claudia’s Data Scientist Profile

The Life of a Chief Data Scientist

On Being a Female Data Scientist

Data Mining Competitions

How to Be a Good Modeler

Data Leakage

Market Predictions

Amazon Case Study: Big Spenders

A Jewelry Sampling Problem

IBM Customer Targeting

Breast Cancer Detection

Pneumonia Prediction

How to Avoid Leakage

Evaluating Models

Accuracy: Meh

Probabilities Matter, Not 0s and 1s

Choosing an Algorithm

A Final Example

Parting Thoughts

14. Data Engineering: MapReduce, Pregel, and Hadoop

About David Crawshaw

Thought Experiment

MapReduce

Word Frequency Problem

Enter MapReduce

Other Examples of MapReduce

What Can’t MapReduce Do?

Pregel

About Josh Wills

Thought Experiment

On Being a Data Scientist

Data Abundance Versus Data Scarcity

Designing Models

Mind the gap

Economic Interlude: Hadoop

A Brief Introduction to Hadoop

Cloudera

Back to Josh: Workflow

So How to Get Started with Hadoop?

15. The Students Speak

Process Thinking

Naive No Longer

Helping Hands

Your Mileage May Vary

Bridging Tunnels

Some of Our Work

16. Next-Generation Data Scientists, Hubris, and Ethics

What Just Happened?

What Is Data Science (Again)?

What Are Next-Gen Data Scientists?

Being Problem Solvers

Cultivating Soft Skills

Being Question Askers

Being an Ethical Data Scientist

Career Advice

Index

Colophon

Temp

I was working on the Google+ data science team with an interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, prototyping algorithms, and unearthing patterns in the data that had huge impact on the product. – Doing a data science

Supplemental Readings

Math

  • Linear Algebra and Its Applications by Gilbert Strang (Cengage Learning)
  • Convex Optimization by Stephen Boyd and Lieven Vendenberghe (Cambridge University Press)
  • A First Course in Probability (Pearson) and Introduction to Probability Models (Academic Press) by Sheldon Ross

Coding

  • R in a Nutshell by Joseph Adler (O’Reilly)
  • Learning Python by Mark Lutz and David Ascher (O’Reilly)
  • R for Everyone: Advanced Analytics and Graphics by Jared Lander (Addison-Wesley)
  • The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff (No Starch Press)
  • Python for Data Analysis by Wes McKinney (O’Reilly)

Data Analysis and Statistical Inference

  • Statistical Inference by George Casella and Roger L. Berger (Cengage Learning)
  • Bayesian Data Analysis by Andrew Gelman, et al. (Chapman & Hall)
  • Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill (Cambridge University Press)
  • Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi (under contract with Cambridge University Press)
  • The Elements of Statistical Learning: Data Mining, Inference and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer)

Artificial Intelligence and Machine Learning

  • Pattern Recognition and Machine Learning by Christopher Bishop (Springer)
  • Bayesian Reasoning and Machine Learning by David Barber (Cambridge University Press)
  • Programming Collective Intelligence by Toby Segaran (O’Reilly)
  • Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Prentice Hall)
  • Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (MIT Press)
  • Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydim (MIT Press)

Experimental Design

  • Field Experiments by Alan S. Gerber and Donald P. Green (Norton)
  • Statistics for Experimenters: Design, Innovation, and Discovery by George E. P. Box, et al. (Wiley-Interscience)

Visualization

  • The Elements of Graphing Data by William Cleveland (Hobart Press)
  • Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau (Wiley)
  • The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)
b/doing_a_data_science.txt · Last modified: 2018/02/02 02:30 by hkimscil