Chapter 1 Introduction

It is the mark of a truly intelligent person to be moved by statistics – George Bernard Shaw

1.1 Book overview

Data science is a vague term. While everyone can agree what a vet or a plumber does, an exact description of a data scientist is still elusive. My background is in statistics and I see data science as the logical extension of applying statistical methods to large scale data sets. However my colleague comes from a computer science background, and their take is switching that idea on it’s head - large data sets first, analytics second. Both of us are correct and also wrong. This explains why data scientists are in such demand. The problems they face are complex, challenging (and interesting) and require skills from multiple disciplines.

The goal of this short book is to provide an introduction into the wonderful world of data science from the statistical point of view. In a traditional statistics course¹, we would look at small, data sets to illustrate key concepts. In a computing science course, we would start with a database and move on from there. This book straddles both camps. Data sets that are small enough to be informative, but still have real world interest.

1.1.1 The trouble with tibbles data

The trouble with interesting data, is that there are so many potential problems. For example, most real world data sets have missing data. However, there are different types of missing data!

Missing completely at random (MCAR). This is the easiest to deal with. The data are missing independently of both observed and unobserved data. For example, if I ask you your age, and you flip a coin. If heads, you tell me the answer. If tails, you refuse.
Missing at random (MAR): given the observed data, data are missing independently of unobserved data. Example: Scottish participants are more likely to refuse a survey on soccer excellence, but it does not depend on the level of their soccer skills.
Missing not at random (MNAR): missing observations related to values of unobserved data. As a concrete example, a good predictor of whether someone survived the titanic disaster was whether or not they had their age recorded.

The key skills with data science are often being able to spot where potential issues lie. Once we know the problems, we can formulate a strategy for overcoming them. Fortunately, there are a few standard tools available to help us.

When a data scientist confronts a new data set, they have a powerful arsenal of statistical and machine learning algorithms at their disposal. However, before tackling a new data set, we should always explore and try to understand the issues before throwing the machine learning kitchen sink at it.

This book targets people who work with data, but don’t come from a mathematics or statistics background. They know names of standard methods, but are a bit hazy on the details. We aim to fill in some of the blanks.

1.2 Talking about data

The quantities measured in a study are called random variables and a particular outcome is called an observation. A collection of observations is the data (or sample). The collection of all possible outcomes is the population. If we were interested in how long a customer spends on our website, the time spent on a website is our random variable. A particular customer’s time would be an observation and if we recorded the time for every person during the day, those times would be our data, which would form a sample from the population of all customers. As a shorthand we often use a capital letter (say \(X\)) to represent our random variable, and the lower case, with subscripts (e.g. \(x_1\), \(x_2\), \(\ldots\), ), to represent observations in a general sense.

In practice it is difficult to observe whole populations, unless we are interested in a very limited population, e.g. just the customers who came to our website today. In reality we usually observe a subset of the population, and this raises questions about who or what goes into our sample; we will briefly consider various commonly–used sampling techniques soon.

Once we have our data, it is important to understand what type it is – so we can figure out exactly what to do with it. The two main categories are quantitative and qualitative (in other words, numeric and non–numeric, respectively). Quantitative data can be further subdivided into categories referred to as discrete and continuous. Discrete data moves in ‘steps’ – for example, whole numbers, shoe sizes etc. Continuous data can usually take any value, although there may be restrictions to the set of positive numbers (for example height, weight) and there may be issues of accuracy depending on the equipment being used to record the data (and so sometimes continuous data might ‘look’ discrete!). There are other words that are often used within these categories, such as ‘ordinal’, and ‘categorical’.

1.2.1 Sample size

When considering data collection, it is important to ensure that the sample contains a sufficient number of members of the population for adequate analysis to take place. Larger samples will generally give more precise information about the population. Unfortunately, in reality, issues of expense and time tend to limit the size of the sample it is possible to take. For example, national opinion polls often rely on samples in the region of just \(1,000\).

1.3 Designing the experiment

Designing a suitable experimental isn’t as easier you would first think. For example, let’s suppose the two adverts were displayed on facebook. An initial idea would be to display advert 1 followed by advert 2. However, a number of confounding factors now become important. For example

Is traffic on facebook the same on each day, e.g. is Friday different from Saturday?
Is the week before Christmas the same as the week after Christmas?
Was your advert on the same page as a more popular advert?

The easiest way to circumvent these issues is with randomised sampling. Essentially, instead of always display each advert, we display the advert with a probability. The probability could be equal, or if you were trying a new and untested approach, you might have probability \(0.95\) standard advert, \(0.05\) new advert.

1.4 Data sets

During this course, we’ll use a few different data sets to illustrate the key points. The datasets can be found in the GitHub repository for this book.

1.4.1 The relationship between beauty and teaching

This data set is from a study where researchers were interested in whether a lecturers’ attractiveness affected their course evaluation! This is a cleaned version of the data set and contains the following variables:

evaluation: the questionnaire result
tenured: does the lecturer have tenure; 1 == Yes. In R, this value is continuous.
minority: does the lecturer come from an ethnic minority (in the USA)
age: the lecturers’ age
gender: a factor: Female or Male
students: number of students in the class
beauty: each of the lecturers’ pictures was rated by six undergraduate students: three women and three men.

The raters were told to use a 10 (highest) to 1 rating scale, to concentrate on the physiognomy of the professor in the picture, to make their ratings independent of age, and to keep 5 in mind as an average. The scores were then normalised.

This data set has 463 rows and 7 columns.

1.4.2 Bond, James Bond

This data set contains information about James bond movies. The data set contains 24 movies with the following columns:

Film: the movie name.
Actor: the actor who played James Bond (to be clear, Sean Connery is clearly the best bond).
Kills: the number of kills by Bond.
Relationships: the number of romantic relationships by Bond.
Alcohol_Units: the number of Units of alcohol consumed by Bond.
Alcohol: a discretised version of Alcohol_Units.
Nationality: the nationality of the Bond actor.

Credit Thanks to Jack Prenter for collecting this data.

1.4.3 OKCupid

This data set was created using a python script to pull data from public profiles on okcupid. There are a total of 59946 profiles which includes people within a 25 mile radius of San Francisco, who were online between June 2011 and July 2012 and had at least one profile picture. Each user has 32 variables. The overall data set is around 150MB, while not “big”, it is prohibitive for teaching. So we’ve dropped 11 variables relating to “users essays”. This results in an uncompressed data set of around 12MB.

Credit: Data and code for OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education July 2015, Volume 23, Number 2); GitHub page.

1.4.4 Starbucks

The Starbucks data set contains nutritional information on 113 products from Starbucks. The data set has six columns:

Product
Calories
Fat
Carb (Carbohydrates)
Fiber
Protein

Clearly there is some poetic licence here.↩