Correlation¶

Imports¶

import pandas as pd
import numpy as np # Used for the correlation coefficient

from matplotlib import pyplot as plt

What is correlation? ¶

Correlation is defined as “a mutual relationship or connection between two or more things”. We will use correlation in the sense of trying to better understand our data and problems. We will continue working today with the Boston housing prices dataset, boston.csv. Suppose you were asked the question “what factors influence the price of a house?” To answer this we would need to know the correlation between house price and other factors. The simplest way to understand correlation is through scatterplots. Let’s draw some scatterplots to get started.

df = pd.read_csv('data/boston.csv')

df.head()

	Median home value	Crime rate	% industrial	Nitrous Oxide concentration	Avg num rooms	% built before 1940	Distance to downtown	Pupil-teacher ratio	% below poverty line
0	24000.0	0.00632	2.31	0.538	6.575	65.2	4.0900	15.3	4.98
1	21600.0	0.02731	7.07	0.469	6.421	78.9	4.9671	17.8	9.14
2	34700.0	0.02729	7.07	0.469	7.185	61.1	4.9671	17.8	4.03
3	33400.0	0.03237	2.18	0.458	6.998	45.8	6.0622	18.7	2.94
4	36200.0	0.06905	2.18	0.458	7.147	54.2	6.0622	18.7	5.33

df.plot(x='Crime rate', 
        y='Median home value', 
        kind='scatter', 
        figsize=(12, 8), 
        title='Crime rate vs house price')

ax = df.plot(x='% industrial', 
             y='Median home value', 
             kind='scatter', 
             figsize=(12, 8),
             title='Crime rate vs house price')

It looks like we’re copying and pasting the same code over and over. This is a good sign that we should write a function!

def make_scatterplot(df, x, y, figsize=(12, 8)):
    df.plot(x=x, 
            y=y, 
            kind='scatter', 
            figsize=figsize,
            title=f'{x} vs {y}')

make_scatterplot(df, x='Nitrous Oxide concentration', y='Median home value')

make_scatterplot(df, x='Avg num rooms', y='Median home value')

How is correlation measured? ¶

This is good, but it’s slow. Also, while we can see that the average number of rooms seems to correlate with the price, what about those in the middle? Both nitrous oxide concentraion and % industrial seem to have a week correlation. Is one better than the other? Scatterplots are great, but they’re just a start. To dig deeper we want to understand what percentage of the y-value (median home value in this case) is explained by knowing the x-value (no2 concentration, % industrial, etc.). This is precisely what the Pearson correlation coefficient $R^2$ tells us. Pandas has a helper function .corr() which computes the correlation coefficients for us.

df.corr()

	Median home value	Crime rate	% industrial	Nitrous Oxide concentration	Avg num rooms	% built before 1940	Distance to downtown	Pupil-teacher ratio	% below poverty line
Median home value	1.000000	-0.388305	-0.483725	-0.427321	0.695360	-0.376955	0.249929	-0.507787	-0.737663
Crime rate	-0.388305	1.000000	0.406583	0.420972	-0.219247	0.352734	-0.379670	0.289946	0.455621
% industrial	-0.483725	0.406583	1.000000	0.763651	-0.391676	0.644779	-0.708027	0.383248	0.603800
Nitrous Oxide concentration	-0.427321	0.420972	0.763651	1.000000	-0.302188	0.731470	-0.769230	0.188933	0.590879
Avg num rooms	0.695360	-0.219247	-0.391676	-0.302188	1.000000	-0.240265	0.205246	-0.355501	-0.613808
% built before 1940	-0.376955	0.352734	0.644779	0.731470	-0.240265	1.000000	-0.747881	0.261515	0.602339
Distance to downtown	0.249929	-0.379670	-0.708027	-0.769230	0.205246	-0.747881	1.000000	-0.232471	-0.496996
Pupil-teacher ratio	-0.507787	0.289946	0.383248	0.188933	-0.355501	0.261515	-0.232471	1.000000	0.374044
% below poverty line	-0.737663	0.455621	0.603800	0.590879	-0.613808	0.602339	-0.496996	0.374044	1.000000

Note that this actually returns $R$, not $R^2$. We typically look at $R^2$ because it has a nice interpretation (next paragraph).

This returns a matrix where the number in each cell is the correlation coefficient for those two columns. For instance, looking at the top row, we see that the correlation coefficient between “Median home value” and “Avg num rooms” is $0.696^2=0.4844$, so approximately 48.4% of the variation in the median home value in an area is determined by the number of rooms in that area.

make_scatterplot(df, x='Crime rate', y='Median home value')

We can see that as the crime rate increases the median home value drops, so the relationship is negative. This makes sense, since increased crime will likely lead to lower home prices.

What about correlation coefficients that are close to zero? We can see that there seems to be a weak relationship (i.e. small correlation coefficient) between “Pupil-teacher ratio” and “Nitrous Oxide concentration” (basically polution). Let’s see what that looks like.

make_scatterplot(df, x='Pupil-teacher ratio', y='Nitrous Oxide concentration')

We can see that there’s no obvious relationship, which is what the low $R^2$ value is suggesting.

Besides using Pandas, you can also compute the correlation coefficient using numpy. The command for this is .corrcoef(...). The arguments you use are simply the two vectors you want to compute the correlation coefficient between. So for example, let’s compute the correlation coefficient between the pupil_teacher_ratio column and the nitrous_oxide_concentration column.

df.head()

	Median home value	Crime rate	% industrial	Nitrous Oxide concentration	Avg num rooms	% built before 1940	Distance to downtown	Pupil-teacher ratio	% below poverty line
0	24000.0	0.00632	2.31	0.538	6.575	65.2	4.0900	15.3	4.98
1	21600.0	0.02731	7.07	0.469	6.421	78.9	4.9671	17.8	9.14
2	34700.0	0.02729	7.07	0.469	7.185	61.1	4.9671	17.8	4.03
3	33400.0	0.03237	2.18	0.458	6.998	45.8	6.0622	18.7	2.94
4	36200.0	0.06905	2.18	0.458	7.147	54.2	6.0622	18.7	5.33

np.corrcoef(df['Pupil-teacher ratio'], df['Nitrous Oxide concentration'])

array([[1.        , 0.18893268],
       [0.18893268, 1.        ]])

You can see that this returns a matrix, just like Pandas did. What we want are either of those two values that are not on the diagonal. So we’ll access them using [0, 1], which means “zeroth row, first column”, just like you would normally think of a matrix.

np.corrcoef(df['Pupil-teacher ratio'], df['Nitrous Oxide concentration'])[0, 1]

0.18893267711276704

Just like Pandas, this return $R$, not $R^2$ (Pandas is actually using numpy to do its calculations). So we’ll square it using the Python squaring operator **.

np.corrcoef(df['Pupil-teacher ratio'], df['Nitrous Oxide concentration'])[0, 1]**2

0.035695556480997086

What correlation does (and does not) tell you ¶

Now that we can measure and visualize correlation, it’s worth thinking about what we learn and don’t learn from it. What we learn is fairly obvious–we can find a relationship between two variables. For example, the scatterplot above shows us (and the associated $R^2$ value) tells us that as the average number of rooms increase, so does the median home value.

Some people think of a high $R^2$ value as meaning that the data is close to a line. This is false. While it is true that an $R^2$ value of $\pm 1$ means that the data is a perfect line, any other $R^2$ value does not necessarily reflect a line. For instance, consider the famous dataset Anscombe’s quartet:

# Four different datasets
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

x2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]

x3 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

# Create a graph with four different axes, one for each dataset
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(16, 16))

# Plot each one and set the title to the be R^2 value
ax1.scatter(x1, y1, color='blue')
ax1.set_title(f'R^2 = {np.corrcoef(x1, y1)[0, 1]**2:.3}')

ax2.scatter(x2, y2, color='red')
ax2.set_title(f'R^2 = {np.corrcoef(x2, y2)[0, 1]:.3}')

ax3.scatter(x3, y3, color='green')
ax3.set_title(f'R^2 = {np.corrcoef(x3, y3)[0, 1]:.3}')

ax4.scatter(x4, y4, color='purple')
ax4.set_title(f'R^2 = {np.corrcoef(x4, y4)[0, 1]:.3}')

fig.suptitle("Anscombe's quartet");

You can see that all four datasets have the same (very high) correlation coefficient of approximately 81.6%. However, all but the first one have strange patterns. By looking at the graph you can easily spot the patterns. But if you just rely on the correlation coefficient then you miss this. So in two dimensions you should rely on your eyes, not on a number.

Multiple correlation ¶

When working with just two variables, correlation coefficient is only somewhat useful. Afterall, we can just look at the graph and form a better conclusion than any single number could tell us. However, what about when we want to understand the relationship between three or more variables? For three we could do a 3d plot, such as below:

from mpl_toolkits.mplot3d import Axes3D # Used for 3d plotting

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Pupil-teacher ratio'], df['Avg num rooms'], df['Median home value'])

ax.set_xlabel('Pupil-teacher ratio')
ax.set_ylabel('Avg num rooms')
ax.set_zlabel('Median home value');

Joining multiple DataFrames ¶

Often times your data will be split across multiple different files. Combining these into one DataFrame is referred to as “joining” your data. While we will be doing this in Pandas, this same idea is extremely common in databases. If you learn how to do it here you will have a great start on joining data in databases as well.

There are multiple different ways you can join data, however we will focus on two of them: concatenating and merging.

Concatenating data¶

Concatenating data refers to simply stacking two DataFrames on top of one another. For example, let’s create some fake data for two different math courses.

course1_df = pd.DataFrame({'students': ['Alice', 'Bob', 'Cathy', 'David'], 'final_exam_score': [90, 84, 79, 50]})
course2_df = pd.DataFrame({'students': ['Esther', 'Frank', 'George', 'Heather', 'Ingrid'], 'final_exam_score': [52, 65, 91, 100, 87]})

course1_df

	students	final_exam_score
0	Alice	90
1	Bob	84
2	Cathy	79
3	David	50

course2_df

	students	final_exam_score
0	Esther	52
1	Frank	65
2	George	91
3	Heather	100
4	Ingrid	87

Suppose we want to find the average across all courses. We could compute the average in each course separately, then compute the average of those two averages (being careful to weight them according to the number of students). Rather than doing that, let’s just concatenate these two DataFrames into one new DataFrame, and compute the average there. We can do that using the Pandas command .concat(). You simply give it a list of the DataFrames you want to concatenate. We’re doing it with two of them, but you could do it with as many as you like.

all_courses = pd.concat([course1_df, course2_df])

all_courses

	students	final_exam_score
0	Alice	90
1	Bob	84
2	Cathy	79
3	David	50
0	Esther	52
1	Frank	65
2	George	91
3	Heather	100
4	Ingrid	87

all_courses['final_exam_score'].mean()

77.55555555555556

Merging¶

Merging data refers to taking two (or more) DataFrames which have some relationship and joining them together. Let’s continue with our above example of two different courses. Suppose the university tracked your grade across different courses, and wanted to make one new dataset showing every course you have taken. In the above (fake) data there are no students shared across course1 and course2. So let’s make a new course which has some of those students (perhaps course1 and course2 are just different sections of the same course, whereas course3 below is a whole new course).

course3_df = pd.DataFrame({'students': ['Alice', 'Esther', 'George', 'Bob', 'Jacob'], 'final_exam_score': [87, 45, 98, 55, 27]})

course3_df

	students	final_exam_score
0	Alice	87
1	Esther	45
2	George	98
3	Bob	55
4	Jacob	27

To merge the data we using the Pandas .merge() function. There are actually multiple different ways to use this function. We’ll show two different ones below. Just pick the one you feel makes most sense and stick with it.

# Method 1
course13_merge1_df = pd.merge(course1_df, course3_df, on='students')

# Method 2
course13_merge2_df = course1_df.merge(course3_df, on='students')

course13_merge1_df

	students	final_exam_score_x	final_exam_score_y
0	Alice	90	87
1	Bob	84	55

course13_merge2_df

	students	final_exam_score_x	final_exam_score_y
0	Alice	90	87
1	Bob	84	55

As you can see, these two methods produced the same results. Because of that I will just stick with the first method. Anything we learn for one method applies directly to the other method.

Looking at the results above, we can see that the merged DataFrame has the grades for both Alice and Bob in their two courses. However, there are a few things to discuss:

Why does it only have Alice and Bob? What about the other students?
Why are the columns called final_exam_score_x/y?

There are many types of merges (called “joins” when working with databases, but the ideas are identical), however we will focus on four of them. They are:

Inner join: Only take those rows which exist in both DataFrames. This is the default behavior for .merge(). So we only took Alice and Bob because those are the only two students that exist in both course1_df and course3_df (look at the DataFrames to verify this).
Left join: Take all rows from the “left” DataFrame, and only take the rows from the right DataFrame which also exist in the left. See the example below.
Right join: Exactly the same as the “left join”, but replace “left” with “right”.
Outer/full join: Take all rows from both DataFrames.

Here is a diagram showing these:

Merges/joins

You specify each of these by using the parameter how in the .merge() function. Let’s show the result of each.

course13_inner_df = pd.merge(course1_df, course3_df, how='inner', on='students') # As noted above, the default behavior is an inner merge, so you actually don't need to specify how='inner', it will do it by default
course13_left_df = pd.merge(course1_df, course3_df, how='left', on='students')
course13_right_df = pd.merge(course1_df, course3_df, how='right', on='students')
course13_full_df = pd.merge(course1_df, course3_df, how='outer', on='students')

An inner join is exactly what we did above. Just this time we explicitly wrote how='inner', which doesn’t affect anything.

course13_inner_df

	students	final_exam_score_x	final_exam_score_y
0	Alice	90	87
1	Bob	84	55

Notice that we have all students from the “left” course course1_df (this is the “left” course because it’s literally on the left when you write pd.merge(course1_df, course3_df, how='left', ...))

course13_left_df

	students	final_exam_score_x	final_exam_score_y
0	Alice	90	87.0
1	Bob	84	55.0
2	Cathy	79	NaN
3	David	50	NaN

The right join is just the opposite of the left join. It takes all students in the “right” course course3_df.

course13_right_df

	students	final_exam_score_x	final_exam_score_y
0	Alice	90.0	87
1	Bob	84.0	55
2	Esther	NaN	45
3	George	NaN	98
4	Jacob	NaN	27

Finally, the outer join. This shows all students from each class.

course13_full_df

	students	final_exam_score_x	final_exam_score_y
0	Alice	90.0	87.0
1	Bob	84.0	55.0
2	Cathy	79.0	NaN
3	David	50.0	NaN
4	Esther	NaN	45.0
5	George	NaN	98.0
6	Jacob	NaN	27.0

Note that for everything except the inner join, there are students which appear in only one of the two classes. In that case, there is no grade to put for them in the other course, because they weren’t in it!

Let’s now turn to the next question, which is the column name. If Pandas sees two columns with the same name (such as how final_exam_score is a column in both course1_df and course3_df) it has to rename them (since Pandas won’t allow a DataFrame to have two columns with the same name). Therefore it appends _x or _y to the end of the name. If you don’t like those names, just rename them.

course13_full_df.columns = ['students', 'course1_final_exam_score', 'course3_final_exam_score']

course13_full_df

	students	course1_final_exam_score	course3_final_exam_score
0	Alice	90.0	87.0
1	Bob	84.0	55.0
2	Cathy	79.0	NaN
3	David	50.0	NaN
4	Esther	NaN	45.0
5	George	NaN	98.0
6	Jacob	NaN	27.0

These examples give you a good idea about how you can join DataFrames. However, they are also simplified examples. For example, consider the following more complicated example of students, a course and a club.

students_df = pd.DataFrame({'student_id': [1, 2, 3, 4, 5, 6], 
                            'name': ['Alice', 'Bob', 'Cathy', 'Bob', 'Derek', 'Esther']})
course_df = pd.DataFrame({'student_id_number': [3, 5, 8, 13, 59, 1, 4], 
                          'grade': [65, 74, 24, 74, 86, 100, 52]})
club_df = pd.DataFrame({'club_name': ['Math', 'Chess', 'Waffle'],
                       'club_officer_student_id': [2, 6, 19]})

students_df

	student_id	name
0	1	Alice
1	2	Bob
2	3	Cathy
3	4	Bob
4	5	Derek
5	6	Esther

course_df

	student_id_number	grade
0	3	65
1	5	74
2	8	24
3	13	74
4	59	86
5	1	100
6	4	52

club_df

	club_name	club_officer_student_id
0	Math	2
1	Chess	6
2	Waffle	19

What we want is to make a single DataFrame with all students, along with their grade in the course (or None/NaN if they were not in the course). Finally, if they’re a club officer we want to indicate which club it’s for (or None/NaN if they were not an officer).

Since we want all students, we could do something like pd.merge(students_df, some_other_df, how='left'). Since students_df is on the left and we are doing how='left' (a left join), this will take all students. Let’s start by merging the students with the course. Both columns have a column with the student id, but the columns are named differently (student_id vs student_id_number). While we could rename them to match, that messes with our data and is unnecessary. Instead, Pandas has left_on=... and right_on=... options which let you specify the column for the left and right DataFrames separately (for exactly this situation where the left and right column names are different).

merged_df = pd.merge(students_df, course_df, how='left', left_on='student_id', right_on='student_id_number')

merged_df

	student_id	name	student_id_number	grade
0	1	Alice	1.0	100.0
1	2	Bob	NaN	NaN
2	3	Cathy	3.0	65.0
3	4	Bob	4.0	52.0
4	5	Derek	5.0	74.0
5	6	Esther	NaN	NaN

We can see that all students are included (since we did a left join and students_df was on the left). We can also see that if the student was in the course then their grade shows up, and NaN shows up otherwise. Before we work on merging the club as well, let’s clean up some things. First, we don’t need both student_id and student_id_number, so let’s drop the student_id_number column.

merged_df = merged_df.drop('student_id_number', axis='columns')

merged_df

	student_id	name	grade
0	1	Alice	100.0
1	2	Bob	NaN
2	3	Cathy	65.0
3	4	Bob	52.0
4	5	Derek	74.0
5	6	Esther	NaN

Let’s now turn to merging the club. Again, we want to keep all students, but just merge in any club data we have for that student. So we’ll do a left join with merged_df on the left so that we keep everything we already did. Again, both DataFrames have a student id column, but they have different names. So we’ll specify that using left_on and right_on.

merged_df = pd.merge(merged_df, club_df, how='left', left_on='student_id', right_on='club_officer_student_id')

merged_df

	student_id	name	grade	club_name	club_officer_student_id
0	1	Alice	100.0	NaN	NaN
1	2	Bob	NaN	Math	2.0
2	3	Cathy	65.0	NaN	NaN
3	4	Bob	52.0	NaN	NaN
4	5	Derek	74.0	NaN	NaN
5	6	Esther	NaN	Chess	6.0

Again, we don’t need the duplicate student id’s, so let’s drop the one coming from club_df.

merged_df = merged_df.drop('club_officer_student_id', axis='columns')

merged_df

	student_id	name	grade	club_name
0	1	Alice	100.0	NaN
1	2	Bob	NaN	Math
2	3	Cathy	65.0	NaN
3	4	Bob	52.0	NaN
4	5	Derek	74.0	NaN
5	6	Esther	NaN	Chess

And we’re done! We have all students along with their id number, grade in the course (if they took it), and club they are an officer of (if they are one). These examples should give you a great starting point for joining together data.

Exercises¶

In the data folder is a file texas_house_2018.csv, which shows the results of the 2018 Texas House of Representatives elections. It shows the district, number of votes the candidate, money raised, and more. The Total_Votes column is the total number of votes any candidate for that position received. So for instance, 60,279 votes were case for District 4, of which 44,669 went to Gregory Keith Bell. The Total_$ column is the amount of money that candidate raised. Use this dataset to explore the following questions:

How many rows have a missing value? Drop any rows with a missing value.
What is the data type for each column once you load the data using Pandas? Fix any data types that are incorrectly loaded.
Change all column names so that they are lower-cased (no upper case letters).
Check the columns and see if the min and max values make sense. Did anyone get negative votes? 120% of the votes?
How well does the total amount of money raised correlate with how many votes they got?
How well does the total amount of money raised correlate with the percentage of votes they got?
Which one of these is the “better” comparison? Should we be comparing money raised against number of votes or percentage of votes? Or something else? Why?
Make graphs trying to understand what factors are important. How important is it being a Republican vs a Democrat vs Third-Party? What about incumbent vs challenger?
Which districts seem to have the most money? That is, which districts have candidates raising the most money? The least?
Can you find any strange correlations? What about the number of letters in a person’s name? Incumbents who did terribly?

In the data folder is another folder retail_sales. It contains three different datasets, all with a store column. Try merging them various ways (left, right, outer, inner) along the store column and see what you get.

Introduction to Data Science