First steps with Pandas and Jupyter notebooks

Jupyter notebooks

What you’re looking at right now is what’s called a Jupyter notebook. When you took your programming course, you probably wrote your code in a text editor and then ran it from the command line. Modern data scientists typically start their work in Jupyter notebooks. A notebook is a collection of code cells (like the one at the top of this notebook) which run Python code, and Markdown cells (like this one!) which have text.

If you’ve ever used Mathematica or similar math systems, this is a very similar setup. Rather than writing your code in the text editor, leaving to run it in a console, and then coming back to change it, everything happens in one place.

Jupyter notebook cheat-sheet

To edit a cell, click on it (if it’s a Code cell), or double-click on it (if it’s a Markdown cell). You can change a cell between Code and Markdown by clicking on the cell, then using the dropdown at the top of this worksheet. You can also click to the left of the cell and hitting C on your keyboard for Code, or M for Markdown. To run a cell, select the cell, then hit Shift-Enter. The output will appear below it. Not all commands will generate an output, so your output can be blank.

Cells are just containers for tiny Python programs. So you can do something like 3+2 and run it with Shift-Enter and get an output of 5. You can also write if statements, define new functions, classes, etc. Anything that works in Python works here too.

Note

You’ll need to either download this notebook or run it in Binder in order to execute cells!

# This is a code cell. Click in it to edit it.
# Try running some small Python code here (like adding two numbers).
# Then, click Shift-Enter to run it. You should see the output below this cell.

Markdown cheat-sheet

Markdown is a way to type pretty text using bold, italics, links, codeblocks and more, without needing to use a word processor like Microsoft Word. There is a little bit of memorization, but it’s not too bad. Below are examples of the most common things you’ll use. For a more detailed list, checkout this cheatsheet.

Double-click on this cell to see a breakdown of how I accomplished all these things.

This is italic

This is bold

This is a big header

This is a slightly smaller header

You can pretty much go as small as you want

  • You can also make a list by using dashes

  • You can make as many as you want

  1. If you want numbers instead

  2. Just use numbers

You can also makes links as follows: The title for your link

If you want to type something that uses Python, and you want to remember that this is a Python command (as opposed to just some text), it’s good practice to surround it with backticks. Backticks are the thing in the top-left of your keyboard, on the same key as the tilde ~.

As an example, suppose you were writing something up showing someone how to use the print command in Python. You could write the following:

“In order to add something in Python, just type the numbers. For example, you could teach someone how to print in Python by telling them to type print(3+2).”

See how surrounding it with backticks made it look different? It makes it clear that this is a Python command, and not just some random text.

Jupyter notebook best practices

  • Always title your notebook.

  • Break your code into sections. Start by doing your imports, then your next section might be loading the data, and so forth. Title each section using Markdown, and give each section a description of what you’re doing there.

  • Keep your code clean. It’s okay to create new cells to test things, but once your testing is done and those cells are no longer needed, delete them. Go back to old working cells and clean them up. Write the code nicer, put a Markdown cell describing what’s going on, etc.

  • Leave yourself notes about what you’ve done. Right now you remember what you’ve done, but you won’t remember it a month from now.

  • If you find yourself copying and pasting some code over and over, make it into a function and call that function instead.

Python

The following is a brief review of Python. Use it to refresh your memory and to help you with the homework.

Data types

In Python, the most common ways to represent data (called data types) are as one of the following:

  • int: An integer (whole number, positive or negative)

  • float: A decimal

  • str: A string (characters, such as “abc”)

In order to see what data type something is, you can just type type(my_thing).

type('abc')
str
type(123)
int
type(123.45)
float

Data structures

When you want to store data, there are several ways to do it. The simplest way is just to store it in a variable. You can do this by typing my_variable = some_value. In a Jupyter Notebook (the thing you’re working in right now), you don’t necessarily need to type print if you want to look at a variable. You can just type the variable name and run the cell.

x = 3
x
3

Since I put the x as the last thing in the cell, Jupyter printed it out for me to see what it is.

Often you will need to store several variables/pieces of data in an organized way. The most common ways to do this are the following data structures:

  • list: A list of values. Lists are created using brackets ([ and ]).

  • set: Similar to a list, but there is no order, and each object shows up at most once. Created using curly braces ({ and }).

  • dict: A dictionary. This stores things with a key and a value. Dictionaries are created using curly braces.

my_list = [1, 1, 2, 3, 'a', 'b', 'c', 'c']
# The set and list look the same at first, but once you print them you will see the difference.
my_set = {1, 1, 2, 3, 'a', 'b', 'c', 'c'} 
my_dict = {'last_name': 'Savala', 'age': 36, 'favorite_number': 3.14}
my_list
[1, 1, 2, 3, 'a', 'b', 'c', 'c']
my_set
{1, 2, 3, 'a', 'b', 'c'}
my_dict
{'last_name': 'Savala', 'age': 36, 'favorite_number': 3.14}
my_dict['age']
36

Control structures

Often you want to control how some events happen. This could mean things like if today is Saturday then don’t do homework, or something like for each number in this list, check if it’s prime. These are called control structures. The most common control structures in Python are if/elif/else statements, and for loops (there are more!).

if/elif/else structures work just like you think. Remember to put a colon : after each one, and indent everything that happens in each part.

x = 5
if x < 3:
    print('Somehow, 5 is less than 3')
elif x > 7:
    print('How is it that 5 is greater than 7?')
    print('That makes no sense!')
else:
    print('Whew, guess everything is okay')
Whew, guess everything is okay

for loops take some structure (often a list) and iterate through it. This is useful when you want to work with every element in a list.

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for x in my_list:
    print(x**2)
1
4
9
16
25
36
49
64
81
100

Functions

Defining functions

Writing code as functions is incredibly important in any programming language. This should seem familiar to math people, as we deal with functions all the time! In Python, write a function using the def command. You then put the name of the function, and in parentheses all the inputs the function takes. Don’t forget to include a colon : at the end, and to indent everything inside the function. At the end you can (but don’t have to) use the return statement to return some value.

# A function with no return statement
def add_and_print(x, y):
    print(x + y)
add_and_print(3, 4)
7
# A function with a return statement
def add_and_return(x, y):
    return x + y
add_and_return(3, 4)
7

These two look the same, but there’s an important difference. If we return a value, then we can store that value and use it later. If we simply print the value, that’s not possible. Example:

z = add_and_return(3, 4)
print(z)
print(2 * z)
print('Now I can manipulate z however I want')
7
14
Now I can manipulate z however I want

Warning

An error is coming!

# We will get an error because add_and_print doesn't return anything, 
# so the program has no idea what value to save to z.
z = add_and_print(3, 4)
z = 2 * z
z
7
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-950ae8e0bd90> in <module>
      2 # so the program has no idea what value to save to z.
      3 z = add_and_print(3, 4)
----> 4 z = 2 * z
      5 z

TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

Calling functions

Functions have arguments, which are simply the inputs. When you define the function all the arguments must have names. That’s why we used x and y when we defined add_and_print above. Here’s another example:

def my_name(first_name, middle_name, last_name):
    print('Your first name is ' + first_name)
    print('Your middle name is ' + middle_name)
    print('Your last name is ' + last_name)

When you call this function, you can choose to use those argument names or not. That is, either of the following two are exactly the same:

my_name(first_name='Paul', middle_name='Awesome', last_name='Savala')
Your first name is Paul
Your middle name is Awesome
Your last name is Savala
my_name('Paul', 'Awesome', 'Savala')
Your first name is Paul
Your middle name is Awesome
Your last name is Savala

Note that in the first one we directly wrote first_name="Paul", ..., while in the second we simply typed them in. If you do not use the argument names, you type the arguments in the order the function definition has them. Since the function has first_name as the first argument, then it just assumes it’s the first argument (first_name), that the second is the second argument (middle_name), and so forth. If you do supply the arugment names then the order doesn’t matter. Here are some examples:

# This is wrong!
my_name('Savala', 'Awesome', 'Paul')
Your first name is Savala
Your middle name is Awesome
Your last name is Paul
# This works though
my_name(last_name='Savala', middle_name='Awesome', first_name='Paul')
Your first name is Paul
Your middle name is Awesome
Your last name is Savala

Either way is just fine, it’s totally up to you. If your function has many arguments then it’s good practice to include the argument names when you call it. That way the person reading your work isn’t trying to remember/guess what the seventh argument in a big list of arguments is. But if it’s just one or two then omitting the argument name is probably just fine.

Default values

In a function you can supply default values. For instance, not everyone has a middle name. So in the function above we would like middle name to be optional. We can do this by setting a default value of, say, None. To do this, simply put the default value there when you define the function:

Warning

Here comes an error!

def my_name(first_name, middle_name=None, last_name):
    print('Your first name is ' + first_name)
    if middle_name is not None:
        print('Your middle name is ' + middle_name)
    print('Your last name is ' + last_name)
  File "<ipython-input-29-87c0b4110724>", line 1
    def my_name(first_name, middle_name=None, last_name):
                ^
SyntaxError: non-default argument follows default argument

As you can see from the error, arguments without a default value (like first_name and last_name in our example) need to come first, then arguments with default values must come last. So let’s fix that.

def my_name(first_name, last_name, middle_name=None):
    print('Your first name is ' + first_name)
    if middle_name is not None:
        print('Your middle name is ' + middle_name)
    print('Your last name is ' + last_name)
my_name(first_name='Paul', last_name='Savala')
Your first name is Paul
Your last name is Savala

f-strings

If you want to print text, but also include a variable, there are three ways to do it.

# Method 1 (the old "format" way)
my_age = 36
my_name = 'Paul'

'My name is {name} and I am {age} years old'.format(name=my_name, age=my_age)
'My name is Paul and I am 36 years old'
# Method 2 (use +)
'My name is ' + my_name + ' and I am ' + str(my_age) + ' years old'
'My name is Paul and I am 36 years old'

Both of these methods works, but they are ugly and slow. In addition, for “method 2” to work, everything has to be a string. So if you removed str(...) from my_age it would give an error.

A much cleaner way which was introduced with Python version 3.6 is called f-strings (the “f” stands for “format”). All you need to do is put an f in front of your string. Then, in the curly braces write the name of your variable. Here’s an example:

# Note the "f" in front of the string. That tells Python this is an f-string.
f'My name is {my_name} and I am {my_age} years old'
'My name is Paul and I am 36 years old'

f-strings are very useful for printing all kinds of variables, including numbers. However, floats with many numbers after the decimal can print out annoyingly long. For instance, consider the following example:

f'If a pizza costs $22 and is divided between 7 people, then each person pays ${22/7}'
'If a pizza costs $22 and is divided between 7 people, then each person pays $3.142857142857143'

Do we really need to see how much each person pays down to the billionth of a penny? We would like just two decimal places. We can do this by including a format at the end. The one we will use the most often is .2f. The dot “.” refers to the decimal. The “2” refers to two places after the decimal. The “f” refers to the number being a float (if you don’t put this it will sometimes print the number in scientific form, which is probably not what you wanted). To tell Python you want to use a format, just put a colon “:” after your computation, and then put the format you want. For example:

# Notice the change in the colons where the calculation occurs
f'If a pizza costs $22 and is divided between 7 people, then each person pays ${22/7:.2f}'
'If a pizza costs $22 and is divided between 7 people, then each person pays $3.14'

Range

Suppose I have a list of 5 elements, and I want to know which ones are odd. So for instance, if my_list = [4, 1, 1, 8, 9], I want to know that the second, third and fifth elements are odd. Remember that in Python these indices start by counting at zero. So the indices_with_odd_elements = [1, 2, 4]. The easiest way to access the incides in a list is to use the range function. By default a range is it’s own thing. In order to be able to better see what is going on, it is useful to convert it to a list.

range(10)
range(0, 10)
list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
list(range(1, 10))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
list(range(1, 10, 2))
[1, 3, 5, 7, 9]
my_list = [4, 1, 1, 8, 9]
for i in range(len(my_list)):
    print(f'The index is {i} and the value is {my_list[i]}')
The index is 0 and the value is 4
The index is 1 and the value is 1
The index is 2 and the value is 1
The index is 3 and the value is 8
The index is 4 and the value is 9

List comprehension

One of the most useful and important ideas in Python is list comprehension. List comprehension allows you to modify a list element-by-element, without needing to use a for-loop. Here is an example where we add one to each number in a list. We first show how we would do it with a for-loop, and then next show the better way to do it (using list comprehension). We will write functions to create a new list add_one_list which is the original list, but with one added to each element.

my_list = [1, 5, 6, 2, 7, 5, 8, 2]
add_one_list = []
# This setup is slow and ugly
for i in range(len(my_list)):
    add_one_list.append(my_list[i] + 1)

add_one_list
[2, 6, 7, 3, 8, 6, 9, 3]

A better way is to use list comprehension. To do list comprehension, you simply write [what_I_want_to_do for x in my_list]. For example:

my_list = [1, 5, 6, 2, 7, 5, 8, 2]

add_one_list = [x + 1 for x in my_list]

add_one_list
[2, 6, 7, 3, 8, 6, 9, 3]

It’s worth noting that Python has “comprehension” for things beyond lists. You can also do it for dictionaries and sets.

my_list = ['a', 'b', 'c', 1, 2, 3]

int_str_dict = {'str': [x for x in my_list if type(x) == str], 
                'int': [x for x in my_list if type(x) == int]}

int_str_dict
{'str': ['a', 'b', 'c'], 'int': [1, 2, 3]}

You should try to use list (or other types of) comprehension as much as possible. They are much faster than loops, and are generally easier to read. This will help greatly when you work with datasets with hundreds of thousands of rows of data.

Using packages

While Python comes with many functions built in, it doesn’t have everything could ever want. For example, let’s say we wanted to compute sine of an angle. We might try something like sine(1), or sin(1) (I’m using radians here). Let’s see what happens.

Warning

Here comes an error!

sine(1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-45-1d306da0da99> in <module>
----> 1 sine(1)

NameError: name 'sine' is not defined
sin(1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-46-025f39ddbe02> in <module>
----> 1 sin(1)

NameError: name 'sin' is not defined

Python doesn’t have a “sine” function. We could perhaps write one ourselves. For example, sine has a series definition given by $\( \sin(x) = \displaystyle\sum_{n=0}^{\infty} \frac{(-1)^n}{(2n+1)!}x^{2n+1} \)\( However this is a lot of work, and would also require writing a factorial function, and probably eventually a \)\pi$ function as well. Luckily for us, other people have already done the hard work! They have done so by writing packages. A Python package is simply a collection of code that you can use. Rather than having to copy-paste code you import from these packages. For example, there is a math package we can use. We will import it as such:

import math

As you can see, nothing happened. However, we can access functions inside of this package by using dot notation. For example, to use the sine function we would type:

math.sin(1)
0.8414709848078965

To access \(\pi\) we could do:

math.pi
3.141592653589793

And to use a factorial we could write:

math.factorial(7)
5040

Packages are extremely powerful. When you are considering writing some complex code, you should first search to see if someone has already written a package which does it for you.

When importing from packages, you don’t necessary need to import the entire package, which is what we did when we typed import math. Some packages are very large, and importing them makes your program slower. Say, for instance, that we only wanted to use the sine function. We could type the following to only import that one function:

from math import sin
sin(2)
0.9092974268256817

If you only plan to use one or two functions from a package, it is good practice to only import those functions. That keeps the code clean and fast.

Python: Homework

For each of the following questions, write a function which does what is asked.

  1. Print out the first 20 even numbers

  2. Test whether or not a number is prime

  3. Given a list of integers, return a list containing only the odd numbers

  4. Given a list of integers and a target number, return True if two numbers in that list add up to the target number, or False if nothing does. For instance, list_of_integers = [1, 4, -4, 2, 8] and target=6 we would return True, because 4+2=6 and both 4 and 2 are in list_of_integers. But we would return False if target=7, because no two numbers in the list add up to 7.

  5. Given a string s with the first letter capitalized (such as “Math”), return the reversed string (backwards), but fix the capitalization (so it should return “Hatm”).

  6. Use list comprehension (described above) to square every element in a list.

  7. Use list comprehension to replace each element in the list of positive integers with the word “even” if it is even or “odd” if it is odd. So if the input is [1, 5, 3, 6], it should return ['odd', 'odd', 'odd', 'even'].

  8. Given a list of positive integers, return a dictionary showing the even and odd numbers. For example, if the input is [1, 5, 3, 6], then the return should be {'odd': [1, 5, 3], 'even': [6]}.

Pandas

Imports

Pandas is the library which all data scientists use to manipulate data. We need to tell Python to use Pandas. The way to do that is to import it by typing import pandas. Then when we want to use it, we type pandas.thing_I_want_pandas_to_do, just like we did with the math package above. However, typing pandas over and over is a bit cumbersome. So it’s common to alias pandas by giving it a shorter name. The common name is pd. So we’ll import pandas as pd, which means that we can now work with Pandas by typing pd.thing_I_want_pandas_to_do, which saves us a bit of typing.

import pandas as pd

Loading the data

I have my folders setup so that I have a data folder where I keep all my data. So, when I load my data in the next step, I need to tell Pandas to look in the data folder. If you don’t have a data folder and the data is just sitting in the same folder as your current notebook, then you don’t need that data/ part. Dealing with directories can be a bit of a headache, so it’s good practice to always create a data folder, and put your data in there.

This data is taken from the Titanic Kaggle competition, and describes the survival (or not) of passengers aboard the Titanic. Here’s a brief description of the data:

Variable

Definition

Key

survival

Survival

0 = No, 1 = Yes

pclass

Ticket class

1 = 1st, 2 = 2nd, 3 = 3rd

sex

Sex

Age

Age in years

sibsp

# of siblings / spouses aboard the Titanic

parch

# of parents / children aboard the Titanic

ticket

Ticket number

fare

Passenger fare

cabin

Cabin number

embarked

Port of Embarkation

C = Cherbourg, Q = Queenstown, S = Southampton

Pandas stores data as things called DataFrames, so it’s typical to call the variable you load the data into df. If you have a bunch of different datasets you’re loading, you can tweak the name to help you remember what the data is. So this could be called titanic_df, or train_df. For now, we’ll just use df.

df = pd.read_csv('data/titanic.csv')

Let’s break this command down:

  • pd says that we are using a function from Pandas, which we aliased as pd

  • read_csv is the function we are using from Pandas. It reads in a CSV (command-separated value) file. If you’ve never heard of a CSV file, it’s basically just an Excel file.

  • drive_dir is the Google drive folder. It’s created in the very first cell of this notebook (look at the very top).

  • data/ is the folder where my data is stored in Google drive

  • titanic.csv is the name of my file that I want to load.

Looking at the data

In order to look at the data, you either just type df to have it print out the entire DataFrame, or you can use the .head() command. It’s generally preferable to use .head(), since that just shows the first 5 rows (by default). If you do want to see more rows, you can pass a number into that function, telling it how many rows you want to see. I did that below to see 10 rows.

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df.head(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Note that Pandas always includes an index row on the left, which uniquely numbers each row. We’ll use these later.

What about if you want to access a single column? The way to do this is to pass the column name as a string (meaning in quotes) to the DataFrame.

df['Age'].head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

See how printing a single column looks different from printing multiple columns? That’s because Pandas calls something with a single column a Series, and something with multiple columns a DataFrame. In reality, they’re not that different. But it’s helpful to know the terms in case you see people mentioning Series when you’re Googling.

If you want to access multiple columns, we need to pass those columns as a list. We’ll pass to the DataFrame a list of the columns we’re looking for. Remember that to access a column in Pandas we write df[column_I_want]. Since our columns are a list, they’ll have brackets around them. This means we end up with double-brackets, like df[[column1, column2, column3]].

Again, we could print everything we get back by typing df[['Survived', 'Age']], but we probably just want a quick peek at the data, not to see the entire list of values. So I’ll just look at the .head() of the data.

df[['Survived', 'Age']].head()
Survived Age
0 0 22.0
1 1 38.0
2 1 26.0
3 1 35.0
4 0 35.0

So far we’ve just accessed columns, so what about rows? Rows are accessed slightly differently. You use the .iloc command. Here “loc” refers to “location”, and “i” refers to integer. So we’ll access the row location by giving it an integer. As in everything in Python, we start at zero. So the first row we would access as df.iloc[0], and the tenth row would be df.iloc[9]. Note that Pandas makes this easier for you by printing the index of each row as the first column (as discussed above). Let’s try it below.

df.iloc[0]
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

Sure enough, if we look at when we first loaded the data, this Mr. Owen Harris Braund is indeed the first person listed, and has an index of zero.

We previously saw how to access multiple columns, so how do we access multiple rows? You can do it similar to how we accessed multiple columns, by passing a list of the row integers we want.

df.iloc[[0, 1, 5]]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

However, suppose we want to access row 0 through 10000. Clearly we don’t want to type [0, 1, 2, ..., 10000]. Instead, we can use slicing. Slicing is a shortcut way to say “give me everything between these two numbers”. We “slice” by using a colon.

df.iloc[0:5]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

For reasons we won’t go into now, Python decides that when you type 0:5, you don’t want [0, 1, 2, 3, 4, 5], instead you must want [0, 1, 2, 3, 4] (so starting at 0, but ending just before 5). So the command above gave us the first five rows. Here’s another simple one.

df.iloc[10:20]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C

You can also tell it to skip rows (like skip every other row, or just take every fifth row) by including that number at the end. So to get every fifth row between rows 0 and 20 you would type:

df.iloc[0:20:5]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S

Describing the data numerically

Okay, we’ve loaded our data, now let’s start examining it. We probably want to answer questions like:

  • How many people survived? How many died?

  • Who is the oldest and youngest person on board? The average age?

  • For the people who are first class (Pclass==1), what percentage survived? What about for second and third class?

We’ll address each of these separately below.

Value counts

“Value counts” refers to counting how many different values appear in a column. So this could be used to answer the question of “How many people survived/died?”, since I’m just counting how many 1’s (survived) and 0’s (died) appear in the Survived column. To do a value count, just select the column you want from your DataFrame (using what you learned above), and tell it to run the .value_counts() command.

df['Survived'].value_counts()
0    549
1    342
Name: Survived, dtype: int64

We can see that 549 people died, and 342 survived. Try it on your own for several other columns.

Basic statistics

Say you want to compute some basic statistics on a column, such as min/max/average/median/etc. In general, Pandas makes this very simple:

df['Age'].min()
0.42
df['Age'].max()
80.0
df['Age'].mean()
29.69911764705882
df['Age'].median()
28.0

Analyzing subsets of data

The last question about analyzing individual passenger classes is a little different. Up to this point we’ve just been taking an entire column (like the Age column) and finding something about it (mean, median, value_counts, etc). But what about if we just want a subset of the column, such as the first class passengers? To do this, you need to tell Pandas the condition you want satisfied. If we want first class passengers, we want the column Pclass to be equal to 1. Remember that in Python, to test equality (i.e. to check if something is or is not equal to something else) you use double equals signs ==. So if I want to check for Pclass equal to 1, I would do Pclass==1. Let’s see what happens if we try that.

df['Pclass']==1
0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Pclass, Length: 891, dtype: bool

Hmm, it looks like it’s telling us True or False for whether each passenger was first class. That’s not quite what I wanted, I just wanted the first class passengers. But that’s okay, we’re mostly there. Instead, what we do is say “DataFrame, look inside yourself and find the rows where Pclass==1”.

df[df['Pclass']==1].head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S

Now we’ve just got the first class passengers! Let’s do a value count on just those passengers to see how many survived. We need to tell Pandas “Take this subset of the passengers which are first class, look at the Survived column, and do a value_counts().”

df[df['Pclass']==1]['Survived'].value_counts()
1    136
0     80
Name: Survived, dtype: int64

This is good, but I’m probably more interested in percentages than in raw numbers. To find out how to do this, Google “pandas value counts percentage”. See if you can find the answer on your own. After you’ve done that, try comparing survival rates for second and third class passengers, as well as for all passengers regardless of class.

Finding missing data (and dealing with it)

It’s very common that in a dataset, some of the data will be missing. What do we mean by “missing”? Sometimes it’s just a blank cell in the spreadsheet, sometimes it’s represented by NaN, which stands for “Not a number”, and is Python speak for “I don’t have anything useful to put here”. There is also a None object in Python, which could potentially be in a cell. The cell could also have an empty string '', or a bunch of spaces '   ', or a zero where it doesn’t make sense (like name), or \(\pm\infty\), or one of many, many other things that could go wrong! This gives you some idea why finding and dealing with missing data is not simple. The general rule of thumb is that cleaning and prepping your data is about 80% of your work, while analyzing and utilizing it is only 20%.

Let’s start by finding Nones and NaNs. Pandas actually has a nice helper function for this called isna(). It looks for both NaN and None values. It does not look for spaces, zeros where you don’t want them, etc. Let’s try it with the Cabin column.

df['Cabin'].isna()
0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Name: Cabin, Length: 891, dtype: bool

Hmm, that’s not quite what we wanted. It’s saying whether each row is or is not NaN/None. In a way that should look similar to what we had above with df['Pclass'==1]. We need to think about what our goal is. Is it just to find these values, to drop rows with them, or to fix them? Let’s start with the first, just counting them.

The simplest way to count them is to see how many of the rows evalued to True when we did .isna(). A fantastically simple way to do this is below.

df['Cabin'].isna().sum()
687

What’s going on here? The .isna() part is giving us the True/False values, and .sum() is adding them up. But how do we add True and False? These are what are called “boolean values”, which mean that can be interpreted as either being 1 or 0. By default, True is mapped to 1, and False is mapped to 0. So when you call .sum(), it’s adding up 1 (True) + 0 (False) + 1 (True) + … and so forth. So the sum is exactly how many NaN/None values there are in that column!

How did I figure out this trick? Googling! I Googled “Pandas count number of nan none in column”. The first result gave me a solution, which is what I used. You should get used to Googling for how to do things in programming. No human knows every command and trick, but the internet does! So use it to your advantage.

With that, let’s count the number of NaN/None values are in each row. I’ll use a simple for-loop to make this easier.

for col in df.columns:
    print(f'{col}: {df[col].isna().sum()}')
PassengerId: 0
Survived: 0
Pclass: 0
Name: 0
Sex: 0
Age: 177
SibSp: 0
Parch: 0
Ticket: 0
Fare: 0
Cabin: 687
Embarked: 2

So it looks like most columns are pretty good, but some do indeed have a lot of missing values. What percentage of the values are missing? How about we divide the number of missing values by the number of rows. To get the shape of a DataFrame, use the .shape attribute.

df.shape
(891, 12)

This says there are 891 rows (rows always come first, just like in a matrix) and 12 columns. Since we only care about rows, we’ll grab the first part of this tuple by using df.shape[0] (remember that everything in Python is zero-indexed, so the first element in a tuple is referred to as the zeroth-element). Let’s alter our for-loop above to print this.

for col in df.columns:
    print(f'{col}: {df[col].isna().sum()} ({100 * df[col].isna().sum() / df.shape[0]}%)')
PassengerId: 0 (0.0%)
Survived: 0 (0.0%)
Pclass: 0 (0.0%)
Name: 0 (0.0%)
Sex: 0 (0.0%)
Age: 177 (19.865319865319865%)
SibSp: 0 (0.0%)
Parch: 0 (0.0%)
Ticket: 0 (0.0%)
Fare: 0 (0.0%)
Cabin: 687 (77.10437710437711%)
Embarked: 2 (0.2244668911335578%)

How should we handle missing data? One way is just to ignore rows with missing data. We can do this by selecting rows where .isna() is False. Pandas has a helper function to do this, called .dropna(). By default it drops any row with at least one NaN/None value. But if you want more flexibility, you can set .dropna(thresh=3) (for instance) to only drop rows with three NaN/None values. It’s good practice to always keep your original data intact. So once we drop the rows, we’ll assign this to a new DataFrame called df_trimmed.

df_trimmed = df.dropna()
df_trimmed.shape
(183, 12)

How do we feel about this? We went from having at last some information on 891 passengers, to having all information on 183 passengers. This seems problematic, as I’ve thrown away a bunch of my data. Moreover, it seems like the column that most often has missing data is the Cabin column. Looking at that column, it seems like it’s basically the room number for the person. I don’t feel like that’s so important, so maybe I don’t need to worry about missing values there. Let’s instead focus on the missing column which seems the most useful, which I feel is Age.

Rather than dropping rows with a missing Age, how about we fill in an age. This may seem strange, as I’m basically just making up an age for that person. However, soon we’ll start building maching learning models, which use the data we have to make predictions about who will survive and die. For these models to work, they need to have data. If a row has a NaN/None, the model will throw an error and refuse to run. So it’s better to have something so that it at least runs, than to have nothing.

The way to fill missing data is to use the Pandas helper function .fillna() (see a pattern here? Pandas has tons of helper functions. So before you try to write something from scratch, try Googling first!). All that .fillna() needs is an argument telling it what to fill in to the missing values. We could do something simple like filling in zeros (.fillna(0)), or 999, or -1, but it’s generally best to not introduce extreme/unrealistic values (no one on board had an age of -1. Also, an age of 999 would skew the mean age, if that was something we decided to calculate). So good practice is to fill numerical columns with the median value.

df_cleaned = df
df_cleaned['Age'] = df_cleaned['Age'].fillna(df_cleaned['Age'].median())

Let’s inspect our data and do another check for missing values to see where we’re at.

for col in df_cleaned.columns:
    print(f'{col}: {df_cleaned[col].isna().sum()} ({100 * df_cleaned[col].isna().sum() / df_cleaned.shape[0]}%)')
PassengerId: 0 (0.0%)
Survived: 0 (0.0%)
Pclass: 0 (0.0%)
Name: 0 (0.0%)
Sex: 0 (0.0%)
Age: 0 (0.0%)
SibSp: 0 (0.0%)
Parch: 0 (0.0%)
Ticket: 0 (0.0%)
Fare: 0 (0.0%)
Cabin: 687 (77.10437710437711%)
Embarked: 2 (0.2244668911335578%)
df_cleaned.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Looking good!

What about other cases, where the data is not what you would expect (numbers for Name, words for Fare, etc.)? One simple way is to look at the data type for each column. Pandas forces all values in a column to be the same type (int, float, object, etc.). So if a single value in the Fare column can’t be described by a float (for example, it was a string ‘$123.16’), then Pandas will force that whole column to be strings (which Pandas calls an object).

df_cleaned.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

A quick glance down the list looks good for the most part, but there is one weird one. Age is a float (decimal). Presumably all ages are integers, so we could consider changing that. The way to tell Pandas to change the column data type is using the .astype() helper function. As an argument, pass what type you would like. We’ll use int as the type.

df_cleaned['Age'] = df_cleaned['Age'].astype('int')
df_cleaned.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age              int64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
df_cleaned.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

Working with Series

In Pandas, a Series is just a DataFrame with only one column. Unfortunately, this just presents a headache. There will be times when you do something which would work perfectly well with a DataFrame, but it will fail because you won’t have realized that you only have one column and thus a Series!

Up to this point we have gotten our data by loading it from a CSV file. However, it’s sometimes helpful to just make some quick, fake data to deal with. The easiest way to do this is to use the Pandas command DataFrame() and give it a dictionary. The dictionary should have the keys being the column names, and the values being a list of whatever values should be in that column. Here is an example.

df = pd.DataFrame({'math_course_number': [2413, 3320, 3439],
                  'math_course_name': ['Calculus I', 'Applied Statistics', 'Introduction to Data Science']})

df
math_course_number math_course_name
0 2413 Calculus I
1 3320 Applied Statistics
2 3439 Introduction to Data Science

This is a DataFrame. You can tell because it looks like one (you’ll see what a Series looks like in a second), or by using the type() Python function.

type(df)
pandas.core.frame.DataFrame

Now let’s select just a single column and see what we get.

math_course_number = df['math_course_number']

math_course_number
0    2413
1    3320
2    3439
Name: math_course_number, dtype: int64

This looks different! It doesn’t have the nice pretty table-style look to it. That’s because this isn’t a DataFrame, it’s a Series!

type(math_course_number)
pandas.core.series.Series

Some things work just as well using a Series as it does a DataFrame. For example:

math_course_number.head()
0    2413
1    3320
2    3439
Name: math_course_number, dtype: int64
math_course_number + 1
0    2414
1    3321
2    3440
Name: math_course_number, dtype: int64

However, some things don’t!

# An error is coming...
math_course_number['math_course_number']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-93-a55044310104> in <module>
      1 # An error is coming...
----> 2 math_course_number['math_course_number']

~/.virtualenvs/jupyter/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if (

~/.virtualenvs/jupyter/lib/python3.8/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
    989 
    990         # Similar to Index.get_value, but we do not fall back to positional
--> 991         loc = self.index.get_loc(label)
    992         return self.index._get_values_for_loc(self, loc, label)
    993 

~/.virtualenvs/jupyter/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    352                 except ValueError as err:
    353                     raise KeyError(key) from err
--> 354             raise KeyError(key)
    355         return super().get_loc(key, method=method, tolerance=tolerance)
    356 

KeyError: 'math_course_number'

Series don’t give the columns names, so you can’t access them by name anymore. In general, it’s often just easiest to convert a Series back to a DataFrame, then you are always working with the same objects. There are (at least) two ways to do this:

  1. Use the Pandas .to_frame() function

  2. Use the Pandas .reset_index() function

These two do different things, so let’s briefly talk about each.

The .to_frame() method simply turns a Series into a DataFrame, that’s it. Here’s an example.

math_course_number_df = math_course_number.to_frame()

math_course_number_df
math_course_number
0 2413
1 3320
2 3439

Now it looks like (and is!) a DataFrame again!

type(math_course_number_df)
pandas.core.frame.DataFrame

The only option you have with .to_frame() is that you can specify the column name. If, instead of ‘math_course_number’ you wanted something else, you could do that.

math_course_number_df = math_course_number.to_frame(name='course_number')

math_course_number_df
course_number
0 2413
1 3320
2 3439

Now let’s talk about .reset_index(). What this function does is it makes the index (the number on the left of the Series/DataFrame) become a new column. Here’s an example:

math_course_number
0    2413
1    3320
2    3439
Name: math_course_number, dtype: int64
math_course_number.reset_index()
index math_course_number
0 0 2413
1 1 3320
2 2 3439

You can see that we have a new column called “index” which has the numbers on the left. This can often be extremely useful. You will run into situations where you have an index having numbers you want to use, but since they’re not a column you can’t easily access them. In that case, use .reset_index() to make them into a column.

Exercises

While you can use these to get you started, really the best thing you can do is to just mess around with the data and come up with your own questions. Then start exploring those questions. Undoubtedly you will run into things you can’t figure out. Just Google and see what you can come up with. Patience and perserverance are key!

  1. For each passenger class (Pclass), what are the cheapest, most expensive, mean and median prices?

  2. What is the average age for men and the average age for women?

  3. The SibSp column counts how many siblings and spouses are on board for that person (so if you and your sister on the ship together, then this would be 1, because you have one sibling or spouse). Create a new DataFrame containing only the people with a sibling or spouse on board.

  4. The cabin column is probably are not useful in any way, especially with so many missing values. Drop that column from the data (Google how to do it!).

  5. What patterns do you see about survival rates among passengers in different classes? We looked at just first-class passengers, but what about second and third class?

  6. We saw above that about 63% of first-class passengers survived, this is far better than for second or third class passengers. But can you do better? What about first-class females? Or second-class children? Can you find some subset of the people with a very strong survival rate? What about a high death rate?

  7. So far we’ve been asking about the relationship between just two variables: Age and Survived, Embarked and something else, etc. How would you go about exploring the relationship between three or more variables? Perhaps Survived is dependent on Age, Fare, Sex, SibSp and Parch. How could you explore this?

  8. We described “missing data” data with a None value. However, that’s not the only way for data to be missing. An empty string "" would also be “missing”. Finally, while data may not be missing per-se, it can still be a value you don’t want, such as a Sex of “unknown”. Checking each column to see if there are any values that you don’t want. Did you find any?

Resources

Where are some places you can turn when you have questions? Google is obviously one of them, but here are some other sites I’ve found especially useful:

  • Chris Albon - He has very clear explanations for lots of different things. He pretty much always starts by creating a DataFrame from scratch, and then works through simple examples. A great first resource if Google isn’t working for you.

  • Machine Learning Mastery - A site covering more advanced topics, but still in a very simple, readable format.

  • Kaggle kernels - “Kernels” is Kaggle’s word for Jupyter notebooks. They host notebooks on their website, and people will build them to accomplish various tasks, and then share them with the community. Many of them are like this notebook, where they’re tutorials meant to take you from start-to-finish on a task. Less useful for answering a quick “how do I…” question, but highly useful for finding an introduction to a new topic.