Intro to Pandas Workshop

Welcome!

Girl Develop It is here to provide affordable and accessible programs to learn software through mentorship and hands-on instruction.

Some "rules"

  • We are here for you!
  • Every question is important
  • Help each other
  • Have fun

Welcome!

Tell us about yourself.

  • Who are you?
  • What made you interested in this course?
  • What is your favorite or dream vacation destination?

What we will cover

  • What is pandas?
  • Why pandas?
  • Uploading and Saving Data
  • Summarizing and Organizing Data
  • Manipulating Data

What is pandas?

  • pandas is a Python package designed to make working with data both easy and intuitive.
  • It is an extremely useful tool for doing practical, real world data analysis in Python.

Why pandas?

  • Extremely fast, even with large datasets
  • Quick, intuitive commands
  • View your data without crashing your computer
  • Easy to use for developers, statisticians, and basically anyone working with a lot of data
  • Used by many companies

Who is using pandas?

  • Facebook
  • AppNexus
  • SpringServe
  • (many others)

Setup - pip

Open up your terminal and install all of the following packages that are not already on your computer:

pip for mac users - run the pip command below

pip for windows users - copy the text from here, save it to a python file, then run python (filename)


# for mac users only
sudo easy_install pip
        

Running sudo will require you to enter your computer password

Setup - ipython and pandas

ipython is a python console, similar to when you run python in the console and get the >>> prompt

ipython is better especially with pandas, because it's more interactive

To install ipython and pandas, run the commands below


pip install ipython
pip install pandas
        

If you receive a permission denied message, run with sudo pip instead

Our Sample Data

Our data is stored in the folder intermediate-python-pandas/example_data/

Once in that folder, open up the file purchases.csv

(This is the last time you're allowed to open a csv in this class)

Ipython and tab completion

In your terminal run ipython. Let's import pandas and try out tab completion


In [1]: import pandas

In [2]: pandas.read_
# hit tab on line two to see what tab completion gives you!
        

Importing our CSV

Now let's import our data with read_csv


In [3]: purchases = pandas.read_csv('~/intermediate-python-pandas/example_data/purchases.csv')
        

You now have a pandas dataframe called purchases

Summarizing Your Data

A major advantage of pandas is being able to see your data without ever opening a file

Here are some useful commands for summarizing your dataframe


In [4]: purchases.head()

In [5]: purchases.tail(2)

In [6]: purchases.shape

In [7]: purchases.columns
        

Let's Develop It!

Use the read_csv command to import our example data members.csv

Print out the column names of your members dataframe

Print out the top 10 rows of your dataframe

Find out how many rows are in your dataframe

Adding and Removing Columns

You can quickly delete columns and add new columns to your dataframe

Deleting columns is done with the command del

You can also create a new column by simply adding two existing columns together


In [8]: purchases['total'] = purchases['subtotal'] + purchases['tax']

In [9]: purchases.head()

In [10]: del purchases['tax']

In [11]: purchases.head()
        

Notice how referencing a dataframe column looks a lot like referencing a dictionary key

Fill in empty cells

If your data came with empty cells, pandas will fill it in with NaN.

This can cause an error if you are trying to sum two columns.

To solve this, we can use the fillna() command to replace NaN with whatever we choose.


In [12]: members.first_name.fillna(' ')
        

Notice that I refer to the first_name column as members.first_name. That is also accepted syntax in pandas.

Sorting Your Data

You can sort your data by any row or a list of rows

Sorting will default to ascending=True, so to sort descending, you have to set the parameter ascending=False

Let's take a quick look at the pandas documentation for more information


In [13]: by_date = purchases.sort_values(by='date')

In [14]: purchases = purchases.sort_values(by=['date', 'total'], ascending=False)
        

Saving Your Data

Now that we've made some changes to our dataframe,
let's save it to a new csv

To do this, we use the command to_csv

Check out tab complete after to_ to see how else you can save your data


In [15]: purchases.to_csv('~/intermediate-python-pandas/example_data/export_purchases.csv')
        

Let's Develop It!

Take your dataframe members and do the following:

  • Create a new column full_name, which will add first_name and last_name
  • Delete the column for phone_number
  • Fill in the NaN rows of the total_spent column with 0
  • Sort data by birthday
  • Save your dataframe as a csv called members_by_birthday.csv in the same folder

Changing a Row's Data Type

When importing data, you may see your column imported as a string, but you need it to be a float.

pandas' function astype() lets you change the data type of your column.

Let's try changing the the data type for the members column site_visits


In [16]: members.site_visits = members.site_visits.astype(str)

In [17]: members.site_visits = members.site_visits.astype(float)
        

Column to Python List

You may come across situations where you want to
save a column as a python list

pandas makes it easy with the tolist() function


In [18]: ids_list = purchases.member_id.tolist()

In [19]: len(ids_list)
        

Dataframe Manipulation: Filtering

pandas filtering can look a little funny, but it is extremely useful

Let's look at the following code and break down exactly what is happening


In [20]: large_purchases = purchases[purchases['total'] > 100.0] 
        

You want to look from inside out:


# We are calling the total price column from purchases
purchases['total']
        

# We are filtering the total price column for totals greater than 100 dollars.
purchases['total'] > 100.0
        

# We are filtering the dataframe for rows where total price is greater than 100 dollars.
purchases[purchases['total'] > 100.0]
        

Let's Develop It!

Let's create a new dataframe named small_purchases where subtotal is less than 50 dollars.

Create a python list of all the member ids in small_purchases

Concatinating DataFrames

To concatinate dataframes means to take two dataframes with the same columns and stacking the rows

Let's import the csv purchases-october.csv and print its columns

Since this csv has the columns as purchases, we can concatinate the two dataframes like so:


In [21]: concatinated = pandas.concat([purchases, purchases-october])
        

Merging DataFrames

How concat stacks rows, you can think of merge like stacking columns

Merging dataframes is a lot like SQL joins

You need to use one or more columns to merge two dataframes together.

Let's merge purchases and members on the column member_id


In [22]: merged = purchases.merge(members, how='left', on='member_id')
        

The how parameter decides how to join the two dataframes.

on decides the column or columns (as a list for multiple)
that you want to merge on

Grouping

groupby allows you to group by 1 or more columns to get a
more top-level view of your data

There are a couple of ways to do this, but here is one recommended way.


In [23]: prices_by_date = purchases.groupby('date', as_index=False)[['subtotal', 'total']].mean()

In [24]: totals_member_date = purchases.groupby(['date', 'member_id'],
                                        as_index=False)['total'].sum()
        

You must include as_index=False since you are combining columns and index logic gets funny otherwise.

Look at the pandas documentation for more ways to use groupby

Let's Develop It!

Import items_purchased.csv as a pandas dataframe called items

Merge items with our merged dataframe by purchase_id

Group this new dataframe by item type and date, sum the total

*Challenge*

Find out the member name who has spent the most on sale items

Survey

Thanks for attending Intro to Pandas!

Your input is invaluable for the improvement of this workshop.

Please take a minute and click here to fill out our survey.