Girl Develop It is here to provide affordable and accessible programs to learn software through mentorship and hands-on instruction.
Some "rules"
Tell us about yourself.
Open up your terminal and install all of the following packages that are not already on your computer:
pip for mac users - run the pip command below
pip for windows users - copy the text from here, save it to a
python file, then run python (filename)
# for mac users only
sudo easy_install pip
Running sudo will require you to enter your computer password
ipython is a python console, similar to when you
run python in the console and get the >>>
prompt
ipython is better especially with pandas, because it's more interactive
To install ipython and pandas, run the commands below
pip install ipython
pip install pandas
If you receive a permission denied message, run with sudo pip instead
Our data is stored in the folder intermediate-python-pandas/example_data/
Once in that folder, open up the file purchases.csv
(This is the last time you're allowed to open a csv in this class)
In your terminal run ipython. Let's import pandas and try out tab completion
In [1]: import pandas
In [2]: pandas.read_
# hit tab on line two to see what tab completion gives you!
Now let's import our data with read_csv
In [3]: purchases = pandas.read_csv('~/intermediate-python-pandas/example_data/purchases.csv')
You now have a pandas dataframe called purchases
A major advantage of pandas is being able to see your data without ever opening a file
Here are some useful commands for summarizing your dataframe
In [4]: purchases.head()
In [5]: purchases.tail(2)
In [6]: purchases.shape
In [7]: purchases.columns
Use the read_csv command to import our example data members.csv
Print out the column names of your members dataframe
Print out the top 10 rows of your dataframe
Find out how many rows are in your dataframe
You can quickly delete columns and add new columns to your dataframe
Deleting columns is done with the command
del
You can also create a new column by simply adding two existing columns together
In [8]: purchases['total'] = purchases['subtotal'] + purchases['tax']
In [9]: purchases.head()
In [10]: del purchases['tax']
In [11]: purchases.head()
Notice how referencing a dataframe column looks a lot like referencing a dictionary key
If your data came with empty cells, pandas will fill it in with NaN.
This can cause an error if you are trying to sum two columns.
To solve this, we can use the fillna() command to replace NaN with whatever we choose.
In [12]: members.first_name.fillna(' ')
Notice that I refer to the first_name column as members.first_name. That is also accepted syntax in pandas.
You can sort your data by any row or a list of rows
Sorting will default to ascending=True, so to sort descending, you have to set the parameter ascending=False
Let's take a quick look at the pandas documentation for more information
In [13]: by_date = purchases.sort_values(by='date')
In [14]: purchases = purchases.sort_values(by=['date', 'total'], ascending=False)
Now that we've made some changes to our
dataframe,
let's save it to a new csv
To do this, we use the command to_csv
Check out tab complete after to_
to
see how else you can save your data
In [15]: purchases.to_csv('~/intermediate-python-pandas/example_data/export_purchases.csv')
Take your dataframe members and do the following:
When importing data, you may see your column imported as a string, but you need it to be a float.
pandas' function astype() lets you change the data type of your column.
Let's try changing the the data type for the members column site_visits
In [16]: members.site_visits = members.site_visits.astype(str)
In [17]: members.site_visits = members.site_visits.astype(float)
You may come across situations where you want to
save a column as a python list
pandas makes it easy with the tolist() function
In [18]: ids_list = purchases.member_id.tolist()
In [19]: len(ids_list)
pandas filtering can look a little funny, but it is extremely useful
Let's look at the following code and break down exactly what is happening
In [20]: large_purchases = purchases[purchases['total'] > 100.0]
You want to look from inside out:
# We are calling the total price column from purchases
purchases['total']
# We are filtering the total price column for totals greater than 100 dollars.
purchases['total'] > 100.0
# We are filtering the dataframe for rows where total price is greater than 100 dollars.
purchases[purchases['total'] > 100.0]
Let's create a new dataframe named small_purchases where subtotal is less than 50 dollars.
Create a python list of all the member ids in small_purchases
To concatinate dataframes means to take two dataframes with the same columns and stacking the rows
Let's import the csv purchases-october.csv and print its columns
Since this csv has the columns as purchases, we can concatinate the two dataframes like so:
In [21]: concatinated = pandas.concat([purchases, purchases-october])
How concat stacks rows, you can think of merge like stacking columns
Merging dataframes is a lot like SQL joins
You need to use one or more columns to merge two dataframes together.
Let's merge purchases and members on the column member_id
In [22]: merged = purchases.merge(members, how='left', on='member_id')
The how
parameter decides
how to join the two dataframes.
on
decides the column or
columns (as a list for multiple)
that you want to merge on
groupby allows you to group by 1 or more columns
to get a
more top-level view of your data
There are a couple of ways to do this, but here is one recommended way.
In [23]: prices_by_date = purchases.groupby('date', as_index=False)[['subtotal', 'total']].mean()
In [24]: totals_member_date = purchases.groupby(['date', 'member_id'],
as_index=False)['total'].sum()
You must include as_index=False since you are combining columns and index logic gets funny otherwise.
Look at the pandas documentation for more ways to use groupby
Import items_purchased.csv as a pandas dataframe called items
Merge items with our merged dataframe by purchase_id
Group this new dataframe by item type and date, sum the total
*Challenge*
Find out the member name who has spent the most on sale items
Thanks for attending Intro to Pandas!
Your input is invaluable for the improvement of this workshop.
Please take a minute and click here to fill out our survey.