Applied Data Science with Python – Part 2
Python for Data Science

Applied Data Science with Python – Part 2

The Series Data Structure

We will quickly start with, non-comprehensive overview of the fundamental data structures in pandas. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import numpy and load pandas into your namespace

import pandas as pd
pd.Series?

We’ll create variable animals of string type and convert it into pandas series.

animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
animals = ['Tiger', 'Bear', None]
pd.Series(animals)

We’ll create pandas series of Integer type

numbers = [1, 2, 3]
pd.Series(numbers)
numbers = [1, 2, None]
pd.Series(numbers)

Import numpy
To construct a DataFrame with missing data, use np.nan for those values which are missing. Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing

import numpy as np
np.nan == None

NaN (not a number) is the standard missing data marker used in pandas

np.nan == np.nan

To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and notna()functions, which are also methods on Series and DataFrame objects

np.isnan(np.nan)

 

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

 

s.index

 

s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

 

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s

Querying a Series

Indexing In Python

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
s.iloc[3]

‘South Korea’

s.loc['Golf']

‘Scotland’

s[3]

‘South Korea’

s['Golf']

‘Scotland’

sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s

0      100.0
1       120.0
2      101.0
3      3.0
dtype: float64

total = 0
for item in s:
    total+=item
print(total)

324.0

import numpy as np
total = np.sum(s)
print(total)

324.0

#this creates a big series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
s.head()

0    486
1     951
2    111
3    142
4    457
dtype: int64

len(s)

10000

%%timeit -n 100
summary = 0
for item in s:
    summary+=item

100 loops, best of 3: 1.31 ms per loop

%%timeit -n 100
summary = np.sum(s)

100 loops, best of 3: 74.3 µs per loop

s+=2 #adds two to each item in s using broadcasting
s.head()

0     488
1      953
2     113
3     144
4     459
dtype: int64

for label, value in s.iteritems():
    s.set_value(label, value+2)
s.head()

0    490
1    955
2    115
3    146
4    461
dtype: int64

%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2

10 loops, best of 3: 966 ms per loop

%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2

10 loops, best of 3: 317 µs per loop

s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s

0               1
1                2
2               3
Animal   Bears
dtype:     object

original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'],
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)

 

original_sports

Archery           Bhutan
Golf                  Scotland
Sumo               Japan
Taekwondo    South Korea
dtype:              object

cricket_loving_countries

Cricket           Australia
Cricket           Barbados
Cricket           Pakistan
Cricket           England
dtype:       object

all_countries

Archery          Bhutan
Golf                 Scotland
Sumo              Japan
Taekwondo   South Korea
Cricket           Australia
Cricket           Barbados
Cricket           Pakistan
Cricket           England
dtype: object

all_countries.loc['Cricket']

Cricket      Australia
Cricket      Barbados
Cricket      Pakistan
Cricket      England
dtype:   object

The DataFrame Data Structure

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. Below is some example covering different scenarios.

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df.head()
Cost Item Purchased Name
Store 1 22.5 Dog Food Chris
Store 1 2.5 Kitty Litter Kevyn
Store 2 5 Bird Seed Vinod
df.loc['Store 2']

Cost                            5
Item Purchased       Bird Seed
Name                         Vinod
Name: Store 2, dtype: object

type(df.loc['Store 2'])

pandas.core.series.Series

df.loc['Store 1']
Cost Item Purchased Name
Store 1 22.5 Dog Food Chris
Store 1 2.5 Kitty Litter Kevyn
df.loc['Store 1', 'Cost']

Store 1       22.5
Store 1       2.5
Name: Cost, dtype: float64
DataFrame.T is used to transpose the DataFrame

df.T
Store 1 Store 1 Store 2
Cost 22.5 2.5 5
Item Purchased Dog Food Kitty Litter Bird Seed
Name Chris Kevyn Vinod
df.T.loc['Cost']

Store 1      22.5
Store 1      2.5
Store 2      5
Name: Cost, dtype: object

df['Cost']

Store 1       22.5
Store 1       2.5
Store 2       5.0
Name: Cost, dtype: float64

df.loc['Store 1']['Cost']

Store 1       22.5
Store 1       2.5
Name: Cost, dtype: float64

df.loc[:,['Name', 'Cost']]
Name Cost
Store 1 Chris 22.5
Store 1 Kevyn 2.5
Store 2 Vinod 5
df.drop('Store 1')
Cost Item Purchased Name
Store 2 5 Bird Seed Vinod
df
Cost Item Purchased Name
Store 1 22.5 Dog Food Chris
Store 1 2.5 Kitty Litter Kevyn
Store 2 5 Bird Seed Vinod
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df
Cost Item Purchased Name
Store 2 5 Bird Seed Vinod
copy_df.drop?
Cost Item Purchased
Store 2 5 Bird Seed
df['Location'] = None
df
Cost Item Purchased Name Location
Store 1 22.5 Dog Food Chris None
Store 1 2.5 Kitty Litter Kevyn None
Store 2 5 Bird Seed Vinod None

Dataframe Indexing and Loading

costs = df['Cost']
costs
costs+=2
costs
df
Cost Item Purchased Name Location
Store 1 24.5 Dog Food Chris None
Store 1 4.5 Kitty Litter Kevyn None
Store 2 7 Bird Seed Vinod None

Download the Data set and Run the below commands to check the output
Data Set Used: Download 

df = pd.read_csv('olympics.csv')
df.head()

 

df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()

 

df.columns

 

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True)
df.head()

Querying a DataFrame

df['Gold'] > 0

 

only_gold = df.where(df['Gold'] > 0)
only_gold.head()

 

only_gold['Gold'].count()

 

df['Gold'].count()

 

only_gold = only_gold.dropna()
only_gold.head()

 

only_gold = df[df['Gold'] > 0]
only_gold.head()

 

len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])

 

df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]

Indexing Dataframes

df.head()

 

df['country'] = df.index
df = df.set_index('Gold')
df.head()

 

df = df.reset_index()
df.head()

Download the Data

df = pd.read_csv('census.csv')
df.head()

 

df['SUMLEV'].unique()

 

df=df[df['SUMLEV'] == 50]
df.head()

 

columns_to_keep = ['STNAME',
                   'CTYNAME',
                   'BIRTHS2010',
                   'BIRTHS2011',
                   'BIRTHS2012',
                   'BIRTHS2013',
                   'BIRTHS2014',
                   'BIRTHS2015',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

 

df = df.set_index(['STNAME', 'CTYNAME'])
df.head()

 

df.loc['Michigan', 'Washtenaw County']

 

df.loc[ [('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County')] ]

Missing values

In this section, we will discuss missing (also referred to as NA) values in pandas
Download the Data

df = pd.read_csv('log.csv')
df

 

df.fillna?

 

df = df.set_index('time')
df = df.sort_index()
df

 

df = df.reset_index()
df = df.set_index(['time', 'user'])
df

 

df = df.fillna(method='ffill')
df.head()

 

Leave a Reply

Close Menu