v String Munging In Dataframe - Python

String Munging In Dataframe

import modules

import pandas as pd
import numpy as np
import re as re

Create dataframe

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'email': ['jas203@gmail.com', 'momomolly@gmail.com', np.NAN, 'battler@milner.com', 'Ames1234@yahoo.com'], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore'])
df
first_name last_name email preTestScore postTestScore
0 Jason Miller jas203@gmail.com 4 25
1 Molly Jacobson momomolly@gmail.com 24 94
2 Tina Ali NaN 31 57
3 Jake Milner battler@milner.com 2 62
4 Amy Cooze Ames1234@yahoo.com 3 70

Which strings in the email column contains 'gmail'

df['email'].str.contains('gmail')
0     True
1     True
2      NaN
3    False
4    False
Name: email, dtype: object

Create a regular expression pattern that breaks apart emails

pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

Find everything in df.email that contains that pattern

df['email'].str.findall(pattern, flags=re.IGNORECASE)
0       [(jas203, gmail, com)]
1    [(momomolly, gmail, com)]
2                          NaN
3     [(battler, milner, com)]
4     [(Ames1234, yahoo, com)]
Name: email, dtype: object

Create a pandas series containing the email elements

matches = df['email'].str.match(pattern, flags=re.IGNORECASE)
matches
/Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
  if __name__ == '__main__':





0       (jas203, gmail, com)
1    (momomolly, gmail, com)
2                        NaN
3     (battler, milner, com)
4     (Ames1234, yahoo, com)
Name: email, dtype: object

Select the domains of the df.email

matches.str[1]
0     gmail
1     gmail
2       NaN
3    milner
4     yahoo
Name: email, dtype: object