Learning machine learning? Try my machine learning flashcards or Machine Learning with Python Cookbook.
String Munging In Dataframe
import modules
import pandas as pd
import numpy as np
import re as re
Create dataframe
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'email': ['[email protected]', '[email protected]', np.NAN, '[email protected]', '[email protected]'],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore'])
df
first_name | last_name | preTestScore | postTestScore | ||
---|---|---|---|---|---|
0 | Jason | Miller | [email protected] | 4 | 25 |
1 | Molly | Jacobson | [email protected] | 24 | 94 |
2 | Tina | Ali | NaN | 31 | 57 |
3 | Jake | Milner | [email protected] | 2 | 62 |
4 | Amy | Cooze | [email protected] | 3 | 70 |
Which strings in the email column contains ‘gmail’
df['email'].str.contains('gmail')
0 True
1 True
2 NaN
3 False
4 False
Name: email, dtype: object
Create a regular expression pattern that breaks apart emails
pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
Find everything in df.email that contains that pattern
df['email'].str.findall(pattern, flags=re.IGNORECASE)
0 [(jas203, gmail, com)]
1 [(momomolly, gmail, com)]
2 NaN
3 [(battler, milner, com)]
4 [(Ames1234, yahoo, com)]
Name: email, dtype: object
Create a pandas series containing the email elements
matches = df['email'].str.match(pattern, flags=re.IGNORECASE)
matches
/Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
if __name__ == '__main__':
0 (jas203, gmail, com)
1 (momomolly, gmail, com)
2 NaN
3 (battler, milner, com)
4 (Ames1234, yahoo, com)
Name: email, dtype: object
Select the domains of the df.email
matches.str[1]
0 gmail
1 gmail
2 NaN
3 milner
4 yahoo
Name: email, dtype: object