Notes On Using
Data Science & Machine Learning
To Fight For Things That Matter
I am the Director of Machine Learning at the Wikimedia Foundation. I have spent over a decade applying statistical learning, artificial intelligence, and software engineering to political, social, and humanitarian efforts.
Learning machine learning? Check out my Machine Learning Flashcards and my book, (Machine Learning With Python Cookbook).
Notes - explanations, ideas, and lessons learned
Machine Learning
Engineering Management
Self
Code Tutorials - short annotated coding guides
Machine Learning
Basics
- Loading Features From Dictionaries
- Loading scikit-learn's Boston Housing Dataset
- Loading scikit-learn's Digits Dataset
- Loading scikit-learn's Iris Dataset
- Make Simulated Data For Classification
- Make Simulated Data For Clustering
- Make Simulated Data For Regression
- Perceptron In Scikit
- Saving Machine Learning Models
Vectors, Matrices, And Arrays
- Adding And Subtracting Matrices
- Apply Operations To Elements
- Calculate Dot Product Of Two Vectors
- Calculate The Average, Variance, And Standard Deviation
- Calculate The Determinant Of A Matrix
- Calculate The Trace Of A Matrix
- Converting A Dictionary Into A Matrix
- Create A Matrix
- Create A Sparse Matrix
- Create A Vector
- Describe An Array
- Find The Maximum And Minimum
- Find The Rank Of A Matrix
- Flatten A Matrix
- Getting The Diagonal Of A Matrix
- Invert A Matrix
- Reshape An Array
- Selecting Elements In An Array
- Transpose A Vector Or Matrix
Preprocessing Structured Data
- Convert Pandas Categorical Data For Scikit-Learn
- Delete Observations With Missing Values
- Deleting Missing Values
- Detecting Outliers
- Discretize Features
- Encoding Ordinal Categorical Features
- Handling Imbalanced Classes With Downsampling
- Handling Imbalanced Classes With Upsampling
- Handling Outliers
- Impute Missing Values With Means
- Imputing Missing Class Labels
- Imputing Missing Class Labels Using k-Nearest Neighbors
- Normalizing Observations
- One-Hot Encode Features With Multiple Labels
- One-Hot Encode Nominal Categorical Features
- Preprocessing Categorical Features
- Preprocessing Iris Data
- Rescale A Feature
- Standardize A Feature
Preprocessing Images
Preprocessing Dates And Times
- Break Up Dates And Times Into Multiple Features
- Calculate Difference Between Dates And Times
- Convert pandas Columns Time Zone
- Convert Strings To Dates
- Encode Days Of The Week
- Handling Missing Values In Time Series
- Handling Time Zones
- Lag A Time Feature
- Rolling Time Window
- Select Date And Time Ranges
Feature Engineering
- Dimensionality Reduction On Sparse Feature Matrix
- Dimensionality Reduction With Kernel PCA
- Dimensionality Reduction With PCA
- Feature Extraction With PCA
- Group Observations Using K-Means Clustering
- Selecting The Best Number Of Components For LDA
- Selecting The Best Number Of Components For TSVD
- Using Linear Discriminant Analysis For Dimensionality Reduction
Model Evaluation
- Accuracy
- Create Baseline Classification Model
- Create Baseline Regression Model
- Cross Validation Pipeline
- Cross Validation With Parameter Tuning Using Grid Search
- Cross-Validation
- Custom Performance Metric
- F1 Score
- Generate Text Reports On Performance
- Nested Cross Validation
- Plot The Learning Curve
- Plot The Receiving Operating Characteristic Curve
- Plot The Validation Curve
- Precision
- Recall
- Split Data Into Training And Test Sets
Trees And Forests
- Outlier Detection With Isolation Forests
- Adaboost Classifier
- Decision Tree Classifier
- Decision Tree Regression
- Feature Importance
- Feature Selection Using Random Forest
- Handle Imbalanced Classes In Random Forest
- Random Forest Classifier
- Random Forest Classifier Example
- Random Forest Regression
- Select Important Features In Random Forest
- Titanic Competition With Random Forest
- Visualize A Decision Tree
Deep Learning
Keras
- Adding Dropout
- Convolutional Neural Network
- Feedforward Neural Network For Binary Classification
- Feedforward Neural Network For Multiclass Classification
- Feedforward Neural Networks For Regression
- k-Fold Cross-Validating Neural Networks
- LSTM Recurrent Neural Network
- Neural Network Early Stopping
- Neural Network Weight Regularization
- Preprocessing Data For Neural Networks
- Save Model Training Progress
- Tuning Neural Network Hyperparameters
- Visualize Loss History
- Visualize Neural Network Architecutre
- Visualize Performance History
Python
Basics
- Using Iterable As Function Arguments
- Handling Long Lines Of Code
- Tuples Vs. Named Tuples
- Append Using The Operator
- Function Example
- List All Files Of Certain Type In A Directory
- Add Padding Around String
- All Combinations For A List Of Objects
- any(), all(), max(), min(), sum()
- Apply Operations Over Items In A List
- Applying Functions To List Items
- Arithmetic Basics
- Assignment Operators
- Basic Operations With NumPy Array
- Breaking Up String Variables
- Brute Force D20 Roll Simulator
- Cartesian Product
- Chain Together Lists
- Cleaning Text
- Compare Two Dictionaries
- Concurrent Processing
- Continue And Break Loops
- Convert HTML Characters To Strings
- Converting Strings To Datetime
- Create A New File Then Write To It
- Create A Temporary File
- Data Structure Basics
- Date And Time Basics
- Dictionary Basics
- Display JSON
- Display Scientific Notation As Floats
- Exiting A Loop
- Find The Max Value In A Dictionary
- Flatten Lists Of Lists
- For Loop
- Formatting Numbers
- Function Annotation Examples
- Function Basics
- Functions Vs. Generators
- Generating Random Numbers With NumPy
- Generator Expressions
- Hard Wrapping Text
- How To Use Default Dicts
- if and if else
- If Else On Any Or All Elements
- Indexing And Slicing NumPy Arrays
- Indexing And Slicing NumPy Arrays
- Iterate An Ifelse Over A List
- Iterate Over Multiple Lists Simultaneously
- Iterating Over Dictionary Keys
- Lambda Functions
- Logical Operations
- Looping Over Two Lists
- Mathematical Operations
- Mocking Functions
- Nested For Loops Using List Comprehension
- Nesting Lists
- Numpy Array Basics
- Parallel Processing
- Partial Function Applications
- Priority Queues
- Queues And Stacks
- Recursive Functions
- repr vs. str
- Scheduling Jobs In The Future
- Select Random Element From A List
- Selecting Items In A List With Filters
- Set The Color Of A Matplotlib Plot
- Sort A List Of Names By Last Name
- Sort A List Of Strings By Length
- Store API Credentials For Open Source Projects
- String Formatting
- String Indexing
- String Operations
- Swapping Variable Values
- Try, Except, and Finally
- Unpacking A Tuple
- Unpacking Function Arguments
- Use Command Line Arguments In A Function
- Using Named Tuples To Store Data
- while Statement
Data Wrangling
- Columns Shared By Two Data Frames
- Apply Functions By Group In Pandas
- Apply Operations To Groups In Pandas
- Applying Operations Over pandas Dataframes
- Assign A New Column To A Pandas DataFrame
- Break A List Into N-Sized Chunks
- Breaking Up A String Into Columns Using Regex In pandas
- Construct A Dictionary From Multiple Lists
- Convert A Categorical Variable Into Dummy Variables
- Convert A Categorical Variable Into Dummy Variables
- Convert A CSV Into Python Code To Recreate It
- Convert A String Categorical Variable To A Numeric Variable
- Convert A Variable To A Time Variable In pandas
- Count Values In Pandas Dataframe
- Create a Column Based on a Conditional in pandas
- Create A pandas Column With A For Loop
- Create A Pipeline In Pandas
- Create Counts Of Items
- Creating Lists From Dictionary Keys And Values
- Crosstabs In pandas
- Delete Duplicates In pandas
- Descriptive Statistics For pandas Dataframe
- Dropping Rows And Columns In pandas Dataframe
- Enumerate A List
- Expand Cells Containing Lists Into Their Own Variables In Pandas
- Filter pandas Dataframes
- Find Largest Value In A Dataframe Column
- Find Unique Values In Pandas Dataframes
- Geocoding And Reverse Geocoding
- Geolocate A City And Country
- Geolocate A City Or Country
- Group A Time Series With pandas
- Group Data By Time
- Group Pandas Data By Hour Of The Day
- Grouping Rows In pandas
- Hierarchical Data In pandas
- Join And Merge Pandas Dataframe
- List Unique Values In A pandas Column
- Load A JSON File Into Pandas
- Load An Excel File Into Pandas
- Load Excel Spreadsheet As pandas Dataframe
- Loading A CSV Into pandas
- Long To Wide Format
- Lower Case Column Names In Pandas Dataframe
- Make New Columns Using Functions
- Map External Values To Dataframe Values in pandas
- Missing Data In pandas Dataframes
- Moving Averages In pandas
- Normalize A Column In pandas
- pandas Data Structures
- pandas Time Series Basics
- Pivot Tables In pandas
- Quickly Change A Column Of Strings In Pandas
- Random Sampling Dataframe
- Ranking Rows Of Pandas Dataframes
- Regular Expression Basics
- Regular Expression By Example
- Reindexing pandas Series And Dataframes
- Rename Column Headers In pandas
- Rename Multiple pandas Dataframe Column Names
- Replacing Values In pandas
- Saving A pandas Dataframe As A CSV
- Search A pandas Column For A Value
- Select Rows When Columns Contain Certain Values
- Select Rows With A Certain Value
- Select Rows With Multiple Filters
- Selecting pandas DataFrame Rows Based On Conditions
- Simple Example Dataframes In pandas
- Sorting Rows In pandas Dataframes
- Split Lat/Long Coordinate Variables Into Separate Variables
- Streaming Data Pipeline
- String Munging In Dataframe
- Using List Comprehensions With pandas
- Using Seaborn To Visualize A pandas Dataframe
Data Visualization
- Back To Back Bar Plot In MatPlotLib
- Bar Plot In MatPlotLib
- Color Palettes in Seaborn
- Creating A Time Series Plot With Seaborn And pandas
- Creating Scatterplots With Seaborn
- Group Bar Plot In MatPlotLib
- Histograms In MatPlotLib
- Making A Matplotlib Scatterplot From A Pandas Dataframe
- Matplotlib, A Simple Example
- Pie Chart In MatPlotLib
- Scatterplot In MatPlotLib
- Stacked Percentage Bar Plot In MatPlotLib
Logging
Statistics
Basics
Scala
- Break A Sequence Into Groups
- Change Data Type
- Chunk Sequence In Equal Sized Groups
- Compare Two Floats
- Create A Range
- Extract Substrings Using Regex
- Filter A Sequence
- Find Largest Key Or Value In A Map
- Flatten Sequence Of Sequences
- For Loop A Map
- For Looping
- Format Numbers As Currency
- If Else
- Increment And Decrement Numbers
- Insert Variables Into Strings
- Iterate Over A Map
- Loop A Collection
- Make Numbers Pretty
- Mapping A Function To A Collection
- Matching Conditions
- Mutable Maps
- N Dimension Arrays
- Partial Functions
- Random Integer Between Two Values
- Replacing Parts Of Strings
- Search A Map
- Search Strings
- Search Strings Using Regex
- Set Operations On Sequences
- Sorting Sequences
- Split Strings
- Try, Catch, Finally
- Variables And Values
- Zip Together Two Lists
Regular Expressions
- Match A Symbol
- Match A Unicode Character
- Match A Word
- Match Any Character
- Match Any Of A List Of Characters
- Match Any Of A Series Of Options
- Match Any Of A Series Of Words
- Match Dates
- Match Email Addresses
- Match Exact Text
- Match Integers Of Any Length
- Match Text Between HTML Tags
- Match Times
- Match URLs
- Match US and UK Spellings
- Match US Phone Numbers
- Match Words With A Certain Ending
- Match ZIP Codes
Snowflake
Basics
Tables
PostgreSQL
Basics
- Create PostgreSQL Database With Python
- Apply Operation To Column
- Compare Values To Subquery
- Copy Rows From One Table To Another
- Count Rows
- Count Unique Values
- Create Column Index
- Create Subquery
- Create View
- Delete View
- Examine A Query
- Group Rows
- Group Rows With Conditions
- If Else
- List Index Columns
- List Tables In Database
- Rename Columns In Views
- Replace Missing Values
- Retrieve Only A Few Rows
- Retrieve Random Subset Of Rows
- Retrieve Row
- Retrieve Rows Based On Condition
- Retrieve Rows Based On Multiple Condition
- Retrieve Subset Of Columns
- Retrieving Missing Values
- Save Queries As Variables
- Select Highest Value In Each Group
- Select Values Between Two Values
- Sort Rows
- Sort Rows In Groups
- Test If Rows Exist In Subquery
- Use Column Aliases With Where Clause
- Value Matches Element Of A List
- View Unique Values
Add, Delete, Change Rows
- Add Column
- Change Values
- Create Column Aliases
- Create Column Conditional On Another Column
- Create Column Of Values
- Create Primary Key
- Delete All Rows
- Delete Duplicates
- Delete Primary Key
- Delete Rows
- Delete Rows That Don't Exist In Another Table
- Export To CSV
- Import CSV
- Insert Rows
- Update Rows Based On Another Table
AWS
Linux Command Line
Basics
- Close A Program
- End Standard Input Entry
- Copy Files And Directories
- Delete Files And Directories
- Delete Files And Directories In Current Directory
- Modify File Permissions
- Move Files And Directories
- Rename File
- See Disk Drive Space
- View Disk Information
- Archive And Unarchive Files
- Change Permissions
- Changing Directories
- Check Current Date And Time
- Create Command
- Create Directory
- Create File
- Create Sequential List Of Files And Directories
- Create Symbolic Links
- Exit Terminal Session
- Get Help With A Command
- Get Information On A File
- List Avaliable Commands
- List The Contents Of A Directory
- Multiple Commands On One Line
- Ping Website
- See Free Memory
- See Who Is Logged Into A System
- Select Files Based On Filename
- Synchronize Files And Directories
- Track Route Of Network Traffic
- View A File's Type
- View A Text File's Contents
- View Current Working Directory
- View First And Last Parts Of Files
- View The Version Of A Package
- Zip And Unzip Directories
- Zip And Unzip Files
Shell Scripts
Git And GitHub
Machine Learning Engineering
Docker
Command Line
- Automatically Generate Human-Readable Container Names
- Automatically Restart Containers
- Connect Container's Filesystem To Computer's Filesystem
- Create A Container
- Create An Image
- Create Read-Only Filesystems In Containers
- Export All Files And Folders Out Of A Container
- Get Bash Shell In A Container
- Inspect A Container
- Inspect An Image
- List Containers
- Publish To Docker Hub
- Pull An Image From A Repository
- Remove An Image
- Rename A Container
- Restart A Container
- Run A Detached Container
- Save A Container's Bash History
- Saving An Image As A File
- Set Container To Run A Bash Command On Start
- Start A Container
- Stop A Container
- Use Environment Variables
- View All Changes To A Container
- View Container Logs
- View Image Size
- Watch Container Logs Live
- Work In A Container
Dockerfiles
- Add A File From A URL To Images
- Add A Volume
- Add Comments
- Add Environment Variables
- Add Files And Folders To Images
- Add Metadata
- Expose A Port
- Ignore Files While Building
- Run Command When Container Starts
- Run Command While Building Image
- Run Commands As A User
- Run Many Commands While Building Image
- Set A Default Working Directory
- Set Default Working Directory