One of the more common operations I have come across when cleaning up data for analysis is a mapping transformation. This can be useful when you are wanting to clean up known dirty data or just tranforming the data to be easier to read. One other reason and probably the most compelling case for the mapping transformation is the need to convert features to float values for sckit-learn models. That is the example I am going to show in this post.

I am going to be using the training data from the Titanic Kaggle Competition. I have created a feature titled ‘formOfAddress’ by scraping various forms of address from the names of the passengers in the training set. This example picks up after the feature creation.

import pandas as pd

df = pd.read_csv('train.csv')

form_of_address = {
    'mr': 0,
    'mrs': 1,
    'misses': 2,
    'master': 3,
    None: -1

# The mapping

# If you want to set your column to reflect the new data
df['formOfAddress'] = df['formOfAddress'].map(form_of_address)

Most of this should be pretty straight forward. One of the more intrestings parts of this example is setting NA’s using the “None” type in python. This will prevent you from having to do a DataFrame.fillna() on the Series after the mapping transformation.

You can read more about the map method here ⤧  Next post Context Budgeting ⤧  Previous post Let's Kick This Thing Off