Data Cleansing – Data Types

Data Science Basics

Data Cleansing – Data Types

Data Cleansing is the process of correcting and adjusting your data (and data types) in the attempt to improve it’s useability. In other words, the way your data is collected, stored, and read directly affects its’ quality. As such, all Data Scientists perform some level of data cleansing after reading in the data in the attempt to increase its’ usability. Below we describe some of the most common areas to watch out for.

Numeric Data

Numeric Data are commonly represented as either Integers or Floating point numbers. Additionally, the Data Type influences the number’s precision. When working with numeric data, it is important to understand the nature of the attribute you are working with to determine which is the best fit. As an example, suppose you are counting the number of students in a classroom at any point in time. In this case, since the number of students are always whole number, representing this as an Integer makes a lot of sense. On the other hand, suppose you are measuring the temperature outside at any given point in time. On this occasion, using a floating point number gives more precision.

Boolean

Boolean data types by definition can take on one of two values, in other words True or False. Consequently, the representation of these two concepts can vary. As an illustration, in some occasions these values can be stored as a text “True” or “False”. In other times, such as during One Hot Encoding, it could also be stored as a 1 (for True) and 0 (for False). Moreover other notations could also arise, such as Yes vs No, Y vs N, etc. You get the point. For this reason it is important to know when your data is a boolean and how it is stored. As it turns out you may need to clean and convert the data into your desired representation.

Date & Time

Anybody who has worked with Dates will know globally there is no unified way to write dates and times. Some countries use MM/DD/YYYY whilst others prefer YYYY-MM-DD. Some places use a 12 hour clock, in contrast others may use a 24 hour clock. This variability all makes working with dates & times slightly tricky.

String

Data written in the form of text are commonly stored as “Strings”. These can be any information represented by alphanumeric characters and symbols. Some of the most common text strings include Names, Addresses, Places, etc. Right out of the box, text stored in Python are either an “Object” datatype, or a String datatype. Using the latter has several benefits such as being able to leverage functions for text manipulation. For example, you could make use of pre-built functions to remove leading & trailing blank spaces.

Data Cleansing – Data Types Example

In our example I have an address book with information about some friends and would like to further analyze the data. Upon reading in my address book the data looks like below

import pandas as pd
mydata = pd.read_csv("addressbook.csv")
mydata.head()
Data Cleansing - Data Types - Initial view of our data
mydata.dtypes
Name                    object
 Phone Number           object
 Email Address          object
 Number of Siblings    float64
 Height                float64
 Birthday               object
 Married?                int64
dtype: object

Notice in our address book, there are different information about each contact. Not only do we have the names of our contact, we also have their corresponding phone number, email addresses, etc.

First off we look at the “Number of Siblings” field and quickly realize the values are stored as a floating point number. Since we know siblings are always whole numbers, we can convert these from floating point numbers to integers.

# Convert FLOAT to INT
mydata['Number of Siblings']=mydata['Number of Siblings'].astype("int8")
mydata.head()
Data Cleansing - Data Types - Converting Float to Int

Secondly, we look at the Birthday and see it is currently stored as an “Object”. Converting the data type into a Date would allow us to further process dates such as time zone conversions, etc. Making use of the format attribute, we can give the proper formatting. In the example below we use the YYYY-MM-DD format.

mydata['Birthday']=pd.to_datetime(mydata['Birthday'], format="%Y-%m-%d")
mydata.dtypes
Name                          object
Phone Number                  object
Email Address                 object
Number of Siblings              int8
Height                       float64
Birthday              datetime64[ns]
Married?                       int64
dtype: object

Thirdly, we look at the “Married?” column and decide this is a boolean stored as 0s and 1s. Converting this into a boolean datatype can be done as below:

mydata['Married?']=mydata['Married?'].astype("boolean")
print(mydata.dtypes)
mydata.head()
Name                          object
Phone Number                  object
Email Address                 object
Number of Siblings              int8
Height                       float64
Birthday              datetime64[ns]
Married?                     boolean
dtype: object
Data Cleansing - Data Types - Cleaning up Boolean

Notice that our “Married?” column is now converted into True/False values as opposed to 0s and 1s.

Finally, we also want to clean up the Name, Phone Number, and Email Addresses. These are stored as text and hence we want to convert these into string data type.

mydata[['Name','Phone Number','Email Address']]=mydata[['Name','Phone Number','Email Address']].astype('string')
mydata.dtypes
Name                          string
Phone Number                  string
Email Address                 string
Number of Siblings              int8
Height                       float64
Birthday              datetime64[ns]
Married?                     boolean
dtype: object

Summary – Data Cleansing Data Type

In summary we have gone over four of the most common data types and cleansing you will encounter. As a recap, we looked at Floating Point numbers and Integers, Text/String data, Booleans, as well as Dates. Stay tuned for our next posting where we will cover more topics related to Data Cleansing.

Logo 100x100 About Alan Wong…
Alan is a part time Digital enthusiast and full time innovator who believes in freedom for all via Digital Transformation. 
兼職人工智能愛好者,全職企業家利用數碼科技釋放潛能與自由。

References

Python strftime format codes

Pandas to_date function documentation

Pandas astype documentation

Leave a Reply