Blog Details

Hey guys! In this blog post, we will take a step-by-step tour to learn how we can manipulate the timestamps using the Python data analysis library - Pandas. Pandas is a fast, flexible, and easy to use open-source data analysis and manipulation tool built on top of the Python programming language.

 

At Outline India (hereafter, OI), we extensively use the SurveyCTO data collection platform to code our paper-based tools in digital format to conduct various field surveys. After each completed survey, OI researchers use the raw dataset file (in .csv document format) with multiple variables to perform the analysis using Pandas. Every raw dataset generated using SurveyCTO contains 3 mandatory timestamp variables – Submission Date, start time, and end time.

 


In this blog post, we will enquire about the various steps involved, from installing the pandas to carrying out the need-based data-time manipulation using in-built timestamp functions in the Pandas library.

 

So let's get started. 

Step 1: Install the Pandas and Tabulate package

Before you go any further, make sure you have installed the Python and PIP package. Carry out the following command on the windows default command-line interpreter to install the Pandas and tabulate package,

pip install pandas
pip install tabulate

Step 2: Import the Pandas and Tabulate library

We will import the Pandas and tabulate library and use this to read the raw dataset file and perform manipulations,

import pandas as pd
from tabulate import tabulate
VIEW RAWIMPORT_PANDAS.PY hosted with ❤ by GITHUB

Step 3: Read the dataset file with timestamps

Now, it's time to read the raw dataset (.csv file) using Pandas to perform the various timestamp manipulations.

import pandas as pd
from tabulate import tabulate
# read the csv file
df = pd.read_csv("raw_data.csv")
# displaying the DataFrame
print(tabulate(df, headers = 'keys', tablefmt = 'psql'))
VIEW RAWREAD_CSV_FILE.PY hosted with ❤ by GITHUB

 

This gives us the output as follows,


As we can see in the output, the mentioned timestamp columns are in string format. To read the timestamp columns in DateTime type, we need to parse the mentioned columns using parse_dates parameter while reading the file.

# read the csv file and parse the dates
df = pd.read_csv("raw_data.csv", parse_dates=['SubmissionDate', 'starttime', 'endtime'])
VIEW RAWPARSE_DATES.PY hosted with ❤ by GITHUB

The code will give us the following output:




Step 4: Perform timestamp manipulations

We can now perform any manipulation on the given timestamp columns based on our need. 

a) Remove time from the timestamp 
We can remove the time from the timestamp columns using,
# Remove Time from SubmissionDate Column
df['Date'] = pd.to_datetime(df['SubmissionDate']).dt.date
print(tabulate(df[['Date', 'SubmissionDate']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWREMOVE_TIME.PY hosted with ❤ by GITHUB

The code will remove the time from the submission date column.


b) Select rows with a specific date

Suppose we want to select the rows with greater or less than a particular date. We will use the inbuild loc function, which helps access a group of rows and columns by label(s) and logical operations.

# Select the dataframe greater than mentioned date
start_date = '2020-12-04'
df['Date'] = (df['Date']).astype('datetime64[D]')
#select only particular date data
df = df.loc[((df['Date'] >= start_date))]
# Prin the dataframe
print(tabulate(df[['SubmissionDate', 'user_id', 'gadget']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWMENTIONED_DATE.PY hosted with ❤ by GITHUB

 

As we can see from the following output, rows with a date greater than 2020-12-04 in the Date column are selected.



If we want to select the rows within some particular start and end date, this can be done using the following code:

# Set start and end date
start_date = '2020-12-04'
end_date = '2020-12-07'
# Define datatype
df['Date'] = (df['Date']).astype('datetime64[D]')
#select only particular date data
df = df.loc[((df['Date'] >= start_date) & (df['Date'] <= end_date))]
# Print the dataframe
print(tabulate(df[['SubmissionDate', 'user_id', 'gadget']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWMENTIONED_DATE1.PY hosted with ❤ by GITHUB

We get the given output,


c) Converting date to respective day name

The pandas give us the functionality to convert the date into the particular day name. We use the dt.day_name() function to convert the date into the respective day.

# Convert Date to respective day name
df['Date'] = (df['Date']).astype('datetime64[D]')
# convert Date to day name
df['Day'] = df['Date'].dt.day_name()
df['Date'] = pd.to_datetime(df['Date']).dt.date
# Print the dataframe
print(tabulate(df[['Date', 'Day', 'user_id', 'gadget']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWDATE_TO_DAY.PY hosted with ❤ by GITHUB




d) Increase date by days, months, and years

The Pandas offset function provides the functionality to increase the date by days, months, and years.   

# Increase Date to days, months and years
df['original_date'] = df['Date']
# increase date by 1
df['increase_date'] = df['Date'] + pd.tseries.offsets.DateOffset(days=1)
# increase date by 3 month
df['increase_month'] = df['Date'] + pd.tseries.offsets.DateOffset(months=3)
# increase date by 2 year
df['increase_year'] = df['Date'] + pd.tseries.offsets.DateOffset(years=2)
# Remove time and keep date only
df['increase_date'] = pd.to_datetime(df['increase_date']).dt.date
df['increase_month'] = pd.to_datetime(df['increase_month']).dt.date
df['increase_year'] = pd.to_datetime(df['increase_year']).dt.date
# Print the dataframe
print(tabulate(df[['original_date', 'increase_date', 'increase_month', 'increase_year']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWINCREASE_DATE.PY hosted with ❤ by GITHUB

This gives the output as,



We can also perform the date manipulation based on the day name. If we want the date to increase by 2, if the current day is Saturday and increase by one if it's Sunday.

 

# Convert Date to respective day name
df['Date'] = (df['Date']).astype('datetime64[D]')
# convert Date to day name
df['old_day'] = df['Date'].dt.day_name()
df['old_date'] = pd.to_datetime(df['Date']).dt.date
df['new_date'] = pd.to_datetime(df['old_date']).dt.date
# increase day by 1 if Sunday and by 2 if Saturday
df.loc[df['old_day'] == 'Sunday', 'new_date'] = df['old_date'] + pd.tseries.offsets.DateOffset(days=1)
df.loc[df['old_day'] == 'Saturday', 'new_date'] = df['old_date'] + pd.tseries.offsets.DateOffset(days=2)
df['new_date'] = pd.to_datetime(df['new_date']).dt.date
df['new_day'] = df['new_date'].astype('datetime64[D]').dt.day_name()
# Print the dataframe
print(tabulate(df[['old_date', 'old_day', 'new_date', 'new_day']], headers = 'keys', tablefmt = 'psql'))


This gives the following output:




e) Extract only time from the timestamp

Now let's talk about time manipulation. If we want to extract the time from the timestamp column, we can use the dt.time function just like date extraction performed in the last sections.

# read the csv file and parse the dates
df = pd.read_csv("raw_data.csv", parse_dates=['SubmissionDate', 'starttime', 'endtime'])
# Remove Date from SubmissionDate Column
df['Time'] = pd.to_datetime(df['SubmissionDate']).dt.time
print(tabulate(df[['Time', 'SubmissionDate']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWTIME.PY hosted with ❤ by GITHUB


This gives us the following output:




f) Increase time by Hours, Minutes, and Seconds

Pandas .tseries functionality can be used to increase the Hours, Minutes, and Seconds in the timestamps.

# read the csv file and parse the dates
df = pd.read_csv("raw_data.csv", parse_dates=['SubmissionDate', 'starttime', 'endtime'])
df['Time'] = pd.to_datetime(df['SubmissionDate']).dt.time
#print(df['Time'].dtype)
# increase hours by 2
df['increased_hours'] = df['SubmissionDate'] + pd.tseries.offsets.DateOffset(hours=2)
# increase minutes by 20
df['increased_minutes'] = df['SubmissionDate'] + pd.tseries.offsets.DateOffset(minutes=20)
# increase seconds by 40
df['increased_seconds'] = df['SubmissionDate'] + pd.tseries.offsets.DateOffset(seconds=40)
# Convert respective columns in datetime format and extract time only
df['increased_hours'] = pd.to_datetime(df['increased_hours']).dt.time
df['increased_minutes'] = pd.to_datetime(df['increased_minutes']).dt.time
df['increased_seconds'] = pd.to_datetime(df['increased_seconds']).dt.time
df['original_time'] = pd.to_datetime(df['SubmissionDate']).dt.time
# Print the dataframe
print(tabulate(df[['original_time', 'increased_hours', 'increased_minutes', 'increased_seconds']], headers = 'keys', tablefmt = 'psql'))
VIEW RAWTIME_INCREASE.PY hosted with ❤ by GITHUB

This gives the output as,


ENDNOTES

So, folks, That's all for this blog. This article was written as the first of a series using Pandas to manipulate the timestamps. More are in line. Thank you if you kept reading until the end.

Complete code can be assessed at

 HTTPS://GITHUB.COM/GURJOTSINGHMAHI/SURVEYCTO_TIMESTAMP

Newsletters