Hey guys! In this blog post, we will take a step-by-step tour to learn how we can manipulate the timestamps using the Python data analysis library - Pandas. Pandas is a fast, flexible, and easy to use open-source data analysis and manipulation tool built on top of the Python programming language.
At Outline India (hereafter, OI), we extensively use the SurveyCTO data collection platform to code our paper-based tools in digital format to conduct various field surveys. After each completed survey, OI researchers use the raw dataset file (in .csv document format) with multiple variables to perform the analysis using Pandas. Every raw dataset generated using SurveyCTO contains 3 mandatory timestamp variables – Submission Date, start time, and end time.
In this blog post, we will enquire about the various steps involved, from installing the pandas to carrying out the need-based data-time manipulation using in-built timestamp functions in the Pandas library.
So let's get started.
Step 1: Install the Pandas and Tabulate package
Before you go any further, make sure you have installed the Python and PIP package. Carry out the following command on the windows default command-line interpreter to install the Pandas and tabulate package,
pip install pandas | |
pip install tabulate |
Step 2: Import the Pandas and Tabulate library
We will import the Pandas and tabulate library and use this to read the raw dataset file and perform manipulations,
import pandas as pd | |
from tabulate import tabulate |
Step 3: Read the dataset file with timestamps
Now, it's time to read the raw dataset (.csv file) using Pandas to perform the various timestamp manipulations.
import pandas as pd | |
from tabulate import tabulate | |
# read the csv file | |
df = pd.read_csv("raw_data.csv") | |
# displaying the DataFrame | |
print(tabulate(df, headers = 'keys', tablefmt = 'psql')) |
This gives us the output as follows,
As we can see in the output, the mentioned timestamp columns are in string format. To read the timestamp columns in DateTime type, we need to parse the mentioned columns using parse_dates parameter while reading the file.
# read the csv file and parse the dates | |
df = pd.read_csv("raw_data.csv", parse_dates=['SubmissionDate', 'starttime', 'endtime']) |
The code will give us the following output:
Step 4: Perform timestamp manipulations
# Remove Time from SubmissionDate Column | |
df['Date'] = pd.to_datetime(df['SubmissionDate']).dt.date | |
print(tabulate(df[['Date', 'SubmissionDate']], headers = 'keys', tablefmt = 'psql')) |
b) Select rows with a specific date
Suppose we want to select the rows with greater or less than a particular date. We will use the inbuild loc function, which helps access a group of rows and columns by label(s) and logical operations.
# Select the dataframe greater than mentioned date | |
start_date = '2020-12-04' | |
df['Date'] = (df['Date']).astype('datetime64[D]') | |
#select only particular date data | |
df = df.loc[((df['Date'] >= start_date))] | |
# Prin the dataframe | |
print(tabulate(df[['SubmissionDate', 'user_id', 'gadget']], headers = 'keys', tablefmt = 'psql')) |
As we can see from the following output, rows with a date greater than 2020-12-04 in the Date column are selected.
If we want to select the rows within some particular start and end date, this can be done using the following code:
# Set start and end date | |
start_date = '2020-12-04' | |
end_date = '2020-12-07' | |
# Define datatype | |
df['Date'] = (df['Date']).astype('datetime64[D]') | |
#select only particular date data | |
df = df.loc[((df['Date'] >= start_date) & (df['Date'] <= end_date))] | |
# Print the dataframe | |
print(tabulate(df[['SubmissionDate', 'user_id', 'gadget']], headers = 'keys', tablefmt = 'psql')) |
We get the given output,
c) Converting date to respective day name
The pandas give us the functionality to convert the date into the particular day name. We use the dt.day_name() function to convert the date into the respective day.
# Convert Date to respective day name | |
df['Date'] = (df['Date']).astype('datetime64[D]') | |
# convert Date to day name | |
df['Day'] = df['Date'].dt.day_name() | |
df['Date'] = pd.to_datetime(df['Date']).dt.date | |
# Print the dataframe | |
print(tabulate(df[['Date', 'Day', 'user_id', 'gadget']], headers = 'keys', tablefmt = 'psql')) |
d) Increase date by days, months, and years
The Pandas offset function provides the functionality to increase the date by days, months, and years.
# Increase Date to days, months and years | |
df['original_date'] = df['Date'] | |
# increase date by 1 | |
df['increase_date'] = df['Date'] + pd.tseries.offsets.DateOffset(days=1) | |
# increase date by 3 month | |
df['increase_month'] = df['Date'] + pd.tseries.offsets.DateOffset(months=3) | |
# increase date by 2 year | |
df['increase_year'] = df['Date'] + pd.tseries.offsets.DateOffset(years=2) | |
# Remove time and keep date only | |
df['increase_date'] = pd.to_datetime(df['increase_date']).dt.date | |
df['increase_month'] = pd.to_datetime(df['increase_month']).dt.date | |
df['increase_year'] = pd.to_datetime(df['increase_year']).dt.date | |
# Print the dataframe | |
print(tabulate(df[['original_date', 'increase_date', 'increase_month', 'increase_year']], headers = 'keys', tablefmt = 'psql')) |
This gives the output as,
We can also perform the date manipulation based on the day name. If we want the date to increase by 2, if the current day is Saturday and increase by one if it's Sunday.
# Convert Date to respective day name | |
df['Date'] = (df['Date']).astype('datetime64[D]') | |
# convert Date to day name | |
df['old_day'] = df['Date'].dt.day_name() | |
df['old_date'] = pd.to_datetime(df['Date']).dt.date | |
df['new_date'] = pd.to_datetime(df['old_date']).dt.date | |
# increase day by 1 if Sunday and by 2 if Saturday | |
df.loc[df['old_day'] == 'Sunday', 'new_date'] = df['old_date'] + pd.tseries.offsets.DateOffset(days=1) | |
df.loc[df['old_day'] == 'Saturday', 'new_date'] = df['old_date'] + pd.tseries.offsets.DateOffset(days=2) | |
df['new_date'] = pd.to_datetime(df['new_date']).dt.date | |
df['new_day'] = df['new_date'].astype('datetime64[D]').dt.day_name() | |
# Print the dataframe | |
print(tabulate(df[['old_date', 'old_day', 'new_date', 'new_day']], headers = 'keys', tablefmt = 'psql')) |
e) Extract only time from the timestamp
Now let's talk about time manipulation. If we want to extract the time from the timestamp column, we can use the dt.time function just like date extraction performed in the last sections.
# read the csv file and parse the dates | |
df = pd.read_csv("raw_data.csv", parse_dates=['SubmissionDate', 'starttime', 'endtime']) | |
# Remove Date from SubmissionDate Column | |
df['Time'] = pd.to_datetime(df['SubmissionDate']).dt.time | |
print(tabulate(df[['Time', 'SubmissionDate']], headers = 'keys', tablefmt = 'psql')) |
This gives us the following output:
f) Increase time by Hours, Minutes, and Seconds
Pandas .tseries functionality can be used to increase the Hours, Minutes, and Seconds in the timestamps.
# read the csv file and parse the dates | |
df = pd.read_csv("raw_data.csv", parse_dates=['SubmissionDate', 'starttime', 'endtime']) | |
df['Time'] = pd.to_datetime(df['SubmissionDate']).dt.time | |
#print(df['Time'].dtype) | |
# increase hours by 2 | |
df['increased_hours'] = df['SubmissionDate'] + pd.tseries.offsets.DateOffset(hours=2) | |
# increase minutes by 20 | |
df['increased_minutes'] = df['SubmissionDate'] + pd.tseries.offsets.DateOffset(minutes=20) | |
# increase seconds by 40 | |
df['increased_seconds'] = df['SubmissionDate'] + pd.tseries.offsets.DateOffset(seconds=40) | |
# Convert respective columns in datetime format and extract time only | |
df['increased_hours'] = pd.to_datetime(df['increased_hours']).dt.time | |
df['increased_minutes'] = pd.to_datetime(df['increased_minutes']).dt.time | |
df['increased_seconds'] = pd.to_datetime(df['increased_seconds']).dt.time | |
df['original_time'] = pd.to_datetime(df['SubmissionDate']).dt.time | |
# Print the dataframe | |
print(tabulate(df[['original_time', 'increased_hours', 'increased_minutes', 'increased_seconds']], headers = 'keys', tablefmt = 'psql')) |
This gives the output as,
ENDNOTES
So, folks, That's all for this blog. This article was written as the first of a series using Pandas to manipulate the timestamps. More are in line. Thank you if you kept reading until the end.
Complete code can be assessed at
David Angel Makel
IT ConsultantIt is a long established fact that a reader will be distracted by the readable content page looking at its layout point of using normal