Blogs

DATA CLEANING SERIES: A GUIDE TO DATA CLEANING

Data cleaning is about understanding the data at hand as well as the expanse of the study under consideration. Data cleaning is a simple exercise when done using proper methods and planning. It is vital to start from the basics and build your way up.

 

Things to Remember

 

The first and foremost thing to keep in mind when working with multiple datasets or multiple copies of the same dataset is the name assignment on files. It is easy to get swamped by the sea of complicated and huge master databases. The approach that we follow is to note down not only the date of creation of the file but also the number of data points contained in it. This is especially useful in case the data needs to be split up for any reason. For more clarity, save your files in dated folders to keep track of your daily work.

 

It is also imperative to keep a tab on the number of observations in the database. Hence a rule of thumb to be followed when dealing with data is that count is of utmost importance! (Also always subtract 1 (for the first row with variable names) from the count of observations in a single column generated in excel, unless you want to spend 20 minutes trying to find the missing data!)

 

Every beginner in the world of data cleaning wonders what tool would be the best for data cleaning. From experience, we realised that Stata, R, and Excel are capable of performing the basic checks discussed in this article. Ultimately, the choice of the tool depends on how comfortable you are with it and how accessible it is. 

 

The aforementioned points should be kept in mind while dealing with any kind of data and can make the data cleaning exercise more efficient and straightforward. 

 

Things to look out for 

 

Almost all primary datasets have a unique identifier attached to each observation. This can either be a unique ID or the name of the respondent or another related variable. These are key variables for the examination and analysis of the data since the information that we want to understand is contained at the unit data level. However, duplicity is an issue faced when dealing with household level data. The duplication signifies either multiple visits to the same household or input of the wrong ID for different households. 

 

A two-step approach should be followed to make corrections for duplicates: 

 

Step 1: Identification 

We need to first identify the duplicate values in the database. The unique identifier is the key variable to be used for this purpose. Finding duplicates of numeric or alpha-numeric IDs can be done using simple commands on STATA (the duplicates command) or in Excel (highlight duplicates function). It is possible that a revisit has been made to the same household due to lack of availability on the first visit (a consent no will be recorded for such a survey). In this case, this input is not a duplicate and may be controlled for during the analysis. 

 

Using the respondent name as an identifier comes with some caveats. An obvious issue is that two or more people can have the same name. In this case, the survey entries should be compared to ascertain if duplicate values have been recorded or not. It is advisable to compare more than one variable to check for duplicity. Key variables to be compared are personal details like address, age, education level, marital status, and so on that are furnished by the respondent. 

 

Step 2: Rectification 

 

Having identified the duplicate values in the database, a decision needs to be taken to keep one of the multiple recordings. Complete surveys containing information about the vital parameters for the study should always have precedence over the alternative entries or incomplete entries.

 

After completing the aforementioned steps, the new dataset will contain unique observations, and any further cleaning of the database has to be carried out after removing the duplicate values.

 

An efficient way to study the dataset is to observe it column-wise. It is imperative to have knowledge of which question of the survey tool the variable represents, and any relevant validity constraints.  

 

The next thing to look out for is typing errors in the dataset. These can exist in entry fields for names, addresses or numeric entries for multiple choice questions. For example, a “don’t know” response can be coded as “999” but the response entry may contain “99” or “9” instead. Skimming through the observations in the filter set for the particular column in Excel is an easy approach to spot typing errors in the dataset. Another approach is using the tabulate command in STATA. This command will generate a table that will list out all the recorded entries and the corresponding frequencies of a particular variable. Typing errors may be spotted in this list. 

 

Another issue that can come up is erroneous negative numeric data entries. They can be identified by using the methodology delineated above for typing errors. For example, calculated fields such as total spending or earning can have negative numbers that must be flagged. These fields are automatically calculated from responses given in the survey. Say, we ask the respondent the number of days they worked in the last month, and their average daily earnings. The survey platform automatically calculates total earnings by multiplying the number of days worked with average daily earnings for each respondent. However, sometimes a respondent may not remember or may not want to answer these questions. In such cases, if “Do not remember” has been coded in as -777, the calculated field for total earning will have an erroneous. This has been illustrated below.

 

Respondent

Number of Days of Work 

Average Daily Earning

Total Earnings

Status

A

21

500

10500


B

15

-777

-11655

Flag

C

-777

-777

603729

Flag

 

In a survey, there are cases wherein personal opinions are recorded. They can correspond to perceptions about an issue or just reasons for non-availability or refusal. These opinions will, most of the time, be recorded in the local language of the respondent or will be approximate translations posted by the enumerator. The appropriate method to deal with such inconsistencies is to take a note of the target users of the dataset and then use appropriate translations for the same. I recommend writing the translated answers in another new column next to the original entry to maintain the authenticity of the data collection exercise. To quote an example, the entry “pair mein dard” may be translated to “pain in legs” (in another column) for the question asking what diseases the respondent is currently suffering from.

 

There is a very thin line between data cleaning and data analysis. While one may perceive replacements to be a function that is performed by a data cleaner, the reality is that a data cleaner ensures that the data is consistent and of good quality and is in a ready to use state for the analysis team. Replacement for missing data or outlier values in the dataset are functions that are performed in tandem with the analysis of the dataset. This ensures that the replacements are suitable for the purpose of the study. 

 

Recommendations for STATA users

 

Users of STATA know how easy it is to perform basic checks on the dataset. The commands tabulate, summarise and duplicate, when combined with conditions come in handy for any kind of database. To illustrate, out of 505 respondents, a few consented to the survey and a few did not. In order to see the number of respondents who consented to the survey divided between males and females, the following command tabulate may be used. Here, 1 for consent corresponds to “yes”.

 

The summarise command is helpful when you want to look at descriptive statistics (average, range, standard deviation etc.) for a numeric variable such as rainfall, age, income, cholesterol level and so on. This command also detects outlier entries in the variable. 

 

The duplicate command can be used to list and tag duplicate entries for any variable. The tagging exercise involves the generation of a new variable that takes the value 1 if the observation has one duplicate, 2 if the observation has two duplicates and so on, and takes the value 0 if the observation has no duplicates. The generation of this variable is beneficial for identifying, ordering and studying the duplicate values in the dataset.

 

To list duplicates for a variable use: duplicates list variabe_name

To tag duplicates for a variable use: duplicates tag variable_name, generate(new_variable)

 

Use the generate command to create dummies wherever possible. Dummy variables can be useful when one wants to apply multiple conditions on one or more variables. For example, we want to understand the newspaper reading habits of males who are over 25 years of age, with higher education, who live in state A. We will start by generating a dummy variable to identify these respondents in the dataset by using the following set of commands. For gender, 1 corresponds to male, and for education (edu), 4 corresponds to higher education. 

 

generate a = 1 if  gender == 1 & state == “A” & edu == 4 & age >25

tabulate newspaper_var if a == 1

 

The first step tags the observations for which all of the conditions are satisfied. The second step lists out the responses of the variable for the identified group of individuals. When carrying out your analysis, we recommend using the two-step approach of identification and rectification listed out for duplicate values, as it is vital to examine the nature of errors in the dataset before proceeding with the rectification exercise. 

 

Automating the cleaning process by creating do-files that can be replicated for a small section of the master database can make our lives a lot easier, and the data cleaning exercise more fun. Remember that writing STATA commands is like writing sentences but in STATA’s language. It is advisable to keep your commands as simple and your do file as explanatory as possible.

 

Conclusion

 

Notwithstanding how exciting one may find data cleaning to be, the best way to clean a dataset is to minimise the possibility of receiving incorrect, irrelevant or missing data. As an agency that collects data from the ground, we make sure to make our surveys as foolproof as possible, and we train the enumerators to collect quality data. Moreover, the data cleaning exercise complements data collection and monitoring. For instance, for a survey that would span a few months, initial sets of data received from the field can shed light on where the data is subpar and also let us know the kind and the extent of errors the enumerators are making. Such monitoring will allow for early detection and speedy action to amend further data collection. 

 

With an ever-growing dependence on data for policy-making, there is an immediate requirement to standardise the protocols for cleaning and maintaining databases. This series is a small step in that direction.

 

 Ashmika Gouchwal is a Quantitative Researcher at Outline India. Himanshi Sharma is a Research Associate at Outline India. 

Read More

HOW TO CONDUCT FIELD TRAINING

Effective data collection necessitates extensive field training, which in turn is a process rife with utter detail. The following objects are communicated to the enumerators as a part of their field training:

Read More

URBAN PLANNING AND THE SLUMS IN DELHI

The journey from a refugee colony to a global city

Lying, on the banks of the ancient river Yamuna, New Delhi is one of the fastest-growing urban agglomerations in the world and is the seat of power for the Indian government[1]. Hence, Delhi has witnessed a meteoric rise in population from the 1965s to the contemporary era. Delhi played as an important refugee settlement post-partition. This led to the creation of the first master plan which sought to rehabilitate the refugee families into different spaces. These turned the earliest fusions of present day authorized housing colonies of post-partition Delhi and the contemporary city that we know of today. Authorization of colonies and access to state services has played a crucial role in the lives of the residents of Delhi. The fact that an inhabitant was able to get access to public services was a significant step for that person to lay claim as the citizen of this city. The laws around claiming residency and citizenship have changed and played a vital role in shaping the spatial development of Delhi.

The first master plan of Delhi legitimized the refugee resident colonies and demarcated areas for official spaces and paved the way for future residential projects. A significant move was to shift hazardous chemical-based factories to the fringes of the city which were then further pushed to areas such as Bawana, Mongolpuri and Mayapuri to name a few. As industrial areas burgeoned a significant population of workers who resided in slums had to move closer to the factories. This played into a shrewdly developed move to legitimately clear out slum areas out of the city space to the peripheries. This process actually highlights the ambivalent nature of the city comes to the fore as the populace not deemed fit to be a part of it.

Unauthorized colonies, slum areas and Jhuggi Jhopri clusters have been as much a part of Delhi as the regularized colonies. They serve as residences for immigrant populations who come from different parts of the country in hope to find employment and perhaps in search of a better quality of life. The people in these slums and JJ clusters are part of the humongous informal sector providing vital ancillary tasks for the cities, employed often as maids, housekeepers and helpers in various commercial and residential spaces. 

Fast forward to 2019, Delhi has been built on two significant master plans which have focused on rapid infrastructural development further pushed by its increased visibility on the global stage towards the last decade of the previous century, with the advent of the liberalization, privatization and globalization era the city of Delhi was vying for the spot as a global city. It is with that goal in focus that the current master plan of 2021 aims to change the landscape of this city, into a slum-free city. This decision jeopardizes the lives of around 1.4 million people still living in such settlements and their fate is in a state of limbo, without a tangible solution in sight.

The Slum dweller in an ambivalent city environment and the right to the city

The slum in the context of Delhi has been defined with very specific parameters, in fact, a slum has more legitimacy than a jhuggi jhopri camp (squatter settlement), often in a slum area there will be land which will be designated as a Jhuggi Jhopri camp, and it is this particular area that we will focus on because regardless of the state and private bodies vying to gain control over the slum areas, the judiciary stepped in to give rights to the slum dwellers, this, however, does not include the people who stay in Jhuggi Jhopri clusters.

Majboor Basti JJ camp is located in Mandawali-Fazalpur colony in the eastern part of Delhi. The residents here highlight the problems of access and acceptance that is commonly witnessed in areas where Jhuggi Jhopri camps are prominent in Delhi. Interestingly, the electricity line that was made permanent due to the entry of a private sector power company which in order to maintain business has provided nearly all areas with a permanent connection, regardless of the status of registry of the houses. This may seem like a step up for all the residents in the JJ camp however it does little to prevent the state from exercising its right to remove and re-locate them at no moment’s notice.

The significant chunk of the residents of Majboor Basti is home to third-generation inhabitants who were born and raised in Delhi, however, in this city environment, they are still alien residents not part of the city space, rather the gentrified population in the gated colonies and authorized colony spaces often see these jhuggis to be a hot-spot of crime and nuisance, unhygienic conditions, and a cesspool of diseases. Enforcement agencies patrol these locations and often complaints related to theft, are traced back to these areas. With the aim to be a truly global city, how do we address the issues aforementioned and find a co-creative space.

Moving away from Aesthetics

City planners in their grand blueprint of the master plan 2021 have tried effectively to give Delhi a makeover, at least aesthetically. The notion of a slum-free city also stems from the desire of both public and private sector real estate developers, who see these slums and JJ clusters to be land that has been illegally encroached by unlawful means, without highlighting the fact of the true profits that they hope to gain by removing these people from the places where multiple generations have resided albeit under a constant threat of eviction. Perhaps this is part of the reason why the jhuggi residents have never taken ownership of the space that they stay in, because they know that it has become impossible for them to exercise any right or control in the situation of their residential status.

The way forward constitutes re-designing of the master plan by moving away from the need of superficial aesthetical model[2]. Participation can be a key tool to facilitate development in the city, participatory processes can be made inclusive by ensuring equal representation from different areas. It may not be an end to the spatial problem rather it will lay the foundation to an inclusive process in the legislature by ensuring that the “illegitimate population” of the city has a greater say in decisions concerning their lives. The population in the unauthorized areas/JJ clusters/ have and still are actively playing a huge role in making Delhi a diverse place and giving these populations rights to be a part of the decision-making process may ensure afar greater ownership and security, which will be the first true step of transitioning into a global city.

 


[1] Major Agglomerations of the World - statistics and charts in maps, diagrams and tables. (n.d.). Retrieved from http://citypopulation.de/world/Agglomerations.html

[2] Ghertner, D. A. (2015). RULE BY AESTHETICS: WORLD-CLASS CITY MAKING IN DELHI. New York: Oxford University Press.

Read More

'KASHMIR' THROUGH THE LENS OF HALF-WIDOWS

“Gar firdaus bar-rue zamin ast, hami asto, hamin asto, hamin ast.” (If there is a heaven on earth, it's here, it's here, it’s here…)

The above phrase of Amir Khusrao often resonates and reminds one of the picturesque beauties of Kashmir alluring ardent travellers to visit this ‘abode of God’. As a young solivagant researcher, who reveals herself in the act of wandering alone in destinations and locations she has not previously visited. I felt a sudden frisson of excitement and thrill post my field induction site was finalised as Kashmir.

Chilla-i-Kalan[1] embraced me as soon as I reached Srinagar. The valley had turned into an unbelievable dream, covered in a blanket of snow. I was welcomed or ‘Khatir madarat’ as Khala would say, by partially frozen pine trees and snow-clad peaks in the backdrop of Lal Chowk.

Post reaching ‘home’ for the next two months, Khala (the mother of a dear friend with whom I stayed in Kashmir) quickly gave me a hot water bottle and Kanger[2] to fight this cold. Mesmerised as I was by the snow-capped peaks of the Himalayas at a distance, I couldn’t help but notice a group of people largely women, sitting on the road with placards and pictures of ‘men’. On enquiring I learnt that these women took part in sit-ins every month in Srinagar to seek the whereabouts of their loved ones who were subject to enforced disappearance. Enforced disappearance is a phenomenon where a person is secretly abducted or imprisoned by a STATE or POLITICAL ORGANIZATION, or by a third party with the authorization, support, or acquiescence of a state or political organization, followed by a refusal to acknowledge the person's fate and whereabouts, with the intent of placing the victim outside the protection of the law.

The picturesque Kashmir Valley often acclaimed as the paradise on Earth has witnessed the death of more than 90,000 people and the disappearance of 8,000 people in the last thirty years. Of which a little more than 1,500 people who disappeared were married. Hence, it is not wrong to assume that at least 1,500 ‘half widows’ or ‘half wives’ have been residing in the valley sans 1989.

Read More

INDIA DEVELOPMENT REVIEW: IMPORTANCE OF MULTI-STAKEHOLDER ENGAGEMENT IN DATA COLLECTION

Outline India (OI) undertook a nutrition study in association with a global funding agency and its local implementation partner. This study was conducted in Jehanabad district of Bihar. The respondents consisted of implementation agency associated SHG women who have children between the ages of six months to two years. Initially, data collection was to happen in March but it was eventually conducted in the first week of June 2019 due to the implementation of the Code of Conduct for the General Assembly Elections in India. This had some unforeseen consequences on operationalizing the study but was resolved due to a strong relationship that was built by the research team and the multiple ground-level stakeholders who were involved with the implementation agency. 

Read More

FIELD STORIES FROM RURAL INDIA: THE VOICES THAT MATTER

Fieldwork in India can be daunting as well as an exhilarating experience. My first field experience was a bit of both. Our team conducted a study on the lives of young women and the issues faced by them, in rural India. Needless to say, given the cultural context in India, issues faced by them are topics of absolute taboo. To be able to facilitate a nuanced conversation with the women about the intricacies of the state of their health and the precautions that they take is a task in itself. But moving beyond this challenge was in many ways necessary as it gave us a peek into the lives of those who do not have the power of a voice. This story is about the voiceless; for whom the decisions are taken without seeking their permission. This story is about the women of rural India.

Read More

Newsletters