Cleaning in the Query

data science
analyses
data issues
Author

Tony Dunsworth, Ph.D.

Published

September 8, 2025

When I started actively working in data science, like everyone else, I heard the old maxim that a data scientist spends about 80% of their time cleaning data. Well, I can report that in many cases, that’s true. I’m currently facing a lot of that right now in building a new dataset for some new reports and dashboards that I’m designing. I’m starting the data cleaning process in the SQL statement. I believe that it’s a lot faster in the end. Once I have the query where I want it, then many of the following steps are pretty easy.

For this dataset, I have three tables from which I will pull data. The first is the master incident table. This contains most of the data that I need for my reports. The second is an extension to the first table. I only want one column from that table and, honestly, I don’t know how often I would use it. However, for a few types of analyses, it could be very handy. The other table that I want to use is a log of activities in a call and the details associated with each activity. Most of the data is fairly straightforward. Some of the derived data, however, isn’t as straight forward as I would like. That data consists of the deltas between timestamps. We use those for operational assessment.

One of the main reasons that I want to do most of my cleaning in the SQL query itself is that I can control the output and standardize it easier. I am more confident with SQL at this stage because I’ve been writing SQL for nearly 20 years in some form. Having written a lot of queries and them progressively more complex and detailed, it feel more natural for me. I admit, I believe that every data scientist should be fluent in writing SQL queries of all types. I also believe that cleaning in the query will save me a considerable amount of time when I’m building my reports and dashboards. That doesn’t mean that there aren’t challenges in cleaning the data before it goes into my dataset. The rest of this post will discuss some of those issues.

When I started writing the first iteration of this query, and I’m currently on my seventh iteration of it, I started then and now with two columns as my foundations. The first is the incident number. I can designate it an ID field in my dataset. The second is the incident creation datetime. From that, I create a lot of additional columns that allow me to craft different analyses. I split the datetime apart to use the hour, the day of the week, and the week number. I also derive the shift and how deep into the shift the call comes. This will be used in correlative studies that investigate how call processing times change throughout the shift. Do we see a change in call processing times when we’re working around lunch breaks? Do the response times change as we get deeper into the shift? I’m also trapping the problem type, the discipline of the service call, the way that it was received, the final disposition, and internally, we trap the call taker and dispatcher for each call. This also includes the priority number. All of these things are chosen to enable different analyses to be run to learn more about the week that happened and how operations functioned.

Along with all of this information, we trap time stamps for various events in the life of a call. One of the reasons for this is to calculate how long it takes for each event to occur and how different calls compare to the median values. It also allows us to track different parts of the operations process and how things change week-to-week. This can lay the foundation for determining if there are correlations between weekly volumes and call processing times. If I find relationships, then I can also build regression models to forecast processing times based on volume, number of telecommunicators on shift, and other factors. However, this is certainly going to be much farther in the future.

The problem is the timestamps that are available and what to do with them and about them. In the master incident table, there are four different timestamps that could be identified for the start of the call. The first is the response date. It is actually the date and time that the software records the call in the database. The second is a column that is listed at the time the clock starts. The third is the date and time that the phone pickup is registered by the software. The final timestamp is when the software records the first keystroke in the call taking screen. All of these time stamps can be within a few seconds of each other. The problem is that there is not one that is consistently earliest. After looking at several different options, because this is a Microsoft SQL Server database and needs to use T-SQL, I will have to use a series of CASE() statements to determine the earliest timestamp. If I were using a different database, I might use LEAST() instead. This is not the only place that I will have to deal with this. We also have three time stamps, between two tables, that could determine the time that the calltaker has collected enough information from the caller to release the call for dispatching. I will have to use the same clause to find the earliest value that identifies when the call was ready for dispatch.

The other challenge is that the vendor has built some elapsed time fields in the main table that could be used, but they are all strings. So to make any use of them, we have to convert them into an integer value of seconds to compare them to differences between the timestamps that we’ve identified for each stage of the call. When I first started writing the conversion, I made a mistake and it looked as if these two different values were way off in a couple of areas. The mistake was how I handled decimal seconds. My code acted like the decimal seconds were seconds and the seconds minutes, etc. When I found my error and corrected it, then everything fell into place and in most instances, the difference between the calculated elapsed time and the documented string is 3 seconds or less. For most of the calls where there are larger differences between some values, we find calls that have been reopened.

Identifying reopened calls and gauging their impact on our statistics was the reason behind this iteration of this query. I wanted to not only identify which calls were closed once then reopened, but I also wanted to find out how to ameliorate their impact on our reported metrics. This updated query has given me the best of both worlds. I will be able to isolate the reopened calls and analyse them for patterns.

In several places, I still have NULL values. In several of my non-datetime fields, I use different strings for several of my categorical variables. If they are integer or calculated fields, I use a value of -9999 to identify the null fields. I didn’t want to set the values to 0 because that is a possible value in several places. Again, this cleaning done in the query is a lot simpler and easier to accomplish. In most cases, I use the ISNULL() function to address possible NULL values. By cleaning here, I don’t have to write as much cleaning code and I get the advantage of moving directly into my analytical code and getting the to the insights hidden in the data.

When I’m done, I have a dataset that has completed an initial cleaning and is ready for the needed manipulations and transformations that will allow me to find new insights and deliver value. Sure the query to generate these datasets is kinda heavy. This new version clocks in at just under 600 lines when formatted in a way that is more human-readable, but it runs quickly, returns over 1000 rows of data in under 3 seconds. I have a second version of this that isn’t focused on a weekly output. That version allows for the specification of dates so I can create larger datasets.

Since I also have a synthetic generator that I continue to update and support, I can also use this query to generate datasets that will allow me to gather more accurate data to fine-tune the distributions and make certain that my synthetic data is more useful. I will likely use the disfit library to make more accurate assessments of the data and find the best distribution.

Well, that’s it for this edition. Stay tuned for a different swing through the world of data workers coming soon!!