Data Preparation for Data Analysis

Introduction

In most cases, data must be prepared before analysing or applying some processing methods. There might be different reasons for it, for instance, missing values, sensor malfunctioning, different time scales, different units, specific format needed for a given method or algorithm, and many more. Therefore, data preparation is as necessary as the analysis itself. While data preparation is usually very specific to a given problem, some common general cases and preprocessing tasks prove to be very useful. Data preprocessing also depends on the data's nature– preprocessing is usually very different for data, where the time dimension is essential (time series), or it is not like a log of discrete cases for classification, where there are no internal causal dependencies among entries. It must be emphasised that whatever the data preprocessing is done, it needs to be carefully noted, and the reasoning behind it must be explained to allow others to understand the results acquired during the analysis.

"Static data"

Some of the methods explained here might also be applied to time series but must be done with full awareness of possible implications. Usually, the data should be formatted as a table consisting of rows representing data entries or events and fields representing features of the event entry. For instance, a row might represent a room climate data entry, where fields or factors represent air temperature, humidity level, CO2 level and other vital measurements. For the sake of simplicity in this chapter, it is assumed that data is formatted as a table.

Filling the missing data

One of the most common situations is missing sensor measurements, which might be caused by communication channel issues, IoT node malfunctioning or other reasons. Since most of the data analysis methods require complete entries, it is necessary to ensure that all data fields are present before applying the analysis methods. Usually, there are some common approaches to deal with the missing values:

Random selection – the method, as suggested by the name, allows randomly selecting one of the possible values of the data field. If the field value list is categorical, representing a limited set of possible values, for instance, a set of colours or operation modes, one value from the list is randomly selected. In the case of a continuous value, a random value from an interval is selected. Besides its simplicity, the method allows for filling gaps in data in cases where a fraction of missing values is insignificant. In case of a significant fraction of missing values, the method should not be applied due to implications on the data analysis.
Informed selection – the method, in essence, does the same as the Random selection except that additional information on values distribution of the field (factor) is used. In other words, the most common might be selected for discrete factor values. However, in the case of continuous values, an average value might be selected according to the distribution characteristics. There might be more complex situations which cannot be described by Gaussian distribution. In those cases, the data analyst needs to make an informed decision on particular selection mechanisms, representing the distribution's specifics.
Value marking – this approach might be applied for cases where there is the chance that missing data is a consequence of some critical processes; for instance, whenever the engine's temperature reaches a critical value, the pressure sensor stops functioning due to overheating. Analysts might know the issue or not; in any case, it is essential to mark those situations to find possible causalities in the data. A dedicated new category might be introduced if the factor is categorical, like “empty”. In the case of continuous values, a dedicated “impossible” value might be assigned, such as max integer value, minimum integer value, zero, and others.

Scaling

Scaling is a very often used method for continuous value numerical factors. The main reason is that different value intervals for different factors are observed. It is essential for methods like clustering, where a multi-dimensional Euclidian distance is used, where, in the case of different scales, one of the dimensions might overwhelm others just because of a higher order of the numerical values. Usually, scaling is performed by applying a linear transformation of the data with set min and max values, which mark the desired value interval. In most software packages, like Python Pandas ^[1], scaling is implemented as a simple-to-use function. However, it might be done manually if needed as well:

^[1] https://pandas.pydata.org/

en/iot-reloaded/data_preparation_for_data_analysis.1721911075.txt.gz · Last modified: 2024/07/25 12:37 by agrisnik