This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:iot-reloaded:data_preparation_for_data_analysis [2024/12/01 11:55] – [Data Preparation for Data Analysis] ktokarz | en:iot-reloaded:data_preparation_for_data_analysis [2025/05/17 09:06] (current) – [Time series modelling] agrisnik | ||
---|---|---|---|
Line 3: | Line 3: | ||
===== Introduction ===== | ===== Introduction ===== | ||
- | In most cases, data must be prepared before analysing or applying some processing methods. There might be different reasons for it, for instance, missing values, sensor malfunctioning, | + | In most cases, data must be prepared before analysing or applying some processing methods. There might be different reasons for this, such as missing values, sensor malfunctioning, |
- | Data preprocessing also depends on the data's nature– preprocessing is usually very different for data, where the time dimension is essential (time series), or it is not like a log of discrete cases for classification, | + | Data preprocessing also depends on the data's nature – preprocessing is usually very different for data, where the time dimension is essential (time series), or it is not like a log of discrete cases for classification, |
- | It must be emphasised that whatever | + | It must be emphasised that whatever data preprocessing is done needs to be carefully noted and the reasoning behind it explained |
===== " | ===== " | ||
Line 14: | Line 14: | ||
==== Filling the missing data ==== | ==== Filling the missing data ==== | ||
- | One of the most common situations is missing sensor measurements, | + | One of the most common situations is missing sensor measurements, |
- | * **Random selection** – the method, as suggested by the name, allows randomly selecting one of the possible values of the data field. If the field value list is categorical, | + | * **Random selection** – the method, as suggested by the name, allows randomly selecting one of the possible values of the data field. If the field value list is categorical, |
- | * **Informed selection** – the method, in essence, does the same as the Random selection except that additional information on values distribution of the field (factor) is used. In other words, the most common might be selected for discrete factor values. However, in the case of continuous values, an average value might be selected | + | * **Informed selection** – the method, in essence, does the same as the Random selection except that additional information on values distribution of the field (factor) is used. In other words, the most common might be selected for discrete factor values. However, in the case of continuous values, an average value might be chosen |
- | * **Value marking** – this approach might be applied for cases where there is the chance that missing data is a consequence of some critical processes; for instance, whenever the engine' | + | * **Value marking** – this approach might be applied for cases where there is the chance that missing data is a consequence of some critical processes; for instance, whenever the engine' |
==== Scaling ==== | ==== Scaling ==== | ||
- | Scaling is a very often used method for continuous value numerical factors. The main reason is that different value intervals for different | + | Scaling is a very often used method for continuous value numerical factors. The main reason is that different value intervals for other factors are observed. It is essential for methods like clustering, where a multi-dimensional Euclidian distance is used, where, in the case of different scales, one of the dimensions might overwhelm others just because of a higher order of the numerical values. |
Usually, scaling is performed by applying a linear transformation of the data with set min and max values, which mark the desired value interval. In most software packages, like Python Pandas ((https:// | Usually, scaling is performed by applying a linear transformation of the data with set min and max values, which mark the desired value interval. In most software packages, like Python Pandas ((https:// | ||
Line 29: | Line 29: | ||
</ | </ | ||
- | ,where: | + | where:\\ |
- | Vold – the old measurement | + | Vold – the old measurement\\ |
- | Vnew – the new – scaled measurement | + | Vnew – the new – scaled measurement\\ |
- | mmin – minimum value of the measurement interval | + | mmin – minimum value of the measurement interval\\ |
- | mmax – maximum value of the measured interval | + | mmax – maximum value of the measured interval\\ |
- | Imin – minimum value of the desired interval | + | Imin – minimum value of the desired interval\\ |
- | Imax – maximum value of the desired interval | + | Imax – maximum value of the desired interval |
==== Normalisation ==== | ==== Normalisation ==== | ||
- | Normalisation is effective when the data distribution is unknown or known to be non-Gaussian (not following a bell curve of the Gaussian distribution). It is beneficial for data with varying scales, especially when using algorithms that do not assume any specific data distribution, | + | Normalisation is effective when the data distribution is unknown or known as non-Gaussian (not following a bell curve of the Gaussian distribution). It is beneficial for data with varying scales, especially when using algorithms that do not assume any specific data distribution, |
==== Adding dimensions ==== | ==== Adding dimensions ==== | ||
- | Sometimes, it is necessary to emphasise a particular phenomenon in the data. For instance, it might be very helpful to bolden the changes in the factor value, i.e. those that are more distant from 0 should be even larger, but those closer should not be raised. In this case, applying the exponent function to the factor values | + | Sometimes, it is necessary to emphasise a particular phenomenon in the data. For instance, it might be very helpful to bolden the changes in the factor value, i.e., those that are more distant from 0 should be even larger, but those closer should not be raised. In this case, applying the exponent function to the factor values—squaring or raising |
- | A variation of the technique might be summing up different factor values before or after applying the exponent. In this case, a group of similar values representing the same phenomenon emphasises it. Any other function can be applied | + | A variation of the technique might be summing up different factor values before or after applying the exponent. In this case, a group of similar values representing the same phenomenon emphasises it. Any other function can be used to represent the specifics of the problem. |
===== Time series ===== | ===== Time series ===== | ||
- | Time series usually represent the dynamics of some process, and therefore, the order of the data entries has to be preserved. This means that in most cases, all of the mentioned methods might be used as long as the data order remains the same. A time series is simply a set of data - usually events, arranged by a time marker. Typically, time series are arranged in the order in which events occur/ recorded. | + | Time series usually represent the dynamics of some process, and therefore, the order of the data entries has to be preserved. This means that in most cases, all of the mentioned methods might be used as long as the data order remains the same. A time series is simply a set of data - usually events, arranged by a time marker. Typically, time series are arranged in the order in which events occur/are recorded. |
- | Several | + | Several |
- | * The sequence of events must be followed for any data manipulation; | + | * The sequence of events must be followed for any data manipulation. |
- | * The arrangement of events in time is not only the order of data arrival | + | * The arrangement of events in time is the order of data arrival |
- | * The sequence of events reflects the causal relations of this process, which we try to discover through data analysis; | + | * The sequence of events reflects the causal relations of this process, which we try to discover through data analysis. |
=== Time Series Analysis Questions === | === Time Series Analysis Questions === | ||
Line 58: | Line 58: | ||
Therefore, there are several questions that data analysis typically tries to answer: | Therefore, there are several questions that data analysis typically tries to answer: | ||
* Is the process stationary, i.e. is the process variable over time? | * Is the process stationary, i.e. is the process variable over time? | ||
- | * If the process is dynamic, is there a direction of development: | + | * If the process is dynamic, is there a direction of development? |
- | * The process | + | * Is the process chaotic or regular? |
- | * There is periodicity in the dynamics of the process: | + | * Is there periodicity in the dynamics of the process? |
* Are there any regularities between the individual changes of the parameters characterising the process – correlation? | * Are there any regularities between the individual changes of the parameters characterising the process – correlation? | ||
* Does the dynamics of the process depend on changes in parameters of the external environment that we can influence, i.e. is the process adaptive? | * Does the dynamics of the process depend on changes in parameters of the external environment that we can influence, i.e. is the process adaptive? | ||
Line 66: | Line 66: | ||
=== Some definitions === | === Some definitions === | ||
- | **Autocorrelation** - A process is autocorrelated if the similarity of the values of a given observation is a function of the time between observations. In other words, the difference between the values of the observations depends on the interval between the observations. This does not mean that the process values are identical but that the difference between them is similar. The process can equally well be decaying or growing in the mean value or amplitude of the measurements, | + | **Autocorrelation** - A process is autocorrelated if the similarity of the values of a given observation is a function of the time between observations. In other words, the difference between the values of the observations depends on the interval between the observations. This does not mean that the process values are identical but that their differences are similar. The process can equally well be decaying or growing in the mean value or amplitude of the measurements, |
- | **Seasonality** - The process is seasonal if the deviation from the average value is repeated periodically. This does not mean the values must match perfectly, but there must be a general tendency to deviate | + | **Seasonality** - The process is seasonal if the deviation from the average value is repeated periodically. This does not mean the values must match perfectly, but there must be a general tendency to deviate from the average value regularly. A perfect example is a sinusoid. |
**Stationarity** - A process is stationary if its statistical properties do not change over time. Generally, the mean and variance over a period serve as good measures. In practice, a certain tolerance interval is used to tell whether a process is stationary since ideal cases (no noise) do not tend to occur in practice. For example, temperature measurements over several years are stationary and seasonal. It is not autocorrelated because temperatures are still relatively variable across days. Numerically, | **Stationarity** - A process is stationary if its statistical properties do not change over time. Generally, the mean and variance over a period serve as good measures. In practice, a certain tolerance interval is used to tell whether a process is stationary since ideal cases (no noise) do not tend to occur in practice. For example, temperature measurements over several years are stationary and seasonal. It is not autocorrelated because temperatures are still relatively variable across days. Numerically, | ||
Line 78: | Line 78: | ||
=== Moving average (sliding average) === | === Moving average (sliding average) === | ||
- | The essence of the method is to obtain an average value within a certain | + | The essence of the method is to obtain an average value within a particular |
<figure Moving average> | <figure Moving average> | ||
- | {{ : | + | {{ : |
- | < | + | < |
</ | </ | ||
- | , where | + | where:\\ |
- | SMAt - the new smoothed value at time instant t; | + | SMAt - the new smoothed value at time instant t\\ |
- | Xi – ith measurement at a time instant i | + | Xi – ith measurement at a time instant i\\ |
M – time window | M – time window | ||
- | The image below demonstrates the effects of a time window size of 10 and 100 measurements – an incoming signal from a freezer' | + | The image in figure {{ref> |
- | * At first, it needs to be emphasised that the moving average adds a slight lag in the incoming data, i.e., the rise and fall of the values are slightly behind the original values. | + | * At first, it must be emphasised that the moving average adds a slight lag in the incoming data, i.e., the rise and fall of the values are slightly behind the original values. |
- | * In the case of M = 10, the overall shape of the time series is preserved, while noise is removed. | + | * In the case of M = 10, the overall shape of the time series is preserved while noise is removed. |
* In the case of M = 100, the time series shape is transformed into a new function, which does not represent the main feature of the original measurements. For instance, the rises are replaced by falls and vice versa, while the data spike melts with the coming rise and forms one more significant rise of the signal. So, the result annihilates the initial features of the signal. | * In the case of M = 100, the time series shape is transformed into a new function, which does not represent the main feature of the original measurements. For instance, the rises are replaced by falls and vice versa, while the data spike melts with the coming rise and forms one more significant rise of the signal. So, the result annihilates the initial features of the signal. | ||
- | |||
- | <figure Moving average> | ||
- | {{ {{ : | ||
- | < | ||
- | </ | ||
=== Exponential moving average === | === Exponential moving average === | ||
- | The exponential moving average is widely used in noise filtering, for example, in the analysis of changes in stock markets. Its main idea is that each measurement' | + | The exponential moving average is widely used in noise filtering, for example, in analysing |
<figure Exponential moving average> | <figure Exponential moving average> | ||
- | {{ : | + | {{ : |
- | < | + | < |
</ | </ | ||
- | , where | + | where:\\ |
- | EMAt - the new smoothed value at time instant t; | + | EMAt - the new smoothed value at time instant t\\ |
- | Xi – ith measurement at a time instant i | + | Xi – ith measurement at a time instant i\\ |
Alpha - smoothing factor between 0 and 1, which reflects the weight of the last - the most recent measurement. | Alpha - smoothing factor between 0 and 1, which reflects the weight of the last - the most recent measurement. | ||
- | As seen in the picture | + | As seen in the picture |
- | < | + | < |
- | {{ : | + | {{ : |
- | < | + | < |
</ | </ | ||
Line 124: | Line 119: | ||
=== Decimation === | === Decimation === | ||
- | Decimation is a technique of excluding some entries from the initial time series to reduce the overwhelming or redundant data. As the name suggests, usually, to reduce the data by 10%, every tenth entry is excluded. It is a simple method that significantly benefits cases of over-measured processes with slow dynamics. With preserved time stamps, the data still allows the application of general time-series analysis techniques like forecasting. | + | Decimation is a technique of excluding some entries from the initial time series to reduce the overwhelming or redundant data. As the name suggests, |