This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:iot-reloaded:data_products_development [2024/07/23 11:28] – agrisnik | en:iot-reloaded:data_products_development [2024/12/10 21:26] (current) – pczekalski | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Data Products Development ====== | ====== Data Products Development ====== | ||
- | {{: | + | In the previous chapter, some essential properties of Big Data systems have been discussed |
- | + | ||
- | In the previous chapter, some essential properties of Big Data systems have been discussed, as well as how IoT systems relate | + | |
=== Business user === | === Business user === | ||
Line 10: | Line 8: | ||
=== Project sponsor === | === Project sponsor === | ||
- | He is the one who defines the business problem and is triggering the birth of the project. He defines | + | He is the one who defines the business problem and is triggering the birth of the project. He represents |
=== Project manager === | === Project manager === | ||
Line 18: | Line 16: | ||
=== Business information analyst === | === Business information analyst === | ||
- | He possesses deep knowledge in the given business domain, supported by his skills and experience. Therefore, he is a valuable asset for the team in understanding the data's content, origin, and possible meaning. He defines the key performance indicators (KPI) and metrics, which are to be measured | + | He possesses deep knowledge in the given business domain, supported by his skills and experience. Therefore, he is a valuable asset for the team in understanding the data's content, origin, and possible meaning. He defines the key performance indicators (KPI) and metrics to assess the project' |
=== Database administrator === | === Database administrator === | ||
- | He is responsible for the configuration of the development environment and Database (one, many or a complex distributed system). In most cases, the configuration must meet specific performance requirements, | + | He is responsible for configuring |
=== Data engineer === | === Data engineer === | ||
- | Data engineers usually have deep technical knowledge of data manipulation methods and techniques. During the project, he tunes data manipulation procedures, SQL queries, and memory management and developed specific stored or server-side procedures. He is responsible for extracting | + | Data engineers usually have deep technical knowledge of data manipulation methods and techniques. During the project, he tuned data manipulation procedures, SQL queries, and memory management and developed specific stored or server-side procedures. He is responsible for extracting |
=== Data scientist === | === Data scientist === | ||
Line 34: | Line 32: | ||
====== | ====== | ||
- | As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on his or her competencies and capacities, roles might overlap or several roles provided by a single team member. | + | As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on their competencies and capacities, roles might overlap, or a single team member |
- | Once the team is built, the development process can start. As with any other product development, | + | Once the team is built, the development process can start. As with any other product development, |
- | < | + | < |
- | {{ : | + | {{ : |
- | < | + | < |
</ | </ | ||
Line 46: | Line 44: | ||
The project team learns about the problem domain, the problem itself, its structure, and possible data sources and defines the initial hypothesis. | The project team learns about the problem domain, the problem itself, its structure, and possible data sources and defines the initial hypothesis. | ||
The phase involves interviewing the stakeholders and other potentially related parties to reach as broad an insight as necessary. It said that during this phase, the problem is farmed – defined the analytical problem, indicators of the success for the potential solutions, business goals and scope. To understand business needs, the project sponsor is involved in the process from the very beginning. The identified data sources might include external systems or APIs, sensors of different types, static data sources, official statistics and other vital sources. | The phase involves interviewing the stakeholders and other potentially related parties to reach as broad an insight as necessary. It said that during this phase, the problem is farmed – defined the analytical problem, indicators of the success for the potential solutions, business goals and scope. To understand business needs, the project sponsor is involved in the process from the very beginning. The identified data sources might include external systems or APIs, sensors of different types, static data sources, official statistics and other vital sources. | ||
- | One of the primary outcomes of the phase is the Initial Hypothesis (IH), which concisely represents the team's vision of the problem and potential solution | + | One of the primary outcomes of the phase is the Initial Hypothesis (IH), which concisely represents the team's vision of the problem and potential solution |
- | Whatever the IH is, it is a much better starting point rather | + | Whatever the IH is, it is a much better starting point than defining the hypothesis during the project implementation in later phases. |
=== Data preparation === | === Data preparation === | ||
- | The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, | + | The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, |
- | - **Data analysis sandbox** | + | - **Data analysis sandbox** |
+ | - **Carrying out ETLs** - The data is retrieved, transformed and loaded back into the sandbox system. Sometimes, simple data filtering excludes outliers and cleans the data. Due to the volume of data, there may be a need for parallelisation of data transfers, which leads to the need for appropriate software and hardware infrastructure. In addition, various web services and interfaces are used to obtain context. | ||
+ | - **Exploring the content of the data** - The main task is to get to know the content of the extracted data. A data catalogue or vocabulary is created (small projects can skip this step). Data research allows for identifying data gaps and technology flaws, as well as teams' own and extraneous data (for determining responsibilities and limitations). | ||
+ | - **Data conditioning** - Slicing and combining are the most common actions in this step. The compatibility of data subsets with each other after the performed manipulations is checked to exclude systematic errors – errors that occur as a result of incorrect manipulation (formatting of data, filling in voids, etc...). During this step, the team ensures the time, metadata, and content match. | ||
+ | - **Reporting and visualising** - This step uses general visualisation techniques, providing a high-level overview – value distributions, | ||
- | Client' | + | === Model planning === |
+ | The main task of the phase is to select model candidates for data clustering, classification or other needs consistent with the Initial Hypothesis from Phase 1. | ||
+ | - **Exploring data and selecting variables** - The aim is to discover and understand variables' | ||
+ | - **Selection | ||
- | - **Carrying out ETLs** | + | === Model development |
- | + | ||
- | The data is retrieved, transformed and loaded back into the sandbox system. Sometimes, simple data filtering excludes outliers and cleans the data. Due to the volume of data, there may be a need for parallelisation of data transfers, which leads to the need for appropriate software and hardware infrastructure. In addition, various web services and interfaces are used to obtain context; | + | |
- | 3. Exploring the content of the data. | + | |
- | The main task is to get to know the content of the extracted data. A data catalogue or vocabulary is created (small projects can skip this step). Data research allows for identifying data gaps and technology flaws, as well as teams' own and extraneous data (for determining responsibilities and limitations). | + | |
- | 4. Data conditioning | + | |
- | Slicing and combining are the most common actions in this step. The compatibility of data subsets with each other after the performed manipulations is checked to exclude systematic errors – errors that occur as a result of incorrect manipulation (formatting of data, filling in voids, etc...). The team ensures the time, metadata, and content match during this step. | + | |
- | 5. Reporting and visualising | + | |
- | This step uses general visualisation techniques, providing a high-level overview – value distributions, | + | |
- | Model planning | + | |
- | The main task of the phase is to select model candidates for data clustering, classification or other needs that are consistent with the Initial Hypothesis from Phase 1. | + | |
- | 1. Exploring data and selecting variables | + | |
- | The aim is to discover and understand variables' | + | |
- | 2. Selection of methods or models | + | |
- | During this step, the team creates a list of methods that match the data and the problem. A typical approach is creating many trim model prototypes using ready-made tools and prototyping packages, such as R, SPSS, Excel, Python, and other specific tools. Tools typical of the phase might include but are not limited to R or Python, SQL and OLAP, Matlab, SPSS, and Excel (for simpler models); | + | |
- | Model development | + | |
During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed: | During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed: | ||
- | 1. Data preparation | + | - **Data preparation** - Specific subsets of data are created, such as training, testing, and validation. The data is adjusted to the selected initial data formatting and structuring methods. |
- | Specific subsets of data are created, such as training, testing, and validation. The data is adjusted to the selected initial data formatting and structuring methods. | + | - **Model development** - Usually, conceptually, |
- | 2. Model development | + | - **Model testing** - The models shall be operated and tuned using the selected tools and training datasets to optimise the models and ensure their resilience to incoming data variations. All decisions must be documented! This is important because all other team roles require detailed |
- | Usually, conceptually, | + | - **Key points to be answered during the phase area:** |
- | 3. Model testing | + | |
- | The models shall be operated and tuned using the selected tools and training datasets to optimise the models and ensure their resilience to incoming data variations. All decisions must be documented! This is important because all other team roles require detailed reasoning | + | |
- | 4. Key points to be answered during the phase area; | + | |
- | Is the model accurate enough? | + | |
- | Are the results obtained meaningful in relation to the objectives set? | + | |
- | Don't models make unacceptable mistakes? | + | |
- | Is the data enough? | + | |
In some areas, false positives are more dangerous than false negatives. For example, targeting systems may inadvertently target "their own". | In some areas, false positives are more dangerous than false negatives. For example, targeting systems may inadvertently target "their own". | ||
- | Communication | + | === Communication |
- | During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to present | + | During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to show any drafts outside a group of data scientists! - The methods used by most of those involved are too complex, which leads to incorrect conclusions and unnecessary communication to the team. Usually, the team is biased in not accepting the results, which falsifies the hypotheses, taking it too personally. However, the data led the team to the conclusions, |
- | Operationalisation | + | |
- | The results presented are first integrated into the pilot project before full-scale implementation, | + | === Operationalisation |
+ | The results presented are first integrated into the pilot project before full-scale implementation, | ||
Expectations for each of the roles during this phase: | Expectations for each of the roles during this phase: | ||
- | 1. Business user: Identifiable benefits of the model for the business; | + | * **Business user:** Identifiable benefits of the model for the business. |
- | 2. Project sponsor: return on investment (ROI) and impact on the business as a whole – how to highlight it outside the organisation / other business; | + | * **Project sponsor:** return on investment (ROI) and impact on the business as a whole – how to highlight it outside the organisation / other business. |
- | 3. Project manager: | + | * **Project manager:** completing |
- | 4. Business | + | * **Business |
- | 5. Data scientist: | + | * **Data scientist:** Convenient |