Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:iot-reloaded:data_products_development [2024/12/03 17:24] pczekalskien:iot-reloaded:data_products_development [2024/12/10 21:26] (current) pczekalski
Line 1: Line 1:
 ====== Data Products Development ====== ====== Data Products Development ======
-In the previous chapter, some essential properties of Big Data systems have been discussed and how and why IoT systems relate to Big Data problems. In any IoT implementation, data processing is the system's heart, which at least transforms into a data product. While it is still mainly a software subsystem, its development differs significantly from that of a regular software product. The difference is expressed through the roles involved and the lifecycle itself. It is often wrongly assumed that the main contributor is the data scientist responsible for developing a particular data processing or forecasting algorithm. It is somewhat valid, except other essential roles are vital to success. The team of developers playing the roles might be as small as three or as large as 20 people, depending on the scale of the project. The prominent roles are explained below.+In the previous chapter, some essential properties of Big Data systems have been discussed and how and why IoT systems relate to Big Data problems. In any IoT implementation, data processing is the system's heart, which at least transforms into a data product. While it is still mainly a software subsystem, its development differs significantly from that of a regular software product. The difference is expressed through the roles involved and the lifecycle itself. It is often wrongly assumed that the main contributor is the data scientist responsible for developing a particular data processing or forecasting algorithm. It is somewhat valid, except other essential roles are vital to success. The team of developers playing the roles might be as small as three or as large as 20 people, depending on the scale of the project. The leading roles are explained below.
  
 === Business user === === Business user ===
Line 8: Line 8:
 === Project sponsor === === Project sponsor ===
  
-The project sponsor defines the business problem and triggers the project'birth. He establishes the project's scope and volume and meets the necessary provisions. While he defines project priorities, he does not have deep knowledge or skills in the technology, algorithms, or methods used. +He is the one who defines the business problem and is triggering the birth of the project. He represents the project's scope and volume and meets the necessary provisions. While he defines project priorities, he does not have deep knowledge or skills in the technology, algorithms, or methods used. 
  
 === Project manager === === Project manager ===
Line 32: Line 32:
 ======  ====== ======  ======
  
-As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on his or her competencies and capacities, roles might overlap or several roles provided by a single team member.  +As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on their competencies and capacities, roles might overlapor a single team member could provide several roles.  
-Once the team is built, the development process can start. As with any other product development, data product development follows a specific life cycle of phases. Depending on particular project needs, there might be variations, but the data product development follows the well-known waterfall pattern in most cases. The phases are explained below:+Once the team is built, the development process can start. As with any other product development, data product development follows a specific life cycle of phases. Depending on particular project needs, there might be variations, but the data product development follows the well-known waterfall pattern in most cases. The phases are explained in the figure {{ref>Dataproductlifecycle}}:
  
-<figure Data product life cycle+<figure Dataproductlifecycle
-{{ :en:iot-reloaded:lifecycle.png?900 | Data product life cycle}} +{{ :en:iot-reloaded:lifecycle.png?900 | Data Product Life Cycle}} 
-<caption>Data product life cycle</caption>+<caption>Data Product Life Cycle</caption>
 </figure> </figure>
  
Line 50: Line 50:
  
 The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, most teams tend to underestimate this time consumption, which costs the project manager and analysts dearly, leading to losing trust in the project's success. Data scientists given a unique role and authority in the team tend to "skip" this phase and go directly to phase 3 or 4, which is costly because of incorrect or insufficient data to solve the problem. The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, most teams tend to underestimate this time consumption, which costs the project manager and analysts dearly, leading to losing trust in the project's success. Data scientists given a unique role and authority in the team tend to "skip" this phase and go directly to phase 3 or 4, which is costly because of incorrect or insufficient data to solve the problem.
-  - **Data analysis sandbox**The client's operational data, log (window), raw streams, etc., are copied. There is a possibility of a natural conflict where Data scientists want everything, and IT «service» provides a minimum. The needs must, therefore, be explained through a thorough argument. The sandbox can be 5 – 10 times larger than the original dataset! +  - **Data analysis sandbox** The client's operational data, log (window), raw streams, etc., are copied. There is a possibility of a natural conflict where Data scientists want everything, and IT «service» provides a minimum. The needs must, therefore, be explained through a thorough argument. The sandbox can be 5 – 10 times larger than the original dataset! 
-  - **Carrying out ETLs**The data is retrieved, transformedand loaded back into the sandbox system. Sometimes, simple data filtering excludes outliers and cleans the data. Due to the volume of data, parallelisation of data transfers may be needed, which leads to the need for appropriate software and hardware infrastructure. In addition, various web services and interfaces are used to obtain context. +  - **Carrying out ETLs** The data is retrieved, transformed and loaded back into the sandbox system. Sometimes, simple data filtering excludes outliers and cleans the data. Due to the volume of data, there may be a need for parallelisation of data transfers, which leads to the need for appropriate software and hardware infrastructure. In addition, various web services and interfaces are used to obtain context. 
-  Exploring the content of the data**The main task is to learn the content of the extracted data. A data catalogue or vocabulary is created (small projects can skip this step). Data research allows for identifying data gapstechnology flaws, and teams' own and extraneous data (for determining responsibilities and limitations). +  - **Exploring the content of the data** The main task is to get to know the content of the extracted data. A data catalogue or vocabulary is created (small projects can skip this step). Data research allows for identifying data gaps and technology flaws, as well as teams' own and extraneous data (for determining responsibilities and limitations). 
-  - **Data conditioning** - Slicing and combining are the most common actions in this step. The compatibility of data subsets with each other after the performed manipulations is checked to exclude systematic errors – errors that occur due to incorrect manipulation (formatting of data, filling in voids, etc...). During this step, the team ensures the time, metadata, and content match. +  - **Data conditioning** - Slicing and combining are the most common actions in this step. The compatibility of data subsets with each other after the performed manipulations is checked to exclude systematic errors – errors that occur as a result of incorrect manipulation (formatting of data, filling in voids, etc...). During this step, the team ensures the time, metadata, and content match. 
-  - **Reporting and visualising** - This step uses general visualisation techniques, providing a high-level overview of value distributions, histograms, correlations, etc.explaining the data content. It is necessary to check whether the data represent the problem sphere, how the value distributions "behave" throughout the dataset, and whether the details are sufficient to solve the problem.+  - **Reporting and visualising** - This step uses general visualisation techniques, providing a high-level overview – value distributions, histograms, correlations, etc. explaining the data content. It is necessary to check whether the data represent the problem sphere, how the value distributions "behave" throughout the dataset, and whether the details are sufficient to solve the problem.
  
 === Model planning === === Model planning ===
 The main task of the phase is to select model candidates for data clustering, classification or other needs consistent with the Initial Hypothesis from Phase 1.  The main task of the phase is to select model candidates for data clustering, classification or other needs consistent with the Initial Hypothesis from Phase 1. 
-  Exploring data and selecting variables**The aim is to discover and understand variables' interrelationships through visualisations. The identified stakeholders are an excellent source of relevant insights about internal data relationshipseven if they do not know the reasons! These steps allow the selection of key factors instead of checking all against all. +  - **Exploring data and selecting variables** The aim is to discover and understand variables' interrelationships through visualisations. The identified stakeholders are an excellent source of relevant insights about internal data relationships – even if they do not know the reasons! These steps allow the selection of key factors instead of checking all against all. 
-  - **Selection of methods or models** - During this step, the team creates a list of ways that match the data and the problem. A typical approach is making many trim model prototypes using ready-made tools and prototyping packages, such as R, SPSS, Excel, Python, and other specific tools. Tools typical of the phase might include but are not limited to R or Python, SQL and OLAP, Matlab, SPSS, and Excel (for simpler models).+  - **Selection of methods or models** - During this step, the team creates a list of methods that match the data and the problem. A typical approach is making many trim model prototypes using ready-made tools and prototyping packages, such as R, SPSS, Excel, Python, and other specific tools. Tools typical of the phase might include but are not limited to R or Python, SQL and OLAP, Matlab, SPSS, and Excel (for simpler models).
  
 === Model development ===  === Model development === 
 During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed: During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed:
   - **Data preparation** - Specific subsets of data are created, such as training, testing, and validation. The data is adjusted to the selected initial data formatting and structuring methods.   - **Data preparation** - Specific subsets of data are created, such as training, testing, and validation. The data is adjusted to the selected initial data formatting and structuring methods.
-  **Model development**—Conceptually, it is usually very complex but relatively short in terms of time.  +  **Model development** - Usually, conceptually, it is very complex but relatively short in terms of time.  
-  - **Model testing**The models shall be operated and tuned using the selected tools and training datasets to optimise them and ensure their resilience to incoming data variations. All decisions must be documented! This is important because all other team roles require detailed decision-making reasoning, especially during communication and operationalisation.+  - **Model testing** The models shall be operated and tuned using the selected tools and training datasets to optimise the models and ensure their resilience to incoming data variations. All decisions must be documented! This is important because all other team roles require detailed decision-making reasoning, especially during communication and operationalisation.
   - **Key points to be answered during the phase area:**   - **Key points to be answered during the phase area:**
     * Is the model accurate enough?      * Is the model accurate enough? 
Line 74: Line 74:
  
 === Communication === === Communication ===
-During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to show any drafts outside a group of data scientists! - The methods used by most of those involved are too complex, which leads to incorrect conclusions and unnecessary communication to the team. Usually, the team is biased in not accepting the results, which falsifies the hypotheses, taking it too personally. However, the data led the team to the conclusions, not the team itself! Anyway, it must be verified that the results are statistically reliable. If not, the results are not presented. It is also essential to show all the obtained side results, as they almost always provide additional value to the business. The general conclusions need to be complemented by sufficiently broad insights into the interpretation of the results, which is necessary for users of the results and decision-makers.+During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to show any drafts outside a group of data scientists! - The methods used by most of those involved are too complex, which leads to incorrect conclusions and unnecessary communication to the team. Usually, the team is biased in not accepting the results, which falsifies the hypotheses, taking it too personally. However, the data led the team to the conclusions, not the team itself! Anyway, it must be verified that the results are statistically reliable. If not, the results are not presented. It is also essential to present all the obtained side results, as they almost always provide additional value to the business. The general conclusions need to be complemented by sufficiently broad insights into the interpretation of the results, which is necessary for users of the results and decision-makers.
  
 === Operationalisation === === Operationalisation ===
Line 80: Line 80:
 Expectations for each of the roles during this phase: Expectations for each of the roles during this phase:
   * **Business user:** Identifiable benefits of the model for the business.   * **Business user:** Identifiable benefits of the model for the business.
-  * **Project sponsor:** Return on investment (ROI) and impact on the business as a wholehow to highlight it outside the organisation / other business. +  * **Project sponsor:** return on investment (ROI) and impact on the business as a whole – how to highlight it outside the organisation / other business. 
-  * **Project manager:** Completing the project within the expected deadlines with the intended resources. +  * **Project manager:** completing the project within the expected deadlines with the intended resources. 
-  * **Business Information Analyst:** Add-ons to existing reports and dashboards. +  * **Business Information Analyst:** add-ons to existing reports and dashboards. 
-  * **Data scientist:** Convenient maintenance of models after preparing detailed documentation of all developments and explaining the work performed to the team.+  * **Data scientist:** Convenient maintenance of models after preparation of detailed documentation of all developments and explanation of the work performed by the team.
  
en/iot-reloaded/data_products_development.1733246694.txt.gz · Last modified: 2024/12/03 17:24 by pczekalski
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0