Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:iot-open:data:data_processing_models_frameworks [2018/12/27 10:53] – external edit 127.0.0.1en:iot-open:data:data_processing_models_frameworks [2020/07/20 09:00] (current) – external edit 127.0.0.1
Line 1: Line 1:
-===== IoT data processing models and frameworks=====+=====  =====  
 +<box #5374d5></box> 
 +<box #5374d5></box> 
 +===== IoT Data Processing Models and Frameworks===== 
 +<box #5374d5></box> 
 +<box #5374d5></box>
  
 Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart "engines" from "frameworks", it is sometimes useful to define the former as the actual component responsible for operating on data and the latter as a set of elements designed to do the same. For instance, Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine. Engines and frameworks can often be swapped out or used in tandem. For instance, Apache Spark, another framework, can hook into Hadoop to replace MapReduce. This interoperability between components is one reason that big data systems have great flexibility. Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart "engines" from "frameworks", it is sometimes useful to define the former as the actual component responsible for operating on data and the latter as a set of elements designed to do the same. For instance, Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine. Engines and frameworks can often be swapped out or used in tandem. For instance, Apache Spark, another framework, can hook into Hadoop to replace MapReduce. This interoperability between components is one reason that big data systems have great flexibility.
Line 17: Line 22:
 Stream processing systems compute over data as it enters the system. It requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors determine processes that will be used to each individual data item as it passes through the system. The datasets in stream processing are considered "unbounded". It has a few important implications: Stream processing systems compute over data as it enters the system. It requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors determine processes that will be used to each individual data item as it passes through the system. The datasets in stream processing are considered "unbounded". It has a few important implications:
  
-  * The total dataset is only defined as the amount of data that has entered the system so far. +  * the total dataset is only defined as the amount of data that has entered the system so far; 
-  * The working dataset is perhaps more relevant and is limited to a single item at a time. +  * the working dataset is perhaps more relevant and is limited to a single item at a time; 
-  * Processing is event-based and does not "end" until explicitly stopped. Results are immediately available and will be continually updated as new data arrives.+  * processing is event-based and does not "end" until explicitly stopped. Results are immediately available and will be continually updated as new data arrives.
  
 Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with a minimal state being maintained in between records. While most systems provide methods of maintaining some state, stream processing is highly optimised for more functional processing with few side effects. Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with a minimal state being maintained in between records. While most systems provide methods of maintaining some state, stream processing is highly optimised for more functional processing with few side effects.
Line 27: Line 32:
 This type of processing lends itself to certain kinds of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time. This type of processing lends itself to certain kinds of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time.
  
-  * Apache Storm +  * Apache Storm. 
-  * Apache Samza+  * Apache Samza.
  
 ===Hybrid Processing Systems=== ===Hybrid Processing Systems===
Line 34: Line 39:
 Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. It is mainly a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, but they also have their integrations, libraries, and tools for doing things like graph analysis, machine learning, and interactive querying. Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. It is mainly a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, but they also have their integrations, libraries, and tools for doing things like graph analysis, machine learning, and interactive querying.
  
-  * Apache Spark +  * Apache Spark. 
-  * Apache Flink+  * Apache Flink.
en/iot-open/data/data_processing_models_frameworks.1545908021.txt.gz · Last modified: 2020/07/20 09:00 (external edit)
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0