This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:iot-open:data:data_processing_models_frameworks [2018/07/09 17:43] – pczekalski | en:iot-open:data:data_processing_models_frameworks [2020/07/20 09:00] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | + | ===== ===== | |
- | === IoT data processing models | + | <box # |
+ | <box # | ||
+ | ===== IoT Data Processing Models | ||
+ | <box # | ||
+ | <box # | ||
Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart " | Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart " | ||
Line 6: | Line 10: | ||
While the systems which handle this stage of the data lifecycle can be complex, the goals on a broad level are very similar: operate over data to increase understanding, | While the systems which handle this stage of the data lifecycle can be complex, the goals on a broad level are very similar: operate over data to increase understanding, | ||
- | //Batch Processing Systems// | + | ===Batch Processing Systems=== |
Batch processing has a long history within the big data world. Batch processing involves operating over a large, static dataset and returning the result at a later time when the computation is complete. The datasets in batch processing are typically bounded: batch datasets represent a limited collection of data persistent: data is almost always backed by some permanent storage large: batch operations are often the only option for processing extensive sets of data Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state is maintained for the duration of the calculations. Tasks that need vast volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at managing large volumes of persistent data, it frequently is used with historical data. | Batch processing has a long history within the big data world. Batch processing involves operating over a large, static dataset and returning the result at a later time when the computation is complete. The datasets in batch processing are typically bounded: batch datasets represent a limited collection of data persistent: data is almost always backed by some permanent storage large: batch operations are often the only option for processing extensive sets of data Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state is maintained for the duration of the calculations. Tasks that need vast volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at managing large volumes of persistent data, it frequently is used with historical data. | ||
- | The trade-off for handling large quantities of data is longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant. Because this methodology heavily depends on permanent storage, reading and writing multiple times per task, it tends to be somewhat slow. On the other hand, since disk space is typically one of the most abundant server resources, it means that MapReduce can handle enormous datasets. MapReduce has incredible scalability potential and has been used in production on tens of thousands of nodes. | + | The trade-off for handling large quantities of data is a longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant. Because this methodology heavily depends on permanent storage, reading and writing multiple times per task, it tends to be somewhat slow. On the other hand, since disk space is typically one of the most abundant server resources, it means that MapReduce can handle enormous datasets. MapReduce has incredible scalability potential and has been used in production on tens of thousands of nodes. |
Apache Hadoop: Apache Hadoop is a processing framework that exclusively provides batch processing. Hadoop was the first big data framework to gain significant traction in the open-source community. Based on several papers and presentations by Google about how they were dealing with tremendous amounts of data at the time, Hadoop reimplemented the algorithms and component stack to make large-scale batch processing more accessible. Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling extensive datasets where time is not a significant factor. The low cost of components necessary for a well-functioning Hadoop cluster makes this processing inexpensive and useful for many use cases. Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology. | Apache Hadoop: Apache Hadoop is a processing framework that exclusively provides batch processing. Hadoop was the first big data framework to gain significant traction in the open-source community. Based on several papers and presentations by Google about how they were dealing with tremendous amounts of data at the time, Hadoop reimplemented the algorithms and component stack to make large-scale batch processing more accessible. Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling extensive datasets where time is not a significant factor. The low cost of components necessary for a well-functioning Hadoop cluster makes this processing inexpensive and useful for many use cases. Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology. | ||
- | //Stream Processing Systems// | + | ===Stream Processing Systems=== |
Stream processing systems compute over data as it enters the system. It requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors determine processes that will be used to each individual data item as it passes through the system. The datasets in stream processing are considered " | Stream processing systems compute over data as it enters the system. It requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors determine processes that will be used to each individual data item as it passes through the system. The datasets in stream processing are considered " | ||
- | * The total dataset is only defined as the amount of data that has entered the system so far. | + | * the total dataset is only defined as the amount of data that has entered the system so far; |
- | * The working dataset is perhaps more relevant and is limited to a single item at a time. | + | * the working dataset is perhaps more relevant and is limited to a single item at a time; |
- | * Processing | + | * processing |
Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with a minimal state being maintained in between records. While most systems provide methods of maintaining some state, stream processing is highly optimised for more functional processing with few side effects. | Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with a minimal state being maintained in between records. While most systems provide methods of maintaining some state, stream processing is highly optimised for more functional processing with few side effects. | ||
Line 28: | Line 32: | ||
This type of processing lends itself to certain kinds of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time. | This type of processing lends itself to certain kinds of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time. | ||
- | * Apache Storm | + | * Apache Storm. |
- | * Apache Samza | + | * Apache Samza. |
- | //Hybrid Processing Systems// | + | ===Hybrid Processing Systems=== |
Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. It is mainly a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, but they also have their integrations, | Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. It is mainly a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, but they also have their integrations, | ||
- | * Apache Spark | + | * Apache Spark. |
- | * Apache Flink | + | * Apache Flink. |