Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- en:iot-open:data_management_and_semantics_ume [2018/05/12 11:03] – salvatdi
+++ en:iot-open:data_management_and_semantics_ume [2018/05/12 12:07] (current) – removed salvatdi
@@ Line 1: / Line 1: @@
-==== Data and Information Management in the Internet of Things ====
-At the center of the IoT ecosystem consisting of billions of connected devices is the wealth of information that can be made available through the fusion of data that is produced in real-time, as well as data stored in permanent repositories.
-This information can make the realisation of innovative and unconventional applications and value-added services possible, and will act as an immense source for trend analysis and strategic business opportunities. A comprehensive management framework of data and information that is generated and stored by the objects within the IoT is thus required to achieve this goal.
-Data management is a broad concept referring to the architectures, practices, and procedures for proper management of the data lifecycle requirements of a certain IT system. As far as the IoT is concerned, data management should act as a layer between the physical sensing objects and devices generating the data - on the one hand, and the applications accessing the data for analysis purposes and services - on the other.
-The IoT data has distinctive characteristics that make traditional relational-based database management an obsolete solution. A massive volume of heterogeneous, streaming and geographically-dispersed real-time data will be created by millions of diverse devices periodically sending observations about certain monitored phenomena or reporting the occurrence of certain or abnormal events of interest. Periodic observations are most demanding in terms of communication overhead and storage due to their streaming and continuous nature, while events present time-strain with end-to-end response times depending on the urgency of the response required for the event. Furthermore, in addition to the data that is generated by IoT entities, there is also metadata that describes these entities (i.e. “things”), such as object identification, location, processes and services provided. The IoT data will statically reside in fixed- or flexible-schema databases and roam the network from dynamic and mobile objects to concentration storage points. This will continue until it reaches centralised data stores. Communication, storage and process will thus be defining factors in the design of data management solutions for the IoT.
-Traditional data management systems handle the storage, retrieval, and update of elementary data items, records and files. In the context of the IoT, data management systems must summarise data online while providing storage, logging, and auditing facilities for offline analysis. This expands the concept of data management from offline storage, query processing, and transaction management operations into online-offline communication/storage dual operations. We first define the data lifecycle within the context of the IoT and then discuss some of the phases in order to have a better understanding of the IoT data management.
-===IoT data lifecycle===
-Data processing is simply the conversion of raw data to meaningful information through a process. Data is manipulated to produce results that lead to a resolution of a problem or improvement of an existing situation. Similar to a production process, it follows a cycle where inputs (raw data) are fed to a process (computer systems, software, etc.) to produce output (information and insights). Generally, organisations employ computer systems to carry out a series of operations on the data in order to present, interpret, or obtain information. The process includes activities like data entry, summary, calculation, storage, etc. Useful and informative output is presented in various appropriate forms such as diagrams, reports, graphics, etc.
-The lifecycle of data within an IoT system proceeds from data production to aggregation, transfer, optional filtering and preprocessing, and finally to storage and archiving. Querying and analysis are the endpoints that initiate (request) and consume data production, but data production can be set to be “pushed” to the IoT consuming services. Production, collection, aggregation, filtering, and some basic querying and preliminary processing functionalities are considered online, communication-intensive operations. Intensive preprocessing, long-term storage and archival and in-depth processing/analysis are considered offline storage-intensive operations.
-Storage operations aim at making data available on the long term for constant access/updates, while archival is concerned with read-only data. Since some IoT systems may generate, process, and store data in-network for real-time and localised services, with no need to propagate this data further up to concentration points in the system, edge devices that combine both processing and storage elements may exist as autonomous units in the cycle. In the following paragraphs, each of the elements in the IoT data lifecycle is explained.
-  - Querying: data-intensive systems rely on querying as the core process to access and retrieve data. In the context of the IoT, a query can be issued either to request real-time data to be collected for temporal monitoring purposes or to retrieve a certain view of the data stored within the system. The first case is typical when a (mostly localised) real-time request for data is required. The second case represents more globalised views of data and in-depth analysis of trends and patterns.
-  - Production: Data production involves sensing and transfer of data by the edge devices within the IoT framework and reporting this data to interested parties periodically (as in a subscribe/notify model), pushing it up the network to aggregation points and subsequently to database servers, or sending it as a response triggered by queries that request the data from sensors and smart objects. Data is usually time-stamped and possibly geo-stamped, and can be in the form of simple key-value pairs, or it may contain rich (unstructured) audio/image/video content, with varying degrees of complexity in-between.
-  - Collection: The sensors and smart objects within the IoT may store the data for a certain time interval or report it to governing components. Data may be collected at concentration points or gateways within the network, where it is further filtered and processed, and possibly fused into compact forms for efficient transmission. Wireless communication technologies such as Zigbee, Wi-Fi and mobile networks are used by objects to send data to collection points. Collection is the first stage of the cycle, and is very crucial, since the quality of data collected will impact heavily on the output. The collection process needs to ensure that the data gathered are both defined and accurate, so that subsequent decisions based on the findings are valid. This stage provides both the baseline from which to measure, and a target on what to improve. Some types of data collection include census (data collection about everything in a group or statistical population), sample survey (collection method that includes only part of the total population), and administrative by-product (data collection is a byproduct of an organisation’s day-to-day operations).
-  - Aggregation/Fusion: Transmitting all the raw data out of the network in real-time is often prohibitively expensive, given the increasing data streaming rates and the limited bandwidth. Aggregation and fusion techniques deploy summarisation and merging operations in real-time to compress the volume of data to be stored and transmitted.
-  - Delivery: As data is filtered, aggregated, and possibly processed either at the concentration points or at the autonomous virtual units within the IoT, the results of these processes may need to be sent further up the system, either as final responses, or for storage and in-depth analysis. Wired or wireless broadband communications may be used there to transfer data to permanent data stores.
-  - Preprocessing: the IoT data will come from different sources with varying formats and structures. Data may need to be preprocessed to handle missing data, remove redundancies and integrate data from different sources into a unified schema before being committed to storage. Preparation is the manipulation of data into a form suitable for further analysis and processing. Raw data cannot be processed and must be checked for accuracy. Preparation is about constructing a dataset from one or more data sources to be used for further exploration and processing. Analysing data that has not been carefully screened for problems can produce highly misleading results that are heavily dependent on the quality of data prepared. This preprocessing is a known procedure in data mining called data cleaning. Schema integration does not imply brute-force fitting of all the data into a fixed relational (tables) schema, but rather a more abstract definition of a consistent way to access the data without having to customise access for each source's data format(s). Probabilities at different levels in the schema may be added at this phase to the IoT data items in order to handle uncertainty that may be present in data or to deal with the lack of trust that may exist in data sources.
-  - Storage/Update & Archiving: This phase handles the efficient storage and organisation of data, as well as the continuous update of data with new information as it becomes available. Archiving refers to the offline long-term storage of data that is not immediately needed for the system's ongoing operations. The importance of this step is that it allows quick access and retrieval of the processed information, allowing it to be passed on to the next stage directly, when needed. The core of centralised storage is the deployment of storage structures that adapt to the various data types and the frequency of data capture. Relational database management systems are a popular choice that involves the organisation of data into a table schema with predefined interrelationships and metadata for efficient retrieval at later stages. NoSQL key-value stores are gaining popularity as storage technologies for their support of Big Data storage with no reliance on relational schema or strong consistency requirements typical of relational database systems. Storage can also be decentralised for autonomous IoT systems, where data is kept at the objects that generate it and is not sent up the system. However, due to the limited capabilities of such objects, storage capacity remains limited in comparison to the centralised storage model.
-  - Processing/Analysis: This phase involves the ongoing retrieval and analysis operations performed and stored and archived data in order to gain insights into historical data and predict future trends, or to detect abnormalities in the data that may trigger further investigation or action. Task-specific preprocessing may be required to filter and clean data before meaningful operations can take place. When an IoT subsystem is autonomous and does not require permanent storage of its data, but rather keeps the processing and storage in the network, then in-network processing may be performed in response to real-time or localised queries.
-  - Output and interpretation: This is the stage where processed information is now transmitted to the user. Output is presented to users in various visual formats like diagrams, infographics, printed report, audio, video, etc. The output needs to be interpreted so that it can provide meaningful information that will guide future decisions of the company.
-Depending on the architecture of an IoT system and actual data management requirements in place, some of the steps described above can be omitted. Nevertheless, it is possible to distinguish three main patterns for the IoT data flow:
-  * In relatively autonomous IoT systems, data proceeds from query to production to in-network processing and then delivery.
-  * In more centralised systems, the data flow that starts from production and proceeds to collection and filtering/aggregation/fusion and ends with data delivery to initiating (possibly global or near real-time) queries.
-  * In fully centralised systems, the data flow extends the production to aggregation further and includes preprocessing, permanent data storage and archival, and in-depth processing and analysis.
-**IoT data management versus traditional database management systems**
-Based on the IoT data lifecycle discussed earlier, we divide an IoT data management system into i) an online (i.e. real-time) front-end that interacts directly with the interconnected IoT objects and sensors, and ii) an offline back-end that handles the mass storage and in-depth analysis of the IoT data. The data management frontend is communication-intensive, as it involves the propagation of query requests and result to and from sensors and smart objects. The backend is storage-intensive, as it involves the mass storage of produced data for later processing and analysis and more in-depth queries. Although the storage elements reside on the back-end, they interact with the front-end on a frequent basis via continuous updates and are thus referred to as online. The autonomous edges in the IoT data lifecycle can be considered more communication-intensive than storage-intensive, as they provide real-time data to certain queries.
-This envisioned data management architecture differs considerably from the existing database management systems (DBMSs), which are mainly storage-centric. In traditional databases, the bulk of data is collected from predefined and finite sources, and stored in scalar form according to strict normalisation rules in relations. Queries are used to retrieve specific “summary” views of the system or update specific items in the database. New data is inserted into the database when needed, also via insertion queries. Query operations are usually local, with execution costs bound to processing and intermediate storage. Transaction management mechanisms guarantee the ACID properties in order to enforce overall data integrity. Even if the database is distributed over multiple sites, query processing and distributed transaction management are enforced. The execution of distributed queries is based on the transparency principle, which dictates that the database is still viewed logically as one centralised unit, and the ACID properties are guaranteed via the two-phase commit protocol.
-In the IoT systems, the picture is dramatically different, with a massive and ever-growing number of data sources that include sensors, RFIDs, embedded systems, and mobile devices. Contrary to the occasional updates and queries submitted to traditional DBMSs, data is streaming constantly from a multitude of edge devices to the IoT data stores, and queries are more frequent and with more versatile needs. Hierarchical data reporting and aggregation may be required for scalability guarantees as well as to enable more prompt processing functionality. The strict relational database schema and the relational normalisation practice may be relaxed in favour of more unstructured and flexible forms that adapt to the diverse data types and sophisticated queries. Although distributed DBMSs optimise queries based on communication considerations, optimisers base their decisions on fixed and well-defined schemas. This may not be the case in the IoT, where new data sources and streaming, localised data create a highly dynamic environment for query optimisers. Striving to guarantee the transparency requirements imposed in distributed DBMSs on IoT data management systems is challenging, if not impossible. Furthermore, transparency may not even be required in the IoT, because innovative applications and services may require location and context awareness. Maintaining ACID properties in bounded IoT spaces (subsystems), while executing transactions can be managed, but is challenging for the more globalised space. However, the element of mobile data sources and how their generated data can be incorporated into the already established data space is a novel challenge that is yet to be addressed by the IoT data management systems.
-**IoT data sources**
-As the IoT gets more involved in various domains and types of operations, a countless number of its data sources generate immense volumes of data. All the sources can be roughly divided into three groups:
-//Passive sources//
-They are sensors that do not communicate actively and send the required information to the centralised management system only on demand. For instance, sensors that make atmospheric measurements produce data when API is activated. That does not mean that an application is also passive; on the contrary, the data from passive sensors requires proper management and processing, and it is what an application is purposed for.
-//Active sources//
-The main difference between passive and active sensors is that the latter transmit data continuously, not only by request. An example is jet engine sensors. Information comes in a real-time manner, which demands an application to provide its ongoing processing. As the data must be safe, the application must parse it from the stream and then place into a proper format for storage and processing.
-//Dynamic sources//
-These sources are most sophisticated, and also most useful ones. Devices with dynamic sensors interact with respective applications bidirectionally and perform a wide range of capabilities, such as data format and frequency change, security issue fixing, update automation and more. Also, they are auto- and self-tuned. Dynamic sensors do not just produce rough information to an application that processes it, but can also send ready data that meets the application’s requirements.
-**Main IoT domains generating data**
-The emergence of new information sources indispensably affects the data center market, and it has experienced structural changes in recent years. Indeed, information technology tends to go beyond processing data in traditional data centers and opt for of cloud-centric ones. In just a few years, only 8% of overall workloads will be handled by old-school data centers.
-The IoT is predicted to generate 403ZBs of data a year by 2018, up from 113.4ZBs in 2013. But according to Cisco, not all generated data will be sent to data centres. In three years times, colo sites should be hosting 8.6ZBs, up from 3.1ZB in 2013. IDC has predicted that by 2020 one tenth of the world’s data will be produced by machines. The organisation forecast that in five years time the amount of connected devices communicating over the internet will reach 32 billion and generate 10% of the world’s data. CBR compiles a list of the top 10 key areas set to foster data growth resulting from IoT connected solutions.
-  - Air travel: Arming planes with smart sensors to prevent failures is already a reality. These sensors produce several terabytes of data per flight. For example, Cisco said that a Boeing 787 aircraft could generate 40 TBs per hour of flight. These IoT solutions in the air industry have several applications beyond preventing failure. They can also reduce fuel consumption, adjust speeds and reduce travel times.
-  - Mining: For the mining industry, the main benefit from using the IoT is safety. By automating machines (M2M), humans are not required to stay close to the vehicles and risk their lives. Cisco predicts that mining operations can generate up to 2.4TBs of data every minute.
-  - Cars: A smart IoT connected vehicle is a fountain of data. It is constantly transmitting data to manufacturers, to road operators, to its driver, to the authorities, etc. Data generated by smart cars could crash mobile networks with data surges by 2024. The company said connected vehicles are expected to total 2.3 billion by then, which will increase data traffic up to 97% during rush hour traffic at some cell points.
-  - Utilities: the worldwide revenue opportunity presented by the IoT for the utilities industry by 2018 is estimated to reach $201 billion. Smart meters are just an example. According to the UK Department of Energy & Climate Change, by the end of 2014 there were a total of 20.8 million gas meters and 25.3 million electricity meters operated by the larger energy suppliers in British domestic properties. Smart meters collect data on how much energy is being used every 30 minutes, 24/7, 365. This sends to the cloud several TBs of information every year.
-  - Cities: Smart cities will be made of everything out there. Street lamps talking to the grid, urban parks connecting to services and rivers sending out alerts on pollution levels. All this data is generated on a daily basis, and it’s stored in the cloud. Millions of sensors, deployed in every city will constantly produce huge amounts of information.
-  - Wearables: It is estimated that by 2019 more than 578 million wearables will be in use around the world. These solutions are constantly collecting data on health, fitness and wellness. The amount of data produced by wearables varies according to the device being worn and the type of sensors it has included.
-  - Sports: As sports adopt more wearables and intelligent clothing to improve performances, clubs are also looking at new ways to read the field and polish tactics using predictive analysis. For example, NBA took on SAP to make its statistics accessible to fans, opening the clubs data to the world. SAP deployed its analytical software, primarily used in business environments, to create a database that records every single move players execute, players’ stats, and much more.
-  - Logistics: Until today, transportation of goods would be over once the supply chain shipped the products. But with the IoT the service will be extended further beyond this, and smart goods will constantly produce more data. Some logistic companies are already collecting data from their suppliers, and also from their suppliers’ suppliers. Most of this data will be RFID, giving logistic companies the ability to analyse it in real time and tackle any eventual problems that might happen in the chain.
-  - Healthcare: Smart healthcare is already being adopted in several countries. Huge virtual platforms store patient data that can be accessed by health services anywhere else. The health sector will see huge benefits from the IoT, with sensors being deployed across all areas in a medical unit. Medical companies are using connectivity to prevent power surges in medical devices, including critical instruments used in surgeries. All this information is stored for future analysis.
-  - Smart homes: Smart homes are already a reality and by 2020 consumers expect this ecosystem to be widely available. It is predicted that one smart connected home today can produce as much as 1GB of information a week.
-**Infrastructure and architectures for Iot Data processing: Cloud, Fog, and Edge computing**
-The IoT generates a vast amount of Big Data and this in turn puts a huge strain on Internet Infrastructure. As a result, this forces companies to find solutions to minimise the pressure and solve their problem of transferring large amounts of data. Cloud computing has entered the mainstream of information technology, providing scalability in delivery of enterprise applications and Software as a Service (SaaS). Companies are now migrating their information operations to the cloud. Many cloud providers can allow for your data to be either transferred via your traditional internet connection or via a dedicated direct link. The benefit of a direct link into the cloud will ensure that your data is uncontended and that the traffic is not crossing the internet and the Quality of Service can be controlled. As the IoT proliferates, businesses face a growing need to analyse data from sources at the edge of a network, whether they are mobile phones, gateways or IoT sensors. Cloud computing has a disadvantage here: It can’t process data quickly enough for modern business applications.
-Cloud computing and the IoT both serve to increase efficiency in everyday tasks and both have a complementary relationship.  The IoT generates massive amounts of data, and cloud computing provides a pathway for this data to travel. Many Cloud providers charge on a pay per use model, which means that you only pay for the computer resources that you use and not more. Economies of scale is another way in which cloud providers can benefit smaller IoT start-ups and reduce overall costs to IoT companies. Another benefit of Cloud Computing for the IoT is that Cloud Computing enables better collaboration which is essential for developers today. By allowing developers to store and access data remotely, developers can access data immediately and work on projects without delay. Finally by storing data in the Cloud, this enables IoT companies to change directly quickly and allocate resources in different areas. Big Data has emerged in the past couple of years and with such emergence the cloud has become the architecture of choice. Most companies find it feasible to access the massive quantities of IoT Big Data via the Cloud.
-The IoT owes its explosive growth to the connection of physical things and operational technologies to analytics and machine learning applications, which can help glean insights from device-generated data and enable devices to make “smart” decisions without human intervention. Currently, such resources are mostly being provided by cloud service providers, where the computation and storage capacity exists. However, despite its power, the cloud model is not applicable to environments where operations are time-critical or internet connectivity is poor. This is especially true in scenarios such as telemedicine and patient care, where milliseconds can have fatal consequences. The same can be said about vehicle-to-vehicle communications, where the prevention of collisions and accidents can’t afford the latency caused by the round-trip to the cloud server.
-Moreover, having every device connected to the cloud and sending raw data over the internet can have privacy, security and legal implications, especially when dealing with sensitive data that is subject to separate regulations in different countries. IoT nodes are closer to the action, but for the moment they do not have the computing and storage resources to perform analytics and machine learning tasks. Cloud servers, on the other hand, have the horsepower, but are too far away to process data and respond in time.
-The Fog/Edge layer is the perfect junction where there are enough compute, storage and networking resources to mimic cloud capabilities at the edge and support the local ingestion of data and the quick turnaround of results. Main benefits of Fog/Edge computing are the following:
-  * Increased network capacity: Fog computing uses much less bandwidth, which means it doesn’t cause bottlenecks and other similar occupancies. Less data movement on the network frees up network capacity, which then can be used for other things.
-  * Real time operation: Fog computing has much higher expedience than any other cloud computing architecture we know today. Since all data analysis is being done at the spot, it represents a true real-time concept, which means it is a perfect match for the needs of IoT concepts.
-  * Data security: Collected data is more secure when it doesn’t travel. It also makes data storing much simpler because it stays in its country of origin. Sending data abroad might violate certain laws.
-The current trend shows that Fog computing will continue to grow in usage and importance as the IoT expands and conquers new grounds. With inexpensive, low-power processing and storage becoming more available, we can expect computation to move even closer to the edge and become ingrained in the same devices that are generating the data, creating even greater possibilities for inter-device intelligence and interactions. Sensors that only log data might one day become a thing of the past.
-Fog/edge computing has the potential to revolutionise the IoT in the next several years. It seems obvious that while Cloud is a perfect match for the IoT, we have other scenarios and IoT technologies that demand low-latency ingestion and immediate processing of data where Fog computing is the answer. Fog/Edge computing improves efficiency and reduces the amount of data that needs to be sent to the cloud for processing. But it’s here to complement the cloud, not replace it. The cloud will continue to have a pertinent role in the IoT cycle. In fact, with Fog computing shouldering the burden of short-term analytics at the edge, cloud resources will be freed to take on the heavier tasks, especially where the analysis of historical data and large datasets is concerned. Insights obtained by the cloud can help update and tweak policies and functionality at the fog layer. To sum up, it is the combination of Fog and Cloud computing that will accelerate the adoption of the IoT, especially for the enterprise.
-**IoT data storage models and frameworks**
-The increasing volumes of heterogeneous unstructured IoT data has also led to the emergence of several solutions to store these overwhelming datasets and support timely data management:
-  * NoSQL databases are often used for storing IoT Big Data. This is a new type of database which is becoming more and more popular among web companies today. Proponents of NoSQL solutions state that they provide simpler scalability and improved performance relative to traditional relational databases. These products excel at storing “unstructured data,” and the category includes open source products such as Cassandra, MongoDB, and Redis.
-  * In-memory databases assume that data is stored in computer memory to make access to it faster. Representative examples are Redis and Memcached, both NoSQL databases, entirely served from memory. These products excel at storing “unstructured data,” and the category includes open source products such as Cassandra, MongoDB, and Redis.
-**IoT data processing models and frameworks**
-Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart "engines" from "frameworks", it is sometimes useful to define the former as the actual component responsible for operating on data and the latter as a set of components designed to do the same. For instance, Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine. Engines and frameworks can often be swapped out or used in tandem. For instance, Apache Spark, another framework, can hook into Hadoop to replace MapReduce. This interoperability between components is one reason that big data systems have great flexibility.
-While the systems which handle this stage of the data lifecycle can be complex, the goals on a broad level are very similar: operate over data in order to increase understanding, surface patterns, and gain insight into complex interactions. To simplify the discussion of these components, we will group these processing frameworks by the state of the data they are designed to handle. Some systems handle data in batches, while others process data in a continuous stream as it flows into the system. Still others can handle data in either of these ways.
-//Batch Processing Systems//
-Batch processing has a long history within the big data world. Batch processing involves operating over a large, static dataset and returning the result at a later time when the computation is complete. The datasets in batch processing are typically bounded: batch datasets represent a finite collection of data persistent: data is almost always backed by some type of permanent storage large: batch operations are often the only option for processing extremely large sets of data Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state be maintained for the duration of the calculations. Tasks that require very large volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at handling large volumes of persistent data, it frequently is used with historical data.
-The trade-off for handling large quantities of data is longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant. Because this methodology heavily depends on permanent storage, reading and writing multiple times per task, it tends to be fairly slow. On the other hand, since disk space is typically one of the most abundant server resources, it means that MapReduce can handle enormous datasets. MapReduce has incredible scalability potential and has been used in production on tens of thousands of nodes.
-Apache Hadoop: Apache Hadoop is a processing framework that exclusively provides batch processing. Hadoop was the first big data framework to gain significant traction in the open-source community. Based on several papers and presentations by Google about how they were dealing with tremendous amounts of data at the time, Hadoop reimplemented the algorithms and component stack to make large scale batch processing more accessible. Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling very large data sets where time is not a significant factor. The low cost of components necessary for a well-functioning Hadoop cluster makes this processing inexpensive and effective for many use cases. Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology.
-//Stream Processing Systems//
-Stream processing systems compute over data as it enters the system. This requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors define operations that will be applied to each individual data item as it passes through the system. The datasets in stream processing are considered "unbounded". This has a few important implications:
-  * The total dataset is only defined as the amount of data that has entered the system so far.
-  * The working dataset is perhaps more relevant, and is limited to a single item at a time.
-  * Processing is event-based and does not "end" until explicitly stopped. Results are immediately available and will be continually updated as new data arrives.
-Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with minimal state being maintained in between records. While most systems provide methods of maintaining some state, steam processing is highly optimized for more functional processing with few side effects.
-Functional operations focus on discrete steps that have limited state or side-effects. Performing the same operation on the same piece of data will produce the same output independent of other factors. This kind of processing fits well with streams because state between items is usually some combination of difficult, limited, and sometimes undesirable. So while some type of state management is usually possible, these frameworks are much simpler and more efficient in their absence.
-This type of processing lends itself to certain types of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time.
-  * Apache Storm
-  * Apache Samza
-**Hybrid Processing Systems**
-Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. This is a largely a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, they have their own integrations, libraries, and tooling for doing things like graph analysis, machine learning, and interactive querying.
-  * Apache Spark
-  * Apache Flink
-**IoT data semantics**
-With some 25 billion devices expected to be connected to the Internet by 2015 and 50 billion by 2020, providing interoperability among the things on the IoT is one of the most fundamental requirements to support object addressing, tracking, and discovery as well as information representation, storage, and exchange.
-The lack of explicit and formal representation of the IoT knowledge could cause ambiguity in terminology, hinder interoperability and mostly semantic interoperability of entities in the IoT world. Furthermore, lack of shared and agreed semantics for this domain (and for any domain) may easily result to semantic heterogeneity - i.e. to the need to align and merge a vast number of different modeling efforts to semantically describe IoT entities, efforts conducted by many different ontology engineers and IoT vendors (domain experts). Although there are tools nowadays to overcome such a problem, it is not a fully automated and precise process and it would be much easier to do so if there is at least a partial agreement between the related stakeholders - i.e. a commonly agreed IoT ontology.
-In these circumstances, an ontology can be used as a semantic registry for the facilitation of the automated deployment of generic and legacy IoT solutions in environments where heterogeneous devices also have been deployed. Such a service can be delivered by IoT solution providers, supporting remotely the interoperability problems of their clients/buyers when buying third-party devices or applications. Practically, this will require the existence of a central point - e.g. a web service/portal for both end users (buyers of the devices) and the IoT solution providers (sellers of the applications) to register their resources, i.e. both the devices and the IoT solutions, in an ontology-based registry.
-**Sensor Web Enablement and Semantic Sensor Networks**
-The Sensor Web Enablement (SWE) standards enable developers to make all types of sensors, transducers and sensor data repositories discoverable, accessible and usable via the Web. Sensor technology, computer technology and network technology are advancing together while demand grows for ways to connect information systems with the real world. Linking diverse technologies in this fertile market environment, integrators are offering new solutions for plant security, industrial controls, meteorology, geophysical survey, flood monitoring, risk assessment, tracking, environmental monitoring, defense, logistics and many other applications. The SWE effort develops the global framework of standards and best practices that make linking of diverse sensor related technologies fast and practical. Standards make it possible to put the pieces together in an efficient way that protects earlier investments, prevents lock-in to specific products and approaches, and allows for future expansion. Standards also influence the design of new component products. Business needs drive the process. Technology providers and solutions providers need to stay abreast of these evolving standards if they are to stay competitive.
-Semantic Web technologies have been proposed as a means to enable interoperability for sensors and sensing systems in the context of SWE. Semantic Web technologies could be used in isolation or in augmenting SWE standards in the form of the Semantic Sensor Web (SSW). Semantic technologies can assist in managing, querying, and combining sensors and observation data. Thus allowing users to operate at abstraction levels above the technical details of format and integration, instead working with domain concepts and restrictions on quality. Machine-interpretable semantics allows autonomous or semi-autonomous agents to assist in collecting, processing, reasoning about, and acting on sensors and their observations. Linked Sensor Data may serve as a means to interlink sensor data with external sources on the Web.
-One of the main outcomes of the SSW research is the Semantic Sensor Network (SSN) ontology (by W3C Semantic Sensor Network Incubator Group). This IoT ontology provides all the necessary semantics for the specification of IoT devices as well as the specifications of the IoT solution (input, output, control logic) that is deployed using these devices. These semantics include terminology related to sensors and observations, reusing the one already provided by the SSN ontology, and extended to capture also the semantics of devices beyond sensors - i.e. actuators, identity devices (tags), embedded devices, and of course the semantics of the devices and things that are observed by sensors, that change their status by actuators, that are attached to identity tags, etc. Furthermore, the ontology includes semantics for the description of the registered IoT solutions - i.e. input, output, control logic - in terms of aligning and matching their requirements with the specifications and services of the registered devices.
-**IoT data visualisation**
-One of the challenges for the IoT industry is data analysis and interpretation. Big Data generated by the IoT devices is impractical if it cannot be translate into a language that is easy to understand, process and present as visual language. For this reason, IoT data visualisation is becoming an integral part of the IoT. Data visualisation provides a way to display this avalanche of collected data in meaningful ways that clearly present insights hidden within this mass amount of information. This can assist us in making fast, informed decisions with more certainty and accuracy than ever before. It is thus vital for business professionals, developers, designers, entrepreneurs and consumers alike to be aware of the role that Visualization will and can play in the near future. It is crucial to know how it can affect the experience and effectiveness of the IoT products and services.
-**Machine learning and data science**
-TODO?
-**Sources**
-https://www.cbronline.com/internet-of-things/10-of-the-biggest-iot-data-generators-4586937/
-https://www.sam-solutions.com/blog/how-much-data-will-iot-create-2017/
-http://www.enterprisefeatures.com/6-important-stages-in-the-data-processing-cycle/
-https://pinaclsolutions.com/blog/2017/cloud-computing-and-iot
-http://internetofthingsagenda.techtarget.com/blog/IoT-Agenda/Its-time-for-fog-edge-computing-in-the-internet-of-things
-https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared

en/iot-open/data_management_and_semantics_ume.1526123000.txt.gz · Last modified: 2020/07/20 09:00 (external edit)