Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- en:iot-open:data_management_and_semantics_ume [2018/05/12 11:23] – salvatdi
+++ en:iot-open:data_management_and_semantics_ume [2018/05/12 12:07] (current) – removed salvatdi
@@ Line 1: / Line 1: @@
-==== Data and Information Management in the Internet of Things ====
-At the center of the IoT ecosystem consisting of billions of connected devices is the wealth of information that can be made available through the fusion of data that is produced in real-time, as well as data stored in permanent repositories.
-This information can make the realisation of innovative and unconventional applications and value-added services possible, and will act as an immense source for trend analysis and strategic business opportunities. A comprehensive management framework of data and information that is generated and stored by the objects within the IoT is thus required to achieve this goal.
-Data management is a broad concept referring to the architectures, practices, and procedures for proper management of the data lifecycle requirements of a certain IT system. As far as the IoT is concerned, data management should act as a layer between the physical sensing objects and devices generating the data - on the one hand, and the applications accessing the data for analysis purposes and services - on the other.
-The IoT data has distinctive characteristics that make traditional relational-based database management an obsolete solution. A massive volume of heterogeneous, streaming and geographically-dispersed real-time data will be created by millions of diverse devices periodically sending observations about certain monitored phenomena or reporting the occurrence of certain or abnormal events of interest. Periodic observations are most demanding in terms of communication overhead and storage due to their streaming and continuous nature, while events present time-strain with end-to-end response times depending on the urgency of the response required for the event. Furthermore, in addition to the data that is generated by IoT entities, there is also metadata that describes these entities (i.e. “things”), such as object identification, location, processes and services provided. The IoT data will statically reside in fixed- or flexible-schema databases and roam the network from dynamic and mobile objects to concentration storage points. This will continue until it reaches centralised data stores. Communication, storage and process will thus be defining factors in the design of data management solutions for the IoT.
-Traditional data management systems handle the storage, retrieval, and update of elementary data items, records and files. In the context of the IoT, data management systems must summarise data online while providing storage, logging, and auditing facilities for offline analysis. This expands the concept of data management from offline storage, query processing, and transaction management operations into online-offline communication/storage dual operations. We first define the data lifecycle within the context of the IoT and then discuss some of the phases in order to have a better understanding of the IoT data management.
-[[en:iot-open:data_lifecycle]]
-[[en:iot-open:iotdatavsdb]]
-[[en:iot-open:data_sources]]
-=== Main IoT domains generating data===
-The emergence of new information sources indispensably affects the data center market, and it has experienced structural changes in recent years. Indeed, information technology tends to go beyond processing data in traditional data centers and opt for of cloud-centric ones. In just a few years, only 8% of overall workloads will be handled by old-school data centers.
-The IoT is predicted to generate 403ZBs of data a year by 2018, up from 113.4ZBs in 2013. But according to Cisco, not all generated data will be sent to data centres. In three years times, colo sites should be hosting 8.6ZBs, up from 3.1ZB in 2013. IDC has predicted that by 2020 one tenth of the world’s data will be produced by machines. The organisation forecast that in five years time the amount of connected devices communicating over the internet will reach 32 billion and generate 10% of the world’s data. CBR compiles a list of the top 10 key areas set to foster data growth resulting from IoT connected solutions.
-  - Air travel: Arming planes with smart sensors to prevent failures is already a reality. These sensors produce several terabytes of data per flight. For example, Cisco said that a Boeing 787 aircraft could generate 40 TBs per hour of flight. These IoT solutions in the air industry have several applications beyond preventing failure. They can also reduce fuel consumption, adjust speeds and reduce travel times.
-  - Mining: For the mining industry, the main benefit from using the IoT is safety. By automating machines (M2M), humans are not required to stay close to the vehicles and risk their lives. Cisco predicts that mining operations can generate up to 2.4TBs of data every minute.
-  - Cars: A smart IoT connected vehicle is a fountain of data. It is constantly transmitting data to manufacturers, to road operators, to its driver, to the authorities, etc. Data generated by smart cars could crash mobile networks with data surges by 2024. The company said connected vehicles are expected to total 2.3 billion by then, which will increase data traffic up to 97% during rush hour traffic at some cell points.
-  - Utilities: the worldwide revenue opportunity presented by the IoT for the utilities industry by 2018 is estimated to reach $201 billion. Smart meters are just an example. According to the UK Department of Energy & Climate Change, by the end of 2014 there were a total of 20.8 million gas meters and 25.3 million electricity meters operated by the larger energy suppliers in British domestic properties. Smart meters collect data on how much energy is being used every 30 minutes, 24/7, 365. This sends to the cloud several TBs of information every year.
-  - Cities: Smart cities will be made of everything out there. Street lamps talking to the grid, urban parks connecting to services and rivers sending out alerts on pollution levels. All this data is generated on a daily basis, and it’s stored in the cloud. Millions of sensors, deployed in every city will constantly produce huge amounts of information.
-  - Wearables: It is estimated that by 2019 more than 578 million wearables will be in use around the world. These solutions are constantly collecting data on health, fitness and wellness. The amount of data produced by wearables varies according to the device being worn and the type of sensors it has included.
-  - Sports: As sports adopt more wearables and intelligent clothing to improve performances, clubs are also looking at new ways to read the field and polish tactics using predictive analysis. For example, NBA took on SAP to make its statistics accessible to fans, opening the clubs data to the world. SAP deployed its analytical software, primarily used in business environments, to create a database that records every single move players execute, players’ stats, and much more.
-  - Logistics: Until today, transportation of goods would be over once the supply chain shipped the products. But with the IoT the service will be extended further beyond this, and smart goods will constantly produce more data. Some logistic companies are already collecting data from their suppliers, and also from their suppliers’ suppliers. Most of this data will be RFID, giving logistic companies the ability to analyse it in real time and tackle any eventual problems that might happen in the chain.
-  - Healthcare: Smart healthcare is already being adopted in several countries. Huge virtual platforms store patient data that can be accessed by health services anywhere else. The health sector will see huge benefits from the IoT, with sensors being deployed across all areas in a medical unit. Medical companies are using connectivity to prevent power surges in medical devices, including critical instruments used in surgeries. All this information is stored for future analysis.
-  - Smart homes: Smart homes are already a reality and by 2020 consumers expect this ecosystem to be widely available. It is predicted that one smart connected home today can produce as much as 1GB of information a week.
-=== Infrastructure and architectures for Iot Data processing: Cloud, Fog, and Edge computing===
-The IoT generates a vast amount of Big Data and this in turn puts a huge strain on Internet Infrastructure. As a result, this forces companies to find solutions to minimise the pressure and solve their problem of transferring large amounts of data. Cloud computing has entered the mainstream of information technology, providing scalability in delivery of enterprise applications and Software as a Service (SaaS). Companies are now migrating their information operations to the cloud. Many cloud providers can allow for your data to be either transferred via your traditional internet connection or via a dedicated direct link. The benefit of a direct link into the cloud will ensure that your data is uncontended and that the traffic is not crossing the internet and the Quality of Service can be controlled. As the IoT proliferates, businesses face a growing need to analyse data from sources at the edge of a network, whether they are mobile phones, gateways or IoT sensors. Cloud computing has a disadvantage here: It can’t process data quickly enough for modern business applications.
-Cloud computing and the IoT both serve to increase efficiency in everyday tasks and both have a complementary relationship.  The IoT generates massive amounts of data, and cloud computing provides a pathway for this data to travel. Many Cloud providers charge on a pay per use model, which means that you only pay for the computer resources that you use and not more. Economies of scale is another way in which cloud providers can benefit smaller IoT start-ups and reduce overall costs to IoT companies. Another benefit of Cloud Computing for the IoT is that Cloud Computing enables better collaboration which is essential for developers today. By allowing developers to store and access data remotely, developers can access data immediately and work on projects without delay. Finally by storing data in the Cloud, this enables IoT companies to change directly quickly and allocate resources in different areas. Big Data has emerged in the past couple of years and with such emergence the cloud has become the architecture of choice. Most companies find it feasible to access the massive quantities of IoT Big Data via the Cloud.
-The IoT owes its explosive growth to the connection of physical things and operational technologies to analytics and machine learning applications, which can help glean insights from device-generated data and enable devices to make “smart” decisions without human intervention. Currently, such resources are mostly being provided by cloud service providers, where the computation and storage capacity exists. However, despite its power, the cloud model is not applicable to environments where operations are time-critical or internet connectivity is poor. This is especially true in scenarios such as telemedicine and patient care, where milliseconds can have fatal consequences. The same can be said about vehicle-to-vehicle communications, where the prevention of collisions and accidents can’t afford the latency caused by the round-trip to the cloud server.
-Moreover, having every device connected to the cloud and sending raw data over the internet can have privacy, security and legal implications, especially when dealing with sensitive data that is subject to separate regulations in different countries. IoT nodes are closer to the action, but for the moment they do not have the computing and storage resources to perform analytics and machine learning tasks. Cloud servers, on the other hand, have the horsepower, but are too far away to process data and respond in time.
-The Fog/Edge layer is the perfect junction where there are enough compute, storage and networking resources to mimic cloud capabilities at the edge and support the local ingestion of data and the quick turnaround of results. Main benefits of Fog/Edge computing are the following:
-  * Increased network capacity: Fog computing uses much less bandwidth, which means it doesn’t cause bottlenecks and other similar occupancies. Less data movement on the network frees up network capacity, which then can be used for other things.
-  * Real time operation: Fog computing has much higher expedience than any other cloud computing architecture we know today. Since all data analysis is being done at the spot, it represents a true real-time concept, which means it is a perfect match for the needs of IoT concepts.
-  * Data security: Collected data is more secure when it doesn’t travel. It also makes data storing much simpler because it stays in its country of origin. Sending data abroad might violate certain laws.
-The current trend shows that Fog computing will continue to grow in usage and importance as the IoT expands and conquers new grounds. With inexpensive, low-power processing and storage becoming more available, we can expect computation to move even closer to the edge and become ingrained in the same devices that are generating the data, creating even greater possibilities for inter-device intelligence and interactions. Sensors that only log data might one day become a thing of the past.
-Fog/edge computing has the potential to revolutionise the IoT in the next several years. It seems obvious that while Cloud is a perfect match for the IoT, we have other scenarios and IoT technologies that demand low-latency ingestion and immediate processing of data where Fog computing is the answer. Fog/Edge computing improves efficiency and reduces the amount of data that needs to be sent to the cloud for processing. But it’s here to complement the cloud, not replace it. The cloud will continue to have a pertinent role in the IoT cycle. In fact, with Fog computing shouldering the burden of short-term analytics at the edge, cloud resources will be freed to take on the heavier tasks, especially where the analysis of historical data and large datasets is concerned. Insights obtained by the cloud can help update and tweak policies and functionality at the fog layer. To sum up, it is the combination of Fog and Cloud computing that will accelerate the adoption of the IoT, especially for the enterprise.
-=== IoT data storage models and frameworks ===
-The increasing volumes of heterogeneous unstructured IoT data has also led to the emergence of several solutions to store these overwhelming datasets and support timely data management:
-  * NoSQL databases are often used for storing IoT Big Data. This is a new type of database which is becoming more and more popular among web companies today. Proponents of NoSQL solutions state that they provide simpler scalability and improved performance relative to traditional relational databases. These products excel at storing “unstructured data,” and the category includes open source products such as Cassandra, MongoDB, and Redis.
-  * In-memory databases assume that data is stored in computer memory to make access to it faster. Representative examples are Redis and Memcached, both NoSQL databases, entirely served from memory. These products excel at storing “unstructured data,” and the category includes open source products such as Cassandra, MongoDB, and Redis.
-=== IoT data processing models and frameworks===
-Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart "engines" from "frameworks", it is sometimes useful to define the former as the actual component responsible for operating on data and the latter as a set of components designed to do the same. For instance, Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine. Engines and frameworks can often be swapped out or used in tandem. For instance, Apache Spark, another framework, can hook into Hadoop to replace MapReduce. This interoperability between components is one reason that big data systems have great flexibility.
-While the systems which handle this stage of the data lifecycle can be complex, the goals on a broad level are very similar: operate over data in order to increase understanding, surface patterns, and gain insight into complex interactions. To simplify the discussion of these components, we will group these processing frameworks by the state of the data they are designed to handle. Some systems handle data in batches, while others process data in a continuous stream as it flows into the system. Still others can handle data in either of these ways.
-//Batch Processing Systems//
-Batch processing has a long history within the big data world. Batch processing involves operating over a large, static dataset and returning the result at a later time when the computation is complete. The datasets in batch processing are typically bounded: batch datasets represent a finite collection of data persistent: data is almost always backed by some type of permanent storage large: batch operations are often the only option for processing extremely large sets of data Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state be maintained for the duration of the calculations. Tasks that require very large volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at handling large volumes of persistent data, it frequently is used with historical data.
-The trade-off for handling large quantities of data is longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant. Because this methodology heavily depends on permanent storage, reading and writing multiple times per task, it tends to be fairly slow. On the other hand, since disk space is typically one of the most abundant server resources, it means that MapReduce can handle enormous datasets. MapReduce has incredible scalability potential and has been used in production on tens of thousands of nodes.
-Apache Hadoop: Apache Hadoop is a processing framework that exclusively provides batch processing. Hadoop was the first big data framework to gain significant traction in the open-source community. Based on several papers and presentations by Google about how they were dealing with tremendous amounts of data at the time, Hadoop reimplemented the algorithms and component stack to make large scale batch processing more accessible. Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling very large data sets where time is not a significant factor. The low cost of components necessary for a well-functioning Hadoop cluster makes this processing inexpensive and effective for many use cases. Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology.
-//Stream Processing Systems//
-Stream processing systems compute over data as it enters the system. This requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors define operations that will be applied to each individual data item as it passes through the system. The datasets in stream processing are considered "unbounded". This has a few important implications:
-  * The total dataset is only defined as the amount of data that has entered the system so far.
-  * The working dataset is perhaps more relevant, and is limited to a single item at a time.
-  * Processing is event-based and does not "end" until explicitly stopped. Results are immediately available and will be continually updated as new data arrives.
-Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with minimal state being maintained in between records. While most systems provide methods of maintaining some state, steam processing is highly optimized for more functional processing with few side effects.
-Functional operations focus on discrete steps that have limited state or side-effects. Performing the same operation on the same piece of data will produce the same output independent of other factors. This kind of processing fits well with streams because state between items is usually some combination of difficult, limited, and sometimes undesirable. So while some type of state management is usually possible, these frameworks are much simpler and more efficient in their absence.
-This type of processing lends itself to certain types of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time.
-  * Apache Storm
-  * Apache Samza
-//Hybrid Processing Systems//
-Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. This is a largely a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, they have their own integrations, libraries, and tooling for doing things like graph analysis, machine learning, and interactive querying.
-  * Apache Spark
-  * Apache Flink
-=== IoT data semantics===
-With some 25 billion devices expected to be connected to the Internet by 2015 and 50 billion by 2020, providing interoperability among the things on the IoT is one of the most fundamental requirements to support object addressing, tracking, and discovery as well as information representation, storage, and exchange.
-The lack of explicit and formal representation of the IoT knowledge could cause ambiguity in terminology, hinder interoperability and mostly semantic interoperability of entities in the IoT world. Furthermore, lack of shared and agreed semantics for this domain (and for any domain) may easily result to semantic heterogeneity - i.e. to the need to align and merge a vast number of different modeling efforts to semantically describe IoT entities, efforts conducted by many different ontology engineers and IoT vendors (domain experts). Although there are tools nowadays to overcome such a problem, it is not a fully automated and precise process and it would be much easier to do so if there is at least a partial agreement between the related stakeholders - i.e. a commonly agreed IoT ontology.
-In these circumstances, an ontology can be used as a semantic registry for the facilitation of the automated deployment of generic and legacy IoT solutions in environments where heterogeneous devices also have been deployed. Such a service can be delivered by IoT solution providers, supporting remotely the interoperability problems of their clients/buyers when buying third-party devices or applications. Practically, this will require the existence of a central point - e.g. a web service/portal for both end users (buyers of the devices) and the IoT solution providers (sellers of the applications) to register their resources, i.e. both the devices and the IoT solutions, in an ontology-based registry.
-== Sensor Web Enablement and Semantic Sensor Networks==
-The Sensor Web Enablement (SWE) standards enable developers to make all types of sensors, transducers and sensor data repositories discoverable, accessible and usable via the Web. Sensor technology, computer technology and network technology are advancing together while demand grows for ways to connect information systems with the real world. Linking diverse technologies in this fertile market environment, integrators are offering new solutions for plant security, industrial controls, meteorology, geophysical survey, flood monitoring, risk assessment, tracking, environmental monitoring, defense, logistics and many other applications. The SWE effort develops the global framework of standards and best practices that make linking of diverse sensor related technologies fast and practical. Standards make it possible to put the pieces together in an efficient way that protects earlier investments, prevents lock-in to specific products and approaches, and allows for future expansion. Standards also influence the design of new component products. Business needs drive the process. Technology providers and solutions providers need to stay abreast of these evolving standards if they are to stay competitive.
-Semantic Web technologies have been proposed as a means to enable interoperability for sensors and sensing systems in the context of SWE. Semantic Web technologies could be used in isolation or in augmenting SWE standards in the form of the Semantic Sensor Web (SSW). Semantic technologies can assist in managing, querying, and combining sensors and observation data. Thus allowing users to operate at abstraction levels above the technical details of format and integration, instead working with domain concepts and restrictions on quality. Machine-interpretable semantics allows autonomous or semi-autonomous agents to assist in collecting, processing, reasoning about, and acting on sensors and their observations. Linked Sensor Data may serve as a means to interlink sensor data with external sources on the Web.
-One of the main outcomes of the SSW research is the Semantic Sensor Network (SSN) ontology (by W3C Semantic Sensor Network Incubator Group). This IoT ontology provides all the necessary semantics for the specification of IoT devices as well as the specifications of the IoT solution (input, output, control logic) that is deployed using these devices. These semantics include terminology related to sensors and observations, reusing the one already provided by the SSN ontology, and extended to capture also the semantics of devices beyond sensors - i.e. actuators, identity devices (tags), embedded devices, and of course the semantics of the devices and things that are observed by sensors, that change their status by actuators, that are attached to identity tags, etc. Furthermore, the ontology includes semantics for the description of the registered IoT solutions - i.e. input, output, control logic - in terms of aligning and matching their requirements with the specifications and services of the registered devices.
-=== IoT data visualisation===
-One of the challenges for the IoT industry is data analysis and interpretation. Big Data generated by the IoT devices is impractical if it cannot be translate into a language that is easy to understand, process and present as visual language. For this reason, IoT data visualisation is becoming an integral part of the IoT. Data visualisation provides a way to display this avalanche of collected data in meaningful ways that clearly present insights hidden within this mass amount of information. This can assist us in making fast, informed decisions with more certainty and accuracy than ever before. It is thus vital for business professionals, developers, designers, entrepreneurs and consumers alike to be aware of the role that Visualization will and can play in the near future. It is crucial to know how it can affect the experience and effectiveness of the IoT products and services.
-=== Machine learning and data science===
-TODO?
-=== Sources===
-https://www.cbronline.com/internet-of-things/10-of-the-biggest-iot-data-generators-4586937/
-https://www.sam-solutions.com/blog/how-much-data-will-iot-create-2017/
-http://www.enterprisefeatures.com/6-important-stages-in-the-data-processing-cycle/
-https://pinaclsolutions.com/blog/2017/cloud-computing-and-iot
-http://internetofthingsagenda.techtarget.com/blog/IoT-Agenda/Its-time-for-fog-edge-computing-in-the-internet-of-things
-https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared

en/iot-open/data_management_and_semantics_ume.1526124196.txt.gz · Last modified: 2020/07/20 09:00 (external edit)