This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:iot-open:data_management_and_semantics_ume [2018/05/12 11:03] – salvatdi | en:iot-open:data_management_and_semantics_ume [2018/05/12 12:07] (current) – removed salvatdi | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== Data and Information Management in the Internet of Things ==== | ||
- | |||
- | At the center of the IoT ecosystem consisting of billions of connected devices is the wealth of information that can be made available through the fusion of data that is produced in real-time, as well as data stored in permanent repositories. | ||
- | This information can make the realisation of innovative and unconventional applications and value-added services possible, and will act as an immense source for trend analysis and strategic business opportunities. A comprehensive management framework of data and information that is generated and stored by the objects within the IoT is thus required to achieve this goal. | ||
- | |||
- | Data management is a broad concept referring to the architectures, | ||
- | |||
- | The IoT data has distinctive characteristics that make traditional relational-based database management an obsolete solution. A massive volume of heterogeneous, | ||
- | |||
- | Traditional data management systems handle the storage, retrieval, and update of elementary data items, records and files. In the context of the IoT, data management systems must summarise data online while providing storage, logging, and auditing facilities for offline analysis. This expands the concept of data management from offline storage, query processing, and transaction management operations into online-offline communication/ | ||
- | |||
- | ===IoT data lifecycle=== | ||
- | |||
- | |||
- | Data processing is simply the conversion of raw data to meaningful information through a process. Data is manipulated to produce results that lead to a resolution of a problem or improvement of an existing situation. Similar to a production process, it follows a cycle where inputs (raw data) are fed to a process (computer systems, software, etc.) to produce output (information and insights). Generally, organisations employ computer systems to carry out a series of operations on the data in order to present, interpret, or obtain information. The process includes activities like data entry, summary, calculation, | ||
- | |||
- | The lifecycle of data within an IoT system proceeds from data production to aggregation, | ||
- | |||
- | Storage operations aim at making data available on the long term for constant access/ | ||
- | |||
- | - Querying: data-intensive systems rely on querying as the core process to access and retrieve data. In the context of the IoT, a query can be issued either to request real-time data to be collected for temporal monitoring purposes or to retrieve a certain view of the data stored within the system. The first case is typical when a (mostly localised) real-time request for data is required. The second case represents more globalised views of data and in-depth analysis of trends and patterns. | ||
- | - Production: Data production involves sensing and transfer of data by the edge devices within the IoT framework and reporting this data to interested parties periodically (as in a subscribe/ | ||
- | - Collection: The sensors and smart objects within the IoT may store the data for a certain time interval or report it to governing components. Data may be collected at concentration points or gateways within the network, where it is further filtered and processed, and possibly fused into compact forms for efficient transmission. Wireless communication technologies such as Zigbee, Wi-Fi and mobile networks are used by objects to send data to collection points. Collection is the first stage of the cycle, and is very crucial, since the quality of data collected will impact heavily on the output. The collection process needs to ensure that the data gathered are both defined and accurate, so that subsequent decisions based on the findings are valid. This stage provides both the baseline from which to measure, and a target on what to improve. Some types of data collection include census (data collection about everything in a group or statistical population), | ||
- | - Aggregation/ | ||
- | - Delivery: As data is filtered, aggregated, and possibly processed either at the concentration points or at the autonomous virtual units within the IoT, the results of these processes may need to be sent further up the system, either as final responses, or for storage and in-depth analysis. Wired or wireless broadband communications may be used there to transfer data to permanent data stores. | ||
- | - Preprocessing: | ||
- | - Storage/ | ||
- | - Processing/ | ||
- | - Output and interpretation: | ||
- | |||
- | Depending on the architecture of an IoT system and actual data management requirements in place, some of the steps described above can be omitted. Nevertheless, | ||
- | * In relatively autonomous IoT systems, data proceeds from query to production to in-network processing and then delivery. | ||
- | * In more centralised systems, the data flow that starts from production and proceeds to collection and filtering/ | ||
- | * In fully centralised systems, the data flow extends the production to aggregation further and includes preprocessing, | ||
- | |||
- | **IoT data management versus traditional database management systems** | ||
- | |||
- | Based on the IoT data lifecycle discussed earlier, we divide an IoT data management system into i) an online (i.e. real-time) front-end that interacts directly with the interconnected IoT objects and sensors, and ii) an offline back-end that handles the mass storage and in-depth analysis of the IoT data. The data management frontend is communication-intensive, | ||
- | |||
- | This envisioned data management architecture differs considerably from the existing database management systems (DBMSs), which are mainly storage-centric. In traditional databases, the bulk of data is collected from predefined and finite sources, and stored in scalar form according to strict normalisation rules in relations. Queries are used to retrieve specific “summary” views of the system or update specific items in the database. New data is inserted into the database when needed, also via insertion queries. Query operations are usually local, with execution costs bound to processing and intermediate storage. Transaction management mechanisms guarantee the ACID properties in order to enforce overall data integrity. Even if the database is distributed over multiple sites, query processing and distributed transaction management are enforced. The execution of distributed queries is based on the transparency principle, which dictates that the database is still viewed logically as one centralised unit, and the ACID properties are guaranteed via the two-phase commit protocol. | ||
- | |||
- | In the IoT systems, the picture is dramatically different, with a massive and ever-growing number of data sources that include sensors, RFIDs, embedded systems, and mobile devices. Contrary to the occasional updates and queries submitted to traditional DBMSs, data is streaming constantly from a multitude of edge devices to the IoT data stores, and queries are more frequent and with more versatile needs. Hierarchical data reporting and aggregation may be required for scalability guarantees as well as to enable more prompt processing functionality. The strict relational database schema and the relational normalisation practice may be relaxed in favour of more unstructured and flexible forms that adapt to the diverse data types and sophisticated queries. Although distributed DBMSs optimise queries based on communication considerations, | ||
- | |||
- | **IoT data sources** | ||
- | |||
- | As the IoT gets more involved in various domains and types of operations, a countless number of its data sources generate immense volumes of data. All the sources can be roughly divided into three groups: | ||
- | |||
- | //Passive sources// | ||
- | |||
- | They are sensors that do not communicate actively and send the required information to the centralised management system only on demand. For instance, sensors that make atmospheric measurements produce data when API is activated. That does not mean that an application is also passive; on the contrary, the data from passive sensors requires proper management and processing, and it is what an application is purposed for. | ||
- | |||
- | //Active sources// | ||
- | |||
- | The main difference between passive and active sensors is that the latter transmit data continuously, | ||
- | |||
- | //Dynamic sources// | ||
- | |||
- | These sources are most sophisticated, | ||
- | |||
- | **Main IoT domains generating data** | ||
- | |||
- | The emergence of new information sources indispensably affects the data center market, and it has experienced structural changes in recent years. Indeed, information technology tends to go beyond processing data in traditional data centers and opt for of cloud-centric ones. In just a few years, only 8% of overall workloads will be handled by old-school data centers. | ||
- | |||
- | The IoT is predicted to generate 403ZBs of data a year by 2018, up from 113.4ZBs in 2013. But according to Cisco, not all generated data will be sent to data centres. In three years times, colo sites should be hosting 8.6ZBs, up from 3.1ZB in 2013. IDC has predicted that by 2020 one tenth of the world’s data will be produced by machines. The organisation forecast that in five years time the amount of connected devices communicating over the internet will reach 32 billion and generate 10% of the world’s data. CBR compiles a list of the top 10 key areas set to foster data growth resulting from IoT connected solutions. | ||
- | |||
- | - Air travel: Arming planes with smart sensors to prevent failures is already a reality. These sensors produce several terabytes of data per flight. For example, Cisco said that a Boeing 787 aircraft could generate 40 TBs per hour of flight. These IoT solutions in the air industry have several applications beyond preventing failure. They can also reduce fuel consumption, | ||
- | - Mining: For the mining industry, the main benefit from using the IoT is safety. By automating machines (M2M), humans are not required to stay close to the vehicles and risk their lives. Cisco predicts that mining operations can generate up to 2.4TBs of data every minute. | ||
- | - Cars: A smart IoT connected vehicle is a fountain of data. It is constantly transmitting data to manufacturers, | ||
- | - Utilities: the worldwide revenue opportunity presented by the IoT for the utilities industry by 2018 is estimated to reach $201 billion. Smart meters are just an example. According to the UK Department of Energy & Climate Change, by the end of 2014 there were a total of 20.8 million gas meters and 25.3 million electricity meters operated by the larger energy suppliers in British domestic properties. Smart meters collect data on how much energy is being used every 30 minutes, 24/7, 365. This sends to the cloud several TBs of information every year. | ||
- | - Cities: Smart cities will be made of everything out there. Street lamps talking to the grid, urban parks connecting to services and rivers sending out alerts on pollution levels. All this data is generated on a daily basis, and it’s stored in the cloud. Millions of sensors, deployed in every city will constantly produce huge amounts of information. | ||
- | - Wearables: It is estimated that by 2019 more than 578 million wearables will be in use around the world. These solutions are constantly collecting data on health, fitness and wellness. The amount of data produced by wearables varies according to the device being worn and the type of sensors it has included. | ||
- | - Sports: As sports adopt more wearables and intelligent clothing to improve performances, | ||
- | - Logistics: Until today, transportation of goods would be over once the supply chain shipped the products. But with the IoT the service will be extended further beyond this, and smart goods will constantly produce more data. Some logistic companies are already collecting data from their suppliers, and also from their suppliers’ suppliers. Most of this data will be RFID, giving logistic companies the ability to analyse it in real time and tackle any eventual problems that might happen in the chain. | ||
- | - Healthcare: Smart healthcare is already being adopted in several countries. Huge virtual platforms store patient data that can be accessed by health services anywhere else. The health sector will see huge benefits from the IoT, with sensors being deployed across all areas in a medical unit. Medical companies are using connectivity to prevent power surges in medical devices, including critical instruments used in surgeries. All this information is stored for future analysis. | ||
- | - Smart homes: Smart homes are already a reality and by 2020 consumers expect this ecosystem to be widely available. It is predicted that one smart connected home today can produce as much as 1GB of information a week. | ||
- | |||
- | **Infrastructure and architectures for Iot Data processing: Cloud, Fog, and Edge computing** | ||
- | |||
- | The IoT generates a vast amount of Big Data and this in turn puts a huge strain on Internet Infrastructure. As a result, this forces companies to find solutions to minimise the pressure and solve their problem of transferring large amounts of data. Cloud computing has entered the mainstream of information technology, providing scalability in delivery of enterprise applications and Software as a Service (SaaS). Companies are now migrating their information operations to the cloud. Many cloud providers can allow for your data to be either transferred via your traditional internet connection or via a dedicated direct link. The benefit of a direct link into the cloud will ensure that your data is uncontended and that the traffic is not crossing the internet and the Quality of Service can be controlled. As the IoT proliferates, | ||
- | |||
- | Cloud computing and the IoT both serve to increase efficiency in everyday tasks and both have a complementary relationship. | ||
- | |||
- | The IoT owes its explosive growth to the connection of physical things and operational technologies to analytics and machine learning applications, | ||
- | |||
- | Moreover, having every device connected to the cloud and sending raw data over the internet can have privacy, security and legal implications, | ||
- | |||
- | The Fog/Edge layer is the perfect junction where there are enough compute, storage and networking resources to mimic cloud capabilities at the edge and support the local ingestion of data and the quick turnaround of results. Main benefits of Fog/Edge computing are the following: | ||
- | |||
- | * Increased network capacity: Fog computing uses much less bandwidth, which means it doesn’t cause bottlenecks and other similar occupancies. Less data movement on the network frees up network capacity, which then can be used for other things. | ||
- | * Real time operation: Fog computing has much higher expedience than any other cloud computing architecture we know today. Since all data analysis is being done at the spot, it represents a true real-time concept, which means it is a perfect match for the needs of IoT concepts. | ||
- | * Data security: Collected data is more secure when it doesn’t travel. It also makes data storing much simpler because it stays in its country of origin. Sending data abroad might violate certain laws. | ||
- | |||
- | The current trend shows that Fog computing will continue to grow in usage and importance as the IoT expands and conquers new grounds. With inexpensive, | ||
- | |||
- | Fog/edge computing has the potential to revolutionise the IoT in the next several years. It seems obvious that while Cloud is a perfect match for the IoT, we have other scenarios and IoT technologies that demand low-latency ingestion and immediate processing of data where Fog computing is the answer. Fog/Edge computing improves efficiency and reduces the amount of data that needs to be sent to the cloud for processing. But it’s here to complement the cloud, not replace it. The cloud will continue to have a pertinent role in the IoT cycle. In fact, with Fog computing shouldering the burden of short-term analytics at the edge, cloud resources will be freed to take on the heavier tasks, especially where the analysis of historical data and large datasets is concerned. Insights obtained by the cloud can help update and tweak policies and functionality at the fog layer. To sum up, it is the combination of Fog and Cloud computing that will accelerate the adoption of the IoT, especially for the enterprise. | ||
- | |||
- | **IoT data storage models and frameworks** | ||
- | |||
- | The increasing volumes of heterogeneous unstructured IoT data has also led to the emergence of several solutions to store these overwhelming datasets and support timely data management: | ||
- | * NoSQL databases are often used for storing IoT Big Data. This is a new type of database which is becoming more and more popular among web companies today. Proponents of NoSQL solutions state that they provide simpler scalability and improved performance relative to traditional relational databases. These products excel at storing “unstructured data,” and the category includes open source products such as Cassandra, MongoDB, and Redis. | ||
- | * In-memory databases assume that data is stored in computer memory to make access to it faster. Representative examples are Redis and Memcached, both NoSQL databases, entirely served from memory. These products excel at storing “unstructured data,” and the category includes open source products such as Cassandra, MongoDB, and Redis. | ||
- | |||
- | |||
- | **IoT data processing models and frameworks** | ||
- | |||
- | Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart " | ||
- | |||
- | While the systems which handle this stage of the data lifecycle can be complex, the goals on a broad level are very similar: operate over data in order to increase understanding, | ||
- | |||
- | //Batch Processing Systems// | ||
- | |||
- | Batch processing has a long history within the big data world. Batch processing involves operating over a large, static dataset and returning the result at a later time when the computation is complete. The datasets in batch processing are typically bounded: batch datasets represent a finite collection of data persistent: data is almost always backed by some type of permanent storage large: batch operations are often the only option for processing extremely large sets of data Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state be maintained for the duration of the calculations. Tasks that require very large volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at handling large volumes of persistent data, it frequently is used with historical data. | ||
- | |||
- | The trade-off for handling large quantities of data is longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant. Because this methodology heavily depends on permanent storage, reading and writing multiple times per task, it tends to be fairly slow. On the other hand, since disk space is typically one of the most abundant server resources, it means that MapReduce can handle enormous datasets. MapReduce has incredible scalability potential and has been used in production on tens of thousands of nodes. | ||
- | |||
- | Apache Hadoop: Apache Hadoop is a processing framework that exclusively provides batch processing. Hadoop was the first big data framework to gain significant traction in the open-source community. Based on several papers and presentations by Google about how they were dealing with tremendous amounts of data at the time, Hadoop reimplemented the algorithms and component stack to make large scale batch processing more accessible. Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling very large data sets where time is not a significant factor. The low cost of components necessary for a well-functioning Hadoop cluster makes this processing inexpensive and effective for many use cases. Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology. | ||
- | |||
- | //Stream Processing Systems// | ||
- | |||
- | Stream processing systems compute over data as it enters the system. This requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors define operations that will be applied to each individual data item as it passes through the system. The datasets in stream processing are considered " | ||
- | |||
- | * The total dataset is only defined as the amount of data that has entered the system so far. | ||
- | * The working dataset is perhaps more relevant, and is limited to a single item at a time. | ||
- | * Processing is event-based and does not " | ||
- | |||
- | Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with minimal state being maintained in between records. While most systems provide methods of maintaining some state, steam processing is highly optimized for more functional processing with few side effects. | ||
- | |||
- | Functional operations focus on discrete steps that have limited state or side-effects. Performing the same operation on the same piece of data will produce the same output independent of other factors. This kind of processing fits well with streams because state between items is usually some combination of difficult, limited, and sometimes undesirable. So while some type of state management is usually possible, these frameworks are much simpler and more efficient in their absence. | ||
- | |||
- | This type of processing lends itself to certain types of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time. | ||
- | |||
- | * Apache Storm | ||
- | * Apache Samza | ||
- | |||
- | **Hybrid Processing Systems** | ||
- | |||
- | Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. This is a largely a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, they have their own integrations, | ||
- | |||
- | * Apache Spark | ||
- | * Apache Flink | ||
- | |||
- | **IoT data semantics** | ||
- | |||
- | With some 25 billion devices expected to be connected to the Internet by 2015 and 50 billion by 2020, providing interoperability among the things on the IoT is one of the most fundamental requirements to support object addressing, tracking, and discovery as well as information representation, | ||
- | |||
- | The lack of explicit and formal representation of the IoT knowledge could cause ambiguity in terminology, | ||
- | |||
- | In these circumstances, | ||
- | |||
- | **Sensor Web Enablement and Semantic Sensor Networks** | ||
- | |||
- | The Sensor Web Enablement (SWE) standards enable developers to make all types of sensors, transducers and sensor data repositories discoverable, | ||
- | |||
- | Semantic Web technologies have been proposed as a means to enable interoperability for sensors and sensing systems in the context of SWE. Semantic Web technologies could be used in isolation or in augmenting SWE standards in the form of the Semantic Sensor Web (SSW). Semantic technologies can assist in managing, querying, and combining sensors and observation data. Thus allowing users to operate at abstraction levels above the technical details of format and integration, | ||
- | |||
- | One of the main outcomes of the SSW research is the Semantic Sensor Network (SSN) ontology (by W3C Semantic Sensor Network Incubator Group). This IoT ontology provides all the necessary semantics for the specification of IoT devices as well as the specifications of the IoT solution (input, output, control logic) that is deployed using these devices. These semantics include terminology related to sensors and observations, | ||
- | |||
- | **IoT data visualisation** | ||
- | |||
- | One of the challenges for the IoT industry is data analysis and interpretation. Big Data generated by the IoT devices is impractical if it cannot be translate into a language that is easy to understand, process and present as visual language. For this reason, IoT data visualisation is becoming an integral part of the IoT. Data visualisation provides a way to display this avalanche of collected data in meaningful ways that clearly present insights hidden within this mass amount of information. This can assist us in making fast, informed decisions with more certainty and accuracy than ever before. It is thus vital for business professionals, | ||
- | |||
- | |||
- | **Machine learning and data science** | ||
- | |||
- | TODO? | ||
- | |||
- | |||
- | |||
- | **Sources** | ||
- | |||
- | https:// | ||
- | |||
- | https:// | ||
- | |||
- | http:// | ||
- | |||
- | https:// | ||
- | |||
- | http:// | ||
- | |||
- | https:// | ||