The Holy Scientist

The fourth part of Namokar mantra Namo Uvajjhayanang salutes the Preceptors and Teachers. They are what can be termed, our friend, philosopher and guide. They are the ones who ignite the spark of…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to prepare data pipelines for anomaly detection

One of the major benefits in creating a robust ML-enabled data pipeline is to feed data into an anomaly detection algorithm. The effectiveness of the anomaly detection capability, however, is almost entirely dependent on the quality of the data pipelines feeding into it. Before we cover how this happens, let’s first look at the challenges of working with data pipelines in general.

A data pipeline consists of an origin, one or more processors and a destination.

In this visual representation of a Peaxy data pipeline, several transformations are performed on source data before writing it into one of two destination databases.

Extract, transform, load (ETL) is a common type of data pipeline used for preparing structured and semi-structured data for analysis. In ETL, source data may come from telemetry, customer relationship management (CRM) systems, or enterprise resource planning (ERP) systems. The data is then transformed by matching and mapping data across the sources, applying filters, converting to an expected unit of measure, or cleaning the data to address inconsistencies or duplicates. Finally, the pipeline can load the processed data into a data warehouse.

Even after a data pipeline is created, it requires active management since systems are rarely static. All pipelines will at times require components to be updated. Developing ad hoc pipelines with home grown tools can sometimes be expedient, but it is better and ultimately more cost effective longer term to work with a partner who understands pipeline architecture. You can rely on the partner’s experience in deploying the most appropriate tools such as Azure IoT Hub or AWS Greengrass.

Although pipelines involving edge devices must be designed to assume failures, there are many ways to minimize them. It starts with selecting the right robust or ruggedized hardware. Understanding the need for redundancy for network uplink over wired Ethernet, WiFi, 4G, or satellite can keep your edge devices in communication even when some services are offline. Local storage can be intelligently leveraged to provide persistence of data when uplinks are offline. Understanding time series databases, and how data must be replayed after a period of outage, will minimize gaps in sensor data.

Even products such as wind turbines from a single company evolve over the years and represent their data in different ways. An older industrial product may expose telemetry in binary format, which can be accessed through CAN or modbus. Attaching to devices often requires hardware that is less common and more difficult to interface with than Ethernet, such as RS-232, RS-422, or RS-485. Newer devices may expose a more contemporary REST API but have a limited set of sensors available. Next generation industrial equipment can have hundreds of sensors, which can inform thousands of data points describing the status and health of the equipment, creating data volume challenges.

If compute resources are sufficient at the edge, data can be normalized there. Otherwise, the raw data is pushed to the cloud for processing. In either case, it is necessary for the data schema to be understood. Data schemas can be stored and referenced through a schema registry, which ensures a mutual understanding of what the data looks like between the producer and consumer.

One of the major uses of a data pipeline is to route data through anomaly detection algorithms to alert operators of potential faults. In this case, the data pipeline should include a number of additional processes to perform the following functions:

Incorporating the above three functions within a data pipeline may take multiple processors and executor steps. Where a predicted value falls above or below the expected value by a configured threshold, an event or notification is raised, alerting operators to the potential impending fault.

Peaxy Lifecycle Intelligence trains a basket of machine-learning algorithms on historical failure data to discover non-trivial discrepancies in live data streams that can signal incipient component failures. These anomalies are fed into PLI’s alert management system, where the user can fully analyze the data.

By choosing to pre-emptively repair or replace equipment, operators can avert catastrophic failures. Because each asset class has a unique use case, Peaxy’s analytics experts perform the initial tuning and tweaking of algorithms. The module’s accuracy improves over time through continuous data ingestion, the use of multiple competing algorithms, and manual feedback on false-positive alerts

Add a comment

Related posts:

Escape

I want to press my thumbs into the golden band that winds itself around the ice caps when no one is looking i want to run and run and run and run until i find some semblance of home , even if i…

Jeff

Jeff is a young professional living in Arlington, VA and works in Washington, DC. He is in his late-20s. He enjoys having interesting pieces of art/decor in his apartment, and he typically has spent…

It Will Feel Better When It Quits Hurting

When he was diagnosed with a rare cancer he continued to use this phrase in his online posts. I never saw it as a tactic to get unnecessary sympathy. Instead, I wondered how this seemingly simple…