Hearing the Voice – How Analog Devices’ Artificial Intelligence Dramatically Increases Equipment Uptime

Anyone who understands the need for equipment maintenance knows how important the sound and vibration that equipment makes. Proper equipment health monitoring through sound and vibration can cut maintenance costs in half and double the lifespan. Achieving real-time acoustic data and analysis is another important condition-based system monitoring (CbM) approach.


Anyone who understands the need for equipment maintenance knows how important the sound and vibration that equipment makes. Proper equipment health monitoring through sound and vibration can cut maintenance costs in half and double the lifespan. Achieving real-time acoustic data and analysis is another important condition-based system monitoring (CbM) approach.

We can learn to understand what a normal sound from a device looks like. When the sound changes, we can confirm that there is an abnormality. Then we can understand what the problem is, and in this way associate the sound with the specific problem. Identifying anomalies can take minutes of training, but combining sounds, vibrations and causes to implement a diagnosis can take a lifetime. Experienced mechanics and engineers may have this knowledge, but they are a scarce resource. Identifying a problem by the sound itself alone can be quite difficult, even with audio recordings, descriptive framing, or in-person training by an expert.

That’s why the Analog Devices team has spent the past 20 years working to understand how humans interpret sound and vibration. Our goal is to build a system that can learn sounds and vibrations from devices, decipher their meanings, detect abnormal behavior, and make diagnoses. This article details the architecture of OtoSense, a device health monitoring system that supports what we call computer hearing, allowing computers to understand the main indicators of device behavior: sound and vibration.

The system works on any device and can work in real time without a network connection. It has been used in industrial applications to support the realization of a scalable and efficient equipment health monitoring system.

This article discusses the principles that guided the development of OtoSense and the role of human hearing during the design of OtoSense. The paper then discusses how sound or vibration characteristics are designed, how these characteristics can be understood and what they represent, and how OtoSense can be continuously changed and improved in continuous learning to perform increasingly complex diagnostics with better results. for precision.

Guiding Principles

To be durable, agnostic, and efficient, the OtoSense design philosophy follows several guiding principles:

u Get inspiration from human neurology. Humans can learn and understand any sound they hear in a very energy efficient way.

u Can learn static and transient sounds. This requires constant adjustment of functionality and ongoing monitoring.

u Identify at the terminal close to the sensor. There should be no need to connect to the remote server over the network to make the decision.

u Interact with experts and learn from them, as long as possible to avoid interfering with their daily work and to make the process as pleasant as possible.

Human auditory system and interpretation of OtoSense

Hearing is a sense of survival. It is an overall sense of distant, unseen events, ripened before birth.

The process by which humans perceive sound can be described by four familiar steps: analog acquisition of sound, digital conversion, feature extraction, and interpretation. At each step, we compare the human ear to the OtoSense system.

u Analog acquisition and digitization. A membrane and lever in the middle ear captures sound and then adjusts impedance to transmit the vibrations into the fluid-filled cavity, where another membrane is selectively displaced according to the spectral components present in the signal. This in turn bends the elastic cells, which emit digital signals reflecting the degree and strength of the bending. These individual signals are then transmitted to the primary auditory cortex via parallel nerves arranged in frequency.

Ÿ In OtoSense, this work is done by sensors, amplifiers and codecs. The digitization process uses a fixed sampling rate, adjustable between 250 Hz and 196 kHz, and the waveforms are encoded in 16-bit and then stored into buffers of size between 128 and 4096.

u Feature extraction occurs in the primary cortex: frequency-domain features, such as dominant frequencies, harmonics, and spectral shape, and time-domain features, such as pulses, intensity changes, and dominant frequency components within a time window of approximately 3 s.

Ÿ OtoSense uses a time window, we call it a “block”, which moves in fixed steps. The size and step size of this block range from 23 ms to 3 seconds, depending on the events to be identified and the sampling rate at which features are extracted at the terminal. In the next section, we will explain in more detail the features extracted by OtoSense.

u Parsing occurs in the contact cortex, which integrates all perception and memory, and gives meaning to sounds (eg, through language), and plays a central role in shaping perception. The parsing process organizes our descriptions of events beyond just naming them. Naming a project, a sound, or an event allows us to give it a larger, more layered meaning. For experts, names and meanings give them a better understanding of their surroundings.

Ÿ That’s why OtoSense’s human interaction begins with visual, unsupervised sound mapping based on human neurology. OtoSense uses a graphical representation of all the sounds or vibrations heard, arranged by similarity, but does not attempt to create a fixed classification. This allows experts to organize and name the groups displayed on the screen without trying to artificially create bounded categories. They can build semantic maps based on their knowledge, perceptions, and expectations of the final output of OtoSense. For the same soundscape, an auto mechanic, an aerospace engineer, or a cold forging press specialist, or even people working on the same field, but from different companies, can be divided, organized, and labeled differently. OtoSense uses the same bottom-up approach to giving meaning as it does to shape linguistic meaning.

From sound and vibration to properties

Over a period of time (as shown earlier, a time window or block), we assign a feature an individual number that describes a given property/quality of the sound or vibration over that time. The principles for selecting features of the OtoSense platform are as follows:

u For both frequency and time domains, features should describe the environment as completely as possible, providing as much detail as possible. They must describe the still humming sound, as well as clicks, clatters, squeaks, and any sounds that change momentarily.

The u features should form a set as orthogonal as possible. If one feature is defined as “average amplitude over the block”, then there should be no other feature that is highly correlated with it, such as “total spectral energy over the block”. Of course, orthogonality may never be achieved, but neither should be expressed as a combination of other features, each of which must contain a single piece of information.

The u feature should minimize the amount of computation. Our brains only know about addition, comparison and reset to 0. Most OtoSense features are designed to be incremental, so that each new example can modify the feature with a simple operation without recomputing on a full buffer, or worse, on a block. Minimizing computation also means that standard physical units can be ignored. For example, it doesn’t make sense to try to express intensity by value (in dBA). If output of dBA values ​​is required, this can be done on output (if necessary).

Of the 2 to 1024 features of the OtoSense platform, some describe the time domain. They are either extracted directly from the waveform or from the evolution of any other feature on the block. Some of these properties include average and maximum amplitudes, complexity derived from the linear length of the waveform, amplitude variation, presence or absence of pulses and their properties, stability of similarity between first and last buffers, Convolution’s ultra-small autocorrelation, or change in dominant spectral peak.

Features used in the frequency domain are extracted from the FFT. The FFT is computed on each buffer, producing outputs from 128 to 2048 individual frequencies. The process then creates a vector of the desired dimensionality that is much smaller than an FFT but still describes the environment in detail. OtoSense originally used an agnostic method to create equal-sized bins on the log spectrum. These bins then focus on spectral regions with high information density, either from an unsupervised perspective capable of maximizing entropy, or from a semi-supervised perspective using labeled events as a guide, depending on the context and the event to be identified . This mimics the cellular structure of our inner ear, where speech details are denser where the density of linguistic information is greatest.

Structure: supports terminal and local data

OtoSense implements anomaly detection and event identification at the end location without the use of any remote equipment. This structure ensures that the system is not affected by network failures and that all raw data blocks do not need to be sent out for analysis. End devices running OtoSense are self-contained systems that describe the behavior of the listening device in real time.

Hearing the Voice – How Analog Devices’ Artificial Intelligence Dramatically Increases Equipment Uptime

Figure 1. OtoSense system.

OtoSense servers running AI and HMI are typically hosted locally. Cloud architecture can aggregate multiple meaningful data streams into the output of OtoSense devices. For an AI that specializes in processing large amounts of data and interacting with hundreds of devices on a single site, it doesn’t make much sense to use cloud hosting.

From features to anomaly detection

Normal/abnormal assessment does not require much interaction with experts. Experts just need to help determine a baseline that indicates normal equipment sound and vibration. This baseline is then transformed into an anomaly model on the Otosense server before pushing to the device.

We then use two different strategies to assess whether incoming sound or vibration is normal:

u The first strategy is what we call “normality”, which is to examine the surroundings of any new sound entering the feature space, its distance from baseline points and clusters, and the size of those clusters. The larger the distance, the smaller the clusters, the more unusual the new sounds, and the higher the outliers. When this outlier is above an expert-defined threshold, the corresponding block will be marked as unusual and sent to the server for the expert to review.

u The second strategy is very simple: any incoming block with a feature value above or below the maximum or minimum value of the baseline defined by the feature is marked as “extreme” and sent to the server.

Anomalous sounds or vibrations are well covered by a combination of anomalous and extreme strategies that also excel at detecting increasingly worn and brutal contingencies.

From feature to event recognition

Features belong to the realm of physics, and meaning belongs to human cognition. Connecting features to meaning requires interaction between OtoSense AI and human experts. We spent a lot of time researching customer feedback to develop a Human Machine Interface (HMI) that allows engineers to efficiently interact with OtoSense and design event recognition models. This HMI allows exploring data, labeling data, creating anomaly models and voice recognition models, and testing those models.

The OtoSense Sound Platter (also known as splatter) allows exploration and labelling of sounds with a complete overview of the dataset. Splatter selects the most interesting and representative sounds in the full dataset and displays them as a 2D similarity map that mixes labeled and unlabeled sounds.

Figure 2. The 2D splatter sound map in the OtoSense Sound Platter.

Any sound or vibration, including its environment, can be visualized in many different ways—for example, using the Sound Widget (also known as a Widget).

Figure 3. OtoSense sound widget (widget).

At any time, an exception model or event recognition model can be created. The event recognition model is a circular confusion matrix that allows OtoSense users to explore confusion events.

Figure 4. An event recognition model can be created based on the desired events.

Anomalies can be inspected and flagged through an interface that displays all anomalous and extreme sounds.

Figure 5. Sound analysis over time in the OtoSense anomaly visualization interface.

Continuous learning process – from anomaly detection to increasingly complex diagnostics

OtoSense is designed to learn from multiple experts and, over time, make more and more complex diagnoses. A common process is a loop between OtoSense and an expert:

u Both the exception model and the event recognition model are run at the terminal. These models create outputs for the probability of potential events occurring and their outliers.

u Unusual sounds or vibrations that exceed a defined threshold will trigger an unusual notification. Technicians and engineers using OtoSense can examine this sound and its pre and post sound information.

u These experts will then flag this anomalous event.

u Calculate the new recognition model and anomaly model containing these new information, and push it to the terminal device.

in conclusion

OtoSense technology from Analog Devices is designed to make sound and vibration expertise continuously available on any device, without the need for a network connection to perform anomaly detection and event identification. The technology is increasingly being used for device health monitoring in aerospace, automotive and industrial monitoring applications, which means that, in scenarios that once required expertise and involved embedded applications, especially for complex devices , the technology has shown good performance.


Sebastien Chistian, “How words create the world.” TEDxCambridge, 2014.

Sebastien Chistian [[email protected]

"Anyone who understands the need for equipment maintenance knows how i…