How to prepare data for Predictive Maintenance: feature engineering for industrial signals in the spirit of RCM and CBM
Predictive Maintenance starts with putting reliability in order – and only then moves to machine learning models. RCM (Reliability-Centered Maintenance) structures equipment functions, critical failure modes, and measurable symptoms. CBM (Condition-Based Maintenance) gives this an operational rhythm: condition monitoring, thresholds, trends, and decisions about inspection, planned shutdowns, or immediate intervention. In this setup, “feature engineering” is simply translating symptoms and CBM logic into a set of numerical indicators computed over time from OT data.
Industrial sensor signals are rarely convenient: different sampling rates, missing data, resets, automation artifacts, and changing operating modes. That is why, in practice, data preparation and feature design have one goal: to provide a stable answer to the question of whether the asset behaves normally under the current operating conditions – and if it deviates from normal, how fast the risk is increasing.
RCM as a map: function → failure mode → symptom → decision
RCM provides an excellent backbone for working with data. First comes the function and the required performance level –for example: a pump must maintain flow and pressure within a certain load range; a compressor must keep compressed-air network parameters within limits; a heat exchanger must transfer heat with the expected efficiency. First comes the function and the required performance level – for example: a pump must maintain flow and pressure within a certain load range; a compressor must keep compressed-air network parameters within limits; a heat exchanger must transfer heat with the expected efficiency; a motor or gearbox must operate within acceptable vibration and temperature envelopes. In RCM practice, these are typically documented through FMEA (Failure Mode and Effect Analysis), which maps each mode to its consequences and detection methods.
Each failure mode has symptoms – and these are the key for data. Bearing wear often shows up as increasing vibration energy and temperature; cavitation as pressure and flow variability under specific suction conditions; fouling as a drop in ΔT relative to flow and inlet temperatures; filter clogging as an increase in ΔP. Bearing wear often shows up as increasing vibration energy and temperature – detectable through vibration analysis and thermography; cavitation as pressure and flow variability under specific suction conditions; fouling as a drop in ΔT relative to flow and inlet temperatures; filter clogging as an increase in ΔP. This sequence – from function through failure mode to symptom and decision – mirrors the logic of the P-F curve (Potential-to-Functional failure curve), where the goal is to detect degradation in the P-F interval before functional failure occurs.
If symptoms are defined in this language, feature selection becomes straightforward: features should measure exactly those phenomena.
CBM as rhythm: condition, trend, episodes, and time windows
CBM operates in time windows that make sense for maintenance decisions – and these windows define the temporal resolution of feature extraction. For some assets, minutes matter; for others, hours or an entire shift. The time window should match degradation dynamics and the practical intervention cadence. In data terms, this means computing features over windows rather than individual samples: 15 minutes, 30 minutes, one hour, a shift – depending on the process.
CBM is also state-based: normal, degraded, critical. These states derive from sensor signal level, variability, trend, and time spent outside an acceptable range – essentially a classification problem when translated into machine learning terms. Those are exactly the elements that should be reflected in features.
The most important technical step: operating-mode segmentation
In industrial signals, what looks like an anomaly is very often simply a change of operating mode. Start-up, shutdown, changeover, steady production, and different load ranges have different signatures in currents, temperatures, pressures, and vibrations. Without segmentation, the model “learns” operating modes and degradation disappears into the background.
Segmentation is typically based on PLC statuses, RUN/STOP signals, load levels from SCADA or IoT sensors, and sometimes recipe or product identification. In practice, stable-operation segments are extracted, while start-ups and shutdowns are treated as separate event categories. This split also makes sense for maintenance: symptoms should be evaluated under comparable conditions.
Features that align well with RCM symptoms
The first group of features describes signal behavior within a window – what is sometimes called statistical feature extraction: mean and median, min/max, range, standard deviation, RMS. In many cases, degradation starts with increased variability and “nervousness” of the signal before the mean level shifts. That is why standard deviation and percentiles can be more valuable than the mean alone.
The second group covers trend and rate of change: trend slope within a window, average absolute change between samples, number of sharp spikes, or time-to-rise to a given level. Maintenance teams often recognize issues precisely because something “started increasing faster” or “hits the limit more often.”
The third group includes episode and exceedance features: time above a threshold, number of threshold crossings, length of the longest episode, time since the last exceedance. This is a direct translation of CBM practice into data – thresholds and time out of range drive many decisions.
The fourth group often delivers the highest ROI in industrial data: relationships between signals and deviation from normal behavior under the same operating conditions. Many faults manifest as efficiency changes rather than a shift in a single parameter. In pumps, a typical pattern is rising power draw with decreasing flow or changing ΔP. In heat exchangers, ΔT drops relative to flow. In compressors, energy per unit of delivered output increases. These dependencies are best captured by features such as flow/power, ΔP/flow, ΔT/flow, windowed correlations, and time lags between signals.
Especially practical are so-called residual features. You build a simple “normal-operation” model – for instance using regression or a lightweight neural network – that predicts expected flow as a function of power and ΔP during stable periods, and then compute the error relative to that model. In more advanced setups, autoencoders serve a similar purpose – learning a compressed representation of normal behavior and flagging reconstruction errors as anomalies.
Data quality as part of reliability: consistent windows and explicit rules
RCM and CBM assume repeatability and explainability of decisions. In data, this translates into a consistent time axis and quality control. Signals come at different sampling frequencies; communication gaps and resets occur. Data should be aligned to a common rhythm – either by resampling onto a single grid or aggregating into windows – and every window should carry a quality label (e.g., missing-data percentage, number of suspicious points). Low-quality windows should be flagged and treated cautiously during training, rather than being “fixed” aggressively.
Failure labels aligned with RCM: what to predict and when
Prediction makes sense only when the target is clearly defined. RCM naturally focuses attention on the moment when the asset begins losing its ability to perform its function – before functional failure occurs. In practice, two starting approaches work well: predicting remaining useful life within a time horizon (e.g., 7/14/30 days), and classifying a degraded state based on CBM thresholds and intervention history. Labels come from CMMS, downtime logs, and maintenance notes, then are mapped onto the time axis. It is essential to keep temporal hygiene: features computed after an event must not leak into the “pre-event” dataset. This train-test discipline – preventing data leakage across the temporal boundary – is what separates reliable models from those that perform well in backtests but fail in production.
How to implement this in Smart RDM – practically
In Smart RDM, the process typically starts with structuring assets and signals in the model, defining operating modes and comparable conditions, and then enabling window-based feature computation. On this foundation, you build risk scoring and CBM support mechanisms: alarms, priorities, checklists, and a learning loop through maintenance feedback. Over time, the “normal behavior” model is refined – adapting to model drift and evolving process conditions – and residual and relational features become central to diagnostics.
Summary
RCM and CBM simplify feature engineering because they impose a clear logic: critical failure modes, measurable symptoms, comparable operating modes, and operational decisions. In industrial data, the highest value comes from features based on variability, trend, exceedance episodes, and relationships between signals – especially deviation from normal behavior under the same conditions. This feature set yields machine learning models that are more stable, easier to explain, and closer to how maintenance teams actually diagnose problems.