MM466 Project - Multi-mode Fault Diagnosis Datasets of Gearbox Under Variable Working Conditions

Group Members: Wilisoni Marayawa (S11196753), Naitinteari Tekamwi (S11126433)

Gearbox Fault Diagnosis Using Machine Learning: Project Blog Summary

Our machine learning journey for gearbox fault diagnosis began in Week 6, but not without hurdles. Initially, we worked with a different dataset, which presented significant compatibility and quality issues. From Weeks 6 to 10, a substantial portion of our time was spent trying to clean, understand, and process this first dataset. However, due to persistent challenges and feedback from our supervisor, we transitioned to a new dataset at the end of Week 10—the MCC5-THU gearbox fault diagnosis dataset.


Week 10 to 12: Familiarization Phase

Upon receiving the MCC5-THU dataset, we dedicated Weeks 10 to 12 to exploring and understanding its structure. This dataset comprised 240 .csv files, each recording 8 sensor channels (speed, torque, 3-axis motor vibration, and 3-axis gearbox vibration) under different fault conditions and loads. Given the large size and variability of the files (~768,000 rows each), initial attempts to load and merge all 240 files proved impractical within MATLAB due to memory constraints and execution time.

Decision Point:
We reduced our working set to 96 strategically selected files, covering various fault types and load conditions. This choice balanced dataset diversity and computational feasibility. The following files were chosen to balance fault type and fault severity and operating conditions:

teeth_crack_L_speed_circulation_10Nm-2000rpm.csv

teeth_crack_L_torque_circulation_1000rpm_10Nm.csv

teeth_crack_L_torque_circulation_1000rpm_20Nm.csv

teeth_crack_M_speed_circulation_10Nm-1000rpm.csv

teeth_crack_M_speed_circulation_10Nm-2000rpm.csv

teeth_crack_M_torque_circulation_1000rpm_10Nm.csv

teeth_crack_M_torque_circulation_1000rpm_20Nm.csv

gear_pitting_H_speed_circulation_10Nm-1000rpm.csv

gear_pitting_H_speed_circulation_10Nm-2000rpm.csv

gear_pitting_H_torque_circulation_1000rpm_10Nm.csv

gear_pitting_H_torque_circulation_1000rpm_20Nm.csv

gear_pitting_L_speed_circulation_10Nm-1000rpm.csv

gear_pitting_L_speed_circulation_10Nm-2000rpm.csv

gear_pitting_L_torque_circulation_1000rpm_10Nm.csv

gear_pitting_L_torque_circulation_1000rpm_20Nm.csv

gear_pitting_M_speed_circulation_10Nm-1000rpm.csv

gear_pitting_M_speed_circulation_10Nm-2000rpm.csv

gear_pitting_M_torque_circulation_1000rpm_10Nm.csv

gear_pitting_M_torque_circulation_1000rpm_20Nm.csv

gear_wear_H_speed_circulation_10Nm-1000rpm.csv

gear_wear_H_speed_circulation_10Nm-2000rpm.csv

gear_wear_H_torque_circulation_1000rpm_10Nm.csv

gear_wear_H_torque_circulation_1000rpm_20Nm.csv

gear_wear_L_speed_circulation_10Nm-1000rpm.csv

gear_wear_L_speed_circulation_10Nm-2000rpm.csv

gear_wear_L_torque_circulation_1000rpm_10Nm.csv

gear_wear_L_torque_circulation_1000rpm_20Nm.csv

gear_wear_M_speed_circulation_10Nm-1000rpm.csv

gear_wear_M_speed_circulation_10Nm-2000rpm.csv

gear_wear_M_torque_circulation_1000rpm_10Nm.csv

gear_wear_M_torque_circulation_1000rpm_20Nm.csv

health_speed_circulation_10Nm-1000rpm.csv

health_speed_circulation_10Nm-2000rpm.csv

health_speed_circulation_10Nm-3000rpm.csv

health_speed_circulation_20Nm-1000rpm.csv

health_speed_circulation_20Nm-2000rpm.csv

health_speed_circulation_20Nm-3000rpm.csv

health_torque_circulation_1000rpm_10Nm.csv

health_torque_circulation_1000rpm_20Nm.csv

health_torque_circulation_2000rpm_10Nm.csv

health_torque_circulation_2000rpm_20Nm.csv

health_torque_circulation_3000rpm_10Nm.csv

health_torque_circulation_3000rpm_20Nm.csv

miss_teeth_speed_circulation_10Nm-1000rpm.csv

miss_teeth_speed_circulation_10Nm-2000rpm.csv

miss_teeth_speed_circulation_10Nm-3000rpm.csv

miss_teeth_speed_circulation_20Nm-1000rpm.csv

miss_teeth_speed_circulation_20Nm-2000rpm.csv

miss_teeth_speed_circulation_20Nm-3000rpm.csv

miss_teeth_torque_circulation_1000rpm_10Nm.csv

miss_teeth_torque_circulation_1000rpm_20Nm.csv

miss_teeth_torque_circulation_2000rpm_10Nm.csv

miss_teeth_torque_circulation_2000rpm_20Nm.csv

miss_teeth_torque_circulation_3000rpm_10Nm.csv

miss_teeth_torque_circulation_3000rpm_20Nm.csv

teeth_break_and_bearing_inner_H_speed_circulation_10Nm-1000rpm.csv

teeth_break_and_bearing_inner_H_speed_circulation_10Nm-2000rpm.csv

teeth_break_and_bearing_inner_H_torque_circulation_1000rpm_10Nm.csv

teeth_break_and_bearing_inner_H_torque_circulation_1000rpm_20Nm.csv

teeth_break_and_bearing_inner_L_speed_circulation_10Nm-1000rpm.csv

teeth_break_and_bearing_inner_L_speed_circulation_10Nm-2000rpm.csv

teeth_break_and_bearing_inner_L_torque_circulation_1000rpm_10Nm.csv

teeth_break_and_bearing_inner_L_torque_circulation_1000rpm_20Nm.csv

teeth_break_and_bearing_inner_M_speed_circulation_10Nm-1000rpm.csv

teeth_break_and_bearing_inner_M_speed_circulation_10Nm-2000rpm.csv

teeth_break_and_bearing_inner_M_torque_circulation_1000rpm_10Nm.csv

teeth_break_and_bearing_inner_M_torque_circulation_1000rpm_20Nm.csv

teeth_break_and_bearing_outer_H_speed_circulation_10Nm-1000rpm.csv

teeth_break_and_bearing_outer_H_speed_circulation_10Nm-2000rpm.csv

teeth_break_and_bearing_outer_H_torque_circulation_1000rpm_10Nm.csv

teeth_break_and_bearing_outer_H_torque_circulation_1000rpm_20Nm.csv

teeth_break_and_bearing_outer_L_speed_circulation_10Nm-1000rpm.csv

teeth_break_and_bearing_outer_L_speed_circulation_10Nm-2000rpm.csv

teeth_break_and_bearing_outer_L_torque_circulation_1000rpm_10Nm.csv

teeth_break_and_bearing_outer_L_torque_circulation_1000rpm_20Nm.csv

teeth_break_and_bearing_outer_M_speed_circulation_10Nm-1000rpm.csv

teeth_break_and_bearing_outer_M_speed_circulation_10Nm-2000rpm.csv

teeth_break_and_bearing_outer_M_torque_circulation_1000rpm_10Nm.csv

teeth_break_and_bearing_outer_M_torque_circulation_1000rpm_20Nm.csv

teeth_break_H_speed_circulation_10Nm-1000rpm.csv

teeth_break_H_speed_circulation_10Nm-2000rpm.csv

teeth_break_H_torque_circulation_1000rpm_10Nm.csv

teeth_break_H_torque_circulation_1000rpm_20Nm.csv

teeth_break_L_speed_circulation_10Nm-1000rpm.csv

teeth_break_L_speed_circulation_10Nm-2000rpm.csv

teeth_break_L_torque_circulation_1000rpm_10Nm.csv

teeth_break_L_torque_circulation_1000rpm_20Nm.csv

teeth_break_M_speed_circulation_10Nm-1000rpm.csv

teeth_break_M_speed_circulation_10Nm-2000rpm.csv

teeth_break_M_torque_circulation_1000rpm_10Nm.csv

teeth_break_M_torque_circulation_1000rpm_20Nm.csv

teeth_crack_H_speed_circulation_10Nm-1000rpm.csv

teeth_crack_H_speed_circulation_10Nm-2000rpm.csv

teeth_crack_H_torque_circulation_1000rpm_10Nm.csv

teeth_crack_H_torque_circulation_1000rpm_20Nm.csv

teeth_crack_L_speed_circulation_10Nm-1000rpm.csv 


Week 12 to 13: Loading and EDA

Our focus in Weeks 12 and 13 was to:

  • Load the selected files efficiently.

  • Develop a strategy to segment signals into 1-second windows with 50% overlap to increase the sample size while preserving fault-related patterns.

  • Perform exploratory data analysis (EDA) such as correlation heatmaps, class distribution plots, and basic time-domain plots for verification.

This overlapping window strategy was essential to maximize data utility from lengthy recordings and introduce variability in training samples.

% ===============================
% Feature Extraction with Overlapping Segments (50%)
% Gearbox Fault Dataset
% ===============================

clear; clc;

% === Configuration ===
dataDir = 'C:/Users/maray/OneDrive/Desktop/MM466/Project/MCC5-THU gearbox fault diagnosis datasets';
fileList = dir(fullfile(dataDir, '*.csv'));
Fs = 12800; % Sampling frequency in Hz
segment_duration = 1; % seconds
segment_samples = Fs * segment_duration;
overlap = 0.5;
step = segment_samples * (1 - overlap);

% === Output Initialization ===
allFeatures = [];
allLabels = {};

% === Loop Through Files ===
for i = 1:length(fileList)
filename = fileList(i).name;
filepath = fullfile(dataDir, filename);
fprintf('Processing %s (%d of %d)\n', filename, i, length(fileList));

% Load data (use 1 axis for now, e.g., motor vibration x-axis = col 3)
data = readmatrix(filepath);
signal = data(:, 3); % motor vibration x
total_samples = length(signal);

% Sliding window segmentation
seg_starts = 1:step:(total_samples - segment_samples + 1);

for idx = 1:length(seg_starts)
seg_start = round(seg_starts(idx));
seg_end = seg_start + segment_samples - 1;
segment = signal(seg_start:seg_end);

% === Time-domain Features ===
rms_val = rms(segment);
std_val = std(segment);
kurt_val = kurtosis(segment);
crest_val = max(abs(segment)) / rms_val;
p2p_val = peak2peak(segment);
skew_val = skewness(segment);

% === Frequency-domain Features ===
Y = abs(fft(segment));
Y = Y(1:floor(end/2));
f = linspace(0, Fs/2, length(Y));
spectralCentroid = sum(f .* Y') / sum(Y);
p = Y / sum(Y);
spectralEntropy = -sum(p .* log2(p + eps));

% Combine features
features = [rms_val, std_val, kurt_val, crest_val, p2p_val, skew_val, spectralCentroid, spectralEntropy];
allFeatures = [allFeatures; features];

% === Assign Label ===
if contains(filename, 'health')
label = 'healthy';
elseif contains(filename, 'teeth_crack')
label = 'teeth crack';
elseif contains(filename, 'gear_pitting')
label = 'gear pitting';
elseif contains(filename, 'gear_wear')
label = 'gear wear';
elseif contains(filename, 'miss_teeth')
label = 'miss teeth';
elseif contains(filename, 'teeth_break_and_bearing_inner')
label = 'compound inner';
elseif contains(filename, 'teeth_break_and_bearing_outer')
label = 'compound outer';
elseif contains(filename, 'teeth_break')
label = 'teeth break';
else
label = 'unknown';
end

allLabels{end+1,1} = label;
end
end

% === Save Dataset ===
featureNames = {'RMS','STD','Kurtosis','CrestFactor','Peak2Peak','Skewness','SpectralCentroid','SpectralEntropy'};
T = array2table(allFeatures, 'VariableNames', featureNames);
T.Label = categorical(allLabels);

save('gearbox_feature_dataset_overlap.mat', 'T');
writetable(T, 'gearbox_feature_dataset_overlap.csv');

fprintf('\n✅ Feature extraction with 50%% overlap complete.\n');

Below are some EDA plots that studies showed would help show meaningful information especially with vibrational signals present in our dataset. 

For respective fault types and fault severity, time series vibrational plots, fft, spectograms and feature based signals were used to explore our data:


































✅ 1. Time-Domain Plot
What it shows:
This plot displays raw vibration signal amplitude (in g) over time (in seconds).

Purpose:
To visually assess how the vibration signal varies over time for different fault conditions. For instance:

Sudden spikes or irregular patterns may indicate gear impacts or missing teeth.

Healthy signals usually have consistent, smooth waveforms.

What to look for:

Consistent patterns → healthy operation.

High peaks or irregularities → mechanical faults.

✅ 2. FFT Plot (Frequency Domain)
What it shows:
The Fast Fourier Transform (FFT) converts the time signal into the frequency domain. This plot shows the amplitude of different frequencies present in the signal.

Purpose:
To identify dominant frequency components associated with specific fault types (e.g., gear mesh frequencies, bearing defect frequencies).

What to look for:

Presence of sharp peaks at characteristic fault frequencies.

Broader frequency content in faulty signals due to impacts and wear.

✅ 3. Spectrogram
What it shows:
Spectrograms provide a time-varying view of the signal’s frequency content (time vs frequency vs amplitude in color).

Purpose:
To analyze how frequency components evolve over time — useful for non-stationary signals like transient faults.

What to look for:

Color intensity represents signal strength at that frequency and time.

Faults often introduce time-localized frequency bursts or dense energy patches.

✅ 4. Feature Signal Plots (RMS, STD, Kurtosis, etc.)
What they show:
Each feature is computed on a 1-second segment and plotted over segments (segment number on X-axis vs feature value).

Purpose:
To observe statistical trends in signal behavior — these features serve as inputs to the machine learning model.

Interpretation:

RMS: Measures overall energy; higher in faulty cases.

STD: Indicates variability; may increase with irregular vibrations.

Kurtosis: Sensitive to sharp spikes; high in cases with impacts (e.g., teeth break).

Crest Factor: Ratio of peak to RMS; good for detecting transient faults.

Peak-to-Peak: High in abrupt or severe faults.

Skewness: Indicates asymmetry; useful for fault asymmetries.

Week 13 to 14: Feature Extraction Strategy

From Week 13 through 14, we implemented a hybrid feature extraction approach, combining:

  • Time-domain features: RMS, STD, Kurtosis, Crest Factor, Skewness, Peak-to-Peak.

  • Frequency-domain features: Spectral Centroid, Spectral Entropy.

  • Wavelet-based energy features: MODWT using Symlet-4 wavelet over 5 levels.

These were applied to all 8 channels per file, giving us a rich, high-dimensional feature set. The multi-domain fusion aimed to capture both transient and frequency-based fault signatures.

We ensured all extracted features were normalized using Z-score normalization, ensuring consistency for model training.


Week 15 to 16: Dimensionality Reduction and Model Training

With over 100 features, dimensionality became a concern. We used Principal Component Analysis (PCA) to reduce dimensionality while preserving 95% of the variance. We were able to reduce from our feature count from 112 to 28. This step improved training time and reduced noise/correlation among features.

We then performed a stratified train-validation-test split (70/15/15) to preserve class balance.

We trained a range of models using Classification Learner App:

  • SVMs (linear, quadratic, Gaussian)

  • Ensemble models (Bagged Trees, Boosted Trees)

  • Decision Trees

Models were evaluated using accuracy, precision, recall, F1-score, and confusion matrices. The top-performing model achieved ~92.9% accuracy. Final evaluations were performed on an unseen test set that saw the highest accuracy of 93.`9% on the same model that yielded the best results during training.  Specifically, the test set may have contained instances that were more distinguishable or representative of the learned patterns during training, leading to improved classification performance. Additionally, the stratified sampling used during dataset splitting may have yielded a test set with less noise or lower intra-class variation, making it easier for the model to generalize to those samples.




Training
SVM Cubic Model (Best)



SVM Quadratic (2nd Best)

SVM Gaussian (3rd Best)


Test
Model 1
Accuracy: 93.19%
Class: compound inner | Precision: 0.94 | Recall: 0.96 | F1-Score: 0.95 Class: compound outer | Precision: 0.96 | Recall: 0.96 | F1-Score: 0.96 Class: gear pitting | Precision: 0.98 | Recall: 0.97 | F1-Score: 0.98 Class: gear wear | Precision: 0.82 | Recall: 0.90 | F1-Score: 0.85 Class: healthy | Precision: 0.95 | Recall: 0.93 | F1-Score: 0.94 Class: miss teeth | Precision: 0.99 | Recall: 0.96 | F1-Score: 0.97 Class: teeth break | Precision: 0.90 | Recall: 0.89 | F1-Score: 0.90 Class: teeth crack | Precision: 0.92 | Recall: 0.88 | F1-Score: 0.90

Accuracy: 92.02% Class: compound inner | Precision: 0.94 | Recall: 0.95 | F1-Score: 0.95 Class: compound outer | Precision: 0.97 | Recall: 0.95 | F1-Score: 0.96 Class: gear pitting | Precision: 0.95 | Recall: 0.96 | F1-Score: 0.96 Class: gear wear | Precision: 0.80 | Recall: 0.87 | F1-Score: 0.84 Class: healthy | Precision: 0.92 | Recall: 0.88 | F1-Score: 0.90 Class: miss teeth | Precision: 0.99 | Recall: 0.97 | F1-Score: 0.98 Class: teeth break | Precision: 0.89 | Recall: 0.89 | F1-Score: 0.89 Class: teeth crack | Precision: 0.92 | Recall: 0.89 | F1-Score: 0.90

Accuracy: 87.62%
Class: compound inner | Precision: 0.96 | Recall: 0.89 | F1-Score: 0.92 Class: compound outer | Precision: 0.85 | Recall: 0.93 | F1-Score: 0.89 Class: gear pitting | Precision: 0.99 | Recall: 0.92 | F1-Score: 0.95 Class: gear wear | Precision: 0.82 | Recall: 0.80 | F1-Score: 0.81 Class: healthy | Precision: 0.95 | Recall: 0.85 | F1-Score: 0.90 Class: miss teeth | Precision: 0.72 | Recall: 1.00 | F1-Score: 0.84 Class: teeth break | Precision: 0.89 | Recall: 0.78 | F1-Score: 0.83 Class: teeth crack | Precision: 0.90 | Recall: 0.84 | F1-Score: 0.87















Justification of Our Approach

Each step in our pipeline was motivated by practical challenges and technical reasoning:

  • File reduction addressed resource limitations.

  • Overlapping windows enriched training data diversity.

  • Multi-domain feature extraction ensured a wide capture of fault characteristics.

  • PCA tackled high-dimensional noise and redundancy.

  • Stratified splitting guaranteed fair model evaluation.

This methodical, justified approach led to the successful development of a high-performing, robust fault classification model.


🔭 Future Work

Looking ahead, several key directions can be taken to enhance the reliability and practicality of the fault diagnosis system:

  1. Utilizing the Full Dataset
    The current study focused on a reduced subset of the original dataset due to processing and memory constraints. Future work will involve leveraging the entire dataset, which includes all 240 files, to better reflect real-world variability and operating conditions. This will provide a more comprehensive training base for the models.

  2. Expanding Class Definitions Based on Fault Severity
    Rather than generalizing fault types into single classes, future iterations will include sub-categorization based on fault severity (e.g., low, medium, high damage levels). This multi-class expansion can improve granularity and allow the model to differentiate not just between fault types, but also their progression stages.

  3. Hyperparameter Tuning
    Further improvement in model performance is expected through systematic hyperparameter optimization. Techniques such as grid search or Bayesian optimization can be applied to fine-tune classifiers like SVMs and ensemble models for better generalization.

  4. Additional Directions
    Once these foundations are addressed, the project can be extended to explore real-time deployment, deep learning architectures (e.g., CNNs), and cross-condition generalization using transfer learning.


Walkthrough process of MATLAB
Video 1: 



Video 2:

Comments