Software engineers collect execution traces (kind of lower level logs) from deployed software systems to debug faulty software behaviour. Automated failure reporting is an example of this act; e.g., Firefox sends crash reports to central servers and DB2 collects crashing and non-crashing traces to debug performance issues or crashes. Due to an overwhelming number of traces, before performing any debugging, it is beneficial to determine whether the trace in execution has captured faulty behaviour or not. This means we can train classification algorithms on the historical normal software traces and failure traces to get an idea about the newly collected traces. However, an interesting question is: which classification algorithm would be the best? To answer this question, I have performed an experiment on the software traces. This experiment can be applied to other datasets too.

 

To perform my experiment, I collected open source programs (UNIX utilities) and their faulty versions from http://sir.unl.edu. I then executed test cases on both faulty and normal version to collect normal and failed traces.  I collected two types of traces for each program: user space (function call) traces and kernel space (Linux system calls) traces. This resulted into two types of data sets—one for each trace type. The details of the programs are shown in Table 1.

 

Table 1. Characteristics of the subject programs (UNIX utilities)
LOC excludes blank lines and comments

 Releases used: Flex 2.5.1; Grep 2.4; Gzip 1.1.2; Sed 4.0.7.
Prog. LOC # Functions # Faults # Test Cases # Passing
Traces
#Failed Traces
Flex 9724 167 20 567 566 545
Grep 9041 149 18 809 799 710
Gzip 4032 88 16 214 214 204
Sed 4735 115 6 370 366 166

 

In order to train machine learning algorithms on trace data, I transformed each trace into a feature vector of function name and their likelihood of occurrences in the trace and labeled them as normal or failure traces if the corresponding test case passed or failed. I trained six classification algorithms on these traces using Weka: C4.5 decision tree, naïve Bayes classifier, neural network, Bayesian network, support vector machine, and hidden Markov model. I trained algorithms on user space and kernel space traces separately. The results are shown in Table 2 and Table 3 where TP is true positive rate, FP is false positive rate and AUC is the area under the curve.

 

Table 2. Results of the classification algorithms on user space (function call) traces of the subject programs (Where NB = Naïve  Bayes; BBN= Bayesian Belief network; MP= Multilayer Perceptron; SVM=Support Vector Machine; HMM= Hidden Markov Model )

C4.5 NB BBN MP SVM HMM
Prog. TP FP AUC TP FP AUC TP FP AUC TP FP AUC TP FP AUC TP FP AUC
Flex 0.924 0.099 0.925 0.159 0.053 0.609 0.371 0.145 0.675 0.981 0.804 0.646 0.721 0.323 0.699 0.706 0 0.416
Grep 0.96 0.044 0.992 0.921 0.227 0.919 0.96 0.044 0.992 0.95 0.113 0.972 0.935 0.075 0.93 0.122 0 0.428
Gzip 0.967 0.02 0.974 0.698 0.594 0.663 0.939 0.035 0.989 0.528 0.144 0.819 0.962 0.015 0.974 0.727 0.258 0.471
Sed 0.849 0.693 0.714 0.321 0.151 0.630 0.61 0.331 0.694 0.819 0.711 0.72 0.808 0.759 0.524 0.964 0.917 0.651

 

 

Table 3. Results of the classification algorithms on kernel space traces of the subject programs (Where NB = Naïve  Bayes; BBN= Bayesian Belief network; MP= Multilayer Perceptron; SVM=Support Vector Machine; HMM= Hidden Markov Model)

C4.5 NB BBN MP SVM HMM
Prog. TP FP AUC TP FP AUC TP FP AUC TP FP AUC TP FP AUC TP FP AUC
Flex 1.00 0.002 0.998 1.00 0.002 0.999 0.993 0.004 0.999 0.998 0.002 1.00 0.998 0.002 0.998 1.00 0.002 0.996
Grep 0.977 0.027 0.981 1.00 0.114 0.942 0.994 0.082 0.996 0.961 0.075 0.992 0.955 0.037 0.959 0.87 0.008 0.913
Gzip 0.953 0.034 0.953 0.682 0.338 0.717 0.883 0.059 0.950 0.963 0.049 0.972 0.963 0.039 0.962 0.256 0.079 0.528
Sed 0.918 0.337 0.813 0.492 0.133 0.727 0.91 0.313 0.872 0.943 0.277 0.863 0.929 0.289 0.82 0.424 0.303 0.548

 

It can be observed that some of the algorithms have better performance measures than others. For, example C4.5 is better than others on user space traces but it is not too different in the case of kernel space traces. To ascertain this a statistical test (Wilcoxon signed rank) was performed between AUC values of kernel space traces and user spaces. The test showed that the kernel space traces are better than user space traces at p-value of 0.001. However, a similar test between six classification algorithms by using AUC values of both kernel space and user space traces did not show any significant difference. The p values are shown in Table 4 below.

 

Table 4. P values obtained using Wilcoxon signed rank test for pair wise comparison of classifiers.

HMM SVM MP BBN NB
C4.5 0.012 0.116 0.779 1.00 0.017
NB 0.025 0.123 0.012 0.018
BBN 0.012 0.093 0.401
MP 0.012 0.484
SVM 0.036

 

So there is no significant difference between classification algorithms on software traces. Anyone can be selected. The differences that we see in tables are just differences by chance.

More details about this experiment can be found in my research paper: On the Comparison of User Space and Kernel Space Traces in Identification of Software Anomalies