Which classification algorithm is better on software traces?

userNovember 8, 2017Data Science

Software engineers collect execution traces (kind of lower level logs) from deployed software systems to debug faulty software behaviour. Automated failure reporting is an example of this act; e.g., Firefox sends crash reports to central servers and DB2 collects crashing and non-crashing traces to debug performance issues or crashes. Due to an overwhelming number of traces, before performing any debugging, it is beneficial to determine whether the trace in execution has captured faulty behaviour or not. This means we can train classification algorithms on the historical normal software traces and failure traces to get an idea about the newly collected traces. However, an interesting question is: which classification algorithm would be the best? To answer this question, I have performed an experiment on the software traces. This experiment can be applied to other datasets too.

To perform my experiment, I collected open source programs (UNIX utilities) and their faulty versions from http://sir.unl.edu. I then executed test cases on both faulty and normal version to collect normal and failed traces. I collected two types of traces for each program: user space (function call) traces and kernel space (Linux system calls) traces. This resulted into two types of data sets—one for each trace type. The details of the programs are shown in Table 1.

Table 1. Characteristics of the subject programs (UNIX utilities)
LOC excludes blank lines and comments

Releases used: Flex 2.5.1; Grep 2.4; Gzip 1.1.2; Sed 4.0.7.
Prog.	LOC	# Functions	# Faults	# Test Cases	# Passing Traces	#Failed Traces
Flex	9724	167	20	567	566	545
Grep	9041	149	18	809	799	710
Gzip	4032	88	16	214	214	204
Sed	4735	115	6	370	366	166

In order to train machine learning algorithms on trace data, I transformed each trace into a feature vector of function name and their likelihood of occurrences in the trace and labeled them as normal or failure traces if the corresponding test case passed or failed. I trained six classification algorithms on these traces using Weka: C4.5 decision tree, naïve Bayes classifier, neural network, Bayesian network, support vector machine, and hidden Markov model. I trained algorithms on user space and kernel space traces separately. The results are shown in Table 2 and Table 3 where TP is true positive rate, FP is false positive rate and AUC is the area under the curve.

Table 2. Results of the classification algorithms on user space (function call) traces of the subject programs (Where NB = Naïve Bayes; BBN= Bayesian Belief network; MP= Multilayer Perceptron; SVM=Support Vector Machine; HMM= Hidden Markov Model )

	C4.5			NB			BBN			MP			SVM			HMM
Prog.	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC
Flex	0.924	0.099	0.925	0.159	0.053	0.609	0.371	0.145	0.675	0.981	0.804	0.646	0.721	0.323	0.699	0.706	0	0.416
Grep	0.96	0.044	0.992	0.921	0.227	0.919	0.96	0.044	0.992	0.95	0.113	0.972	0.935	0.075	0.93	0.122	0	0.428
Gzip	0.967	0.02	0.974	0.698	0.594	0.663	0.939	0.035	0.989	0.528	0.144	0.819	0.962	0.015	0.974	0.727	0.258	0.471
Sed	0.849	0.693	0.714	0.321	0.151	0.630	0.61	0.331	0.694	0.819	0.711	0.72	0.808	0.759	0.524	0.964	0.917	0.651

Table 3. Results of the classification algorithms on kernel space traces of the subject programs (Where NB = Naïve Bayes; BBN= Bayesian Belief network; MP= Multilayer Perceptron; SVM=Support Vector Machine; HMM= Hidden Markov Model)

	C4.5			NB			BBN			MP			SVM			HMM
Prog.	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC	TP	FP	AUC
Flex	1.00	0.002	0.998	1.00	0.002	0.999	0.993	0.004	0.999	0.998	0.002	1.00	0.998	0.002	0.998	1.00	0.002	0.996
Grep	0.977	0.027	0.981	1.00	0.114	0.942	0.994	0.082	0.996	0.961	0.075	0.992	0.955	0.037	0.959	0.87	0.008	0.913
Gzip	0.953	0.034	0.953	0.682	0.338	0.717	0.883	0.059	0.950	0.963	0.049	0.972	0.963	0.039	0.962	0.256	0.079	0.528
Sed	0.918	0.337	0.813	0.492	0.133	0.727	0.91	0.313	0.872	0.943	0.277	0.863	0.929	0.289	0.82	0.424	0.303	0.548

It can be observed that some of the algorithms have better performance measures than others. For, example C4.5 is better than others on user space traces but it is not too different in the case of kernel space traces. To ascertain this a statistical test (Wilcoxon signed rank) was performed between AUC values of kernel space traces and user spaces. The test showed that the kernel space traces are better than user space traces at p-value of 0.001. However, a similar test between six classification algorithms by using AUC values of both kernel space and user space traces did not show any significant difference. The p values are shown in Table 4 below.

Table 4. P values obtained using Wilcoxon signed rank test for pair wise comparison of classifiers.

	HMM	SVM	MP	BBN	NB
C4.5	0.012	0.116	0.779	1.00	0.017
NB	0.025	0.123	0.012	0.018
BBN	0.012	0.093	0.401
MP	0.012	0.484
SVM	0.036

So there is no significant difference between classification algorithms on software traces. Anyone can be selected. The differences that we see in tables are just differences by chance.

More details about this experiment can be found in my research paper: On the Comparison of User Space and Kernel Space Traces in Identification of Software Anomalies

user

Data science enthusiast!

Which classification algorithm is better on software traces?

user

Blogs

Sign up for the weekly newsletter