라떼군 이야기

Anomaly Detection based on Compression Data using Clustering Algorithm

With the development of Internet, damages and losses because of the intrusion are getting increased. The attacks of the intruders became more complicated and diversified so we need to make with more active and effective actions. Although log data should be treated as important data as they show the important system information and the trace of the intrusion, it is not easy to maintain and manage them because of their volume. Currently intrusion detection technology is being studied actively. However, the research to solve the log data storage problem and intrusion detection problem at the same time has not been sufficiently made.

The proposed method is to detect anomaly by setting the normal cluster range applying the clustering algorithm through the stage of compression and distance conversion using normal data and define the data out of the normal cluster as anomalies. In the compression stage, it uses the variation of a logpack compression algorithm. It applies to each log data independently and does not need a separate normalization after compression. Therefore, it is appropriate for on-line environments that add data in real time. In the distance conversion stage, it uses the difference data as distance. As the compressed result shows the difference without redundancy, it is proper to be used as a distance.

For the experiment data, KDD’99 Data Set and DARPA 1998 Data Set were used. And Hierarchical, K-means, and K-medoids Clustering Algorithm were applied. As the result can be varied according to the number of clusters that it generates, the number of the clusters was increased from 2 to 32 during the experiment. Estimation data was used to calculate the Precision and Recall and the performance was evaluated through Accuracy and F-measure. To estimate the most optimal cluster using learning data only K-fold cross validation was used. Additionally, genetic algorithm was used to distinguish the informative and non-informative fields. It was tried to increase the total compression rate applying strong compression algorithm to the non-informative fields.

We proved that the proposed method is better informative field abstraction method to detect anomalies than the conventional methods as it showed better results in the experiment. We estimated the optimal number of clusters applying K-fold cross validation to the learning data only. Additionally, it became possible to distinguish the informative and non-informative fields using genetic algorithm and the total compression rate was improved.

January 1, 2011 ∙ thesis