Machine learning classification for advanced malware detection

Fabio Di Troia

Research output: ThesisDoctoral thesis

Abstract

This introductory document discusses topics related to malware detection via the application of machine learning algorithms. It is intended as a supplement to the published work submitted (a complete list of which can be found in Table 1) and outlines the motivation behind the experiments. The document begins with the following sections: • Section 2 presents a preliminary discussion of the research methodology employed. • Section 3 presents the background analysis of malware detection in general, and the use of machine learning. • Section 4 provides a brief introduction of the most common machine learning algorithms in current use. The remaining sections present the main body of the experimental work, which lead to the conclusions in Section 10. • Section 5 analyzes different initialization strategies for machine learning models, with a view to ensuring that the most effective training and testing strategy is employed. Following this, a purely dynamic approach is proposed, which results in perfect classification of the samples against benign files, and therefore provides a baseline against which the performance of subsequent static approaches can be compared. • Section 6 introduces the static-based tests, beginning with the challenging problem of zero-day detection samples, i.e. malware samples for which not enough data has been gathered yet to train the machine learning models. • Section 7 describes the testing of several different approaches to static malware detection. During these tests, the effectiveness of these algorithms is analyzed and compared with other means of classification. 7 • Section 8 proposes and compares techniques to boost the detection accuracy by combining the scores obtained from other detection algorithms, with a view to improving static classification scores and thus reach the perfect detection obtained with dynamic features. • Section 9 tests the effectiveness of generic malware models by assessing the detection effectiveness of a generic malware model trained on several different families. The experiments are intended to introduce a more realistic scenario where a single, comprehensive, machine learning model is used to detect several families. This Section shows the difficulty to build a single model to detect several malware families.
Original languageEnglish
QualificationDoctor of Philosophy (PhD)
Awarding Institution
  • Kingston University
Supervisors/Advisors
  • Tunnicliffe, Martin, Supervisor
Publication statusAccepted/In press - May 2020
Externally publishedYes

Bibliographical note

Physical Location: Online only.

Keywords

  • machine learning
  • malware detection
  • clustering
  • hidden Markov models
  • support vector machines
  • dynamic analysis
  • static analysis
  • Computer science and informatics

PhD type

  • Standard route

Fingerprint

Dive into the research topics of 'Machine learning classification for advanced malware detection'. Together they form a unique fingerprint.

Cite this