Abstract
This introductory document discusses topics related to malware detection via the application
of machine learning algorithms. It is intended as a supplement to the published work
submitted (a complete list of which can be found in Table 1) and outlines the motivation
behind the experiments.
The document begins with the following sections:
• Section 2 presents a preliminary discussion of the research methodology employed.
• Section 3 presents the background analysis of malware detection in general, and the
use of machine learning.
• Section 4 provides a brief introduction of the most common machine learning
algorithms in current use.
The remaining sections present the main body of the experimental work, which lead to the
conclusions in Section 10.
• Section 5 analyzes different initialization strategies for machine learning models, with
a view to ensuring that the most effective training and testing strategy is employed.
Following this, a purely dynamic approach is proposed, which results in perfect
classification of the samples against benign files, and therefore provides a baseline
against which the performance of subsequent static approaches can be compared.
• Section 6 introduces the static-based tests, beginning with the challenging problem of
zero-day detection samples, i.e. malware samples for which not enough data has been
gathered yet to train the machine learning models.
• Section 7 describes the testing of several different approaches to static malware
detection. During these tests, the effectiveness of these algorithms is analyzed and
compared with other means of classification.
7
• Section 8 proposes and compares techniques to boost the detection accuracy by
combining the scores obtained from other detection algorithms, with a view to
improving static classification scores and thus reach the perfect detection obtained
with dynamic features.
• Section 9 tests the effectiveness of generic malware models by assessing the detection
effectiveness of a generic malware model trained on several different families. The
experiments are intended to introduce a more realistic scenario where a single,
comprehensive, machine learning model is used to detect several families. This
Section shows the difficulty to build a single model to detect several malware families.
| Original language | English |
|---|---|
| Qualification | Doctor of Philosophy (PhD) |
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Publication status | Accepted/In press - May 2020 |
| Externally published | Yes |
Bibliographical note
Physical Location: Online only.Keywords
- machine learning
- malware detection
- clustering
- hidden Markov models
- support vector machines
- dynamic analysis
- static analysis
- Computer science and informatics
PhD type
- Standard route