Natural Language Processing with machine learning for anomaly detection on system call logs

Goosen, Christo

Natural Language Processing with machine learning for anomaly detection on system call logs

dc.contributor.advisor	Bradshaw, Karen B
dc.contributor.author	Goosen, Christo
dc.date.accessioned	2026-03-03T10:06:52Z
dc.date.issued	13/10/2023
dc.description.abstract	Host intrusion detection systems and machine learning have been studied for many years especially on datasets like KDD99. Current research and systems are focused on low training and processing complex problems such as system call returns, which lack the system call arguments and potential traces of exploits run against a system. With respect to malware and vulnerabilities, signatures are relied upon, and the potential for natural language processing of the resulting logs and system call traces needs further experimentation. This research looks at unstructured raw system call traces from x86_64 bit GNU Linux operating systems with natural language processing and supervised and unsupervised machine learning techniques to identify current and unseen threats. The research explores whether these tools are within the skill set of information security professionals, or require data science professionals. The research makes use of an academic and modern system call dataset from Leipzig University and applies two machine learning models based on decision trees. Random Forest as the supervised algorithm is compared to the unsupervised Isolation Forest algorithm for this research, with each experiment repeated after hyper-parameter tuning. The research finds conclusive evidence that the Isolation Forest Tree algorithm is effective, when paired with a Principal Component Analysis, in identifying anomalies in the modern Leipzig Intrusion Detection Data Set (LID-DS) dataset combined with samples of executed malware from the Virus Total Academic dataset. The base or default model parameters produce sub-optimal results, whereas using a hyper-parameter tuning technique increases the accuracy to within promising levels for anomaly and potential zero day detection.
dc.description.degree	Master's thesis
dc.description.degree	MSc
dc.format.extent	124 pages
dc.format.mimetype	application/pdf
dc.identifier.other	http://hdl.handle.net/10962/424699
dc.identifier.uri	https://researchrepository.ru.ac.za/handle/20.500.14915/3571
dc.language	English
dc.publisher	Rhodes University, Faculty of Science, Department of Computer Science
dc.rights	Goosen, Christo
dc.subject	Natural language processing (Computer science)
dc.subject	Machine learning
dc.subject	Information security
dc.subject	Anomaly detection (Computer security)
dc.subject	Host-based intrusion detection system
dc.title	Natural Language Processing with machine learning for anomaly detection on system call logs
dc.type	Academic thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Natural_Language_Processing_with_machine_learning__vital_72176.pdf
Size:: 1.7 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Degrees (Computer Science)