Christo Goosen



People call me the Goose. Otherwise known as CrypticGoose. I am a CTO in fintech/insuretech by day, security researcher and defender by night. Busy writing my masters thesis on the above topic. Doing a masters of Computer Science in Information Security at Rhodes. Organizer of OWASP Cape Town and BSides Cape Town.

Title: Natural Language Processing and Anomaly detection in System call logs

Containers (lightweight application virtualization) provide further isolation for application’s, but the container daemon and management systems, ads more attack surface. The research problem is that despite segmentation and system call hardening, containers are still vulnerable and the host and other
containers can be affected.

In this paper, the use of syscall (system calls, calls to kernel) logging in Linux x86_64 systems is investigated with Natural Language Processing. Logs are tokenized and hashed, then transformed into a sparse matrix encoding. The purpose of the method is to classify the documents and test the different accuracies of different classifiers, such as Random Forest, K Nearest Neighbor, etc.

Baseline data, as well as labeled malicious events, are used to train a model to identify and classify anomalies within syscall data. Syscall data can be used to identify usage/abuse of file, system resources and the network, hence a good source of data for anomalies. Further containers, specific Docker is chosen as the application isolation decreases the noise in logs from other daemons and systems running alongside the application. Docker and Google’s gvisor (Sandboxed container runtime by Google) applies seccomp (secure  computing mode) rules in Linux, decreasing the attack surface, additionally provides and opportunity to conduct further research into detecting unknown and potential zero-day attacks.

The data is derived from malware provided through Virus Total’s Academic access and docker exploits during the time period of 2017 to 2019. Malware samples are dynamically run in a container environment, with syscall logs gathered through strace/sysdig (system call tracing utilities).

Methods addressed on this paper can be applied to other systems, however the scope of this paper does not cover for the differences between architectures, but rather the focus will be primarily on x86_64 Linux system calls.