Ransomware Attack Detection based on Pertinent System Calls Using Machine Learning Techniques
Ahmed Dib, Sabri Ghazi and Mendjel Mohamed Said Mehdi
Networks and Systems laboratory – LRS, Department of Computer Science, Badji
Mokhtar Annaba University, Annaba, Algeria
Laboratoire de Gestion Electronique de Document – LabGED, Badji
MokhtarAnnabaUniversity Annaba, Algeria
ABSTRACT
In the last few years, the evolution of information technology has resulted in the development of several
interesting and sensitive fields such as the dark Web and cyber-criminality, especially using ransomware
attacks. This paper aims to bring out only critical features and make their observation, or not, in software
behaviour sufficient to decide whether it is ransomware or not. Therefore, we propose a new solution for
ransomware detection based on machine learning algorithms and system calls. First, we introduce our
produced dataset of collected system calls of both ransomware and Benignware. Then, we push pre-processing steps deeply to reduce efficiently data dimensionality. After that, we introduce a new technique to select pertinent features. Next, we bring out the critical system calls, their importance and their
contribution to the distinction between dataset elements. Finally, we present our model that achieves an
overall accuracy of 99.81% after K-Fold cross-validation.
KEYWORDS
Ransomware, System calls, Machin learning, Cyber security.
1. INTRODUCTION
Ransomware is a type of malicious software that aims to extort money from its victims by
encrypting their files using various techniques and robust algorithms to make their attacks more
efficient. The attackers provide instructions and a deadline to pay the ransom, which can be a few
hundred dollars on average. Ransomware attacks have been increasing due to the ease of creating
and generating them using various methods and tools. These attacks are facilitated by the
existence of anonymous and untraceable cryptocurrencies on the Internet. Statistics show that
more than 4,000 ransomware attacks are carried out every day [1].
Ransomware-as-a-Service (RAAS) [2] is one of the most commonly used ransomware
generators. It simplifies the process of creating and deploying new ransomware samples,
allowing individuals with little or no knowledge of cybersecurity to create advanced ransomware
variants. The end-user of RAAS specifies certain parameters, such as the ransom amount,
payment instructions, and deadline for payment. RAAS allows for the creation and deployment of
ransomware after certain conditions have been met. Some examples of RAAS instances that have
been discovered since early 2015 include Tox, Fakben, and Radamant [3]. Tox provides a simple
three-step ransomware generator for free, but a portion of the ransom is collected for the benefit
of the service owner.
International Journal of Computer Networks & Communications (IJCNC) Vol.15, No.4, July 2023
124 Therefore, a significant amount of effort and research has been conducted to provide reliable
solutions. Static and dynamic features have been defined to identify salient characteristics and
distinguish between benign and malicious applications [4]. Static features are directly extracted
from PE files without execution. The artefact is decompressed, unpacked, disassembled, and, if
necessary, loaded into memory to extract its dump. Several studies have been conducted based on
static feature analysis, including opcodes (operational codes), bytecodes, strings, or Executable
and Linkable Format (ELF) file headers for malware and ransomware detection, for both mobile
and computer systems, as shown in [5, 6, 7, 8, 9]. Conversely, dynamic feature analysis is useful
for overcoming the limitations associated with static features, such as the level and complexity of
artefact obfuscation. Dynamic features are extracted and collected while the ransomware is
running within a protected system, usually in a virtual environment. Numerous studies have
focused on the analysis of dynamic features, such as system/API calls in [10, 11,12], network
traffic in [13, 14, 15], CPU events, load, and memory consumption in [16, 17], and I/O requests
in [18].
Furthermore, machine learning (ML) was widely used for ransomware detection. It is a method
of data analysis that provides a set of interesting algorithms used for learning from data, pattern
recognition, and decision making. Good performances were achieved as a result of involving ML
algorithms in ransomware detection. On one side, ML provides methods based on ensembles
namely bagging, boosting, and stacking. Bagging methods including Random Forest (RF)were
used in several ransomware detection studies. In [19], the author proposed a static analysis based
on the RF method that deals with the extracted features from the artefactraw byte. In [20], the
authors extracted the best features from file system activities, Dynamic Linked Libraries (DLL)
references, and registry activities logs. Then, they performed a dynamic analysis using a set of
ML algorithms including bagging and RF to distinguish between ransomware and Benignware. In
[21], the authors proposed the analysis of API calls to detect various kinds of malware as well as
ransomware. They used tree-based ensemble models including Boosting and Bagging algorithms
such as AdaBoost, XGBoost, and RF. On the other side, several non-ensemble ML algorithms are
used to detect both ransomware and and general types of attacks [22, 23]. Neural Network based
techniques are widely used such as bi-directional Long Short Term Memory (BiLSTM) in [24],
and self-attention-based convolution neural network (SA-CNN) in[25]. Moreover, classical
supervised learning methods are also used in ransomware detection such as Support vector
machines (SVM) in[26],Bayesian Networks and other supervised learning algorithms such as in
[27].
On the other hand, malware analysis studies are usually achieved using collected malware.
Several Web repositories and services allow malicious samples download for free after
registration such as Run [28], VirusShare [29], VirusTotal [30], the Zoo [31], and Free
Automated Malware Analysis Service / Hybrid Analysis. Additionally, they allow a user to
submit suspicious files for scanning and get their analysis reports. This helps to identify new
malicious samples and breaks the spread process of malware. These repositories provide various
types of behavioural reports including PCAP files that store captured network traffic, Indicators
of Compromise (OpenIOC) that give forensic artefacts of an intrusion, Malware Attribute
Enumeration and Characterization (MAEC) [32]which is used for encoding and communicating
high-fidelity information about malware and attacks, and Malware Information Sharing Platform
and Threat Sharing (MISP) reports that are useful for sharing cyber security indicators and threats
within security communities. Therefore, researchers collected the provided samples and reports to
produce their datasets for their specific works such as the use of Hybrid Analysis in the study
[26],Kaggle in [33], Virus Total, Virus Share, and the Zoo in [34], in [35] the authors download
samples from Virus Total and produce a new dataset of API calls publicly available on the
GitHub website [36].
International Journal of Computer Networks & Communications (IJCNC) Vol.15, No.4, July 2023
125
However, the analysis process of malware produces a high dimensionality set of features. Thus,
data reduction techniques were used for decreasing data in the creation of ML models on one the
hand and carrying out good performances on the other hand. Data dimensionality reduction is an
important pre-processing step that removes incomplete, redundant, irrelevant, and ineffective
data. Moreover, it speeds up the computing process and enhances the accuracy of ML algorithms
that have a column-wise implementation. Most existing ransomware detection studies considered
data dimensionality reduction. They employed a variety of techniques such as Low Variance
Filter, High Correlation Filter, and Principal Component Analysis (PCA), in addition to the use of
some ML algorithms that implicitly performs feature selection such as Random Forests and J48
decision tree. In [14], the authors proposed the selection of the most relevant network packet
features for ransomware detection based on network traffic. They assigned a score to each feature
using the combination of six characteristic correlations namely: gain ratio, information gain,
correlation ranking, One R feature, Relief F ranking, and symmetrical. Four classes of features
were defined according to their correlation score. The class having the highest score interval
contained the lowest number of features and gave the best performances. In [37], the authors
demonstrated that Random forest-based approaches select the most relevant features while
increasing the model performance within an intrusion detection system (IDS).In [33], the authors
used the PCA technique to reduce PE file features for malware detection using deep learning
techniques. In [17], PCA is also employed to reduce hardware performance counters features for
Hardware-Assisted Malware Detection based on ML algorithms. In [20] the authors proposed the
use of a sequential pattern mining technique, namely Mind the Gap: Frequent Sequence Mining
(MG-FSM), to detect the best features for ransomware and Benignware differentiation. They
extracted Maximal Sequential Patterns (MSPs) from three sets of system events namely file
system, DLL, and registry events. After removing outlier sequences, they selected the best three
from nine MSP types that give the best performance when creating their ML model.
Although machine learning can be effective for detecting ransomware, it may also raise ethical
concerns related to biases, privacy, and legal responsibilities. To address these concerns, we
considered the following measures:
Bias: the collected data is composed of system calls of the main Ransomware families and
Benignware categories to get a balanced and diversified dataset.
privacy concerns: the proposed technique collects and analyses the API calls provided by the
operating system for each process, focusing only on the type of executed operations, such as
memory allocation, data transmission, etc. without accessing the content or nature of the data.
Furthermore, our study’s significant contribution is addressing the danger of ransomware attacks
by employing system calls and machine learning capabilities. Consequently, we proceed to do the
following:
Introduce a new dataset built from scratch that includes various ransomware families and
Benignware categories. Especially benign samples that share with Ransomware some
capabilities such as file encryption and networking.
Analyze the impact of various normalization data techniques on the performances of the
different ML algorithms in the context of Ransomware detection.
Analyze the impact of various dimensionality reduction techniques on the rate of data
reduction and the performances of the different ML algorithms.
Select the pertinent features to describe the behaviour of both Ransomware and Benignware.
In this part, we propose a new technique inspired by TF-IDF to select important features
regarding their use by Ransomware.
Build ML models using 8 ML algorithms.
Quantify the contribution of each pertinent feature in the classification process.
International Journal of Computer Networks & Communications (IJCNC) Vol.15, No.4, July 2023
126 The remainder of this paper is organized as follows: Section 2 presents related work. In Section
3, we describe the steps followed to clean up the dataset, the proposed technique, and the
development of the ML models. Section 4 covers the dataset construction phase, experimentation,
and the obtained results. Finally, Section 5 concludes this study.
2. RELATED WORK
Ransomware detection based on system calls has been the subject of many recent studies. In [43],
the analysis of API call frequencies is proposed to detect 14 strains of ransomware by identifying
their salient features. The API calls of several benign applications are collected and compared to
the API calls of ransomware using Fisher exact tests on a contingency table. This technique is
proposed to distinguish between the behaviour of 14 ransomware samples from different families
and the behaviour of some benign activities such as installing and running Word, Excel, Apache,
etc. However, we believe that there is a lack of ransomware samples on one side, and suspicious
behaviours should be included in the benign activities on the other side, such as file compression,
encryption, and network traffic exchange. This will enable the identification of salient
discriminative features and filter out the common ones.
In [34], the use of a reverse engineering framework is proposed for ransomware detection based
on machine learning algorithms. After converting binary files to hexadecimal, Cosine similarity
is used to extract the DLL level and expected API calls. The detection process is done based on
several machine learning algorithms such as Bayesian Network, Logistic Regression, and
Adaboost combined with Random Forest. However, the authors did not discuss the impact of
using anti-reverse engineering techniques on their proposed technique [38]. Anti-debugging and
anti-reverse engineering provide techniques such as code obfuscation and binary file packing to
encrypt ransomware payloads, anddisrupt and impede the process of reverse engineering.
In [39], a machine learning-based framework is proposed for ransomware detection. The authors
created their dataset using 83 ransomware samples from different families and 84 benignware
samples from various categories. They used API call flows graph (CFG) to calculate the
frequency of consecutive API calls and built machine learning models, including RF, SVM,
Naïve Bayes, and Simple Logistics (SL). The highest accuracy achieved was 98.2% with SL built
on 3000 features. However, the authors did not provide information about the selected features or
their relationship with ransomware activities.
In [40], ransomware behaviour is modelled using a combination of static analysis, trap layer, and
dynamic analysis. From the static analysis, information is gathered from the PE header,
embedded resources, packers and cryptos, embedded strings, etc. The trap layer checks for the
modification of a set of special files, known as “honey files and directories,” which are not
expected to be modified during regular operations. Suspicious behaviour, such as Windows
cryptographic API usage, is reported. During dynamic analysis, I/O Request Packets (IRP) are
collected from the file system I/O manager. Only certain requests, such as file read and write
operations, are included in the feature vector. ML models are then built to classify ransomware
and benignware based on the collected features that describe their behaviours. The study was
conducted using 574 ransomware samples from different families and 442 benignware samples.
The achieved True Positive Rate was 98.25% using the Gradient Tree Boosting Algorithm.
However, the authors did not provide details on how they built the ML models or processed the
data.
In [10], a solution is proposed to distinguish ransomware from other types of malware and benign
applications. First, n-gram sets of API call sequences are generated for file manipulation
operations only. Then, feature vectors are produced using Class Frequency – Non-Class
International Journal of Computer Networks & Communications (IJCNC) Vol.15, No.4, July 2023
127
Frequency (CF-NCF), which provides classification indicators. This technique is based on the
Term Frequency – Inverse Document Frequency (TF-IDF) intended to reflect the importance of a
word to a document in a corpus. Finally, machine learning models are built on a weighted n-gram
vector resulting from multiplying the n-gram data by the weight value obtained from CF-NCF.
Six machine learning classifiers are evaluated, including Random Forest, which achieves the
highest accuracy rate of 98.65%. However, the authors did not provide further details about the
dataset, which includes 1000 ransomware, 900 malware, and only 300 benign applications.
3. METHODOLOGY
This section describes the steps performed to create ML models for ransomware detection. First,
we describe the steps of data normalization, data reduction, and feature selection. Then, we
highlight the interpretability of the most significant features. Finally, we present and discuss our
proposed ML model.
3.1. Dataset normalization
Data normalization is a crucial step to improve the performance of machine learning models by
making the features on a similar scale. We use one of the available methods for rescaling the
entire numeric data, depending on the implemented ML classification algorithms. These data
normalization techniques can be either linear or non-linear. Linear techniques such as Min-Max
and Clipping are sensitive to the presence of outliers and are well-supported by tree-based ML
algorithms. On the other hand, non-linear methods such as Quantile Transformer and Power
Scaler are more beneficial for ML classification algorithms like logistic regression and linear
SVM that perform well with regression problems. Non-linear data normalization methods can
handle outliers and put data under well-known distributions such as uniform and Gaussian-like.
Therefore, we first check if the dataset contains outliers and then conduct experiments to check
the impact of choosing one normalization technique over another. We find an important variation
in the number of system calls, and according to the experimentation shown in section 4.3.2, non-linear normalization techniques give the best performance. Thus, we normalize our dataset using
one of the non-linear data normalization techniques, which can handle almost all ML
classifications and support some data reduction and feature selection methods such as mutual
information, which perform well with well-known distribution data
3.2. Dimensionality Reduction
This step aims to remove irrelevant and redundant data from the dataset and select the most
important features for Ransomware detection. We focus on reducing dimensionality using filter-based feature selection methods. In the case of system calls, redundant data is produced when a
set of API calls is invoked together This usually occurs when opening and closing connections,
manipulating windows, exchanging data, and so on. Table 1 shows examples of highly correlated
API calls.
Table 1. Examples of highly correlated APIs
To reduce data dimensionality, we first remove redundant data by keeping only one item from
each set of highly correlated features. Next, we select the relevant features by retaining the ones
most correlated with the output vector. To determine the most appropriate dimensionality
reduction technique that yields the best performance with the lowest number of features, we
conduct experiments using various correlation methods. Specifically, we use Pearson, Spearman,
and Kendall correlation to measure linear and monotonic relationships, as well as Mutual
Information to measure the information score gained between features and the target. The best
results, as shown in section 4.4.1, are obtained by applying the Spearman correlation between
features and the Kendall correlation between features and the target. The best performance
achieved is 99.26%, with a dimensionality reduction of 98.71%, which corresponds to 67 out of
5194 features. Spearman correlation is used between features, while Kendall correlation is
applied between the remaining features and the target. The use of Kendall correlation allowed us
to detect other relevant features that cannot be detected using Spearman and Mutual Information.
Kendall correlation uses a more robust distance based on concordant and discordant pairs to
describe the relationship between the target and features.
However, the combination of Spearman and MI provides good performance for almost all ML
classification algorithms. This combination makes use of the monotonic correlation between
features on one hand, and the gained information between features and target by calculating the
distance of their distribution on the other hand. This results in a good dimensionality reduction
rate (96.94%) because we exclude features with low MI-scores even if they are moderately
correlated with the target.
3.3. Feature selection
Feature selection is an important step to validate the inputs of ML models. It is ideal to reduce the
number of features while keeping the best performance to obtain the necessary set of features to
distinguish between Ransomware and Benignware. However, in our case, when we reduced
features by applying Spearman and Kendall correlations, we had to choose an extreme threshold
to achieve a high reduction rate. As a result, upon analysing the reduced features, we found that:
More than half of the features are related to graphical interface manipulation (24% Graphics
and gaming, 31% Windows application UI development).
All reduced features are captured from Benignware executions.
Thus, we should exclude, as possible, the use of system calls related to graphical interface
manipulation from one side and involve more features that typify Ransomware and describe their
behaviour from the other side. It is mandatory to focus on what happened exactly with the file
system, network, and services regarding both Ransomware and Benignware. For this purpose, we
propose the combination of two methods to select the most important and ‘special’ features. The
word ‘special’ is used to indicate that the feature may not be very discriminative since it is not
selected in the data reduction step, but it has different information that we can exploit to
distinguish Ransomware behavior, even if it does not meet correlation criteria. The two methods
that we combined are Permutation Feature Importance (PFI) in addition to the use of a new
technique inspired by TF-IDF. The latter is used in various domains, especially in Natural
Language Processing (NLP). It evaluates the importance of terms in the textual corpus by
calculating TF and IDF.
3.3.1. Feature Importance based on Call Frequency and References (FICFR)
We propose the FICFR technique to extract the most important features frequently called
Ransomware. As seen in the previous section, the API calls resulting from the data reduction step
International Journal of Computer Networks & Communications (IJCNC) Vol.15, No.4, July 2023
129
only describe Benignware behaviour. To address this issue, we propose a technique inspired by
TF-IDF to calculate a score for each feature that describes its importance concerning
Ransomware. A feature is considered important if its call frequency is higher by Ransomware
and it was referenced by almost all Ransomware,in contrast to Benignware.