Ransomware Attack Detection based on Pertinent System Calls Using Machine Learning Techniques
Ahmed Dib, Sabri Ghazi and Mendjel Mohamed Said Mehdi
Networks and Systems laboratory – LRS, Department of Computer Science, Badji Mokhtar Annaba University, Annaba, Algeria
Laboratoire de Gestion Electronique de Document – LabGED, Badji MokhtarAnnabaUniversity Annaba, Algeria
In the last few years, the evolution of information technology has resulted in the development of several interesting and sensitive fields such as the dark Web and cyber-criminality, especially using ransomware attacks. This paper aims to bring out only critical features and make their observation, or not, in software behaviour sufficient to decide whether it is ransomware or not. Therefore, we propose a new solution for ransomware detection based on machine learning algorithms and system calls. First, we introduce our produced dataset of collected system calls of both ransomware and Benignware. Then, we push pre-processing steps deeply to reduce efficiently data dimensionality. After that, we introduce a new technique to select pertinent features. Next, we bring out the critical system calls, their importance and their contribution to the distinction between dataset elements. Finally, we present our model that achieves an overall accuracy of 99.81% after K-Fold cross-validation.
KEYWORDS
Ransomware, System calls, Machin learning, Cyber security.
Ransomware-as-a-Service (RAAS) [2] is one of the most commonly used ransomware generators. It simplifies the process of creating and deploying new ransomware samples, allowing individuals with little or no knowledge of cybersecurity to create advanced ransomware variants. The end-user of RAAS specifies certain parameters, such as the ransom amount, payment instructions, and deadline for payment. RAAS allows for the creation and deployment of ransomware after certain conditions have been met. Some examples of RAAS instances that have been discovered since early 2015 include Tox, Fakben, and Radamant [3]. Tox provides a simple three-step ransomware generator for free, but a portion of the ransom is collected for the benefit of the service owner.
Therefore, a significant amount of effort and research has been conducted to provide reliable solutions. Static and dynamic features have been defined to identify salient characteristics and distinguish between benign and malicious applications [4]. Static features are directly extracted from PE files without execution. The artefact is decompressed, unpacked, disassembled, and, if necessary, loaded into memory to extract its dump. Several studies have been conducted based on static feature analysis, including opcodes (operational codes), bytecodes, strings, or Executable and Linkable Format (ELF) file headers for malware and ransomware detection, for both mobile and computer systems, as shown in [5, 6, 7, 8, 9]. Conversely, dynamic feature analysis is useful for overcoming the limitations associated with static features, such as the level and complexity of artefact obfuscation. Dynamic features are extracted and collected while the ransomware is running within a protected system, usually in a virtual environment. Numerous studies have focused on the analysis of dynamic features, such as system/API calls in [10, 11,12], network traffic in [13, 14, 15], CPU events, load, and memory consumption in [16, 17], and I/O requests in [18].
Furthermore, machine learning (ML) was widely used for ransomware detection. It is a method of data analysis that provides a set of interesting algorithms used for learning from data, pattern recognition, and decision making. Good performances were achieved as a result of involving ML algorithms in ransomware detection. On one side, ML provides methods based on ensembles namely bagging, boosting, and stacking. Bagging methods including Random Forest (RF)were used in several ransomware detection studies. In [19], the author proposed a static analysis based on the RF method that deals with the extracted features from the artefactraw byte. In [20], the authors extracted the best features from file system activities, Dynamic Linked Libraries (DLL) references, and registry activities logs. Then, they performed a dynamic analysis using a set of ML algorithms including bagging and RF to distinguish between ransomware and Benignware. In [21], the authors proposed the analysis of API calls to detect various kinds of malware as well as ransomware. They used tree-based ensemble models including Boosting and Bagging algorithms such as AdaBoost, XGBoost, and RF. On the other side, several non-ensemble ML algorithms are used to detect both ransomware and and general types of attacks [22, 23]. Neural Network based techniques are widely used such as bi-directional Long Short Term Memory (BiLSTM) in [24], and self-attention-based convolution neural network (SA-CNN) in[25]. Moreover, classical supervised learning methods are also used in ransomware detection such as Support vector machines (SVM) in[26],Bayesian Networks and other supervised learning algorithms such as in [27].
On the other hand, malware analysis studies are usually achieved using collected malware. Several Web repositories and services allow malicious samples download for free after registration such as Run [28], VirusShare [29], VirusTotal [30], the Zoo [31], and Free Automated Malware Analysis Service / Hybrid Analysis. Additionally, they allow a user to submit suspicious files for scanning and get their analysis reports. This helps to identify new malicious samples and breaks the spread process of malware. These repositories provide various types of behavioural reports including PCAP files that store captured network traffic, Indicators of Compromise (OpenIOC) that give forensic artefacts of an intrusion, Malware Attribute Enumeration and Characterization (MAEC) [32]which is used for encoding and communicating high-fidelity information about malware and attacks, and Malware Information Sharing Platform and Threat Sharing (MISP) reports that are useful for sharing cyber security indicators and threats within security communities. Therefore, researchers collected the provided samples and reports to produce their datasets for their specific works such as the use of Hybrid Analysis in the study [26],Kaggle in [33], Virus Total, Virus Share, and the Zoo in [34], in [35] the authors download samples from Virus Total and produce a new dataset of API calls publicly available on the GitHub website [36].
However, the analysis process of malware produces a high dimensionality set of features. Thus, data reduction techniques were used for decreasing data in the creation of ML models on one the hand and carrying out good performances on the other hand. Data dimensionality reduction is an important pre-processing step that removes incomplete, redundant, irrelevant, and ineffective data. Moreover, it speeds up the computing process and enhances the accuracy of ML algorithms that have a column-wise implementation. Most existing ransomware detection studies considered data dimensionality reduction. They employed a variety of techniques such as Low Variance Filter, High Correlation Filter, and Principal Component Analysis (PCA), in addition to the use of some ML algorithms that implicitly performs feature selection such as Random Forests and J48 decision tree. In [14], the authors proposed the selection of the most relevant network packet features for ransomware detection based on network traffic. They assigned a score to each feature using the combination of six characteristic correlations namely: gain ratio, information gain, correlation ranking, One R feature, Relief F ranking, and symmetrical. Four classes of features were defined according to their correlation score. The class having the highest score interval contained the lowest number of features and gave the best performances. In [37], the authors demonstrated that Random forest-based approaches select the most relevant features while increasing the model performance within an intrusion detection system (IDS).In [33], the authors used the PCA technique to reduce PE file features for malware detection using deep learning techniques. In [17], PCA is also employed to reduce hardware performance counters features for Hardware-Assisted Malware Detection based on ML algorithms. In [20] the authors proposed the use of a sequential pattern mining technique, namely Mind the Gap: Frequent Sequence Mining (MG-FSM), to detect the best features for ransomware and Benignware differentiation. They extracted Maximal Sequential Patterns (MSPs) from three sets of system events namely file system, DLL, and registry events. After removing outlier sequences, they selected the best three from nine MSP types that give the best performance when creating their ML model.
Although machine learning can be effective for detecting ransomware, it may also raise ethical concerns related to biases, privacy, and legal responsibilities. To address these concerns, we considered the following measures:
Furthermore, our study’s significant contribution is addressing the danger of ransomware attacks by employing system calls and machine learning capabilities. Consequently, we proceed to do the following:
The remainder of this paper is organized as follows: Section 2 presents related work. In Section 3, we describe the steps followed to clean up the dataset, the proposed technique, and the development of the ML models. Section 4 covers the dataset construction phase, experimentation, and the obtained results. Finally, Section 5 concludes this study.
2. Related Work
Ransomware detection based on system calls has been the subject of many recent studies. In [43], the analysis of API call frequencies is proposed to detect 14 strains of ransomware by identifying their salient features. The API calls of several benign applications are collected and compared to the API calls of ransomware using Fisher exact tests on a contingency table. This technique is proposed to distinguish between the behaviour of 14 ransomware samples from different families and the behaviour of some benign activities such as installing and running Word, Excel, Apache, etc. However, we believe that there is a lack of ransomware samples on one side, and suspicious behaviours should be included in the benign activities on the other side, such as file compression, encryption, and network traffic exchange. This will enable the identification of salient discriminative features and filter out the common ones.
In [34], the use of a reverse engineering framework is proposed for ransomware detection based on machine learning algorithms. After converting binary files to hexadecimal, Cosine similarity is used to extract the DLL level and expected API calls. The detection process is done based on several machine learning algorithms such as Bayesian Network, Logistic Regression, and Adaboost combined with Random Forest. However, the authors did not discuss the impact of using anti-reverse engineering techniques on their proposed technique [38]. Anti-debugging and anti-reverse engineering provide techniques such as code obfuscation and binary file packing to encrypt ransomware payloads, anddisrupt and impede the process of reverse engineering.
In [39], a machine learning-based framework is proposed for ransomware detection. The authors created their dataset using 83 ransomware samples from different families and 84 benignware samples from various categories. They used API call flows graph (CFG) to calculate the frequency of consecutive API calls and built machine learning models, including RF, SVM, Naïve Bayes, and Simple Logistics (SL). The highest accuracy achieved was 98.2% with SL built on 3000 features. However, the authors did not provide information about the selected features or their relationship with ransomware activities.
In [40], ransomware behaviour is modelled using a combination of static analysis, trap layer, and dynamic analysis. From the static analysis, information is gathered from the PE header, embedded resources, packers and cryptos, embedded strings, etc. The trap layer checks for the modification of a set of special files, known as “honey files and directories,” which are not expected to be modified during regular operations. Suspicious behaviour, such as Windows cryptographic API usage, is reported. During dynamic analysis, I/O Request Packets (IRP) are collected from the file system I/O manager. Only certain requests, such as file read and write operations, are included in the feature vector. ML models are then built to classify ransomware and benignware based on the collected features that describe their behaviours. The study was conducted using 574 ransomware samples from different families and 442 benignware samples. The achieved True Positive Rate was 98.25% using the Gradient Tree Boosting Algorithm. However, the authors did not provide details on how they built the ML models or processed the data.
In [10], a solution is proposed to distinguish ransomware from other types of malware and benign applications. First, n-gram sets of API call sequences are generated for file manipulation operations only. Then, feature vectors are produced using Class Frequency – Non-Class Frequency (CF-NCF), which provides classification indicators. This technique is based on the Term Frequency – Inverse Document Frequency (TF-IDF) intended to reflect the importance of a word to a document in a corpus. Finally, machine learning models are built on a weighted n-gram vector resulting from multiplying the n-gram data by the weight value obtained from CF-NCF. Six machine learning classifiers are evaluated, including Random Forest, which achieves the highest accuracy rate of 98.65%. However, the authors did not provide further details about the dataset, which includes 1000 ransomware, 900 malware, and only 300 benign applications.
3 . Methodology
This section describes the steps performed to create ML models for ransomware detection. First, we describe the steps of data normalization, data reduction, and feature selection. Then, we highlight the interpretability of the most significant features. Finally, we present and discuss our proposed ML model.
3.1. Dataset normalization
Data normalization is a crucial step to improve the performance of machine learning models by
making the features on a similar scale. We use one of the available methods for rescaling the
entire numeric data, depending on the implemented ML classification algorithms. These data
normalization techniques can be either linear or non-linear. Linear techniques such as Min-Max
and Clipping are sensitive to the presence of outliers and are well-supported by tree-based ML
algorithms. On the other hand, non-linear methods such as Quantile Transformer and Power
Scaler are more beneficial for ML classification algorithms like logistic regression and linear
SVM that perform well with regression problems. Non-linear data normalization methods can
handle outliers and put data under well-known distributions such as uniform and Gaussian-like.
Therefore, we first check if the dataset contains outliers and then conduct experiments to check
the impact of choosing one normalization technique over another. We find an important variation
in the number of system calls, and according to the experimentation shown in section 4.3.2, nonlinear normalization techniques give the best performance. Thus, we normalize our dataset using
one of the non-linear data normalization techniques, which can handle almost all ML
classifications and support some data reduction and feature selection methods such as mutual
information, which perform well with well-known distribution data.
3.2. Dimensionality Reduction
This step aims to remove irrelevant and redundant data from the dataset and select the most
important features for Ransomware detection. We focus on reducing dimensionality using filterbased feature selection methods. In the case of system calls, redundant data is produced when a
set of API calls is invoked together This usually occurs when opening and closing connections,
manipulating windows, exchanging data, and so on. Table 1 shows examples of highly correlated
API calls.
Table 1. Examples of highly correlated APIs