Laeeba Javed, Aasim Zafar
Department of Computer Science, Aligarh Muslim University, India
ABSTRACT
Cybersecurity language models are typically evaluated using fragmented ways, which impedes meaningful comparison and operational comprehension. Existing cybersecurity-specific BERT models are often tested in isolation, with inconsistent preprocessing, tokenization, and evaluation procedures. This paper presents a unified cross-domain benchmarking study for systematically evaluating cybersecurity-adapted BERT models under identical experimental conditions. The evaluation spans CTI, phishing, logs, and CVE domains using CTI-BERT, SecureBERT, CySecBERT, and SecBERT. Results reveal strong performance convergence across models and highlight domain-driven failure modes rather than architectural superiority. To examine real-world resilience, the study expands on this paradigm with zero-shot and few-shot cross-domain evaluations, revealing asymmetric transfer behavior and domain-dependent adaptation efficiency. A controlled training method ablation is also performed, indicating that aggressive optimization does not always increase performance and can decrease stability in semantically rich domains. Stress filtering further exposes brittle reliance on lexicalshortcuts and limited semantic grounding. These findings provide practical guidance for model–domain alignment and real-world cybersecurity deployment. The findings of this study are especially relevant to network security and operational situations when cybersecurity models are deployed across heterogeneous data streams.
KEYWORDS
Cybersecurity, NLP, BERT, Threat Intelligence, Domain-Adaptive Pretraining, Network Security
1.INTRODUCTION
Contemporary cybersecurity analysis is increasingly being carried out as part of larger distributed networked systems rather than as standalone analytic components [1] [2]. Cyber threat intelligence pipelines gather a variety of textual streams from sensors, endpoints, intrusion detection systems, and external sources that are spread across complex network infrastructures [1] [3]. In such cases, the practical utility of language models is determined not only by task-specific performance metrics (e.g., F1-score, precision, recall) but also by predictability, robustness during domain shift, and stability across diverse data sources. As cybersecurity language models become more integrated into network operations, monitoring workflows, and automated response mechanisms, understanding their behavior across different domains becomes a system-wide problem [4] [5].
The primary challenge that modern cybersecurity professionals struggle to address is how cyber threats are getting more advanced and evolving rapidly [1] [6]. This issue has intensified since 2019 and 2020 as accelerated digitization, remote work, and cloud-based services significantly expanded the attack surface, enabling adversaries to exploit vulnerabilities more rapidly and adaptively while leveraging technological advances to dodge existing detection systems [1][3].
Adding to this, the scale, speed, and diversity of digital data have become increasingly difficult to manage [4] [3]. Threat intelligence extraction is severely affected by the continuous generation of large-scale runtime logs and heterogeneous security data from sensors, endpoint agents, and highly unstructured external sources like social media platforms, blogs, threat feeds, dark web forums, and incident reports [1] [7] [2]. As a result, general-purpose language models (LMs) have proven ineffective for this domain due to the highly specialized nature of cybersecurity documentation and the prevalence of threats described in unstructured textual formats [4] [8]. Cybersecurity terms are frequently uncommon in general English (e.g., ransomware, API, keylogger) or function as homographs (words with multiple meanings) that have completely different connotations in a security context (e.g., honeypot, patch, virus) compared to standard English corpora [4] [8]. This situation has rendered traditional, static defense systems inadequate, necessitating the deployment of complex automated techniques like Deep Learning (DL) and Natural Language Processing (NLP) [1] [8] [9].
Despite the effectiveness of general-purpose language models in open-domain natural language processing, their direct application to cybersecurity text is fundamentally ineffective [4]. Technical terminology, acronyms, exploit identifiers, and malware family names are examples of highly specialized vocabulary used in cybersecurity documentation that is either rare or non-existent in regular English corpora [10] [11]. As a result, generic language models face a significant domain mismatch, resulting in frequent tokenization failures and insufficient lexical representations of essential security concepts [11]. This difficulty is exacerbated by semantic ambiguity, as several regularly used cybersecurity terms, such as virus, worm, payload, and security patch, have meanings that range significantly from their general English usage [8]. Without domain understanding, general models often misinterpret such phrases, yielding inappropriate contextual associations [5]. Furthermore, cybersecurity text has structural differences from the clean, narrative-style data used for pretraining generic language models [12]. Threat reports and operational data are often lengthy, noisy, and poorly structured, combining unstructured prose with semi-structured system logs, indicators of compromise, and artifacts from online forums or social media [2]. These properties involve long-range dependencies, uneven formatting, and technical abstractions that ordinary models struggle to represent effectively [13] [14]. As a result, generic language models often fail to capture the full semantics of cybersecurity text, resulting in poor performance for downstream tasks like entity recognition, threat classification, and anomaly detection [13]. These constraints show that using general-purpose language models alone is insufficient for reliable cybersecurity analysis [10].
Building on these constraints, prior studies have increasingly used Domain-Adaptive Pretraining (DAPT) and suggested a number of cybersecurity-specific BERT models to better capture specialized terminology and domain semantics that generic language models are unable to [15]. While these approaches show significant in-domain performance gains [10], they are typically designed for specific cybersecurity sub-domains and evaluated in isolation, with highly inconsistent experimental pipelines that include customized preprocessing, tokenization strategies, loss functions, and evaluation metrics [16] [17]. As a result, claimed improvements are task-specific and scientifically incomparable, with cross-domain transferability generally unknown, leaving practitioners without credible empirical guidance for model selection in real-world operational situations.
Figure 1 summarises this evolution over time to depict these various approaches. To address this gap, our work introduces a unified cross-domain benchmarking study that evaluates multiple cybersecurity-specific BERT models using an identical fine-tuning and evaluation pipeline across multiple cybersecurity text domains. This work provides a reproducible foundation for comparative research by enabling controlled analysis of cross-domain behavior, model-domain
Figure 1 : Evolution of NLP modelsin cybersecurity from traditional machine learning approaches to domain-specific Transformer architectures.
alignment, and practical deployment implications. To the best of our knowledge, this is the first systematic evaluation of cybersecurity-focused BERT models conducted under fully consistent experimental conditions.
2.RELATED WORK
Early cybersecurity defense systems focused on rule-based procedures and predefined threat signatures, which were only effective against known attacks and were insufficient for identifying novel or advanced persistent threats [18]. To automate tasks like malware detection [19], intrusion analysis [20], and vulnerability prediction [17], classical machine learning approaches using handcrafted features like TF-IDF [21] [1] and classifiers like Support Vector Machines [18], Naïve Bayes [22], and K-Nearest Neighbors [23] were introduced [24].However, these count-based text representations failed to capture contextual dependencies and semantic relationships between events, frequently omitting significant correlations inherent in cybersecurity text streams [25] [7] [13]. To overcome these constraints, sequential deep learning models like Recurrent Neural Networks and Long Short-Term Memory networks were employed, as they could model event sequences more effectively in tasks like log anomaly detection and threat report entity recognition [26] [27] [25].
Even with notable advances over classical methods, RNN and LSTM-based models continued to struggle with long-range dependencies, unidirectional processing, and complex semantic variations inherent in cybersecurity narratives [28] [29] [30]. A major shift was enabled by the introduction of Transformer-based architectures, particularly BERT (Bidirectional Encoder Representations from Transformers), as they capture both preceding and following information, resulting in richer semantic representations for cybersecurity text [9] [18]. BERT and its variants soon became the dominant backbone for modern cybersecurity NLP tasks, as they directly addressed the contextual limitations of earlier models when handling diverse attack descriptions [13]
Despite their success in open-domain NLP, generic BERT models were ineffective for cybersecurity literature due to significant domain mismatch, asthey were pretrained on general corpora that lack the specialized vocabulary and structure of security narratives [4] [31] [32]. Key cybersecurity terms such as malware families, exploit identifiers, protocol artifacts, and threat actor aliases, were often handled as out-of-vocabulary terms or poorly tokenised [4] [8]. At the same time, common words like virus, worm, and payload were misinterpreted due to domain-specific semantics, leading to incorrect contextual representations in threat intelligence text [8] [13]. These issues were further amplified by the voluminous, noisy, and weakly structured nature of cybersecurity documents, including CTI reports, vulnerability descriptions, and system logs, as well as continuous semantic drift driven by evolving attack strategies, which static general domain models failed to detect [18] [24].
To overcome these constraints, Domain-Adaptive Pretraining (DAPT) was adopted, in which pretrained language models are further trained on extensive unlabeled cybersecurity corpora [4]. By aligning vocabulary, semantics, and contextual understanding with real-world security data, DAPT consistently improves downstream tasks such as threat classification, named entity recognition, and vulnerability analysis [29] [33].
Table 1 : Learning Paradigms in Domain-Adaptive Pretraining (DAPT)
However, DAPT resulted in a scattered ecosystem of cybersecurity specific BERT variant, as shown in Table 2 and Figure 2, each tailored for specific subdomains like CTI, malware analysis, system logs, or vulnerability assessment, thereby limiting cross-domain generalization and comparability across cybersecurity NLP tasks [8] [13].
Table 2 : Cybersecurity-Specific BERT Models and Evaluation Scope
This fragmentation is intensified by inconsistent experimental pipelines, including customized preprocessing, tokenization, loss functions, and domain-specific evaluation metrics [9] [37] [14],
Figure 2 : Taxonomy of Cybersecurity BERT Architectures
as discussed in Table 3. As a result, observed performance gains are task-specific and scientifically incomparable across studies and cybersecurity domains [9] [4]. Most cybersecurity-adapted BERT models are assessed primarily within their designated domains, leaving cross-domain transferability largely unexplored [34] [30]. Furthermore, the research lacks a systematic modeldomain alignment study under consistent conditions, which limits practical assistance for realworld cybersecurity implementations [1] [37]. In contrast, this study emphasizes cross-domain benchmarking within a standardized experimental framework, explicitly examining model– domain alignment and failure modes through the application of stress-filtered inputs to more accurately simulate operational cybersecurity environments.
Table 3 : Identified Research Gapsin Cybersecurity NLP
Figure 3 : Overview of the unified benchmarking study.
3.METHODOLOGY
This section describes the experimental methodology as a unified and controlled process designed to enable fair benchmarking across all evaluated models. Figure 3 provides the graphical representation of the study. The central idea guiding this methodology is that any observed performance differences should originate only from differences in model architecture and pretraining strategy, and not from variations in the experimental pipeline.
In order to accomplish this, all models were assessed under a single and uniform experimental framework. Every pipeline step was kept identical across models and domains. To maintain the original structure and semantics of the data, input texts were subjected to uniform, minimal preprocessing. Only lowercasing, whitespace normalization, and the removal of non-textual artifacts such as malformed tokens were included. With the exception of the stress filtering technique, no language normalization, stemming, lemmatization, or domain-specific cleaning was performed. This ensured that models were exposed to the same raw information content, allowing to evaluate their learned representations without external bias.
Each model employed its own native tokenizer, loaded from the corresponding pre-trained HuggingFace checkpoint. The evaluated models include CTI-BERT [4], SecureBERT [8], CySecBERT [13], and SecBERT [35], which are all cited their original pretraining publications and related sources. These models incorporate several cybersecurity pretraining approaches from general cybersecurity adaptation, continuous domain-adaptive pretraining, and pretraining from scratch on domain-specific corpora. To maintain consistency with the pretraining assumptions, tokenizer behavior, including special token handling, was left unchanged. All inputs were standardized to a maximum sequence length of 256 tokens using truncation or padding.
Every domain used the same HuggingFace Trainer configuration for evaluation and fine-tuning. The batch size, number of epochs, optimizer, learning rate, and random seed were all kept constant. There was no introduction of model-specific hyperparameter adjustment or early stopping. Any stability-related tweaks were made solely to ensure effective execution and did not alter the optimization goals. Evaluating these architectures under identical downstream conditions enables controlled analysis of how the pretraining technique influences cross-domain behavior, regardless of pipeline or optimization effects
3.1. Domains, Datasets, and Strategy
In this study, each domain is treated as a distinct text classification job and assessed using a single experimental pipeline. The goal is to investigate how various cybersecurity-adapted language models behave under similar and controlled settings rather than to maximize task-specific performance
The Cyber Threat Intelligence (CTI) domain is made up of curated threat intelligence text obtained from structured reports with ATT&CK style annotations [4]. Multi-class classification of CTI entities, such as ACTOR, MALWARE, TOOL, and TACTIC, is part of the process. After stress filtering, the dataset’s initial 15,000 samples are reduced to 2,306 samples. Although there is a moderate decline in TACTIC instances as a result of more semantic constraints, the overall class distribution is still largely stable. The corpus incorporates information from APTNotes threat reports, MITRE ATT&CK knowledge bases, and auxiliary vulnerability-related descriptions.
The phishing domain is based on the Phish No More dataset [38], a consolidated corpus that combines numerous publicly available email datasets. The aim is to distinguish between phishing and legitimate emails using binary classification. Stress filtering generates a compact challenge subset of 2,299 emails from 39,154 original samples, which is then stratified into training and test sets.
System log analysis detects binary anomalies using the HDFS log dataset [39]. Log messages are excessively repetitive and template-driven, with a significant class imbalance favoring normal events. Stress filtering decreases the dataset from 30,000 to 28,160 samples by retaining logs with overlapping anomaly indications while preserving the original task formulation.
In contrast, stress filtering is intentionally not applied in the CVE severity categorization task. CVE severity labels (LOW, MEDIUM, HIGH, and CRITICAL) are based on structured CVSS assessments rather than linguistic clues. Lexical constraints were found to affect label distributions and task validity. As a result, the CVE domain is retained without stress filtering and serves as a diagnostic case for analyzing model failure modes under a text-only formulation.
3.2. Design justification & Threats to Validity
Stress filtering is used specifically for linguistically grounded domains to remove lexically evident samples and create challenging evaluation subsets. A domain-specific lexical indicator set is defined for each applicable domain, and a sample is considered valid if it contains at least two different indicators. Formally, a sample is stress valid if
This threshold prevents shortcut learning, enforces contextual interpretation, and simulates analystlevel ambiguity. While stress filtering makes diagnosis more difficult, it also reduces dataset size and may change class distributions, as shown in Table 4. To reduce this effect, stratified sampling and consistent filtering thresholds are applied to all models in each domain. Furthermore, this study does not claim optimal task performance, but instead prioritizes controlled comparison over score maximization, which aligns with benchmarking purposes rather than state-of-the-art competition
3.3. Evaluation Metrics
The evaluation technique was developed to ensure that the provided results are meaningful and comparable to class-level performance across all domains and models. Because each domain has
Table 4 : Dataset Statistics Before and After Stress Filtering
varied degrees of class imbalance, a unified set of metrics was used consistently to support fair benchmarking.
The key evaluation criterion is the Macro-F1 score, which gives equal weight to all classes regardless of frequency. This option ensures that model performance is not driven by the majority classes and provides for a fair assessment of predictive quality across labels. To support this analysis, per-class F1 scores have been included, revealing label-specific strengths and failure patterns that may be disguised by aggregate metrics. Overall accuracy is used as an alternative statistic for completeness, although it is not prioritized due to its susceptibility to class imbalance
where C denotes the number of classes.
3.4. Zero-Shot Cross-Domain Training Methodology
While the unified fine-tuning approach ensures controlled comparability across domains, realworld cybersecurity systems often operate in domain mismatch scenarios, in which labeled data from a target domain is absent or highly limited. In such cases, the practical utility of a cybersecurity language model is determined not by its peak in domain accuracy, but by its capacity to transfer previously acquired domain knowledge to unknown security contexts.
To explicitly examine this, we add a zero-shot cross-domain training protocol to the unified benchmarking study, which allows models to be trained on a single source domain and then evaluated on a different target domain without any additional fine-tuning or adaptation. This setup isolates the underlying generalization capacity of each cybersecurity-specific BERT model and directly tackles the gap in cross-domain transfer analysis identified in previous work and in the motivation of this study
3.4.1. Experimental Design
The zero-shot evaluation employs a train-once, test-elsewhere approach. For every experiment:
• A model is fine-tuned exclusively on a single source domain.
• The trained model is then tested unchanged on a new target domain.
• There are no target-domain samples, labels, or statistics used during training
This technique assures that any apparent performance on the target domain is the result of semantic and representational overlap acquired during source-domain training, rather than task-specific memorization.
Table 5 depicts the domains analyzed which include Cyber Threat Intelligence (CTI), phishingemails, system logs, and CVE severity descriptions. Each domain keeps its original task formulation, label space, and assessment metrics, as described in the unified pipeline.
Table 5 : Zero-Shot Cross-Domain Training–Evaluation Matrix
3.4.2. Data Preparation and Training Protocol
All zero-shot trials employ the stress-filtered datasets from the core benchmarking pipeline, ensuring consistency with the main study and preventing dataset leakage. Stress filtering remains enabled for linguistically grounded domains (CTI, phishing, and logs) but disabled for CVE severity rating, in accordance with the methodological reason presented previously. This ensures that zero-shot findings are directly comparable to the fine-tuned results reported in the unified evaluation
We use the same training configuration in zero-shot experiments as the unified benchmarking pipeline. No domain-specific hyperparameter adjustment, class reweighting, or early stopping are introduced. Each model is trained individually on a single source domain and tested across all remaining target domains, resulting in a comprehensive cross-domain transfer matrix that includes both strong and weak transfer pathways.
3.5. Few-Shot Cross-Domain Adaptation Methodology
The zero-shot experiments established an important baseline: cybersecurity language models exhibit high domain dependence and asymmetric transfer behavior, even when evaluated using a fully unified pipeline. However, zero-shot configurations do not accurately reflect real-world cybersecurity deployments, in which analysts often have access to a small number of labeled samples from a new domain rather than none at all.
In practice, labeled cybersecurity data is expensive, noisy, and difficult to gather in large quantities. The operational concern now moves from whether a model can generalize without adaptation to how effectively it can adapt when minimal target domain supervision is provided. To investigate this, we extend the zero-shot procedure to include a few-shot cross-domain adaptation setting, in which models are trained on a source domain before being exposed to limited labeled data from a target domain.
This investigation focuses on two transfer paths: CTI → phishing and CTI → logs, which show contrasting semantic behavior in the zero-shot condition. This enables us to investigate how domain structure affects adaptation efficiency under constrained supervision.
3.5.1. Experimental Design
In the few-shot approach, each model is fully fine-tuned on the CTI domain before being fine-tuned on small portions of the target domain. Target-domain supervision is gradually introduced at 10%, 25%, and 50% of the available training data, with evaluation always performed on a predetermined held-out test set.No architecture changes, hyperparameter tuning, or domain-specific optimizations are implemented
As a result, any observed performance differences can be directly linked to the amount of targetdomain supervision available, rather than procedural variance. This design preserves the unified pipeline’s experimental controls while allowing for clean comparisons across models and supervision levels
3.5.2. Data Preparation and Training Protocol
All few-shot experiments reuse the previously defined stress-filtered datasets, assuring compatibility with zero-shot and completely fine-tuned evaluations. The whole stress-filtered CTI dataset is used for source domain supervision. For the target domains, few-shot subsets are generated solely from the training split, whereas the test split remains constant across experiments. To preserve class imbalance characteristics, few-shot samples are stratified when used for phishing. The natural class imbalance in logs is purposely maintained to represent operational contexts. This approach assures that the few-shot results are directly comparable to previous findings.
3.6. Controlled Training Strategy Ablation Study
The unified benchmarking approach demonstrated that cybersecurity-adapted BERT models converge considerably in performance when tested under identical settings. However, this convergence does not explain whether performance similarities are caused by model architecture, domain characteristics, or the downstream training technique itself.
To detangle these aspects, we conduct controlled ablation research that isolates the influence of the training approach while leaving all other components fixed. Unlike previous research, which frequently conflates architectural uniqueness with training processes, our analysis focuses specifically on how different training configurations affect performance inside the same unified pipeline.
The purpose of this ablation is diagnostic rather than competitive: to discover whether more aggressive or relaxed training procedures reliably extract additional domain signal, or whether simpler and more stable configurations already capture the majority of what can be learned from the data
3.6.1. Experimental Design and Training Protocol
Three controlled training variants were tested: A, B, and C in all four cybersecurity domains. These variants are not new models, but rather different training settings used evenly within the same experimental context. Only the training technique is changed, therefore any observed differences can be attributed purely to training behavior rather than experimental artifacts.
Variant A represents the baseline configuration utilized in the primary benchmarking trials. It is purposely conservative, focusing on stability and reproducibility as a point of comparison.
Variant B introduces a small relaxation in training dynamics while maintaining the same target and data exposure. This option determines whether modestly stronger optimization consistently yields in performance gains if domain semantics are already captured.
Variant C significantly lowers training limitations, resulting in a more permissive learning environment. This variation investigates whether lowered constraint results in richer signal extraction or instead increases instability, particularly in domains with limited semantic grounding.