Mehmet Baris Yaman
Foundational Technology, Siemens A.S¸., Istanbul, Turkey
Abstract. The proliferation of Internet of Things (IoT) devices has created unprecedented security challenges, with traditional intrusion detection systems struggling to achieve the accuracy needed for operational deployment. This comprehensive study investigates three advanced Graph Neural Network (GNN) mechanisms for network intrusion detection, addressing the fundamental limitations of standard edge-level classification approaches. We present Prototype-GNN, which leverages distance-based classification with learnable prototype embeddings; Contrastive-GNN, which optimizes embedding geometry through supervised contrastive learning; and GSL-GNN, which adaptively learns optimal graph structure from node features. Through extensive experimentation on the TON-IoT dataset containing 1 million network connections, we demonstrate that these mechanisms achieve 94.24%, 94.71%, and 96.66% accuracy respectively, representing substantial improvements of +2.37, +2.84, and +4.79 percentage points over the baseline EdgeLevelGCN (91.87%). Our best-performing GSL-GNN architecture achieves 99.70% ROC-AUC with an exceptionally low 1.5% false positive rate, addressing the critical challenge of alert fatigue in security operations. This journal article extends our preliminary conference study [1] by providing comprehensive ablation studies, additional architectural variants, and deeper theoretical analysis.
Keywords: Graph Neural Networks · Network Intrusion Detection · IoT Security · Prototype Learning · Contrastive Learning · Graph Structure Learning · Deep Learning
1 Introduction
The Internet of Things revolution has created security challenges of fundamentally different character than traditional network efense. Modern IoT ecosystems consist of billions of diverse devices—ranging from industrial sensors and medical implants to smart home controllers and infrastructure monitors—all communicating via a wide variety of protocols with differing security characteristics. This massive scale and extreme heterogeneity overwhelm conventional security approaches. Individual devices lack the computational resources for sophisticated local protection, yet centralized monitoring systems struggle with the volume and diversity of network traffic. Network intrusion detection systems have emerged as essential defensive infrastructure, analyzing traffic patterns to identify malicious activities without requiring on-device computation. However, operational deployment demands extraordinary capabilities: systems must process enormous traffic volumes generated by diverse applications, distinguish sophisticated attack patterns from legitimate behaviors in heterogeneous environments, maintain minimal false alarm rates to preserve the effectiveness of human analysts, deliver real-time verdicts enabling automated threat response, and adapt continuously to evolving adversarial tactics. Satisfying this full set of requirements at the same time poses a fundamental challenge that existing approaches have difficulty addressing, thereby motivating the exploration of new detection architectures.
Statistical learning has revolutionized intrusion detection by enabling automated discovery of malicious patterns from network traffic data. Contemporary machine learning approaches—including ensemble tree methods, maximummargin classifiers, and proximity-based techniques—operate by learning decision boundaries in feature spaces constructed from network flow statistics. Security practitioners employ these methods by engineering numerical descriptors that capture traffic characteristics: aggregate volume metrics, protocol usage patterns, connection duration distributions, and port access frequencies. When trained on labeled examples, these classifiers successfully identify deviations from normal traffic behavior as reflected in statistical properties of individual flows. The fundamental architectural constraint lies in the independence assumption: each connection is treated as a standalone observation, with classifiers unable to reason about relationships between communicating entities or analyze coordinated activities spanning multiple network hops. A network is fundamentally a graph where devices (nodes) interact through connections (edges), and attacks often exhibit graph-level patterns—such as coordinated botnets, lateral movement, and hierarchical command-and-control structures—that cannot be captured by flat feature vectors [2].
Deep neural architectures have transformed intrusion detection by introducing the capability for autonomous hierarchical abstraction learning from unprocessed network observations. This eliminates the traditional requirement for security domain experts to manually engineer discriminative features, as the system automatically discovers multi-level representations. Convolutional frameworks process packet-level information by casting network flows into formats resembling sequential time-series or structured image-like data. Recurrent networks and LSTM mechanisms identify dependencies that propagate through temporal sequences of network activity. Autoencoders learn compressed representations for anomaly detection through reconstruction error [3]. The selflearning nature of these systems enables automatic discovery of temporal progression patterns, rotocol behavioral signatures, and payload-embedded characteristics without explicit specification. Nevertheless, a critical rchitectural constraint exists: these neural paradigms were fundamentally designed for data with regular geometric structure—pixel grids for images, ordered positions for sequences—creating a mismatch with the variable, interconnected topology of network communications. While CNNs excel at local feature extraction through convolutional filters and RNNs model sequential dependencies, neither naturally handles the irregular, non-Euclidean structure of network graphs where nodes have varying degrees and connections exhibit complex patterns [4].
Graph Neural Networks (GNNs) represent a breakthrough for learning on graph-structured data by generalizing neural networks to non-Euclidean domains [5]. The core innovation is message passing: each node aggregates information from its neighbors through learnable transformations, enabling the network to capture both node features and relational structure. Graph Convolutional Networks (GCNs) apply spectral graph theory to define convolutions on graphs [6], while Graph Attention Networks (GATs) learn importance weights for different neighbors [7]. GraphSAGE introduces sampling-based aggregation for scalability [11]. GNNs have achieved state-of-the-art results across diverse domains: social network analysis (fraud detection, recommendation), molecular chemistry (drug discovery, protein folding), knowledge graphs (link prediction, reasoning), and traffic forecasting [9]. For network intrusion detection, GNNs naturally model the network graph where devices are nodes and connections are edges. By propagating information through the topology, GNNs capture complex attack patterns such as distributed attacks, multi-hop propagation, and structural anomalies that flat ML models miss [10]. But substantial room remains for architectural innovations.
The theoretical understanding of edge-level GNN expressiveness has deepened considerably. [18] proved that edge-level message passing can distinguish graph structures that node-level methods cannot, establishing theoretical superiority for tasks where edge properties are critical. [19] analyzed the WeisfeilerLeman expressiveness of edge-level GNNs, showing they achieve higher-order graph isomorphism testing compared to standard GNNs. [20] provided generalization bounds for edge classification, proving that edge-augmented architectures achieve better sample complexity when edge features carry discriminative information—precisely the case in network intrusion detection.
This paper introduces three GNN architectures that substantially advance intrusion detection accuracy through distinct mechanisms:
We provide comprehensive experimental validation on 1 million network connections, architectural trade-off analysis, and show generalization potential to other edge classification domains. The rest of this paper is organized as follows: Sect. 3 presents our methodology including the three architectures, Sect. 4 describes experimental setup and results, Sect. 7 discusses findings and implications, and Sect. 8 concludes with future directions.
2 GNN Classification
2.1 Graph Level Classification
Traditional GNN approaches employ graph level classification, where the entire network snapshot receives a single label (malicious or normal) [12]:
where h(L)i are final node embeddings and READOUT(·) aggregates them (e.g., mean/max pooling). However, this approach suffers from the aggregation bottleneck: all device-specific and connection-specific information is compressed into a single graph-level representation, losing fine-grained attack signatures and making it impossible to identify which specific devices or connections are compromised.
2.2 Node Level Classification
An intermediate approach is node level classification, where individual devices (nodes) are classified as compromised or benign:
where h(L)i is the node embedding after L GNN layers and xi are node features (device characteristics). This approach offers better granularity than graph-level classification by identifying which specific devices exhibit malicious behavior. It is particularly suitable for detecting compromised IoT devices, botnets, or infected hosts. However, node-level classification still loses connection-specific information: a device may have both legitimate and malicious connections, but node-level methods cannot distinguish between them. Additionally, many attacks manifest primarily in connection patterns such as port scanning, data theft rather than device-level features, making node classification insufficient for comprehensive intrusion detection.
2.3 Edge Level Classification
The most detailed approach is edge level classification, which directly classifies individual connections:
where hi , hj are node embeddings capturing graph context and xij are edge features (connection statistics). This paradigm is fundamentally more suitable for intrusion detection because:
– Fine-grained predictions: Identifies which specific connections are malicious
– No information bottleneck: Attack signatures preserved in edge representations
– Connection-level patterns: Captures attacks manifested in traffic characteristics (bytes, packets, protocols, ports)
– Scalability: Computational cost linear in number of edges, not entire graph
– Real-time applicability: Can classify new connections as they arrive
Our work advances edge-level GNN architectures through three mechanisms explained in the remainder of this section.
3 Methodology
3.1 Problem Formulation
Let G = (V, E, XV , XE) represent a network traffic graph where:
– V = {v1, v2, . . . , vn} is the set of IP address nodes.
– E = {eij | vi , vj ∈ V } is the set of connections (edges).
– XV ∈ Rn×dv contains node features (e.g., IP metadata, behavioral statistics).
– XE ∈ R|E|×de contains edge features (e.g., bytes, packets, duration, protocols).
For network intrusion detection, we define the edge classification task as:
where yij = 0 indicates normal traffic and yij = 1 indicates an attack connection
3.2 Baseline: EdgeLevelGCN
We establish a strong baseline using standard Graph Convolutional Networks adapted for edge classification. Figure 1 shows the baseline EdgeLevelGCN architecture. The baseline EdgeLevelGCN represents a standard graph convolutional approach for edge-level classification in network intrusion detection. Node features propagate through three GCN layers with normalization and dropout regularization. Edge representations are constructed via concatenation of endpoint node embeddings and intrinsic edge features, followed by MLP-based classification.
Node Embedding Each node i is initialized with features xi ∈ Rd0 (device characteristics). GCN layers propagate information:
where N (i) are neighbors, di is degree, W(l) are learnable weights, and σ is ReLU activation. This computes:
where A˜ = A + I (adjacency with self-loops) and D˜ is degree matrix.
Edge Classification For each edge (i, j), we concatenate node embeddings with edge features:
where ∥ denotes concatenation and xij ∈ Rde are edge features (bytes, packets, duration, ports). A 3-layer MLP classifier predicts:
Training We use Focal Loss to handle class imbalance:
where pt is predicted probability, αt balances classes, and γ = 2 focuses on hard examples.
Computational Complexity GCN Layers: Each of L = 3 layers computes H(l+1) = σ(D˜ − 1/2 A˜D˜ − 1/2 H(l)W(l)). The sparse matrix multiplication AH˜ (l) costs O(|E| · d) where |E| is the number of edges and d is the hidden dimension. The dense transformation H(l)W(l) costs O(|V |·d2). Combined: O(L·(|E|·d+|V |·d2)).
Edge Classification: For |E| edges, concatenation and MLP forward pass
cost O(|E| · d2).
Total Complexity: O(L· |E| · d + L· |V | · d2 + |E| · d2). For typical network graphs where |E| ≫ |V | and L is small constant, this simplifies to O(|E| · d2).
3.3 Prototype-GNN: Distance-Based Classification
Prototype-GNN addresses the limitation that a single decision hyperplane cannot capture diverse attack patterns. Instead, we learn multiple prototypes—
representative embeddings for attack and normal patterns—and classify based
on distance. Figure 2 presents the Prototype-GNN architecture. Prototype-GNN
Prototype-GNN addresses the limitation that a single decision hyperplane cannot capture diverse attack patterns. Instead, we learn multiple prototypes—
representative embeddings for attack and normal patterns—and classify based
on distance. Figure 2 presents the Prototype-GNN architecture. Prototype-GNN
Architecture Node Embedding: Same as baseline (3 GCN layers), producing
final node embeddings h
(L)
i ∈ R
d where L = 3.
Edge Encoding: For each edge (i, j), concatenate node embeddings with
edge features:
Prototype Layer: We learn K prototypes per class (attack/normal):
where each pk ∈ R
2d+de
is a learnable parameter initialized via K-means
Clustering.
Distance Computation: For edge eij , compute distances to all prototypes:
Classification: Predict class based on minimum distance:
For training, convert distances to probabilities via Softmax:
Loss Function We employ a triple loss combining classification, cluster compactness, and class separation:
Classification Loss: Cross-entropy on distance-based probabilities:
Cluster Compactness: Encourages prototypes of same class to be diverse:
Class Separation: Pushes attack and normal prototypes apart:
with λ1 = 0.05, λ2 = 0.05, ϵ = 0.1 for numerical stability
Computational Complexity GCN Encoding: Same as baseline, O(L · |E| ·
d + L · |V | · d
2
) where L = 3.
Distance Computation: For each of |E| edges, compute distances to 2K
prototypes (K attack + K normal). Each distance computation ∥eij −pk∥
2
2
costs
O(demb) where demb = 2d + de is the edge embedding dimension. Total: O(|E| ·
K · demb).
Softmax Classification: Converting distances to probabilities costs O(|E|·
K).
Total Complexity: O(L · |E| · d + L · |V | · d
2 + |E| · K · d). Since K = 8 is a
small constant, this simplifies to O(|E| · d
2
), matching the baseline asymptotic
complexity with minimal overhead.
Interpretability Unlike blackbox classifiers, Prototype-GNN provides interpretability: each prototype represents a learned attack/normal pattern. Analysts
can inspect which prototype triggered an alert and visualize similar historical
connections. This aids in understanding attack variants and reducing false positives.
3.4 Contrastive-GNN: Optimizing Embedding Geometry
Standard cross-entropy loss only encourages correct predictions but does not
structure the embedding space. Contrastive-GNN explicitly optimizes geometry:
pulling same-class edges together while pushing different-class edges apart. Figure 3 presents the Contrastive-GNN architecture with a dual-head design. The
dual-head architecture optimizing both classification accuracy and embedding
geometry simultaneously. Following shared GCN encoding, the architecture bifurcates into parallel classification and projection heads. The projection head
generates L2-normalized embeddings for supervised contrastive learning, which explicitly maximizes inter-class separation and intra-class compactness through
temperature-scaled similarity metrics.
Architecture Node Embedding: Same as baseline (3 GCN layers), producing
final node embeddings h
(L)
i ∈ R
d where L = 3.
Edge Encoding: For each edge (i, j), concatenate node embeddings with
edge features:
Dual-Head Design:
– Classification Head: 3-layer MLP predicting attack/normal from eij
– Projection Head: 3-layer MLP mapping eij to 128-d normalized space
Supervised Contrastive Loss For each edge (i, j) in a batch, define:
– P(i, j): Positive set (edges with same label)
– A(i, j): All other edges in batch
The supervised contrastive loss:
where τ = 0.07 is temperature controlling concentration. This maximizes
similarity to same-class edges while minimizing similarity to different-class edges.
Combined Loss
with λ = 0.5 balancing classification and contrastive objectives.
Computational Complexity GCN Encoding: Same as baseline, O(L · |E| ·
d + L · |V | · d
2
).
Dual Heads: Both classification and projection heads process |E| edge embeddings through 3-layer MLPs, costing O(|E| · d
2
) each.
Contrastive Loss: For a batch of B edges, computing pairwise similarities
zij · zp requires O(B2
· dproj ) where dproj = 128 is the projection dimension.
Across all |E|/B batches: O(|E| · B · dproj ).
Total Complexity: O(L· |E| · d+L· |V | · d
2 +|E| · d
2 +|E| ·B · dproj ). With
typical batch size B ≪ |E|, this simplifies to O(|E| · d
2
) asymptotically, though
the contrastive term adds O(B · dproj ) overhead per edge in practice.
3.5 GSL-GNN: Adaptive Graph Structure Learning
The baseline uses the physical network topology for message passing. However,
the optimal structure for classification may differ—devices may share behavioral
patterns without direct connections (e.g., botnet members). GSL-GNN learns
this structure adaptively. Figure 4 illustrates the GSL-GNN architecture. GSLGNN addresses the limitation of fixed physical network topology through adaptive graph structure learning. The architecture comprises dual parallel paths: one
operating on the original adjacency matrix and another on a learned adjacency
matrix derived via bilinear attention mechanisms. Top-k sparsification ensures
computational tractability while preserving salient connections. Node embeddings from both paths undergo weighted fusion 0.5 before edge classification.
This adaptive topology learning captures latent behavioral patterns orthogonal
to physical connectivity,
Structure Learner Given node features H(0) = X ∈ R
n×d
, learn adjacency
matrix:
where Wattn ∈ R
d×d
learns feature interactions. This produces dense Alearned ∈
R
n×n.
Sparsification: To avoid O(n
2
) computation, keep only top- k neighbors per
node:
Numerical Stability: Critical implementation detail:
to prevent NaN from extreme values.
Dual-Path GCN Process features using both learned and original topologies:
Fuse via weighted combination:
with α = 0.5 balancing learned and physical structure.
Edge Classification Same as baseline: concatenate node embeddings with edge
features, feed to MLP.
End-to-End Training: The structure learner is differentiable, enabling joint
optimization:
where Lreg = ∥Alearned∥
2
F prevents trivial solutions.
Computational Complexity Structure Learning (Naive): Computing bilinear attention Alearned[i, j] = σ(h
T
i Wattnhj ) for all |V |
2 node pairs costs O(|V |
2
·
d
2
). This is prohibitive for large graphs.
Sparsification: Keeping only top-k neighbors per node via TopK(Alearned[i, :
], k = 15) reduces the effective adjacency to O(|V | · k) edges. This requires
O(|V |
2
log k) for k-selection across all nodes, but is computed once per forward
pass.
Dual-Path GCN: Processing both learned and original graphs costs 2·O(L·
|E| · d + L · |V | · d
2
) where the learned graph has O(|V | · k) edges. The fusion
operation αHlearned + (1 − α)Horiginal is O(|V | · d).
Edge Classification: Same as baseline, O(|E| · d
2
).
Total Complexity: O(|V |
2
·d
2+|V |
2
log k+L·|V |·k·d+L·|V |·d
2+|E|·d
2
).
The dominant term is structure learning: O(|V|
2
· d
2
) without approximations.
However, with top-k sparsification reducing effective computation, practical complexity approaches O(|V |·k ·d
2 +|E|·d
2
), achieving the noted 26× speedup over
naive O(|V |
2
) adjacency.
4 Experimental Setup
4.1 Dataset
In this paper, we use the TON IoT Network Intrusion Detection Dataset [13–17]
as it contains raw network flow data from IoT devices.
Given the severe class imbalance in the original dataset (96.4% attacks, 3.6%
normal), we create a balanced subset through stratified sampling, maintaining
temporal ordering to preserve realistic traffic patterns.
4.2 Temporal Graph Construction
Network connections are organized into 200 temporal graph snapshots:
– Snapshot size: 3,000 connections each
– Nodes: Unique IP addresses (avg 346 per snapshot)
– Edges: Directed connections between IPs
– Node features (12-dim): In/out degree, bytes, packets, avg connection duration, protocol distribution, port entropy
– Edge features (10-dim): Bytes sent/received, packet count, duration, protocol type (TCP/UDP/ICMP), source/destination ports, flags
Feature Engineering Node Features (per IP address):
Edge Features (per connection):
Normalization All features normalized to [0, 1] via min-max scaling:
computed on training set, applied to validation/test
Graph Construction Adjacency matrix A ∈ {0, 1}
n×n where A[i, j] = 1 if
connection exists from IP i to j.
4.3 Data Preprocessing
We create 200 temporal graph snapshots (3000 connections each) and split:
– Training: 140 snapshots (70%, 420K edges)
– Validation: 30 snapshots (15%, 90K edges)
– Test: 30 snapshots (15%, 90K edges)
Each snapshot preserves the original temporal ordering and maintains a balanced class distribution, with an attack ratio ranging from 40% to 60%, allowing for controlled variation.
5 Results
Table 1 presents a comprehensive comparison of all four models on the test set. Figure 5 presents a radar chart to visually compare the model results. Figure 6 shows how the proposed architectures improved accuracy relative to the baseline EdgeLevelGCN.
1.All architectures substantially outperform baseline:
Table 1. Model Performance on Test Set
![]()
![]()
Figure 5. Radar Chart for Algorithms
– Prototype-GNN: +2.37 pp (91.87% → 94.24%)
– Contrastive-GNN: +2.84 pp (91.87% → 94.71%)
– GSL-GNN: +4.79 pp (91.87% → 96.66%)
2. GSL-GNN achieves near-perfect discrimination: 99.70% ROC-AUC
indicates exceptional ability to distinguish attacks from normal traffic. Only
3,002 total errors out of 90,000 connections.
3. Trade-off between accuracy and efficiency:
– Best accuracy: GSL-GNN (96.66%, 35 min training)
– Best balance: Contrastive-GNN (94.71%, 18 min training)
– Most efficient: Prototype-GNN (94.24%, 12 min training)
4. Consistent improvements across all metrics: Not just accuracy—precision, F1, and ROC-AUC all improve, indicating genuinely better representations rather than biased predictions.![]()
Figure 6. Architecture Improvements
Table 2. Confusion Matrices (TN, FP, FN, TP)