IJNSA 03

Advanced Graph Neural Network Mechanisms for IoT Network Intrusion Detection: A Comprehensive Study of Prototype-Based, Contrastive, and Structure Learning Approaches

Mehmet Baris Yaman

Foundational Technology, Siemens A.S¸., Istanbul, Turkey

Abstract. The proliferation of Internet of Things (IoT) devices has created unprecedented security challenges, with traditional intrusion detection systems struggling to achieve the accuracy needed for operational deployment. This comprehensive study investigates three advanced Graph Neural Network (GNN) mechanisms for network intrusion detection, addressing the fundamental limitations of standard edge-level classification approaches. We present Prototype-GNN, which leverages distance-based classification with learnable prototype embeddings; Contrastive-GNN, which optimizes embedding geometry through supervised contrastive learning; and GSL-GNN, which adaptively learns optimal graph structure from node features. Through extensive experimentation on the TON-IoT dataset containing 1 million network connections, we demonstrate that these mechanisms achieve 94.24%, 94.71%, and 96.66% accuracy respectively, representing substantial improvements of +2.37, +2.84, and +4.79 percentage points over the baseline EdgeLevelGCN (91.87%). Our best-performing GSL-GNN architecture achieves 99.70% ROC-AUC with an exceptionally low 1.5% false positive rate, addressing the critical challenge of alert fatigue in security operations. This journal article extends our preliminary conference study [1] by providing comprehensive ablation studies, additional architectural variants, and deeper theoretical analysis.

Keywords: Graph Neural Networks · Network Intrusion Detection · IoT Security · Prototype Learning · Contrastive Learning · Graph Structure Learning · Deep Learning

1 Introduction

The Internet of Things revolution has created security challenges of fundamentally different character than traditional network efense. Modern IoT ecosystems consist of billions of diverse devices—ranging from industrial sensors and medical implants to smart home controllers and infrastructure monitors—all communicating via a wide variety of protocols with differing security characteristics. This massive scale and extreme heterogeneity overwhelm conventional security approaches. Individual devices lack the computational resources for sophisticated local protection, yet centralized monitoring systems struggle with the volume and diversity of network traffic. Network intrusion detection systems have emerged as essential defensive infrastructure, analyzing traffic patterns to identify malicious activities without requiring on-device computation. However, operational deployment demands extraordinary capabilities: systems must process enormous traffic volumes generated by diverse applications, distinguish sophisticated attack patterns from legitimate behaviors in heterogeneous environments, maintain minimal false alarm rates to preserve the effectiveness of human analysts, deliver real-time verdicts enabling automated threat response, and adapt continuously to evolving adversarial tactics. Satisfying this full set of requirements at the same time poses a fundamental challenge that existing approaches have difficulty addressing, thereby motivating the exploration of new detection architectures.

Statistical learning has revolutionized intrusion detection by enabling automated discovery of malicious patterns from network traffic data. Contemporary machine learning approaches—including ensemble tree methods, maximummargin classifiers, and proximity-based techniques—operate by learning decision boundaries in feature spaces constructed from network flow statistics. Security practitioners employ these methods by engineering numerical descriptors that capture traffic characteristics: aggregate volume metrics, protocol usage patterns, connection duration distributions, and port access frequencies. When trained on labeled examples, these classifiers successfully identify deviations from normal traffic behavior as reflected in statistical properties of individual flows. The fundamental architectural constraint lies in the independence assumption: each connection is treated as a standalone observation, with classifiers unable to reason about relationships between communicating entities or analyze coordinated activities spanning multiple network hops. A network is fundamentally a graph where devices (nodes) interact through connections (edges), and attacks often exhibit graph-level patterns—such as coordinated botnets, lateral movement, and hierarchical command-and-control structures—that cannot be captured by flat feature vectors [2].

Deep neural architectures have transformed intrusion detection by introducing the capability for autonomous hierarchical abstraction learning from unprocessed network observations. This eliminates the traditional requirement for security domain experts to manually engineer discriminative features, as the system automatically discovers multi-level representations. Convolutional frameworks process packet-level information by casting network flows into formats resembling sequential time-series or structured image-like data. Recurrent networks and LSTM mechanisms identify dependencies that propagate through temporal sequences of network activity. Autoencoders learn compressed representations for anomaly detection through reconstruction error [3]. The selflearning nature of these systems enables automatic discovery of temporal progression patterns, rotocol behavioral signatures, and payload-embedded characteristics without explicit specification. Nevertheless, a critical rchitectural constraint exists: these neural paradigms were fundamentally designed for data with regular geometric structure—pixel grids for images, ordered positions for sequences—creating a mismatch with the variable, interconnected topology of network communications. While CNNs excel at local feature extraction through convolutional filters and RNNs model sequential dependencies, neither naturally handles the irregular, non-Euclidean structure of network graphs where nodes have varying degrees and connections exhibit complex patterns [4].

Graph Neural Networks (GNNs) represent a breakthrough for learning on graph-structured data by generalizing neural networks to non-Euclidean domains [5]. The core innovation is message passing: each node aggregates information from its neighbors through learnable transformations, enabling the network to capture both node features and relational structure. Graph Convolutional Networks (GCNs) apply spectral graph theory to define convolutions on graphs [6], while Graph Attention Networks (GATs) learn importance weights for different neighbors [7]. GraphSAGE introduces sampling-based aggregation for scalability [11]. GNNs have achieved state-of-the-art results across diverse domains: social network analysis (fraud detection, recommendation), molecular chemistry (drug discovery, protein folding), knowledge graphs (link prediction, reasoning), and traffic forecasting [9]. For network intrusion detection, GNNs naturally model the network graph where devices are nodes and connections are edges. By propagating information through the topology, GNNs capture complex attack patterns such as distributed attacks, multi-hop propagation, and structural anomalies that flat ML models miss [10]. But substantial room remains for architectural innovations.

The theoretical understanding of edge-level GNN expressiveness has deepened considerably. [18] proved that edge-level message passing can distinguish graph structures that node-level methods cannot, establishing theoretical superiority for tasks where edge properties are critical. [19] analyzed the WeisfeilerLeman expressiveness of edge-level GNNs, showing they achieve higher-order graph isomorphism testing compared to standard GNNs. [20] provided generalization bounds for edge classification, proving that edge-augmented architectures achieve better sample complexity when edge features carry discriminative information—precisely the case in network intrusion detection.

This paper introduces three GNN architectures that substantially advance intrusion detection accuracy through distinct mechanisms:

Prototype-GNN: Learns key example patterns and classifies new connections by finding which example they match best
Contrastive-GNN: Organizes the model’s internal space by pushing attacks and normal traffic into different regions
GSL-GNN: Discovers the optimal way to connect devices for sharing information during classification

We provide comprehensive experimental validation on 1 million network connections, architectural trade-off analysis, and show generalization potential to other edge classification domains. The rest of this paper is organized as follows: Sect. 3 presents our methodology including the three architectures, Sect. 4 describes experimental setup and results, Sect. 7 discusses findings and implications, and Sect. 8 concludes with future directions.

2 GNN Classification

2.1 Graph Level Classification

Traditional GNN approaches employ graph level classification, where the entire network snapshot receives a single label (malicious or normal) [12]:

where h^(L)_i are final node embeddings and READOUT(·) aggregates them (e.g., mean/max pooling). However, this approach suffers from the aggregation bottleneck: all device-specific and connection-specific information is compressed into a single graph-level representation, losing fine-grained attack signatures and making it impossible to identify which specific devices or connections are compromised.

2.2 Node Level Classification

An intermediate approach is node level classification, where individual devices (nodes) are classified as compromised or benign:

where h^(L)_i is the node embedding after L GNN layers and x_i are node features (device characteristics). This approach offers better granularity than graph-level classification by identifying which specific devices exhibit malicious behavior. It is particularly suitable for detecting compromised IoT devices, botnets, or infected hosts. However, node-level classification still loses connection-specific information: a device may have both legitimate and malicious connections, but node-level methods cannot distinguish between them. Additionally, many attacks manifest primarily in connection patterns such as port scanning, data theft rather than device-level features, making node classification insufficient for comprehensive intrusion detection.

2.3 Edge Level Classification

The most detailed approach is edge level classification, which directly classifies individual connections:

where h_i , h_j are node embeddings capturing graph context and x_ij are edge features (connection statistics). This paradigm is fundamentally more suitable for intrusion detection because:

– Fine-grained predictions: Identifies which specific connections are malicious
– No information bottleneck: Attack signatures preserved in edge representations
– Connection-level patterns: Captures attacks manifested in traffic characteristics (bytes, packets, protocols, ports)
– Scalability: Computational cost linear in number of edges, not entire graph
– Real-time applicability: Can classify new connections as they arrive

Our work advances edge-level GNN architectures through three mechanisms explained in the remainder of this section.

3 Methodology

3.1 Problem Formulation

Let G = (V, E, X_V , X_E) represent a network traffic graph where:

– V = {v₁, v₂, . . . , v_n} is the set of IP address nodes.
– E = {e_ij | v_i , v_j ∈ V } is the set of connections (edges).
– X_V ∈ R^n×d_v contains node features (e.g., IP metadata, behavioral statistics).
– X_E ∈ R^|E|×d_e contains edge features (e.g., bytes, packets, duration, protocols).

For network intrusion detection, we define the edge classification task as:

where y_ij = 0 indicates normal traffic and y_ij = 1 indicates an attack connection

3.2 Baseline: EdgeLevelGCN

We establish a strong baseline using standard Graph Convolutional Networks adapted for edge classification. Figure 1 shows the baseline EdgeLevelGCN architecture. The baseline EdgeLevelGCN represents a standard graph convolutional approach for edge-level classification in network intrusion detection. Node features propagate through three GCN layers with normalization and dropout regularization. Edge representations are constructed via concatenation of endpoint node embeddings and intrinsic edge features, followed by MLP-based classification.

Node Embedding Each node i is initialized with features x_i ∈ R^d₀ (device characteristics). GCN layers propagate information:

where N (i) are neighbors, d_i is degree, W^(l) are learnable weights, and σ is ReLU activation. This computes:

where A^˜ = A + I (adjacency with self-loops) and D^˜ is degree matrix.

Edge Classification For each edge (i, j), we concatenate node embeddings with edge features:

where ∥ denotes concatenation and x_ij ∈ R^de are edge features (bytes, packets, duration, ports). A 3-layer MLP classifier predicts:

Training We use Focal Loss to handle class imbalance:

where p_t is predicted probability, α_t balances classes, and γ = 2 focuses on hard examples.

Computational Complexity GCN Layers: Each of L = 3 layers computes H^(l+1) = σ(D^˜ ^{− 1/2} A^˜D^{˜ − 1/2} H^(l)W^(l)). The sparse matrix multiplication AH^{˜ (l)} costs O(|E| · d) where |E| is the number of edges and d is the hidden dimension. The dense transformation H^(l)W^(l) costs O(|V |·d²). Combined: O(L·(|E|·d+|V |·d²)).

Edge Classification: For |E| edges, concatenation and MLP forward pass
cost O(|E| · d²).

Total Complexity: O(L· |E| · d + L· |V | · d² + |E| · d²). For typical network graphs where |E| ≫ |V | and L is small constant, this simplifies to O(|E| · d²).

3.3 Prototype-GNN: Distance-Based Classification

Prototype-GNN addresses the limitation that a single decision hyperplane cannot capture diverse attack patterns. Instead, we learn multiple prototypes— representative embeddings for attack and normal patterns—and classify based on distance. Figure 2 presents the Prototype-GNN architecture.

Figure 1. EdgeLevelGCN Baseline Architecture

Prototype-GNN introduces learnable prototype-based classification using distance metrics in the embedding space. The architecture maintains 16 trainable prototypes (8 per class) initialized through k-means clustering and refined end-to-end. Classification relies on L2 distance computation between edge embeddings and prototypes, with predictions derived from minimum distance assignment. The composite loss function incorporates cross-entropy, intra-class clustering, and inter-class separation objectives.

Architecture Node Embedding: Same as baseline (3 GCN layers), producing final node embeddings h^(L)_i ∈ R^d where L = 3.
Edge Encoding: For each edge (i, j), concatenate node embeddings with edge features:

Prototype Layer: We learn K prototypes per class (attack/normal):

where each p_k ∈ R^2d+d_e is a learnable parameter initialized via K-means Clustering.

Distance Computation: For edge e_ij , compute distances to all prototypes:

Classification: Predict class based on minimum distance:

For training, convert distances to probabilities via Softmax:

Loss Function We employ a triple loss combining classification, cluster compactness, and class separation:

Classification Loss: Cross-entropy on distance-based probabilities:

Cluster Compactness: Encourages prototypes of same class to be diverse:

Class Separation: Pushes attack and normal prototypes apart:

with λ1 = 0.05, λ2 = 0.05, ϵ = 0.1 for numerical stability

Computational Complexity GCN Encoding: Same as baseline, O(L · |E| ·d + L · |V | · d²) where L = 3.
Distance Computation: For each of |E| edges, compute distances to 2K prototypes (K attack + K normal). Each distance computation ∥eij −pk∥₂² costs O(d_emb) where d_emb = 2d + d_e is the edge embedding dimension. Total: O(|E| ·K · d_emb).
Softmax Classification: Converting distances to probabilities costs O(|E|·K).
Total Complexity: O(L · |E| · d + L · |V | · d² + |E| · K · d). Since K = 8 is a small constant, this simplifies to O(|E| · d²), matching the baseline asymptotic complexity with minimal overhead.

Figure 2. Prototype-GNN Architecture

Interpretability Unlike blackbox classifiers, Prototype-GNN provides interpretability: each prototype represents a learned attack/normal pattern. Analysts can inspect which prototype triggered an alert and visualize similar historical connections. This aids in understanding attack variants and reducing false positives.

3.4 Contrastive-GNN: Optimizing Embedding Geometry

Standard cross-entropy loss only encourages correct predictions but does not structure the embedding space. Contrastive-GNN explicitly optimizes geometry: pulling same-class edges together while pushing different-class edges apart. Figure 3 presents the Contrastive-GNN architecture with a dual-head design. The dual-head architecture optimizing both classification accuracy and embedding geometry simultaneously. Following shared GCN encoding, the architecture bifurcates into parallel classification and projection heads. The projection head generates L2-normalized embeddings for supervised contrastive learning, which explicitly maximizes inter-class separation and intra-class compactness through temperature-scaled similarity metrics.

Architecture Node Embedding: Same as baseline (3 GCN layers), producing final node embeddings h^(L)_i ∈ R^d where L = 3.

Edge Encoding: For each edge (i, j), concatenate node embeddings with edge features:

Dual-Head Design:
– Classification Head: 3-layer MLP predicting attack/normal from e_ij
– Projection Head: 3-layer MLP mapping e_ij to 128-d normalized space

Supervised Contrastive Loss For each edge (i, j) in a batch, define:

– P(i, j): Positive set (edges with same label)
– A(i, j): All other edges in batch

The supervised contrastive loss:

where τ = 0.07 is temperature controlling concentration. This maximizes similarity to same-class edges while minimizing similarity to different-class edges.

Combined Loss

with λ = 0.5 balancing classification and contrastive objectives.

Computational Complexity GCN Encoding: Same as baseline, O(L · |E| ·d + L · |V | · d²).
Dual Heads: Both classification and projection heads process |E| edge embeddings through 3-layer MLPs, costing O(|E| · d²) each.
Contrastive Loss: For a batch of B edges, computing pairwise similarities z_ij · z_p requires O(B²· d_proj ) where d_proj = 128 is the projection dimension. Across all |E|/B batches: O(|E| · B · dproj ).
Total Complexity: O(L· |E| · d+L· |V | · d² +|E| · d² +|E| ·B · d_proj ). With typical batch size B ≪ |E|, this simplifies to O(|E| · d²) asymptotically, though the contrastive term adds O(B · d_proj ) overhead per edge in practice.

Figure 3. Contrastive-GNN Architecture

3.5 GSL-GNN: Adaptive Graph Structure Learning

The baseline uses the physical network topology for message passing. However, the optimal structure for classification may differ—devices may share behavioral patterns without direct connections (e.g., botnet members). GSL-GNN learns this structure adaptively. Figure 4 illustrates the GSL-GNN architecture. GSLGNN addresses the limitation of fixed physical network topology through adaptive graph structure learning. The architecture comprises dual parallel paths: one operating on the original adjacency matrix and another on a learned adjacency matrix derived via bilinear attention mechanisms. Top-k sparsification ensures computational tractability while preserving salient connections. Node embeddings from both paths undergo weighted fusion 0.5 before edge classification. This adaptive topology learning captures latent behavioral patterns orthogona to physical connectivity,

Structure Learner Given node features H⁽⁰⁾ = X ∈ R^n×d, learn adjacency matrix:

where W_attn ∈ R^d×d learns feature interactions. This produces dense A_learned∈ R^n×n.
Sparsification: To avoid O(n²) computation, keep only top- k neighbors per node:

Numerical Stability: Critical implementation detail:

to prevent NaN from extreme values.

Dual-Path GCN Process features using both learned and original topologies:

Fuse via weighted combination:

with α = 0.5 balancing learned and physical structure.

Edge Classification Same as baseline: concatenate node embeddings with edge features, feed to MLP.
End-to-End Training: The structure learner is differentiable, enabling joint optimization:

where L_reg = ∥A_learned∥_F² prevents trivial solutions.

Computational Complexity Structure Learning (Naive): Computing bilinear attention A_learned[i, j] = σ(h^T_i W_attnh_j ) for all |V |² node pairs costs O(|V |²·d²). This is prohibitive for large graphs.
Sparsification: Keeping only top-k neighbors per node via TopK(A_learned[i, :], k = 15) reduces the effective adjacency to O(|V | · k) edges. This requires O(|V |²log k) for k-selection across all nodes, but is computed once per forward pass.
Dual-Path GCN: Processing both learned and original graphs costs 2·O(L·|E| · d + L · |V | · d²) where the learned graph has O(|V | · k) edges. The fusion operation αH_learned + (1 − α)H_original is O(|V | · d).

Figure 4. GSL-GNN Architecture

Edge Classification: Same as baseline, O(|E| · d²).
Total Complexity: O(|V |²·d²+|V |²log k+L·|V |·k·d+L·|V |·d²+|E|·d²). The dominant term is structure learning: O(|V|²· d²) without approximations. However, with top-k sparsification reducing effective computation, practical complexity approaches O(|V |·k·d²+|E|·d²), achieving the noted 26× speedup over naive O(|V |²) adjacency.

4 Experimental Setup

4.1 Dataset

In this paper, we use the TON IoT Network Intrusion Detection Dataset [13–17] as it contains raw network flow data from IoT devices.
Given the severe class imbalance in the original dataset (96.4% attacks, 3.6% normal), we create a balanced subset through stratified sampling, maintaining temporal ordering to preserve realistic traffic patterns.

4.2 Temporal Graph Construction

Network connections are organized into 200 temporal graph snapshots:

– Snapshot size: 3,000 connections each
– Nodes: Unique IP addresses (avg 346 per snapshot)
– Edges: Directed connections between IPs
– Node features (12-dim): In/out degree, bytes, packets, avg connection duration, protocol distribution, port entropy
– Edge features (10-dim): Bytes sent/received, packet count, duration, protocol type (TCP/UDP/ICMP), source/destination ports, lags

Feature Engineering Node Features (per IP address):

Edge Features (per connection):

Normalization All features normalized to [0, 1] via min-max scaling:

computed on training set, applied to validation/test.

Graph Construction Adjacency matrix A ∈ {0, 1}^n×n where A[i, j] = 1 if connection exists from IP i to j.

4.3 Data Preprocessing

We create 200 temporal graph snapshots (3000 connections each) and split:

– Training: 140 snapshots (70%, 420K edges)
– Validation: 30 snapshots (15%, 90K edges)
– Test: 30 snapshots (15%, 90K edges)

Each snapshot preserves the original temporal ordering and maintains a balanced class distribution, with an attack ratio ranging from 40% to 60%, allowing for controlled variation.

5 Results

Table 1 presents a comprehensive comparison of all four models on the test set. Figure 5 presents a radar chart to visually compare the model results. Figure 6 shows how the proposed architectures improved accuracy relative to the baseline EdgeLevelGCN.

Table 1. Model Performance on Test Set







Figure 5. Radar Chart for Algorithms




1.All architectures substantially outperform baseline:
– Prototype-GNN: +2.37 pp (91.87% → 94.24%)
– Contrastive-GNN: +2.84 pp (91.87% → 94.71%)
– GSL-GNN: +4.79 pp (91.87% → 96.66%)

2. GSL-GNN achieves near-perfect discrimination: 99.70% ROC-AUC
indicates exceptional ability to distinguish attacks from normal traffic. Only
3,002 total errors out of 90,000 connections.

3. Trade-off between accuracy and efficiency:
– Best accuracy: GSL-GNN (96.66%, 35 min training)
– Best balance: Contrastive-GNN (94.71%, 18 min training)
– Most efficient: Prototype-GNN (94.24%, 12 min training)

4. Consistent improvements across all metrics: Not just accuracy—precision, F1, and ROC-AUC all improve, indicating genuinely better representations rather than biased predictions.





Figure 6. Architecture Improvements



Table 2. Confusion Matrices (TN, FP, FN, TP)
Share this:

				Share on X (Opens in new window)
				X
			

				Share on Facebook (Opens in new window)
				Facebook
			
Like Loading...
									
	
	
	
		
		Leave a comment Cancel reply






			
			
Δ