Blog Post

Seeing Through the Black Box: Explainable AI for Cyber Attribution and Defense

  • Chathrie Upeka Wimalasooriya
  • published date: 2026-02-27 14:44:47

Modern cyber defense increasingly relies on AI systems that analyse vast amounts of threat intelligence and identify patterns beyond traditional analytical reach. As these models grow more powerful, they also become harder to understand raising concerns about transparency and reliability. This post walks through the role of explainable AI in strengthening cyber attribution, outlining how GNN- and LLM-based techniques can clarify model reasoning and support operational decision-making, and highlight what security teams need to consider when integrating XAI into real-world workflows.

Introduction

Artificial Intelligence is rapidly reshaping cyber defense. The growing sophistication of cyberattacks has pushed security teams to adopt AI as a core component of modern defense. Unlike traditional systems limited by static signatures or manual analysis, AI-driven models can process streams of data in real time, identifying irregularities and hidden patterns that may escape human detection. Just as importantly, AI-based systems can be continually trained and updated as new attack behaviours emerge, allowing them to adapt to evolving threats.

Yet this increased reliance on AI brings its own challenges. As models grow in scale and complexity, they become harder for humans to understand. In practice, cybersecurity analysts are often asked to trust decisions made by systems whose internal logic is difficult or sometimes impossible to interpret.

Analysts are frequently required to accept the outputs of black-box systems; A classifier may flag a network flow as malicious or suggest that a campaign resembles a known threat actor, yet provide no clarity on how or why that conclusion was reached. In high-stakes environments such as national security, critical infrastructure, and large-scale incident response, this lack of transparency is not merely inconvenient; it introduces significant operational and analytical risk.

This tension between high model performance and human-understandable reasoning sits at the center of current research in Explainable AI (XAI). AI-driven tools can detect subtle anomalies far beyond human capabilities, but their black-box nature raises fundamental questions: Why did the model flag this behaviour? What features drove the alert? Can this output be trusted—especially under adversarial conditions? As noted in recent literature, explainability is essential for justification, accountability, and trustworthiness in AI-assisted cyber defense. It is therefore not an optional add-on, but a foundational requirement for AI-driven security systems.

AI-driven cyber Attribution

A robust attribution process requires a multi-modal approach, rather than dependence on a single analytical framework. For example, initial attack mapping may use the Cyber Kill Chain, followed by MITRE ATT&CK for detailed behavioural analysis. Graph Neural Networks (GNNs) can then map relationships and uncover hidden connections between adversaries, their capabilities, and their infrastructure (e.g., IPs, domains).

Recent AI-driven attribution methods highlight the value of combining GNNs, Large Language Models (LLMs), and NLP [1,2]. These models support tasks ranging from simple threat-actor classification based on CTI reports [3] to multi-source, heterogeneous attribution involving topological features and relationships among indicators of compromise (IOCs).

Explainable Graph Neural Networks (XGNN)

Trustworthy GNNs incorporate robustness, explainability, privacy, fairness, accountability, and other trust-oriented properties [4]. Explainable GNNs (XGNNs) are increasingly important for cyber attribution because GNNs are now used to connect infrastructure, malware, TTPs, victims, and adversary profiles into a unified heterogeneous graph. While GNNs identify non-obvious links—such as shared C2 servers or recurring behavioural patterns—their predictions often appear as black-box outputs. Explainability is therefore essential to make these decisions traceable and defensible, especially when informing high-risk attribution assessments.

Forms of Explanations for GNNs

Explainability in GNNs can take several forms depending on the granularity and purpose of the explanation [5,6]:

  • Node-level: Which entities (e.g., IPs, domains, malware) influence the model’s decision

  • Edge-level: Which relationships (e.g., malware → C2, campaign → infrastructure) were most influential

  • Feature-level: Which attributes (e.g., malware family, ATT&CK techniques) mattered

  • Subgraph-level: Minimal structural rationale behind the prediction

Methods for GNN Explainability

  • Gradient/ Feature Attribution: looks at which nodes, edges, or features the model was most sensitive to.

  • Perturbation-Based Masking: removes parts of the graph like deleting a suspicious domain or IP to see whether the prediction changes (counterfactual reasoning).

  • Surrogate Models: when internal model access is restricted, a simpler model (e.g., small decision tree or Bayesian network) approximates local behaviour to explain predictions.

  • Decomposition Methods: breaks down the final prediction into contributions from each node or connection, similar to tracing how much each piece of evidence “added” to the final score.

  • Generation-based Methods: use reinforcement learning or graph-generation techniques to create a simple example subgraph that strongly triggers the model’s decision—this serves as a human-readable pattern that explains what the GNN was looking for (e.g., XGNN)

Explainable Large Language Models (XAI-LLMs)

In cyber attribution workflows, LLMs and domain-adapted NLP models such as CTI-BERT [7] and CySecBERT[8] help analyse unstructured threat-intelligence text. Trained on cybersecurity corpora, they extract IOCs, map behaviours to MITRE ATT&CK, classify threat reports, and identify linguistic cues linked to known APT groups. This greatly reduces analyst effort.

However, despite their capabilities, LLMs remain largely opaque [9], providing outputs without revealing how they reached them. This lack of transparency is problematic for high-stakes decisions such as actor attribution or campaign correlation.

XAI techniques are still evolving, and integrating them into LLMs without degrading efficiency is a significant challenge [9]. Until LLMs can expose interpretable reasoning paths or justify their outputs with evidence, their role in critical cyber-defense processes will remain constrained.

XAI Techniques Relevant to LLM

Although LLMs differ from traditional ML models, several established XAI approaches can be applied into CTI and attribution workflows [10,11].

  1. SHAP (Shapley Additive Explanations): SHAP assigns each input feature a contribution score (Shapley values) based on cooperative game theory, enabling analysts to understand how much each token, feature, or attribute influenced an output. In cybersecurity research, SHAP has been used to explain why IDS models flagged certain network patterns as malicious, or why a specific feature contributed to anomaly detection. 

    • Positive Shapley value → increased support for the output

    • Negative value → reduced support

For example, In LLM-based CTI analysis, SHAP can highlight:

  • which tokens (which parts of the texts) contributed to identifying an IOC
  • which phrases influenced the model to link a report to a specific APT group
  • which behavioural indicators drove a MITRE ATT&CK mapping

This provides an analyst-friendly breakdown of model reasoning.

  1. LIME (Local Interpretable Model-Agnostic Explanations): LIME complements SHAP by focusing on localized explanations for individual predictions. It creates slightly altered versions of the input by selectively modifying some parts (e.g., masking tokens, removing terms) and then observes how these changes affect the output.

    • If removing a token changes the prediction → that token was important

    • If a token does not affect the prediction → it was likely irrelevant

In attribution workflows, LIME can show:

  • which words influenced a model to classify a report (e.g., as APT28-related)
  • which words or sentences led the model to map an activity to a particular TTP
  • which textual elements informed inferred infrastructure relationships

 

  1. Attention/Gradient-Based Attribution: reveals which tokens the model focused on

  2. Evidence-Grounded Reasoning (e.g., RAG): requires the model to cite retrieved CTI sources

  3. Self-Consistency Verification: reduces hallucination by comparing multiple reasoning paths

  4. Counterfactual Explanations: Shows how slight changes in the input (e.g., removing a TTP description) alter the output (causal reasoning)

Looking Ahead

  1. Bridging Research and Practice in XAI: Most XAI methods remain research prototypes. SOC teams need operationally usable explanations embedded in workflows (e.g., inline justification panels, evidence links, and explanation logging) with UI support (e.g., explanation panels in SOC consoles) so analysts can actually use them.

  2. Standardised Evaluation for XAI in Cybersecurity: Current XAI studies use inconsistent metrics. The next step is developing benchmarks and evaluation frameworks that measure stability, human usefulness, and robustness against adversarial manipulation.

  3. System-Level Explainability, Not just Model-Level: XAI must extend beyond single predictions to full pipelines, monitoring explanation drift (e.g., which features are driving positives this month vs last month) and tracking how reasoning evolves during long-running campaigns.

  4. Toward Trustworthy AI for Cyber Defense: Future attribution systems will combine robustness, privacy, fairness, explainability. This holistic “trustworthy AI” perspective will underpin next-generation cyber defense systems.

Edited By: Windhya Rankothge, PhD, Canadian Institute for Cybersecurity 

References

  1. Xiao, N., Lang, B., Wang, T., & Chen, Y. (2024).  APT-MMF: An advanced persistent threat actor attribution method based on multimodal and multilevel feature fusion.  https://www.sciencedirect.com/science/article/pii/S0167404824002657

  2. Hu, Y., Zou, F., Han, J., Sun, X., & Wang, Y. (2024). Llm-tikg: Threat intelligence knowledge graph construction utilizing large language model. https://www.sciencedirect.com/science/article/pii/S0167404824003043

  3. Irshad, E., & Siddiqui, A. B. (2023). Cyber threat attribution using unstructured reports in cyber threat intelligence. https://www.sciencedirect.com/science/article/pii/S111086652200069X

  4. Zhang, H., Wu, B., Yuan, X., Pan, S., Tong, H., & Pei, J. (2024). Trustworthy graph neural networks: Aspects, methods, and trends. https://ieeexplore.ieee.org/abstract/document/10477407

  5. Luo, D., Cheng, W., Xu, D., Yu, W., Zong, B., Chen, H., & Zhang, X. (2020). Parameterized explainer for graph neural network. 
    https://proceedings.neurips.cc/paper/2020/file/e37b08dd3015330dcbb5d6663667b8b8-Paper.pdf

  6. Vu, M., & Thai, M. T. (2020). Pgm-explainer: Probabilistic graphical model explanations for graph neural networks. 
    https://proceedings.neurips.cc/paper_files/paper/2020/file/8fb134f258b1f7865a6ab2d935a897c9-Paper.pdf

  7. Park, Y., & You, W. (2023, December). A pretrained language model for cyber threat intelligence.  https://aclanthology.org/2023.emnlp-industry.12/

  8. Bayer, M., Kuehn, P., Shanehsaz, R., & Reuter, C. (2024). Cysecbert: A domain-adapted language model for the cybersecurity domain. https://dl.acm.org/doi/full/10.1145/3652594

  9. Atlam, H. F. (2025). LLMs in Cyber Security: Bridging Practice and Education.  https://www.mdpi.com/2504-2289/9/7/184

  10. Ali, T., & Kostakos, P. (2023). Huntgpt: Integrating machine learning-based anomaly detection and explainable ai with large language models (llms). https://arxiv.org/abs/2309.16021

  11. Mohale, V. Z., & Obagbuwa, I. C. (2025). A systematic review on the integration of explainable artificial intelligence in intrusion detection systems to enhancing transparency and interpretability in cybersecurityhttps://doi.org/10.3389/frai.2025.1526221

#Explainable AI #Cyber Attribution #Threat Intelligence #GNN #LLM