Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference

TL;DR. BHPI reframes multi-disease EHR modeling as inferring a latent hypergraph: diseases group into overlapping pathways, risk factors act on pathways rather than single diseases, and structured Bayesian inference delivers calibrated uncertainty — with the largest gains on rare diseases.

Abstract

Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification.

We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around latent, risk-factor–modulated disease pathways. Risk factors act on hyperedges — latent disease subsets with shared risk patterns — allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence.

To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and the UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

How it compares

Approach	Borrows strength	Risk-factor–specific	Higher-order (> pairwise)	Uncertainty over structure
Independent logistic	✗	✓	✗	✗
Multi-task / black-box	✓	✗	partial	✗
Disease networks	partial	✗	✗	✗
BHPI (ours)	✓	✓	✓	✓

BHPI is the only approach that borrows strength across diseases, keeps effects risk-factor–specific, captures higher-order (beyond pairwise) structure, and quantifies uncertainty over that structure.

How it works

\( \boldsymbol{\beta}_{j,v} \;=\; d_v^{-1}\sum_{e=1}^{E}\, H_{v,e}\,\mu_{j,e} \)

A disease’s effect is composed from shared hyperedge effects: H says which diseases belong to each pathway (learned, with uncertainty), μ is how a risk factor acts through a pathway, and d keeps the effect scale stable as the number of pathways grows. A repulsion prior keeps pathways non-redundant and identifiable; structured variational inference preserves the existence → membership → effect logic for calibrated uncertainty.

BHPI workflow: data layer, latent hypergraph layer, and effect propagation — The BHPI workflow. (a) Patient covariates and multiple disease outcomes. (b) A latent disease hypergraph models higher-order structure via hyperedges, with uncertain existence and membership. (c) Risk factors act on hyperedges to induce structured, disentangled effects across diseases; shaded regions denote posterior uncertainty.

Results on the UK Biobank

diseases jointly modeled

baseline risk factors

~277K

UK Biobank patients

0.005

calibration error (ECE) — well-calibrated

BHPI is most valuable exactly where it is hardest — rare diseases — where independent baselines collapse with heavy negative tails. On the rarest diseases (<2% prevalence) it improves mean AUC by +1.2 points over optimally tuned logistic regression (p = 3×10⁻⁵). By borrowing strength across shared pathways, BHPI yields stable, interpretable estimates and competitive prediction.

Per-disease change in AUROC relative to logistic regression, stratified by disease prevalence — Per-disease ΔAUC relative to tuned logistic regression, stratified by disease prevalence. Discriminative baselines (LightGBM, Classifier Chains) show high-variance failure modes — long negative tails — on rare diseases, while BHPI stays robust across the prevalence spectrum.

Discovered bipartite risk-factor to disease pathways in the UK Biobank — Discovered latent pathways linking risk factors (top) and diseases (bottom) through hyperedges (center) in the UK Biobank.

Cite

@inproceedings{ding2026bhpi,
  title     = {Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference},
  author    = {Ding, Shengxian and Gao, Haonan and Liu, Pangpang and
               Tian, Xinyuan and Zhao, Yize},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
  year      = {2026},
  eprint    = {2606.07677},
  archivePrefix = {arXiv}
}