
Our prior survey of foundation models beyond text identified eight open questions [Prior Survey]. This report investigates three of the most promising avenues for closing those gaps:
Public benchmarks for event-sequence FMs are finally emerging. HORIZON (April 2026, 54M users) [1], EBES (KDD 2025) [2], and MBD (multimodal banking, 2024) [3] represent the first serious attempts, though none yet match the authority of GLUE or ImageNet. Blockchain data remains the most structurally promising source for permissionless, graph-structured event benchmarks, but no dedicated blockchain FM benchmark exists yet.
Blockchain/on-chain foundation models are a small but active field. BERT4ETH (WWW 2023) [5], BlockFound/BlockScan (2024–2025) [6], and the newly released EWE-1 (sistemalabs, 2026; 1.1B Ethereum transactions, open-weights) [7] demonstrate that the pre-train-then-adapt paradigm works on blockchain data. GetEmbed/ZKAI Labs has achieved commercial traction with wallet-level recommendation models deployed inside Coinbase Wallet, validating the use case at scale [8].
JEPA (Joint-Embedding Predictive Architecture) has expanded beyond vision into tabular data (T-JEPA, ICLR 2025) [14], graph-level representation (Graph-JEPA, 2024) [15], and trajectory similarity (T-JEPA for trajectories, 2024) [16]. These extensions suggest JEPA could address the relational signal gap exposed by PRAGMA's AML failure [Prior Survey], but no one has yet applied JEPA to financial event sequences or transaction networks. This is a clear research opportunity.
Part I: Public Benchmarks for Event-Sequence Foundation Models
The Gap
NLP has GLUE/SuperGLUE. Vision has ImageNet. Tabular data has OmniTabBench. Event-sequence foundation models have nothing comparable. Every industrial model evaluates on proprietary internal benchmarks, making cross-paper comparison impossible.
Emerging Benchmarks
Three recent efforts begin to address this:
HORIZON (April 2026)
Paper: "HORIZON: A Benchmark for In-the-wild User Behaviour Modeling" — arXiv:2604.17259 [1]
Scale: 54M users, 35M items, 486M interactions [1]
Data source: Cross-domain reformulation of Amazon Reviews [1]
Tasks: Temporal generalization, sequence-length variation, unseen-user modeling, cross-domain transfer [1]
Key innovation: Evaluates along three axes (dataset, task, evaluation) rather than single-domain next-item prediction. Explicitly tests temporal robustness and cross-domain generalization [1].
Limitation: Built from product review data, not financial transactions or rich event sequences. No graph structure.
EBES — Easy Benchmarking for Event Sequences (KDD 2025)
Paper: "EBES: Easy Benchmarking for Event Sequences" — arXiv:2410.03399 · GitHub [2]
Focus: Standardized benchmark for event-sequence classification with irregular sampling intervals and mixed categorical/numerical features [2].
Coverage: Healthcare, finance, and user interaction domains [2].
Key contribution: First standardized evaluation protocol for event-sequence models. Addresses the lack of comparable results across studies [2].
Limitation: Focused on classification tasks; doesn't cover generative or recommendation tasks [2].
MBD — Multimodal Banking Dataset (2024)
Paper: "Multimodal Banking Dataset: Understanding Client Needs through Event Sequences" — arXiv:2409.17587 · HuggingFace · GitHub [3]
Scale: 2M+ corporate clients with 950M bank transactions, 1B geo-position events, 5M dialogue embeddings [3]
Key innovation: First industrial-scale publicly available multimodal banking event-sequence dataset. Includes transaction sequences, geolocation data, and dialogue embeddings [3].
Tasks: Future purchase prediction, modality matching [3]
Status: Dataset publicly available on HuggingFace [3].
Limitation: Corporate banking (not consumer); anonymized to the point where some structural signals may be lost.
PyTorch-Lifestream (IJCAI 2025)
Paper: IJCAI 2025 proceedings [4]
Role: Open-source library providing standardized implementations of event-sequence pretraining methods with built-in support for multimodal event sequences including transactions, clickstreams, and geolocation data [4].
Datasets: Bundles several public event-sequence datasets for benchmarking, though none at PRAGMA scale [4].
Assessment
Benchmark | Domain | Scale | Graph Structure? | Public? | FM-Ready? |
|---|---|---|---|---|---|
HORIZON [1] | Product reviews | 54M users | ❌ | ✅ | ✅ (pretraining + eval) |
EBES [2] | Multi-domain events | Various | ❌ | ✅ | Partial (classification only) |
MBD [3] | Corporate banking | 2M clients, 950M txns | ❌ | ✅ | ✅ |
PyTorch-Lifestream [4] | Multi-domain | Various small | ❌ | ✅ | ✅ (library) |
Needed | Financial/blockchain | Billions of events | ✅ | ✅ | ✅ |
Key gap: None of these benchmarks include graph structure. The transaction graph — who transacts with whom — is precisely what single-user event-sequence FMs cannot capture and what AML/fraud detection requires [Prior Survey]. Blockchain data is the natural candidate to fill this gap.
Part II: Blockchain / On-Chain Foundation Models
Why Blockchain Data Matters for FM Research
Blockchain transaction data has unique properties that make it ideal for public event-sequence FM benchmarks:
Permissionless access — No GDPR, MiFID II, or HIPAA barriers. All transactions are public by design.
Inherent graph structure — Wallets, contracts, and token flows form a rich, directed graph.
Massive scale — Ethereum alone has billions of transactions since 2015.
Multi-modal — Transactions include amounts, contract calls, token types, timestamps, gas prices, and more.
Ground truth labels — Known fraud addresses, sanctions lists, and phishing datasets exist [9, 10, 11].
Existing Blockchain FMs
BERT4ETH (WWW 2023)
Paper: "BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection" — arXiv:2303.18138 · Code [5]
Architecture: BERT-style masked modeling on Ethereum transaction sequences per wallet [5].
Scale: Small by FM standards (hidden_size=64, 2 attention heads, max 50 positions) [5].
Tasks: Phishing detection, de-anonymization, illicit account detection [5].
Key contribution: First pre-trained transformer for Ethereum. Demonstrated that BERT-style pretraining transfers to multiple blockchain fraud tasks [5].
Limitation: Very small model; processes wallets in isolation (same limitation as single-user event-sequence FMs) [5].
BlockFound / BlockScan (2024–2025)
Paper: "BlockScan: Detecting Anomalies in Blockchain Transactions" — arXiv:2410.04039 · OpenReview (ICLR 2025 submission) [6]
Architecture: Customized Transformer with modular tokenizer for blockchain's multi-modal data structure (blockchain tokens, text, numbers) [6].
Training: BERT-style MLM adapted for DeFi transaction patterns [6].
Key innovation: Handles the unique data structure of blockchain transactions (function calls, token amounts, addresses) with a specialized tokenizer [6].
Limitation: Focused on anomaly detection. A third-party analysis notes it neglects dynamic execution context (opcode-level transaction details).
EWE-1 (sistemalabs, 2026) — First Open-Weights Blockchain FM
Blog: sistemalabs.com/blog/introducing-ewe-1 · GitHub · HuggingFace [7]
Architecture: Causal (autoregressive) transformer trained with sequifier. Unlike BERT4ETH's masked reconstruction, EWE-1 is causal — it maximizes information about future behavior, discarding early transaction info if it doesn't help predict what's next [7].
Scale: 1.1 billion Ethereum transaction records from 2024–2025. Three model sizes: 35M (small), 110M (medium), 500M (large) parameters. 16 attention heads across all variants [7].
Input: 31 features per transaction across a 64-transaction lookback window. Features cover counterparty identifiers, transaction characteristics (cost, success, entropy), wallet state (age, frequency, session depth), and temporal features (hour, day, month) [7].
Training data: All Ethereum mainnet transactions from 2024–2025, excluding addresses with fewer than 4 transactions and whales with >100K transactions [7].
Validation: Within-wallet cosine similarity 85–90%; between-wallet similarity 10–15%. Phishing detection from embeddings outperforms raw features by reducing error rates 10–20% [7].
Open-source: ✅ Permissively licensed weights on HuggingFace + inference code on GitHub [7].
Significance: First open-weights large transaction model for blockchain data [7].
ZipZap (KDD 2024)
Referenced in EWE-1's blog as training on up to 110M transactions [7]. Combines sequence modeling with graph elements but does not publish model weights [7].
Public Blockchain Datasets
Dataset | Data | Scale | Labels | Source |
|---|---|---|---|---|
Elliptic Bitcoin | Bitcoin transaction graph | 203K txns, 234K edges | Licit/illicit | PyG Guide [9] |
Elliptic2 | Bitcoin subgraph patterns | Extended Elliptic | Money laundering subgraphs | GitHub [10] |
Elliptic++ | Bitcoin txns + wallet addresses | 203K txns, 822K addresses | Fraud + illicit actors | GitHub [11] |
BERT4ETH phishing list | Ethereum addresses | Thousands of labeled addresses | Phishing accounts | GitHub [5] |
EWE-1 training data | All Ethereum mainnet 2024–2025 | 1.1B transactions | None (self-supervised) | Reconstructible from public chain [7] |
GetEmbed / ZKAI Labs — Commercial Blockchain AI
Company: Product name: ❜embed (getembed.ai) [8].
What they do: ML-powered personalization for Web3 apps. Per their product page: "While LLMs predict the next word in a sentence, our recommendation models predict your user's next action" [8].
Architecture (per vendor): HRNN (Hierarchical Recurrent Neural Network) — described as capturing both short-term intent and long-term preferences from on-chain behavior. Sequential, time-aware models trained on each wallet's interaction history. Cold start from just 2 interactions [8]. Note: These claims come from marketing material; no academic validation exists.
Scale (per vendor): Trained on 130M+ on-chain interactions. Backtested daily. Served in 250ms [8].
Validation (per vendor): "Mask last 10 actions per user, predict what they do next. No data leakage, no inflated metrics." Metric: Mean Reciprocal Rank [8].
Coinbase integration: Powered the personalized social feed in the new Coinbase Wallet app (via Base partnership). Ranks posts, mini-apps, and videos using Farcaster and zora social graph data + on-chain signals. In limited beta as of May 2025, with gradual rollout, until public launch in December 2025 [8].
No academic paper found. Architecture details come from product pages and blog posts only [8].
Assessment: Blockchain as the Public Benchmark Opportunity
Blockchain data is arguably the most promising domain for public event-sequence FM benchmarks because it uniquely combines:
Billions of real bilateral financial transactions with full counterparty visibility, publicly available
Inherent graph structure (wallets, contracts, token flows)
No regulatory barrier preventing benchmark creation
Ground truth labels from sanctions lists and fraud datasets [9, 10, 11]
Multiple scales and chains (Bitcoin, Ethereum, L2s)
Note: Other public financial data sources exist (e.g., LOBSTER order book data, SEC EDGAR filings), but none offer the combination of bilateral counterparty identity, graph structure, and permissionless access at this scale.
Yet no dedicated blockchain FM benchmark exists. The ingredients are there (Elliptic datasets [9, 10, 11], EWE-1's open data pipeline [7], BERT4ETH's evaluation tasks [5]), but nobody has assembled them into a standardized, multi-task benchmark comparable to GLUE or OmniTabBench.
Part III: JEPA Architectures and Relational Event Modeling
The Problem JEPA Could Solve
PRAGMA reports a −47.1% F₀.₅ degradation on AML relative to its production baseline (arXiv:2604.08649, §3.4.5) [Prior Survey, 19]. The authors attribute this to a fundamental limitation: the model processes each user's event history in isolation. Cross-user relational signals (transaction networks, counterparty patterns) are invisible. The same limitation affects BERT4ETH [5] and most event-sequence FMs.
JEPA's core principle — predict in latent space rather than reconstructing raw inputs [12, 13] — could theoretically enable cross-entity reasoning: instead of predicting masked tokens or next events, a JEPA-style model could predict the latent representation of a neighboring entity from the latent representation of a context entity.
JEPA Family: Current State (April 2026)
Model | Domain | Year | Venue | Key Innovation |
|---|---|---|---|---|
I-JEPA | Images | 2023 | CVPR 2023 | Predict latent representations of masked image patches [12] |
V-JEPA | Video | 2024 | Meta FAIR | Self-supervised video understanding via latent prediction [13] |
V-JEPA 2 | Video + robotics | 2025 | Meta FAIR | 1.2B params; zero-shot robot control from internet video [13] |
V-JEPA 2.1 | Video (dense features) | 2026 | Meta FAIR | Temporally consistent dense features [13] |
A-JEPA | Audio | 2024 | — | JEPA for audio representation learning (secondary source only [18]) |
Graph-JEPA | Graphs | 2024 | — | JEPA for graph-level representation learning [15] |
T-JEPA (tabular) | Tabular data | 2025 | ICLR 2025 | Augmentation-free SSL for tabular data [14] |
T-JEPA (trajectory) | Spatial trajectories | 2024 | — | Trajectory similarity computation [16] |
T-JEPA for Tabular Data (ICLR 2025)
Paper: "T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data" — arXiv:2410.05016 · ICLR 2025 poster · OpenReview [14]
Key contribution: Applies JEPA to tabular data, solving the data augmentation problem that has plagued self-supervised learning for structured data. Traditional SSL requires generating different "views" of the same sample — hard for tabular data because columns have heterogeneous semantics. T-JEPA uses mask-and-predict in latent space instead, requiring no augmentations [14].
Architecture: Mask subsets of features, predict their latent representations from the unmasked context. Introduces "regularization tokens" — a novel regularization method critical for JEPA-based models on tabular data [14].
Results: Competitive with supervised baselines; enables pre-training without labels [14].
Relevance to financial event FMs: T-JEPA demonstrates that JEPA works for structured data. The natural next step is extending from single-row tabular data to multi-event sequences.
Graph-JEPA (2024)
Paper: "Graph-level Representation Learning with Joint-Embedding Predictive Architectures" — arXiv:2309.16014 · OpenReview [15]
Key contribution: First JEPA for the graph domain. Partitions graphs into patches using METIS, encodes them via GNNs + Transformer blocks, predicts hyperbola-parameterized targets with Smooth-L1 loss [15].
Results: Sets state-of-the-art as a pre-trained backbone on 5 of the evaluated benchmark datasets for graph classification, regression, and isomorphism [15].
Architecture: Employs masked subgraph modeling — mask a subgraph, predict its latent representation from the context subgraph. Predicts coordinates on the unit hyperbola to encode hierarchical relationships [15].
Relevance to transaction networks: Graph-JEPA operates on exactly the data structure that single-user event-sequence FMs cannot handle — graphs with subgraph patterns [15]. A transaction network is a graph. Applying Graph-JEPA principles to transaction graphs could capture the relational signals that single-user models miss for AML.
T-JEPA for Trajectory Similarity (2024)
Paper: "T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation" — arXiv:2406.12913 [16]
Key contribution: Applies JEPA to spatial trajectories — sequences of GPS points over time. Learns trajectory representations by predicting latent features of masked trajectory segments [16].
Relevance: Trajectories are structurally similar to transaction sequences (ordered events with temporal and spatial features). Shows JEPA works on sequential data beyond images/video [16].
The Missing Piece: JEPA for Transaction Event Sequences
No paper has yet applied JEPA to financial event sequences or transaction networks. This is a clear gap and opportunity. The ingredients exist:
T-JEPA (tabular) shows JEPA works for structured data features [14].
Graph-JEPA shows JEPA works for graph-structured data [15].
T-JEPA (trajectory) shows JEPA works for temporal sequences [16].
PRAGMA provides the event-sequence architecture and the AML failure that motivates relational extensions [Prior Survey].
A hypothetical TX-JEPA (Transaction JEPA) could:
Take an event-sequence encoder as the backbone
Add a cross-user JEPA objective: predict the latent representation of a counterparty's behavior from the focal user's transaction context
Capture relational signals (who transacts with whom, network patterns) without explicit graph construction
Train on blockchain data (public, graph-structured, billion-scale) as a proof of concept before deploying on private banking data
Why JEPA Instead of Standard GNNs?
The GNN literature for financial fraud is large (see [17] for a comprehensive review). Why would JEPA add value over standard GNNs?
Pre-training at scale (hypothesized): GNNs are typically trained end-to-end on specific tasks. JEPA's self-supervised objective could in principle enable pre-training on massive unlabeled graphs — critical for financial data where labeled fraud cases are rare [17]. Caveat: Graph-JEPA [15] has only been validated on small standard benchmarks (PROTEINS, MUTAG, etc.), not billion-scale graphs. Scaling remains unproven.
Latent-space prediction avoids reconstruction: Masked modeling reconstructs exact tokens; JEPA instead predicts abstract latent representations [12, 14]. This avoids committing to surface-level reconstruction targets, which may help capture higher-level relational semantics — though this advantage for structured data is a theoretical argument, not yet empirically validated.
Composability: A JEPA-style cross-user objective could be added to existing event-sequence FMs as an auxiliary loss — no full architectural redesign needed.
Transfer across entities: JEPA learns representations that generalize across entities (wallets, users) without entity-specific features, potentially enabling cross-institution transfer.
Part IV: Synthesis — Connecting the Threads
The Opportunity Map
What we're currently researching
Blockchain Event-Sequence FM Benchmark
Assemble Elliptic (Bitcoin) [9], Elliptic2 [10], Elliptic++ [11], BERT4ETH phishing data [5], and EWE-1's Ethereum pipeline [7] into a standardized multi-task benchmark with:
Transaction-level tasks (fraud detection, anomaly classification)
Address-level tasks (phishing detection, wallet clustering)
Graph-level tasks (AML subgraph detection, Sybil identification)
Temporal generalization splits (train on 2023, test on 2024)
Multi-chain evaluation (Bitcoin + EVMs + SVMs based blockchains)
Graph-JEPA for Transaction Networks
Combine an event-sequence encoder with Graph-JEPA's subgraph prediction objective [15]:
Per-user event encoder produces user-level embeddings
Graph-JEPA objective predicts counterparty embeddings from focal-user context [15]
Train on all EVM+SVM transactions graph (public, billion-scale)
Evaluate on AML-style tasks to test whether relational signals improve
Open Large Transaction Model for Blockchain
EWE-1 is a strong start (1.1B transactions, open weights) [7], but uses only 64-transaction windows and 31 features per wallet [7]. Possible extensions:
Multi-source events (DeFi interactions, NFT trades, governance votes, bridge transfers)
Longer context windows (thousands of events — would require efficient attention mechanisms)
Profile-state encoder for wallet-level attributes (age, balance, contract deployment history)
Multi-chain pretraining (Ethereum + L2s + Bitcoin + Solana)
Open Questions
Can JEPA's latent prediction close the relational gap? T-JEPA (tabular) [14] and Graph-JEPA [15] exist, but nobody has combined them for transaction event sequences. The hypothesis is promising but unvalidated.
Will blockchain benchmarks generalize to traditional finance? Blockchain transactions differ structurally from bank transactions (pseudonymous, token-based, smart-contract-mediated). Models that excel on Blockchain data may not transfer to traditional banking.
What is the right tokenization for on-chain data? BERT4ETH uses small models (hidden_size=64) [5]. EWE-1 uses 31 hand-engineered features [7]. BlockFound uses modular tokenization [6]. No systematic comparison of tokenization strategies for blockchain data exists.
Can federated or synthetic approaches bridge the gap for private financial data? MBD (multimodal banking) is anonymized [3]. Blockchain is inherently public. But most financial event data remains inaccessible. Synthetic event-sequence generation is unexplored.
Prior Survey Reference
Claims about PRAGMA (including the −47% AML failure metric) originate from the team's prior survey document ("Foundation Models: Finance and Beyond Text", April 2026) and are marked with [Prior Survey] throughout.
Sources
Benchmarks
Goel et al., "HORIZON: A Benchmark for In-the-wild User Behaviour Modeling" (2026) — arXiv:2604.17259
Osin et al., "EBES: Easy Benchmarking for Event Sequences" (KDD 2025) — arXiv:2410.03399 · GitHub
"Multimodal Banking Dataset" (2024) — arXiv:2409.17587 · HuggingFace
Sakhno et al., PyTorch-Lifestream (IJCAI 2025) — Proceedings
Blockchain Foundation Models
Hu et al., "BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection" (WWW 2023) — arXiv:2303.18138
Yu et al., "BlockScan: Detecting Anomalies in Blockchain Transactions" (2024–2025) — arXiv:2410.04039 · OpenReview
sistemalabs, "Introducing EWE-1" (2026) — Blog · GitHub · HuggingFace
GetEmbed / ZKAI Labs — getembed.ai · Coinbase Wallet blog post · Product details
Blockchain Datasets
JEPA Extensions
. Assran et al., "Self-supervised Learning from Images with a Joint-Embedding Predictive Architecture" (I-JEPA, CVPR 2023) — arXiv:2301.08243
. Thimonier et al., "T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data" (ICLR 2025) — arXiv:2410.05016 · OpenReview
. Skenderi et al., "Graph-level Representation Learning with Joint-Embedding Predictive Architectures" (Graph-JEPA, 2024) — arXiv:2309.16014 · OpenReview
. "T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation" (2024) — arXiv:2406.12913
Surveys and Context
" Graph Neural Networks for Financial Fraud Detection: A Review" — arXiv:2411.05815
"14 JEPA Milestones as a Map of AI Progress" — Lifeboat News
PRAGMA Primary Source
. Ostroukhov et al., "PRAGMA: Revolut Foundation Model" (2026) — arXiv:2604.08649
Prior Work (not numbered — referenced as [Prior Survey])
"Foundation Models for Finance" (April 2026) — Internal prior survey blogpost on getembed.ai/blog. Contains PRAGMA architecture details and benchmark analysis.