
Abstract
Foundation models — large-scale models pre-trained on broad data and adapted to downstream tasks — have expanded decisively beyond natural language. This survey maps the landscape as of April 2026 across five interconnected domains: (1) financial language models, (2) financial time-series models, (3) banking and transaction event-sequence models, (4) tabular foundation models, and (5) healthcare and behavioral event-sequence models. We anchor the survey on PRAGMA (Revolut, 2026), the first publicly documented production-scale banking foundation model, using its architecture, results, and related work as a lens into the broader field. We cover 40+ models and papers, identify seven cross-cutting themes, and surface the key open questions.
Key finding: The competitive frontier has shifted. For financial NLP, frontier general-purpose LLMs (o3, GPT-5.5) now outperform domain-fine-tuned models (BloombergGPT, FinGPT). But for structured, sequential, and event-level data — transactions, user actions, market microstructure — domain-specific foundation models show clear and growing advantages. The moat is in proprietary data, not algorithms.
Scope: Foundation models operating on financial data, user event sequences, transactions, tabular records, time series, and structured non-textual data.
Table of Contents
Financial Foundation Models: State of the Field
PRAGMA: A Case Study in Banking Foundation Models
Industrial Event Sequence Foundation Models
Financial Time-Series Foundation Models
Tabular Foundation Models
Behavioral, Healthcare, and Other Event-Sequence FMs
Cross-Cutting Themes
Open Questions and Gaps
Recommended Reading List
Sources
1. Financial Foundation Models: State of the Field
1.1 Taxonomy
A comprehensive survey by Tsinghua / E Fund Management / Hong Kong Polytechnic University ("Advancing Financial Engineering with Foundation Models," Engineering, 2025; DOI: 10.1016/j.eng.2025.11.029) categorizes financial FMs into three modalities:
Modality | Examples | Tasks |
|---|---|---|
Financial Language Models | BloombergGPT, FinGPT, ICE-INTENT, FinBERT | Sentiment analysis, compliance, report generation |
Financial Time-Series Models | Kronos, FinCast, MarketGPT, Time-LLM | Price forecasting, volatility prediction, synthetic data |
Financial Visual-Language Models | FinLLaVA, FinTral | Chart interpretation, FOMC projection parsing, table understanding |
To this taxonomy, the work surveyed here adds a fourth:
Modality | Examples | Tasks |
|---|---|---|
Banking Event-Sequence Models | PRAGMA, nuFormer, TransactionGPT, TREASURE | Credit scoring, fraud detection, churn, personalization |
1.2 Evolution
2019–2022: BERT-style encoder models (FinBERT)
2023–2024: Generative architectures (BloombergGPT: 50B params, 363B financial + 345B general tokens, ~$3M training cost; FinGPT: lightweight LoRA fine-tuning of open LLMs)
2025–2026: Reasoning-enhanced LLMs, domain-specific time-series FMs, and production banking FMs. Frontier general models surpass domain-fine-tuned models on NLP benchmarks.
1.3 Current Leaderboard (April 2026)
Per the Finance LLM Leaderboard 2026:
Observation | Detail |
|---|---|
SEC filing comprehension | o3 and GPT-5 lead on FinanceBench |
Multi-step calculations | Reasoning models (o3, DeepSeek-R2) pull ahead on QA and TAT-QA |
Domain fine-tuned models | BloombergGPT and FinGPT trail frontier general models on most tasks |
Conceptual knowledge | CFA-Bench separates genuine financial understanding from pattern matching |
Implication: Domain-specific language pretraining for finance may no longer be worth the cost. The action has moved to domain-specific pretraining for non-text financial data.
1.4 Benchmarks (2025–2026)
Benchmark | Focus | Source |
|---|---|---|
FinanceBench | SEC filing comprehension | General |
XFinBench | Complex financial problem solving (ACL 2025) | |
IndFin-Bench | India-specific financial filings | |
CFA-Bench | Financial conceptual knowledge | General |
FinTSB | Time-series with realistic trading constraints | Survey |
FinTrace | LLM tool-calling for long-horizon financial tasks | |
FinMME | Visual QA with 11K+ pairs | Survey |
1.5 Industry Deployments (2026)
Bank of New York → GPT-5.5: 220+ internal use cases
Revolut → PRAGMA: Production banking FM for fraud, churn, personalization
Citi Wealth → "Citi Sky": AI-powered advisor (Google DeepMind), rolling out Summer 2026
AMCAP Global: Multi-model agentic AI framework for asset analysis
2. PRAGMA: A Case Study in Banking Foundation Models
2.1 Overview
Paper: "PRAGMA: Revolut Foundation Model" · Ostroukhov et al. (Revolut Research + NVIDIA)
Published: April 9, 2026 · arXiv:2604.08649
PRAGMA is a family of encoder-only Transformer foundation models (10M, 100M, 1B parameters) pre-trained on multi-source banking event sequences. It replaces siloed, task-specific models with a single shared backbone that transfers across credit scoring, fraud detection, lifetime value prediction, communication engagement, product recommendation, and more.
2.2 Why Not Just Use an LLM?
PRAGMA argues against text serialization of structured banking data:
Sequence inflation: Field names and delimiters inflate sequence lengths by 3–5×
Numerical destruction: Subword tokenization splits digits, losing magnitude and ordering
Heterogeneity: Banking events have variable-length records with mixed categorical, numerical, and free-text fields
2.3 Training Data
Dimension | Scale |
|---|---|
Users | 26 million |
Countries | 111 |
Events | 24 billion |
Tokens | 207 billion |
Time range | 25 months (2023–2025) |
Event Sources: Transactions (card payments, transfers), App (navigation, product usage), Trading (stock/crypto), Communication (push notifications, emails).
Profile State: Static contextual attributes (balance quantile, plan tier, service region) plus life-long events — timestamped milestones (e.g., first_topup) that survive truncation.
2.4 Architecture: Three-Encoder Design
Profile State Encoder: Processes static user attributes + life-long event timestamps via RoPE
Event Encoder: Processes each event independently; adds calendar features (hour/day/month via fixed-period sine/cosine)
History Encoder: Contextualizes the concatenation of [USR] + all [EVT] embeddings; uses RoPE on log-seconds-to-most-recent-event
Variant | Parameters | GPUs | Training Time |
|---|---|---|---|
PRAGMA-S | 10M | 16× H100 | ~2 days |
PRAGMA-M | 100M | 16× H100 | ~2 weeks |
PRAGMA-L | 1B | 32× H100 | ~2 weeks |
All variants: GELU, pre-norm LayerNorm, dropout 0.1, Muon + AdamW, bf16 mixed precision.
2.5 Tokenization: Key–Value–Time
Each data point decomposes into three components:
Component | Vocabulary | Encoding |
|---|---|---|
Keys (field names) | ~60 tokens | Single token per semantic type |
Values | ~28K tokens | Numerical → percentile buckets; Categorical → single token; Text → BPE subwords |
Time | Continuous | Log-seconds |
Token embedding: x = PosEmb(E(key) + E(value)), positions index within a field, not across fields.
2.6 Pre-training: Masked Modeling
BERT-style MLM adapted for structured events. The MLM head receives a 3d-dimensional concatenation per masked token:
Event Encoder output (local within-event context)
History Encoder output at the [EVT] position (cross-event context)
History Encoder output at the [USR] position (user-level context)
Masking strategy (three complementary sources):
Type | Rate | Purpose |
|---|---|---|
Token-level | 15% | Standard token reconstruction |
Event-level | 10% | Reconstruct entire events from history |
Key-level | 10% | Predict values given other keys + context |
Small fraction of masks replaced with [UNK] (excluded from loss) as input dropout.
2.7 Engineering
Sequence packing: FlashAttention varlen kernel eliminates padding overhead. 2–5× throughput improvement.
Dynamic batching: Records sharded by event count; greedy packing within fixed GPU memory budget.
Truncation: Max 24 tokens/event (0.01% affected), max 6,500 events/user (most recent retained).
Storage: LMDB user index + Parquet event shards partitioned by event count.
2.8 Results (Relative to Internal Production Baselines)
Caveat: Only relative improvements reported. Absolute metrics withheld for commercial sensitivity.
Main Results (PRAGMA-L + LoRA)
Task | Metric | Relative Change |
|---|---|---|
Credit Scoring | PR-AUC | +130.2% |
Credit Scoring | ROC-AUC | +12.4% |
Communication Engagement | PR-AUC | +79.4% |
Communication Engagement | ROC-AUC | +20.4% |
External Fraud | Precision | +64.7% |
External Fraud | Recall | +64.7% |
Uplift (AUUC) | — | +163.7% |
Scaling (PRAGMA-S → L, LoRA)
Task | Metric | S→M | S→L |
|---|---|---|---|
Credit Scoring | PR-AUC | +16.3% | +35.2% |
External Fraud | Recall | +24.8% | +23.5% |
Product Rec. | mAP | +18.9% | +27.0% |
LTV | PR-AUC | +1.5% | +3.0% |
Scaling gains are task-dependent: credit scoring benefits enormously; LTV saturates early.
Pre-training Effect (LoRA vs. Scratch, PRAGMA-M)
Task | PR-AUC Gain |
|---|---|
Comm. Engagement | +18.6% |
Credit Scoring | +13.0% |
Product Rec. (mAP) | +10.3% |
Profile State Effect (Full vs. Event-only)
Task | Impact |
|---|---|
Fraud Recall | +85.6% |
Credit Scoring PR-AUC | +31.8% |
Comm. Engagement PR-AUC | −3.0% (hurts!) |
Profile state is hugely important for fraud/credit; slightly harmful for communication engagement.
Optional: Pre-trained Text Encoder (Nemotron-1B-v2)
Credit Scoring PR-AUC: +16.1%. Product Rec. mAP: −6.4%. Trade-off: +18% training latency. Kept as opt-in.
2.9 Where PRAGMA Fails: Anti-Money Laundering
Task | Metric | PRAGMA vs. Baseline |
|---|---|---|
AML | F₀.₅ | −47.1% |
AML is inherently relational — requires cross-user network signals. PRAGMA processes users in isolation. This is a fundamental architectural limitation.
2.10 Critical Assessment
Strengths:
First production-scale, multi-source banking FM with public documentation
Principled tokenization avoiding text serialization pitfalls
Broadest task evaluation in the financial FM literature (6+ tasks)
Honest about failures (AML)
NVIDIA co-authorship; solid engineering details
Weaknesses:
No absolute metrics → baseline strength unknown
No public model or data → unreproducible externally
Internal benchmarks only → no cross-paper comparison possible
Relational blindness → fundamental gap for graph-structured tasks
No comparison to simpler baselines (e.g., GBDT + RNN)
3. Industrial Event Sequence Foundation Models
These models share PRAGMA's thesis: behavioral event sequences contain transferable representations that outperform hand-crafted features across multiple tasks. None are open-source.
3.1 Meta — HSTU / Generative Recommenders (ICML 2024)
Paper: "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations"
arXiv:2402.17152
The foundational paper for industrial event-sequence FMs:
Architecture: HSTU — pointwise aggregated attention + SiLU gating replacing softmax. 5.3–15.2× faster than FlashAttention-2.
Scale: Up to 1.5 trillion parameters (including embeddings). Power-law scaling across 3 orders of magnitude.
Training: Autoregressive next-action prediction. Each sequence generates O(N) training signals.
Results: +12.4% topline improvement across Meta production surfaces.
Serving: M-FALCON enables 285× more complex models at same throughput.
Key insight: Actions encode richer intent than words.
3.2 Nubank — nuFormer (2025)
Paper: "Your Spending Needs Attention: Modeling Financial Habits with Transformers"
arXiv:2507.23267
Closest predecessor to PRAGMA:
Scale: O(100B) transactions, 100M+ members. 24M–330M parameters.
Architecture: Transformer with text-like tokenization + DCNv2 joint fusion.
Training: Next-token prediction.
Results: 1.25% relative AUC lift (3.3× typical launch impact). 4.4% churn reduction.
Limitation vs. PRAGMA: Single event source (transactions only), no explicit profile state.
3.3 Visa — TransactionGPT (2025) + TREASURE (2025)
TransactionGPT: arXiv:2511.08939
Architecture: 3D-Transformer with Virtual Token Layer (~10M params). Three coupled transformers for features, metadata, temporal dimensions.
Results: +22% on production anomaly detection. 92% fewer params and 300× faster inference than Llama2-7B on MCC prediction.
TREASURE: arXiv:2511.19693
Transformer FM for high-volume payment understanding. Extended with LLM-based sentence embeddings (arXiv:2601.05271).
3.4 Yandex Music — ARGUS (KDD 2026)
Paper: "Scaling Recommender Transformers to One Billion Parameters"
arXiv:2507.15994
Scale: 3.2M → 1B parameters. Power-law scaling confirmed independently from HSTU.
Training: Dual pre-training — Next Item Prediction + Feedback Prediction (RL-inspired).
Results: +2.26% listening time, +6.37% likes — largest DL improvement in platform history.
Key insight: Architecture matters less than training objective and scale. Standard Transformer matches HSTU with the right task.
3.5 Pinterest — TransAct V2 (2025)
Architecture: Tiny transformer (2 layers, d=64) in wide-and-deep CTR model. Pragmatic latency-first design for 500M+ users.
Innovation: SKUT Triton kernel (6.6× speedup). 103–338× end-to-end latency improvement.
Results: +6.35% repin, −12.80% hides online.
3.6 Mastercard — Large Tabular Model (2026)
Press release. No peer-reviewed paper. "Large tabular model" on billions of anonymized transactions.
3.7 Stripe — Payment Foundation Model (2025)
TechCrunch, May 2025. No peer-reviewed paper. Tens of billions of transactions. NVIDIA partnership.
3.8 Open Banking Foundational Model (2025)
arXiv:2511.12154. Academic work. Multimodal FM integrating structured transaction attributes with text descriptions via MLM.
Comparison Matrix
Model | Builder | Params | Events | Sources | Architecture | Objective | Tasks | Open? |
|---|---|---|---|---|---|---|---|---|
PRAGMA | Revolut | 10M–1B | 24B | Txns, app, trading, comms | Encoder, 3-branch | Masked modeling | 6+ | ❌ |
HSTU | Meta | ≤1.5T | Trillions | User actions | Decoder, HSTU | Next-action | Rec. | ❌ |
nuFormer | Nubank | 24M–330M | ~100B | Transactions | Encoder+DCNv2 | Next-token | Credit, churn | ❌ |
TransactionGPT | Visa | ~10M | Billions | Payments | 3D-Transformer | Next-txn+supervised | Anomaly, MCC | ❌ |
TREASURE | Visa | — | Billions | Payments | Transformer | Self-supervised | Fraud, personalization | ❌ |
ARGUS | Yandex | 3.2M–1B | Billions | Listening | Transformer | NIP+feedback | Rec. | ❌ |
TransAct V2 | Tiny | O(10⁴)/user | Actions | Transformer in CTR | Next Action Loss | CTR | ❌ | |
Mastercard LTM | Mastercard | — | Billions | Payments | "LTM" | — | Fraud | ❌ |
Stripe PFM | Stripe | — | Tens of B | Payments | — | — | Routing | ❌ |
4. Financial Time-Series Foundation Models
4.1 Kronos (AAAI 2026)
Paper: "Kronos: A Foundation Model for the Language of Financial Markets"
arXiv:2508.02739 · Code · Open-source (MIT)
Architecture: Decoder-only Transformer with specialized OHLCV tokenizer that discretizes candlestick data into hierarchical tokens.
Scale: 12B+ K-line records from 45 global exchanges. Model family: 4.1M–499M params.
Results (zero-shot): +93% RankIC over leading TSFM for price forecasting; 9% lower MAE in volatility; 22% better generative fidelity.
Significance: First model to show that the pre-training paradigm works for financial time series — previous TSFMs often underperformed non-pretrained architectures.
4.2 FinCast (CIKM 2025)
Paper: "FinCast: A Foundation Model for Financial Time-Series Forecasting"
arXiv:2508.19609 · Code · Open-source (Apache 2.0)
Architecture: Decoder-only Transformer + Mixture of Experts (MoE).
Scale: 20B+ financial time points across diverse domains and resolutions.
Innovation: PQ-Loss (joint point + probabilistic forecasting). MoE for domain specialization.
Results: Robust zero-shot across domains without fine-tuning.
4.3 Chronos (TMLR 2024)
Paper: "Chronos: Learning the Language of Time Series"
arXiv:2403.07815 · Code · Open-source (Apache 2.0)
Architecture: T5-based. Tokenizes continuous values via scaling + quantization into discrete bins.
Scale: 20M–710M params. Trained on 27 public datasets + KernelSynth synthetic data (1M series).
Results: Competitive zero-shot performance on 42 benchmarks without any time-series-specific architectural changes.
Key insight: Standard language model architectures work directly on tokenized time series.
4.4 Time-LLM (ICLR 2024)
Paper: "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"
arXiv:2310.01728 · Code · 1000+ citations
Approach: Repurposes frozen pre-trained LLMs (Llama-7B, GPT-2) via lightweight "reprogramming" layers.
Innovation: Cross-attention alignment between time-series patches and text prototypes; Prompt-as-Prefix enriches input with domain knowledge.
Trade-off: Requires per-dataset fine-tuning of reprogramming layers (LLM stays frozen).
Design Tension: Tokenize vs. Reprogram
Strategy | Models | Pros | Cons |
|---|---|---|---|
Tokenize into bins | Chronos, Kronos | Standard LM architectures; zero-shot capable | Precision loss; bucket boundaries are data-specific |
Reprogram frozen LLM | Time-LLM | Leverages pre-trained knowledge; cheap | Per-dataset fine-tuning; fragile alignment |
Type-specific encoding | PRAGMA | Preserves data types natively | Custom infrastructure; domain-specific |
5. Tabular Foundation Models
Tabular FMs address fixed-schema row data — related to but distinct from event sequences.
5.1 The TabPFN Lineage
Version | Year | Venue | Scale | Key Advance |
|---|---|---|---|---|
TabPFN | 2023 | ICLR | ≤1K samples, 100 features | In-context learning on synthetic data from structural causal model priors |
TabPFN v2 | 2025 | Nature | ≤10K samples, 500 features | Categorical features, missing values, regression |
TabPFN-2.5 | 2025 | arXiv | ≤50K samples, 2K features | 100% win rate vs XGBoost (small-medium); distillation engine |
Core idea: train a Transformer on synthetic data from causal priors, then do in-context learning at inference (no gradients). All open-source.
TabPFN v2: Nature · HuggingFace
TabPFN-2.5: arXiv:2511.08667
5.2 Beyond TabPFN
Model | Year | Venue | Key Innovation | arXiv |
|---|---|---|---|---|
Mitra | 2025 | NeurIPS | Mixed synthetic priors (SCM + tree-based); 72M params; outperforms TabPFN v2 | |
CARTE | 2024 | ICML | Graph-based pretraining for cross-table transfer | |
TabICL v2 | 2026 | arXiv | Column-then-row attention; scales to ~500K samples; best open-source option | |
TabForestPFN | 2024 | arXiv | Forest-based synthetic data for more complex decision boundaries | |
KumoRFM-2 | 2026 | arXiv | FM for relational (multi-table) data via in-context learning |
Additional models: TabDPT, MotherNet (Microsoft), TabFlex (Microsoft), ContextTab (SAP), LimiX, Orion variants. See the Mindful Modeler overview (2026) listing 16+ tabular FMs.
5.3 Benchmarks and Surveys
OmniTabBench (arXiv:2604.06814): Comprehensive GBDTs vs. NNs vs. FMs comparison.
Survey: "Representation Learning for Tabular Data" (arXiv:2504.16109)
FMSD Workshop at ICML 2025 and 2026: 99 submissions, 500+ participants.
5.4 Cross-Modal Self-Supervised Learning
data2vec (Meta, ICML 2022; arXiv:2202.03555): First framework using the same self-supervised algorithm across speech, vision, and NLP. Predicts contextualized latent representations via self-distillation (teacher = EMA of student). Achieved SOTA on ImageNet (ViT-B/L). Open-source in fairseq.
Relevance: Demonstrates that masked prediction of rich contextualized targets (not raw inputs) can work across modalities. PRAGMA's MLM head, which concatenates local + cross-event + user-level context before prediction, echoes this principle.
6. Behavioral, Healthcare, and Other Event-Sequence FMs
6.1 Behavioral / Clickstream FMs
Paper | Year | Key Idea | Source |
|---|---|---|---|
BehaveGPT | 2025 | Transformer + DRO for user behavior prediction | |
Large Behavioral Models | 2026 | Next-event prediction across retail/payments (Unbox AI) | |
NEST | 2026 | Event streams as sequences of multisets; Masked Set Modeling | |
ClickstreamGPT | 2025 | GPT-style generation for e-commerce clickstream | |
TRACE | 2024 | Multi-session clickstream embeddings via multi-task learning |
6.2 Self-Supervised Learning for Event Sequences
An active sub-area exploring pre-training objectives:
Contrastive + Generative fusion: Yugay & Zaytsev (2024). arXiv:2408.09995
MLEM: Generative and contrastive as distinct modalities. arXiv:2401.15935
PyTorch-Lifestream (IJCAI 2025): Open-source library implementing CoLES, CPC, RTD, BERT-style pretraining. Proceedings
Mixed-type Event Sequences (NeurIPS 2025): Heterogeneous events across medicine, finance, remote sensing. Poster
6.3 Healthcare Event FMs
Healthcare is the closest parallel to financial event FMs — similar challenges (heterogeneous events, irregular timing, long histories, privacy).
Model | Year | Scale | Architecture | Key Innovation | arXiv |
|---|---|---|---|---|---|
Apollo | 2026 | 25B events, 7.2M patients, 28 modalities | Multimodal temporal FM | 322 clinical tasks | |
EHRMamba | 2025 | — | Mamba (subquadratic) | Addresses O(n²) for long EHR | |
RAVEN | 2026 | >1M patients | Generative, next-visit | Recurrence-aware regularization | |
NEST | 2026 | — | Hierarchical multiset | Models co-occurring events |
Apollo mirrors PRAGMA's approach at similar scale: 25B medical events, multi-source, multi-task. Cross-pollination between financial and healthcare event FMs is an underexplored opportunity.
6.4 Recommendation FMs
Model | Year | Venue | Key Innovation | Source |
|---|---|---|---|---|
RecGPT | 2025 | EMNLP | Zero-shot cross-domain via unified item tokenization | |
RecBase | 2025 | EMNLP | Domain-agnostic FM for zero-shot rec. |
Surveys: arXiv:2504.16420 (2025), arXiv:2402.11143 (2024).
6.5 Other Domains
Graph FMs: Survey at arXiv:2505.15116 (2025)
IoT / CPS: arXiv:2503.12282 (2025)
Sensor-based HAR: Survey at arXiv:2604.02711 (2026)
7. Cross-Cutting Themes
7.1 The Tokenization Problem
The central unsolved problem for non-text FMs. No consensus exists.
Strategy | Used By | Trade-off |
|---|---|---|
Serialize as text | Naive baseline | Universal but inflated, numerical info destroyed |
Percentile bucketing | PRAGMA, Chronos | Compact, preserves magnitude; loses precision |
Learned per-field embeddings | TabTransformer, FT-Transformer | Native structure; requires fixed schema |
OHLCV hierarchical tokenizer | Kronos | Domain-optimized; not transferable |
Reprogramming via cross-attention | Time-LLM | Reuses LLM knowledge; per-dataset fine-tuning |
Key–value–time decomposition | PRAGMA | Type-aware heterogeneous encoding; custom infra |
Synthetic prior fitting | TabPFN, Mitra | No real data needed; limited to tabular (not sequences) |
7.2 Pre-training Objective: No Consensus
Objective | Used By | Analogy |
|---|---|---|
Masked modeling | PRAGMA, BERT4Rec, NEST, MLEM | BERT-style |
Next-event prediction | HSTU, ARGUS, nuFormer, RAVEN | GPT-style |
Contrastive learning | CoLES, MLEM, BYB | SimCLR/CLIP-style |
Dual (NIP + feedback) | ARGUS | RL-inspired |
Joint self-supervised + supervised | TransactionGPT | Multi-task |
Synthetic prior fitting | TabPFN, Mitra | Bayesian meta-learning |
The right objective depends on the downstream use case: encoder-only models naturally pair with masked modeling (discriminative tasks); decoder-only with next-event prediction (generative tasks).
7.3 Scaling Laws for Structured Data
Paper | Scaling Behavior |
|---|---|
HSTU (Meta) | Power-law across 3 orders of magnitude — comparable to GPT-3/LLaMA-2 |
ARGUS (Yandex) | Linear on log scale from 3.2M to 1B |
PRAGMA (Revolut) | Task-dependent: credit scoring benefits enormously; LTV saturates early |
TabPFN lineage | Scales primarily in data capacity (1K → 50K) rather than model size |
No emergent capabilities observed. Unlike LLMs with qualitative jumps at scale, structured data FMs show smooth, task-dependent scaling.
7.4 Encoder-Only vs. Decoder-Only
Architecture | Used By | Best For |
|---|---|---|
Encoder-only (bidirectional) | PRAGMA, BERT4Rec, NEST | Discriminative (classification, scoring) |
Decoder-only (autoregressive) | HSTU, Kronos, FinCast, Chronos | Generative (forecasting, next-event) |
Hybrid | TransactionGPT, nuFormer | Multi-objective |
PRAGMA's encoder-only choice is unusual in a field trending decoder-only. Justified: "Our primary goal is transferable representations for discriminative financial tasks, rather than open-ended generation."
7.5 The "None of This Is Open" Problem
Industrial event-sequence FMs: 0/9 are open-source.
Time-series FMs: 3/4 are open-source (Chronos, Kronos, FinCast).
Tabular FMs: Majority are open-source (TabPFN, Mitra, TabICL).
The industrial event-sequence community treats data as the moat. The academic community values reproducibility. This split makes cross-paper comparison of event-sequence FMs effectively impossible.
PyTorch-Lifestream (IJCAI 2025) is the main open-source effort to democratize event-sequence pretraining methods.
7.6 The Competitive Landscape in Financial Services
As of April 2026, at least six major financial institutions have published or announced transaction/event FMs:
Company | Model | Peer-Reviewed Paper? |
|---|---|---|
Revolut | PRAGMA | ✅ |
Nubank | nuFormer | ✅ |
Visa | TransactionGPT + TREASURE | ✅ (both) |
Mastercard | LTM | ❌ (press only) |
Stripe | Payment FM | ❌ (press only) |
Meta | HSTU/GR | ✅ |
This convergence marks the emergence of behavioral representation as a new competitive layer in financial services.
7.7 Relational / Graph Structure: The Big Gap
PRAGMA's −47% AML result is symptomatic of a field-wide limitation. Single-user / single-row models cannot capture cross-entity relationships needed for:
Anti-money laundering (transaction networks)
Network fraud (coordinated attacks)
Syndicated lending risk
Social contagion effects
KumoRFM-2 (arXiv:2604.12596) addresses multi-table relational data but not cross-user graph structure. The Graph Foundation Models survey (arXiv:2505.15116) maps the broader space but doesn't specifically address financial transaction graphs.
8. Open Questions and Gaps
No public benchmark for event sequence FMs. NLP has GLUE; vision has ImageNet; tabular has OmniTabBench. Event sequences have nothing. Every paper evaluates on proprietary data.
Optimal pre-training objective unknown. No systematic ablation of masked modeling vs. next-event prediction vs. contrastive learning on the same data.
Cross-institution transfer untested. Can PRAGMA-style representations transfer to a different bank's data distribution? To a different country's financial system? No evidence either way.
Relational structure unaddressed. PRAGMA's AML failure is just one symptom. No event-sequence FM handles cross-user graph structure.
Regulatory barriers to replication. GDPR, MiFID II, HIPAA make public benchmarks from real data structurally impossible. Federated and synthetic approaches are immature.
No emergent capabilities. Scaling is smooth and task-dependent. Whether structured data FMs can exhibit LLM-like phase transitions remains open.
Efficiency Pareto frontier poorly characterized. TransAct V2 (2 layers, d=64) achieves strong results; HSTU uses 1.5T params. The right operating point for different latency/accuracy trade-offs is unclear.
Domain-specific vs. general language pretraining. For NLP tasks, frontier general LLMs now win. Is there a crossover point where domain-specific language pretraining becomes worthwhile again, or has that ship sailed permanently?
9. Recommended Reading List
Tier 1: Essential
Zhai et al., "Actions Speak Louder than Words" (ICML 2024) — arXiv:2402.17152
Ostroukhov et al., "PRAGMA: Revolut Foundation Model" (2026) — arXiv:2604.08649
Braithwaite et al., "nuFormer" (2025) — arXiv:2507.23267
Hollmann et al., "TabPFN v2" (Nature, 2025) — Nature
Tier 2: Important
Dou et al., "TransactionGPT" (2025) — arXiv:2511.08939
Khrylchenko et al., "ARGUS" (KDD 2026) — arXiv:2507.15994
Ansari et al., "Chronos" (TMLR 2024) — arXiv:2403.07815
Shi et al., "Kronos" (AAAI 2026) — arXiv:2508.02739
Yeh et al., "TREASURE" (2025) — arXiv:2511.19693
Zhang et al., "Mitra" (NeurIPS 2025) — arXiv:2510.21204
"Apollo" (2026) — arXiv:2604.18570
Tier 3: Surveys & Tools
"FM-Powered Recommender Systems" (2025) — arXiv:2504.16420
"Foundation Models for Recommender Systems" (2024) — arXiv:2402.11143
"Representation Learning for Tabular Data" (2025) — arXiv:2504.16109
"Graph Foundation Models" (2025) — arXiv:2505.15116
Chen et al., "Advancing Financial Engineering with Foundation Models" (Engineering, 2025) — DOI
PyTorch-Lifestream (IJCAI 2025) — Proceedings
OmniTabBench (2026) — arXiv:2604.06814
10. Sources
Primary Sources (paper read or abstract verified)
# | Paper | Identifier |
|---|---|---|
1 | Zhai et al., "Actions Speak Louder than Words" (ICML 2024) | |
2 | Ostroukhov et al., "PRAGMA" (2026) | |
3 | Braithwaite et al., "nuFormer" (2025) | |
4 | Dou et al., "TransactionGPT" (2025) | |
5 | Yeh et al., "TREASURE" (2025) | |
6 | Khrylchenko et al., "ARGUS" (KDD 2026) | |
7 | Xia et al., "TransAct V2" (2025) | |
8 | Ansari et al., "Chronos" (TMLR 2024) | |
9 | Jin et al., "Time-LLM" (ICLR 2024) | |
10 | Shi et al., "Kronos" (AAAI 2026) | |
11 | Zhu et al., "FinCast" (CIKM 2025) | |
12 | Hollmann et al., "TabPFN" (ICLR 2023) | |
13 | Hollmann et al., "TabPFN v2" (Nature, 2025) | |
14 | Grinsztajn et al., "TabPFN-2.5" (2025) | |
15 | Zhang et al., "Mitra" (NeurIPS 2025) | |
16 | Baevski et al., "data2vec" (ICML 2022) | |
17 | Kim et al., "CARTE" (ICML 2024) | |
18 | "TabICL v2" (2026) | |
19 | Fey et al., "KumoRFM-2" (2026) | |
20 | "BehaveGPT" (2025) | |
21 | "NEST" (2026) | |
22 | "Apollo" (2026) | |
23 | "EHRMamba" (ML4H 2025) | |
24 | "RAVEN" (2026) | |
25 | "RecGPT" (EMNLP 2025) | |
26 | "Open Banking FM" (2025) | |
27 | Polleti et al., "TREASURE + LLM embeddings" (2025) | |
28 | "OmniTabBench" (2026) | |
29 | Dong et al., "LLM Agents in Finance" (EMNLP 2025) | |
30 | Zhang et al., "XFinBench" (ACL 2025) | |
31 | Chen et al., "Advancing Financial Engineering with FMs" (Engineering, 2025) | |
32 | "FinTrace" (2026) |
Secondary Sources (press releases, blogs, overviews)
# | Source | URL |
|---|---|---|
33 | Finance LLM Leaderboard 2026 | |
34 | Mastercard LTM announcement | |
35 | Stripe Payment FM | |
36 | Large Behavioral Models (Unbox AI) | |
37 | State of Tabular FMs (2026) | |
38 | PRAGMA deep dive (Linas Beliūnas) | |
39 | FMSD Workshop, ICML 2025 | |
40 | FMSD Workshop, ICML 2026 |
Surveys
# | Survey | Source |
|---|---|---|
41 | FM-Powered Recommender Systems (2025) | |
42 | Foundation Models for Recommender Systems (2024) | |
43 | Representation Learning for Tabular Data (2025) | |
44 | Graph Foundation Models (2025) | |
45 | Sensor-based HAR FMs (2026) |