Back

Foundation Models for Finance

Research

Apr 28, 2026

Financial Foundation Models - State of the Field

Abstract

Foundation models — large-scale models pre-trained on broad data and adapted to downstream tasks — have expanded decisively beyond natural language. This survey maps the landscape as of April 2026 across five interconnected domains: (1) financial language models, (2) financial time-series models, (3) banking and transaction event-sequence models, (4) tabular foundation models, and (5) healthcare and behavioral event-sequence models. We anchor the survey on PRAGMA (Revolut, 2026), the first publicly documented production-scale banking foundation model, using its architecture, results, and related work as a lens into the broader field. We cover 40+ models and papers, identify seven cross-cutting themes, and surface the key open questions.

Key finding: The competitive frontier has shifted. For financial NLP, frontier general-purpose LLMs (o3, GPT-5.5) now outperform domain-fine-tuned models (BloombergGPT, FinGPT). But for structured, sequential, and event-level data — transactions, user actions, market microstructure — domain-specific foundation models show clear and growing advantages. The moat is in proprietary data, not algorithms.

Scope: Foundation models operating on financial data, user event sequences, transactions, tabular records, time series, and structured non-textual data.

Financial Foundation Models: State of the Field
PRAGMA: A Case Study in Banking Foundation Models
Industrial Event Sequence Foundation Models
Financial Time-Series Foundation Models
Tabular Foundation Models
Behavioral, Healthcare, and Other Event-Sequence FMs
Cross-Cutting Themes
Open Questions and Gaps
Recommended Reading List
Sources

1. Financial Foundation Models: State of the Field

1.1 Taxonomy

A comprehensive survey by Tsinghua / E Fund Management / Hong Kong Polytechnic University ("Advancing Financial Engineering with Foundation Models," Engineering, 2025; DOI: 10.1016/j.eng.2025.11.029) categorizes financial FMs into three modalities:

Modality	Examples	Tasks
Financial Language Models	BloombergGPT, FinGPT, ICE-INTENT, FinBERT	Sentiment analysis, compliance, report generation
Financial Time-Series Models	Kronos, FinCast, MarketGPT, Time-LLM	Price forecasting, volatility prediction, synthetic data
Financial Visual-Language Models	FinLLaVA, FinTral	Chart interpretation, FOMC projection parsing, table understanding

To this taxonomy, the work surveyed here adds a fourth:

Modality	Examples	Tasks
Banking Event-Sequence Models	PRAGMA, nuFormer, TransactionGPT, TREASURE	Credit scoring, fraud detection, churn, personalization

1.2 Evolution

2019–2022: BERT-style encoder models (FinBERT)
2023–2024: Generative architectures (BloombergGPT: 50B params, 363B financial + 345B general tokens, ~$3M training cost; FinGPT: lightweight LoRA fine-tuning of open LLMs)
2025–2026: Reasoning-enhanced LLMs, domain-specific time-series FMs, and production banking FMs. Frontier general models surpass domain-fine-tuned models on NLP benchmarks.

1.3 Current Leaderboard (April 2026)

Per the Finance LLM Leaderboard 2026:

Observation	Detail
SEC filing comprehension	o3 and GPT-5 lead on FinanceBench
Multi-step calculations	Reasoning models (o3, DeepSeek-R2) pull ahead on QA and TAT-QA
Domain fine-tuned models	BloombergGPT and FinGPT trail frontier general models on most tasks
Conceptual knowledge	CFA-Bench separates genuine financial understanding from pattern matching

Implication: Domain-specific language pretraining for finance may no longer be worth the cost. The action has moved to domain-specific pretraining for non-text financial data.

1.4 Benchmarks (2025–2026)

Benchmark	Focus	Source
FinanceBench	SEC filing comprehension	General
XFinBench	Complex financial problem solving (ACL 2025)	ACL Anthology
IndFin-Bench	India-specific financial filings	CompoundingAI
CFA-Bench	Financial conceptual knowledge	General
FinTSB	Time-series with realistic trading constraints	Survey
FinTrace	LLM tool-calling for long-horizon financial tasks	arXiv:2604.10015
FinMME	Visual QA with 11K+ pairs	Survey

1.5 Industry Deployments (2026)

Bank of New York → GPT-5.5: 220+ internal use cases
Revolut → PRAGMA: Production banking FM for fraud, churn, personalization
Citi Wealth → "Citi Sky": AI-powered advisor (Google DeepMind), rolling out Summer 2026
AMCAP Global: Multi-model agentic AI framework for asset analysis

2. PRAGMA: A Case Study in Banking Foundation Models

2.1 Overview

Paper: "PRAGMA: Revolut Foundation Model" · Ostroukhov et al. (Revolut Research + NVIDIA)
Published: April 9, 2026 · arXiv:2604.08649

PRAGMA is a family of encoder-only Transformer foundation models (10M, 100M, 1B parameters) pre-trained on multi-source banking event sequences. It replaces siloed, task-specific models with a single shared backbone that transfers across credit scoring, fraud detection, lifetime value prediction, communication engagement, product recommendation, and more.

2.2 Why Not Just Use an LLM?

PRAGMA argues against text serialization of structured banking data:

Sequence inflation: Field names and delimiters inflate sequence lengths by 3–5×
Numerical destruction: Subword tokenization splits digits, losing magnitude and ordering
Heterogeneity: Banking events have variable-length records with mixed categorical, numerical, and free-text fields

2.3 Training Data

Dimension	Scale
Users	26 million
Countries	111
Events	24 billion
Tokens	207 billion
Time range	25 months (2023–2025)

Event Sources: Transactions (card payments, transfers), App (navigation, product usage), Trading (stock/crypto), Communication (push notifications, emails).

Profile State: Static contextual attributes (balance quantile, plan tier, service region) plus life-long events — timestamped milestones (e.g., first_topup) that survive truncation.

2.4 Architecture: Three-Encoder Design

Profile State Encoder: Processes static user attributes + life-long event timestamps via RoPE
Event Encoder: Processes each event independently; adds calendar features (hour/day/month via fixed-period sine/cosine)
History Encoder: Contextualizes the concatenation of [USR] + all [EVT] embeddings; uses RoPE on log-seconds-to-most-recent-event

Variant	Parameters	GPUs	Training Time
PRAGMA-S	10M	16× H100	~2 days
PRAGMA-M	100M	16× H100	~2 weeks
PRAGMA-L	1B	32× H100	~2 weeks

All variants: GELU, pre-norm LayerNorm, dropout 0.1, Muon + AdamW, bf16 mixed precision.

2.5 Tokenization: Key–Value–Time

Each data point decomposes into three components:

Component	Vocabulary	Encoding
Keys (field names)	~60 tokens	Single token per semantic type
Values	~28K tokens	Numerical → percentile buckets; Categorical → single token; Text → BPE subwords
Time	Continuous	Log-seconds `8·ln(1+t/8)` + calendar sine/cosine

Token embedding: x = PosEmb(E(key) + E(value)), positions index within a field, not across fields.

2.6 Pre-training: Masked Modeling

BERT-style MLM adapted for structured events. The MLM head receives a 3d-dimensional concatenation per masked token:

Event Encoder output (local within-event context)
History Encoder output at the [EVT] position (cross-event context)
History Encoder output at the [USR] position (user-level context)

Masking strategy (three complementary sources):

Type	Rate	Purpose
Token-level	15%	Standard token reconstruction
Event-level	10%	Reconstruct entire events from history
Key-level	10%	Predict values given other keys + context

Small fraction of masks replaced with [UNK] (excluded from loss) as input dropout.

2.7 Engineering

Sequence packing: FlashAttention varlen kernel eliminates padding overhead. 2–5× throughput improvement.
Dynamic batching: Records sharded by event count; greedy packing within fixed GPU memory budget.
Truncation: Max 24 tokens/event (0.01% affected), max 6,500 events/user (most recent retained).
Storage: LMDB user index + Parquet event shards partitioned by event count.

2.8 Results (Relative to Internal Production Baselines)

Caveat: Only relative improvements reported. Absolute metrics withheld for commercial sensitivity.

Main Results (PRAGMA-L + LoRA)

Task	Metric	Relative Change
Credit Scoring	PR-AUC	+130.2%
Credit Scoring	ROC-AUC	+12.4%
Communication Engagement	PR-AUC	+79.4%
Communication Engagement	ROC-AUC	+20.4%
External Fraud	Precision	+64.7%
External Fraud	Recall	+64.7%
Uplift (AUUC)	—	+163.7%

Scaling (PRAGMA-S → L, LoRA)

Task	Metric	S→M	S→L
Credit Scoring	PR-AUC	+16.3%	+35.2%
External Fraud	Recall	+24.8%	+23.5%
Product Rec.	mAP	+18.9%	+27.0%
LTV	PR-AUC	+1.5%	+3.0%

Scaling gains are task-dependent: credit scoring benefits enormously; LTV saturates early.

Pre-training Effect (LoRA vs. Scratch, PRAGMA-M)

Task	PR-AUC Gain
Comm. Engagement	+18.6%
Credit Scoring	+13.0%
Product Rec. (mAP)	+10.3%

Profile State Effect (Full vs. Event-only)

Task	Impact
Fraud Recall	+85.6%
Credit Scoring PR-AUC	+31.8%
Comm. Engagement PR-AUC	−3.0% (hurts!)

Profile state is hugely important for fraud/credit; slightly harmful for communication engagement.

Optional: Pre-trained Text Encoder (Nemotron-1B-v2)

Credit Scoring PR-AUC: +16.1%. Product Rec. mAP: −6.4%. Trade-off: +18% training latency. Kept as opt-in.

2.9 Where PRAGMA Fails: Anti-Money Laundering

Task	Metric	PRAGMA vs. Baseline
AML	F₀.₅	−47.1%

AML is inherently relational — requires cross-user network signals. PRAGMA processes users in isolation. This is a fundamental architectural limitation.

2.10 Critical Assessment

Strengths:

First production-scale, multi-source banking FM with public documentation
Principled tokenization avoiding text serialization pitfalls
Broadest task evaluation in the financial FM literature (6+ tasks)
Honest about failures (AML)
NVIDIA co-authorship; solid engineering details

Weaknesses:

No absolute metrics → baseline strength unknown
No public model or data → unreproducible externally
Internal benchmarks only → no cross-paper comparison possible
Relational blindness → fundamental gap for graph-structured tasks
No comparison to simpler baselines (e.g., GBDT + RNN)

3. Industrial Event Sequence Foundation Models

These models share PRAGMA's thesis: behavioral event sequences contain transferable representations that outperform hand-crafted features across multiple tasks. None are open-source.

3.1 Meta — HSTU / Generative Recommenders (ICML 2024)

Paper: "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations"
arXiv:2402.17152

The foundational paper for industrial event-sequence FMs:

Architecture: HSTU — pointwise aggregated attention + SiLU gating replacing softmax. 5.3–15.2× faster than FlashAttention-2.
Scale: Up to 1.5 trillion parameters (including embeddings). Power-law scaling across 3 orders of magnitude.
Training: Autoregressive next-action prediction. Each sequence generates O(N) training signals.
Results: +12.4% topline improvement across Meta production surfaces.
Serving: M-FALCON enables 285× more complex models at same throughput.
Key insight: Actions encode richer intent than words.

3.2 Nubank — nuFormer (2025)

Paper: "Your Spending Needs Attention: Modeling Financial Habits with Transformers"
arXiv:2507.23267

Closest predecessor to PRAGMA:

Scale: O(100B) transactions, 100M+ members. 24M–330M parameters.
Architecture: Transformer with text-like tokenization + DCNv2 joint fusion.
Training: Next-token prediction.
Results: 1.25% relative AUC lift (3.3× typical launch impact). 4.4% churn reduction.
Limitation vs. PRAGMA: Single event source (transactions only), no explicit profile state.

3.3 Visa — TransactionGPT (2025) + TREASURE (2025)

TransactionGPT: arXiv:2511.08939

Architecture: 3D-Transformer with Virtual Token Layer (~10M params). Three coupled transformers for features, metadata, temporal dimensions.
Results: +22% on production anomaly detection. 92% fewer params and 300× faster inference than Llama2-7B on MCC prediction.

TREASURE: arXiv:2511.19693

Transformer FM for high-volume payment understanding. Extended with LLM-based sentence embeddings (arXiv:2601.05271).

3.4 Yandex Music — ARGUS (KDD 2026)

Paper: "Scaling Recommender Transformers to One Billion Parameters"
arXiv:2507.15994

Scale: 3.2M → 1B parameters. Power-law scaling confirmed independently from HSTU.
Training: Dual pre-training — Next Item Prediction + Feedback Prediction (RL-inspired).
Results: +2.26% listening time, +6.37% likes — largest DL improvement in platform history.
Key insight: Architecture matters less than training objective and scale. Standard Transformer matches HSTU with the right task.

3.5 Pinterest — TransAct V2 (2025)

arXiv:2506.02267

Architecture: Tiny transformer (2 layers, d=64) in wide-and-deep CTR model. Pragmatic latency-first design for 500M+ users.
Innovation: SKUT Triton kernel (6.6× speedup). 103–338× end-to-end latency improvement.
Results: +6.35% repin, −12.80% hides online.

3.6 Mastercard — Large Tabular Model (2026)

Press release. No peer-reviewed paper. "Large tabular model" on billions of anonymized transactions.

3.7 Stripe — Payment Foundation Model (2025)

TechCrunch, May 2025. No peer-reviewed paper. Tens of billions of transactions. NVIDIA partnership.

3.8 Open Banking Foundational Model (2025)

arXiv:2511.12154. Academic work. Multimodal FM integrating structured transaction attributes with text descriptions via MLM.

Comparison Matrix

Model	Builder	Params	Events	Sources	Architecture	Objective	Tasks	Open?
PRAGMA	Revolut	10M–1B	24B	Txns, app, trading, comms	Encoder, 3-branch	Masked modeling	6+	❌
HSTU	Meta	≤1.5T	Trillions	User actions	Decoder, HSTU	Next-action	Rec.	❌
nuFormer	Nubank	24M–330M	~100B	Transactions	Encoder+DCNv2	Next-token	Credit, churn	❌
TransactionGPT	Visa	~10M	Billions	Payments	3D-Transformer	Next-txn+supervised	Anomaly, MCC	❌
TREASURE	Visa	—	Billions	Payments	Transformer	Self-supervised	Fraud, personalization	❌
ARGUS	Yandex	3.2M–1B	Billions	Listening	Transformer	NIP+feedback	Rec.	❌
TransAct V2	Pinterest	Tiny	O(10⁴)/user	Actions	Transformer in CTR	Next Action Loss	CTR	❌
Mastercard LTM	Mastercard	—	Billions	Payments	"LTM"	—	Fraud	❌
Stripe PFM	Stripe	—	Tens of B	Payments	—	—	Routing	❌

4. Financial Time-Series Foundation Models

4.1 Kronos (AAAI 2026)

Paper: "Kronos: A Foundation Model for the Language of Financial Markets"
arXiv:2508.02739 · Code · Open-source (MIT)

Architecture: Decoder-only Transformer with specialized OHLCV tokenizer that discretizes candlestick data into hierarchical tokens.
Scale: 12B+ K-line records from 45 global exchanges. Model family: 4.1M–499M params.
Results (zero-shot): +93% RankIC over leading TSFM for price forecasting; 9% lower MAE in volatility; 22% better generative fidelity.
Significance: First model to show that the pre-training paradigm works for financial time series — previous TSFMs often underperformed non-pretrained architectures.

4.2 FinCast (CIKM 2025)

Paper: "FinCast: A Foundation Model for Financial Time-Series Forecasting"
arXiv:2508.19609 · Code · Open-source (Apache 2.0)

Architecture: Decoder-only Transformer + Mixture of Experts (MoE).
Scale: 20B+ financial time points across diverse domains and resolutions.
Innovation: PQ-Loss (joint point + probabilistic forecasting). MoE for domain specialization.
Results: Robust zero-shot across domains without fine-tuning.

4.3 Chronos (TMLR 2024)

Paper: "Chronos: Learning the Language of Time Series"
arXiv:2403.07815 · Code · Open-source (Apache 2.0)

Architecture: T5-based. Tokenizes continuous values via scaling + quantization into discrete bins.
Scale: 20M–710M params. Trained on 27 public datasets + KernelSynth synthetic data (1M series).
Results: Competitive zero-shot performance on 42 benchmarks without any time-series-specific architectural changes.
Key insight: Standard language model architectures work directly on tokenized time series.

4.4 Time-LLM (ICLR 2024)

Paper: "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"
arXiv:2310.01728 · Code · 1000+ citations

Approach: Repurposes frozen pre-trained LLMs (Llama-7B, GPT-2) via lightweight "reprogramming" layers.
Innovation: Cross-attention alignment between time-series patches and text prototypes; Prompt-as-Prefix enriches input with domain knowledge.
Trade-off: Requires per-dataset fine-tuning of reprogramming layers (LLM stays frozen).

Design Tension: Tokenize vs. Reprogram

Strategy	Models	Pros	Cons
Tokenize into bins	Chronos, Kronos	Standard LM architectures; zero-shot capable	Precision loss; bucket boundaries are data-specific
Reprogram frozen LLM	Time-LLM	Leverages pre-trained knowledge; cheap	Per-dataset fine-tuning; fragile alignment
Type-specific encoding	PRAGMA	Preserves data types natively	Custom infrastructure; domain-specific

5. Tabular Foundation Models

Tabular FMs address fixed-schema row data — related to but distinct from event sequences.

5.1 The TabPFN Lineage

Version	Year	Venue	Scale	Key Advance
TabPFN	2023	ICLR	≤1K samples, 100 features	In-context learning on synthetic data from structural causal model priors
TabPFN v2	2025	Nature	≤10K samples, 500 features	Categorical features, missing values, regression
TabPFN-2.5	2025	arXiv	≤50K samples, 2K features	100% win rate vs XGBoost (small-medium); distillation engine

Core idea: train a Transformer on synthetic data from causal priors, then do in-context learning at inference (no gradients). All open-source.

TabPFN v2: Nature · HuggingFace
TabPFN-2.5: arXiv:2511.08667

5.2 Beyond TabPFN

Model	Year	Venue	Key Innovation	arXiv
Mitra	2025	NeurIPS	Mixed synthetic priors (SCM + tree-based); 72M params; outperforms TabPFN v2	2510.21204
CARTE	2024	ICML	Graph-based pretraining for cross-table transfer	2402.16785
TabICL v2	2026	arXiv	Column-then-row attention; scales to ~500K samples; best open-source option	2602.11139
TabForestPFN	2024	arXiv	Forest-based synthetic data for more complex decision boundaries	2405.13396
KumoRFM-2	2026	arXiv	FM for relational (multi-table) data via in-context learning	2604.12596

Additional models: TabDPT, MotherNet (Microsoft), TabFlex (Microsoft), ContextTab (SAP), LimiX, Orion variants. See the Mindful Modeler overview (2026) listing 16+ tabular FMs.

5.3 Benchmarks and Surveys

OmniTabBench (arXiv:2604.06814): Comprehensive GBDTs vs. NNs vs. FMs comparison.
Survey: "Representation Learning for Tabular Data" (arXiv:2504.16109)
FMSD Workshop at ICML 2025 and 2026: 99 submissions, 500+ participants.

5.4 Cross-Modal Self-Supervised Learning

data2vec (Meta, ICML 2022; arXiv:2202.03555): First framework using the same self-supervised algorithm across speech, vision, and NLP. Predicts contextualized latent representations via self-distillation (teacher = EMA of student). Achieved SOTA on ImageNet (ViT-B/L). Open-source in fairseq.

Relevance: Demonstrates that masked prediction of rich contextualized targets (not raw inputs) can work across modalities. PRAGMA's MLM head, which concatenates local + cross-event + user-level context before prediction, echoes this principle.

6. Behavioral, Healthcare, and Other Event-Sequence FMs

6.1 Behavioral / Clickstream FMs

Paper	Year	Key Idea	Source
BehaveGPT	2025	Transformer + DRO for user behavior prediction	arXiv:2505.17631
Large Behavioral Models	2026	Next-event prediction across retail/payments (Unbox AI)	Blog
NEST	2026	Event streams as sequences of multisets; Masked Set Modeling	arXiv:2602.00520
ClickstreamGPT	2025	GPT-style generation for e-commerce clickstream	Springer
TRACE	2024	Multi-session clickstream embeddings via multi-task learning	arXiv:2409.12972

6.2 Self-Supervised Learning for Event Sequences

An active sub-area exploring pre-training objectives:

Contrastive + Generative fusion: Yugay & Zaytsev (2024). arXiv:2408.09995
MLEM: Generative and contrastive as distinct modalities. arXiv:2401.15935
PyTorch-Lifestream (IJCAI 2025): Open-source library implementing CoLES, CPC, RTD, BERT-style pretraining. Proceedings
Mixed-type Event Sequences (NeurIPS 2025): Heterogeneous events across medicine, finance, remote sensing. Poster

6.3 Healthcare Event FMs

Healthcare is the closest parallel to financial event FMs — similar challenges (heterogeneous events, irregular timing, long histories, privacy).

Model	Year	Scale	Architecture	Key Innovation	arXiv
Apollo	2026	25B events, 7.2M patients, 28 modalities	Multimodal temporal FM	322 clinical tasks	2604.18570
EHRMamba	2025	—	Mamba (subquadratic)	Addresses O(n²) for long EHR	2405.14567
RAVEN	2026	>1M patients	Generative, next-visit	Recurrence-aware regularization	2603.24562
NEST	2026	—	Hierarchical multiset	Models co-occurring events	2602.00520

Apollo mirrors PRAGMA's approach at similar scale: 25B medical events, multi-source, multi-task. Cross-pollination between financial and healthcare event FMs is an underexplored opportunity.

6.4 Recommendation FMs

Model	Year	Venue	Key Innovation	Source
RecGPT	2025	EMNLP	Zero-shot cross-domain via unified item tokenization	arXiv:2506.06270
RecBase	2025	EMNLP	Domain-agnostic FM for zero-shot rec.	ACL Anthology

Surveys: arXiv:2504.16420 (2025), arXiv:2402.11143 (2024).

6.5 Other Domains

Graph FMs: Survey at arXiv:2505.15116 (2025)
IoT / CPS: arXiv:2503.12282 (2025)
Sensor-based HAR: Survey at arXiv:2604.02711 (2026)

7. Cross-Cutting Themes

7.1 The Tokenization Problem

The central unsolved problem for non-text FMs. No consensus exists.

Strategy	Used By	Trade-off
Serialize as text	Naive baseline	Universal but inflated, numerical info destroyed
Percentile bucketing	PRAGMA, Chronos	Compact, preserves magnitude; loses precision
Learned per-field embeddings	TabTransformer, FT-Transformer	Native structure; requires fixed schema
OHLCV hierarchical tokenizer	Kronos	Domain-optimized; not transferable
Reprogramming via cross-attention	Time-LLM	Reuses LLM knowledge; per-dataset fine-tuning
Key–value–time decomposition	PRAGMA	Type-aware heterogeneous encoding; custom infra
Synthetic prior fitting	TabPFN, Mitra	No real data needed; limited to tabular (not sequences)

7.2 Pre-training Objective: No Consensus

Objective	Used By	Analogy
Masked modeling	PRAGMA, BERT4Rec, NEST, MLEM	BERT-style
Next-event prediction	HSTU, ARGUS, nuFormer, RAVEN	GPT-style
Contrastive learning	CoLES, MLEM, BYB	SimCLR/CLIP-style
Dual (NIP + feedback)	ARGUS	RL-inspired
Joint self-supervised + supervised	TransactionGPT	Multi-task
Synthetic prior fitting	TabPFN, Mitra	Bayesian meta-learning

The right objective depends on the downstream use case: encoder-only models naturally pair with masked modeling (discriminative tasks); decoder-only with next-event prediction (generative tasks).

7.3 Scaling Laws for Structured Data

Paper	Scaling Behavior
HSTU (Meta)	Power-law across 3 orders of magnitude — comparable to GPT-3/LLaMA-2
ARGUS (Yandex)	Linear on log scale from 3.2M to 1B
PRAGMA (Revolut)	Task-dependent: credit scoring benefits enormously; LTV saturates early
TabPFN lineage	Scales primarily in data capacity (1K → 50K) rather than model size

No emergent capabilities observed. Unlike LLMs with qualitative jumps at scale, structured data FMs show smooth, task-dependent scaling.

7.4 Encoder-Only vs. Decoder-Only

Architecture	Used By	Best For
Encoder-only (bidirectional)	PRAGMA, BERT4Rec, NEST	Discriminative (classification, scoring)
Decoder-only (autoregressive)	HSTU, Kronos, FinCast, Chronos	Generative (forecasting, next-event)
Hybrid	TransactionGPT, nuFormer	Multi-objective

PRAGMA's encoder-only choice is unusual in a field trending decoder-only. Justified: "Our primary goal is transferable representations for discriminative financial tasks, rather than open-ended generation."

7.5 The "None of This Is Open" Problem

Industrial event-sequence FMs: 0/9 are open-source.
Time-series FMs: 3/4 are open-source (Chronos, Kronos, FinCast).
Tabular FMs: Majority are open-source (TabPFN, Mitra, TabICL).

The industrial event-sequence community treats data as the moat. The academic community values reproducibility. This split makes cross-paper comparison of event-sequence FMs effectively impossible.

PyTorch-Lifestream (IJCAI 2025) is the main open-source effort to democratize event-sequence pretraining methods.

7.6 The Competitive Landscape in Financial Services

As of April 2026, at least six major financial institutions have published or announced transaction/event FMs:

Company	Model	Peer-Reviewed Paper?
Revolut	PRAGMA	✅
Nubank	nuFormer	✅
Visa	TransactionGPT + TREASURE	✅ (both)
Mastercard	LTM	❌ (press only)
Stripe	Payment FM	❌ (press only)
Meta	HSTU/GR	✅

This convergence marks the emergence of behavioral representation as a new competitive layer in financial services.

7.7 Relational / Graph Structure: The Big Gap

PRAGMA's −47% AML result is symptomatic of a field-wide limitation. Single-user / single-row models cannot capture cross-entity relationships needed for:

Anti-money laundering (transaction networks)
Network fraud (coordinated attacks)
Syndicated lending risk
Social contagion effects

KumoRFM-2 (arXiv:2604.12596) addresses multi-table relational data but not cross-user graph structure. The Graph Foundation Models survey (arXiv:2505.15116) maps the broader space but doesn't specifically address financial transaction graphs.

8. Open Questions and Gaps

No public benchmark for event sequence FMs. NLP has GLUE; vision has ImageNet; tabular has OmniTabBench. Event sequences have nothing. Every paper evaluates on proprietary data.
Optimal pre-training objective unknown. No systematic ablation of masked modeling vs. next-event prediction vs. contrastive learning on the same data.
Cross-institution transfer untested. Can PRAGMA-style representations transfer to a different bank's data distribution? To a different country's financial system? No evidence either way.
Relational structure unaddressed. PRAGMA's AML failure is just one symptom. No event-sequence FM handles cross-user graph structure.
Regulatory barriers to replication. GDPR, MiFID II, HIPAA make public benchmarks from real data structurally impossible. Federated and synthetic approaches are immature.
No emergent capabilities. Scaling is smooth and task-dependent. Whether structured data FMs can exhibit LLM-like phase transitions remains open.
Efficiency Pareto frontier poorly characterized. TransAct V2 (2 layers, d=64) achieves strong results; HSTU uses 1.5T params. The right operating point for different latency/accuracy trade-offs is unclear.
Domain-specific vs. general language pretraining. For NLP tasks, frontier general LLMs now win. Is there a crossover point where domain-specific language pretraining becomes worthwhile again, or has that ship sailed permanently?

9. Recommended Reading List

Tier 1: Essential

Zhai et al., "Actions Speak Louder than Words" (ICML 2024) — arXiv:2402.17152
Ostroukhov et al., "PRAGMA: Revolut Foundation Model" (2026) — arXiv:2604.08649
Braithwaite et al., "nuFormer" (2025) — arXiv:2507.23267
Hollmann et al., "TabPFN v2" (Nature, 2025) — Nature

Tier 2: Important

Dou et al., "TransactionGPT" (2025) — arXiv:2511.08939
Khrylchenko et al., "ARGUS" (KDD 2026) — arXiv:2507.15994
Ansari et al., "Chronos" (TMLR 2024) — arXiv:2403.07815
Shi et al., "Kronos" (AAAI 2026) — arXiv:2508.02739
Yeh et al., "TREASURE" (2025) — arXiv:2511.19693
Zhang et al., "Mitra" (NeurIPS 2025) — arXiv:2510.21204
"Apollo" (2026) — arXiv:2604.18570

Tier 3: Surveys & Tools

"FM-Powered Recommender Systems" (2025) — arXiv:2504.16420
"Foundation Models for Recommender Systems" (2024) — arXiv:2402.11143
"Representation Learning for Tabular Data" (2025) — arXiv:2504.16109
"Graph Foundation Models" (2025) — arXiv:2505.15116
Chen et al., "Advancing Financial Engineering with Foundation Models" (Engineering, 2025) — DOI
PyTorch-Lifestream (IJCAI 2025) — Proceedings
OmniTabBench (2026) — arXiv:2604.06814

10. Sources

Primary Sources (paper read or abstract verified)

#	Paper	Identifier
1	Zhai et al., "Actions Speak Louder than Words" (ICML 2024)	arXiv:2402.17152
2	Ostroukhov et al., "PRAGMA" (2026)	arXiv:2604.08649
3	Braithwaite et al., "nuFormer" (2025)	arXiv:2507.23267
4	Dou et al., "TransactionGPT" (2025)	arXiv:2511.08939
5	Yeh et al., "TREASURE" (2025)	arXiv:2511.19693
6	Khrylchenko et al., "ARGUS" (KDD 2026)	arXiv:2507.15994
7	Xia et al., "TransAct V2" (2025)	arXiv:2506.02267
8	Ansari et al., "Chronos" (TMLR 2024)	arXiv:2403.07815
9	Jin et al., "Time-LLM" (ICLR 2024)	arXiv:2310.01728
10	Shi et al., "Kronos" (AAAI 2026)	arXiv:2508.02739
11	Zhu et al., "FinCast" (CIKM 2025)	arXiv:2508.19609
12	Hollmann et al., "TabPFN" (ICLR 2023)	arXiv:2207.01848
13	Hollmann et al., "TabPFN v2" (Nature, 2025)	Nature
14	Grinsztajn et al., "TabPFN-2.5" (2025)	arXiv:2511.08667
15	Zhang et al., "Mitra" (NeurIPS 2025)	arXiv:2510.21204
16	Baevski et al., "data2vec" (ICML 2022)	arXiv:2202.03555
17	Kim et al., "CARTE" (ICML 2024)	arXiv:2402.16785
18	"TabICL v2" (2026)	arXiv:2602.11139
19	Fey et al., "KumoRFM-2" (2026)	arXiv:2604.12596
20	"BehaveGPT" (2025)	arXiv:2505.17631
21	"NEST" (2026)	arXiv:2602.00520
22	"Apollo" (2026)	arXiv:2604.18570
23	"EHRMamba" (ML4H 2025)	arXiv:2405.14567
24	"RAVEN" (2026)	arXiv:2603.24562
25	"RecGPT" (EMNLP 2025)	arXiv:2506.06270
26	"Open Banking FM" (2025)	arXiv:2511.12154
27	Polleti et al., "TREASURE + LLM embeddings" (2025)	arXiv:2601.05271
28	"OmniTabBench" (2026)	arXiv:2604.06814
29	Dong et al., "LLM Agents in Finance" (EMNLP 2025)	ACL Anthology
30	Zhang et al., "XFinBench" (ACL 2025)	ACL Anthology
31	Chen et al., "Advancing Financial Engineering with FMs" (Engineering, 2025)	DOI
32	"FinTrace" (2026)	arXiv:2604.10015

Secondary Sources (press releases, blogs, overviews)

#	Source	URL
33	Finance LLM Leaderboard 2026	awesomeagents.ai
34	Mastercard LTM announcement	mastercard.com
35	Stripe Payment FM	TechCrunch
36	Large Behavioral Models (Unbox AI)	Blog
37	State of Tabular FMs (2026)	Mindful Modeler
38	PRAGMA deep dive (Linas Beliūnas)	Substack
39	FMSD Workshop, ICML 2025	ICML
40	FMSD Workshop, ICML 2026	Website

Surveys

#	Survey	Source
41	FM-Powered Recommender Systems (2025)	arXiv:2504.16420
42	Foundation Models for Recommender Systems (2024)	arXiv:2402.11143
43	Representation Learning for Tabular Data (2025)	arXiv:2504.16109
44	Graph Foundation Models (2025)	arXiv:2505.15116
45	Sensor-based HAR FMs (2026)	arXiv:2604.02711

Explore

Get Started

Book a Demo

Documentation

Company

Blog

Brand Assets

Careers

Community

Farcaster

Lens

Terms & Policies

Terms