Foundation Models for Finance

Research

Apr 28, 2026

Financial Foundation Models - State of the Field

Abstract

Foundation models — large-scale models pre-trained on broad data and adapted to downstream tasks — have expanded decisively beyond natural language. This survey maps the landscape as of April 2026 across five interconnected domains: (1) financial language models, (2) financial time-series models, (3) banking and transaction event-sequence models, (4) tabular foundation models, and (5) healthcare and behavioral event-sequence models. We anchor the survey on PRAGMA (Revolut, 2026), the first publicly documented production-scale banking foundation model, using its architecture, results, and related work as a lens into the broader field. We cover 40+ models and papers, identify seven cross-cutting themes, and surface the key open questions.

Key finding: The competitive frontier has shifted. For financial NLP, frontier general-purpose LLMs (o3, GPT-5.5) now outperform domain-fine-tuned models (BloombergGPT, FinGPT). But for structured, sequential, and event-level data — transactions, user actions, market microstructure — domain-specific foundation models show clear and growing advantages. The moat is in proprietary data, not algorithms.

Scope: Foundation models operating on financial data, user event sequences, transactions, tabular records, time series, and structured non-textual data.

Table of Contents

  1. Financial Foundation Models: State of the Field

  2. PRAGMA: A Case Study in Banking Foundation Models

  3. Industrial Event Sequence Foundation Models

  4. Financial Time-Series Foundation Models

  5. Tabular Foundation Models

  6. Behavioral, Healthcare, and Other Event-Sequence FMs

  7. Cross-Cutting Themes

  8. Open Questions and Gaps

  9. Recommended Reading List

  10. Sources

1. Financial Foundation Models: State of the Field

1.1 Taxonomy

A comprehensive survey by Tsinghua / E Fund Management / Hong Kong Polytechnic University ("Advancing Financial Engineering with Foundation Models," Engineering, 2025; DOI: 10.1016/j.eng.2025.11.029) categorizes financial FMs into three modalities:

Modality

Examples

Tasks

Financial Language Models

BloombergGPT, FinGPT, ICE-INTENT, FinBERT

Sentiment analysis, compliance, report generation

Financial Time-Series Models

Kronos, FinCast, MarketGPT, Time-LLM

Price forecasting, volatility prediction, synthetic data

Financial Visual-Language Models

FinLLaVA, FinTral

Chart interpretation, FOMC projection parsing, table understanding

To this taxonomy, the work surveyed here adds a fourth:

Modality

Examples

Tasks

Banking Event-Sequence Models

PRAGMA, nuFormer, TransactionGPT, TREASURE

Credit scoring, fraud detection, churn, personalization

1.2 Evolution

  1. 2019–2022: BERT-style encoder models (FinBERT)

  2. 2023–2024: Generative architectures (BloombergGPT: 50B params, 363B financial + 345B general tokens, ~$3M training cost; FinGPT: lightweight LoRA fine-tuning of open LLMs)

  3. 2025–2026: Reasoning-enhanced LLMs, domain-specific time-series FMs, and production banking FMs. Frontier general models surpass domain-fine-tuned models on NLP benchmarks.

1.3 Current Leaderboard (April 2026)

Per the Finance LLM Leaderboard 2026:

Observation

Detail

SEC filing comprehension

o3 and GPT-5 lead on FinanceBench

Multi-step calculations

Reasoning models (o3, DeepSeek-R2) pull ahead on QA and TAT-QA

Domain fine-tuned models

BloombergGPT and FinGPT trail frontier general models on most tasks

Conceptual knowledge

CFA-Bench separates genuine financial understanding from pattern matching

Implication: Domain-specific language pretraining for finance may no longer be worth the cost. The action has moved to domain-specific pretraining for non-text financial data.

1.4 Benchmarks (2025–2026)

Benchmark

Focus

Source

FinanceBench

SEC filing comprehension

General

XFinBench

Complex financial problem solving (ACL 2025)

ACL Anthology

IndFin-Bench

India-specific financial filings

CompoundingAI

CFA-Bench

Financial conceptual knowledge

General

FinTSB

Time-series with realistic trading constraints

Survey

FinTrace

LLM tool-calling for long-horizon financial tasks

arXiv:2604.10015

FinMME

Visual QA with 11K+ pairs

Survey

1.5 Industry Deployments (2026)

  • Bank of New York → GPT-5.5: 220+ internal use cases

  • Revolut → PRAGMA: Production banking FM for fraud, churn, personalization

  • Citi Wealth → "Citi Sky": AI-powered advisor (Google DeepMind), rolling out Summer 2026

  • AMCAP Global: Multi-model agentic AI framework for asset analysis

2. PRAGMA: A Case Study in Banking Foundation Models

2.1 Overview

Paper: "PRAGMA: Revolut Foundation Model" · Ostroukhov et al. (Revolut Research + NVIDIA)
Published: April 9, 2026 · arXiv:2604.08649

PRAGMA is a family of encoder-only Transformer foundation models (10M, 100M, 1B parameters) pre-trained on multi-source banking event sequences. It replaces siloed, task-specific models with a single shared backbone that transfers across credit scoring, fraud detection, lifetime value prediction, communication engagement, product recommendation, and more.

2.2 Why Not Just Use an LLM?

PRAGMA argues against text serialization of structured banking data:

  1. Sequence inflation: Field names and delimiters inflate sequence lengths by 3–5×

  2. Numerical destruction: Subword tokenization splits digits, losing magnitude and ordering

  3. Heterogeneity: Banking events have variable-length records with mixed categorical, numerical, and free-text fields

2.3 Training Data

Dimension

Scale

Users

26 million

Countries

111

Events

24 billion

Tokens

207 billion

Time range

25 months (2023–2025)

Event Sources: Transactions (card payments, transfers), App (navigation, product usage), Trading (stock/crypto), Communication (push notifications, emails).

Profile State: Static contextual attributes (balance quantile, plan tier, service region) plus life-long events — timestamped milestones (e.g., first_topup) that survive truncation.

2.4 Architecture: Three-Encoder Design


  • Profile State Encoder: Processes static user attributes + life-long event timestamps via RoPE

  • Event Encoder: Processes each event independently; adds calendar features (hour/day/month via fixed-period sine/cosine)

  • History Encoder: Contextualizes the concatenation of [USR] + all [EVT] embeddings; uses RoPE on log-seconds-to-most-recent-event

Variant

Parameters

GPUs

Training Time

PRAGMA-S

10M

16× H100

~2 days

PRAGMA-M

100M

16× H100

~2 weeks

PRAGMA-L

1B

32× H100

~2 weeks

All variants: GELU, pre-norm LayerNorm, dropout 0.1, Muon + AdamW, bf16 mixed precision.

2.5 Tokenization: Key–Value–Time

Each data point decomposes into three components:

Component

Vocabulary

Encoding

Keys (field names)

~60 tokens

Single token per semantic type

Values

~28K tokens

Numerical → percentile buckets; Categorical → single token; Text → BPE subwords

Time

Continuous

Log-seconds 8·ln(1+t/8) + calendar sine/cosine

Token embedding: x = PosEmb(E(key) + E(value)), positions index within a field, not across fields.

2.6 Pre-training: Masked Modeling

BERT-style MLM adapted for structured events. The MLM head receives a 3d-dimensional concatenation per masked token:

  1. Event Encoder output (local within-event context)

  2. History Encoder output at the [EVT] position (cross-event context)

  3. History Encoder output at the [USR] position (user-level context)

Masking strategy (three complementary sources):

Type

Rate

Purpose

Token-level

15%

Standard token reconstruction

Event-level

10%

Reconstruct entire events from history

Key-level

10%

Predict values given other keys + context

Small fraction of masks replaced with [UNK] (excluded from loss) as input dropout.

2.7 Engineering

  • Sequence packing: FlashAttention varlen kernel eliminates padding overhead. 2–5× throughput improvement.

  • Dynamic batching: Records sharded by event count; greedy packing within fixed GPU memory budget.

  • Truncation: Max 24 tokens/event (0.01% affected), max 6,500 events/user (most recent retained).

  • Storage: LMDB user index + Parquet event shards partitioned by event count.

2.8 Results (Relative to Internal Production Baselines)

Caveat: Only relative improvements reported. Absolute metrics withheld for commercial sensitivity.

Main Results (PRAGMA-L + LoRA)

Task

Metric

Relative Change

Credit Scoring

PR-AUC

+130.2%

Credit Scoring

ROC-AUC

+12.4%

Communication Engagement

PR-AUC

+79.4%

Communication Engagement

ROC-AUC

+20.4%

External Fraud

Precision

+64.7%

External Fraud

Recall

+64.7%

Uplift (AUUC)

+163.7%

Scaling (PRAGMA-S → L, LoRA)

Task

Metric

S→M

S→L

Credit Scoring

PR-AUC

+16.3%

+35.2%

External Fraud

Recall

+24.8%

+23.5%

Product Rec.

mAP

+18.9%

+27.0%

LTV

PR-AUC

+1.5%

+3.0%

Scaling gains are task-dependent: credit scoring benefits enormously; LTV saturates early.

Pre-training Effect (LoRA vs. Scratch, PRAGMA-M)

Task

PR-AUC Gain

Comm. Engagement

+18.6%

Credit Scoring

+13.0%

Product Rec. (mAP)

+10.3%

Profile State Effect (Full vs. Event-only)

Task

Impact

Fraud Recall

+85.6%

Credit Scoring PR-AUC

+31.8%

Comm. Engagement PR-AUC

−3.0% (hurts!)

Profile state is hugely important for fraud/credit; slightly harmful for communication engagement.

Optional: Pre-trained Text Encoder (Nemotron-1B-v2)

Credit Scoring PR-AUC: +16.1%. Product Rec. mAP: −6.4%. Trade-off: +18% training latency. Kept as opt-in.

2.9 Where PRAGMA Fails: Anti-Money Laundering

Task

Metric

PRAGMA vs. Baseline

AML

F₀.₅

−47.1%

AML is inherently relational — requires cross-user network signals. PRAGMA processes users in isolation. This is a fundamental architectural limitation.

2.10 Critical Assessment

Strengths:

  • First production-scale, multi-source banking FM with public documentation

  • Principled tokenization avoiding text serialization pitfalls

  • Broadest task evaluation in the financial FM literature (6+ tasks)

  • Honest about failures (AML)

  • NVIDIA co-authorship; solid engineering details

Weaknesses:

  • No absolute metrics → baseline strength unknown

  • No public model or data → unreproducible externally

  • Internal benchmarks only → no cross-paper comparison possible

  • Relational blindness → fundamental gap for graph-structured tasks

  • No comparison to simpler baselines (e.g., GBDT + RNN)

3. Industrial Event Sequence Foundation Models

These models share PRAGMA's thesis: behavioral event sequences contain transferable representations that outperform hand-crafted features across multiple tasks. None are open-source.

3.1 Meta — HSTU / Generative Recommenders (ICML 2024)

Paper: "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations"
arXiv:2402.17152

The foundational paper for industrial event-sequence FMs:

  • Architecture: HSTU — pointwise aggregated attention + SiLU gating replacing softmax. 5.3–15.2× faster than FlashAttention-2.

  • Scale: Up to 1.5 trillion parameters (including embeddings). Power-law scaling across 3 orders of magnitude.

  • Training: Autoregressive next-action prediction. Each sequence generates O(N) training signals.

  • Results: +12.4% topline improvement across Meta production surfaces.

  • Serving: M-FALCON enables 285× more complex models at same throughput.

  • Key insight: Actions encode richer intent than words.

3.2 Nubank — nuFormer (2025)

Paper: "Your Spending Needs Attention: Modeling Financial Habits with Transformers"
arXiv:2507.23267

Closest predecessor to PRAGMA:

  • Scale: O(100B) transactions, 100M+ members. 24M–330M parameters.

  • Architecture: Transformer with text-like tokenization + DCNv2 joint fusion.

  • Training: Next-token prediction.

  • Results: 1.25% relative AUC lift (3.3× typical launch impact). 4.4% churn reduction.

  • Limitation vs. PRAGMA: Single event source (transactions only), no explicit profile state.

3.3 Visa — TransactionGPT (2025) + TREASURE (2025)

TransactionGPT: arXiv:2511.08939

  • Architecture: 3D-Transformer with Virtual Token Layer (~10M params). Three coupled transformers for features, metadata, temporal dimensions.

  • Results: +22% on production anomaly detection. 92% fewer params and 300× faster inference than Llama2-7B on MCC prediction.

TREASURE: arXiv:2511.19693

  • Transformer FM for high-volume payment understanding. Extended with LLM-based sentence embeddings (arXiv:2601.05271).

3.4 Yandex Music — ARGUS (KDD 2026)

Paper: "Scaling Recommender Transformers to One Billion Parameters"
arXiv:2507.15994

  • Scale: 3.2M → 1B parameters. Power-law scaling confirmed independently from HSTU.

  • Training: Dual pre-training — Next Item Prediction + Feedback Prediction (RL-inspired).

  • Results: +2.26% listening time, +6.37% likes — largest DL improvement in platform history.

  • Key insight: Architecture matters less than training objective and scale. Standard Transformer matches HSTU with the right task.

3.5 Pinterest — TransAct V2 (2025)

arXiv:2506.02267

  • Architecture: Tiny transformer (2 layers, d=64) in wide-and-deep CTR model. Pragmatic latency-first design for 500M+ users.

  • Innovation: SKUT Triton kernel (6.6× speedup). 103–338× end-to-end latency improvement.

  • Results: +6.35% repin, −12.80% hides online.

3.6 Mastercard — Large Tabular Model (2026)

Press release. No peer-reviewed paper. "Large tabular model" on billions of anonymized transactions.

3.7 Stripe — Payment Foundation Model (2025)

TechCrunch, May 2025. No peer-reviewed paper. Tens of billions of transactions. NVIDIA partnership.

3.8 Open Banking Foundational Model (2025)

arXiv:2511.12154. Academic work. Multimodal FM integrating structured transaction attributes with text descriptions via MLM.

Comparison Matrix

Model

Builder

Params

Events

Sources

Architecture

Objective

Tasks

Open?

PRAGMA

Revolut

10M–1B

24B

Txns, app, trading, comms

Encoder, 3-branch

Masked modeling

6+

HSTU

Meta

≤1.5T

Trillions

User actions

Decoder, HSTU

Next-action

Rec.

nuFormer

Nubank

24M–330M

~100B

Transactions

Encoder+DCNv2

Next-token

Credit, churn

TransactionGPT

Visa

~10M

Billions

Payments

3D-Transformer

Next-txn+supervised

Anomaly, MCC

TREASURE

Visa

Billions

Payments

Transformer

Self-supervised

Fraud, personalization

ARGUS

Yandex

3.2M–1B

Billions

Listening

Transformer

NIP+feedback

Rec.

TransAct V2

Pinterest

Tiny

O(10⁴)/user

Actions

Transformer in CTR

Next Action Loss

CTR

Mastercard LTM

Mastercard

Billions

Payments

"LTM"

Fraud

Stripe PFM

Stripe

Tens of B

Payments

Routing

4. Financial Time-Series Foundation Models

4.1 Kronos (AAAI 2026)

Paper: "Kronos: A Foundation Model for the Language of Financial Markets"
arXiv:2508.02739 · Code · Open-source (MIT)

  • Architecture: Decoder-only Transformer with specialized OHLCV tokenizer that discretizes candlestick data into hierarchical tokens.

  • Scale: 12B+ K-line records from 45 global exchanges. Model family: 4.1M–499M params.

  • Results (zero-shot): +93% RankIC over leading TSFM for price forecasting; 9% lower MAE in volatility; 22% better generative fidelity.

  • Significance: First model to show that the pre-training paradigm works for financial time series — previous TSFMs often underperformed non-pretrained architectures.

4.2 FinCast (CIKM 2025)

Paper: "FinCast: A Foundation Model for Financial Time-Series Forecasting"
arXiv:2508.19609 · Code · Open-source (Apache 2.0)

  • Architecture: Decoder-only Transformer + Mixture of Experts (MoE).

  • Scale: 20B+ financial time points across diverse domains and resolutions.

  • Innovation: PQ-Loss (joint point + probabilistic forecasting). MoE for domain specialization.

  • Results: Robust zero-shot across domains without fine-tuning.

4.3 Chronos (TMLR 2024)

Paper: "Chronos: Learning the Language of Time Series"
arXiv:2403.07815 · Code · Open-source (Apache 2.0)

  • Architecture: T5-based. Tokenizes continuous values via scaling + quantization into discrete bins.

  • Scale: 20M–710M params. Trained on 27 public datasets + KernelSynth synthetic data (1M series).

  • Results: Competitive zero-shot performance on 42 benchmarks without any time-series-specific architectural changes.

  • Key insight: Standard language model architectures work directly on tokenized time series.

4.4 Time-LLM (ICLR 2024)

Paper: "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"
arXiv:2310.01728 · Code · 1000+ citations

  • Approach: Repurposes frozen pre-trained LLMs (Llama-7B, GPT-2) via lightweight "reprogramming" layers.

  • Innovation: Cross-attention alignment between time-series patches and text prototypes; Prompt-as-Prefix enriches input with domain knowledge.

  • Trade-off: Requires per-dataset fine-tuning of reprogramming layers (LLM stays frozen).

Design Tension: Tokenize vs. Reprogram

Strategy

Models

Pros

Cons

Tokenize into bins

Chronos, Kronos

Standard LM architectures; zero-shot capable

Precision loss; bucket boundaries are data-specific

Reprogram frozen LLM

Time-LLM

Leverages pre-trained knowledge; cheap

Per-dataset fine-tuning; fragile alignment

Type-specific encoding

PRAGMA

Preserves data types natively

Custom infrastructure; domain-specific

5. Tabular Foundation Models

Tabular FMs address fixed-schema row data — related to but distinct from event sequences.

5.1 The TabPFN Lineage

Version

Year

Venue

Scale

Key Advance

TabPFN

2023

ICLR

≤1K samples, 100 features

In-context learning on synthetic data from structural causal model priors

TabPFN v2

2025

Nature

≤10K samples, 500 features

Categorical features, missing values, regression

TabPFN-2.5

2025

arXiv

≤50K samples, 2K features

100% win rate vs XGBoost (small-medium); distillation engine

Core idea: train a Transformer on synthetic data from causal priors, then do in-context learning at inference (no gradients). All open-source.

5.2 Beyond TabPFN

Model

Year

Venue

Key Innovation

arXiv

Mitra

2025

NeurIPS

Mixed synthetic priors (SCM + tree-based); 72M params; outperforms TabPFN v2

2510.21204

CARTE

2024

ICML

Graph-based pretraining for cross-table transfer

2402.16785

TabICL v2

2026

arXiv

Column-then-row attention; scales to ~500K samples; best open-source option

2602.11139

TabForestPFN

2024

arXiv

Forest-based synthetic data for more complex decision boundaries

2405.13396

KumoRFM-2

2026

arXiv

FM for relational (multi-table) data via in-context learning

2604.12596

Additional models: TabDPT, MotherNet (Microsoft), TabFlex (Microsoft), ContextTab (SAP), LimiX, Orion variants. See the Mindful Modeler overview (2026) listing 16+ tabular FMs.

5.3 Benchmarks and Surveys

  • OmniTabBench (arXiv:2604.06814): Comprehensive GBDTs vs. NNs vs. FMs comparison.

  • Survey: "Representation Learning for Tabular Data" (arXiv:2504.16109)

  • FMSD Workshop at ICML 2025 and 2026: 99 submissions, 500+ participants.

5.4 Cross-Modal Self-Supervised Learning

data2vec (Meta, ICML 2022; arXiv:2202.03555): First framework using the same self-supervised algorithm across speech, vision, and NLP. Predicts contextualized latent representations via self-distillation (teacher = EMA of student). Achieved SOTA on ImageNet (ViT-B/L). Open-source in fairseq.

Relevance: Demonstrates that masked prediction of rich contextualized targets (not raw inputs) can work across modalities. PRAGMA's MLM head, which concatenates local + cross-event + user-level context before prediction, echoes this principle.

6. Behavioral, Healthcare, and Other Event-Sequence FMs

6.1 Behavioral / Clickstream FMs

Paper

Year

Key Idea

Source

BehaveGPT

2025

Transformer + DRO for user behavior prediction

arXiv:2505.17631

Large Behavioral Models

2026

Next-event prediction across retail/payments (Unbox AI)

Blog

NEST

2026

Event streams as sequences of multisets; Masked Set Modeling

arXiv:2602.00520

ClickstreamGPT

2025

GPT-style generation for e-commerce clickstream

Springer

TRACE

2024

Multi-session clickstream embeddings via multi-task learning

arXiv:2409.12972

6.2 Self-Supervised Learning for Event Sequences

An active sub-area exploring pre-training objectives:

  • Contrastive + Generative fusion: Yugay & Zaytsev (2024). arXiv:2408.09995

  • MLEM: Generative and contrastive as distinct modalities. arXiv:2401.15935

  • PyTorch-Lifestream (IJCAI 2025): Open-source library implementing CoLES, CPC, RTD, BERT-style pretraining. Proceedings

  • Mixed-type Event Sequences (NeurIPS 2025): Heterogeneous events across medicine, finance, remote sensing. Poster

6.3 Healthcare Event FMs

Healthcare is the closest parallel to financial event FMs — similar challenges (heterogeneous events, irregular timing, long histories, privacy).

Model

Year

Scale

Architecture

Key Innovation

arXiv

Apollo

2026

25B events, 7.2M patients, 28 modalities

Multimodal temporal FM

322 clinical tasks

2604.18570

EHRMamba

2025

Mamba (subquadratic)

Addresses O(n²) for long EHR

2405.14567

RAVEN

2026

>1M patients

Generative, next-visit

Recurrence-aware regularization

2603.24562

NEST

2026

Hierarchical multiset

Models co-occurring events

2602.00520

Apollo mirrors PRAGMA's approach at similar scale: 25B medical events, multi-source, multi-task. Cross-pollination between financial and healthcare event FMs is an underexplored opportunity.

6.4 Recommendation FMs

Model

Year

Venue

Key Innovation

Source

RecGPT

2025

EMNLP

Zero-shot cross-domain via unified item tokenization

arXiv:2506.06270

RecBase

2025

EMNLP

Domain-agnostic FM for zero-shot rec.

ACL Anthology

Surveys: arXiv:2504.16420 (2025), arXiv:2402.11143 (2024).

6.5 Other Domains

7. Cross-Cutting Themes

7.1 The Tokenization Problem

The central unsolved problem for non-text FMs. No consensus exists.

Strategy

Used By

Trade-off

Serialize as text

Naive baseline

Universal but inflated, numerical info destroyed

Percentile bucketing

PRAGMA, Chronos

Compact, preserves magnitude; loses precision

Learned per-field embeddings

TabTransformer, FT-Transformer

Native structure; requires fixed schema

OHLCV hierarchical tokenizer

Kronos

Domain-optimized; not transferable

Reprogramming via cross-attention

Time-LLM

Reuses LLM knowledge; per-dataset fine-tuning

Key–value–time decomposition

PRAGMA

Type-aware heterogeneous encoding; custom infra

Synthetic prior fitting

TabPFN, Mitra

No real data needed; limited to tabular (not sequences)

7.2 Pre-training Objective: No Consensus

Objective

Used By

Analogy

Masked modeling

PRAGMA, BERT4Rec, NEST, MLEM

BERT-style

Next-event prediction

HSTU, ARGUS, nuFormer, RAVEN

GPT-style

Contrastive learning

CoLES, MLEM, BYB

SimCLR/CLIP-style

Dual (NIP + feedback)

ARGUS

RL-inspired

Joint self-supervised + supervised

TransactionGPT

Multi-task

Synthetic prior fitting

TabPFN, Mitra

Bayesian meta-learning

The right objective depends on the downstream use case: encoder-only models naturally pair with masked modeling (discriminative tasks); decoder-only with next-event prediction (generative tasks).

7.3 Scaling Laws for Structured Data

Paper

Scaling Behavior

HSTU (Meta)

Power-law across 3 orders of magnitude — comparable to GPT-3/LLaMA-2

ARGUS (Yandex)

Linear on log scale from 3.2M to 1B

PRAGMA (Revolut)

Task-dependent: credit scoring benefits enormously; LTV saturates early

TabPFN lineage

Scales primarily in data capacity (1K → 50K) rather than model size

No emergent capabilities observed. Unlike LLMs with qualitative jumps at scale, structured data FMs show smooth, task-dependent scaling.

7.4 Encoder-Only vs. Decoder-Only

Architecture

Used By

Best For

Encoder-only (bidirectional)

PRAGMA, BERT4Rec, NEST

Discriminative (classification, scoring)

Decoder-only (autoregressive)

HSTU, Kronos, FinCast, Chronos

Generative (forecasting, next-event)

Hybrid

TransactionGPT, nuFormer

Multi-objective

PRAGMA's encoder-only choice is unusual in a field trending decoder-only. Justified: "Our primary goal is transferable representations for discriminative financial tasks, rather than open-ended generation."

7.5 The "None of This Is Open" Problem

Industrial event-sequence FMs: 0/9 are open-source.
Time-series FMs: 3/4 are open-source (Chronos, Kronos, FinCast).
Tabular FMs: Majority are open-source (TabPFN, Mitra, TabICL).

The industrial event-sequence community treats data as the moat. The academic community values reproducibility. This split makes cross-paper comparison of event-sequence FMs effectively impossible.

PyTorch-Lifestream (IJCAI 2025) is the main open-source effort to democratize event-sequence pretraining methods.

7.6 The Competitive Landscape in Financial Services

As of April 2026, at least six major financial institutions have published or announced transaction/event FMs:

Company

Model

Peer-Reviewed Paper?

Revolut

PRAGMA

Nubank

nuFormer

Visa

TransactionGPT + TREASURE

✅ (both)

Mastercard

LTM

❌ (press only)

Stripe

Payment FM

❌ (press only)

Meta

HSTU/GR

This convergence marks the emergence of behavioral representation as a new competitive layer in financial services.

7.7 Relational / Graph Structure: The Big Gap

PRAGMA's −47% AML result is symptomatic of a field-wide limitation. Single-user / single-row models cannot capture cross-entity relationships needed for:

  • Anti-money laundering (transaction networks)

  • Network fraud (coordinated attacks)

  • Syndicated lending risk

  • Social contagion effects

KumoRFM-2 (arXiv:2604.12596) addresses multi-table relational data but not cross-user graph structure. The Graph Foundation Models survey (arXiv:2505.15116) maps the broader space but doesn't specifically address financial transaction graphs.

8. Open Questions and Gaps

  1. No public benchmark for event sequence FMs. NLP has GLUE; vision has ImageNet; tabular has OmniTabBench. Event sequences have nothing. Every paper evaluates on proprietary data.

  2. Optimal pre-training objective unknown. No systematic ablation of masked modeling vs. next-event prediction vs. contrastive learning on the same data.

  3. Cross-institution transfer untested. Can PRAGMA-style representations transfer to a different bank's data distribution? To a different country's financial system? No evidence either way.

  4. Relational structure unaddressed. PRAGMA's AML failure is just one symptom. No event-sequence FM handles cross-user graph structure.

  5. Regulatory barriers to replication. GDPR, MiFID II, HIPAA make public benchmarks from real data structurally impossible. Federated and synthetic approaches are immature.

  6. No emergent capabilities. Scaling is smooth and task-dependent. Whether structured data FMs can exhibit LLM-like phase transitions remains open.

  7. Efficiency Pareto frontier poorly characterized. TransAct V2 (2 layers, d=64) achieves strong results; HSTU uses 1.5T params. The right operating point for different latency/accuracy trade-offs is unclear.

  8. Domain-specific vs. general language pretraining. For NLP tasks, frontier general LLMs now win. Is there a crossover point where domain-specific language pretraining becomes worthwhile again, or has that ship sailed permanently?

9. Recommended Reading List

Tier 1: Essential

  1. Zhai et al., "Actions Speak Louder than Words" (ICML 2024) — arXiv:2402.17152

  2. Ostroukhov et al., "PRAGMA: Revolut Foundation Model" (2026) — arXiv:2604.08649

  3. Braithwaite et al., "nuFormer" (2025) — arXiv:2507.23267

  4. Hollmann et al., "TabPFN v2" (Nature, 2025) — Nature

Tier 2: Important

  1. Dou et al., "TransactionGPT" (2025) — arXiv:2511.08939

  2. Khrylchenko et al., "ARGUS" (KDD 2026) — arXiv:2507.15994

  3. Ansari et al., "Chronos" (TMLR 2024) — arXiv:2403.07815

  4. Shi et al., "Kronos" (AAAI 2026) — arXiv:2508.02739

  5. Yeh et al., "TREASURE" (2025) — arXiv:2511.19693

  6. Zhang et al., "Mitra" (NeurIPS 2025) — arXiv:2510.21204

  7. "Apollo" (2026) — arXiv:2604.18570

Tier 3: Surveys & Tools

  1. "FM-Powered Recommender Systems" (2025) — arXiv:2504.16420

  2. "Foundation Models for Recommender Systems" (2024) — arXiv:2402.11143

  3. "Representation Learning for Tabular Data" (2025) — arXiv:2504.16109

  4. "Graph Foundation Models" (2025) — arXiv:2505.15116

  5. Chen et al., "Advancing Financial Engineering with Foundation Models" (Engineering, 2025) — DOI

  6. PyTorch-Lifestream (IJCAI 2025) — Proceedings

  7. OmniTabBench (2026) — arXiv:2604.06814

10. Sources

Primary Sources (paper read or abstract verified)

#

Paper

Identifier

1

Zhai et al., "Actions Speak Louder than Words" (ICML 2024)

arXiv:2402.17152

2

Ostroukhov et al., "PRAGMA" (2026)

arXiv:2604.08649

3

Braithwaite et al., "nuFormer" (2025)

arXiv:2507.23267

4

Dou et al., "TransactionGPT" (2025)

arXiv:2511.08939

5

Yeh et al., "TREASURE" (2025)

arXiv:2511.19693

6

Khrylchenko et al., "ARGUS" (KDD 2026)

arXiv:2507.15994

7

Xia et al., "TransAct V2" (2025)

arXiv:2506.02267

8

Ansari et al., "Chronos" (TMLR 2024)

arXiv:2403.07815

9

Jin et al., "Time-LLM" (ICLR 2024)

arXiv:2310.01728

10

Shi et al., "Kronos" (AAAI 2026)

arXiv:2508.02739

11

Zhu et al., "FinCast" (CIKM 2025)

arXiv:2508.19609

12

Hollmann et al., "TabPFN" (ICLR 2023)

arXiv:2207.01848

13

Hollmann et al., "TabPFN v2" (Nature, 2025)

Nature

14

Grinsztajn et al., "TabPFN-2.5" (2025)

arXiv:2511.08667

15

Zhang et al., "Mitra" (NeurIPS 2025)

arXiv:2510.21204

16

Baevski et al., "data2vec" (ICML 2022)

arXiv:2202.03555

17

Kim et al., "CARTE" (ICML 2024)

arXiv:2402.16785

18

"TabICL v2" (2026)

arXiv:2602.11139

19

Fey et al., "KumoRFM-2" (2026)

arXiv:2604.12596

20

"BehaveGPT" (2025)

arXiv:2505.17631

21

"NEST" (2026)

arXiv:2602.00520

22

"Apollo" (2026)

arXiv:2604.18570

23

"EHRMamba" (ML4H 2025)

arXiv:2405.14567

24

"RAVEN" (2026)

arXiv:2603.24562

25

"RecGPT" (EMNLP 2025)

arXiv:2506.06270

26

"Open Banking FM" (2025)

arXiv:2511.12154

27

Polleti et al., "TREASURE + LLM embeddings" (2025)

arXiv:2601.05271

28

"OmniTabBench" (2026)

arXiv:2604.06814

29

Dong et al., "LLM Agents in Finance" (EMNLP 2025)

ACL Anthology

30

Zhang et al., "XFinBench" (ACL 2025)

ACL Anthology

31

Chen et al., "Advancing Financial Engineering with FMs" (Engineering, 2025)

DOI

32

"FinTrace" (2026)

arXiv:2604.10015

Secondary Sources (press releases, blogs, overviews)

#

Source

URL

33

Finance LLM Leaderboard 2026

awesomeagents.ai

34

Mastercard LTM announcement

mastercard.com

35

Stripe Payment FM

TechCrunch

36

Large Behavioral Models (Unbox AI)

Blog

37

State of Tabular FMs (2026)

Mindful Modeler

38

PRAGMA deep dive (Linas Beliūnas)

Substack

39

FMSD Workshop, ICML 2025

ICML

40

FMSD Workshop, ICML 2026

Website

Surveys

#

Survey

Source

41

FM-Powered Recommender Systems (2025)

arXiv:2504.16420

42

Foundation Models for Recommender Systems (2024)

arXiv:2402.11143

43

Representation Learning for Tabular Data (2025)

arXiv:2504.16109

44

Graph Foundation Models (2025)

arXiv:2505.15116

45

Sensor-based HAR FMs (2026)

arXiv:2604.02711

© 2026 ZKAI Labs. All rights reserved.

© 2026 ZKAI Labs. All rights reserved.