Text Generation
English
Polish
agent
πŸ‡ͺπŸ‡Ί Region: EU

You need to agree to use this model only for research or education purposes under Reactive AI Model & Architecture License (RAML) v1.0

The repository will be available instantly after accepting license terms

Accept Reactive AI Model & Architecture License (RAML) v1.0 terms to access the repository and use model. Reactive Transformer (pending patent #P.453260) is available for free for non-commercial usage. For commercial usage please contact Reactive AI at [email protected]

Log in or Sign Up to review the conditions and access this model content.

RxT-Beta 3B (A190M & M176+/33M)

Preview repo - training in progress - planned release date: February 2026

Docs in progress

RxT-Beta is the world's first real-scale stateful Reactive Language Model (RxLM) with infinite memory & context, made to confirm new Reactive Transformer (RxT) scaling laws and solve all the biggest stateless LLMs problems. RxT models are natively conversational (and agentic) - instead of reprocessing all the conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory, that's updated asynchronously between the interactions. It introduces unique features like:

  • infinite conversation & global context through Mixture-of-Memory (MoM)
  • live continual learning from interactions in real-time
  • true real-time processing with near-zero latency
  • linear conversation cost scaling
  • fixed computational cost and memory usage for each interaction
  • increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations"
  • natively encoded memory, impossible to read without the model
  • extreme pre-training efficiency
  • hybrid stateful reasoning

In first small scale experiments RxT-Alpha models achieved about 50% higher accuracy and almost 2x lower perplexity, than the same size stateless decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were then confirmed on small 10B tokens subset of real-world data and ~0.3B models (RxT-Beta Micro), where RxT advantage was even bigger. These promising results, along with all the unique features, demonstrate that Reactive Transformer is a revolutionary generational leap and a crucial milestone on the path to Artificial General Intelligence (AGI). Of course, if we will confirm this at scale, which is what we plan to do with RxT-Beta.

The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 250B pre-training tokens, and significantly outperform them on long multi-turn conversations.

Reactive Transformer - stateful event-driven path to AGI

How could we even talk about General Intelligence in context of completely stateless models? Memory is a crucial part of intelligence, that enable learning new things from environment interactions and storing all the subjective feelings. This fact is mostly ignored in current generation of AI.
Reactive Transformer is a fundamental paradigm shift made to fill this gap between stateless LLMs and real intelligence. With natural, internal memory system, continual live learning and true stateful real-time processing, it indicates the beginning of new age of Event-Driven AI.

Mistakes of decoder-only Transformers

While increasing knowledge and abilities of LLMs are impressive, they work completely different than natural intelligence. All current models are stateless and data-driven, what means that they process all the conversation at once, to keep context between messages, instead processing single messages with real memory system. This approach leads to multiple problems:

  • extreme inefficiency:
    • for first message, model have to just process that message and generate answer
    • but for 20th message, model have to re-process all 19 previous queries and answers to generate answer just for that last message
    • that makes very long context windows, like 1M+ tokens economically prohibitive
    • let's take an extreme example - Llama 4 Scout with 10M context window and $0.18/million tokens pricing (Novita.ai). With 5k tokens (T) mean interaction length (query + answer), it could theoretically fit 2k interactions (N) into context window. For full conversation with stateless LLM equation is N * (N / 2) * T, so in this case 2000 * 1000 * 5000 = 10B tokens!. Full cost is $1800 - this is around the average monthly salary in Poland.
  • increasing complexity and dropping quality:
    • in first few messages, the model's objective is relatively easy - it has to find related information in limited number of tokens. Quality is great
    • but in longer dialogues, objective becomes harder, model has to search in more tokens on each step to identify meaningful information. Quality is dropping
    • some research said that after ~8 interactions, accuracy is ~30% lower, others state that after ~100k tokens model is getting dumb
    • from our experience - after 10-20 interactions, models are mixing informations from different steps, being practically useless - it's better to start new conversation and provide all important context again, because model cannot learn after the training
  • increasing delays, computational cost and memory usage:
    • when on each step model has to re-process all previous messages, each next step has bigger delays and is slower, due to quadratic complexity of attention
    • accumulated interactions, stored in KV-cache for autoregressive generation, require significant memory - usage is increasing with next messages
  • inability to learn after the training:
    • all the model params are static and model cannot learn in inference after the training - one/few-shot learning isn't real learning
    • each conversation is starting from exactly the same state
    • new persistant knowledge require re-training
    • external agentic memory is text based and its expressiveness is limited - it's still just a text added to prompt and conversation history
  • weak privacy and safety:
    • in chatbots, all text based conversations have to be stored in provider's database
    • sending full conversations through APIs is vulnerable for leaks
  • against natural intelligence:
    • if human consciousness worked like a stateless LLM, we would have to analyze the history of our entire day (or even entire life) to know what we were doing 10 minutes ago
    • instead, we have multi-level memory, that's storing information, between all our interactions

That list is probably not even complete, but with these problems identified, we could easily say that stateless processing is the biggest mistake of decoder-only Transformers.

Common misunderstandings

Some concepts in the world of LLMs seem to be wrong and misunderstood:

  • real-time processing - they say that model works in real-time, when it has low latency and is fast. Others even calling all the LLM inference process "real-time". But when stateless model processes full dialogue, where some message could be from i.e. from previous day, how could we call it real-time? True real-time processing require only actual data as an input.
  • one/few-shot learning - popularized by Open AI paper "Language Models are Few-Shot Learners". As described before, that's not real learning. It's only in the context of the prompt, each new conversation require providing same examples again, as GPT-like models are not able to modify its internal state. Then, that title is wrong - LLMs aren't "any-shot" learners.
  • context window is short-term memory - even new architectures with long-term memory like Google TITANS treat context window as short-term memory. In our opinion it's completely wrong, as it leads to all the efficiency problems. Real memory should be updated dynamically, not just accumulated - context window is just the context window, not memory.

Architecture details

  • dim: 512
  • vocab: 65k (english + polish)
  • max interaction length: 8192 tokens
  • max conversation length: infinite
  • embedding: shared (decoder/encoder) / not tied with head
  • memory:
    • type: Dynamic Mixture-of-Memory
    • layers: 21
    • working memory: 512 slots
    • short-term memory: 2560 slots (10 fragments * 256 slots)
    • long-term memory: initial 16k (64 fragments) slots, extendable
    • dynamic memory params: extendable 176M with 33M active per interaction
  • decoder:
    • layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense)
    • self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
    • memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
    • feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts
      • routed experts: 384
      • active experts: 10
      • routed expert dim: 192
      • shared experts: 2 with softmax gating
      • shared expert dim: 384
      • activation: SwiGLU
    • dense layer: 1536 dim with SwiGLU activation
    • params: 2.85B with 190M activated per token
  • encoder:
    • layers: 21
    • self-attention: Gated Symmetric Sparse Query Attention (sSQA) with 8/16 query/key/value heads
    • feed forward: Dense MLP / 1536 dim with SwiGLU activation
    • params: 97M
  • memory attention:
    • type: Grouped Self/Interlayer Memory Attention
    • layers: 21
    • interlayer groups: 3 * 7 layers
    • attention layers: Symmetric Sparse Query Attention (sSQA) with 8/16 query/key/value heads
    • residual gates: elementwise / sigmoid
    • params: 22.2M
  • all params: ~2.93B with 190M activated per token + 120M per interaction

Interaction Template

Since RxT models process only single interactions - the current query and answer, without including all previous messages (which the model accesses via memory), the complex Chat Templates known from LLMs have been replaced by a much simpler Interaction Template. In addition to query and answer, in RxT-Beta it's also responsible for hybrid reasoning and agentic tools control, based on special tokens:

  • [Q] - query block
  • [A] - answer block
  • [T] - thinking/reasoning block
  • [C] - agentic tool call
  • [U] - agentic tool use
  • [I] - internal instruction

Hybrid Reasoning

RxT-Beta hybrid reasoning is controlled by the presence of special tokens at the end of user's query:

  • [Q] User's query... [A] forces the model to generate fast answer without reasoning
  • [Q] User's query... [T] activates extended thinking mode. When the reasoning is finished, model will add [A] token on its own and continue generating the answer

Finally, [Q] User's query [A] Model's answer is the full template for fast answer mode and [Q] User's query [T] Model's thinking [A] Model's answer is for extended thinking.

Automatic Mode Selection

RxT-Beta could also automatically decide which mode it will use with simple zero-overhead routing:

  • pass the query without [A] or [T] special tokens: [Q] User's query...
  • normally, it could accidentally lead to query completion, but there's simple trick - get only logits/probs for [A] and [T] special tokens and select one with higher probability
  • then model could normally generate answer/thinking tokens autoregressively

Agentic Tools

Model's answer could also result in agentic tool call, indicated with [C] special token - it could be used in both fast answer and extended thinking modes:

  • [Q] User's query [A] Model's answer [C] Agentic tool call (json) in fast answer mode
  • [Q] User's query [T] Model's thinking [A] Model's answer [C] Agentic tool call (json) in extended thinking mode

Then, when model is waiting for the tool result, it asynchronously updates the memory in background - tool result will be processed as new, special interaction, with standard [Q] token replaced by [U]. Depending on the tool or user's setting, it could trigger additional thinking or just the fast answer:

  • [U] Tool results [A] Model's answer (tool call summary) is the template for tool usage in fast answer mode
  • [U] Tool results [T] Model's thinking about tool results [A] Model's answer (tool call summary) is the template for extended thinking

Tool calls could be also chained - answer generated for interaction initialized with tool usage ([U]) could also result in another tool call, just like in regular interaction - i.e. [U] Tool results [A] Model's answer (tool call summary) [C] Another tool call (json) for fast answer mode.

Internal Instructions

Additionally, model behaviour could be controlled with special internal instructions, equivalent to system prompt in LLMs, but limited to single interaction instead of full conversation (per-conversation system prompt is replaced by memory init). It could be used i.e. to force some tool calls or specific answer format.

Internal instructions should be placed before the query or tool use, i.e. [I] Use 'web_search' tool [Q] User's query... [A].

Architecture Innovations

Reactive Transformer (RxT) with additional stateless layers

Reactive Transformer (Adam Filipek, 2025) is our flagship innovation, that redefines conversational and agentic AI, to make it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention). That makes it far more expressive and compressible than any existing agentic memory.

While RxT decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from all previous interactions - that's why we called it memory cross-attention. We also don't use positional encoding for memory cross-attention keys, because memory doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding.

Since RxT-Alpha models, introduced in paper, we added initial and final stateless layers to decoder. They use only self-attention with feed forward, without memory cross-attention. In RxT-Beta Decoder we have:

  • two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding.
  • first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE
  • two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers

Sparse Query Attention (SQA)

Sparse Query Attention (Adam Filipek, 2025) is our solution for computationally efficient attention, that's especially useful in RxT. Unlike common sparse attention patters, like Sliding Window Attention (SWA), SQA is based on structural sparsity, instead of spatial sparsity. By reducing the number of used query heads, it's using partial information from all tokens and performs scaled dot product attention in lower dimensionality (reducing the number of matrix multiplications). SQA is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention.

In RxT-Beta we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. In decoder, we stay with the same number of key/value head (as GQA baseline), so memory access cost in autoregressive generation is on the same level. However, in RxT KV-cache is limited only to single interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention), where we use symmetric variant of SQA with all query, key and value heads reduced by 50%, that outperforms other solutions - match baseline GQA quality with 2x smaller computational cost.

Sparse Attention for RxT

Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In RxT we achieved infinite context by... making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs, when it has to fit all the chat history. Then, full SQA attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore, RxT has native sliding window, that's not limited to fixed number of tokens, but to current interaction, what's just natural.

On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory attention is rather weak, especially for memory, that doesn't have spatial relations.

Linear Attention for RxT

We tested new Linear Attention solutions and hybrid attention architecture for RxT-Beta self-attention, but for short single interaction sequences used for MVP (1-8k tokens), training was about 2-3x slower than with full SQA baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our custom solution called Memory-driven Gated DeltaNet.

Gated Self-Attention

We follow the direction from Alibaba/Qwen Team research and added sigmoid gates to our SQA self-attention layers (in both decoder and encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in SQA gate has reduced dimensionality, same as query and attention calculation.

We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the processed sequence, while attention results are based on values from memory. Memory attention also has different input sources, so it skip gates too. So finally, we are using gates only for self-attention layers.

Sparse Mixture-of-Experts (MoE) with gated shared experts

Latest models, like Kimi K2 or Qwen3-Next, demonstrated high effectiveness of architectures with large number of smaller experts and high sparse activation rates for each token. We follow the same direction in RxT-Beta Mixture-of-Experts with 10 from 384 experts activated per token. We are extending it with two bigger shared experts with softmax gate, for even better expresiveness. Both shared experts are used for all tokens, but gate can decide which one is more important for each token - we plan to introduce task-aware shared experts load balancing in next training stages to specialize one expert in reasoning, while second one will be dedicated to fast answers, to better balance hybrid reasoning abilities. Shared experts are 2x bigger than routed experts.

Bidirectional Masked Language Modeling (MLM) in decoder pre-training

In unique RxT pre-training method, decoder is learning with both unidirectional autoregressive language modeling (self-attention) and bidirectional modeling (cross-attention). It boosts training effectiveness with the "super-convergence" effect, but also makes training too easy, what leads to quick loss plateau on early training stage. To prevent this, we make the decoder's task harder by adding random noise to the encoder's outputs. In early experiments, we used small noise levels like 0.15-0.2, but in RxT-Beta we increased it to 0.5 as a starting point. Additionally, we decided to add also random masking to encoder outputs to make the prediction of tokens on masked positions even harder. It adds another objective to decoder's training, that is close to masked language modeling, used in encoders training.

To make the training even more effective, we introduced progressive noise level and masking probability increase - with this solution, even loss plateau is "healthy", because with each step, objective becomes harder.

Even with high noise and masking rates, decoder is quickly achieving over 90% prediction accuracy, then in about 99% of training time it's learning to correctly predict remaining 10% (responsible for the most important knowledge), to finally reach 98-99% accuracy level. It's impossible to reach in classic decoder-only LLM training - we believe that this is the main reason of RxT extreme training efficiency. More details in training process description.

Training

Event-driven stateful real-time processing in Reactive Transformer requires new re-designed training methodology. The basic pipeline was introduced in the RxT-Alpha paper, but we significantly modified and extended it for real-scale RxT-Beta to handle larger models, more complex datasets, Mixture-of-Memory, hybrid reasoning and agentic capabilities. We also replaced the final Reinforcement Learning stages (Memory Reinforcement Learning and Reactive RLHF) with our new Direct Memory and Preference Optimization (DMPO) algorithm, which extends DPO-like methods for memory-aware training.

Training stages & intermediate models

  1. Joint LM Pre-Training with "cheated context" teacher forcing on large text corpora (~250B tokens)
  2. Hybrid Reasoning Interaction SFT (with "cheated context" teacher forcing algorithm) on independent interactions (~20B tokens)
  3. Self-Supervised Memory Attention Pre-Training on multi-turn conversations (quick epoch on ~20k conversations)
  4. Supervised Memory Aware Training (SMAT) on multi-turn conversations
    • Short-Term Memory (STM) stages:
      • hybrid instruct/reasoning (200-300k conversations)
      • retrieval (50k conversations)
    • Mixture-of-Memory (MoM) adaptation:
      • long range hybrid instruct/reasoning (100k conversations)
      • long range retrieval (20k conversations)
      • cross-session memory training (?)
  5. Direct Memory and Preference Optimization (DMPO) on multi-turn preference pairs with correct/incorrect memory usage
    • inter-step short-range optimization
    • long-range optimization

Joint LM Pre-Training with "cheated context" teacher forcing

The Joint LM Pre-Training stage co-trains the Generator-Decoder and Memory Encoder to establish a shared semantic foundation, learn fundamental language representations, and align their vector spaces. Unlike RxT-Alpha, RxT-Beta pre-training introduces progressive curriculum learning with multiple rounds of increasing sequence length, noise level, and masking probability.

Training Algorithm

The training algorithm proceeds as follows for each input sequence $S$:

  1. Encoder Processing: The input sequence is randomly masked (15% tokens) to create $S_{mask}$. The Memory Encoder processes $S_{mask}$ and its final hidden states are passed to a dedicated Masked Language Modeling (MLM) head to compute $\mathcal{L}_{MLM}$.

  2. Encoder Output Processing: The hidden states from each layer of the encoder, $ED = {ed_1, ed_2, ..., ed_L}$, are detached from the computation graph. This crucial step prevents gradients from the decoder from flowing back into the encoder. Two perturbations are then applied:

    • Random Position Masking: A subset of positions are masked (set to zero) before noise addition. This forces the decoder to internalize knowledge rather than relying purely on the encoder's context.
    • Additive Gaussian Noise: Random noise is added: $ED' = ED + \epsilon \cdot \sigma$, where $\sigma$ is the noise level and $\epsilon \sim \mathcal{N}(0, I)$. Note that noise is additive, not interpolated: masked positions contain only noise.
  3. Decoder Processing: The noisy, masked encoder states $ED'$ serve as Key and Value inputs for the decoder's Memory Cross-Attention layers. The decoder processes the original, unmasked sequence $S$ autoregressively and computes $\mathcal{L}_{AR}$.

  4. Joint Loss: $\mathcal{L}{Joint} = \alpha \mathcal{L}{AR} + \beta \mathcal{L}_{MLM}$

Progressive Curriculum

RxT-Beta introduces a three-round progressive curriculum with increasing difficulty:

Round Sequence Length Tokens Noise Level Masking Probability
R1 1,024 ~100B 0.5 β†’ 0.75 0.2 β†’ 0.4
R2 2,048 ~100B 0.7 β†’ 0.85 0.4 β†’ 0.7
R3 4,096 ~50B 0.8 β†’ 0.95 0.7 β†’ 0.8

Noise levels and masking probabilities increase linearly within each round, making the decoder's task progressively harder. This progressive difficulty curriculum forces the model to deeply internalize knowledge rather than relying on the encoder's "cheated" context.

Key Differences from Standard LLM Training

The "cheated context" teacher forcing approach provides fundamentally different training dynamics compared to standard autoregressive LLM training:

  1. Super-Convergence Effect: Unlike standard decoder-only training where loss decreases gradually over the entire training process, RxT decoder loss drops to very low levels (prediction accuracy >90%) within the first ~1% of training. The remaining 99% of training time is spent learning to correctly predict the hardest ~10% of tokensβ€”the ones requiring true knowledge internalization rather than contextual inference.

  2. Multiple Concurrent Objectives: The decoder simultaneously learns through multiple connected objectives:

    • Standard autoregressive generation
    • Context-conditioned generation (from encoder outputs)
    • Masked language modeling reconstruction (through cross-attention to masked positions)
    • Denoising (extracting signal from noisy encoder representations)

    This multi-objective training creates richer gradients and more robust representations than single-objective autoregressive training.

  3. Full-Sequence Information Flow: In standard autoregressive training, each position only receives information from previous positions. In RxT joint training, the decoder's cross-attention layers receive bidirectional information from the encoder's full sequence processing. This enables feed-forward layers to receive and process complete contextual information, potentially leading to more effective knowledge storage.

  4. Early High-Quality Data Efficiency: In standard LLM pre-training, the highest-quality data is typically reserved for the final 10-30% of training when the model has reached sufficient maturity to effectively absorb complex information. RxT's super-convergence effect means the model reaches this "high-efficiency absorption" state almost immediately, enabling effective learning from high-quality data from the start.

  5. MoE Load Balancing Efficiency: The router load balancing loss drops to near-optimal levels (~1.1) extremely quickly, ensuring all experts learn uniformly throughout the entire training process. This balanced expert utilization may be as effective as dense model training while maintaining MoE efficiency.

Encoder Training

The encoder continues to train with standard MLM loss throughout all rounds. Its layer outputs are detached before masking/noise addition and passed to decoder cross-attention. This ensures the encoder develops strong bidirectional representations while the decoder learns to extract useful information from increasingly degraded encoder signals.


Hybrid Reasoning Interaction SFT

The Interaction SFT stage adapts the jointly pre-trained encoder and decoder to the specific format of conversational interactions, including the new Interaction Template with hybrid reasoning and agentic tool capabilities.

Training Algorithm

The training algorithm follows the same "cheated context" teacher forcing approach as Joint Pre-Training, but with key differences:

  1. Data Format: Training data consists of structured interactions following the Interaction Template format, rather than raw text sequences. Each interaction may include query, thinking, answer, and tool call components.

  2. Loss Masking: Loss is computed only on model-generated tokens (thinking [T], answer [A], tool calls [C]). User-provided tokens (query [Q], tool results [U], internal instructions [I]) are masked from loss computation but still processed by the model.

  3. Extended Sequence Length: Maximum sequence/interaction length increases to 8,192 tokens to accommodate complex reasoning chains and tool interactions.

  4. Progressive Noise and Masking:

    • Noise: 0.7 β†’ 1.0 (approaching full noise at the end)
    • Masking: 0.7 β†’ 0.95 (approaching full masking at the end)

    These aggressive schedules (adjusted based on pre-training results) force the model to rely minimally on encoder context, preparing it for memory-dependent operation where "perfect" context is unavailable.

Interaction Template Structure

The SFT stage trains the model on the full RxT-Beta Interaction Template:

Fast Answer Mode:

[BOS][I]internal[Q]query[A]answer[C]tool_call[EOS]

Extended Thinking Mode:

[BOS][I]internal[Q]query[T]thinking[A]answer[C]tool_call[EOS]

Tool Usage Mode:

[BOS][I]internal[U]tool_result[T]thinking[A]answer[C]tool_call[EOS]

The model learns to:

  • Generate appropriate thinking chains when prompted with [T]
  • Provide direct answers when prompted with [A]
  • Produce well-formed tool call JSON when appropriate
  • Process tool results and provide summaries
  • Follow internal instructions for per-interaction behavior control

Hybrid Reasoning Control

The reasoning mode is controlled by the token at the end of the user's input:

  • [Q]query[A] β†’ Forces fast answer without reasoning
  • [Q]query[T] β†’ Activates extended thinking mode

The model learns this conditional behavior during SFT, enabling users to control reasoning depth at inference time.


Self-Supervised Memory Attention Pre-Training

The Memory Attention network requires pre-training before the full memory-dependent training can begin. The central challenge is that the target outputβ€”the updated memory state $STM_t$β€”is a high-dimensional tensor for which no human-generated labels exist.

The Cold Start Problem

Without pre-training, a randomly initialized Memory Attention network outputs vectors that are effectively noise. If this noisy output were fed to the decoder in subsequent training, it would act as a powerful distractor, corrupting the learning signal. The decoder would likely learn to ignore its memory cross-attention layers entirely, defeating the architecture's purpose. This "cold start" problem must be solved before memory-dependent training can succeed.

Self-Supervised Proxy Task

We employ a self-supervised proxy task to train the Memory Attention network to produce semantically coherent memory updates. The objective is to learn a plausible combination of the previous memory state and new information from the current interaction.

Algorithm:

  1. Initialize $STM_0$ with random noise (e.g., from a normal distribution)

  2. For each interaction $I_t = (X_t, Y_t)$ in a sequence:

    • Encode the interaction: $ED_t = \text{Encoder}(\text{concat}(X_t, Y_t))$
    • Generate pseudo-label via weighted average: $$STM_{target} = (1 - w_t) \cdot STM_{t-1} + w_t \cdot ED_t$$
    • Compute actual memory update: $STM_t = \text{MemAttn}(STM_{t-1}, ED_t)$
    • Optimize cosine similarity: $\mathcal{L}{Mem} = -\text{cosine_similarity}(STM_t, STM{target})$
  3. Anneal $w_t$ across the sequence:

    • First interaction: $w_t$ is high (~0.9) to prioritize incorporating new information
    • Later interactions: $w_t$ decreases progressively, encouraging retention and integration

RxT-Beta Specifics

For RxT-Beta, this stage:

  • Uses conversations with hybrid reasoning and agentic tools following the Interaction Template
  • Trains on a subset of ~20,000 conversations with varying numbers of interaction steps
  • Applies standard self-supervised label weighting schedules
  • Completes relatively quickly as it only pre-conditions the memory network

The goal is not to create a perfect memory system, but to pre-condition the network to produce outputs that are semantically coherent and reside within the same vector space as other components, enabling subsequent training stages to succeed.


Supervised Memory Aware Training (SMAT)

The SMAT stage unifies all pre-trained components to train the model on its intended event-driven operational cycle. This is the first point at which the decoder learns to rely on a meaningful, accumulated memory state from genuinely past interactions, rather than the "cheated" context from joint training stages.

Core Algorithm

The algorithm uses a curriculum of multi-step dialogues ${I_1, I_2, ..., I_N}$:

  1. Initialization: The memory state $STM_0$ is initialized with random noise. The model processes the first interaction $I_1 = (X_1, Y_1)$ with the decoder conditioned on $STM_0$. This explicitly trains the model to handle conversation beginnings from a blank state.

  2. Sequential Processing: For each step $t$ from 1 to $N$:

    • Process query $X_t$ conditioned on $STM_{t-1}$ and generate response $Y_t$
    • Compute autoregressive loss $\mathcal{L}_t$
    • Backward pass immediately to compute and accumulate gradients
    • Encode the completed interaction: $ED_t = \text{Encoder}(\text{concat}(X_t, Y_t))$
    • Update memory: $STM_t = \text{MemAttn}(STM_{t-1}, ED_t)$
  3. Optimizer Step:

    • After all $N$ interactions, perform a single optimizer step
    • Gradient accumulation with accumulation steps equal to conversation length ($N$)
    • This ensures consistent optimization across the full dialogue while keeping memory usage bounded

Important: Backward passes occur after each interaction step, not at the end of the entire conversation. Only the optimizer step is deferred until all interactions are processed. This is critical for memory efficiencyβ€”maintaining the computation graph across all steps would cause GPU out-of-memory errors.

RxT-Beta Two-Stage SMAT

Unlike RxT-Alpha (which used fixed 7-step conversations), RxT-Beta SMAT is divided into two major stages, each with sub-stages:

Stage A: Short-Term Memory (STM) Training

  1. Hybrid Instruct/Reasoning Stage (200-300k conversations):

    • Trains the model to maintain conversational thread coherence across multiple interactions
    • Includes both fast answer and extended thinking interactions
    • Variable conversation lengths with batches grouped by length
    • Focuses on context maintenance and response quality
  2. Retrieval Stage (50k conversations):

    • Trains explicit information retrieval from previous steps
    • "Conversational needle in a haystack" tasks
    • User queries explicitly reference information from earlier interactions
    • Develops the model's ability to locate and extract specific stored information

Stage B: Mixture-of-Memory (MoM) Adaptation

Before MoM adaptation, a mini training stage trains a lightweight binary classifier (1-2 layer MLP) to detect generic/continuation queries (e.g., "Tell me more", "What do you mean?"). This classifier enables bypassing fragment routing for continuation queries. The routing mechanism itself is non-parametric (cosine similarity-based), requiring only memory attention/cross-attention adaptation for MoM.

  1. Long-Range Hybrid Instruct/Reasoning Stage (100k conversations):

    • Very long conversations with multiple topic threads
    • Tests memory fragment routing across topic switches
    • Validates working memory maintains immediate context while dynamic fragments handle topic-specific information
  2. Long-Range Retrieval Stage (20k conversations):

    • Information retrieval across 50+ interaction steps
    • Tests the ability to access information stored in distant memory fragments
    • Validates MoM's infinite context capabilities
  3. Cross-Session Memory Stage (TBD):

    • Training on multiple simultaneous conversations
    • Can be implemented by sharing memory across batch positions
    • Validates shared fragment pools and cross-conversation memory

Key Training Details

  • Variable Conversation Lengths: Unlike the fixed 7-step curriculum in RxT-Alpha, RxT-Beta uses variable-length conversations. Batches are constructed by grouping conversations of equal length.

  • Gradient Accumulation: Accumulation steps match conversation length, ensuring no parameter updates occur mid-conversation. This prevents the memory system from receiving inconsistent gradients.

  • Advanced Progressive Unfreezing: SMAT uses a sophisticated unfreezing schedule designed to ensure the model learns to properly utilize memory, rather than simply improving feed-forward and self-attention layers to fit the training sequences:

    • Early epochs: Only memory attention and memory cross-attention layers are trainable. All other layers (self-attention, feed-forward/MoE, encoder self-attention) remain frozen. This forces the model to learn memory-dependent behavior.

    • Middle epochs: Gradually unfreeze encoder layers to improve memory encoding quality.

    • Final epochs: Unfreeze the full model for end-to-end optimization.

    This staged approach ensures that improvements in response quality genuinely result from better memory utilization, not from the model simply memorizing patterns in the training sequences through its feed-forward layers.


Direct Memory and Preference Optimization (DMPO)

DMPO is a novel training algorithm that replaces the originally planned Memory Reinforcement Learning (MRL) stage. MRL proved to be slow and unstable in practice due to the complexity of RL-based optimization in the memory-dependent setting. DMPO combines the stability of Direct Preference Optimization (DPO) with memory-aware training, providing a more effective and efficient approach.

Motivation

Standard DPO optimizes the policy to prefer accepted over rejected responses using the loss:

LDPO=βˆ’E[log⁑σ(Ξ²(log⁑π(yw∣x)Ο€ref(yw∣x)βˆ’log⁑π(yl∣x)Ο€ref(yl∣x)))]\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \left(\log \frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)\right]

where $y_w$ is the accepted response, $y_l$ is the rejected response, $\pi$ is the current policy, $\pi_{ref}$ is a frozen reference policy, and $\beta$ controls preference strength.

However, standard DPO doesn't account for RxT's memory system. DMPO extends DPO to:

  1. Condition preferences on accumulated memory states
  2. Update memory only based on accepted interactions
  3. Propagate preference learning through the memory update chain

DMPO Algorithm

For a conversation with preference pairs at each step:

  1. Initialize: Memory state $STM_0$ with random noise; create frozen reference decoder $\pi_{ref}$

  2. For each interaction step $t$:

    • Get query $Q_t$, accepted answer $A_w$, rejected answer $A_l$
    • Clone memory state for reference: $STM_{ref} = STM_{t-1}.\text{detach}()$
    • Compute policy log probabilities conditioned on current memory:
      • $\log p_w = \sum \log \pi(A_w | Q_t, STM_{t-1})$
      • $\log p_l = \sum \log \pi(A_l | Q_t, STM_{t-1})$
    • Compute reference log probabilities (frozen reference decoder):
      • $\log p_{w,ref} = \sum \log \pi_{ref}(A_w | Q_t, STM_{ref})$
      • $\log p_{l,ref} = \sum \log \pi_{ref}(A_l | Q_t, STM_{ref})$
    • Compute DPO loss: $\mathcal{L}t = -\log \sigma(\beta \cdot ((\log p_w - \log p_l) - (\log p{w,ref} - \log p_{l,ref})))$
    • Backward pass immediately to accumulate gradients
    • Update memory with ACCEPTED interaction only:
      • $ED_t = \text{Encoder}(\text{concat}(Q_t, A_w))$
      • $STM_t = \text{MemAttn}(STM_{t-1}, ED_t)$
  3. Optimizer step after all interactions complete (gradient accumulation = conversation length)

Key Design Decisions

  1. Accepted-Only Memory Updates: Memory is updated only with accepted interactions. This ensures the memory system learns to store high-quality information, not noise from rejected responses.

  2. Memory-Conditioned Preferences: Both policy and reference models are conditioned on the same memory state ($STM_{t-1}$), ensuring fair comparison of responses in the memory-dependent context.

  3. Short-Range vs Long-Range Gradient Flow:

    • Short-range (default): Memory updates are detached from the computation graph. Each step optimizes independently.
    • Long-range: Gradient checkpointing enables gradients to flow through memory updates across multiple steps: $$\frac{\partial \mathcal{L}{t+k}}{\partial \theta{enc}} = \frac{\partial \mathcal{L}{t+k}}{\partial \text{logits}} \cdot \frac{\partial \text{logits}}{\partial STM{t+k-1}} \cdot ... \cdot \frac{\partial STM_t}{\partial ED_t} \cdot \frac{\partial ED_t}{\partial \theta_{enc}}$$ This enables the encoder and memory attention to learn what information should be stored for better responses many steps later.
  4. Query Token Masking: Only answer tokens contribute to log probability computation; query tokens are masked.

Two-Stage DMPO Training

Stage 1: Short-Range Optimization

  • Standard per-step backward passes as described above
  • Each step's backward pass is independent (memory update uses .detach())
  • Optimizes immediate memory usage and response quality
  • Memory-efficient: no multi-step computation graphs

Stage 2: Long-Range Optimization

  • Maintains gradient graph across 2-10 interaction steps (particularly around topic changes)
  • Memory update at step $t$ influences response quality at step $t+k$ where $k \in [2, 10]$
  • Requires gradient checkpointing to manage GPU memory:
    • Intermediate activations are recomputed during backward pass rather than stored
    • Trades compute for memory to enable longer gradient chains
  • With MoM routing, this should generalize to even longer conversations without requiring full-conversation gradient graphs
  • Teaches long-range memory planning: which information to store now for use many steps later

Anchored DMPO (APO-style) [Future Work]

For enhanced stability, DMPO can be extended with a neutral reference point following the Anchored Preference Optimization (APO) approach:

  • Accepted: Strongly preferred response ($y_w$)
  • Neutral: Reference baseline ($y_n$) - neither good nor bad
  • Rejected: Dispreferred response ($y_l$)

The anchored loss adds regularization: LAPO=LDPO+Ξ»β‹…E[∣log⁑π(yn∣x)Ο€ref(yn∣x)∣]\mathcal{L}_{APO} = \mathcal{L}_{DPO} + \lambda \cdot \mathbb{E}\left[\left|\log \frac{\pi(y_n|x)}{\pi_{ref}(y_n|x)}\right|\right]

This prevents reward hacking by maintaining reasonable behavior even under strong preference optimization. The neutral anchor provides a stable reference point, preventing the model from drifting too far from reasonable responses.

RxT-Beta will initially use standard DMPO without anchoring, with anchored DMPO reserved for future optimization if needed.

Why DMPO Over Reinforcement Learning?

Aspect Memory RL (MRL) DMPO
Stability Prone to instability, reward collapse Stable optimization
Efficiency Multiple rollouts per update Single forward-backward pass
Implementation Complex reward modeling, policy gradients Simple loss computation
Memory Integration Requires careful reward shaping for memory Natural integration via accepted-only updates
Convergence Slow, unpredictable Fast, predictable

DMPO provides all the benefits of preference learning while being specifically designed for RxT's memory architecture, making it the optimal choice for the final training stage.


Training Summary

The complete RxT-Beta training pipeline systematically builds capabilities through:

  1. Joint Pre-Training β†’ Shared representations, super-convergence, knowledge internalization
  2. Interaction SFT β†’ Conversational format, hybrid reasoning, agentic capabilities
  3. Memory Attention Pre-Training β†’ Cold start solution, semantic coherence
  4. SMAT β†’ Full memory-dependent operation, STM and MoM adaptation
  5. DMPO β†’ Memory-aware preference optimization, response quality refinement

This structured curriculum addresses each potential failure mode before proceeding to more complex stages, ensuring stable training of a fully functional stateful conversational model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ReactiveAI/RxT-Beta

Collection including ReactiveAI/RxT-Beta

Papers for ReactiveAI/RxT-Beta