NTT unveils decoding framework for clearer AI reasoning

Fri, 5th Jun 2026

NTT has introduced an AI inference framework called Rationale-Enhanced Decoding, designed to make reasoning in large vision-language models more explainable.

The framework addresses a weakness in chain-of-thought reasoning used by multimodal AI systems that process images and text. These models often generate intermediate rationales before producing an answer, but those rationales are not always reflected in the final output.

That gap has complicated efforts to make AI systems easier to interpret in sensitive applications. If a model reaches the same answer even when its stated reasoning is altered or irrelevant, the rationale cannot be treated as a reliable explanation of how the answer was reached.

NTT's approach changes the inference process by separating image-based inference from rationale-based inference, then combining them through weighted decoding. This is intended to ground the final response in both the visual input and the text rationale, rather than relying on a single combined sequence.

The method is designed to work without additional training. It can be applied at inference time and does not require new datasets or retraining, which are often costly in both data preparation and computing resources.

The problem

Large vision-language models combine language models with image encoders so they can work across visual and textual inputs. That has made them useful for tasks such as document understanding and video analysis, where text-only systems have limits.

Chain-of-thought reasoning has been adopted in these models much as it has in text-based language models. The technique asks the model to produce an intermediate rationale before generating a final answer, with the expectation that the rationale will improve performance and provide a clearer path for human inspection.

NTT said its research found that existing multimodal models can ignore their own rationales when generating final answers. In one example, a rationale about an unrelated sports car was paired with an image of a presentation slide document, yet the model gave the same answer as it did with the correct rationale.

That suggests the final output was driven by the image alone rather than by the reasoning text. According to NTT, this shows that conventional multimodal chain-of-thought methods lack a causal structure that ensures rationales are actually used to produce answers.

How it works

Rationale-Enhanced Decoding was developed as a decoding technique that can be inserted into the inference stage of existing models. By handling the image and the rationale independently, the framework is designed to make the influence of each source clearer in the final output.

NTT said experiments showed improvements in reasoning performance, including answer accuracy, across a range of large vision-language models. Results improved further when the models were supplied with higher-quality rationales, including ones generated by OpenAI's GPT-4.

These findings point to a broader issue in explainable AI research. Many systems can generate plausible-sounding reasoning after the fact, but the harder task is showing that the reasoning materially affects the answer. NTT's work focuses on that narrower question: whether the rationale is faithful to the output, rather than simply readable to a human user.

Potential uses

NTT said the framework could support applications where reliability and interpretability matter more than raw output alone. It cited AI agent collaboration, medical image analysis, and decision-support conversational agents as examples.

In medical settings, for instance, the ability to inspect whether a model's conclusion reflects both the source image and its stated rationale could help address concerns about opaque decision-making. Similar issues arise in systems that support human decisions, where users may want stronger evidence that the explanation aligns with the model's actual process.

NTT linked the work to its broader effort to improve AI reliability. The research contributes to its vision of "AI Constellation," in which large numbers of AI systems work together.

NTT is a global technology company with operations in more than 70 countries and regions. It said it generates more than USD $90 billion in revenue, employs about 340,000 people, and allocates 30% of its annual profits to research and development.

According to the company, experimental results showed that large vision-language models using the method can "faithfully interpret and utilize the content of rationales."

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google