Augmenting GPT-4 with Calculational Code

GPT-4 and other LLMs (Large Language Models) are driving a tidal wave of innovation in applied AI. However used without augmentation they have very limited calculational capabilities and make mistakes calculating with numbers. In this project, we describe a simple, general technique to address this, apply it to some widely reported real-world failures of GPT-4-based AI, do some evaluation and discuss related issues.

These models also make mistakes when writing text that involves comparative logic involving numbers. For example, GPT-4 quite happily writes sentences like this (emphasis added):

User

Give a numerical comparison of these two reports, deriving new interesting numbers as part of the comparison.

Both companies reported an increase in net sales compared to the same quarter last year, but Gap Inc. had a much larger absolute and relative increase ($4.04 billion, up 2%) than lululemon ($1.9 billion, up 28%).

Unfortunately, 2 is not greater than 28, the text as written is half-right, but contains a serious mistake that makes it non-sensical.

Augmenting by Calculational Code

The hope lies in this observation:

  1. LLMs are weak at numeric calculation but good at writing numeric calculation code
  2. Python and many other tools are perfect at evaluating numeric calculation code.

The answer is obvious: get GPT-4 to write the numeric calculation code relevant to the question, and get Python or some other tool to evaluate it, and use GPT-4 for the rest. Our aim is to “equip” or “augment” GPT-4 with a numeric calculator. In practice the approach we use is simple and can be used to augment any LLM invocation. Further, the calculation code and results can be shown to the user as supplementary output, as well as separately checked, audited, explained and kept for record keeping.

Without equipping, GPT-x models are often used to implement this function:

  1. LLMQuestion –> Answer

With numeric calculation equipping:

  1. LLM: Question –> Numeric calculation code
  2. Evaluator: Numeric calculation code –> Numeric calculation answers
  3. LLM: Question + Numeric calculation code + Numeric calculation answers –> Answer

The overall function computed is still Question --> Answer.

We developed some indicative problem sets that exhibit the following characteristics. These sets were developed partly in order to explore the boundary of what this technique can handle. The results indicate dramatic improvements when using numeric calculation equipping, with the exception of date calculations, where improvements were only marginal. This is discussed more in the evaluation.

Report

GitHub Next technical report with evaluation and implementation are available.

Leave a comment