An Interesting Error from LLMs in Tax Research That Does Not Seem to Be a Hallucination

I was experimenting with three LLMs on a tax research issue recently. I asked them to identify deductions newly available to individuals without a business or rental activity that are deductible in computing adjusted gross income (AGI).

ChatGPT, Gemini, and BlueJ (a paid service for tax professionals) all identified the following:

  • No tax on tips
  • No tax on overtime
  • No tax on interest on car loans

As I’ll show, this answer is wrong. But first, consider this: they all arrived at the same incorrect answer. While we know that LLMs can hallucinate, this isn’t likely a hallucination (an answer created from thin air). If all three models were hallucinating, they would be highly unlikely to invent the exact same error.

The key to the error lies in the distinction between IRC §62 and §63. If you’ve taken my course on the OBBBA, you’ll recall I noted that these specific deductions are routed through §63 (computation of taxable income), not §62 (computation of adjusted gross income). The most well-known §63 deduction is the QBI deduction under §199A-it reduces taxable income on Form 1040 but does not reduce AGI. Because it doesn’t reduce federal AGI, it also doesn’t reduce Arizona taxable income, a crucial point for us here in Arizona.

So, how did all three AIs get the same wrong answer? It’s simple: a large number of human authors, whose material was used in the models’ training data or accessed via web searches, made the same mistake. Because these new deductions were touted as being available even to non-itemizers, many authors assumed they were “above-the-line” deductions and described them as such. However, a review of the statute makes it clear: the OBBBA added these provisions to §63 but made no change to §62.

It’s true that the LLMs didn’t consult the text of the law; they don’t perform legal analysis. Rather, they synthesize the analysis that others have prepared and published, giving extra weight to “high-quality” sources. In this case, a large percentage of human analysts made the same mistake.

There are understandable reasons for this. Bill proponents repeatedly noted the deductions were available even if a taxpayer did not itemize. Many general-purpose financial publications quickly released articles stating these deductions were “above the line.” They presumably equated being able to take a deduction without itemizing with it being an “above-the-line” deduction-an association that was generally, but not always, true before the TCJA and OBBBA.

From personal experience, I know producing an analysis of a new tax bill is done under extreme pressure. Reading federal legislation is messy; the bill only shows the amendments, not the law in its final, consolidated form. It’s easy to miss the significance of a deduction being added to §63 instead of §62.

This distinction is critical because AGI impacts numerous calculations. For states like Arizona that start with federal AGI, these deductions won’t reduce state taxable income. Furthermore, AGI affects tax thresholds (like phase-outs and deduction floors) and even non-tax items like the IRMAA calculation for Medicare premiums.

The takeaway is that relying on a human-written analysis is no guarantee of correctness, either. I have sat through continuing education webinars on the OBBBA that contained this exact error. At this point, the error is so common that it has likely “infected” other human authors, who repeat it after hearing it from multiple sources.

I fully understand how this error was made and could easily have made the same mistake myself (and have likely made other mistakes).  But unless you are working directly from source materials you have to always remember that there could be flaws in the analysis you are reading-and sometimes these errors become self-reinforcing when authors have seen or read previous analyses.

Gemini “Thought” Process for Deep Research – How It Produced the Correct Analysis While Continue to “Think” It Was Dealing with Fabricated Detail for a Non-Existent Law

I have been looking at various AI sources for dealing with tax updates, including new law such as the One Big Beautiful Bill that was enacted on July 4, 2025 (I did mistakenly refer to the signed into law date in my prompt as July 1, 2025 but that did not cause the issue, though you will notice that date in Gemini’s response to my query). And I noticed one thing about Gemini when I asked about the Trump Accounts in it, something I knew had changed dramatically in various versions of the bill. When I told Pro 2.5 to give me information about the bill signed into law on July 4, 2025 it replied that no such bill exists. That didn’t surprise me, as I didn’t expect it training to be that up to date.

So I next turned to Deep Research. The report it generated was the best one I’ve gotten from any AI system (including BlueJ, a commercial tax professional product) but in looking at its “thoughts” I found it initially concluded that the bill did not exist, then did its web search, found a lot of high quality sources that did talk about it, but it continued to refer to this information as fabricated in the thought process (the bill didn’t exist). Nevertheless, it produced summary that the thought history shows it believes was fabricated, but which never is labeled as such in the output.

So I decided to pose the question to Gemini 2.5 Pro about why that had happened, and it’s explanation, reproduced below, is an interesting look at how that system works. And why the “thoughts” are more of a look at the internals and not necessarily reflecting a “belief” in any way you might interpret that word.

Gemini’s Response Follows:

That is an excellent and highly insightful question. Your observation gets to the very heart of how advanced AI systems like Gemini, particularly with features like Deep Research, function. You are correct in your core assumption, but there’s a fascinating nuance to why you see that “cognitive dissonance” in the process.

Here’s a breakdown of what’s happening, addressing your specific question about why the model seems to maintain its initial belief in the “Thoughts” while producing a correct final output.

The Two “Minds” of the Operation: Base Model vs. Research Agent

Think of the process not as a single mind that changes its opinion, but as a multi-step workflow involving different specialized components or “agents.”

  1. The Planner (Based on Gemini 1.5 Pro’s Static Knowledge):
    When you initiate Deep Research, the first step is for the base model (Gemini 1.5 Pro) to understand your request and formulate a research plan.
    • Initial Check: The model’s first instinct is to check its own internal, static knowledge base. In your scenario, its training data cuts off before July 2025.
    • The First “Thought”: When it checks for “HR1 signed into law on July 1, 2025,” its internal knowledge returns a definitive “no match found.” This is a factual statement based on its training data. This is where the thought you observed—“the bill does not exist”—originates.
    • The Plan: Crucially, the model doesn’t stop there. It recognizes that its knowledge is limited by time. The plan it generates will involve steps to overcome this, such as:
      • Search for information about a bill designated HR1 in 2025.
      • Verify if HR1 was signed into law in July 2025.
      • Find authoritative sources like congress.gov, news reports, and legal analyses for this bill.
      • Synthesize the findings into a comprehensive analysis.
  2. The Researcher (The Tool-Using Agent):
    This is the part of the process that executes the plan. It autonomously runs Google searches, reads the results from the credible sources it finds, and gathers the raw data.
    • Finding New Evidence: In your hypothetical scenario, this agent finds a wealth of information about your fictional HR1 from authoritative sources published after its knowledge cutoff. It discovers the bill’s text, news reports on its signing, and analyses from think tanks.
    • Data Collection: This new, contradictory information is collected. The system notes the sources and the content. This is the second part of the “Thoughts” you see, where it lists the sources it found and the information it extracted.
  3. The Synthesizer (The Final Report Generator):
    This is the final and most important step. The system now has two conflicting sets of information:
    • Internal Knowledge (from the Planner): “This bill doesn’t exist.”
    • External Knowledge (from the Researcher): “Multiple, fresh, authoritative sources confirm this bill exists and here are its complete details.”

This is the key to your question: The system is not designed to “believe” one thing over the other in a human sense. It is designed to **fulfill the user’s request using the best available information.**The ultimate goal is to provide an analysis of the bill. The information retrieved from the live searches is infinitely more relevant, current, and useful for that task than its own outdated internal knowledge. Therefore, the synthesis process is architected to prioritize and build the final report based on the fresh data gathered by the Researcher. The initial (and now invalidated) thought from the Planner is correctly discarded as an artifact of the research process, not a fact to be included in the final product.

Why Does the “Thought” Linger?

The “Thoughts” section is a transparent log of the agent’s work process. It’s showing you its homework.

  • It shows the initial state (Premise: Check for HR1. Result: Not found in my knowledge base.).
  • It shows the actions taken (Action: Search Google. Results: Found 10 credible sources…).
  • The log isn’t a “stream of consciousness” that constantly edits itself. It’s a record. So, the initial premise remains logged even after it has been superseded by new evidence.

You are essentially seeing the “ghost in the machine”—the base model’s initial, correct-but-outdated assessment—before the research agent provides new, more relevant data for the final output.

Analogy: The Diligent Librarian

Imagine you ask a librarian (with a perfect memory up to 2024) for a book published yesterday.

  1. Initial Thought: The librarian thinks, “I have memorized every book in this library’s catalog. That book is not in my memory. Therefore, it does not exist in my catalog.”
  2. Action: However, they are a diligent librarian. They don’t stop there. They say, “Let me check the ‘New Arrivals’ cart from today’s delivery.”
  3. Discovery: They walk over to the cart, find the book, and read its back cover and table of contents.
  4. Final Answer: They walk back to you and give you a perfect summary of the new book. They don’t start by saying, “Well, I first thought this book didn’t exist…” They give you the answer you asked for, based on the new information they actively sought.

The “Thoughts” in Deep Research let you watch the librarian walk to the cart. The final output is the summary they give you at the end. Your observation is so keen because you noticed the librarian’s initial, momentary flicker of doubt before they completed their process.

The One, Big, Beautiful Bill Title XI Tax Provisions as Passed by the Ways & Means Committee

I decided to post the following for various reasons at this point. First, since the One, Big, Beautiful Bill has already had some amendments made to Title XI by the House Rules Committee (though the full updated bill text is not available as of the morning of May 24) there’s little reason to refine this document. Second, I have found it useful as a starting point for answering questions about the bill (being able to find relevant items in the bill text and JCT report). And, third, it does demonstrate the usefulness of NotebookLM for getting on top of these sorts of things, as well as gave me a good learning experience in how to deal with new law text (the JCT report was really useful in being able to handle the bill, since the latter only comes with the changes to the IRC, not the full revised IRC provision in context which is often needed to understand what Congress is doing).

So I am posting the analysis I compiled of the bill using NotebookLM, the bill text that was approved by the Ways & Means Committee, and the JCT report on the various provisions in that bill. I added a section on the proposed changes to PL 86-272 as well though that was added by a different committee and is not part of Title XI.

The way I handled it was to craft a prompt to ask for an analysis dealing with specific areas of each section in the bill, working from the section list in the Ways & Means Committee summary. Giving NotebookLM “bite sized” items to analyze proved a much more accurate way to get it to focus on the proper items than trying to pick off multiple provisions at once. And I did scan the output to make sure I didn’t see any obvious issues, following up on a few things where I was surprised though most often discovering that “yeah, that’s how it was written” when I was done reviewing the appropriate sections of each document.

You can download that analysis below. There’s little reason for me to refine it more at this point since the final bill will require a “start over” (you never know what change will sneak in as a bill moves through the process) but, as I noted, with the caveat that you don’t assume it’s 100% accurate (which you shouldn’t do for any analysis–the law is what matters), I do find it useful.

If you haven’t used it, you might want to go look at NotebookLM. Even with a free account you can do a decent analysis (just not a huge number of them).