An Interesting Error from LLMs in Tax Research That Does Not Seem to Be a Hallucination

I was experimenting with three LLMs on a tax research issue recently. I asked them to identify deductions newly available to individuals without a business or rental activity that are deductible in computing adjusted gross income (AGI).

ChatGPT, Gemini, and BlueJ (a paid service for tax professionals) all identified the following:

  • No tax on tips
  • No tax on overtime
  • No tax on interest on car loans

As I’ll show, this answer is wrong. But first, consider this: they all arrived at the same incorrect answer. While we know that LLMs can hallucinate, this isn’t likely a hallucination (an answer created from thin air). If all three models were hallucinating, they would be highly unlikely to invent the exact same error.

The key to the error lies in the distinction between IRC §62 and §63. If you’ve taken my course on the OBBBA, you’ll recall I noted that these specific deductions are routed through §63 (computation of taxable income), not §62 (computation of adjusted gross income). The most well-known §63 deduction is the QBI deduction under §199A-it reduces taxable income on Form 1040 but does not reduce AGI. Because it doesn’t reduce federal AGI, it also doesn’t reduce Arizona taxable income, a crucial point for us here in Arizona.

So, how did all three AIs get the same wrong answer? It’s simple: a large number of human authors, whose material was used in the models’ training data or accessed via web searches, made the same mistake. Because these new deductions were touted as being available even to non-itemizers, many authors assumed they were “above-the-line” deductions and described them as such. However, a review of the statute makes it clear: the OBBBA added these provisions to §63 but made no change to §62.

It’s true that the LLMs didn’t consult the text of the law; they don’t perform legal analysis. Rather, they synthesize the analysis that others have prepared and published, giving extra weight to “high-quality” sources. In this case, a large percentage of human analysts made the same mistake.

There are understandable reasons for this. Bill proponents repeatedly noted the deductions were available even if a taxpayer did not itemize. Many general-purpose financial publications quickly released articles stating these deductions were “above the line.” They presumably equated being able to take a deduction without itemizing with it being an “above-the-line” deduction-an association that was generally, but not always, true before the TCJA and OBBBA.

From personal experience, I know producing an analysis of a new tax bill is done under extreme pressure. Reading federal legislation is messy; the bill only shows the amendments, not the law in its final, consolidated form. It’s easy to miss the significance of a deduction being added to §63 instead of §62.

This distinction is critical because AGI impacts numerous calculations. For states like Arizona that start with federal AGI, these deductions won’t reduce state taxable income. Furthermore, AGI affects tax thresholds (like phase-outs and deduction floors) and even non-tax items like the IRMAA calculation for Medicare premiums.

The takeaway is that relying on a human-written analysis is no guarantee of correctness, either. I have sat through continuing education webinars on the OBBBA that contained this exact error. At this point, the error is so common that it has likely “infected” other human authors, who repeat it after hearing it from multiple sources.

I fully understand how this error was made and could easily have made the same mistake myself (and have likely made other mistakes).  But unless you are working directly from source materials you have to always remember that there could be flaws in the analysis you are reading-and sometimes these errors become self-reinforcing when authors have seen or read previous analyses.

Gemini “Thought” Process for Deep Research – How It Produced the Correct Analysis While Continue to “Think” It Was Dealing with Fabricated Detail for a Non-Existent Law

I have been looking at various AI sources for dealing with tax updates, including new law such as the One Big Beautiful Bill that was enacted on July 4, 2025 (I did mistakenly refer to the signed into law date in my prompt as July 1, 2025 but that did not cause the issue, though you will notice that date in Gemini’s response to my query). And I noticed one thing about Gemini when I asked about the Trump Accounts in it, something I knew had changed dramatically in various versions of the bill. When I told Pro 2.5 to give me information about the bill signed into law on July 4, 2025 it replied that no such bill exists. That didn’t surprise me, as I didn’t expect it training to be that up to date.

So I next turned to Deep Research. The report it generated was the best one I’ve gotten from any AI system (including BlueJ, a commercial tax professional product) but in looking at its “thoughts” I found it initially concluded that the bill did not exist, then did its web search, found a lot of high quality sources that did talk about it, but it continued to refer to this information as fabricated in the thought process (the bill didn’t exist). Nevertheless, it produced summary that the thought history shows it believes was fabricated, but which never is labeled as such in the output.

So I decided to pose the question to Gemini 2.5 Pro about why that had happened, and it’s explanation, reproduced below, is an interesting look at how that system works. And why the “thoughts” are more of a look at the internals and not necessarily reflecting a “belief” in any way you might interpret that word.

Gemini’s Response Follows:

That is an excellent and highly insightful question. Your observation gets to the very heart of how advanced AI systems like Gemini, particularly with features like Deep Research, function. You are correct in your core assumption, but there’s a fascinating nuance to why you see that “cognitive dissonance” in the process.

Here’s a breakdown of what’s happening, addressing your specific question about why the model seems to maintain its initial belief in the “Thoughts” while producing a correct final output.

The Two “Minds” of the Operation: Base Model vs. Research Agent

Think of the process not as a single mind that changes its opinion, but as a multi-step workflow involving different specialized components or “agents.”

  1. The Planner (Based on Gemini 1.5 Pro’s Static Knowledge):
    When you initiate Deep Research, the first step is for the base model (Gemini 1.5 Pro) to understand your request and formulate a research plan.
    • Initial Check: The model’s first instinct is to check its own internal, static knowledge base. In your scenario, its training data cuts off before July 2025.
    • The First “Thought”: When it checks for “HR1 signed into law on July 1, 2025,” its internal knowledge returns a definitive “no match found.” This is a factual statement based on its training data. This is where the thought you observed—“the bill does not exist”—originates.
    • The Plan: Crucially, the model doesn’t stop there. It recognizes that its knowledge is limited by time. The plan it generates will involve steps to overcome this, such as:
      • Search for information about a bill designated HR1 in 2025.
      • Verify if HR1 was signed into law in July 2025.
      • Find authoritative sources like congress.gov, news reports, and legal analyses for this bill.
      • Synthesize the findings into a comprehensive analysis.
  2. The Researcher (The Tool-Using Agent):
    This is the part of the process that executes the plan. It autonomously runs Google searches, reads the results from the credible sources it finds, and gathers the raw data.
    • Finding New Evidence: In your hypothetical scenario, this agent finds a wealth of information about your fictional HR1 from authoritative sources published after its knowledge cutoff. It discovers the bill’s text, news reports on its signing, and analyses from think tanks.
    • Data Collection: This new, contradictory information is collected. The system notes the sources and the content. This is the second part of the “Thoughts” you see, where it lists the sources it found and the information it extracted.
  3. The Synthesizer (The Final Report Generator):
    This is the final and most important step. The system now has two conflicting sets of information:
    • Internal Knowledge (from the Planner): “This bill doesn’t exist.”
    • External Knowledge (from the Researcher): “Multiple, fresh, authoritative sources confirm this bill exists and here are its complete details.”

This is the key to your question: The system is not designed to “believe” one thing over the other in a human sense. It is designed to **fulfill the user’s request using the best available information.**The ultimate goal is to provide an analysis of the bill. The information retrieved from the live searches is infinitely more relevant, current, and useful for that task than its own outdated internal knowledge. Therefore, the synthesis process is architected to prioritize and build the final report based on the fresh data gathered by the Researcher. The initial (and now invalidated) thought from the Planner is correctly discarded as an artifact of the research process, not a fact to be included in the final product.

Why Does the “Thought” Linger?

The “Thoughts” section is a transparent log of the agent’s work process. It’s showing you its homework.

  • It shows the initial state (Premise: Check for HR1. Result: Not found in my knowledge base.).
  • It shows the actions taken (Action: Search Google. Results: Found 10 credible sources…).
  • The log isn’t a “stream of consciousness” that constantly edits itself. It’s a record. So, the initial premise remains logged even after it has been superseded by new evidence.

You are essentially seeing the “ghost in the machine”—the base model’s initial, correct-but-outdated assessment—before the research agent provides new, more relevant data for the final output.

Analogy: The Diligent Librarian

Imagine you ask a librarian (with a perfect memory up to 2024) for a book published yesterday.

  1. Initial Thought: The librarian thinks, “I have memorized every book in this library’s catalog. That book is not in my memory. Therefore, it does not exist in my catalog.”
  2. Action: However, they are a diligent librarian. They don’t stop there. They say, “Let me check the ‘New Arrivals’ cart from today’s delivery.”
  3. Discovery: They walk over to the cart, find the book, and read its back cover and table of contents.
  4. Final Answer: They walk back to you and give you a perfect summary of the new book. They don’t start by saying, “Well, I first thought this book didn’t exist…” They give you the answer you asked for, based on the new information they actively sought.

The “Thoughts” in Deep Research let you watch the librarian walk to the cart. The final output is the summary they give you at the end. Your observation is so keen because you noticed the librarian’s initial, momentary flicker of doubt before they completed their process.

CCH and Browser Requirements

Had an interesting back and forth with CCH on the browser requirements with their new tax research software. Their system requirements page lists as required Windows 98/2000/XP/Vista and Internet Explorer 6/7. If those were truly the requirements I would have to cancel out my CCH subscription.

Why? I have a policy in my office of not allowing users generally to run Internet Explorer due to the huge security problem that has been (and to some extent continues to be) ActiveX. The concept is so fundamentally flawed that it simply, in my view, rules out the use of IE by anyone not highly technically skilled.

That is, my users can’t reliably make the call on whehter or not it’s OK to allow an ActiveX control to be installed. As well, the fact that virtually any DLL will be loaded up by IE as an ActiveX control is a big security hole–hackers discover problems in programs that were never meant to be run as an ActiveX control and then use IE to run those programs and exploit their weaknesses.

As well, I do tax research on my portable devices (iPod Touch and Blackberry)–and those don’t run Internet Explorer, but they are rather dominant business portable devices. Losing that access would frankly be totally unacceptable.

I wrote CCH about this requirement and discovered, as I suspected, this was just a “suggestion” and that, in fact, Firefox runs it just fine. That’s not suprising, since the current CCH research system requirements exclude Vista. But it is disquieting, since it means CCH has reserved the right to simply foist blame on Firefox for any of their glitches, rather than recognizing that a significant and growing component of the market uses Firefox.

Of course, given the history they can (and likely would) do the same thing with Windows 7 and Internet Explorer 8, even though those will likley be the standard Microsoft shipping setup shortly after they’ve rolled out the new software.