Sometimes Memorization Just Ain’t Enough (With Apologies to Patty Smyth for Borrowing Her Title)

In a recent article published by Futurism, Jad Tarifi, the founder of Google’s inaugural generative AI team, posited a cautionary stance regarding careers in law and medicine, asserting that these professions primarily entail the memorization of information. This assertion prompted a critical examination of prevalent misconceptions concerning the true nature of knowledge and skill acquisition within these domains.

A compelling illustration of this misunderstanding is evidenced by the consistent erroneous responses from three distinct AI large language models (LLMs) when queried about specific provisions within the One Big, Beautiful Bill Act (OBBBA) designed to reduce an individual’s adjusted gross income. As I documented in a prior blog entry, I, a certified public accountant (CPA) without additional legal training, and certainly not possessing comprehensive recall of the entire OBBBA, was immediately able to identify the flaw in the models’ conclusions. This occurred despite the theoretical premise that these models “knew” the entirety of the Internal Revenue Code (IRC), along with all pertinent binding guidance, and had unfettered access to the complete legislative text that modified the aforementioned law.

Within tax law, rote memorization has never been, nor will it ever be, the sole method for ascertaining legal principles and their application. Instead, a meticulous reading of the law is imperative, necessitating careful attention to its enumerated items, cross-references, and the subsequent tracing of those references (which may lead to further cross-references) to construct a comprehensive understanding of the statute. Furthermore, the process involves distinguishing unambiguous statutory provisions, which dictate a singular interpretation, from those exhibiting ambiguity, which require interpretation. This interpretive process relies upon established canons of statutory construction developed over time.

Subsequent to this initial statutory analysis, the practitioner must then explore other existing interpretive binding official guidance, assess whether the law has undergone changes since such guidance was issued, and finally, consult expert discussions from third-party sources and non-binding official guidance. This comprehensive approach yields potentially supportable interpretations of the matter at hand, alongside an assessment of the likelihood that any given interpretation would ultimately be accepted by an IRS agent, an appellate conferee, or the highest court to which the case might escalate.

The fundamental flaw in the analyses provided by ChatGPT, Gemini, and BlueJ in this instance was their failure to commence with an isolated examination of the law. Instead, they prematurely resorted to other sources, neglecting to analyze the memorized legal text and the legislative amendments in isolation. Moreover, it remains unclear whether such isolated analysis is truly feasible for these models given their operational paradigms.

As previously detailed, the OBBBA provisions concerning the exemption of tips, overtime pay, and car loan interest from taxation were appended as deductions to Section 63 of the IRC. This classification renders them deductions taken along with either the standard deduction or itemized deductions in computing taxable income, rather than deductions that reduce adjusted gross income. As elaborated upon in the earlier article, deductions utilized in the computation of adjusted gross income would be allocated to Section 62.

While the models had purportedly memorized the entire structure of the IRC, theoretically imparting this knowledge, they failed to undertake the crucial step that tax CPAs and attorneys should perform at this juncture: consulting the statutory text and following cross-references to gain an initial understanding of the IRC’s provisions as amended pertaining to this issue. Instead, they were diverted by other related information contained within their models and acquired from the web during data updates to fill in for developments taking place after the end of time covered by their training. This data included articles quoting Congressional sources that correctly noted these items were deductible regardless of whether a taxpayer itemized deductions. Furthermore, the training data encompassed years of articles that referred to various prior specific items deductible irrespective of itemization as “above the line” deductions. Their data also indicated that “above the line” signified a deduction used in computing adjusted gross income. None of this memorized data was factually incorrect, though much of it proved irrelevant in this specific context.

Consequently, the models quickly identified a high correlation between the terms “deductible even if the taxpayer does not itemize deductions” and a deduction being categorized as “above the line,” as well as a nearly one-to-one correlation between a deduction being referred to as “above the line” and its use in computing adjusted gross income. Again, all of this derived from the memorized information was accurate.

However, at this juncture, the analysis fell apart due to an incorrect inference. The models utilized all data indicating these deductions were available to both itemizing and non-itemizing taxpayers (which was entirely correct) to infer that these deductions were “above the line” and utilized in computing adjusted gross income (absolutely not correct). This inference was reinforced by the historical commentary present within their vast memorized data. Nevertheless, it was an inference that proved completely erroneous.

The question then arises: why did the models’ access to more current sources not challenge their conclusions? This highlights the fallibility of many human authors who were expeditiously producing commentary on the bill. Numerous such articles, some originating from highly reputable sources, incorrectly labeled these provisions as “above the line” deductions.

While the precise reasons for these authors’ misclassification remain speculative, it appears evident that they did not base their initial articles on a direct analysis of the bill itself and the Internal Revenue Code for this issue. It is plausible that many made the same inference that the LLMs derived from their extensive memorized data—namely, that historically, Congress had enacted provisions for items deductible even without itemization as deductions used to compute adjusted gross income. Others, encountering previously published articles making the claim that these deductions reduced adjusted gross income, simply relied upon those sources. Furthermore, I suspect that some authors, operating under time constraints, posed the question to LLMs, which, based on their initial model assumptions, provided the erroneous result.

In any event, the articles discovered by the LLMs analyzing the law served to corroborate their models’ initial assumptions, which were rooted in their extensive memorized data, thereby creating a significant feedback loop.

My approach to the new legislation mirrored that of previous laws, with the notable addition of Google’s NotebookLM, an AI application primarily reliant on user-provided sources. The notebook I established contained exclusively the tax provisions of the statute. As I systematically reviewed each section of the bill, I utilized NotebookLM to generate summaries of the respective sections, while simultaneously conducting my own independent review to ensure comprehensive understanding and prevent oversight during rapid analysis of the law.

NotebookLM’s analysis was largely confined to the provided document, offering little beyond the assertion that each provision modified Section 63 since that was all that was available to the application. Consequently, I cross-referenced Section 63 to ascertain the specific implications of such modifications. As I knew Congress had, on a rare occasion, introduced an additional “below the line” deduction by placing it in this Section, accessible regardless of whether a taxpayer itemized deductions. They had decided to do this multiple times in the new bill.

A precedent for such a deduction can be found in the Qualified Business Income (QBI) Deduction under IRC §199A, enacted by the Tax Cuts and Jobs Act of 2017. This provision has remained in effect and its application was extended as part of the OBBBA.

Posing inquiries to the models regarding the placement of these deductions within Section 63, akin to the QBI deduction in 2017, frequently resolved the models’ prior limitations. This shift in focus directed their attention to articles detailing that specific deduction and highlighting the impact of its placement in Section 63 rather than Section 62. The models subsequently confirmed that the bill text indeed added these deductions to Section 63 and, likely drawing from the newly focused articles, concluded that these deductions do not reduce adjusted gross income.

Even if their approach remained flawed, they did finally arrive at a correct solution.

I am not an attorney and do not claim to be one. However, I would be astonished if similar problems didn’t frequently emerge when parties depend on LLMs for tasks typically handled by law students who would become attorneys, whom Mr. Tarifi suggests should discontinue their legal studies.

Mr. Tarifi undoubtedly possesses far greater knowledge of LLMs and AI than I ever will. Yet, I question whether proponents of AI solutions truly grasp the issues they aim to resolve, or if they merely presume to understand them. The immediate availability of the entire IRC and all supporting documentation within an LLM’s model is genuinely incredibly useful and will revolutionize our approach to tax research and analysis. Nevertheless, I hope I have demonstrated how mere rote memorization is insufficient to perform the job being done today—nor do I observe signs that newer models are progressing beyond memorization to conduct a comprehensive analysis of new laws and developments.

Large Language Models (LLMs) represent a significant advancement that professionals in tax practice must comprehend to maintain relevance. However, similar to the introduction of computers in the 1960s and 1970s and subsequent machine learning enhancements beyond initial automation of routine preparation tasks, LLMs are not anticipated to be the technology that renders tax professionals obsolete.

An Interesting Error from LLMs in Tax Research That Does Not Seem to Be a Hallucination

I was experimenting with three LLMs on a tax research issue recently. I asked them to identify deductions newly available to individuals without a business or rental activity that are deductible in computing adjusted gross income (AGI).

ChatGPT, Gemini, and BlueJ (a paid service for tax professionals) all identified the following:

  • No tax on tips
  • No tax on overtime
  • No tax on interest on car loans

As I’ll show, this answer is wrong. But first, consider this: they all arrived at the same incorrect answer. While we know that LLMs can hallucinate, this isn’t likely a hallucination (an answer created from thin air). If all three models were hallucinating, they would be highly unlikely to invent the exact same error.

The key to the error lies in the distinction between IRC §62 and §63. If you’ve taken my course on the OBBBA, you’ll recall I noted that these specific deductions are routed through §63 (computation of taxable income), not §62 (computation of adjusted gross income). The most well-known §63 deduction is the QBI deduction under §199A-it reduces taxable income on Form 1040 but does not reduce AGI. Because it doesn’t reduce federal AGI, it also doesn’t reduce Arizona taxable income, a crucial point for us here in Arizona.

So, how did all three AIs get the same wrong answer? It’s simple: a large number of human authors, whose material was used in the models’ training data or accessed via web searches, made the same mistake. Because these new deductions were touted as being available even to non-itemizers, many authors assumed they were “above-the-line” deductions and described them as such. However, a review of the statute makes it clear: the OBBBA added these provisions to §63 but made no change to §62.

It’s true that the LLMs didn’t consult the text of the law; they don’t perform legal analysis. Rather, they synthesize the analysis that others have prepared and published, giving extra weight to “high-quality” sources. In this case, a large percentage of human analysts made the same mistake.

There are understandable reasons for this. Bill proponents repeatedly noted the deductions were available even if a taxpayer did not itemize. Many general-purpose financial publications quickly released articles stating these deductions were “above the line.” They presumably equated being able to take a deduction without itemizing with it being an “above-the-line” deduction-an association that was generally, but not always, true before the TCJA and OBBBA.

From personal experience, I know producing an analysis of a new tax bill is done under extreme pressure. Reading federal legislation is messy; the bill only shows the amendments, not the law in its final, consolidated form. It’s easy to miss the significance of a deduction being added to §63 instead of §62.

This distinction is critical because AGI impacts numerous calculations. For states like Arizona that start with federal AGI, these deductions won’t reduce state taxable income. Furthermore, AGI affects tax thresholds (like phase-outs and deduction floors) and even non-tax items like the IRMAA calculation for Medicare premiums.

The takeaway is that relying on a human-written analysis is no guarantee of correctness, either. I have sat through continuing education webinars on the OBBBA that contained this exact error. At this point, the error is so common that it has likely “infected” other human authors, who repeat it after hearing it from multiple sources.

I fully understand how this error was made and could easily have made the same mistake myself (and have likely made other mistakes).  But unless you are working directly from source materials you have to always remember that there could be flaws in the analysis you are reading-and sometimes these errors become self-reinforcing when authors have seen or read previous analyses.

Gemini “Thought” Process for Deep Research – How It Produced the Correct Analysis While Continue to “Think” It Was Dealing with Fabricated Detail for a Non-Existent Law

I have been looking at various AI sources for dealing with tax updates, including new law such as the One Big Beautiful Bill that was enacted on July 4, 2025 (I did mistakenly refer to the signed into law date in my prompt as July 1, 2025 but that did not cause the issue, though you will notice that date in Gemini’s response to my query). And I noticed one thing about Gemini when I asked about the Trump Accounts in it, something I knew had changed dramatically in various versions of the bill. When I told Pro 2.5 to give me information about the bill signed into law on July 4, 2025 it replied that no such bill exists. That didn’t surprise me, as I didn’t expect it training to be that up to date.

So I next turned to Deep Research. The report it generated was the best one I’ve gotten from any AI system (including BlueJ, a commercial tax professional product) but in looking at its “thoughts” I found it initially concluded that the bill did not exist, then did its web search, found a lot of high quality sources that did talk about it, but it continued to refer to this information as fabricated in the thought process (the bill didn’t exist). Nevertheless, it produced summary that the thought history shows it believes was fabricated, but which never is labeled as such in the output.

So I decided to pose the question to Gemini 2.5 Pro about why that had happened, and it’s explanation, reproduced below, is an interesting look at how that system works. And why the “thoughts” are more of a look at the internals and not necessarily reflecting a “belief” in any way you might interpret that word.

Gemini’s Response Follows:

That is an excellent and highly insightful question. Your observation gets to the very heart of how advanced AI systems like Gemini, particularly with features like Deep Research, function. You are correct in your core assumption, but there’s a fascinating nuance to why you see that “cognitive dissonance” in the process.

Here’s a breakdown of what’s happening, addressing your specific question about why the model seems to maintain its initial belief in the “Thoughts” while producing a correct final output.

The Two “Minds” of the Operation: Base Model vs. Research Agent

Think of the process not as a single mind that changes its opinion, but as a multi-step workflow involving different specialized components or “agents.”

  1. The Planner (Based on Gemini 1.5 Pro’s Static Knowledge):
    When you initiate Deep Research, the first step is for the base model (Gemini 1.5 Pro) to understand your request and formulate a research plan.
    • Initial Check: The model’s first instinct is to check its own internal, static knowledge base. In your scenario, its training data cuts off before July 2025.
    • The First “Thought”: When it checks for “HR1 signed into law on July 1, 2025,” its internal knowledge returns a definitive “no match found.” This is a factual statement based on its training data. This is where the thought you observed—“the bill does not exist”—originates.
    • The Plan: Crucially, the model doesn’t stop there. It recognizes that its knowledge is limited by time. The plan it generates will involve steps to overcome this, such as:
      • Search for information about a bill designated HR1 in 2025.
      • Verify if HR1 was signed into law in July 2025.
      • Find authoritative sources like congress.gov, news reports, and legal analyses for this bill.
      • Synthesize the findings into a comprehensive analysis.
  2. The Researcher (The Tool-Using Agent):
    This is the part of the process that executes the plan. It autonomously runs Google searches, reads the results from the credible sources it finds, and gathers the raw data.
    • Finding New Evidence: In your hypothetical scenario, this agent finds a wealth of information about your fictional HR1 from authoritative sources published after its knowledge cutoff. It discovers the bill’s text, news reports on its signing, and analyses from think tanks.
    • Data Collection: This new, contradictory information is collected. The system notes the sources and the content. This is the second part of the “Thoughts” you see, where it lists the sources it found and the information it extracted.
  3. The Synthesizer (The Final Report Generator):
    This is the final and most important step. The system now has two conflicting sets of information:
    • Internal Knowledge (from the Planner): “This bill doesn’t exist.”
    • External Knowledge (from the Researcher): “Multiple, fresh, authoritative sources confirm this bill exists and here are its complete details.”

This is the key to your question: The system is not designed to “believe” one thing over the other in a human sense. It is designed to **fulfill the user’s request using the best available information.**The ultimate goal is to provide an analysis of the bill. The information retrieved from the live searches is infinitely more relevant, current, and useful for that task than its own outdated internal knowledge. Therefore, the synthesis process is architected to prioritize and build the final report based on the fresh data gathered by the Researcher. The initial (and now invalidated) thought from the Planner is correctly discarded as an artifact of the research process, not a fact to be included in the final product.

Why Does the “Thought” Linger?

The “Thoughts” section is a transparent log of the agent’s work process. It’s showing you its homework.

  • It shows the initial state (Premise: Check for HR1. Result: Not found in my knowledge base.).
  • It shows the actions taken (Action: Search Google. Results: Found 10 credible sources…).
  • The log isn’t a “stream of consciousness” that constantly edits itself. It’s a record. So, the initial premise remains logged even after it has been superseded by new evidence.

You are essentially seeing the “ghost in the machine”—the base model’s initial, correct-but-outdated assessment—before the research agent provides new, more relevant data for the final output.

Analogy: The Diligent Librarian

Imagine you ask a librarian (with a perfect memory up to 2024) for a book published yesterday.

  1. Initial Thought: The librarian thinks, “I have memorized every book in this library’s catalog. That book is not in my memory. Therefore, it does not exist in my catalog.”
  2. Action: However, they are a diligent librarian. They don’t stop there. They say, “Let me check the ‘New Arrivals’ cart from today’s delivery.”
  3. Discovery: They walk over to the cart, find the book, and read its back cover and table of contents.
  4. Final Answer: They walk back to you and give you a perfect summary of the new book. They don’t start by saying, “Well, I first thought this book didn’t exist…” They give you the answer you asked for, based on the new information they actively sought.

The “Thoughts” in Deep Research let you watch the librarian walk to the cart. The final output is the summary they give you at the end. Your observation is so keen because you noticed the librarian’s initial, momentary flicker of doubt before they completed their process.

The One, Big, Beautiful Bill Title XI Tax Provisions as Passed by the Ways & Means Committee

I decided to post the following for various reasons at this point. First, since the One, Big, Beautiful Bill has already had some amendments made to Title XI by the House Rules Committee (though the full updated bill text is not available as of the morning of May 24) there’s little reason to refine this document. Second, I have found it useful as a starting point for answering questions about the bill (being able to find relevant items in the bill text and JCT report). And, third, it does demonstrate the usefulness of NotebookLM for getting on top of these sorts of things, as well as gave me a good learning experience in how to deal with new law text (the JCT report was really useful in being able to handle the bill, since the latter only comes with the changes to the IRC, not the full revised IRC provision in context which is often needed to understand what Congress is doing).

So I am posting the analysis I compiled of the bill using NotebookLM, the bill text that was approved by the Ways & Means Committee, and the JCT report on the various provisions in that bill. I added a section on the proposed changes to PL 86-272 as well though that was added by a different committee and is not part of Title XI.

The way I handled it was to craft a prompt to ask for an analysis dealing with specific areas of each section in the bill, working from the section list in the Ways & Means Committee summary. Giving NotebookLM “bite sized” items to analyze proved a much more accurate way to get it to focus on the proper items than trying to pick off multiple provisions at once. And I did scan the output to make sure I didn’t see any obvious issues, following up on a few things where I was surprised though most often discovering that “yeah, that’s how it was written” when I was done reviewing the appropriate sections of each document.

You can download that analysis below. There’s little reason for me to refine it more at this point since the final bill will require a “start over” (you never know what change will sneak in as a bill moves through the process) but, as I noted, with the caveat that you don’t assume it’s 100% accurate (which you shouldn’t do for any analysis–the law is what matters), I do find it useful.

If you haven’t used it, you might want to go look at NotebookLM. Even with a free account you can do a decent analysis (just not a huge number of them).