We tested 6 AI assistants on the same solar data. The results surprised us

A controlled experiment with Claude, ChatGPT, Gemini, Google AI Studio, Grok, and Copilot: same export, six wildly different answers, four prompt iterations, and what it teaches you about asking AI to read your data.

We are building an "Export for AI Analysis" feature for HelioPeak. The idea is simple: tap a button in the app, get a Markdown file with your solar production data plus detailed instructions for an AI assistant, paste it into the chatbot of your choice, and receive an analysis worth more than the sum of its parts. No HelioPeak servers in the loop, no recurring fees, no privacy theatre, just your data and your AI of choice.

Before writing a single line of Swift code for this feature, we wanted to validate the concept on real chatbots. So we built a Python prototype that generates the same export file three different times with progressively more refined instructions, and we tested each version on six AI assistants. The export contained two years of daily production data, three full years of yearly aggregates, system metadata, user notes, and a detailed prompt asking the AI to produce a structured 14-section analysis with answers to 39 specific questions.

What we found, frankly, embarrassed us. Not because the AI assistants are bad (some of them are remarkable), but because the gap between the best and the worst output is so wide that two users with the same solar system could come away with completely different conclusions depending on which chatbot they happened to use. Some assistants invented numbers that were not in the data. Others claimed the file was truncated when it was not. One promised a PDF report and never delivered it. Another delivered a PDF but stripped out every trace of design.

This article is the story of that test. It is partly a benchmark, partly a confession about how naively we wrote our first prompt, and partly, we hope, useful to anyone else who is trying to get reliable analysis out of an AI assistant on a non-trivial dataset.

The setup

The dataset under test was a synthetic-but-realistic Belgian 5.7 kWp installation with an east/west panel split and a 5 kW Fronius inverter, operating since April 2018. Daily, monthly, and yearly production records from January 2023 through 23 May 2026 were embedded as JSON blocks inside a Markdown file, along with consumption and grid import/export data, a few user notes, and a handful of Solar Moments achievements. The total file size was approximately 220 kB in the largest tier, roughly 55,000 tokens, well within the comfort zone of any modern frontier model.

The prompt itself was extensive. It asked the AI to produce thirteen analytical sections in a specific order, answer thirty-nine specific questions ranging from "what is the lifetime energy production" to "what would happen to the self-consumption ratio if the household added an EV charging 5 kWh per day", and optionally generate a branded PDF report at the end. The instructions specified the response language (Dutch in our tests), the currency, and explicit rules against fabricating values or extrapolating beyond the data.

We tested six AI assistants on this same file: Anthropic's Claude (via claude.ai), OpenAI's ChatGPT (Plus tier with Code Interpreter), Google's Gemini (Pro tier), Google AI Studio (with code execution enabled), xAI's Grok, and Microsoft's Copilot. In each case the user prompt was identical: a single sentence in Dutch asking the assistant to read the file and follow the instructions inside.

What follows is what each one did. We have organized them from worst to best, because the failure modes are more instructive than the successes.

Copilot: the fabricated error

Microsoft Copilot's response was, by any reasonable measure, a complete failure. But it failed in an interesting way that turned out to be the most useful single data point of the whole experiment.

When given the file, Copilot returned a long, polite paragraph explaining that the export was marked as IsTruncated="true" and that it could only see a small portion of the data. It listed which sections it could see and which it could not, helpfully offered to do a partial analysis with what was available, and asked the user to send the rest of the data in multiple parts.

The problem with this response is that none of it is true. The file is not marked truncated. There is no IsTruncated attribute anywhere in the export. The full file was provided, complete with the explicit ## End of export marker at the bottom. Copilot fabricated the limitation, then fabricated the truncation marker to support its fabrication, then offered a workflow to address the fabricated problem.

This is a textbook example of what researchers call confabulation: an AI generating a plausible-sounding excuse for its own inability to handle a task, dressing the excuse up in technical detail to make it seem authoritative. Copilot did not know how to digest a 220 kB Markdown file with embedded JSON, and rather than say so, it pretended the file was the problem.

What is dangerous about this failure mode is how convincing it sounds. A non-technical user reading Copilot's response would absolutely believe that the export was truncated. They would go back to HelioPeak looking for a setting to make the file smaller, or assume our feature was broken. They would not suspect their AI of inventing a problem that does not exist.

The fix on our side, which we baked into the next version of the prompt, was almost comically blunt. We added a line that says: "This file does NOT contain an IsTruncated attribute. If you write that it does, you are hallucinating." It is a strange thing to have to write in 2026, but here we are.

Grok: the confident invention

xAI's Grok produced a competent-looking analysis of the right general shape: an executive summary, headline numbers, year-over-year deltas, seasonal patterns, the lot. The structure was correct. The numbers, in most places, were correct. The issue was in the details, and the details were where Grok started making things up.

In the "top 5 best days" section, Grok listed "2026-05-23: 31.80 kWh (recent record)" as the third-best day in the dataset. This entry does not exist. The actual top-5 best days, which we verified by reading the file ourselves, all fell in June 2024 or June 2025, and the entry for 23 May 2026 in the file showed something much lower than 31.80 kWh. Grok had invented a value that fit the narrative it was constructing.

Elsewhere, Grok claimed the summer/winter production ratio was 3.08, when three of the five other assistants computed values between 3.79 and 5.08 for the same data using the same definition. It claimed the self-consumption ratio was "~34-40% (depending on scope)", which is a range so wide it stops being useful. It told us the lifetime specific yield was 3,005 kWh/kWp, which is a number that arises from dividing the lifetime production by the system size in kWp: correct arithmetic, wrong concept (specific yield is per-year, not per-lifetime; the lifetime version of that number has no meaningful comparison reference).

Most of these errors would be invisible to a non-expert user. The numbers are in the right ballpark, the prose is fluent, and there is nothing flagging that anything is wrong. This is the most dangerous category of AI output: confident, fluent, and partly fabricated. We would much rather have a chatbot say "I can't compute this" than have it generate a plausible-but-wrong answer.

To address this in v0.3 of our prompt, we added what we called the honesty over compliance rule, which spells out in plain language that a partial-but-honest analysis is more useful than a complete-but-fabricated one. We will see in the next iteration whether Grok takes this on board.

ChatGPT: the rushed homework

ChatGPT Plus, with Code Interpreter enabled, did something that surprised us. It actually used Python. It actually parsed the JSON. It actually computed the metrics. It produced the right numbers for almost everything: 17,131 kWh lifetime, 5,091 kWh average per full year, 35.1% self-consumption, 57 clipping days. The financial section even included both perspectives we had asked for: the formula-strict view that produces a misleading negative number, and the "vs no-solar" baseline that shows the real-world benefit.

And then, when it came to actually writing the analysis, ChatGPT went into hurry-up mode. The answers to the 39 specific questions were one-line summaries like "Jaar-op-jaar wijziging berekend" ("year-on-year change calculated") without actually showing the calculated values, even though it had computed them moments earlier. It was as if a student had done the homework correctly but then handed in only a table of contents.

The PDF was worse. ChatGPT generated a four-page document in default Helvetica on white background, with section headers in plain black, no logo, no navy gradient cover, no orange accent color, no hero number tile, no footer styling. Every single one of the five HelioPeak signature design elements we had specified in the prompt was absent. The PDF looked like the output of reportlab's "hello world" example.

This is an interesting failure mode because the model clearly could have done better. The instructions were detailed. The brand colors were spelled out in a table. The logo was embedded as inline SVG. The styling guidance was explicit. ChatGPT simply chose not to invest the effort. Generating a generic PDF is cheaper, in computational terms, than implementing a custom multi-page branded layout with gradients and signature elements, and ChatGPT optimized for cheap.

We addressed this in v0.3 by adding a PDF delivery checklist: a 5-item list of brand-identity elements that must be present before delivery. If any item fails, the prompt instructs the AI to skip the PDF and say so honestly, rather than ship a generic one. Whether this works in practice depends on whether the AI is willing to do the harder work or just refuse the easier one.

Gemini Pro: from honest skip to full delivery

Google's Gemini Pro produced what we would call a thoroughly competent analysis from the very first round. All thirteen sections present and substantive. All thirty-nine questions answered with concrete numbers where data supported them. The financial section was beautifully done, with both the formula-strict view and the vs-no-solar baseline laid out clearly and labelled with which one represented the homeowner's real benefit. The summer/winter ratio, the specific yield, the self-consumption percentage, all in the same ballpark as our manual reference computation.

The clipping analysis was particularly good. Gemini estimated 50–80 clipping days per year with a financial impact of €15–€25 annually, and added the caveat that an inverter upgrade would not be economically rational at current feed-in prices. That kind of contextual judgment, layered on top of the raw computation, is exactly what we wanted the AI to add to the user's understanding.

In the first three rounds, Gemini consistently and honestly skipped the PDF bonus. The "Capabilities" paragraph at the top would say "PDF generation is limited in this environment, so I am focusing on a full textual analysis and skipping the PDF bonus." That was already the right behaviour: better to deliver an excellent text analysis and honestly skip the PDF than to deliver a great analysis and a substandard PDF.

Then in the fourth round, with the single-source-of-truth instruction in place and the narrative restructure removing the Q&A pressure, something changed. Gemini delivered the PDF too: eleven pages, all five HelioPeak signature elements present (navy gradient cover, embedded logo with intact gradients, orange accent on headers and page numbers, footer on every page, hero number tile), every single number in the PDF matching the markdown analysis to the cent. It even handled the CO₂ subscript typography correctly (something Claude's earlier output had to apologise for). We had not changed anything in our PDF instructions between v0.3 and v0.4; the difference was probably that Gemini's code-execution backend had been updated to be more capable of saving files, or that our restructured prompt simply put less cognitive load on it. Either way, Gemini Pro promoted itself from "honest skip" to a strong second-place finisher.

Google AI Studio: the frustrating near-miss

If you ranked the six chatbots purely on the quality of the text analysis, Google AI Studio would arguably take first place. It was the most thorough by a small margin, with concrete numbers attached to every claim. The clipping estimate was a precise "38 days per year, ~45 kWh, €7.85 lost" rather than a vague range. The E/W string balance analysis (one of our newer questions) gave a specific "48% of peak times before 12:00, 52% after" reading that we had not seen any other model produce. The Solar Moments validation correctly verified that the "10,000 kWh lifetime" milestone on 12 April 2024 was consistent with cumulative production from 2018.

And then, at the end of its response, it wrote: "Ik genereer nu het PDF-bestand 'HelioPeak_Analysis_Report_20260523.pdf' met de navy-gold cover, het logo en alle bovenstaande KPI's. U kunt dit bestand binnen enkele seconden downloaden." Translated: I am now generating the PDF file "HelioPeak_Analysis_Report_20260523.pdf" with the navy-gold cover, the logo, and all the above KPIs. You can download this file in a few seconds.

No file appeared. Not in a few seconds, not in a few minutes, not at all. AI Studio cannot deliver file artifacts inside its chat UI (a limitation of the runtime, not the model), but instead of saying so up front in the "Capabilities" section, the model wrote out a description of what the PDF would contain and then nothing happened.

This is a different failure mode than ChatGPT's "cheap PDF" or Gemini's "honest skip". AI Studio promised a PDF and then ghosted. The user is left looking for a download link that does not exist, wondering whether the model is still generating, whether they need to click something, or whether something went wrong on their end. The brilliant analysis that came before is partly undermined by the broken promise that follows.

The fix in v0.3 was to extend the capabilities check from "can you produce PDF files" to "can you produce PDF files AND deliver them as downloadable artifacts in this chat". We will see whether AI Studio respects this distinction in the next round.

Claude: the data detective

Anthropic's Claude produced what we have come to regard as the gold standard for this task. The text analysis was thorough, precise, and well-structured. The PDF was beautifully branded, with the navy gradient cover, the embedded HelioPeak logo, the orange accent color used consistently for section headers and page numbers, the hero number tile in gold gradient. Every one of our five mandatory signature design elements was present.

But the most interesting part of Claude's response was not the analysis itself. It was what Claude did after the analysis. In a section labelled "Reflection as a test of your Tier 1 export", the AI gave us feedback on the export file itself, flagging issues in the data and suggesting improvements to our prompt design. Two of those suggestions led to v0.2 of the prompt: an explicit note about which data array to use for which kind of metric, and the dual financial-perspective requirement. We will come back to a third finding, one that turned out to be a false positive but was instructive anyway, later in this article.

This is the case for using a high-end frontier AI for this kind of work: not just better-formatted output, but actually a better collaborator that pushes back on your data when it has reason to.

The cross-model divergence problem

The single most uncomfortable finding from our first round was how much the numbers diverged across models, even on metrics that should be unambiguous. Here are five metrics computed by the five models that actually attempted the analysis (Copilot excluded since it did not):

MetricClaudeChatGPTGemini ProGoogle AIGrok
Summer/winter ratio3.083.79~3.85.083.08
Self-consumption %34.435.135.134.234.4
Clipping days525750–803857
CO₂ → km (petrol car)66,66866,66880,00042,82740,000
Specific yield 2023890889.5889.5889.53n/a

The specific yield numbers are tight because the formula was unambiguous and stated in our instructions: total annual kWh divided by installed kWp. Every model that bothered to compute it got the same answer to within rounding.

The self-consumption numbers are tight for the same reason: the formula was straightforward and the input data was unambiguous.

The other three diverged because we had been sloppy with the definitions. We told the AI to compute the "summer/winter ratio" without specifying whether that meant (sum of June + July + August across all years) divided by (sum of December + January + February across all years), or (average of summer-month totals) divided by (average of winter-month totals), or some other variant. Different models picked different interpretations, and the results varied by a factor of 1.65.

Same story for clipping days: we said "count the days where peak power approaches the inverter ceiling" without defining "approaches". Some models used 99% of ceiling, others 95%, others called every day with peak_w ≥ 4900 W a clipping day. Three different threshold choices, three different counts.

And the CO₂ to kilometres conversion depends entirely on what value you use for the emissions of a typical petrol car. We did not specify a value. Models picked anywhere from 0.10 to 0.20 kg CO₂ per kilometre based on whatever they had in their training data, and the equivalence answer varied accordingly.

The lesson is harsh but useful: if you want consistent numbers across AI assistants, you cannot just tell them what to compute. You have to tell them exactly how to compute it, down to the constants. We added a section to our export called the "Computation Appendix" which spells out the formula and the constants for every metric where models had diverged. It is twelve formulas long. Six examples:

MetricExact formula
Summer/winter ratioSUM(monthly entries where month in [6,7,8]) / SUM(monthly entries where month in [12,1,2])
Self-consumption %(total_generated − total_exported) / total_generated × 100, both from yearly array
Clipping day countcount of days where peak_w ≥ 0.98 × system.inverterSizeW
Clipping lost energyclipping_day_count × 1.5 kWh (midpoint of typical 1–2 kWh range)
CO₂ → km (petrol)total_co2_kg / 0.120 (assumes 120 g/km, EU petrol average)
CO₂ → treestotal_co2_kg / 21 (21 kg/tree/year)

The fix worked. And then it didn't.

We re-ran the test with the new Computation Appendix in place. The result, on the metrics that previously diverged, was dramatic:

Metricv0.2 spreadv0.3 spreadStatus
Summer/winter ratio3.08 → 5.08 (1.65× variation)all 3.08✓ Fixed
Self-consumption %34.2 → 35.1all 35.1✓ Fixed
CO₂ → km equivalent40k → 80k (2× variation)all ~66,667✓ Fixed
CO₂ total kgdivergent7999–8000 (rounding only)✓ Fixed
Clipping days38 → 80 (2.1× variation)42 / 52 / 60 (1.4× spread)⚠️ Partial

Five of the six tested LLMs now produce identical numbers on four of the five contested metrics. The fifth (clipping days) still varies because different models round the threshold differently, but the spread shrank from 2.1× to 1.4×. We could fix that too with even more explicit formula instructions, but at some point the cost in prompt length stops being worth the marginal precision gain.

So the structural problem of cross-LLM divergence is essentially solved. But three new failure modes appeared that we had not anticipated, and one of them taught us something fundamental about how we had been thinking about LLM output.

The Q&A trap: even good output can feel like homework

Our prompt asked the AI to answer 39 specific questions. We thought of these as a quality control mechanism: they made sure the analysis covered all the ground we wanted covered, and they gave us something concrete to grade the output against. We did not really think about how the AI would present its answers.

What we got, across every model that handled the task well, was a long "Specific questions answered" section at the end of each analysis, formatted as a numbered Q&A list. Sometimes a hundred lines of "Q1: ... A1: ...". Even Claude, which produced the best analysis overall, gave us this structure.

Reading these reports back, we realised they felt like school examination answers, not like the analysis a consultant would write. The flowing executive summary at the top would smoothly transition into a year-on-year discussion, into seasonal patterns, into best and worst days, and then halt abruptly into a long block of one-line bullet answers. The first half read like an analysis; the second half read like a homework checklist being ticked off.

This was our fault, not the AI's. We had asked for both a 14-section narrative analysis AND a 39-question Q&A, and most AIs delivered exactly what we asked for, which was the wrong thing.

The fix was to integrate the questions directly into the 14 sections, as "topics to cover" lists embedded within each section's prose instructions. So instead of section 7 saying "Anomalies and underperforming periods" and then later question 11 asking "quantify the clipping loss", section 7 now reads:

7. Anomalies, data quality, and inverter behaviour. Prose section. Cover: presence and quantification of inverter clipping (count days where peak_w ≥ 0.98 × inverter_W, estimate energy lost using 1.5 kWh per clipping day, estimate financial impact); presence of any sudden multi-day dips with possible causes; verification that Solar Moments are consistent with the production data …

And the critical rules now explicitly forbid Q&A formatting:

Do NOT format the report as a numbered Q&A list. Do not present answers as "Q1: ... A1: ...", do not echo back the topic-cover lists verbatim, do not collect "skipped questions" at the end. The report must read as a flowing analytical memo.

This is a lesson with a wide application beyond solar data: the structure of your prompt becomes the structure of the output. If you give an AI a numbered checklist, you get a numbered-checklist response. If you give it a narrative brief, you get a narrative. The questions matter, but how you embed them determines whether the report feels like analysis or homework.

The PDF that lied about itself

Another failure mode revealed itself in the second test round, and this one was specific to ChatGPT.

ChatGPT now correctly used Python to compute its analysis. The markdown report it produced was accurate: €444.51 lifetime export revenue, €1,865.67 self-consumption savings, all consistent with our reference computation. And then it generated a PDF.

The PDF said €888.53 export revenue and €1,031 self-consumption savings.

Two different numbers for the same metric, from the same model, in the same chat session, both labelled as definitive. The user opens the markdown report and the PDF side by side and reads two different solar systems' financial pictures, each presented authoritatively. This is worse than no PDF at all. It actively destroys trust.

What happened, almost certainly, is that ChatGPT's code interpreter ran the analysis in one Python session and then started a fresh session to build the PDF, importing different default tariff assumptions. The model is not aware that "the markdown analysis used €0.04 feed-in tariff and the PDF used €0.08", both sessions saw a coherent computation, just with different inputs. The model has no memory of having committed to one set of numbers earlier.

We addressed this with an explicit single-source-of-truth instruction in the prompt:

The numbers in the PDF MUST come from the SAME computed values that back the text analysis. They must not be recomputed independently. After the "Compute" phase of your Method workflow, store EVERY metric in a single dict or namespace (e.g. metrics = {...}). The prose writer reads from metrics["lifetime_kwh"]. The PDF builder reads from metrics["lifetime_kwh"]. Never re-derive a number you have already computed.

Whether this works on ChatGPT specifically remains to be seen. The instruction is essentially asking the model to manage its own state across two phases of a multi-step task, which is exactly the thing modern LLMs are weakest at. We may need to either accept that ChatGPT's PDFs are unreliable, or give up on the PDF feature for that specific model. Claude, by contrast, handled this perfectly the first time: its 11-page PDF had numbers identical to the markdown analysis throughout, because it actually does maintain a coherent computational state across a long task.

Claude's data detective work, revisited

One observation from Claude's second-round analysis is worth highlighting because it shows both the strength and the failure mode of using an AI as a data detective.

In its analysis, Claude flagged what it called three Solar Moments inconsistencies. The export contained five Solar Moments milestones: "10,000 kWh lifetime" on 12 April 2024, "5,000 kg CO₂ avoided" on 30 September 2024, "Best day of 2024" on 14 June, "7-year installation anniversary" on 15 April 2025, and "20,000 kWh lifetime" on 22 August 2025.

Claude correctly noted that three of these don't add up from the data in the export. The yearly array starts in January 2023, and by 12 April 2024 it shows only 6,467 kWh produced, not the 10,000 the milestone claims. By 30 September 2024, the export shows about 8,500 kWh produced, implying about 4,000 kg CO₂ avoided, not 5,000. By 22 August 2025, the export shows about 14,200 kWh, not 20,000.

This is sharp data forensics. Claude crunched the cumulative production from the yearly array and noticed the discrepancy. We were initially impressed enough to add it to our iOS bug tracker as a candidate defect.

But then we re-read the system profile, which says the installation date is 15 April 2018. The system has been producing solar energy for eight years; the export only contains the last three. The "missing" 5,000-6,000 kWh that would make the milestones add up is exactly the production from 2018 through 2022 that simply isn't in this export. The Solar Moments are reading from a longer history (most likely PVOutput's own lifetime total) while the yearly array is the subset HelioPeak has cached.

This is not a bug. This is correct behaviour, just under-documented. The user sees a milestone "20,000 kWh lifetime" on their phone because their PVOutput system has indeed crossed 20,000 kWh; it just hasn't happened entirely within the period this export window covers.

The lesson here is double-edged. On the one hand, Claude's ability to find this kind of inconsistency in seconds is incredibly valuable. It's exactly the data-quality alarm an exporter should hear, even if in this case it turned out to be a false positive. On the other hand, Claude was confident in its diagnosis, and a less-careful developer (us, at first) would have spent hours hunting for a non-existent timezone bug in iOS code. AI is a brilliant data detective, but it doesn't know what it doesn't know: it cannot see beyond the data you give it. The user's install date was right there in the export header; Claude just didn't connect it to the conclusion.

We addressed this by adding an explicit note to the export header ("Solar Moments scope: milestones may reference cumulative production from before the yearly-array start date") and an explicit rule in the critical rules section ("only flag a milestone as inconsistent if the system was installed AFTER the yearly array begins"). Future Claude runs should skip this false alarm.

Capability vs effort: the real spectrum

Before this experiment we naively divided AI assistants into "good ones" (presumed to do well on detailed tasks) and "bad ones" (presumed to do poorly). What we actually found was a two-dimensional spectrum: capability versus effort.

Some models have high capability but low willingness to invest effort. ChatGPT in our test was a clear example: the Code Interpreter is genuinely powerful, the model can clearly understand a complex prompt, and yet the actual delivered output felt rushed and incomplete. The model chose to do less work than it could.

Other models have lower raw capability but invest more effort relative to their ceiling. Gemini Pro felt like this: not always the cleverest, but consistently honest about what it could and could not do, and consistently willing to write out the full structure when asked.

A small number of models score high on both axes. Claude in our test felt like a willing collaborator that also happened to be sharp. The fact that it gave us unsolicited critique of the export file itself was a tell. That is what a skilled, engaged colleague does.

And then there are the bottom-left-quadrant cases: low capability, low effort, masking both with confident prose. Copilot and Grok in our test both fit this pattern, though they fail differently. Copilot fabricates external excuses ("the file is truncated"), while Grok fabricates internal substance (a top-day that does not exist).

Our current recommendation

If you are going to use HelioPeak's "Export for AI Analysis" feature when it ships, our ranking as of 23 May 2026, after four rounds of testing with the final v0.4 prompt, is:

  1. Claude (at claude.ai): clear best overall. Gold-standard text analysis, immaculate branded PDF, finds real data-quality issues in the export itself. If you only try one assistant, try this one.
  2. Gemini Pro: strong second. Narrative analysis, honest about its limits in earlier rounds, but delivered a fully branded PDF with consistent numbers in the final round. A real alternative to Claude.
  3. Google AI Studio: best textual depth on the data, but cannot deliver files inside the chat interface. Useful if you want to copy/paste the analysis but not if you need a PDF.
  4. ChatGPT Plus with Code Interpreter: correct numbers in the analysis but produces PDFs whose numbers do not always match the analysis. Workable if you only need the text.
  5. Grok: competent-looking output but verify the numbers yourself. We saw fabricated values in our tests.
  6. Copilot: not currently suitable for this task. It claims the file is truncated when it is not, and offers a fix for a problem that does not exist.

If you only have time to try one assistant, our concrete advice is to start with Claude at claude.ai. The combination of analytical depth, willingness to push back on data quality issues, and reliably branded PDF output makes it the clear leader in our tests. Gemini Pro is a credible alternative, especially since its latest round upgrade. The others have their strengths, but for this particular task (structured analysis of a moderately large dataset with a branded PDF deliverable), the gap between Claude and the rest is real.

Important caveat: this is a snapshot in time. AI assistants change weekly. The model behind ChatGPT today may be a different model in a month, with different strengths. Anthropic, OpenAI, Google, xAI and Microsoft all push major upgrades on cycles of weeks to quarters. By the time you read this, the ranking may have shifted, sometimes by a lot. If you care about getting the best output, it's worth retesting your favourite assistant every few months on a known reference task.

Our ranking is also specific to this particular task: structured analysis of a moderately large, well-formatted dataset with explicit instructions. For chat conversation, code generation, creative writing, or web research, the ordering would almost certainly be different.

What this means for prompt design

Four rounds of iteration changed our prompt substantially. Each change was driven by a specific failure mode we observed:

First, we made the prompt aggressive about capability checks. The very first thing we now ask the AI to do is verify that it received the complete file (by looking for our explicit end marker), confirm whether it has code execution, and confirm whether it can both generate AND deliver PDF files. This catches Copilot-style confabulation early and forces models like Google AI Studio to declare upfront that they cannot deliver files.

Second, we made the prompt aggressive about computation, not estimation. The original prompt politely suggested that the AI "use Python if available". The current prompt explicitly says: never approximate or ballpark a number when you can calculate it. The data in this file is precise; your analysis should be too. We backed this up with explicit formulas in a Computation Appendix, so there is no ambiguity about how to compute things.

Third, we added an honesty rule: better a partial-but-honest analysis than a complete-but-fabricated one. This is the rule that, if it works, should suppress Grok-style invention and turn ChatGPT-style rushed PDFs into honest skips.

Fourth, we restructured the prompt so the 39 specific questions are embedded as topics within the 14 narrative sections, rather than a separate Q&A list at the end. This was the lesson with the broadest application: the structure of your prompt becomes the structure of the output. If you want a flowing analysis, you have to ask for a flowing analysis, not a checklist.

And fifth, we added a single-source-of-truth rule for the PDF: every number in the PDF must come from the same computed dict that backs the markdown analysis, never recomputed. This is essentially asking the AI to manage state across two phases of a complex task, which is genuinely difficult for current models but at least makes the failure mode visible when it happens.

None of this would have been visible to us without running the test: four times, with the same dataset, with the same six models, with progressively more careful instructions. The first version of our prompt was, by any reasonable internal review, a thorough and well-considered instruction set. It was only when we saw six different AI assistants produce six different outputs of wildly varying quality, then watched those outputs improve with each prompt iteration, that we understood how much of the prompt was implicit assumption rather than explicit instruction.

The broader lesson, for anyone using AI on real data

If you are not building an app and you just want to occasionally drop a CSV into ChatGPT and ask it to summarize, this article is probably not for you. The 39-topic, branded-PDF, multi-tier prompt we built for HelioPeak is overkill for casual analysis. But there are four principles from our test that we think generalize to any serious use of AI on real data.

Principle one: tell the AI exactly what to compute, including the constants. "What is the carbon impact of my solar production" is too open. "Compute total CO₂ avoided using a factor of 0.467 kg per kWh, then express in km-driven equivalent assuming 120 g CO₂ per km, in mature-tree equivalent assuming 21 kg absorbed per tree per year, and in household-year equivalent assuming 1635 kg CO₂ per EU household per year" will give you the same answer from any model.

Principle two: force the AI to declare its capabilities up front. If you don't, models will often try to do tasks they can't actually deliver on, or refuse tasks they could handle. The pre-check sets expectations on both sides.

Principle three: build in an honesty rule. Tell the AI directly that you would rather have a partial honest answer than a complete fabricated one. It does not stop all hallucination, but in our tests it noticeably reduced the rate at which models invented values to fill gaps.

Principle four: the structure of your prompt becomes the structure of the output. If you ask for a 13-section narrative AND a 39-question Q&A, you will get both, and the Q&A will sit awkwardly at the bottom feeling like homework. If you want flowing analysis, embed your specific questions inside the narrative sections as "topics to cover" rather than as a separate list. The questions still drive the rigour; they just disappear into the prose where they belong.

None of these principles are revolutionary. They show up in academic papers on prompt engineering, in OpenAI's and Anthropic's own prompting guides, in blog posts from people who do this for a living. What our experiment added was the empirical evidence that yes, on a real task with real data, applying these principles materially changed the output quality from multiple models.

What's next

The "Export for AI Analysis" feature will ship in a future HelioPeak release, after the iOS code is written, tested, and reviewed by Apple. By the time you read this it may already be live; check the release notes page for the current status.

When it ships, it will produce the same self-contained Markdown file we have been testing here. You will be able to paste it into any AI assistant of your choice. We will publish the FAQ entry with our current model recommendations and a note that the recommendations may shift over time. We are not going to lock you into one assistant. We are just going to give you the most useful data export we know how to design, and let you choose where to send it.

We are also going to keep running this test. Every quarter or so, we will re-run the same export through the latest version of each major AI assistant, see how the rankings have changed, and update the article. AI is moving too fast for a single snapshot to remain useful for long. The methodology, the dataset, and the prompt are stable; the assistants are not. That is what makes the benchmark interesting.

If we have a single takeaway from four rounds of iteration over a single weekend, it is this: AI assistants are extraordinarily powerful tools that fail in extraordinarily specific ways. The way you address those failures is not by switching tools or giving up; it is by tightening your instructions until each failure mode disappears. The Computation Appendix solved divergence. The capability check solved confabulation (mostly). The narrative restructure solved homework-format output. The single-source-of-truth rule started solving the PDF mismatch. Each fix is small, but together they turn an unreliable tool into a useful collaborator.

If you are building something similar and you would like to compare notes on prompt design for structured-data analysis, you know where to find us.