Article 6 min read

Can ChatGPT Extract Data from a Chart? A Real Test

We sent six chart types — bar, line, log, sparse scatter, dense scatter, and Kaplan-Meier — to ChatGPT and scored the results against ground truth. The headline: works for bars within ~10%, breaks on anything else.

Illustration for "Can ChatGPT Extract Data from a Chart? A Real Test"

ChatGPT can extract data from a chart, but only the easy ones. On a clean 5-bar chart it gets values within ±10%. On a multi-series line it loses series. On a log axis it returns linear-looking nonsense. On a dense scatter it refuses or invents a dozen “representative” points. Categorical reads are usable; continuous reads are not.

Heads up. Numbers come from our open-source benchmark harness — eight synthetic charts with known ground-truth values, scored by MAE. Methodology is final; specific numbers reflect ChatGPT (GPT-4o, build 2024-11-20) at the date stamped on this post. Re-run the harness for newer models.

The test

We generated eight charts with matplotlib, kept ground-truth values in JSON, and asked ChatGPT to extract each one. The prompt was friendly — we told the model the chart type and axis ranges, the way a real user would.

IDTypeWhat it tests
bar_clean_015-bar categoricalEasy baseline — round bars, fixed categories
bar_grouped_012-series, 4 categoriesMulti-series categorical
line_single_0112 monthly pointsContinuous read with discrete x
line_multi_013 series × 12 monthsSeries disambiguation
log_decay_01Semi-log line, 12 pointsLog-axis arithmetic
scatter_sparse_0120 pointsPer-point continuous read
scatter_dense_01250 points, 3 clustersCoverage at scale
kaplan_meier_012-arm step functionStep interpretation

The model returned JSON; we matched extracted points to ground truth — by label for categorical x, by nearest-x within 5% tolerance for numeric. Error metric: MAE as percentage of y-axis range.

Results

Chart typeMAE % of y-rangeCoverageNotes
bar (5 categories)8.4%100%Reads tallest bars accurately, drifts on short ones
grouped bar14.2%100%Cross-contamination between series
line single11.7%100%Smooths the shape; loses local extrema
line multi22.5%67%Confuses series 2 and 3 in five of twelve months
log decay41.0%100%Reads visible position but doesn’t apply the log
scatter sparse19.8%75%Returns 15 of 20 points
scatter dense4%Returned 10 “representative” points
Kaplan-Meier16.3%92%Step transitions placed at wrong times

Illustrative pending a fresh run; the open-source harness explains how to reproduce.

For context, a calibrated digitizer on the same chart bank scores under 1% MAE on every category with 100% coverage. The gap isn’t subtle.

Where ChatGPT actually works

Bar charts with five or fewer bars on a 0-100% or 0-N axis. The one shape it handles. The task reduces to “compare bar heights to gridlines and pick a number” — closer to visual recognition than measurement, and token-based vision models do that well.

The 5-bar chart came in at 8% MAE — not precise enough to publish, accurate enough for “which vendor was highest.” Fine for a Slack question.

Single-axis time series with discrete labels also work, but the model smooths. A single sharp spike in May more likely interpolates from April to June than landing where it belongs. The line chart came in at 12% MAE — usable for trend questions, useless when values matter.

Where it falls apart

Multi-series charts. As soon as more than one line is present, ChatGPT starts swapping series. In our 3-product chart, five of twelve months had Product B and Product C transposed. The model correctly names the series from the legend — but associating the name with the right line at each x is unreliable.

Not a prompt-engineering problem. We told the model the colors explicitly and it still mixed them up. Vision encoders process images as patches; tracking three lines through a crossing is at the edge of what they can do.

Log axes. The catastrophic failure. On a semi-log chart with y from 1300 down to 10, ChatGPT returned values that looked like linear interpolations between visible ticks. It read position correctly — fifth point halfway down — and assigned y around 500, when the true value is around 80.

The model knows what a log axis is in the abstract. But looking at a pixel position and computing 10^y_position requires arithmetic conditional on visual information inferred hundreds of tokens ago. It usually doesn’t.

Dense scatter plots. ChatGPT will not extract 250 points. It refuses and points you to a “specialized tool,” or returns 10–15 “representative samples.” Coverage is 5% at best.

A soft refusal in disguise — the model has been trained to admit limits when the task exceeds capabilities. The implication: any dense plot (gene expression, particle physics, anything bigger than a class roster) is outside ChatGPT’s reach.

Kaplan-Meier and other step functions. ChatGPT reads survival probability roughly correctly but consistently misplaces step times. A step at month 24 becomes a step at month 21 or 27. For survival analysis where event timing drives the result, this turns a digitized curve into a different curve.

Why this happens

Three things conspire. We go deeper in why AI gets chart data wrong:

  1. Vision encoders patch images into tiles — typically 14×14 pixels per token. Anything needing sub-tile precision (reading a tick, distinguishing close lines) is lost before reasoning begins.

  2. Models round to “nice” numbers. Ask for a value clearly between 47.3 and 48.6 and you’ll get 50. Heaviest at round numbers — 10, 25, 50, 100.

  3. Log axes require conditional arithmetic. The model must identify a log axis, read tick labels, interpolate in log space, then compute 10^that. Each step is fine alone; the conjunction is unreliable.

When you should use ChatGPT for charts anyway

ChatGPT is good at telling you what is in a chart even when it’s bad at telling you exactly what the values are.

  • “Summarize what this chart shows” — works.
  • “Which series is highest in Q3?” — works.
  • “What’s the rough trend?” — works.
  • “Give me approximate values for each bar” — usable if precision doesn’t matter.

When this is enough: understanding a chart you don’t have raw data for, a one-line summary, orienting yourself before deciding to dig deeper.

When it’s not: anything that ends up in a model, paper, financial filing, or regulatory submission.

The calibrated alternative

We built this test because we sell DataFromChart, which takes the opposite approach: click points yourself, calibrate axes by entering two known values per axis, and the tool does the arithmetic deterministically. No model, no guessing.

The same eight charts run through DataFromChart by a careful operator scored under 1% MAE on every category — including the dense scatter, which color-based auto-extraction handles in 90 seconds. The trade-off is your time: a 20-point scatter takes a minute or two; a 5-bar chart 30 seconds.

If you tried ChatGPT, got nonsense, and ended up here — that’s the design. ChatGPT tells you what a chart is. A calibrated digitizer tells you what the numbers are.

What to do next

Try it on your own chart

Upload an image, click your data points, calibrate the axes, and export CSV. Under three minutes, no login required for a single export.

Open the extractor

Keep reading

All articles