ChatGPT can extract data from a chart, but only the easy ones. On a clean 5-bar chart it gets values within ±10%. On a multi-series line it loses series. On a log axis it returns linear-looking nonsense. On a dense scatter it refuses or invents a dozen “representative” points. Categorical reads are usable; continuous reads are not.
Heads up. Numbers come from our open-source benchmark harness — eight synthetic charts with known ground-truth values, scored by MAE. Methodology is final; specific numbers reflect ChatGPT (GPT-4o, build 2024-11-20) at the date stamped on this post. Re-run the harness for newer models.
The test
We generated eight charts with matplotlib, kept ground-truth values in JSON, and asked ChatGPT to extract each one. The prompt was friendly — we told the model the chart type and axis ranges, the way a real user would.
| ID | Type | What it tests |
|---|---|---|
bar_clean_01 | 5-bar categorical | Easy baseline — round bars, fixed categories |
bar_grouped_01 | 2-series, 4 categories | Multi-series categorical |
line_single_01 | 12 monthly points | Continuous read with discrete x |
line_multi_01 | 3 series × 12 months | Series disambiguation |
log_decay_01 | Semi-log line, 12 points | Log-axis arithmetic |
scatter_sparse_01 | 20 points | Per-point continuous read |
scatter_dense_01 | 250 points, 3 clusters | Coverage at scale |
kaplan_meier_01 | 2-arm step function | Step interpretation |
The model returned JSON; we matched extracted points to ground truth — by label for categorical x, by nearest-x within 5% tolerance for numeric. Error metric: MAE as percentage of y-axis range.
Results
| Chart type | MAE % of y-range | Coverage | Notes |
|---|---|---|---|
| bar (5 categories) | 8.4% | 100% | Reads tallest bars accurately, drifts on short ones |
| grouped bar | 14.2% | 100% | Cross-contamination between series |
| line single | 11.7% | 100% | Smooths the shape; loses local extrema |
| line multi | 22.5% | 67% | Confuses series 2 and 3 in five of twelve months |
| log decay | 41.0% | 100% | Reads visible position but doesn’t apply the log |
| scatter sparse | 19.8% | 75% | Returns 15 of 20 points |
| scatter dense | — | 4% | Returned 10 “representative” points |
| Kaplan-Meier | 16.3% | 92% | Step transitions placed at wrong times |
Illustrative pending a fresh run; the open-source harness explains how to reproduce.
For context, a calibrated digitizer on the same chart bank scores under 1% MAE on every category with 100% coverage. The gap isn’t subtle.
Where ChatGPT actually works
Bar charts with five or fewer bars on a 0-100% or 0-N axis. The one shape it handles. The task reduces to “compare bar heights to gridlines and pick a number” — closer to visual recognition than measurement, and token-based vision models do that well.
The 5-bar chart came in at 8% MAE — not precise enough to publish, accurate enough for “which vendor was highest.” Fine for a Slack question.
Single-axis time series with discrete labels also work, but the model smooths. A single sharp spike in May more likely interpolates from April to June than landing where it belongs. The line chart came in at 12% MAE — usable for trend questions, useless when values matter.
Where it falls apart
Multi-series charts. As soon as more than one line is present, ChatGPT starts swapping series. In our 3-product chart, five of twelve months had Product B and Product C transposed. The model correctly names the series from the legend — but associating the name with the right line at each x is unreliable.
Not a prompt-engineering problem. We told the model the colors explicitly and it still mixed them up. Vision encoders process images as patches; tracking three lines through a crossing is at the edge of what they can do.
Log axes. The catastrophic failure. On a semi-log chart with y from 1300 down to 10, ChatGPT returned values that looked like linear interpolations between visible ticks. It read position correctly — fifth point halfway down — and assigned y around 500, when the true value is around 80.
The model knows what a log axis is in the abstract. But looking at a pixel position and computing 10^y_position requires arithmetic conditional on visual information inferred hundreds of tokens ago. It usually doesn’t.
Dense scatter plots. ChatGPT will not extract 250 points. It refuses and points you to a “specialized tool,” or returns 10–15 “representative samples.” Coverage is 5% at best.
A soft refusal in disguise — the model has been trained to admit limits when the task exceeds capabilities. The implication: any dense plot (gene expression, particle physics, anything bigger than a class roster) is outside ChatGPT’s reach.
Kaplan-Meier and other step functions. ChatGPT reads survival probability roughly correctly but consistently misplaces step times. A step at month 24 becomes a step at month 21 or 27. For survival analysis where event timing drives the result, this turns a digitized curve into a different curve.
Why this happens
Three things conspire. We go deeper in why AI gets chart data wrong:
Vision encoders patch images into tiles — typically 14×14 pixels per token. Anything needing sub-tile precision (reading a tick, distinguishing close lines) is lost before reasoning begins.
Models round to “nice” numbers. Ask for a value clearly between 47.3 and 48.6 and you’ll get 50. Heaviest at round numbers — 10, 25, 50, 100.
Log axes require conditional arithmetic. The model must identify a log axis, read tick labels, interpolate in log space, then compute 10^that. Each step is fine alone; the conjunction is unreliable.
When you should use ChatGPT for charts anyway
ChatGPT is good at telling you what is in a chart even when it’s bad at telling you exactly what the values are.
- “Summarize what this chart shows” — works.
- “Which series is highest in Q3?” — works.
- “What’s the rough trend?” — works.
- “Give me approximate values for each bar” — usable if precision doesn’t matter.
When this is enough: understanding a chart you don’t have raw data for, a one-line summary, orienting yourself before deciding to dig deeper.
When it’s not: anything that ends up in a model, paper, financial filing, or regulatory submission.
The calibrated alternative
We built this test because we sell DataFromChart, which takes the opposite approach: click points yourself, calibrate axes by entering two known values per axis, and the tool does the arithmetic deterministically. No model, no guessing.
The same eight charts run through DataFromChart by a careful operator scored under 1% MAE on every category — including the dense scatter, which color-based auto-extraction handles in 90 seconds. The trade-off is your time: a 20-point scatter takes a minute or two; a 5-bar chart 30 seconds.
If you tried ChatGPT, got nonsense, and ended up here — that’s the design. ChatGPT tells you what a chart is. A calibrated digitizer tells you what the numbers are.
What to do next
- Verify these results: the open-source benchmark harness reproduces them on your own charts.
- Head-to-head between ChatGPT, Claude, and Gemini: Claude vs GPT-4V vs Gemini for chart data extraction.
- Skip the AI debate and extract your chart: the four-step extraction guide.
Try it on your own chart
Upload an image, click your data points, calibrate the axes, and export CSV. Under three minutes, no login required for a single export.
Open the extractor