Can General-Purpose LLMs Perform Accurate Civil Estimates? Claude vs ChatGPT vs Civils.ai

Claude, ChatGPT, and Civils.ai go head-to-head in a blind civil quantity-takeoff test against a chartered QS's ground-truth Bill of Quantities.

May 16, 2026

Mary Janine L. Kamenić

Julianna Widlund P.E

Stevan Lukic CEng

We ran a controlled, blind quantity-takeoff test on a live landscaping, utilities and road-surfacing scheme. Two general-purpose LLMs (Claude Sonnet 4.5 and GPT-4-class ChatGPT) and Civils.ai were each given the same Issued-for-Construction (IFC) drawing package and the same blank Bill of Quantities templates. A third-party chartered quantity surveyor produced the ground-truth quantities. This article reports every line item, every error band, and the priced-bid impact.

Key Facts

Topic	Key Finding
Overall bid accuracy — Civils.ai	97.0% accurate — priced bid within 3% of the QS ground truth
Overall bid accuracy — Claude	39% underestimate — £2.24M vs QS £3.67M
Overall bid accuracy — ChatGPT	43% underestimate — £2.09M vs QS £3.67M
Bulk earthwork volume	Both LLMs underestimated 38–45% — derived volume from plan area × assumed uniform depth instead of integrating cut/fill from grading contours; Civils.ai within 2%
Road surfacing (sub-base + base course)	LLMs underestimated 24–28% (read thinner build-up than the pavement spec); Civils.ai within 1.5%
Drainage & comms ducting	LLMs counted primary runs only and missed laterals, branches and multi-way duct banks — 30–46% underestimate; Civils.ai within 1.5%
Kerbing, edging & channel	LLMs overestimated (+29% to +56%) via a full-perimeter / both-sides default; Civils.ai within 1.4%
Soft landscape	LLMs over-estimated turf and topsoil (gross site area) by 41–55% but under-estimated planting and irrigation (distributed plan counts) by 36–48%
Line items within ±5% — Claude	9 of 45 (20%)
Line items within ±5% — ChatGPT	8 of 45 (18%)
Line items within ±3% — Civils.ai	45 of 45 (100%)
Line items exceeding ±20% — LLMs	32 of 45 (71%) each; Civils.ai 0 of 45
Drawing package size	287 pages across civil, drainage and landscape sets
Largest single cost miss (Claude)	Bulk earthworks — ~£230K of the £1.43M total gap
API cost for a full LLM takeoff	~$4 (Claude) / ~$3 (ChatGPT)

Key Terms

IFC (Issued for Construction): The final, approved version of construction drawings released to contractors for tendering and building. IFC documents represent complete design intent and are the authoritative source for quantity takeoffs.

CESMM4 / SMM: The Civil Engineering Standard Method of Measurement (4th edition) and its building-works counterparts define how civil quantities are measured, described and itemised in a Bill of Quantities. They are the standard framework for road, drainage, earthwork and public-realm measurement.

Bill of Quantities (BoQ): An itemised, priced schedule of the materials, labour and works required for a project, structured by trade or division. It is the basis on which contractors tender.

CSI MasterFormat Divisions 31–33: The North American numbering convention used to organise site and infrastructure work — Division 31 (Earthwork), Division 32 (Exterior Improvements: paving, surfacing, planting, irrigation) and Division 33 (Utilities). We mapped the BoQ to both CESMM4 and CSI so the results are portable across markets.

Cut and Fill / Earthwork Balance: The calculation of excavation (cut) and placement (fill) volumes derived by comparing existing and proposed ground levels across a grading surface. Accurate earthwork volumes require integrating depth across the whole surface, not multiplying plan area by a single assumed depth.

Spon's Civil Engineering and Highway Works Price Book: A standard reference of unit rates for civil, highway, drainage and landscape works, used here to price both quantity sets and benchmark the bid-level impact.

1. Approach

The experiment was designed as a controlled, blind accuracy test. Each estimator received only the IFC drawing package and two blank BoQ templates — no hints, no reference quantities, and no human guidance on what to count or how to measure it.

Source Materials

Document	Size	Contents
Civil Engineering Drawings (roads & earthworks)	112 pages, 68 MB PDF	Setting-out, grading plans, road long-sections and cross-sections, pavement build-up details, kerb and drainage details
Drainage & Utilities Drawings	94 pages, 156 MB PDF	Storm and foul layouts, pipe and manhole schedules, catchpit/gully layouts, duct routes, chamber details
Landscape Drawings	81 pages, 210 MB PDF	Planting plans, planting and tree schedules, hard-landscape layouts, irrigation routes, tree-pit details
Drawing Register (IFC).xlsx	—	Index of all drawing numbers and PDF page locations
Roads & Earthworks takeoff template	—	CESMM4 / CSI-formatted BoQ, quantities blank
Utilities & Landscape takeoff template	—	CESMM4 / CSI-formatted BoQ, quantities blank

Project: Northgate Link Road, Utilities & Public Realm Scheme — a ~1.1 km link road with associated storm drainage, wet and dry utilities, and a public-realm/landscape package (turf, planting, irrigation, hard landscape). Representative of a mid-scale civils and infrastructure project.

How the general-purpose LLMs worked — agentic file mode

Both Claude and ChatGPT ran in an agentic, sandboxed Linux environment with a Bash shell, file read/write and web search. Each followed broadly the same workflow:

Read the drawing register to build a page-number index.
Used pdftotext to extract machine-readable text from schedule pages (footing/pier-equivalent schedules, pipe schedules, manhole schedules, planting schedules).
Used pdftoppm to render plan, section and elevation pages as 200 DPI PNG images for visual scale reading.
Wrote a Python script using openpyxl to populate both BoQ templates while preserving existing formulas and formatting.

Each ran in a single coherent session. Estimated API cost: ~$4 (Claude) and ~$3 (ChatGPT).

How Civils.ai worked — purpose-built takeoff engine

Civils.ai ingested the same IFC package into its civil-specific takeoff engine. Rather than reading pages one at a time, it vectorised the plan and section sheets, auto-classified every drawing against CESMM4/CSI divisions, and — critically for this scope — reconstructed the earthwork surface from the grading contours and spot levels rather than assuming a uniform depth. It cross-referenced plan, section and schedule for every measured item (for example, reconciling pipe runs on the drainage layout against the pipe schedule and the manhole schedule), and resolved scope boundaries (single-side vs both-side kerb, net vs gross planting area) from the detail drawings.

2. Ground Truth — Chartered Quantity Surveyor

After all three estimators completed their takeoffs, an independent chartered quantity surveyor produced their own quantities from the same drawing package, working to CESMM4. Their figures were entered alongside the three machine sets, creating a direct line-by-line comparison across 45 non-lump-sum line items spanning earthwork, surfacing, landscape and utilities.

3. Results

3.1 Accuracy Distribution

Error Band	Claude	ChatGPT	Civils.ai
≤ ±5% error	9 items (20%)	8 items (18%)	45 items (100%)
±5–20% error	4 items (9%)	4 items (9%)	0 items
> ±20% error	32 items (71%)	33 items (73%)	0 items

The two general-purpose LLMs show a strongly bimodal distribution: schedule-driven counts cluster near zero error; derived volumes, distributed counts and scope-dependent linear items sit in the extreme tails. Civils.ai shows no items outside ±3%.

3.2 Per-Measurement-Unit Accuracy (mean absolute error)

Unit	Category	Claude	ChatGPT	Civils.ai
EA / nr	Discrete counts	14%	16%	0.8%
t	Asphalt tonnage (from build-up)	12%	14%	1.1%
m²	Areas	24%	27%	1.2%
m³	Volumes	30%	33%	1.4%
LM	Linear (pipe, kerb, duct)	31%	34%	1.3%

For the LLMs, accuracy tracks how the quantity is derived, not the trade: anything read directly from a table (asphalt courses from the pavement build-up, chamber counts from a schedule) is close, while anything requiring derivation from geometry (m³ volumes, m² surfaces) or aggregation across sheets (LM runs) drifts badly.

3.3 Full Line-Item Comparison

Each machine cell shows the quantity and its error against the QS ground truth. Positive = overestimate, negative = underestimate.

Division 31 — Earthwork

#	Item	Unit	QS (truth)	Claude	ChatGPT	Civils.ai
1	Site clearing & grubbing	m²	18,500	18,300 (−1.1%)	17,900 (−3.2%)	18,450 (−0.3%)
2	Topsoil strip & stockpile	m³	4,625	3,240 (−30.0%)	3,050 (−34.1%)	4,600 (−0.5%)
3	Bulk excavation (cut)	m³	12,400	7,650 (−38.3%)	6,900 (−44.4%)	12,180 (−1.8%)
4	Engineered / compacted fill	m³	9,750	6,050 (−37.9%)	5,400 (−44.6%)	9,900 (+1.5%)
5	Cut-to-fill haulage	m³	12,400	7,650 (−38.3%)	6,900 (−44.4%)	12,300 (−0.8%)
6	Pavement box excavation	m³	5,280	4,650 (−11.9%)	4,400 (−16.7%)	5,260 (−0.4%)
7	Utility trench excavation	m³	3,140	1,940 (−38.2%)	1,780 (−43.3%)	3,090 (−1.6%)
8	Geotextile separation layer	m²	14,200	10,800 (−23.9%)	10,200 (−28.2%)	14,050 (−1.1%)

Division 32 — Surfacing, Paving & Landscaping

#	Item	Unit	QS (truth)	Claude	ChatGPT	Civils.ai
9	Sub-base (Type 1 granular)	m³	3,960	3,010 (−24.0%)	2,850 (−28.0%)	3,920 (−1.0%)
10	Asphalt base course	t	2,850	2,150 (−24.6%)	2,050 (−28.1%)	2,890 (+1.4%)
11	Asphalt binder course	t	1,690	1,700 (+0.6%)	1,640 (−3.0%)	1,675 (−0.9%)
12	Asphalt surface course	t	1,120	1,110 (−0.9%)	1,090 (−2.7%)	1,130 (+0.9%)
13	Carriageway area	m²	8,800	8,650 (−1.7%)	8,400 (−4.5%)	8,780 (−0.2%)
14	Footway & cycleway paving	m²	3,420	2,020 (−40.9%)	1,900 (−44.4%)	3,380 (−1.2%)
15	Block paving (plazas)	m²	1,240	970 (−21.8%)	940 (−24.2%)	1,250 (+0.8%)
16	Tactile paving	m²	96	62 (−35.4%)	58 (−39.6%)	94 (−2.1%)
17	Precast concrete kerb	LM	2,180	3,240 (+48.6%)	3,400 (+56.0%)	2,210 (+1.4%)
18	Edging & channel	LM	1,560	2,010 (+28.8%)	2,140 (+37.2%)	1,540 (−1.3%)
19	Kerb bedding concrete	m³	218	292 (+33.9%)	306 (+40.4%)	221 (+1.4%)
20	Road line marking	LM	6,400	4,100 (−35.9%)	3,800 (−40.6%)	6,320 (−1.3%)
21	Road studs	EA	240	150 (−37.5%)	132 (−45.0%)	236 (−1.7%)
22	Traffic signs	EA	34	27 (−20.6%)	25 (−26.5%)	34 (0.0%)
23	Street lighting columns	EA	42	36 (−14.3%)	38 (−9.5%)	42 (0.0%)
24	Topsoil to soft landscape	m³	2,760	3,900 (+41.3%)	4,100 (+48.6%)	2,720 (−1.4%)
25	Turfing	m²	6,200	9,150 (+47.6%)	9,600 (+54.8%)	6,280 (+1.3%)
26	Hydroseeding	m²	3,100	1,780 (−42.6%)	1,650 (−46.8%)	3,060 (−1.3%)
27	Shrub & herbaceous planting	nr	1,840	1,180 (−35.9%)	1,060 (−42.4%)	1,820 (−1.1%)
28	Standard trees	nr	128	128 (0.0%)	126 (−1.6%)	128 (0.0%)
29	Semi-mature trees	nr	24	24 (0.0%)	24 (0.0%)	24 (0.0%)
30	Tree pits with root barrier	nr	152	120 (−21.1%)	118 (−22.4%)	150 (−1.3%)
31	Mulch / bark	m³	340	214 (−37.1%)	196 (−42.4%)	345 (+1.5%)
32	Irrigation drip line	LM	4,200	2,380 (−43.3%)	2,200 (−47.6%)	4,140 (−1.4%)
33	Planter bed edging	LM	980	1,240 (+26.5%)	1,290 (+31.6%)	990 (+1.0%)

Division 33 — Utilities

#	Item	Unit	QS (truth)	Claude	ChatGPT	Civils.ai
34	Storm drain pipe Ø300	LM	1,420	960 (−32.4%)	900 (−36.6%)	1,400 (−1.4%)
35	Storm drain pipe Ø450	LM	860	660 (−23.3%)	620 (−27.9%)	870 (+1.2%)
36	Storm drain pipe Ø600	LM	540	440 (−18.5%)	420 (−22.2%)	545 (+0.9%)
37	Catchpits & gullies	EA	88	52 (−40.9%)	48 (−45.5%)	87 (−1.1%)
38	Manholes	EA	46	43 (−6.5%)	42 (−8.7%)	46 (0.0%)
39	Headwalls & outfalls	EA	6	6 (0.0%)	6 (0.0%)	6 (0.0%)
40	Pipe bedding & surround	m³	1,180	750 (−36.4%)	700 (−40.7%)	1,160 (−1.7%)
41	Water main Ø150	LM	1,240	1,180 (−4.8%)	1,120 (−9.7%)	1,250 (+0.8%)
42	Comms ducting (multi-way)	LM	2,860	1,700 (−40.6%)	1,560 (−45.5%)	2,820 (−1.4%)
43	Draw pits / joint boxes	EA	64	40 (−37.5%)	36 (−43.8%)	63 (−1.6%)
44	Valve chambers	EA	18	18 (0.0%)	18 (0.0%)	18 (0.0%)
45	Utility trench reinstatement	m²	3,980	2,510 (−36.9%)	2,350 (−41.0%)	3,930 (−1.3%)

3.4 Cost Impact (Priced Bid)

Indicative mid-range unit rates (Spon's Civil Engineering and Highway Works Price Book, GBP) were applied to all four quantity sets, including a standard 30% addition for contingency, overheads and profit, and levies.

Division	QS (truth)	Claude	ChatGPT	Civils.ai
Div 01 — General / Prelims (lump sum)	£320K	£315K (−1.6%)	£305K (−4.7%)	£318K (−0.6%)
Div 31 — Earthwork	£980K	£545K (−44.4%)	£505K (−48.5%)	£960K (−2.0%)
Div 32 — Surfacing / Paving / Landscape	£1,650K	£970K (−41.2%)	£900K (−45.5%)	£1,588K (−3.8%)
Div 33 — Utilities	£720K	£410K (−43.1%)	£375K (−47.9%)	£694K (−3.6%)
Total	£3,670K	£2,240K (−39.0%)	£2,085K (−43.2%)	£3,560K (−3.0%)

Civils.ai landed within 3% of the priced QS bid (97.0% accurate). Both general-purpose LLMs came in ≈40% low at the bid level — a gap of roughly £1.4M–£1.6M against the QS figure, which on a tender of this size is the difference between a credible bid and a non-compliant one.

3.5 Top Cost Misses — Claude

The five categories below account for roughly £540K of Claude's £1.43M total gap; the remainder is distributed across the other 30-plus line items.

Rank	Category	QS	Claude	Gap	Cause
1	Bulk earthworks (cut / fill / haul / disposal)	£560K	£330K	−£230K	Volume-from-plan (uniform assumed depth)
2	Road surfacing (sub-base + base course)	£430K	£300K	−£130K	Under-read pavement build-up thickness
3	Storm drainage (pipe + bedding + gullies)	£310K	£185K	−£125K	Primary runs only; missed laterals
4	Comms ducting + trench reinstatement	£240K	£142K	−£98K	Missed multi-way duct banks
5	Soft-landscape planting + irrigation	£205K	£132K	−£73K	Distributed plan counts under-read
—	Offsetting overestimates (kerb/edging, turf/topsoil)	—	—	+£115K	Full-perimeter & gross-area defaults

4. Systematic Error Taxonomy

Every large LLM error in this test maps to one of six repeatable failure modes. Civils.ai's engine is built to neutralise all six.

Error type	Root cause	Affected items	Direction
Volume-from-plan	Plan area × single assumed depth instead of integrating cut/fill from contours and spot levels	Bulk cut, fill, haul, topsoil strip	Under
Layer-thickness under-read	Read a thinner pavement / sub-base than the specified build-up	Sub-base, asphalt base course	Under
Primary-run-only	Counted main pipe/duct runs; missed laterals, branches and connections	Storm laterals, comms ducting, gullies	Under
Distributed-count miss	Missed elements spread across many plan sheets	Shrubs, gullies, draw pits, signs, tree pits	Under
Full-perimeter default	Applied kerb/edging around the full perimeter and to both sides	Kerb, edging, channel, kerb bedding	Over
Gross-vs-net area	Used gross site area for soft landscape rather than net planting zones	Turf, topsoil placement	Over

The two over-estimating modes (kerb perimeter, gross area) partly mask the under-estimating modes at the total-cost level, which is why the LLM bids can look "only" 40% low despite most individual line items being far more wrong in one direction or the other. This netting-out is itself a hazard: it makes a badly constructed takeoff appear more plausible than it is.

5. Takeaways

Where general-purpose LLMs work well

Schedule-driven counts. Standard and semi-mature trees, manholes, valve chambers and headwalls — anything read directly from a clean tabular schedule — came back exact or near-exact.
Tonnage from an explicit build-up. Asphalt binder and surface courses, priced from the stated pavement build-up, were within 3%.
Simple plan areas with a clear callout. Carriageway area and site clearing were close because they derive from a single dimensioned boundary.
Lump sums. General-conditions items default to 1-each and are always correct.

Where general-purpose LLMs struggle today

Derived volumes. Bulk earthwork is the single worst category. Both LLMs multiplied a plan area by an assumed uniform depth; neither reconstructed cut and fill from the grading surface. Result: 38–45% underestimate on the highest-value items in the bill.
Networked linear work. Drainage and ducting were read as their primary runs only. Laterals, branch connections and multi-way duct banks — spread across multiple sheets — were missed, producing 30–46% underestimates.
Distributed counts. Shrubs, gullies, draw pits and signs are enumerated across many plan sheets rather than in one schedule; the LLMs consistently under-counted them.
Scope-boundary judgement. Kerb, edging and channel were pushed to a full-perimeter, both-sides default (+29% to +56%), while soft-landscape areas used gross site area rather than net planting zones (+41% to +55%).

Why Civils.ai reaches 97%

Civils.ai is purpose-built for civil takeoff, so the six failure modes above are handled by design rather than by luck:

Earthwork from levels, not plan area. It reconstructs the cut/fill surface from contours and spot levels — the root fix for the largest category of LLM error.
Plan + section + schedule reconciliation. Every measured item is cross-checked across all three view types, so pipe runs match the pipe schedule and chamber counts match the manhole schedule.
Network tracing. It follows drainage and duct networks through laterals and branches instead of counting only the trunk run.
CESMM/CSI-aware scope resolution. Kerb sides, net-vs-gross planting and pavement build-up thickness are resolved from the standard method of measurement and the detail drawings, not from a default.

Economic Assessment

A general-purpose LLM takeoff costs ~$3–$4 in API usage and takes under an hour. That is genuinely useful as a first-pass index — it names every line item and flags where the risk sits. But at a 40% bid-level underestimate, it is not a tender-ready quantity set, and the errors are systematic rather than random, so they will not "average out."

A chartered QS billing at £80–£150/hour would spend 20–45+ hours on a takeoff of this size — £1,600–£6,800 in labour. Civils.ai reproduced the QS bid to within 3% in a fraction of that time. The strongest workflow is: Civils.ai produces the measured takeoff, a QS reviews and signs off the high-value earthwork, drainage and surfacing lines — a review measured in hours, not days.

Recommended workflow

Use a purpose-built civil engine for the measured takeoff. General LLMs are best kept to a sanity-check / line-item-discovery role.
Always human-review the three highest-risk categories: bulk earthworks, networked drainage/ducting, and pavement/surfacing build-ups.
Derive earthwork from levels, never from plan area × assumed depth.
Resolve scope boundaries explicitly — kerb sides, net vs gross planting area, pavement layer thicknesses — before pricing.

Frequently Asked Questions

Is Claude (or ChatGPT) accurate enough for real civil quantity takeoffs?

Not as a standalone tender basis. In this test both general-purpose LLMs came in about 40% low at the priced-bid level, and 71% of their line items exceeded ±20% error. They are reliable only on schedule-driven counts and on quantities read directly from an explicit build-up. For bulk earthworks, networked drainage and ducting, and soft-landscape areas they are systematically wrong. The appropriate role is a first-pass index that a professional — or a purpose-built engine — corrects.

How is Civils.ai able to reach 97% when general LLMs reach ~60%?

Civils.ai is purpose-built for civil takeoff. It reconstructs earthwork volumes from grading contours and spot levels rather than multiplying plan area by an assumed depth; it traces drainage and duct networks through their laterals and branches; and it cross-references plan, section and schedule for every item while resolving CESMM/CSI scope boundaries. These are exactly the four things the general LLMs got wrong, which is why the accuracy gap is largest on the highest-value lines in the bill.

Why did the LLMs underestimate bulk earthworks so heavily?

Both derived volume from a plan area multiplied by a single assumed depth, rather than integrating cut and fill across the grading surface. On a scheme with meaningful level changes, that method understates the true volume by 35–45%. It is a methodological error, not a page-reading error — which is why it is so consistent across both models.

Why did the LLMs miss drainage and ducting?

They read the primary runs shown on the layout sheets and stopped there. Lateral connections to gullies, branch runs and multi-way duct banks are distributed across several sheets and enumerated in schedules the models did not fully reconcile. The result was a 30–46% underestimate on pipe length, gully count, duct length and draw-pit count.

What did the general LLMs actually get right?

Discrete counts pulled straight from a clean schedule — standard and semi-mature trees, manholes, valve chambers, headwalls — were exact or near-exact. Asphalt binder and surface tonnage, priced from the stated pavement build-up, were within 3%. Lump-sum general-conditions items were correct by default. The pattern is clear: LLMs are accurate where the answer is a number in a table, and inaccurate where the answer must be derived from geometry or aggregated across sheets.

How does the ~$4 API cost compare to professional estimating cost?

A chartered QS would spend 20–45+ hours on a comparable takeoff, representing roughly £1,600–£6,800 in labour. The ~$4 LLM cost is not a substitute for that expertise — at a 40% underestimate it is not tender-ready — but it compresses the data-gathering phase. The strongest economics come from a purpose-built engine like Civils.ai that reaches QS-grade accuracy, with human review reserved for the high-value earthwork, drainage and surfacing lines.

Would these results apply to other civil project types?

Largely, yes. The earthwork finding generalises to any scheme with meaningful level changes — the plan-area-times-assumed-depth error is inherent to how general LLMs approach volume. The drainage/ducting and distributed-count findings apply to any project with a real utility network. Errors shrink on simple, flat, rectilinear sites where plan area and true surface converge and networks are trivial; they grow on graded, layered or network-heavy schemes — which describes most real landscaping, utilities and road-surfacing work.

Interested in learning about how you can use AI in your Civil Engineering workflow?

Learn more