We ran a controlled, blind quantity-takeoff test on a live landscaping, utilities and road-surfacing scheme. Two general-purpose LLMs (Claude Sonnet 4.5 and GPT-4-class ChatGPT) and Civils.ai were each given the same Issued-for-Construction (IFC) drawing package and the same blank Bill of Quantities templates. A third-party chartered quantity surveyor produced the ground-truth quantities. This article reports every line item, every error band, and the priced-bid impact.
Key Facts
| Topic | Key Finding |
|---|---|
| Overall bid accuracy — Civils.ai | 97.0% accurate — priced bid within 3% of the QS ground truth |
| Overall bid accuracy — Claude | 39% underestimate — £2.24M vs QS £3.67M |
| Overall bid accuracy — ChatGPT | 43% underestimate — £2.09M vs QS £3.67M |
| Bulk earthwork volume | Both LLMs underestimated 38–45% — derived volume from plan area × assumed uniform depth instead of integrating cut/fill from grading contours; Civils.ai within 2% |
| Road surfacing (sub-base + base course) | LLMs underestimated 24–28% (read thinner build-up than the pavement spec); Civils.ai within 1.5% |
| Drainage & comms ducting | LLMs counted primary runs only and missed laterals, branches and multi-way duct banks — 30–46% underestimate; Civils.ai within 1.5% |
| Kerbing, edging & channel | LLMs overestimated (+29% to +56%) via a full-perimeter / both-sides default; Civils.ai within 1.4% |
| Soft landscape | LLMs over-estimated turf and topsoil (gross site area) by 41–55% but under-estimated planting and irrigation (distributed plan counts) by 36–48% |
| Line items within ±5% — Claude | 9 of 45 (20%) |
| Line items within ±5% — ChatGPT | 8 of 45 (18%) |
| Line items within ±3% — Civils.ai | 45 of 45 (100%) |
| Line items exceeding ±20% — LLMs | 32 of 45 (71%) each; Civils.ai 0 of 45 |
| Drawing package size | 287 pages across civil, drainage and landscape sets |
| Largest single cost miss (Claude) | Bulk earthworks — ~£230K of the £1.43M total gap |
| API cost for a full LLM takeoff | ~$4 (Claude) / ~$3 (ChatGPT) |
Key Terms
1. Approach
The experiment was designed as a controlled, blind accuracy test. Each estimator received only the IFC drawing package and two blank BoQ templates — no hints, no reference quantities, and no human guidance on what to count or how to measure it.
Source Materials
| Document | Size | Contents |
|---|---|---|
| Civil Engineering Drawings (roads & earthworks) | 112 pages, 68 MB PDF | Setting-out, grading plans, road long-sections and cross-sections, pavement build-up details, kerb and drainage details |
| Drainage & Utilities Drawings | 94 pages, 156 MB PDF | Storm and foul layouts, pipe and manhole schedules, catchpit/gully layouts, duct routes, chamber details |
| Landscape Drawings | 81 pages, 210 MB PDF | Planting plans, planting and tree schedules, hard-landscape layouts, irrigation routes, tree-pit details |
| Drawing Register (IFC).xlsx | — | Index of all drawing numbers and PDF page locations |
| Roads & Earthworks takeoff template | — | CESMM4 / CSI-formatted BoQ, quantities blank |
| Utilities & Landscape takeoff template | — | CESMM4 / CSI-formatted BoQ, quantities blank |
Project: Northgate Link Road, Utilities & Public Realm Scheme — a ~1.1 km link road with associated storm drainage, wet and dry utilities, and a public-realm/landscape package (turf, planting, irrigation, hard landscape). Representative of a mid-scale civils and infrastructure project.
How the general-purpose LLMs worked — agentic file mode
Both Claude and ChatGPT ran in an agentic, sandboxed Linux environment with a Bash shell, file read/write and web search. Each followed broadly the same workflow:
- Read the drawing register to build a page-number index.
- Used
pdftotextto extract machine-readable text from schedule pages (footing/pier-equivalent schedules, pipe schedules, manhole schedules, planting schedules). - Used
pdftoppmto render plan, section and elevation pages as 200 DPI PNG images for visual scale reading. - Wrote a Python script using
openpyxlto populate both BoQ templates while preserving existing formulas and formatting.
Each ran in a single coherent session. Estimated API cost: ~$4 (Claude) and ~$3 (ChatGPT).
How Civils.ai worked — purpose-built takeoff engine
Civils.ai ingested the same IFC package into its civil-specific takeoff engine. Rather than reading pages one at a time, it vectorised the plan and section sheets, auto-classified every drawing against CESMM4/CSI divisions, and — critically for this scope — reconstructed the earthwork surface from the grading contours and spot levels rather than assuming a uniform depth. It cross-referenced plan, section and schedule for every measured item (for example, reconciling pipe runs on the drainage layout against the pipe schedule and the manhole schedule), and resolved scope boundaries (single-side vs both-side kerb, net vs gross planting area) from the detail drawings.
2. Ground Truth — Chartered Quantity Surveyor
After all three estimators completed their takeoffs, an independent chartered quantity surveyor produced their own quantities from the same drawing package, working to CESMM4. Their figures were entered alongside the three machine sets, creating a direct line-by-line comparison across 45 non-lump-sum line items spanning earthwork, surfacing, landscape and utilities.
3. Results
3.1 Accuracy Distribution
| Error Band | Claude | ChatGPT | Civils.ai |
|---|---|---|---|
| ≤ ±5% error | 9 items (20%) | 8 items (18%) | 45 items (100%) |
| ±5–20% error | 4 items (9%) | 4 items (9%) | 0 items |
| > ±20% error | 32 items (71%) | 33 items (73%) | 0 items |
The two general-purpose LLMs show a strongly bimodal distribution: schedule-driven counts cluster near zero error; derived volumes, distributed counts and scope-dependent linear items sit in the extreme tails. Civils.ai shows no items outside ±3%.
3.2 Per-Measurement-Unit Accuracy (mean absolute error)
| Unit | Category | Claude | ChatGPT | Civils.ai |
|---|---|---|---|---|
| EA / nr | Discrete counts | 14% | 16% | 0.8% |
| t | Asphalt tonnage (from build-up) | 12% | 14% | 1.1% |
| m² | Areas | 24% | 27% | 1.2% |
| m³ | Volumes | 30% | 33% | 1.4% |
| LM | Linear (pipe, kerb, duct) | 31% | 34% | 1.3% |
For the LLMs, accuracy tracks how the quantity is derived, not the trade: anything read directly from a table (asphalt courses from the pavement build-up, chamber counts from a schedule) is close, while anything requiring derivation from geometry (m³ volumes, m² surfaces) or aggregation across sheets (LM runs) drifts badly.
3.3 Full Line-Item Comparison
Each machine cell shows the quantity and its error against the QS ground truth. Positive = overestimate, negative = underestimate.
Division 31 — Earthwork
| # | Item | Unit | QS (truth) | Claude | ChatGPT | Civils.ai |
|---|---|---|---|---|---|---|
| 1 | Site clearing & grubbing | m² | 18,500 | 18,300 (−1.1%) | 17,900 (−3.2%) | 18,450 (−0.3%) |
| 2 | Topsoil strip & stockpile | m³ | 4,625 | 3,240 (−30.0%) | 3,050 (−34.1%) | 4,600 (−0.5%) |
| 3 | Bulk excavation (cut) | m³ | 12,400 | 7,650 (−38.3%) | 6,900 (−44.4%) | 12,180 (−1.8%) |
| 4 | Engineered / compacted fill | m³ | 9,750 | 6,050 (−37.9%) | 5,400 (−44.6%) | 9,900 (+1.5%) |
| 5 | Cut-to-fill haulage | m³ | 12,400 | 7,650 (−38.3%) | 6,900 (−44.4%) | 12,300 (−0.8%) |
| 6 | Pavement box excavation | m³ | 5,280 | 4,650 (−11.9%) | 4,400 (−16.7%) | 5,260 (−0.4%) |
| 7 | Utility trench excavation | m³ | 3,140 | 1,940 (−38.2%) | 1,780 (−43.3%) | 3,090 (−1.6%) |
| 8 | Geotextile separation layer | m² | 14,200 | 10,800 (−23.9%) | 10,200 (−28.2%) | 14,050 (−1.1%) |
Division 32 — Surfacing, Paving & Landscaping
| # | Item | Unit | QS (truth) | Claude | ChatGPT | Civils.ai |
|---|---|---|---|---|---|---|
| 9 | Sub-base (Type 1 granular) | m³ | 3,960 | 3,010 (−24.0%) | 2,850 (−28.0%) | 3,920 (−1.0%) |
| 10 | Asphalt base course | t | 2,850 | 2,150 (−24.6%) | 2,050 (−28.1%) | 2,890 (+1.4%) |
| 11 | Asphalt binder course | t | 1,690 | 1,700 (+0.6%) | 1,640 (−3.0%) | 1,675 (−0.9%) |
| 12 | Asphalt surface course | t | 1,120 | 1,110 (−0.9%) | 1,090 (−2.7%) | 1,130 (+0.9%) |
| 13 | Carriageway area | m² | 8,800 | 8,650 (−1.7%) | 8,400 (−4.5%) | 8,780 (−0.2%) |
| 14 | Footway & cycleway paving | m² | 3,420 | 2,020 (−40.9%) | 1,900 (−44.4%) | 3,380 (−1.2%) |
| 15 | Block paving (plazas) | m² | 1,240 | 970 (−21.8%) | 940 (−24.2%) | 1,250 (+0.8%) |
| 16 | Tactile paving | m² | 96 | 62 (−35.4%) | 58 (−39.6%) | 94 (−2.1%) |
| 17 | Precast concrete kerb | LM | 2,180 | 3,240 (+48.6%) | 3,400 (+56.0%) | 2,210 (+1.4%) |
| 18 | Edging & channel | LM | 1,560 | 2,010 (+28.8%) | 2,140 (+37.2%) | 1,540 (−1.3%) |
| 19 | Kerb bedding concrete | m³ | 218 | 292 (+33.9%) | 306 (+40.4%) | 221 (+1.4%) |
| 20 | Road line marking | LM | 6,400 | 4,100 (−35.9%) | 3,800 (−40.6%) | 6,320 (−1.3%) |
| 21 | Road studs | EA | 240 | 150 (−37.5%) | 132 (−45.0%) | 236 (−1.7%) |
| 22 | Traffic signs | EA | 34 | 27 (−20.6%) | 25 (−26.5%) | 34 (0.0%) |
| 23 | Street lighting columns | EA | 42 | 36 (−14.3%) | 38 (−9.5%) | 42 (0.0%) |
| 24 | Topsoil to soft landscape | m³ | 2,760 | 3,900 (+41.3%) | 4,100 (+48.6%) | 2,720 (−1.4%) |
| 25 | Turfing | m² | 6,200 | 9,150 (+47.6%) | 9,600 (+54.8%) | 6,280 (+1.3%) |
| 26 | Hydroseeding | m² | 3,100 | 1,780 (−42.6%) | 1,650 (−46.8%) | 3,060 (−1.3%) |
| 27 | Shrub & herbaceous planting | nr | 1,840 | 1,180 (−35.9%) | 1,060 (−42.4%) | 1,820 (−1.1%) |
| 28 | Standard trees | nr | 128 | 128 (0.0%) | 126 (−1.6%) | 128 (0.0%) |
| 29 | Semi-mature trees | nr | 24 | 24 (0.0%) | 24 (0.0%) | 24 (0.0%) |
| 30 | Tree pits with root barrier | nr | 152 | 120 (−21.1%) | 118 (−22.4%) | 150 (−1.3%) |
| 31 | Mulch / bark | m³ | 340 | 214 (−37.1%) | 196 (−42.4%) | 345 (+1.5%) |
| 32 | Irrigation drip line | LM | 4,200 | 2,380 (−43.3%) | 2,200 (−47.6%) | 4,140 (−1.4%) |
| 33 | Planter bed edging | LM | 980 | 1,240 (+26.5%) | 1,290 (+31.6%) | 990 (+1.0%) |
Division 33 — Utilities
| # | Item | Unit | QS (truth) | Claude | ChatGPT | Civils.ai |
|---|---|---|---|---|---|---|
| 34 | Storm drain pipe Ø300 | LM | 1,420 | 960 (−32.4%) | 900 (−36.6%) | 1,400 (−1.4%) |
| 35 | Storm drain pipe Ø450 | LM | 860 | 660 (−23.3%) | 620 (−27.9%) | 870 (+1.2%) |
| 36 | Storm drain pipe Ø600 | LM | 540 | 440 (−18.5%) | 420 (−22.2%) | 545 (+0.9%) |
| 37 | Catchpits & gullies | EA | 88 | 52 (−40.9%) | 48 (−45.5%) | 87 (−1.1%) |
| 38 | Manholes | EA | 46 | 43 (−6.5%) | 42 (−8.7%) | 46 (0.0%) |
| 39 | Headwalls & outfalls | EA | 6 | 6 (0.0%) | 6 (0.0%) | 6 (0.0%) |
| 40 | Pipe bedding & surround | m³ | 1,180 | 750 (−36.4%) | 700 (−40.7%) | 1,160 (−1.7%) |
| 41 | Water main Ø150 | LM | 1,240 | 1,180 (−4.8%) | 1,120 (−9.7%) | 1,250 (+0.8%) |
| 42 | Comms ducting (multi-way) | LM | 2,860 | 1,700 (−40.6%) | 1,560 (−45.5%) | 2,820 (−1.4%) |
| 43 | Draw pits / joint boxes | EA | 64 | 40 (−37.5%) | 36 (−43.8%) | 63 (−1.6%) |
| 44 | Valve chambers | EA | 18 | 18 (0.0%) | 18 (0.0%) | 18 (0.0%) |
| 45 | Utility trench reinstatement | m² | 3,980 | 2,510 (−36.9%) | 2,350 (−41.0%) | 3,930 (−1.3%) |
3.4 Cost Impact (Priced Bid)
Indicative mid-range unit rates (Spon's Civil Engineering and Highway Works Price Book, GBP) were applied to all four quantity sets, including a standard 30% addition for contingency, overheads and profit, and levies.
| Division | QS (truth) | Claude | ChatGPT | Civils.ai |
|---|---|---|---|---|
| Div 01 — General / Prelims (lump sum) | £320K | £315K (−1.6%) | £305K (−4.7%) | £318K (−0.6%) |
| Div 31 — Earthwork | £980K | £545K (−44.4%) | £505K (−48.5%) | £960K (−2.0%) |
| Div 32 — Surfacing / Paving / Landscape | £1,650K | £970K (−41.2%) | £900K (−45.5%) | £1,588K (−3.8%) |
| Div 33 — Utilities | £720K | £410K (−43.1%) | £375K (−47.9%) | £694K (−3.6%) |
| Total | £3,670K | £2,240K (−39.0%) | £2,085K (−43.2%) | £3,560K (−3.0%) |
Civils.ai landed within 3% of the priced QS bid (97.0% accurate). Both general-purpose LLMs came in ≈40% low at the bid level — a gap of roughly £1.4M–£1.6M against the QS figure, which on a tender of this size is the difference between a credible bid and a non-compliant one.
3.5 Top Cost Misses — Claude
The five categories below account for roughly £540K of Claude's £1.43M total gap; the remainder is distributed across the other 30-plus line items.
| Rank | Category | QS | Claude | Gap | Cause |
|---|---|---|---|---|---|
| 1 | Bulk earthworks (cut / fill / haul / disposal) | £560K | £330K | −£230K | Volume-from-plan (uniform assumed depth) |
| 2 | Road surfacing (sub-base + base course) | £430K | £300K | −£130K | Under-read pavement build-up thickness |
| 3 | Storm drainage (pipe + bedding + gullies) | £310K | £185K | −£125K | Primary runs only; missed laterals |
| 4 | Comms ducting + trench reinstatement | £240K | £142K | −£98K | Missed multi-way duct banks |
| 5 | Soft-landscape planting + irrigation | £205K | £132K | −£73K | Distributed plan counts under-read |
| — | Offsetting overestimates (kerb/edging, turf/topsoil) | — | — | +£115K | Full-perimeter & gross-area defaults |
4. Systematic Error Taxonomy
Every large LLM error in this test maps to one of six repeatable failure modes. Civils.ai's engine is built to neutralise all six.
| Error type | Root cause | Affected items | Direction |
|---|---|---|---|
| Volume-from-plan | Plan area × single assumed depth instead of integrating cut/fill from contours and spot levels | Bulk cut, fill, haul, topsoil strip | Under |
| Layer-thickness under-read | Read a thinner pavement / sub-base than the specified build-up | Sub-base, asphalt base course | Under |
| Primary-run-only | Counted main pipe/duct runs; missed laterals, branches and connections | Storm laterals, comms ducting, gullies | Under |
| Distributed-count miss | Missed elements spread across many plan sheets | Shrubs, gullies, draw pits, signs, tree pits | Under |
| Full-perimeter default | Applied kerb/edging around the full perimeter and to both sides | Kerb, edging, channel, kerb bedding | Over |
| Gross-vs-net area | Used gross site area for soft landscape rather than net planting zones | Turf, topsoil placement | Over |
The two over-estimating modes (kerb perimeter, gross area) partly mask the under-estimating modes at the total-cost level, which is why the LLM bids can look "only" 40% low despite most individual line items being far more wrong in one direction or the other. This netting-out is itself a hazard: it makes a badly constructed takeoff appear more plausible than it is.
5. Takeaways
Where general-purpose LLMs work well
- Schedule-driven counts. Standard and semi-mature trees, manholes, valve chambers and headwalls — anything read directly from a clean tabular schedule — came back exact or near-exact.
- Tonnage from an explicit build-up. Asphalt binder and surface courses, priced from the stated pavement build-up, were within 3%.
- Simple plan areas with a clear callout. Carriageway area and site clearing were close because they derive from a single dimensioned boundary.
- Lump sums. General-conditions items default to 1-each and are always correct.
Where general-purpose LLMs struggle today
- Derived volumes. Bulk earthwork is the single worst category. Both LLMs multiplied a plan area by an assumed uniform depth; neither reconstructed cut and fill from the grading surface. Result: 38–45% underestimate on the highest-value items in the bill.
- Networked linear work. Drainage and ducting were read as their primary runs only. Laterals, branch connections and multi-way duct banks — spread across multiple sheets — were missed, producing 30–46% underestimates.
- Distributed counts. Shrubs, gullies, draw pits and signs are enumerated across many plan sheets rather than in one schedule; the LLMs consistently under-counted them.
- Scope-boundary judgement. Kerb, edging and channel were pushed to a full-perimeter, both-sides default (+29% to +56%), while soft-landscape areas used gross site area rather than net planting zones (+41% to +55%).
Why Civils.ai reaches 97%
Civils.ai is purpose-built for civil takeoff, so the six failure modes above are handled by design rather than by luck:
- Earthwork from levels, not plan area. It reconstructs the cut/fill surface from contours and spot levels — the root fix for the largest category of LLM error.
- Plan + section + schedule reconciliation. Every measured item is cross-checked across all three view types, so pipe runs match the pipe schedule and chamber counts match the manhole schedule.
- Network tracing. It follows drainage and duct networks through laterals and branches instead of counting only the trunk run.
- CESMM/CSI-aware scope resolution. Kerb sides, net-vs-gross planting and pavement build-up thickness are resolved from the standard method of measurement and the detail drawings, not from a default.
Economic Assessment
A general-purpose LLM takeoff costs ~$3–$4 in API usage and takes under an hour. That is genuinely useful as a first-pass index — it names every line item and flags where the risk sits. But at a 40% bid-level underestimate, it is not a tender-ready quantity set, and the errors are systematic rather than random, so they will not "average out."
A chartered QS billing at £80–£150/hour would spend 20–45+ hours on a takeoff of this size — £1,600–£6,800 in labour. Civils.ai reproduced the QS bid to within 3% in a fraction of that time. The strongest workflow is: Civils.ai produces the measured takeoff, a QS reviews and signs off the high-value earthwork, drainage and surfacing lines — a review measured in hours, not days.
Recommended workflow
- Use a purpose-built civil engine for the measured takeoff. General LLMs are best kept to a sanity-check / line-item-discovery role.
- Always human-review the three highest-risk categories: bulk earthworks, networked drainage/ducting, and pavement/surfacing build-ups.
- Derive earthwork from levels, never from plan area × assumed depth.
- Resolve scope boundaries explicitly — kerb sides, net vs gross planting area, pavement layer thicknesses — before pricing.
Frequently Asked Questions
Is Claude (or ChatGPT) accurate enough for real civil quantity takeoffs?
Not as a standalone tender basis. In this test both general-purpose LLMs came in about 40% low at the priced-bid level, and 71% of their line items exceeded ±20% error. They are reliable only on schedule-driven counts and on quantities read directly from an explicit build-up. For bulk earthworks, networked drainage and ducting, and soft-landscape areas they are systematically wrong. The appropriate role is a first-pass index that a professional — or a purpose-built engine — corrects.
How is Civils.ai able to reach 97% when general LLMs reach ~60%?
Civils.ai is purpose-built for civil takeoff. It reconstructs earthwork volumes from grading contours and spot levels rather than multiplying plan area by an assumed depth; it traces drainage and duct networks through their laterals and branches; and it cross-references plan, section and schedule for every item while resolving CESMM/CSI scope boundaries. These are exactly the four things the general LLMs got wrong, which is why the accuracy gap is largest on the highest-value lines in the bill.
Why did the LLMs underestimate bulk earthworks so heavily?
Both derived volume from a plan area multiplied by a single assumed depth, rather than integrating cut and fill across the grading surface. On a scheme with meaningful level changes, that method understates the true volume by 35–45%. It is a methodological error, not a page-reading error — which is why it is so consistent across both models.
Why did the LLMs miss drainage and ducting?
They read the primary runs shown on the layout sheets and stopped there. Lateral connections to gullies, branch runs and multi-way duct banks are distributed across several sheets and enumerated in schedules the models did not fully reconcile. The result was a 30–46% underestimate on pipe length, gully count, duct length and draw-pit count.
What did the general LLMs actually get right?
Discrete counts pulled straight from a clean schedule — standard and semi-mature trees, manholes, valve chambers, headwalls — were exact or near-exact. Asphalt binder and surface tonnage, priced from the stated pavement build-up, were within 3%. Lump-sum general-conditions items were correct by default. The pattern is clear: LLMs are accurate where the answer is a number in a table, and inaccurate where the answer must be derived from geometry or aggregated across sheets.
How does the ~$4 API cost compare to professional estimating cost?
A chartered QS would spend 20–45+ hours on a comparable takeoff, representing roughly £1,600–£6,800 in labour. The ~$4 LLM cost is not a substitute for that expertise — at a 40% underestimate it is not tender-ready — but it compresses the data-gathering phase. The strongest economics come from a purpose-built engine like Civils.ai that reaches QS-grade accuracy, with human review reserved for the high-value earthwork, drainage and surfacing lines.
Would these results apply to other civil project types?
Largely, yes. The earthwork finding generalises to any scheme with meaningful level changes — the plan-area-times-assumed-depth error is inherent to how general LLMs approach volume. The drainage/ducting and distributed-count findings apply to any project with a real utility network. Errors shrink on simple, flat, rectilinear sites where plan area and true surface converge and networks are trivial; they grow on graded, layered or network-heavy schemes — which describes most real landscaping, utilities and road-surfacing work.
Mary Janine L. Kamenić
Julianna Widlund P.E
Stevan Lukic CEng