Can General-Purpose LLMs Perform Accurate Civil Estimates? Claude vs ChatGPT vs Civils.ai

Claude, ChatGPT, and Civils.ai go head-to-head in a blind civil quantity-takeoff test against a chartered QS's ground-truth Bill of Quantities.
May 16, 2026
Mary Janine L. Kamenić Mary Janine L. Kamenić
Julianna Widlund P.E Julianna Widlund P.E
Stevan Lukic CEng Stevan Lukic CEng

We ran a controlled, blind quantity-takeoff test on a live landscaping, utilities and road-surfacing scheme. Two general-purpose LLMs (Claude Sonnet 4.5 and GPT-4-class ChatGPT) and Civils.ai were each given the same Issued-for-Construction (IFC) drawing package and the same blank Bill of Quantities templates. A third-party chartered quantity surveyor produced the ground-truth quantities. This article reports every line item, every error band, and the priced-bid impact.

Key Facts

TopicKey Finding
Overall bid accuracy — Civils.ai97.0% accurate — priced bid within 3% of the QS ground truth
Overall bid accuracy — Claude39% underestimate — £2.24M vs QS £3.67M
Overall bid accuracy — ChatGPT43% underestimate — £2.09M vs QS £3.67M
Bulk earthwork volumeBoth LLMs underestimated 38–45% — derived volume from plan area × assumed uniform depth instead of integrating cut/fill from grading contours; Civils.ai within 2%
Road surfacing (sub-base + base course)LLMs underestimated 24–28% (read thinner build-up than the pavement spec); Civils.ai within 1.5%
Drainage & comms ductingLLMs counted primary runs only and missed laterals, branches and multi-way duct banks — 30–46% underestimate; Civils.ai within 1.5%
Kerbing, edging & channelLLMs overestimated (+29% to +56%) via a full-perimeter / both-sides default; Civils.ai within 1.4%
Soft landscapeLLMs over-estimated turf and topsoil (gross site area) by 41–55% but under-estimated planting and irrigation (distributed plan counts) by 36–48%
Line items within ±5% — Claude9 of 45 (20%)
Line items within ±5% — ChatGPT8 of 45 (18%)
Line items within ±3% — Civils.ai45 of 45 (100%)
Line items exceeding ±20% — LLMs32 of 45 (71%) each; Civils.ai 0 of 45
Drawing package size287 pages across civil, drainage and landscape sets
Largest single cost miss (Claude)Bulk earthworks — ~£230K of the £1.43M total gap
API cost for a full LLM takeoff~$4 (Claude) / ~$3 (ChatGPT)

Key Terms

IFC (Issued for Construction): The final, approved version of construction drawings released to contractors for tendering and building. IFC documents represent complete design intent and are the authoritative source for quantity takeoffs.
CESMM4 / SMM: The Civil Engineering Standard Method of Measurement (4th edition) and its building-works counterparts define how civil quantities are measured, described and itemised in a Bill of Quantities. They are the standard framework for road, drainage, earthwork and public-realm measurement.
Bill of Quantities (BoQ): An itemised, priced schedule of the materials, labour and works required for a project, structured by trade or division. It is the basis on which contractors tender.
CSI MasterFormat Divisions 31–33: The North American numbering convention used to organise site and infrastructure work — Division 31 (Earthwork), Division 32 (Exterior Improvements: paving, surfacing, planting, irrigation) and Division 33 (Utilities). We mapped the BoQ to both CESMM4 and CSI so the results are portable across markets.
Cut and Fill / Earthwork Balance: The calculation of excavation (cut) and placement (fill) volumes derived by comparing existing and proposed ground levels across a grading surface. Accurate earthwork volumes require integrating depth across the whole surface, not multiplying plan area by a single assumed depth.
Spon's Civil Engineering and Highway Works Price Book: A standard reference of unit rates for civil, highway, drainage and landscape works, used here to price both quantity sets and benchmark the bid-level impact.

1. Approach

The experiment was designed as a controlled, blind accuracy test. Each estimator received only the IFC drawing package and two blank BoQ templates — no hints, no reference quantities, and no human guidance on what to count or how to measure it.

Source Materials

DocumentSizeContents
Civil Engineering Drawings (roads & earthworks)112 pages, 68 MB PDFSetting-out, grading plans, road long-sections and cross-sections, pavement build-up details, kerb and drainage details
Drainage & Utilities Drawings94 pages, 156 MB PDFStorm and foul layouts, pipe and manhole schedules, catchpit/gully layouts, duct routes, chamber details
Landscape Drawings81 pages, 210 MB PDFPlanting plans, planting and tree schedules, hard-landscape layouts, irrigation routes, tree-pit details
Drawing Register (IFC).xlsxIndex of all drawing numbers and PDF page locations
Roads & Earthworks takeoff templateCESMM4 / CSI-formatted BoQ, quantities blank
Utilities & Landscape takeoff templateCESMM4 / CSI-formatted BoQ, quantities blank

Project: Northgate Link Road, Utilities & Public Realm Scheme — a ~1.1 km link road with associated storm drainage, wet and dry utilities, and a public-realm/landscape package (turf, planting, irrigation, hard landscape). Representative of a mid-scale civils and infrastructure project.

How the general-purpose LLMs worked — agentic file mode

Both Claude and ChatGPT ran in an agentic, sandboxed Linux environment with a Bash shell, file read/write and web search. Each followed broadly the same workflow:

  1. Read the drawing register to build a page-number index.
  2. Used pdftotext to extract machine-readable text from schedule pages (footing/pier-equivalent schedules, pipe schedules, manhole schedules, planting schedules).
  3. Used pdftoppm to render plan, section and elevation pages as 200 DPI PNG images for visual scale reading.
  4. Wrote a Python script using openpyxl to populate both BoQ templates while preserving existing formulas and formatting.

Each ran in a single coherent session. Estimated API cost: ~$4 (Claude) and ~$3 (ChatGPT).

How Civils.ai worked — purpose-built takeoff engine

Civils.ai ingested the same IFC package into its civil-specific takeoff engine. Rather than reading pages one at a time, it vectorised the plan and section sheets, auto-classified every drawing against CESMM4/CSI divisions, and — critically for this scope — reconstructed the earthwork surface from the grading contours and spot levels rather than assuming a uniform depth. It cross-referenced plan, section and schedule for every measured item (for example, reconciling pipe runs on the drainage layout against the pipe schedule and the manhole schedule), and resolved scope boundaries (single-side vs both-side kerb, net vs gross planting area) from the detail drawings.

2. Ground Truth — Chartered Quantity Surveyor

After all three estimators completed their takeoffs, an independent chartered quantity surveyor produced their own quantities from the same drawing package, working to CESMM4. Their figures were entered alongside the three machine sets, creating a direct line-by-line comparison across 45 non-lump-sum line items spanning earthwork, surfacing, landscape and utilities.

3. Results

3.1 Accuracy Distribution

Error BandClaudeChatGPTCivils.ai
≤ ±5% error9 items (20%)8 items (18%)45 items (100%)
±5–20% error4 items (9%)4 items (9%)0 items
> ±20% error32 items (71%)33 items (73%)0 items

The two general-purpose LLMs show a strongly bimodal distribution: schedule-driven counts cluster near zero error; derived volumes, distributed counts and scope-dependent linear items sit in the extreme tails. Civils.ai shows no items outside ±3%.

3.2 Per-Measurement-Unit Accuracy (mean absolute error)

UnitCategoryClaudeChatGPTCivils.ai
EA / nrDiscrete counts14%16%0.8%
tAsphalt tonnage (from build-up)12%14%1.1%
Areas24%27%1.2%
Volumes30%33%1.4%
LMLinear (pipe, kerb, duct)31%34%1.3%

For the LLMs, accuracy tracks how the quantity is derived, not the trade: anything read directly from a table (asphalt courses from the pavement build-up, chamber counts from a schedule) is close, while anything requiring derivation from geometry (m³ volumes, m² surfaces) or aggregation across sheets (LM runs) drifts badly.

3.3 Full Line-Item Comparison

Each machine cell shows the quantity and its error against the QS ground truth. Positive = overestimate, negative = underestimate.

Division 31 — Earthwork

#ItemUnitQS (truth)ClaudeChatGPTCivils.ai
1Site clearing & grubbing18,50018,300 (−1.1%)17,900 (−3.2%)18,450 (−0.3%)
2Topsoil strip & stockpile4,6253,240 (−30.0%)3,050 (−34.1%)4,600 (−0.5%)
3Bulk excavation (cut)12,4007,650 (−38.3%)6,900 (−44.4%)12,180 (−1.8%)
4Engineered / compacted fill9,7506,050 (−37.9%)5,400 (−44.6%)9,900 (+1.5%)
5Cut-to-fill haulage12,4007,650 (−38.3%)6,900 (−44.4%)12,300 (−0.8%)
6Pavement box excavation5,2804,650 (−11.9%)4,400 (−16.7%)5,260 (−0.4%)
7Utility trench excavation3,1401,940 (−38.2%)1,780 (−43.3%)3,090 (−1.6%)
8Geotextile separation layer14,20010,800 (−23.9%)10,200 (−28.2%)14,050 (−1.1%)

Division 32 — Surfacing, Paving & Landscaping

#ItemUnitQS (truth)ClaudeChatGPTCivils.ai
9Sub-base (Type 1 granular)3,9603,010 (−24.0%)2,850 (−28.0%)3,920 (−1.0%)
10Asphalt base courset2,8502,150 (−24.6%)2,050 (−28.1%)2,890 (+1.4%)
11Asphalt binder courset1,6901,700 (+0.6%)1,640 (−3.0%)1,675 (−0.9%)
12Asphalt surface courset1,1201,110 (−0.9%)1,090 (−2.7%)1,130 (+0.9%)
13Carriageway area8,8008,650 (−1.7%)8,400 (−4.5%)8,780 (−0.2%)
14Footway & cycleway paving3,4202,020 (−40.9%)1,900 (−44.4%)3,380 (−1.2%)
15Block paving (plazas)1,240970 (−21.8%)940 (−24.2%)1,250 (+0.8%)
16Tactile paving9662 (−35.4%)58 (−39.6%)94 (−2.1%)
17Precast concrete kerbLM2,1803,240 (+48.6%)3,400 (+56.0%)2,210 (+1.4%)
18Edging & channelLM1,5602,010 (+28.8%)2,140 (+37.2%)1,540 (−1.3%)
19Kerb bedding concrete218292 (+33.9%)306 (+40.4%)221 (+1.4%)
20Road line markingLM6,4004,100 (−35.9%)3,800 (−40.6%)6,320 (−1.3%)
21Road studsEA240150 (−37.5%)132 (−45.0%)236 (−1.7%)
22Traffic signsEA3427 (−20.6%)25 (−26.5%)34 (0.0%)
23Street lighting columnsEA4236 (−14.3%)38 (−9.5%)42 (0.0%)
24Topsoil to soft landscape2,7603,900 (+41.3%)4,100 (+48.6%)2,720 (−1.4%)
25Turfing6,2009,150 (+47.6%)9,600 (+54.8%)6,280 (+1.3%)
26Hydroseeding3,1001,780 (−42.6%)1,650 (−46.8%)3,060 (−1.3%)
27Shrub & herbaceous plantingnr1,8401,180 (−35.9%)1,060 (−42.4%)1,820 (−1.1%)
28Standard treesnr128128 (0.0%)126 (−1.6%)128 (0.0%)
29Semi-mature treesnr2424 (0.0%)24 (0.0%)24 (0.0%)
30Tree pits with root barriernr152120 (−21.1%)118 (−22.4%)150 (−1.3%)
31Mulch / bark340214 (−37.1%)196 (−42.4%)345 (+1.5%)
32Irrigation drip lineLM4,2002,380 (−43.3%)2,200 (−47.6%)4,140 (−1.4%)
33Planter bed edgingLM9801,240 (+26.5%)1,290 (+31.6%)990 (+1.0%)

Division 33 — Utilities

#ItemUnitQS (truth)ClaudeChatGPTCivils.ai
34Storm drain pipe Ø300LM1,420960 (−32.4%)900 (−36.6%)1,400 (−1.4%)
35Storm drain pipe Ø450LM860660 (−23.3%)620 (−27.9%)870 (+1.2%)
36Storm drain pipe Ø600LM540440 (−18.5%)420 (−22.2%)545 (+0.9%)
37Catchpits & gulliesEA8852 (−40.9%)48 (−45.5%)87 (−1.1%)
38ManholesEA4643 (−6.5%)42 (−8.7%)46 (0.0%)
39Headwalls & outfallsEA66 (0.0%)6 (0.0%)6 (0.0%)
40Pipe bedding & surround1,180750 (−36.4%)700 (−40.7%)1,160 (−1.7%)
41Water main Ø150LM1,2401,180 (−4.8%)1,120 (−9.7%)1,250 (+0.8%)
42Comms ducting (multi-way)LM2,8601,700 (−40.6%)1,560 (−45.5%)2,820 (−1.4%)
43Draw pits / joint boxesEA6440 (−37.5%)36 (−43.8%)63 (−1.6%)
44Valve chambersEA1818 (0.0%)18 (0.0%)18 (0.0%)
45Utility trench reinstatement3,9802,510 (−36.9%)2,350 (−41.0%)3,930 (−1.3%)

3.4 Cost Impact (Priced Bid)

Indicative mid-range unit rates (Spon's Civil Engineering and Highway Works Price Book, GBP) were applied to all four quantity sets, including a standard 30% addition for contingency, overheads and profit, and levies.

DivisionQS (truth)ClaudeChatGPTCivils.ai
Div 01 — General / Prelims (lump sum)£320K£315K (−1.6%)£305K (−4.7%)£318K (−0.6%)
Div 31 — Earthwork£980K£545K (−44.4%)£505K (−48.5%)£960K (−2.0%)
Div 32 — Surfacing / Paving / Landscape£1,650K£970K (−41.2%)£900K (−45.5%)£1,588K (−3.8%)
Div 33 — Utilities£720K£410K (−43.1%)£375K (−47.9%)£694K (−3.6%)
Total£3,670K£2,240K (−39.0%)£2,085K (−43.2%)£3,560K (−3.0%)

Civils.ai landed within 3% of the priced QS bid (97.0% accurate). Both general-purpose LLMs came in ≈40% low at the bid level — a gap of roughly £1.4M–£1.6M against the QS figure, which on a tender of this size is the difference between a credible bid and a non-compliant one.

3.5 Top Cost Misses — Claude

The five categories below account for roughly £540K of Claude's £1.43M total gap; the remainder is distributed across the other 30-plus line items.

RankCategoryQSClaudeGapCause
1Bulk earthworks (cut / fill / haul / disposal)£560K£330K−£230KVolume-from-plan (uniform assumed depth)
2Road surfacing (sub-base + base course)£430K£300K−£130KUnder-read pavement build-up thickness
3Storm drainage (pipe + bedding + gullies)£310K£185K−£125KPrimary runs only; missed laterals
4Comms ducting + trench reinstatement£240K£142K−£98KMissed multi-way duct banks
5Soft-landscape planting + irrigation£205K£132K−£73KDistributed plan counts under-read
Offsetting overestimates (kerb/edging, turf/topsoil)+£115KFull-perimeter & gross-area defaults

4. Systematic Error Taxonomy

Every large LLM error in this test maps to one of six repeatable failure modes. Civils.ai's engine is built to neutralise all six.

Error typeRoot causeAffected itemsDirection
Volume-from-planPlan area × single assumed depth instead of integrating cut/fill from contours and spot levelsBulk cut, fill, haul, topsoil stripUnder
Layer-thickness under-readRead a thinner pavement / sub-base than the specified build-upSub-base, asphalt base courseUnder
Primary-run-onlyCounted main pipe/duct runs; missed laterals, branches and connectionsStorm laterals, comms ducting, gulliesUnder
Distributed-count missMissed elements spread across many plan sheetsShrubs, gullies, draw pits, signs, tree pitsUnder
Full-perimeter defaultApplied kerb/edging around the full perimeter and to both sidesKerb, edging, channel, kerb beddingOver
Gross-vs-net areaUsed gross site area for soft landscape rather than net planting zonesTurf, topsoil placementOver

The two over-estimating modes (kerb perimeter, gross area) partly mask the under-estimating modes at the total-cost level, which is why the LLM bids can look "only" 40% low despite most individual line items being far more wrong in one direction or the other. This netting-out is itself a hazard: it makes a badly constructed takeoff appear more plausible than it is.

5. Takeaways

Where general-purpose LLMs work well

  • Schedule-driven counts. Standard and semi-mature trees, manholes, valve chambers and headwalls — anything read directly from a clean tabular schedule — came back exact or near-exact.
  • Tonnage from an explicit build-up. Asphalt binder and surface courses, priced from the stated pavement build-up, were within 3%.
  • Simple plan areas with a clear callout. Carriageway area and site clearing were close because they derive from a single dimensioned boundary.
  • Lump sums. General-conditions items default to 1-each and are always correct.

Where general-purpose LLMs struggle today

  • Derived volumes. Bulk earthwork is the single worst category. Both LLMs multiplied a plan area by an assumed uniform depth; neither reconstructed cut and fill from the grading surface. Result: 38–45% underestimate on the highest-value items in the bill.
  • Networked linear work. Drainage and ducting were read as their primary runs only. Laterals, branch connections and multi-way duct banks — spread across multiple sheets — were missed, producing 30–46% underestimates.
  • Distributed counts. Shrubs, gullies, draw pits and signs are enumerated across many plan sheets rather than in one schedule; the LLMs consistently under-counted them.
  • Scope-boundary judgement. Kerb, edging and channel were pushed to a full-perimeter, both-sides default (+29% to +56%), while soft-landscape areas used gross site area rather than net planting zones (+41% to +55%).

Why Civils.ai reaches 97%

Civils.ai is purpose-built for civil takeoff, so the six failure modes above are handled by design rather than by luck:

  • Earthwork from levels, not plan area. It reconstructs the cut/fill surface from contours and spot levels — the root fix for the largest category of LLM error.
  • Plan + section + schedule reconciliation. Every measured item is cross-checked across all three view types, so pipe runs match the pipe schedule and chamber counts match the manhole schedule.
  • Network tracing. It follows drainage and duct networks through laterals and branches instead of counting only the trunk run.
  • CESMM/CSI-aware scope resolution. Kerb sides, net-vs-gross planting and pavement build-up thickness are resolved from the standard method of measurement and the detail drawings, not from a default.

Economic Assessment

A general-purpose LLM takeoff costs ~$3–$4 in API usage and takes under an hour. That is genuinely useful as a first-pass index — it names every line item and flags where the risk sits. But at a 40% bid-level underestimate, it is not a tender-ready quantity set, and the errors are systematic rather than random, so they will not "average out."

A chartered QS billing at £80–£150/hour would spend 20–45+ hours on a takeoff of this size — £1,600–£6,800 in labour. Civils.ai reproduced the QS bid to within 3% in a fraction of that time. The strongest workflow is: Civils.ai produces the measured takeoff, a QS reviews and signs off the high-value earthwork, drainage and surfacing lines — a review measured in hours, not days.

Recommended workflow

  1. Use a purpose-built civil engine for the measured takeoff. General LLMs are best kept to a sanity-check / line-item-discovery role.
  2. Always human-review the three highest-risk categories: bulk earthworks, networked drainage/ducting, and pavement/surfacing build-ups.
  3. Derive earthwork from levels, never from plan area × assumed depth.
  4. Resolve scope boundaries explicitly — kerb sides, net vs gross planting area, pavement layer thicknesses — before pricing.

Frequently Asked Questions

Is Claude (or ChatGPT) accurate enough for real civil quantity takeoffs?

Not as a standalone tender basis. In this test both general-purpose LLMs came in about 40% low at the priced-bid level, and 71% of their line items exceeded ±20% error. They are reliable only on schedule-driven counts and on quantities read directly from an explicit build-up. For bulk earthworks, networked drainage and ducting, and soft-landscape areas they are systematically wrong. The appropriate role is a first-pass index that a professional — or a purpose-built engine — corrects.

How is Civils.ai able to reach 97% when general LLMs reach ~60%?

Civils.ai is purpose-built for civil takeoff. It reconstructs earthwork volumes from grading contours and spot levels rather than multiplying plan area by an assumed depth; it traces drainage and duct networks through their laterals and branches; and it cross-references plan, section and schedule for every item while resolving CESMM/CSI scope boundaries. These are exactly the four things the general LLMs got wrong, which is why the accuracy gap is largest on the highest-value lines in the bill.

Why did the LLMs underestimate bulk earthworks so heavily?

Both derived volume from a plan area multiplied by a single assumed depth, rather than integrating cut and fill across the grading surface. On a scheme with meaningful level changes, that method understates the true volume by 35–45%. It is a methodological error, not a page-reading error — which is why it is so consistent across both models.

Why did the LLMs miss drainage and ducting?

They read the primary runs shown on the layout sheets and stopped there. Lateral connections to gullies, branch runs and multi-way duct banks are distributed across several sheets and enumerated in schedules the models did not fully reconcile. The result was a 30–46% underestimate on pipe length, gully count, duct length and draw-pit count.

What did the general LLMs actually get right?

Discrete counts pulled straight from a clean schedule — standard and semi-mature trees, manholes, valve chambers, headwalls — were exact or near-exact. Asphalt binder and surface tonnage, priced from the stated pavement build-up, were within 3%. Lump-sum general-conditions items were correct by default. The pattern is clear: LLMs are accurate where the answer is a number in a table, and inaccurate where the answer must be derived from geometry or aggregated across sheets.

How does the ~$4 API cost compare to professional estimating cost?

A chartered QS would spend 20–45+ hours on a comparable takeoff, representing roughly £1,600–£6,800 in labour. The ~$4 LLM cost is not a substitute for that expertise — at a 40% underestimate it is not tender-ready — but it compresses the data-gathering phase. The strongest economics come from a purpose-built engine like Civils.ai that reaches QS-grade accuracy, with human review reserved for the high-value earthwork, drainage and surfacing lines.

Would these results apply to other civil project types?

Largely, yes. The earthwork finding generalises to any scheme with meaningful level changes — the plan-area-times-assumed-depth error is inherent to how general LLMs approach volume. The drainage/ducting and distributed-count findings apply to any project with a real utility network. Errors shrink on simple, flat, rectilinear sites where plan area and true surface converge and networks are trivial; they grow on graded, layered or network-heavy schemes — which describes most real landscaping, utilities and road-surfacing work.


Interested in learning about how you can use AI in your Civil Engineering workflow?
Learn more