February 11, 20267 min read

DeepCAD dataset: the training data behind text-to-CAD

Most text-to-CAD models learn from the DeepCAD dataset: about 170,000 parametric CAD models. That's not a lot. Here's why that matters.

Quick answer

The DeepCAD dataset contains approximately 170,000 parametric CAD models represented as sequences of sketch-and-extrude operations, with ~660,000 text annotations added by the Text2CAD project. It's the primary training dataset for text-to-CAD research, but its limited size and geometric simplicity constrain what current models can generate.

Somewhere in the basement of every text-to-CAD demo is a training dataset, and most of the time it's DeepCAD. I first ran into it while tracing back the claims in a vendor's whitepaper. They kept talking about "trained on hundreds of thousands of parametric models." The number sounded impressive. Then I downloaded the actual dataset, opened a few samples in a viewer, and spent ten minutes looking at what was essentially a collection of geometry that a first-semester CAD student could build during a lunch break. Cylinders, boxes, plates with holes, simple extrusions on simple sketches. Valid parametric models, technically. Also the kind of thing I'd model in two minutes on a slow day.

That's not a complaint about the researchers who built it. Given what was available, DeepCAD was a genuine achievement. But understanding what's in this dataset, and more importantly what isn't, tells you a lot about why text-to-CAD tools behave the way they do.

What DeepCAD actually is#

DeepCAD was introduced in a 2021 ICCV paper by Rundi Wu, Chang Xiao, and Changxi Zheng from Columbia University. The full name of the paper is "DeepCAD: A Deep Generative Network for Computer-Aided Design Models." The dataset was a byproduct of building a generative model for CAD, and it ended up becoming the most widely used training set in the field.

The dataset contains approximately 178,000 parametric CAD models sourced from ABC, a large-scale collection of CAD models from Onshape's public repository. The original ABC dataset has over a million models, but DeepCAD filtered it down to models that could be represented as sequences of sketch-and-extrude operations. That filtering is important. It means DeepCAD only includes models that were built by sketching a 2D profile and extruding it, possibly multiple times, to create a 3D solid. No sweeps. No lofts. No revolves. No sheet metal. No surfacing.

Each model in the dataset is stored not as a mesh or a B-Rep solid, but as a sequence of CAD commands: create a sketch on a plane, draw line segments, arcs, and circles to define a profile, extrude the profile by some distance. This command-sequence representation is what makes the dataset useful for training AI models. The model doesn't learn what a part looks like. It learns how to build a part, step by step, the way a CAD timeline records it.

Size matters, and 178,000 is small#

I keep hearing people describe 178,000 models as "large-scale." In the CAD research world, it is. In the broader AI world, it's tiny.

For reference: Stable Diffusion was trained on about 2 billion image-text pairs. GPT-3 was trained on hundreds of billions of tokens. Even in specialized domains, datasets tend to be in the millions. DeepCAD has 178,000 models, each represented as a sequence averaging maybe 60-80 CAD operation tokens. The total amount of training data, measured in the way AI researchers measure it, is minuscule.

This matters because the diversity of the dataset directly constrains what a trained model can produce. If the training data is 178,000 simple prismatic parts, the model will generate simple prismatic parts. It won't spontaneously learn to create a gear, a turbine blade, or a complex housing with snap-fit features, because it never saw one. The training set is the ceiling.

CAD data is scarce for a reason. Most real CAD models are proprietary. Companies don't publish their part files. The models that do end up in public repositories like Onshape or GrabCAD tend to be simpler than what lives on corporate servers. The really interesting geometry, the assemblies with hundreds of parts, the injection-molded housings with draft angles and rib patterns, the sheet metal enclosures that fold flat, none of that is in DeepCAD. It can't be, because nobody shared it.

The Text2CAD annotation layer#

The original DeepCAD dataset had geometry but no text. The models came with their CAD command sequences but no natural language descriptions. You couldn't train a text-to-CAD model on it because there was nothing connecting words to shapes.

The Text2CAD paper fixed this by annotating the dataset with approximately 660,000 text descriptions generated using Mistral and LLaVA-NeXT. Each model got multiple descriptions at different skill levels: beginner ("a box with a hole"), intermediate ("a rectangular block with a through-hole centered on the top face"), and expert ("sketch a 40mm by 25mm rectangle on the XY plane, extrude 15mm, then sketch a 6mm circle centered on the top face and cut-extrude through all").

The multi-level annotation was a smart decision. Real users describe parts at wildly different levels of specificity. A hobbyist says "a bracket." A mechanical engineer says "an L-bracket, 3mm 6061 aluminum, 40mm legs, two M4 clearance holes per leg on a 25mm pitch." The model needs to handle both, and the annotation pipeline gave it examples of each.

But the annotations are only as good as the models they describe. Annotating a simple cylinder with a beginner description and an expert description gives you two ways to say "cylinder." It doesn't give you a way to generate a cam, a spring clip, or a dovetail joint. The bottleneck isn't the text. It's the geometry.

What the models look like#

I went through a random sample of about fifty DeepCAD models. Here's what I found.

Most are simple extrusions: a sketch profile extruded once or twice to create a 3D shape. A few are more complex, with multiple sketch planes and boolean operations (cutting one extrusion from another). The sketch profiles are made of lines, arcs, and circles. No splines. The geometry is clean but elementary.

Typical examples: a rectangular plate with four corner holes. A cylinder with a bore. A step block. A T-shaped bracket. An L-shaped bracket. A flanged plate. A plate with a centered rectangular pocket. These are the building blocks of mechanical design, and they're perfectly valid parts. They're also the kind of parts that take about three minutes to model by hand in any CAD tool.

What you won't find: assemblies, parts with complex internal geometry, freeform surfaces, thin-walled injection-molded parts, sheet metal with bend reliefs, gears, cams, threaded features, helical geometry, or anything that requires operations beyond sketch-and-extrude. The dataset defines the vocabulary, and the vocabulary is deliberately limited.

Why this shapes every tool you use#

When a text-to-CAD tool handles your "rectangular bracket with mounting holes" prompt beautifully and then falls apart on "helical gear with 20-degree pressure angle," the DeepCAD dataset is a big part of the reason. The model learned from simple parts. It generates simple parts. The training data is the boundary.

Commercial tools like Zoo.dev likely train on additional proprietary data beyond DeepCAD, and they have their own geometric kernels that may handle more complex operations. But the foundational research, the architecture, the proof of concept, that all came from training on DeepCAD. The field's understanding of what works and what doesn't was shaped by this dataset's contents.

This also explains the dimensional accuracy problem. The DeepCAD models have specific dimensions, but the text annotations describe them approximately. When you train a model on "a box about 40mm long" paired with a box that's exactly 41.3mm, the model learns to approximate. It doesn't learn to be precise, because precision wasn't reliably encoded in the training signal.

The CAD data problem#

DeepCAD is the most-used dataset in text-to-CAD research because there isn't much else. Other public CAD datasets exist, ABC has over a million models, Fusion 360 Gallery has about 20,000, but none of them combine the command-sequence representation with the scale that researchers need. And none of them have text annotations at the scale Text2CAD provided.

Building a better dataset is the obvious next step and also the hardest one. You need parametric CAD models stored as editable command sequences (not just meshes or B-Rep solids), covering a wide range of real engineering geometry, with accurate text descriptions at multiple levels of detail. Getting that data means either generating it synthetically (which risks the model learning to generate synthetic-looking parts), convincing companies to share proprietary models (good luck), or building an annotation pipeline that works on more complex geometry.

Until that dataset exists, text-to-CAD models will keep bumping into the same ceiling. They'll get better at generating the kinds of parts DeepCAD contains. They won't suddenly learn to generate the kinds of parts it doesn't.

The honest assessment#

DeepCAD did exactly what it needed to do: it proved that representing CAD models as learnable sequences was viable and gave the research community a common training set. The Text2CAD paper added the language bridge. Together, they made text-to-CAD research possible.

But treating 178,000 simple models as sufficient for production text-to-CAD is like training a writing assistant on nothing but grocery lists and expecting it to draft contracts. The format is similar. The complexity is not. Every limitation I've hit with text-to-CAD tools, the narrow geometry range, the approximate dimensions, the inability to handle real engineering features, traces back, at least in part, to a training dataset that contains the CAD equivalent of "hello world" programs. The tools will get better when the data does. So far, the data hasn't.

Newsletter

New articles, product updates, and practical ideas on Text-to-CAD, AI CAD, and CAD workflows.