January 18, 20269 min read

How text-to-CAD actually works

The short version: an AI reads your prompt and tries to output real CAD geometry instead of a mesh blob. The longer version involves transformers, B-Rep kernels, and a lot of duct tape.

Quick answer

Text-to-CAD works by feeding a natural language prompt into a trained neural network that outputs a sequence of CAD operations (sketches, extrusions, fillets) rather than pixels or mesh triangles. The AI generates B-Rep geometry using learned patterns from datasets like DeepCAD's 170,000 parametric models.

Text-to-CAD works by converting a written prompt into a sequence of CAD modeling operations, like sketch, extrude, fillet, and chamfer, that a geometric kernel then executes into real B-Rep geometry. The AI doesn't draw a shape. It writes a recipe for building one.

I figured this out the hard way. I'd been using Zoo's text-to-CAD tool for a few weeks, getting results that ranged from surprisingly useful to quietly wrong, and I couldn't tell why the same kind of prompt would produce a clean bracket one day and a cursed lump the next. So I did what I always do when software annoys me enough: I went and read the research papers, sitting at my desk at ten o'clock on a Tuesday night with one dead monitor and a browser full of arxiv tabs. What I found was both more interesting and more fragile than I expected.

The pipeline, from English to geometry#

The general architecture behind text-to-CAD follows a pattern that will look familiar if you've paid any attention to how large language models work, except the output isn't text. It's a sequence of parametric CAD commands.

Here's the basic flow. You type a prompt: "rectangular enclosure, 80mm by 50mm by 30mm, 2mm wall thickness, four M3 mounting holes on the corners." That prompt goes into a text encoder, typically a BERT-style transformer, which converts your words into a dense numerical representation that captures the meaning, dimensions, and spatial relationships you described. Think of it as translating English into math that the next stage can actually use.

That encoded representation then feeds into a decoder, usually an autoregressive transformer that generates a sequence of CAD operations one step at a time. Not triangles. Not voxels. Operations. "Create sketch on XY plane. Draw rectangle 80mm by 50mm. Extrude 30mm. Shell to 2mm wall. Place hole, M3 clearance, at position (5, 5). Repeat at corners." Each operation in the sequence is a token, and the model predicts the next token based on everything that came before, the same way a language model predicts the next word in a sentence.

The difference that matters: when GPT predicts the next word, a bad prediction gives you a weird sentence. When a CAD sequence decoder predicts the wrong operation, you get geometry that intersects itself, a fillet on an edge that doesn't exist, or an extrusion that collapses the model into something that would make a topology professor weep. CAD geometry has rules. Hard rules. Surfaces have to be watertight. Faces have to connect. Boolean operations have to produce valid solids. There's no "close enough" in B-Rep the way there is in mesh approximation.

What the Text2CAD paper actually showed#

The most important academic work in this space is the Text2CAD paper that got a spotlight at NeurIPS 2024. I've seen it cited by every vendor in the space, usually with the inconvenient parts left out.

The researchers built the first end-to-end framework for generating parametric CAD models from natural language. They used the DeepCAD dataset, which contains roughly 170,000 parametric CAD models, and annotated it with about 660,000 text descriptions at varying levels of detail and skill. Some annotations read like an engineer's spec. Others read like a beginner describing a shape they saw once. That range was deliberate, because real users don't all talk like SolidWorks power users.

The architecture uses a BERT encoder for the text side and a transformer-based autoregressive network for the CAD sequence side. The model learns to map natural language descriptions to sequences of sketch and extrude operations that, when executed, produce the described geometry. It's trained end-to-end, meaning the text understanding and the CAD generation learn together rather than being bolted on separately.

The results were genuinely impressive for what they are. The model could generate recognizable mechanical parts from text descriptions, with correct topology and editable feature history. But "recognizable" and "dimensionally accurate" are not the same thing, and "editable" and "production-ready" aren't either. The Text2CAD paper is an academic proof of concept, not a shipping product. Most of the commercial tools build on similar ideas but add their own layers of engineering on top, and none of them are particularly transparent about how much duct tape is involved.

B-Rep generation vs mesh generation#

Most AI 3D tools (Meshy, Tripo, diffusion-model generators) produce meshes: bags of triangles that approximate a surface. A mesh can look like a bracket, but it doesn't know it's a bracket. You can't select a face. You can't fillet an edge. Import one into SolidWorks and the software treats it like a foreign object that wandered in from a game engine.

Text-to-CAD produces B-Rep geometry. Boundary Representation. Mathematically defined surfaces and edges, the same kind your CAD software creates when you sketch and extrude. Real faces, real edges, topology the software understands. You can measure, modify, and export a STEP file a machine shop will accept.

This is also why text-to-CAD is harder than text-to-3D. Mesh triangles just need to look right from a distance. B-Rep geometry means every operation has to produce a mathematically consistent solid. One bad boolean, one self-intersecting surface, one unclosed sketch, and the model fails. I've seen outputs that look perfect in the viewport and explode the moment you try to fillet an edge.

The operation sequence is the key idea#

What separates text-to-CAD from other AI 3D approaches is that the output is a sequence of operations, not a surface prediction.

A human in Fusion 360 builds a part step by step: sketch, dimension, extrude, add a hole, fillet edges, shell the body. The feature tree records this history. You can roll back, change a dimension, watch the rest update. A text-to-CAD model generates that same kind of sequence. "Sketch on XY. Rectangle, origin-centered, 80x50. Pad 30mm. Fillet edges, 2mm radius. Pocket, circular, 3.2mm, position (5, 5, 30)." The geometric kernel executes these in order, producing a solid with a real feature tree.

This is why the kernel matters. Zoo uses KittyCAD, a GPU-native kernel built for AI-driven geometry generation. Other tools generate OpenSCAD code, Python scripts for FreeCAD, or commands that run inside Fusion 360 like CADAgent does. The kernel has to execute whatever the AI generates and produce valid geometry at the end. When the kernel and the AI disagree about what's geometrically possible, you get the silent failures that make this technology maddening to debug.

Why this is genuinely hard#

I want to be clear about something, because the demos make this look easier than it is. Generating valid parametric CAD geometry from text is harder than generating images or meshes.

CAD geometry has constraints that images and meshes don't. Every sketch needs to be fully constrained or the extrusion is ambiguous. Every boolean operation (cut, join, intersect) needs to produce a valid solid, not a self-intersecting mess. Fillets and chamfers can only be applied to edges that actually exist in the current state of the model, and whether a fillet succeeds depends on the surrounding geometry in ways that are difficult to predict without actually trying it. I've been using SolidWorks for over a decade and I still get surprised by which fillets fail and which don't. Expecting a neural network to get this right every time is optimistic.

The training data problem is real too. The DeepCAD dataset has 170,000 models, which sounds like a lot until you compare it to the billions of images used to train Stable Diffusion or the trillions of text tokens used for GPT. CAD data is scarce because most of it is proprietary. Companies don't publish their parametric models. The models that do exist in public datasets tend toward simple mechanical parts. So the AI has seen a lot of brackets and boxes and housings, and not many gears, snap fits, sheet metal parts, or complex multi-body assemblies. It generates what it's been trained on, and the training data has holes you could drive a forklift through.

Then there's evaluation. How do you measure whether a generated CAD model is "good"? The Text2CAD paper uses metrics like coverage and constraint satisfaction. But those don't capture what an engineer cares about: Is the feature tree clean? Can I edit it without breaking everything? Would a machinist accept this without calling me? Those questions don't have neat mathematical answers, which makes it hard to train a model to optimize for them.

What this means for the output you actually get#

When you use a text-to-CAD tool today, you're getting predictions from a neural network trained on a small dataset of simple CAD models, run through a geometric kernel that enforces validity but can't fix bad predictions. Good outputs happen when your prompt aligns with what the model has seen in training and when the operation sequence is geometrically valid. Bad outputs happen when any of that breaks down. And it breaks down quietly.

You don't get an error saying "the fillet failed because the adjacent face conflicts with the shell." You get a model that looks fine and turns out to have internal faces, zero-thickness walls, or dimensions 15% off from what you asked for. I've learned to measure every critical feature on every model I get from these tools, the same way I measure parts from a shop. Trust, but verify, except I don't trust yet.

Where this actually stands#

The text-to-CAD guide covers the practical side of using these tools. The what is text-to-CAD post explains the concept without assuming you know what B-Rep means. What I've tried to explain here is the machinery: why it works when it works, and why it breaks when it breaks.

The core technology is real. Transformers can learn to generate valid CAD operation sequences from text. The commercial tools have turned that into something usable, with varying reliability. But there's a real gap between "the AI can generate a plausible sequence of CAD operations" and "the AI can generate the right sequence for your part with correct dimensions and a feature tree you'd want to edit."

A CNC machine can cut any shape you tell it to, but the shape is only as good as the program. Text-to-CAD is the same deal. The kernel builds whatever the AI tells it to. The question is whether the AI is telling it the right thing. For simple parts with clear descriptions, often yes. For anything with real complexity, the answer is a polite version of "sort of, but check everything." The architecture is sound. The training data is growing. The kernels are improving. I'll be watching, importing STEP files, and keeping my calipers on the desk.

Newsletter

New articles, product updates, and practical ideas on Text-to-CAD, AI CAD, and CAD workflows.