February 17, 20267 min read

The Text2CAD paper: what the NeurIPS research actually says

The NeurIPS 2024 Text2CAD paper introduced the first end-to-end framework for generating parametric CAD from natural language. Here's what it does, what it proved, and what it doesn't solve.

Quick answer

The Text2CAD paper (NeurIPS 2024 spotlight) presents a transformer-based framework that generates parametric CAD models from text using the DeepCAD dataset (~170K models, ~660K text annotations). It uses a BERT encoder and autoregressive CAD sequence decoder to produce sketch-and-extrude operations, not mesh geometry.

I've seen the Text2CAD paper cited by at least four different text-to-CAD vendors, always in the same way: vaguely, enthusiastically, and with the inconvenient parts left out. "Based on cutting-edge NeurIPS research" is a great thing to put on a landing page. It's less useful for understanding what the research actually showed, where it broke down, and what it means for the tools you might use on a Tuesday afternoon when a client needs a bracket by end of day.

So I read the paper. Then I read it again, because the first pass didn't stick and I was trying to understand the evaluation metrics while eating a sandwich at my desk, which turns out to be a bad combination. Here's what it actually says.

What the paper is#

Text2CAD, published as a spotlight paper at NeurIPS 2024, is the first end-to-end framework for generating parametric CAD models from natural language text prompts. The authors are Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin Sheikh, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal, primarily from DFKI and RPTU Kaiserslautern-Landau in Germany.

"First end-to-end" is an important qualifier. Earlier research had tackled pieces of this problem: generating CAD sequences from other representations, generating 3D geometry from text as mesh, annotating CAD models with descriptions. Text2CAD put the full pipeline together: text in, parametric CAD operations out.

The paper introduced two contributions that matter. First, a data annotation pipeline that generated multi-level text descriptions for the DeepCAD dataset. Second, the model architecture itself, which takes those text descriptions and generates valid sequences of sketch-and-extrude operations.

The data pipeline#

The DeepCAD dataset contains about 178,000 parametric CAD models represented as sequences of CAD operations: sketch a profile, extrude it, sketch another profile on a different plane, extrude that. Each model is a recipe, not a mesh. That's what makes it useful for this kind of research. The models are simple, mostly prismatic mechanical parts, but they're stored as parametric operation sequences, the same kind of instructions a human would follow in a timeline-based CAD tool.

The Text2CAD team annotated this dataset with approximately 660,000 text descriptions using Mistral and LLaVA-NeXT (a vision-language model). They generated descriptions at multiple skill levels, from beginner-style ("make two cylinders, one inside the other") to expert-style ("sketch a concentric circular profile on the XY plane with outer diameter 24mm and inner diameter 16mm, extrude 10mm along the Z axis"). This range was deliberate. Real users don't all describe geometry the same way, and the model needed to handle everything from casual to precise.

That annotation pipeline is itself a contribution. Before Text2CAD, the DeepCAD models existed without text labels. You had geometry but no natural language descriptions to train a text-to-CAD model on. The team essentially created the labeled dataset that made the whole thing possible.

The architecture#

The model has two main components. A text encoder based on BERT (with trainable adaptive layers) converts the input prompt into a dense numerical representation. An autoregressive transformer decoder takes that representation and generates CAD operations one token at a time.

Each CAD operation is tokenized: the operation type (sketch, extrude), the parameters (coordinates, dimensions, angles), and the ordering. The decoder predicts the next token in the sequence, conditioned on the text encoding and everything it's generated so far. If you've seen how language models generate text word by word, this is the same principle applied to CAD construction sequences.

The output isn't mesh. It's a sequence of sketch-and-extrude operations that can be executed by a CAD kernel to produce B-Rep geometry. That distinction matters enormously and is what separates this from text-to-3D research like DreamFusion or Point-E. The Text2CAD model doesn't predict what a surface looks like. It predicts how to build a solid, step by step, the way an engineer would.

What the results showed#

The paper evaluates the model on several axes. Visual quality (does it look like the described part), parametric precision (are the individual operations correct), and geometric accuracy (does the final solid match the intent).

For parametric precision, they report F1 scores for different CAD elements: lines, arcs, circles, and extrusions. The model is reasonably good at getting the basic operations right, especially for the simpler descriptions. For geometric accuracy, they use Chamfer Distance (a standard metric for comparing 3D shapes) and invalidity ratios (what fraction of generated sequences produce broken geometry).

They also ran GPT-4V evaluations and human evaluations, which is an acknowledgment that metrics alone don't capture whether a generated part is actually useful.

The honest summary of the results: the model can generate recognizable mechanical parts from text descriptions, with valid topology and the correct general shape. It handles beginner-level prompts (simple descriptions) better than expert-level prompts (precise dimensional specifications). The dimensional accuracy is approximate, not precise. The range of geometry it can produce is limited to what exists in the training data, which is mostly simple prismatic parts.

What it doesn't solve#

This is where the vendor citations conveniently trail off.

The model generates single parts only. No assemblies. No parts that reference other parts. No mating relationships or spatial context. You describe one object and get one object.

The dimensional accuracy is not sufficient for manufacturing without verification. The model generates approximate dimensions that are often close but not exact. If you ask for 80mm by 50mm, you might get 78.3mm by 51.1mm. That's impressive for a research prototype and useless for a machine shop.

The geometry vocabulary is limited to sketch-and-extrude operations. No fillets, no chamfers, no shell, no pattern, no sweep, no loft. The DeepCAD dataset stores models as sketch-and-extrude sequences, so that's what the model learned. If your part needs a fillet, the model doesn't have a token for that. This is a significant limitation because real parts have fillets, chamfers, draft angles, and other features that make them manufacturable.

The training data is small. 178,000 models sounds like a lot until you compare it to the billions of images that image generation models train on. The model has seen a narrow slice of the CAD universe: simple mechanical parts, mostly boxes and cylinders and plates. Ask for a gear, a cam, a sheet metal bracket, or an ergonomic handle, and you're outside the training distribution.

The code is available (GitHub), but the license is CC BY-NC-SA 4.0: non-commercial use only. If you want to build a product on this, you need a different license arrangement or a different model.

What it means for the tools you actually use#

Every commercial text-to-CAD tool operates on similar principles to what this paper describes: text encoding, sequence generation, kernel execution. Zoo.dev, AdamCAD, CADAgent, they all process text prompts and output CAD operations. The specific architectures differ. The training data differs. The kernels differ. But the fundamental pattern, language in, construction sequence out, is what Text2CAD formalized academically.

The paper is useful for calibrating expectations. When you see a text-to-CAD tool generate a clean bracket from a prompt, the research tells you roughly what's happening inside and why certain things work better than others. Simple prismatic parts match the training distribution. Complex geometry doesn't. Dimensional accuracy is approximate. Single-part generation is the current frontier. These aren't limitations specific to one tool. They're limitations of the approach, and the Text2CAD paper is honest about them in a way that marketing pages typically aren't.

The contribution that matters most#

If I had to pick one thing the Text2CAD paper did that will have lasting impact, it's the annotated dataset. Before this work, the CAD research community had geometry without language labels. Text2CAD created the bridge between natural language and parametric CAD sequences at a scale that enables training. Every future text-to-CAD model, open or commercial, benefits from the existence of that annotated data or from the pipeline methodology used to create it.

The model itself will be surpassed. The architecture will be refined. But the problem of connecting human language to CAD operation sequences, and the dataset that first made that connection trainable, that's the foundation. The open-source text-to-CAD space is building on it.

The paper is worth reading if you use text-to-CAD tools and want to understand what you're actually interacting with. It's also worth reading if you're skeptical about these tools, because the limitations section is more honest than any product page I've seen. The research proves the concept works. It also proves the concept has boundaries, and those boundaries are exactly where the hard engineering begins.

Newsletter

New articles, product updates, and practical ideas on Text-to-CAD, AI CAD, and CAD workflows.