March 28, 202614 min read

CAD datasets for AI training: what's available and what's locked up

Training AI to generate CAD models requires CAD training data. Most of the good data is locked inside corporate vaults. What's publicly available is limited, biased, and often missing the metadata that matters most.

Quick answer

Key public CAD datasets for AI: ABC Dataset (~1M models), DeepCAD (~180K parametric sequences), Fusion 360 Gallery (~8,000 models with design history), and ShapeNet (~51K 3D models). Most are biased toward simple mechanical parts. Corporate CAD libraries with real-world complexity and manufacturing metadata remain proprietary. The training data gap limits text-to-CAD quality.

The key public CAD datasets for training AI are the ABC Dataset (about one million models), DeepCAD (about 178,000 parametric sequences), Fusion 360 Gallery (about 8,000 models with design history), ShapeNet (about 51,000 3D models), and Thingi10K (about 10,000 printable models). Most are biased toward simple mechanical parts. The real-world CAD data, the complex assemblies with tolerances and manufacturing metadata, is locked inside corporate PDM vaults where no researcher can touch it. I know this because I've tried to find better training data for a side project, and every promising lead ended at a firewall, an NDA, or a polite email explaining that sharing CAD models was "not aligned with our IP strategy."

That was three months ago. I was sitting at my desk with a Fusion 360 Gallery model open on one screen and a DeepCAD sample on the other. Both were extruded rectangles with holes. One had a feature tree. The other was a command sequence. Neither looked anything like the parts I design for actual clients. And it hit me: the reason text-to-CAD tools struggle with real engineering geometry is that they've never seen real engineering geometry. They've seen the public dataset equivalent of a first-semester homework assignment.

This post is about what's actually available, what's missing, and why the gap between public CAD data and corporate CAD data is the most important bottleneck in the entire text-to-CAD field.

ABC Dataset: the million-model foundation#

The ABC Dataset, published in 2019 by Koch et al. from Princeton, is the largest publicly available collection of CAD models. It contains approximately one million models sourced from Onshape's public projects. The name stands for "A Big CAD Model Dataset for Geometric Deep Learning."

The models are stored as B-Rep geometry, which means they have proper faces, edges, and topology. You get STEP files and derived meshes. The geometric quality is generally good because the models come from a real CAD platform with a real geometric kernel.

The problems: no text annotations. No manufacturing metadata. No design history. No parametric information beyond the raw geometry. You get shapes, not processes. You can train a model to recognize what parts look like but not how they were built, which is exactly the information text-to-CAD needs.

The distribution is also skewed. Onshape's public projects are dominated by hobbyists, students, and early-career users. The models tend to be simple: brackets, plates, basic housings, mechanical components that a single person would create as a public project. Complex assemblies, multi-body parts, and production-quality engineering models are rare because professionals don't usually share their work publicly. The dataset is large but shallow.

ABC is useful for training geometric understanding, shape classification, and surface analysis models. It's less useful for text-to-CAD specifically because there's no text to pair with the geometry.

DeepCAD: the dataset that made text-to-CAD research possible#

I've written about the DeepCAD dataset in detail, but the summary matters here.

DeepCAD contains approximately 178,000 parametric CAD models represented as sequences of CAD commands: sketch a profile, extrude it, sketch another profile, cut-extrude through. Each model is a recipe, not just a shape. This command-sequence representation is what made it possible to train generative models that output CAD operations instead of raw geometry.

The dataset was derived from ABC by filtering to models that could be represented as sketch-and-extrude sequences. That filtering is important. It means DeepCAD excludes sweeps, lofts, revolves, shell features, sheet metal operations, surfacing, and any other modeling approach that doesn't fit the sketch-extrude pattern. The result is geometrically simple: plates, blocks, cylinders, brackets, and basic prismatic shapes.

The Text2CAD paper added approximately 660,000 text annotations to DeepCAD, using Mistral and LLaVA-NeXT to generate descriptions at beginner, intermediate, and expert levels. This annotation layer transformed DeepCAD from a geometry-only dataset into a text-geometry paired dataset, enabling the first end-to-end text-to-CAD models.

DeepCAD is the most-cited dataset in text-to-CAD research. It defined the field's technical approach. It also defined the field's limitations. When a text-to-CAD tool generates a beautiful bracket but can't handle a gear, a swept tube, or a thin-walled injection-molded housing, the training data is a big part of the reason.

Fusion 360 Gallery: small but rich#

The Fusion 360 Gallery Dataset, published by Autodesk Research in 2021 (Willis et al.), is much smaller than ABC or DeepCAD, about 8,625 models, but much richer in information.

Each model includes the complete design history: the sequence of modeling operations, the sketch geometry, the constraints, the parameters, and the feature tree. This is the only major public dataset that preserves full parametric design history as a human engineer would experience it in a real CAD tool. You don't just see what the part looks like. You see how it was built, step by step, decision by decision.

The models also include B-Rep geometry, mesh representations, segmented surfaces, and metadata about the design operations used. It's the most complete representation of the CAD design process available in any public dataset.

The problems: size and source bias. 8,625 models is tiny by ML standards. And because the models come from Fusion 360 Gallery, they represent what people chose to share publicly, which skews toward demonstrations, tutorials, and personal projects rather than production engineering. You get interesting geometry but not necessarily representative geometry.

For researchers working on how text-to-CAD works at the architecture level, the Fusion 360 Gallery is invaluable because it provides ground truth for what a proper feature tree looks like. For training models that need to produce editable, parametric output, it's one of the few sources that shows what "right" looks like. It just doesn't show it at scale.

ShapeNet: the one from the graphics world#

ShapeNet was published in 2015 by Chang et al. and contains approximately 51,300 3D models organized into 55 categories from WordNet taxonomy. It's been enormously influential in 3D deep learning research, and you'll see it cited in almost every paper about 3D generation.

The catch for CAD: ShapeNet models are meshes, not B-Rep solids. They were collected from online 3D model repositories, not CAD tools. The geometry represents visual appearance, not engineering definition. You can't extract feature trees, sketch constraints, or parametric dimensions from ShapeNet models because that information was never there.

ShapeNet is useful for training models that need to understand 3D shape in general: classification, retrieval, reconstruction. It's less useful for text-to-CAD specifically because the output format doesn't match what CAD engineers need. A mesh chair from ShapeNet and a parametric bracket from DeepCAD occupy different universes in terms of engineering utility.

Some research projects have used ShapeNet as supplementary data for shape understanding while using DeepCAD for the actual CAD generation task. That's a reasonable approach, but it doesn't solve the fundamental problem: the CAD-specific data remains scarce.

Thingi10K: the 3D printing dataset#

Thingi10K, published in 2016 by Zhou and Jacobson, contains 10,000 3D models from Thingiverse, the largest repository of user-submitted 3D printing files. The models are stored as meshes (STL and OBJ) and span a wide range of categories: mechanical parts, art, household items, toys, tools, cosplay props.

The value of Thingi10K is its diversity and its connection to real fabrication. These are models people actually printed. They include mechanical parts alongside decorative objects, giving a broader view of what users create in 3D modeling tools.

The limitations for AI training: mesh-only format (no parametric data), no design history, no manufacturing metadata beyond "someone printed this." The geometric quality varies enormously because the models come from users of all skill levels. Some are well-designed mechanical parts. Some are decorative meshes that would crash a CAD kernel.

For text-to-CAD research specifically, Thingi10K is marginal. For broader research on 3D shape understanding and generation, it's a useful supplementary dataset.

What each dataset includes and misses#

To make this concrete, here's what a real engineering part needs and what each dataset provides:

Geometry (the shape itself): All datasets provide this, though in different formats. ABC and DeepCAD provide B-Rep. ShapeNet and Thingi10K provide meshes. Fusion 360 Gallery provides both.

Design history (how it was built): Only Fusion 360 Gallery. DeepCAD has command sequences, which is a partial version. Everything else is geometry-only.

Text descriptions: Only DeepCAD, through the Text2CAD annotation layer. Everything else has no text pairing.

Dimensional accuracy (precise measurements): DeepCAD and Fusion 360 Gallery preserve exact dimensions. ABC preserves B-Rep dimensions. ShapeNet and Thingi10K are approximate at best.

Tolerances: None. No public dataset includes tolerance information.

Material specifications: None in any meaningful way.

Manufacturing process data: None. No dataset records whether a part was machined, molded, printed, or cast, or what process parameters were used.

Assembly context: None. All major datasets contain individual parts, not assemblies with mating relationships.

Design intent (why features exist): None, beyond what can be inferred from the feature sequence.

The pattern is clear: public datasets provide shape. Real engineering requires shape plus context. The context, tolerances, materials, manufacturing processes, assembly relationships, design intent, is exactly what's missing, and it's exactly what would make text-to-CAD output useful for actual engineering work.

The proprietary data problem#

The most interesting CAD data in the world sits inside corporate PDM systems, and it's not coming out.

A mid-size manufacturing company might have 50,000 to 500,000 parts in their vault. A large automotive or aerospace company has millions. These parts are complex. They have real tolerances, real material specifications, real manufacturing data associated with them. Many have revision histories going back years or decades. Some are linked to inspection reports, manufacturing defect records, and cost data.

This data, if it could be assembled, cleaned, and annotated, would be transformative for ML in CAD. Instead of training on 178,000 simple extrusions, you could train on millions of production parts spanning every manufacturing process and material. The models would learn what real engineering looks like because they'd see real engineering.

But companies don't share this data, for legitimate reasons. Part designs are proprietary. They contain trade secrets. They reveal product roadmaps, manufacturing capabilities, and competitive information. Even anonymized, a collection of automotive bracket designs from a specific company might reveal something about their upcoming vehicle platform. IP protection is real, and no amount of academic enthusiasm is going to override it.

Some companies have internal ML initiatives using their own data. Siemens, Autodesk, PTC, and Dassault all have access to customer data through their cloud platforms, subject to their terms of service and privacy policies. Whether and how they use this data for training is an active area of legal and ethical discussion that I'm watching with professional interest and personal skepticism.

The practical result: public research advances on limited data. Corporate initiatives advance on proprietary data that never gets published. And the gap between what public text-to-CAD tools can generate and what production engineering requires remains wide.

Dataset bias: the simple-parts problem#

Every public CAD dataset is biased toward simple mechanical parts. This isn't an accident. It's a consequence of how the data was collected.

Public repositories attract hobbyists, students, and demonstrators. The geometry they share tends to be simple, self-contained, and single-part. The complex, multi-feature, multi-body, assembly-integrated parts that make up real products don't get shared because they're proprietary, because they require context to understand, and because sharing a single part from a 200-part assembly without the rest of the assembly is like sharing one chapter from the middle of a novel.

This bias has direct consequences for text-to-CAD. Models trained on simple parts generate simple parts. When I ask a text-to-CAD tool for "a bracket," it produces something reasonable because brackets are well-represented in the training data. When I ask for "a four-cavity injection mold base with guided ejection," the output is useless because the model has never seen one.

The bias extends to modeling operations too. DeepCAD only contains sketch-and-extrude operations. Parts built with sweeps, lofts, revolves, patterns, or surfacing techniques are excluded. This means the AI literally cannot produce geometry that requires these operations. It's not a quality problem. It's a vocabulary problem. The training data taught the model to speak in extrusions. Asking it to loft is like asking someone who only knows English to write in Japanese.

What metadata is missing and why it matters#

The missing metadata is as important as the missing geometry, maybe more so.

Tolerances define what "close enough" means for each feature. Without them, a generated part has no specification, just a shape. Every hole is exactly its nominal size, which is not how manufacturing works. A 6 mm hole might need to be 6.000 +0.018/-0.000 for a bearing press fit, or 6.2 ±0.1 for a clearance hole. The number 6 alone is meaningless without the tolerance, and no training dataset includes this information.

Material specifications determine what's physically possible. A 0.5 mm wall in polycarbonate is fine. A 0.5 mm wall in aluminum is a problem on a mill. A 0.5 mm wall in cast iron doesn't exist outside of research papers. The AI doesn't know the material, so it can't know the limits.

Manufacturing intent, the reason a part looks the way it does, is the deepest kind of missing metadata. A fillet exists because the machinist needs a tool radius there, or because the molder needs draft, or because the stress analyst said the sharp corner would crack. Three identical fillets, three different reasons. The training data records the fillet. It doesn't record the reason. And the reason is what determines whether the fillet should be 1 mm or 3 mm.

How the data gap affects text-to-CAD quality#

Every limitation I've described in text-to-CAD output quality traces back to the training data.

Dimensional inaccuracy: the models learned from annotations that say "about 40 mm" paired with geometry that's 41.3 mm. The model learned to approximate, not to be precise.

Limited geometry range: the models learned from simple extrusions and produce simple extrusions. Complex geometry is out of vocabulary.

No manufacturing awareness: the models never saw manufacturing context, so they can't produce manufacturing-aware output.

No tolerance generation: the models never saw tolerances, so they can't generate them.

No assembly understanding: the models never saw assemblies, so they can't reason about part relationships.

This is not a criticism of the researchers who built these datasets. Given the constraints, they've done remarkable work. DeepCAD enabled an entire field of research. The Fusion 360 Gallery is the gold standard for design history data. The ABC Dataset proved that large-scale CAD data collection was possible.

But the honest picture is this: the public data available for training CAD AI is like teaching someone woodworking using only photos of IKEA furniture. The shapes are there. The material is hinted at. The joints, the grain direction, the tooling marks, the assembly sequence, and the decades of craft knowledge that went into making the joints work, none of that is in the picture. And it shows in the output.

Where the data might come from#

I see three plausible paths to better training data.

Synthetic data generation: using existing CAD tools to procedurally generate large numbers of parametric models with controlled properties. This is already happening in some research labs. The risk is that synthetic data produces synthetic-looking output, models that are geometrically valid but don't reflect how real engineers design.

Federated or anonymized corporate data: companies contributing anonymized geometry and metadata to shared datasets without revealing proprietary designs. This requires solving real technical and legal problems around anonymization, but the incentive exists: better AI tools benefit the companies whose data trains them. Industry consortia or standards bodies might eventually broker this.

Annotation of existing public data: taking the models that already exist in ABC, DeepCAD, and other datasets and adding the missing metadata through expert annotation or inference. This is labor-intensive but feasible for specific metadata types. Estimating likely manufacturing processes from geometry, inferring material from typical dimensions and features, adding tolerance standards based on common practice.

None of these paths is fast. All of them require significant investment. And none will produce data as rich as what already sits inside corporate servers.

The honest picture#

The text-to-CAD field is limited by its training data more than by its models. The ML architectures are capable enough. The neural approaches to CAD generation are improving. The bottleneck is that nobody has the data to train these models on what real engineering looks like.

Public datasets gave us the proof of concept. Simple brackets and extruded plates from AI prompts are now a reality. That's a real achievement built on DeepCAD, ABC, Fusion 360 Gallery, and the researchers who assembled them.

The next step, generating geometry that's dimensionally precise, manufacturing-aware, properly toleranced, and representative of the full range of engineering design, requires data that doesn't exist in public and may not exist in any single location. Building that data is the boring, expensive, unglamorous work that determines whether text-to-CAD stays a prototyping novelty or becomes a real engineering tool.

My bet is that it stays a novelty for longer than the vendors will admit and becomes useful faster than the skeptics expect. The timeline depends entirely on the data. And the data, right now, is a collection of extruded rectangles in a research dataset, looking nothing like the parts I design for a living.

Newsletter

New articles, product updates, and practical ideas on Text-to-CAD, AI CAD, and CAD workflows.