" I go online to save time — so it's frustrating to download and print a 200-page PDF when I need one datasheet." That was an R&D engineer describing a keyword search tool in 2006. Twenty years later, manufacturers are still having the same problem.
The drives are full. Product specifications, datasheets, installation manuals, compliance documents — thousands of them, technically accessible, practically impossible to act on. A customer asks which pump supports 220V with a stainless steel housing. A distributor needs to know which products qualify for food-grade environments. A sales rep is on a call and needs to compare two models to keep the client’s interest. None of these questions get a fast answer from a keyword search bar.
Many times teams are relying on tribal knowledge. Whoever has been around longest knows which products can be substituted, which models were discontinued, which configurations work best. When that person leaves (and in manufacturing, turnover has been merciless) the knowledge walks out with them. Staff then reach for AI assistance to recover what was lost, which is the right instinct. But what they usually discover is that the AI project turns out to be a data reckoning. The documentations exist, but they weren't built to be machine-readable.
Chaotic data, inconsistent formats, users who needed the answer fast—these are the hindrances that manufacturing companies face most and I’ve seen working with the clients. This article walks through how these systems that hold are designed, where they fail, and what it takes to make them work.
The Problem with Manufacturer Product Catalogs
The root cause is fragmented, inconsistent information, which was built for print usage.
In fact, product data is spread across dozens or hundreds of individual PDFs, each formatted differently by the team or era that produced it. A spec sheet from 2018 uses different column headers than the one published last year. A datasheet describes voltage in one unit; a manual uses another. There is no unified language connecting a stainless steel housing in one document to its equivalent attribute in another.
Traditional keyword search pours fuel on the fire. It finds documents that contain the words, not documents that answer the question. Search "high-pressure food-safe pump" and you get everything that mentions any of those three terms, ranked by signals that have nothing to do with what the customer needs. The reading, the comparing, the figuring-out — that's all still on them.
Why Traditional RAG Fails for Product Catalogs
The instinct when someone hears “AI over documents” is to reach for RAG, which is Retrieval-Augmented Generation. Feed the PDFs into a vector database, embed the chunks, let the model retrieve and answer. It works in pitch decks, but it consistently comes undone in real life environments over large and heterogeneous product catalogs.
The frustration is widely shared among peers in the field. Manufacturing and AI professionals repeatedly report that deploying generic AI tools against complex, mission-critical technical environments produces unreliable results, and that the gap between what gets demonstrated and what works well when the volume hits is a primary reason AI launches with fanfare, stalls without it.
AI search for manufacturing is “hit and miss” precisely when it’s built on architectural assumptions that don’t match the reality of how manufacturer data is structured.
The fault in the blueprint is that single-pass RAG retrieves chunks and generates an answer in one shot. Everyday product questions don’t resolve that way. A request like “Which of your pumps supports food-grade environments and delivers in under two weeks?” requires the system to retrieve specifications, cross-reference availability data, and synthesize across multiple documents that were never designed to be read together. No single retrieval pass captures all of that.
We ran into this ceiling. Building document intelligence systems when context windows were limited to 8,000 tokens and enterprise collections ran into millions of files — you learn fast that single-pass RAG is a structural one. The same ceiling applies to any manufacturer with hundreds of thousands of product pages across thousands of SKUs.
There are several structural gaps that make single-pass RAG insufficient for product catalog use cases:
- Different document types require different extraction strategies. A pipeline that treats a scanned image the same as a structured PDF will produce inconsistent and unreliable results.
- Hybrid search, semantic plus keyword, significantly outperforms either approach alone, especially for technical attributes like model numbers, voltage specs, and material codes that don’t embed well semantically.
- Without iterative retrieval, the system has no mechanism to detect or correct its own gaps. If the first pass misses a relevant document, the answer is simply wrong.
- Without document classification upfront, the same extraction logic gets applied to structured PDFs, scanned images, and OCR artifacts. Each with a completely different internal representation and failure mode.
Architecture of an AI Product Catalog
The system has four interconnected layers: ingestion and classification, hybrid search, iterative retrieval, and an agent layer that turns retrieved data into answers. Each depends on the one before it.
Document Ingestion and Classification
Before any extraction happens, every incoming document is classified. This architectural decision determines if everything downstream is trustworthy.
In our clinical document processing system that we built for a client, we identified five distinct document types, each requiring its own extraction pipeline. The taxonomy maps directly to manufacturer product catalogs:
→ Plain images: photos of printed documents, sometimes taken at an angle or poorly lit. Pure OCR with noise tolerance built in.
→ Unstructured PDFs: machine-generated but with no predictable layout. Every manufacturer arranges the same attributes differently; extraction must be inferred rather than templated.
→ Structured PDFs: documents from known manufacturers with fixed, predictable element positions. A stored template maps directly to the layout; extraction is precise.
→ OCR PDFs: appear structured but were generated by an unknown external process. The internal representation is unreliable; we treat these as images wrapped in a PDF container.
→ Embedded tables: spec comparison matrices, part number tables, and attribute grids. The richest source of structured product data, but requiring specialized logic to preserve row and column relationships.
Misroute a document and the damage is invisible until somebody sees it: values pulled from the wrong location, attributes mapped to mismatched fields, output that looks reliable and leads you off a cliff . In a product catalog, it's a bad recommendation delivered with a straight face.
Hybrid Search Layer
Once documents are ingested and extracted, the search layer connects user questions to the right information. A purely semantic search understands meaning but struggles with exact technical specifications: a model number, a voltage rating, a specific material designation. Keyword search catches those precisely but misses intent.
The real-world approach combines both, then applies reranking to surface the most relevant results and compression to fit them into the model’s context window without losing critical details. In our conversational analytics platform, this layer is built on top of PostgreSQL for structured data, Redis for caching, vector databases for semantic retrieval, and graph databases for connecting related entities — product families, compatible accessories, superseded models.
This design choice is well-supported by recent field data. IBM Research shows that combining vector search, sparse vector search, and full-text search yields measurably better recall than any single method alone. The practical reason is straightforward: vector embeddings capture meaning, but they cannot precisely represent exact queries, a model number like “XR-450” or a material code like “304 SS” may carry little semantic context in training data, making pure vector search unreliable for the most technically specific product queries. Keyword search dominates for those cases. Combining both, weighted appropriately and reranked, is the only approach that handles the full range of real-world product questions.
Iterative Retrieval Loop
The retrieval loop is what separates this architecture from a standard RAG pipeline. After an initial retrieval pass, the system evaluates whether the results are sufficient to answer the question confidently. If they’re not, because a key attribute is missing, two documents conflict, or the question requires information that didn’t surface in the first pass, the system reformulates the query and retrieves additional context.
This loop continues until the system has sufficient evidence or determines that a confident answer isn’t possible, at which point it escalates to the clarification loop rather than guessing.
We had to engineer this as a core mechanism, when context windows were limited to 8,000 tokens and enterprise document collections couldn’t fit into any single pass. That constraint is still real for any manufacturer with hundreds of thousands of product pages across thousands of SKUs.
Agent Layer: How the Catalog Answers Questions
The retrieval architecture determines whether the system can find the right information. The agent layer determines whether it can do something useful with it. Four agents handle the journey from question to answer.
Query Understanding Agent
Before retrieving anything, the system needs to understand what the user is actually asking. A question like “Which pump fits my application?” is very different from “Compare the A200 and B300 on pressure rating.” The query understanding agent identifies intent, lookup, comparison, compatibility check, or recommendation, and structures the retrieval plan accordingly.
This intent detection layer is what allows non-technical users, including sales engineers, distributors, procurement managers, to interact with complex underlying data without knowing how to frame a structured query. The quality of intent detection determines the quality of everything downstream.
Retrieval Agent
Once intent is clear, the retrieval agent selects the relevant documents and pulls the structured attributes needed to answer the question. For a product lookup, that means targeting a specific datasheet. For a comparison, it means pulling the same attributes from multiple products so they can be evaluated side by side.
The retrieval agent works against the classified, extracted knowledge base. The quality of extraction at ingestion time determines what the retrieval agent has to work with. This is why the ingestion pipeline is not a pre-processing step that can be cut to save time. It is the foundation.
Comparison Agent
Product comparisons are one of the highest-value use cases in manufacturing, and one of the hardest to do well with standard search. The comparison agent takes structured attributes from multiple products and generates a structured output: a table, a ranked list, or a narrative comparison depending on what the question requires.
This only works if the underlying data is structured consistently. Two products described in different terms (one spec sheet says “stainless steel housing,” another says “304 SS enclosure”) can only be compared if the extraction layer has normalized those attributes to a shared vocabulary.
Clarification Loop
When a question is ambiguous, the system asks for clarification. This human-in-the-loop mechanism is a designed transition point.
A clarification question catches a misrouted query before it surfaces a wrong product recommendation. The cost of asking is seconds. The cost of a wrong answer is a sale, a return, or worse — a safety incident with a misspecified component.
From Static Catalog to Intelligent Product Advisor
The workflow that makes this possible follows a consistent pattern regardless of query type:
- The user’s question enters the system and passes through question routing .
- The retrieval agent selects relevant documents and pulls the attributes needed.
- For comparison queries, the comparison agent structures the attributes side by side.
- If information is missing or ambiguous, the clarification loop surfaces a targeted question.
- The answer is generated with retrieved context embedded, grounded in the actual documents.
- Every step is logged for traceability. We use Langfuse for this, the full reasoning trace can be inspected if an answer needs to be verified.
The change from static catalog to intelligent advisor is the shift from “here are your documents” to “here is the answer.” Every employee who needs to understand a product line can get to a useful answer in seconds rather than searching through PDFs or relying on whoever happens to know.
Why This Works Better Than Keyword Search
Keyword search finds documents that contain the words. This system understands what the user is asking. That distinction matters at every step:
- Structured attribute extraction means product specifications exist as queryable data, not text buried in paragraphs.
- Iterative retrieval means the system catches and corrects its own gaps before returning an answer.
- Normalization at ingestion time means “stainless steel housing” and “304 SS enclosure” are the same attribute when the user is comparing products.
- Validation across retrieval passes means confident-looking but wrong answers are caught before they reach the user.
The deeper issue is that keyword search shifts the work to the user. They have to read the documents, make the comparison, and synthesize the answer themselves. This system shifts that work to the architecture, so the user gets the answer directly.
Cost and Scaling Considerations
Building this system is one hill to climb. Running it at scale without the cost structure undermining the business case is another, and it’s a blind spot that consistently gets underestimated.
The scale of this problem is growing. The RAG market reached $1.85 billion in 2024 and is expanding at roughly 49% annually, meaning the number of organizations learning these cost lessons in real time is doubling roughly every 18 months. Most of them are discovering the same thing: the economics that look fine at proof-of-concept change fundamentally in operation. A typical enterprise knowledge base of 10,000 documents can be embedded and indexed for under $100 at ingestion. The ongoing cost isn’t ingestion — it’s the per-query token cost at production volume, multiplied by every distributor, sales engineer, and support agent hitting the system daily.
The infrastructure decision that compounds this problem is the one that feels small at the start. In one project, a file upload feature was built routing data through the backend instead of directly to cloud storage using presigned URLs. The direct approach would have taken an extra day or two to implement properly. The team chose the faster path. When the system moved to production, all uploads were routed through a VPN security layer, and the bottleneck that created cost a month of engineering time to resolve through chunking, compression, and configuration tuning, none of which fully fixed the problem.
Several strategies reduce operating cost without sacrificing accuracy:
- Semantic caching — similar questions that have already been answered don’t need a new API call. In a product catalog context, where many distributors ask the same questions about the same products, cache hit rates can be high enough to significantly reduce per-query cost.
- Model routing — simple lookup queries don’t need the most capable and most expensive model. Routing simpler questions to smaller models reduces cost per query without affecting answer quality for those cases.
- Prompt optimization — restructuring prompts to use fewer tokens without losing accuracy compounds over thousands of daily queries.
- Structured extraction at ingestion — extracting and storing product attributes at ingestion time means query-time retrieval is faster and requires less model processing.
The companies that manage this well are the ones who modeled the operating cost of their systems before they scaled them, and built efficiency into the architecture from the beginning rather than retrofitting it later. The “fast and cheap” path doesn’t save money. It moves the bill to a later date, with interest.
What a 4-Week Proof of Concept Looks Like
For manufacturers evaluating this approach, a four-week proof of concept provides a working system built over their own documents.
- Week 1 — Document Ingestion: Ingest a representative sample of the manufacturer’s product PDFs. Classify document types, build extraction pipelines for each, establish the knowledge base.
- Week 2 — RAG Baseline: Stand up the hybrid search layer and iterative retrieval loop. Establish accuracy benchmarks against a set of test questions drawn from real customer queries.
- Week 3 — Agent Layer: Build the query understanding, retrieval, and comparison agents. Implement the clarification loop. Test against more complex, multi-step questions.
- Week 4 — Interface and Testing: Connect the system to a usable interface. Test with actual users from the sales or support team. Document gaps and prioritize the production roadmap.
The output is a working system over the client’s own data, a straightforward picture of accuracy and limitations, and an informed decision about whether and how to move to operation.
In practice, Week 1 is where most projects encounter their first problem: getting clean, complete document sets from the manufacturer turns out to be harder than expected. Catalogs are scattered across systems, some PDFs are password-protected, older datasheets exist only as scans. It is a reason to start with a representative sample rather than waiting for a complete catalog, and to build the ingestion pipeline robust enough to handle what arrives later.
When Manufacturers Should Build AI Product Catalogs
Not every manufacturer needs this system today. The use case is strongest when several conditions are true:
- Large product range — hundreds or thousands of SKUs where no individual can hold the full catalog in their head.
- Complex specifications — products differentiated by technical attributes that require precise matching to customer requirements.
- High PDF volume — most product knowledge lives in documents rather than in a structured database.
- Sales engineering load — a significant portion of pre-sales time is spent answering “which product fits X” questions that could be automated.
- Distributor or partner network — external parties who need to answer product questions without direct access to internal expertise.
The break-even point is often a function of sales cycle speed. If answering a product question takes 24 hours through the current process and it could take 30 seconds, the value compounds quickly. Across a distributor network of any real size, the cumulative time recovered, and the deals that don’t slip because a question got answered at 10pm instead of the next morning adds up faster than most teams expect.
From our experience, companies that try to introduce AI without first asking “what specific problem will this solve and how will we know it worked” tend to build systems that close the meeting and open a support ticket. The right question before starting isn’t “do we want AI in our product catalog?” It’s “what does our sales team spend the most time on that this system could handle instead?”
The Catalog That Works Back
You already have the answer. It's sitting in a datasheet, a spec sheet, an installation manual nobody has time to open. The information exists. It just can't do anything from where it is locked away.
Building a system that changes that isn't a straight line: the shortcuts that look cheap upfront have a way of showing up as month-long problems once you go live. But when it works, the impact is instant. Your newest sales rep has the same product knowledge as your most experienced one. A distributor gets an answer without calling anyone. A customer gets a precise response in seconds instead of waiting on the right PDF.
The catalog stops being something people dig through. It becomes something that works for them.