AI Agent System for Polymer Injection Moulding Operations

1. Client Background

The client operates three production lines running 24/7 across two shifts, producing high-tolerance plastic components for the automotive and white goods sectors. The plant runs 18 injection moulding machines ranging from 80 to 650 tonne clamping force, processing approximately 40 active moulds across different material grades.

Their quality management team of eight engineers relied on a combination of a legacy SCADA system, paper-based job sheets, and a shared Excel workbook to track process deviations, material batch changes, and first-article inspection (FAI) outcomes. Shift handovers were verbal, and institutional knowledge about mould-specific behaviour lived almost entirely in the heads of senior setters.

The client was not asking for AI to replace their engineers. They wanted a system that would make a new setter as effective as an experienced one within their first year, and prevent recurring defects from being rediscovered every time a mould returned to the floor after maintenance.

2. Problem Statement

Four specific, measurable problems were identified during discovery workshops with production, quality, and maintenance stakeholders.

Slow deviation response. When a process parameter drifted outside tolerance — barrel temperature, back pressure, cycle time — the setter had to manually cross-reference the job sheet, historical logs, and their own experience. Mean time to corrective action was measured at 35-40 minutes.

Repeated defect cycles. Flash, short shots, and sink marks on specific moulds were being "solved" repeatedly with no institutional memory. The same corrective actions were being rediscovered three to five times per year per mould, each time at the cost of scrap material and machine downtime.

Material batch variation. When a new batch of PA66 or PP-GF30 arrived, setters had no structured process for adjusting parameters to account for MFI or moisture variation. Adjustments were made ad hoc and were not recorded systematically against the batch certificate.

Shift knowledge gaps. Night shift had no access to real context from day shift beyond brief verbal handovers. Context about in-progress deviations, mould wear, and tooling changes was routinely lost at shift change.

Out of scope (explicitly defined). The project did not attempt to automate machine control, replace the process engineer's sign-off authority, or integrate with the client's ERP system. This was a decision-support and knowledge-surfacing system, not a closed-loop control system.

3. Technical Architecture

Stack Overview

Orchestration: LangGraph
Observability: Langfuse (self-hosted on factory server)
Primary reasoning model: Mistral 7B Instruct (Q4_K_M quantisation)
Tool-call / routing model: Phi-3 Mini 3.8B (Q4_K_M quantisation)
Inference server: llama.cpp (HTTP server mode)
Vector store: ChromaDB (on-premise)
Relational database: PostgreSQL 15
API gateway: FastAPI
Operator interface: React + Tailwind CSS
Data source (read-only): SCADA REST API

Hardware Constraint

The factory's IT policy prohibited cloud API calls for production data. The dedicated inference server was a single workstation: NVIDIA RTX 4090 (24 GB VRAM), 128 GB RAM, running Ubuntu 22.04.

This constrained model selection to 7B and sub-4B parameter models at 4-bit quantisation. We benchmarked Mistral 7B Instruct, Phi-3 Mini 3.8B, and Llama 3.1 8B on internal evaluation sets derived from 200 historical deviation reports and corrective action records. Mistral 7B was selected for its reasoning quality on process-related text; Phi-3 Mini was selected for tool-call parsing due to its lower latency and strong instruction-following at small size.

Both models were loaded simultaneously in llama.cpp server mode. Peak VRAM usage reached approximately 7.5 GB, leaving adequate headroom for burst loads during shift transitions.

Langfuse Deployment

Langfuse was deployed on a separate internal application server using Docker Compose. All agent traces — inputs, outputs, latencies, token counts, and human review actions — were written to Langfuse. This gave the QA lead and the engineering team a complete audit trail of every agent decision, which was a specific requirement from the client's quality management system (ISO 9001 compliance).

The Langfuse dashboard was used weekly to review agent accuracy, flag low-confidence outputs, and identify patterns in human overrides. No production data was ever sent outside the factory network.

4. Agent Design

The system was composed of four specialist agents and one supervisor/router agent, orchestrated as a LangGraph stateful graph. Each agent had a clearly bounded scope. We deliberately avoided building a single general-purpose agent — smaller, focused agents are easier to evaluate, debug, and explain to operators who need to trust the system.

Agent 1: Deviation Analyst

Model: Mistral 7B Instruct Trigger: SCADA parameter snapshot outside defined tolerance band

When a deviation is detected, this agent retrieves the last 48 hours of process logs for the relevant machine and mould combination, identifies the pattern and direction of drift, and queries a ChromaDB vector store of historical corrective actions for semantically similar past events. It produces a structured output containing the deviation summary, likely contributing factors, and up to three suggested corrective actions drawn from the historical knowledge base, each with a similarity score.

The agent does not issue any instructions to the machine. Its output is presented to the setter on the operator interface as a recommendation card.

Agent 2: Material Advisor

Model: Mistral 7B Instruct Trigger: New material batch logged in the intake system

When a new batch of raw material is received and logged, the Material Advisor compares the batch certificate values (MFI, moisture content, filler percentage) against the baseline parameters held in the mould process card for every active job using that material grade. It produces plain-language parameter adjustment guidance — typically covering barrel temperature profile, back pressure, and drying time — that the setter should review before beginning production with the new batch.

The agent pulls from a small internal knowledge base of material science guidelines that were reviewed and approved by the client's process engineer during the pilot phase. Guidance outside this knowledge base is explicitly flagged as uncertain.

Agent 3: Shift Summariser

Model: Mistral 7B Instruct Trigger: Scheduled job, 15 minutes before each shift end

This agent runs automatically at the end of each shift. It pulls all deviation events, corrective actions taken, material batch changes, and machine downtime incidents from the structured shift log in PostgreSQL. It generates a concise handover briefing for the incoming shift supervisor: a short narrative paragraph describing the state of each active production job, followed by a bullet list of flagged items that require attention or monitoring during the incoming shift.

The summary is delivered to a shared internal web interface and is also printed automatically to a small label printer in the supervisor's office — a specific request from the operations manager who did not want the incoming supervisor to need to log in before reading the briefing.

Agent 4: Knowledge Query Agent

Model: Phi-3 Mini 3.8B (query parsing) + Mistral 7B (answer generation) Trigger: Operator natural language query via the chat interface

This agent allows setters and engineers to ask questions in plain language through a simple chat interface on the operator terminals. Examples of real queries from the pilot phase:

"What was the last time we had flash on mould 14 and what did we do about it?"
"This is the first time we've run PA66-GF35. What should I watch for?"
"Machine 6 has been running 2 seconds long on cycle time since Monday. Any idea why?"

Phi-3 Mini handles query classification and generates a structured retrieval plan covering which database tables or vector store collections to query. Mistral 7B generates the final response grounded in the retrieved context. Source references are always shown alongside the answer so the operator knows whether the information came from a historical deviation report, a mould process card, or a material datasheet.

Supervisor / Router Agent

Model: Phi-3 Mini 3.8B Role: Entry point and orchestration logic

All inputs — SCADA events, scheduled jobs, and operator queries — pass through the supervisor node in the LangGraph graph. The supervisor classifies the input, routes to the appropriate specialist agent, manages retry logic for low-confidence outputs, and decides whether a human review checkpoint should be inserted before the response is surfaced to the operator. The supervisor does not generate responses itself; it manages flow only.

5. LangGraph Workflow Design

The graph was structured with a mix of deterministic edges (a SCADA deviation always routes to the Deviation Analyst) and conditional edges (the supervisor node evaluates confidence scores and decides whether to proceed to output or hold for human review).

Retry and Fallback Logic

If the Deviation Analyst produced a response with average retrieval similarity below 0.55 — meaning the vector store had no close historical matches — the graph branched to a fallback path that surfaced a raw log summary to the operator with an explicit note: "No closely matching historical events found. Showing raw process data for manual review." We did not want the system generating speculative recommendations when it had no grounding.

6. Human-in-the-Loop Design

Human review was not an afterthought — it was a first-class part of the LangGraph graph design. We identified four specific checkpoints where operator or engineer input was required before the system could proceed or finalise an output.

Checkpoint 1: Corrective Action Confirmation (High-urgency deviations)

When the Deviation Analyst flagged a deviation classified as high severity — for example, barrel temperature more than 15°C outside tolerance or cycle time drift exceeding 8% — the suggested corrective action was presented to the setter with a mandatory confirmation step. The setter had to explicitly select "Apply recommendation" or "Override with manual action" before the event was logged as resolved. Silently applying recommendations was intentionally not possible.

The setter's choice and any manual notes were written back to the PostgreSQL log and used to update the ChromaDB vector store over time, improving the quality of future recommendations.

Checkpoint 2: Material Advisor Sign-off

The Material Advisor output was never delivered directly to the setter as an instruction. It was routed first to the shift supervisor for review. The supervisor could approve, modify, or reject the guidance. Only after supervisor approval was the recommendation surfaced to the setter's terminal. This reflected the client's existing quality process — parameter changes on a new material batch had always required supervisor sign-off, and the AI system did not change that authority structure.

Checkpoint 3: Shift Summary Review

The Shift Summariser output was presented to the outgoing shift supervisor for a brief review before being shared with the incoming shift. The supervisor could add context, correct factual errors, or flag urgent items for escalation. The review window was deliberately kept short — the interface surfaced a diff view showing only items that differed from the previous shift's summary, reducing cognitive load.

Checkpoint 4: Low-Confidence Query Responses

When the Knowledge Query Agent produced a response with a low internal confidence estimate — typically because retrieved context was sparse or contradictory — the response was flagged visually on the operator terminal with a yellow banner reading "Low confidence — recommend verifying with process engineer." The operator could still read the response but was explicitly informed that it should not be relied upon without verification.

All checkpoint interactions were captured in Langfuse traces, allowing the QA lead to review human override rates week by week and identify areas where agent recommendations were systematically being rejected.

7. Pilot Phase and Rollout

Months 1–2: Data Preparation

The most time-consuming part of the project was not model work — it was data. The first six weeks were spent extracting and structuring historical records:

3 years of SCADA logs (exported to PostgreSQL)
847 historical deviation reports (digitised from paper and ingested into ChromaDB after cleaning)
120 mould process cards (converted from Word documents to structured JSON)
34 material datasheets (parsed and loaded into the knowledge base)

The client's QA lead was involved in every stage of data cleaning and was responsible for approving the accuracy of the ChromaDB knowledge base before any agent was built against it.

Months 3–4: Pilot on Two Machines

The system was deployed in read-only advisory mode on two machines running the highest-volume mould. No machine parameters were changed by the system. Operators were asked to rate the quality of agent recommendations after each event using a simple thumbs-up / thumbs-down interface.

After four weeks, recommendation acceptance rates were:

Deviation Analyst: 71% accepted without modification
Material Advisor: 83% accepted (most rejections were on edge cases with unusual filler grades)
Knowledge Query Agent: 68% rated as useful or very useful

The shift summaries required the most iteration. Early versions were too long and included information setters already knew. After three rounds of prompt revision based on supervisor feedback, summary length dropped from an average of 420 words to 180 words, and usefulness ratings improved significantly.

Months 5–7: Full Production Rollout and Tuning

The system was rolled out to all 18 machines. A two-day onboarding session was run for all setters and supervisors covering what the system could and could not do, how to interpret confidence indicators, and how to correct the system when it was wrong. Operator feedback at this stage was used to identify prompt improvements and UI changes before the system was considered stable.

The framing of the onboarding was deliberate: setters were positioned as domain experts training the system, not as users being replaced by it.

8. Observability with Langfuse

Every agent invocation produced a trace in Langfuse containing the full input context, intermediate retrieval results, model outputs, latency at each node, and the human review outcome. This gave the team several practical capabilities:

Debugging production failures. When a setter reported that a recommendation "made no sense," the engineering team could pull the exact trace, inspect which documents were retrieved from ChromaDB, and identify whether the issue was a retrieval failure or a reasoning failure. In practice, the majority of poor recommendations traced back to low-quality source documents in the knowledge base, not model reasoning errors.

Tracking human override patterns. Weekly review of override rates by checkpoint, agent, and machine allowed the QA lead to identify systematic gaps. For instance, the Material Advisor was consistently being overridden for PP-GF30 batches from one specific supplier. An investigation revealed that the process cards for that grade contained outdated parameter ranges.

Latency monitoring. The Phi-3 Mini routing step consistently ran in under 1.2 seconds. Mistral 7B reasoning steps ran between 8 and 22 seconds depending on context length. Operators were shown a progress indicator during longer operations — without this, early pilot users assumed the system had crashed when waiting for a response.

Resource tracking. Because inference was self-hosted, cost was measured in GPU time. Langfuse token counts revealed that the Shift Summariser was consuming a disproportionate share of GPU time due to long context windows at shift end. Context was refactored to pass structured summaries rather than raw logs, reducing token usage by approximately 60% without degrading output quality.

9. Limitations and Honest Observations

The models were not always right. At 7B parameters, Mistral occasionally produced recommendations that were plausible-sounding but technically incorrect — particularly for less common defect types with few historical records. The human-in-the-loop checkpoints were essential for catching these cases. The system was not safe to deploy without them.

Data quality was the real bottleneck. The quality of agent outputs was almost entirely determined by the quality of the knowledge base. Poorly written historical deviation reports produced poor recommendations. The project spent more time on data cleaning and curation than on model or agent work.

Operator trust took time. Several experienced setters were sceptical of the system during the pilot phase and consistently overrode recommendations even when those recommendations were later shown to be correct. Trust improved gradually as setters saw that their override choices were respected and recorded, and as recommendation quality demonstrably improved over time. Forcing compliance would have been counterproductive.

Latency is noticeable. A 10–20 second response time for complex queries is acceptable in an office context but feels slow on a factory floor when a machine is running out of spec. The human review checkpoints added further delay. Managing operator expectations about response times was an ongoing communication challenge throughout the rollout.

10. Outcomes (6 Months Post-Deployment)

Mean time to corrective action on process deviations reduced from 34 minutes to approximately 9 minutes, primarily because setters no longer needed to search historical records manually.

Repeat defect events — defined as the same defect type on the same mould within a 90-day window — dropped by 58% in the first six months. The knowledge base now captures corrective actions at the point they are taken, making institutional memory persistent rather than personal.

Shift handover time reduced from an average of 12 minutes of verbal briefing to approximately 4 minutes, with the incoming supervisor reviewing the AI-generated summary before the meeting and using the briefing to ask targeted questions rather than catch up on basics.

The Material Advisor flagged 11 batch-related parameter adjustments during the first six months. Seven were accepted and applied before production started. In each case, first-off quality was within tolerance on the first attempt — a meaningful improvement over the previous process, where batch-related adjustments were typically made reactively after the first samples were inspected.

11. What We Would Do Differently

Use a structured output schema from the start. Early in the project, agent outputs were free-form text that the front end parsed with regular expressions. This was fragile and caused several rendering bugs. Moving to JSON-structured outputs with Pydantic validation from the beginning would have saved approximately two weeks of debugging.

Instrument the vector store from day one. Retrieval logging was added to Langfuse partway through the pilot, which meant incomplete data for the first four weeks. Tracing retrieval results alongside model outputs from the start would have accelerated knowledge base quality improvements considerably.

Run a shorter proof-of-concept before committing to full data preparation. Seven months was the right total project length, but a tighter two-week proof-of-concept on a single mould with two operators before the main data preparation effort would have surfaced operator UX feedback earlier and allowed prompt design to begin with real interaction data rather than assumptions.