AI & Intelligence

AI Training Pipeline

Extract training data from conversations, upload knowledge entries, manage datasets, and monitor training quality.

01

Five Extractors

Training data is extracted from real conversations using five specialised extractors: chat Q&A pairs, intent classification examples, RAG retrieval pairs, knowledge base entries, and combined multi-format output. Each extractor produces JSONL output with anonymisation applied — PII is stripped before any data enters the training pipeline.

02

Dataset Management

View, filter, and manage training datasets from the admin UI. Each dataset shows its source, extraction type, entry count, and creation date. Download datasets as JSONL for external use or delete datasets that are no longer needed. Statistics show the total volume of training data across all dataset types.

03

R2 Storage

Extracted training data is stored on Cloudflare R2 as JSONL files. Each dataset has a unique key and metadata record in the database. The R2 storage model keeps training data separate from the production database, making it easy to manage lifecycle, access control, and compliance requirements independently.

04

Queue Integration

Training data extraction runs as async jobs via the queue system. Submit an extraction job from the admin UI, and it processes in the background without blocking the main application. The training.extract job type handles scheduling, progress tracking, and error handling for large extraction operations.

Ready to see it in action?

Get Started