Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling
Title: Enhancing Retail Product Classification for Consumer Price Indices: A Hybrid Rule-Based and Bag-of-Words Approach with Reliability-Weighted Human Oversight
Abstract:
The integration of alternative data streams—such as web-scraped information, transaction and receipt records, and scanner data—has become increasingly vital for measuring consumer prices. However, a persistent challenge arises from the nature of product descriptions in these sources: they are often brief, noisy, abbreviated, and lack standardized product codes. Consequently, before meaningful price comparisons can be conducted, each item must be mapped to a specific consumption classification, such as the UN COICOP framework. This study presents a general, reproducible methodology for executing this mapping.
The proposed pipeline operates in three stages. First, it performs text normalization and tokenization to handle noisy item names. Second, it employs a rule-based pre-classifier built on a prefix tree (trie), which utilizes category-specific key-phrases and stop-phrases. Third, it applies a per-category binary confirmation model to verify whether an item belongs to its tentatively assigned category. To manage labeling at scale, the system utilizes a human-in-the-loop protocol where annotators provide binary valid or reject judgments. These judgments are aggregated using dynamically updated reliability weights, allowing the model to engage in continual fine-tuning via the same rules.
Empirical results indicate a deflationary outcome regarding model complexity. In a controlled study free of data leakage—using one category, real positives against hard negatives, and five seeds—bag-of-words models were found to essentially saturate the task, achieving an F1 score of approximately 0.99. Notably, a linear classifier performed as well as a multilayer perceptron, while explicit word-order features (n-grams) provided no additional benefit. Furthermore, the study suggests that approximately 67 labeled examples are sufficient for effective performance.
Regarding the labeling protocol, a Monte-Carlo analysis revealed that the reliability-weighted voting method offers only a marginal improvement over simple majority voting, as its additive weights tend to saturate. In contrast, the Dawid-Skene method was shown to recover labels significantly more effectively. The paper concludes with a discussion on price-level quality control and provides design recommendations for statistical offices considering the use of transaction data. All figures included are illustrative; no confidential data, code, or documentation is reproduced.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





