Lexicons and grammars for language processing: industrial or handcrafted products?
Title: Language Processing Resources: The Debate Between Manual Craftsmanship and Industrial Automation
Abstract
In recent years, the application of linguistic data to language processing has seen steady growth, with such assets now widely recognized as language resources. While traditional resources primarily consist of text collections like the Brown Corpus and the Penn Treebank, there has been a recent surge in the development of electronic lexicons—including WordNet, FrameNet, VerbNet, ComLex, and Lexicon-Grammar—as well as formal grammars such as TAG.
A distinct contrast exists in how these resources are built: while corpus construction has long relied on high levels of automation, the creation of lexicons and grammars remains predominantly a manual endeavor. Increasingly, language processing experts acknowledge that lexicons and grammars possess a richer informational density than corpora, thereby enabling more sophisticated processing techniques. This disparity in construction time may be attributed to the difference in informational content; the careful handcrafting of these resources by linguists yields data that is more informative than what can be produced through automatic generation.
Currently, this field is trending in two divergent directions. One path involves language technology specialists becoming accustomed to managing manually constructed resources, which offer greater complexity and depth. The other, which represents the dominant view, focuses on automating and industrializing the creation of lexicons and grammars. Both trajectories are already underway, creating a palpable tension between them.
The future relationship between linguists and computer scientists hinges on which direction prevails. The first approach necessitates the recruitment and training of a significant number of linguists, whereas the second relies primarily on technical solutions devised by computer engineers. This article examines practical examples of these language resources to evaluate whether manual creation, industrial generation, or a hybrid of both approaches offers the most realistic or effective outcomes.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





