Web-scale information extraction faces a fundamental trade-off: rule-based wrappers are brittle and vulnerable to drift, while end-to- end LLM extraction is accurate but costly and opaque. We introduce a pipeline that promotes the LLM to the control plane, leaving fast, transparent wrappers in the data plane and letting the model monitor drift and auto-repair them at scale. The architecture rests on three pillars to be developed during the PhD. (i) URL discovery: an agnostic module that exploits temporal link -graph signals to surface high - value pages without manual seed tuning. (ii) Structural templating: a formal grammar-based clustering that groups pages into stable templates and defines reusable wrapper scopes. (iii) LLM control plane: agentic LLMs that both supervise the pipeline and repair wrappers when drift is detected. By fusing URL discovery, theory - grounded templating, and LLM- based wrapper induction, the system aims to transform hand- tuned heuristics into a self -healing, economically sustainable, fully autonomous web data extraction pipeline, orchestrated by a dedicated control plane. The full system will be field-tested in the domain of editorial news, an incremental, high-drift environment where layout changes and semantic diversity make robust extraction especially challenging. While initially developed within the domain of media intelligence, the architecture is designed for generalization to other structured web verticals.
Marineli, F. (2025). Large Language Models as Control Planes for Industrial-Scale Web Data Extraction. In VLDB Ph.D. Workshop.
Large Language Models as Control Planes for Industrial-Scale Web Data Extraction
Felipe Marineli
2025-01-01
Abstract
Web-scale information extraction faces a fundamental trade-off: rule-based wrappers are brittle and vulnerable to drift, while end-to- end LLM extraction is accurate but costly and opaque. We introduce a pipeline that promotes the LLM to the control plane, leaving fast, transparent wrappers in the data plane and letting the model monitor drift and auto-repair them at scale. The architecture rests on three pillars to be developed during the PhD. (i) URL discovery: an agnostic module that exploits temporal link -graph signals to surface high - value pages without manual seed tuning. (ii) Structural templating: a formal grammar-based clustering that groups pages into stable templates and defines reusable wrapper scopes. (iii) LLM control plane: agentic LLMs that both supervise the pipeline and repair wrappers when drift is detected. By fusing URL discovery, theory - grounded templating, and LLM- based wrapper induction, the system aims to transform hand- tuned heuristics into a self -healing, economically sustainable, fully autonomous web data extraction pipeline, orchestrated by a dedicated control plane. The full system will be field-tested in the domain of editorial news, an incremental, high-drift environment where layout changes and semantic diversity make robust extraction especially challenging. While initially developed within the domain of media intelligence, the architecture is designed for generalization to other structured web verticals.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


