Benvenuti nell'Anagrafe della Ricerca d'Ateneo

Web-scale information extraction faces a fundamental trade-off: rule-based wrappers are brittle and vulnerable to drift, while end-to- end LLM extraction is accurate but costly and opaque. We introduce a pipeline that promotes the LLM to the control plane, leaving fast, transparent wrappers in the data plane and letting the model monitor drift and auto-repair them at scale. The architecture rests on three pillars to be developed during the PhD. (i) URL discovery: an agnostic module that exploits temporal link -graph signals to surface high - value pages without manual seed tuning. (ii) Structural templating: a formal grammar-based clustering that groups pages into stable templates and defines reusable wrapper scopes. (iii) LLM control plane: agentic LLMs that both supervise the pipeline and repair wrappers when drift is detected. By fusing URL discovery, theory - grounded templating, and LLM- based wrapper induction, the system aims to transform hand- tuned heuristics into a self -healing, economically sustainable, fully autonomous web data extraction pipeline, orchestrated by a dedicated control plane. The full system will be field-tested in the domain of editorial news, an incremental, high-drift environment where layout changes and semantic diversity make robust extraction especially challenging. While initially developed within the domain of media intelligence, the architecture is designed for generalization to other structured web verticals.

Marineli, F. (2025). Large Language Models as Control Planes for Industrial-Scale Web Data Extraction. In VLDB Ph.D. Workshop.

Large Language Models as Control Planes for Industrial-Scale Web Data Extraction

Felipe Marineli

2025-01-01

Abstract

Web-scale information extraction faces a fundamental trade-off: rule-based wrappers are brittle and vulnerable to drift, while end-to- end LLM extraction is accurate but costly and opaque. We introduce a pipeline that promotes the LLM to the control plane, leaving fast, transparent wrappers in the data plane and letting the model monitor drift and auto-repair them at scale. The architecture rests on three pillars to be developed during the PhD. (i) URL discovery: an agnostic module that exploits temporal link -graph signals to surface high - value pages without manual seed tuning. (ii) Structural templating: a formal grammar-based clustering that groups pages into stable templates and defines reusable wrapper scopes. (iii) LLM control plane: agentic LLMs that both supervise the pipeline and repair wrappers when drift is detected. By fusing URL discovery, theory - grounded templating, and LLM- based wrapper induction, the system aims to transform hand- tuned heuristics into a self -healing, economically sustainable, fully autonomous web data extraction pipeline, orchestrated by a dedicated control plane. The full system will be field-tested in the domain of editorial news, an incremental, high-drift environment where layout changes and semantic diversity make robust extraction especially challenging. While initially developed within the domain of media intelligence, the architecture is designed for generalization to other structured web verticals.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Citazione
	
				Marineli, F. (2025). Large Language Models as Control Planes for Industrial-Scale Web Data Extraction. In VLDB Ph.D. Workshop.
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/526896

Citazioni

ND

ND

ND

social impact