使用 Docling Parse 构建完整的文档布局感知解析流水线
英文摘要
This tutorial demonstrates a full parsing pipeline using Docling Parse to extract text cells (words, characters, lines) with page-level coordinates from a multi-element test PDF. It covers environment setup, generation of a PDF with columns, tables, vector shapes, and an embedded image, and extraction of structured JSON/CSV outputs. The workflow includes reconstruction of layout-aware reading order from word coordinates, rendering of cell overlays for inspection, and benchmarking of threaded parallel parsing. The resulting pipeline is suitable for document AI tasks such as layout analysis, table extraction, and preparation for retrieval-augmented generation (RAG).
中文摘要
本教程展示了使用 Docling Parse 构建完整解析流水线的方法,从包含多元素(分栏、表格、矢量图形、嵌入图像)的测试 PDF 中提取词、字符、行及其页面坐标。包括环境配置、PDF 生成、结构化 JSON/CSV 导出、基于坐标重建布局感知阅读顺序、渲染单元覆盖图以及多线程解析性能测试。该流水线可支撑版面分析、表格提取及为检索增强生成(RAG)做数据准备等文档智能任务。
关键要点
Sets up a Colab Python environment with Docling Parse, Pillow, ReportLab, Pandas, and Matplotlib, handling Pillow compatibility issues.
在 Colab 中安装 Docling Parse、Pillow、ReportLab、Pandas 和 Matplotlib,并处理 Pillow 版本冲突的恢复逻辑。
Generates a custom two-page PDF with text, two-column layout, table-like structure, vector shapes, and an embedded bitmap image for controlled testing.
通过代码生成包含文本、双栏、类表格结构、矢量图形和嵌入位图的定制两页 PDF,用于受控测试。
Extracts word, character, and line cells with bounding box coordinates from each page using DoclingPdfParser, and renders overlay images for visual verification.
使用 DoclingPdfParser 逐页提取词、字符、行单元并附带边界框坐标,同时渲染覆盖图以直观验证。
Exports parsed results to JSON and CSV, flattens records into a DataFrame, and reconstructs layout-aware text by grouping words into lines based on their y-coordinates and sorting by x-position.
将解析结果导出为 JSON 和 CSV,展平为 DataFrame,并利用 y 坐标分组和 x 坐标排序重建布局感知文本行。
Tests threaded parallel parsing with DoclingThreadedPdfParser, benchmarking performance and saving results, and verifies command-line tool availability.
测试 DoclingThreadedPdfParser 的多线程并行解析,记录性能基准,并检查 CLI 工具是否可用。