GLM-OCR实战教程：GLM-OCR识别结果接入Elasticsearch构建可检索文档库-尧图手机网站定制

GLM-OCR实战教程GLM-OCR识别结果接入Elasticsearch构建可检索文档库你是不是遇到过这样的烦恼手头有一堆扫描的PDF、图片文档想在里面找点资料只能一张张打开、一页页翻看效率低到让人抓狂。或者团队共享的文档库越来越大想找个历史会议纪要却像大海捞针一样困难。今天我就带你解决这个痛点。我们将利用GLM-OCR这个强大的多模态OCR模型把图片、PDF里的文字信息精准地“读”出来然后接入Elasticsearch这个专业的搜索引擎打造一个你自己的、可以快速检索的智能文档库。整个过程我会手把手教你从环境搭建到代码实现再到实际应用保证你跟着做一遍就能搞定。准备好了吗我们开始吧。1. 教程目标与价值在开始动手之前我们先明确一下学完这篇教程你能得到什么。你将学会部署并调用GLM-OCR服务让模型跑起来并能通过代码把图片里的文字“读”出来。搭建Elasticsearch环境建立一个本地的搜索引擎用来存储和检索我们的文档。构建数据处理流水线设计一个自动化流程把图片丢进去文字存进搜索引擎。实现文档检索功能最终你能像用百度一样用关键词快速找到你文档库里的任何内容。这个方案能帮你效率提升告别手动翻阅秒级定位所需信息。知识沉淀将散落的图片、扫描件转化为结构化的、可搜索的知识资产。应用扩展这套框架稍加改造就能用于合同管理、档案数字化、知识库构建等各种场景。2. 环境准备与快速部署工欲善其事必先利其器。我们先来把需要的“工具”准备好。2.1 启动GLM-OCR服务GLM-OCR模型已经预置好了我们只需要把它启动起来。它提供了一个非常方便的Web界面和API接口。打开你的终端执行以下命令# 1. 进入项目目录 cd /root/GLM-OCR # 2. 启动服务 ./start_vllm.sh第一次启动会慢一点因为需要加载模型到内存大概需要1-2分钟。看到终端输出类似“Running on local URL: http://0.0.0.0:7860”的信息就说明服务启动成功了。验证服务是否正常打开你的浏览器访问http://你的服务器IP地址:7860。你应该能看到一个上传图片的界面。可以随便传张带文字的图片选择“Text Recognition:”任务点“开始识别”试试效果。2.2 安装Elasticsearch接下来我们安装Elasticsearch。这里我们用Docker来安装最简单省事。# 1. 拉取Elasticsearch镜像这里用8.11.0版本比较稳定 docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.0 # 2. 创建并运行Elasticsearch容器 docker run -d \ --name elasticsearch \ -p 9200:9200 \ -p 9300:9300 \ -e discovery.typesingle-node \ -e xpack.security.enabledfalse \ --restartalways \ elasticsearch:8.11.0运行后等个十几秒在浏览器访问http://你的服务器IP地址:9200。如果看到一串包含“cluster_name”等信息的JSON恭喜你Elasticsearch也启动成功了2.3 安装Python依赖库我们的核心逻辑会用Python来写需要安装几个必要的库。# 使用项目预设的conda环境 /opt/miniconda3/envs/py310/bin/pip install \ elasticsearch8.11.0 \ # Elasticsearch官方Python客户端 gradio-client0.6.1 \ # 用于调用GLM-OCR的API Pillow10.1.0 \ # 用于处理图片 PyPDF23.0.1 # 可选用于处理PDF如果你有PDF需求的话好了所有环境都准备好了。下面我们进入核心环节写代码。3. 核心代码实现构建连接桥梁现在我们要编写三个核心的Python脚本它们就像流水线上的三个工人各司其职共同完成从图片到可检索文档的转化。3.1 第一步调用GLM-OCR提取文字首先我们写一个“阅读员”脚本它的任务就是调用GLM-OCR服务把图片里的文字准确地提取出来。创建一个文件叫ocr_extractor.py#!/usr/bin/env python3 # ocr_extractor.py - GLM-OCR文字提取模块 import os import time from gradio_client import Client from PIL import Image import PyPDF2 # 如果需要处理PDF的话 from io import BytesIO class GLMOCRExtractor: GLM-OCR文字提取器 def __init__(self, server_urlhttp://localhost:7860): 初始化OCR提取器 Args: server_url: GLM-OCR服务的地址 print(f[INFO] 正在连接GLM-OCR服务: {server_url}) self.client Client(server_url) print([INFO] GLM-OCR服务连接成功) def extract_from_image(self, image_path, task_typeText Recognition:): 从单张图片中提取文字 Args: image_path: 图片文件路径 task_type: 任务类型默认是文本识别 Returns: str: 识别出的文字内容 try: # 调用GLM-OCR API result self.client.predict( image_pathimage_path, prompttask_type, api_name/predict ) # result通常是一个包含识别文本的字符串或字典 # 这里根据实际API返回格式调整 if isinstance(result, dict) and text in result: return result[text] elif isinstance(result, str): return result else: # 尝试转换为字符串 return str(result) except Exception as e: print(f[ERROR] 图片识别失败 {image_path}: {e}) return def extract_from_pdf(self, pdf_path, output_image_dirtemp_images): 从PDF中提取文字通过将每页转为图片再识别 Args: pdf_path: PDF文件路径 output_image_dir: 临时存放转换后图片的目录 Returns: dict: 页码-识别文字的映射 os.makedirs(output_image_dir, exist_okTrue) text_by_page {} try: # 打开PDF文件 with open(pdf_path, rb) as file: pdf_reader PyPDF2.PdfReader(file) total_pages len(pdf_reader.pages) print(f[INFO] 开始处理PDF: {pdf_path}, 共{total_pages}页) # 这里简化处理实际需要将PDF每页转为图片 # 可以使用pdf2image库这里为简化先用文本提取演示 for page_num in range(total_pages): page pdf_reader.pages[page_num] page_text page.extract_text() # PDF自带的文本提取可能不准确这里我们模拟调用OCR # 实际应用中你应该将PDF页面渲染为图片然后调用上面的extract_from_image print(f[INFO] 处理第{page_num1}页 (模拟OCR过程)) # 假设我们有一个临时图片文件实际项目中需要生成 # text self.extract_from_image(temp_image_path) # 这里先用提取的文本代替 text_by_page[page_num 1] page_text or f第{page_num1}页内容识别 except Exception as e: print(f[ERROR] PDF处理失败 {pdf_path}: {e}) return text_by_page # 使用示例 if __name__ __main__: # 1. 创建提取器 extractor GLMOCRExtractor() # 2. 测试单张图片识别 test_image /path/to/your/test_image.png # 替换为你的测试图片路径 if os.path.exists(test_image): print(f\n[测试] 识别图片: {test_image}) text extractor.extract_from_image(test_image) print(f识别结果:\n{text[:500]}...) # 只打印前500字符 else: print(f[警告] 测试图片不存在: {test_image})这个脚本的核心是GLMOCRExtractor类它封装了调用GLM-OCR API的细节。你可以用它来处理单张图片也可以扩展它来处理PDF需要先将PDF页面转为图片。3.2 第二步将文字存入Elasticsearch文字提取出来了接下来需要有个“档案管理员”把它们分门别类地存起来并且做好索引方便以后查找。这就是Elasticsearch的工作。创建第二个文件叫es_indexer.py#!/usr/bin/env python3 # es_indexer.py - Elasticsearch文档索引模块 from elasticsearch import Elasticsearch from datetime import datetime import hashlib import json class DocumentIndexer: 文档索引器负责将文本存入Elasticsearch def __init__(self, es_hostlocalhost, es_port9200): 初始化Elasticsearch索引器 Args: es_host: Elasticsearch主机地址 es_port: Elasticsearch端口 self.es Elasticsearch([fhttp://{es_host}:{es_port}]) # 测试连接 if self.es.ping(): print([INFO] Elasticsearch连接成功) else: print([ERROR] 无法连接到Elasticsearch请检查服务是否运行) raise ConnectionError(Elasticsearch连接失败) def create_index(self, index_namedocument_library): 创建文档索引如果不存在 Args: index_name: 索引名称 # 索引的映射Mapping相当于数据库的表结构 index_mapping { mappings: { properties: { doc_id: {type: keyword}, # 文档唯一ID title: {type: text, analyzer: ik_max_word, search_analyzer: ik_smart}, # 标题使用中文分词 content: {type: text, analyzer: ik_max_word, search_analyzer: ik_smart}, # 内容使用中文分词 source_path: {type: keyword}, # 源文件路径 file_type: {type: keyword}, # 文件类型image, pdf等 page_num: {type: integer}, # 页码如果是PDF ocr_confidence: {type: float}, # OCR置信度如果有 created_time: {type: date}, # 创建时间 updated_time: {type: date}, # 更新时间 tags: {type: keyword} # 标签用于分类 } }, settings: { number_of_shards: 1, # 分片数 number_of_replicas: 0 # 副本数单机环境设为0 } } # 检查索引是否存在 if not self.es.indices.exists(indexindex_name): self.es.indices.create(indexindex_name, bodyindex_mapping) print(f[INFO] 索引 {index_name} 创建成功) else: print(f[INFO] 索引 {index_name} 已存在) self.index_name index_name return index_name def generate_doc_id(self, source_path, page_numNone): 生成文档唯一ID Args: source_path: 源文件路径 page_num: 页码可选 Returns: str: 文档ID base_str f{source_path}_{page_num} if page_num else source_path return hashlib.md5(base_str.encode()).hexdigest() def index_document(self, title, content, source_path, file_typeimage, page_numNone, tagsNone, ocr_confidence1.0): 索引单个文档 Args: title: 文档标题 content: 文档内容 source_path: 源文件路径 file_type: 文件类型 page_num: 页码 tags: 标签列表 ocr_confidence: OCR置信度 Returns: dict: 索引操作结果 # 生成文档ID doc_id self.generate_doc_id(source_path, page_num) # 准备文档数据 document { doc_id: doc_id, title: title, content: content, source_path: source_path, file_type: file_type, page_num: page_num, ocr_confidence: ocr_confidence, created_time: datetime.now().isoformat(), updated_time: datetime.now().isoformat(), tags: tags or [] } # 索引文档存入Elasticsearch try: response self.es.index(indexself.index_name, iddoc_id, documentdocument) print(f[INFO] 文档索引成功: {title} (ID: {doc_id})) return response except Exception as e: print(f[ERROR] 文档索引失败 {title}: {e}) return None def search_documents(self, query, fieldcontent, size10): 搜索文档 Args: query: 搜索关键词 field: 搜索字段默认搜索内容 size: 返回结果数量 Returns: list: 搜索结果列表 search_body { query: { match: { field: query } }, highlight: { fields: { content: {} # 高亮显示匹配的内容 } }, size: size } try: response self.es.search(indexself.index_name, bodysearch_body) results [] for hit in response[hits][hits]: source hit[_source] highlight hit.get(highlight, {}) result { score: hit[_score], # 匹配度分数 title: source[title], content_preview: source[content][:200] ..., # 内容预览 source_path: source[source_path], page_num: source.get(page_num), highlight: highlight.get(content, []) # 高亮片段 } results.append(result) print(f[INFO] 搜索 {query} 找到 {len(results)} 个结果) return results except Exception as e: print(f[ERROR] 搜索失败: {e}) return [] # 使用示例 if __name__ __main__: # 1. 创建索引器 indexer DocumentIndexer() # 2. 创建索引 index_name indexer.create_index(my_document_library) # 3. 索引一个测试文档 test_doc indexer.index_document( title测试文档标题, content这是一段测试内容包含一些关键词比如人工智能、机器学习、深度学习等。, source_path/fake/path/test.txt, file_typetext, tags[测试, 技术] ) # 4. 搜索测试 if test_doc: print(\n[测试] 搜索关键词人工智能) results indexer.search_documents(人工智能) for i, result in enumerate(results, 1): print(f{i}. {result[title]} (匹配度: {result[score]:.2f})) if result[highlight]: print(f 高亮: {....join(result[highlight][:2])}...)这个DocumentIndexer类做了几件重要的事连接Elasticsearch。创建一个专门存文档的“仓库”索引并且定义好了每个文档有哪些字段标题、内容、来源等。提供了index_document方法用来把一篇文档存进去。提供了search_documents方法用来根据关键词查找文档还能高亮显示匹配的片段。3.3 第三步组装完整流水线现在“阅读员”和“档案管理员”都有了我们需要一个“调度员”来指挥它们工作。这个调度员就是我们的主程序。创建第三个文件叫main_pipeline.py#!/usr/bin/env python3 # main_pipeline.py - 主处理流水线 import os import glob from ocr_extractor import GLMOCRExtractor from es_indexer import DocumentIndexer import time class DocumentProcessingPipeline: 文档处理流水线 def __init__(self, ocr_server_urlhttp://localhost:7860, es_hostlocalhost, es_port9200): 初始化流水线 Args: ocr_server_url: GLM-OCR服务地址 es_host: Elasticsearch主机 es_port: Elasticsearch端口 print( * 50) print(初始化文档处理流水线...) print( * 50) # 初始化OCR提取器 self.ocr_extractor GLMOCRExtractor(ocr_server_url) # 初始化ES索引器 self.indexer DocumentIndexer(es_host, es_port) # 创建索引 self.index_name self.indexer.create_index(smart_document_library) print([INFO] 流水线初始化完成\n) def process_single_image(self, image_path, titleNone, tagsNone): 处理单张图片 Args: image_path: 图片路径 title: 自定义标题默认为文件名 tags: 标签列表 Returns: bool: 处理是否成功 if not os.path.exists(image_path): print(f[ERROR] 文件不存在: {image_path}) return False # 使用文件名作为默认标题 if title is None: title os.path.basename(image_path) print(f[处理] 开始处理图片: {title}) # 步骤1: OCR提取文字 start_time time.time() content self.ocr_extractor.extract_from_image(image_path) ocr_time time.time() - start_time if not content or len(content.strip()) 10: # 简单判断是否提取到有效内容 print(f[警告] 图片内容提取可能为空或过短: {image_path}) # 可以选择跳过或记录这里我们继续索引 print(f[OCR完成] 提取字符数: {len(content)}耗时: {ocr_time:.2f}秒) # 步骤2: 索引到Elasticsearch doc_result self.indexer.index_document( titletitle, contentcontent, source_pathimage_path, file_typeimage, tagstags, ocr_confidence0.9 # 这里可以替换为实际的置信度 ) if doc_result: print(f[成功] 图片已索引: {title}\n) return True else: print(f[失败] 图片索引失败: {title}\n) return False def process_directory(self, directory_path, file_pattern*.png, recursiveFalse, default_tagsNone): 处理整个目录下的文件 Args: directory_path: 目录路径 file_pattern: 文件匹配模式如 *.png, *.jpg recursive: 是否递归处理子目录 default_tags: 默认标签 Returns: tuple: (成功数, 失败数) if not os.path.isdir(directory_path): print(f[ERROR] 目录不存在: {directory_path}) return 0, 0 print(f[批量处理] 开始处理目录: {directory_path}) print(f文件模式: {file_pattern}, 递归: {recursive}) # 获取文件列表 if recursive: pattern os.path.join(directory_path, **, file_pattern) files glob.glob(pattern, recursiveTrue) else: pattern os.path.join(directory_path, file_pattern) files glob.glob(pattern) print(f找到 {len(files)} 个文件待处理) success_count 0 fail_count 0 # 逐个处理文件 for i, file_path in enumerate(files, 1): print(f\n[{i}/{len(files)}] , end) # 生成一个简单的标题 rel_path os.path.relpath(file_path, directory_path) title f文档_{i:03d}: {rel_path} # 处理文件 if self.process_single_image(file_path, titletitle, tagsdefault_tags): success_count 1 else: fail_count 1 print(f\n[批量处理完成] 成功: {success_count}, 失败: {fail_count}) return success_count, fail_count def interactive_search(self): 交互式搜索模式 print(\n * 50) print(进入交互式搜索模式 (输入 quit 退出)) print( * 50) while True: query input(\n请输入搜索关键词: ).strip() if query.lower() in [quit, exit, q]: print(退出搜索模式) break if not query: print(请输入有效的关键词) continue # 执行搜索 results self.indexer.search_documents(query, size5) if not results: print(f未找到包含 {query} 的文档) continue print(f\n找到 {len(results)} 个相关文档:) for i, result in enumerate(results, 1): print(f\n{i}. {result[title]} (匹配度: {result[score]:.2f})) print(f 来源: {result[source_path]}) if result.get(page_num): print(f 页码: {result[page_num]}) # 显示高亮内容 if result[highlight]: print(f 相关片段:) for snippet in result[highlight][:2]: # 最多显示2个片段 print(f - {snippet}) else: print(f 预览: {result[content_preview]}) # 主函数使用示例 if __name__ __main__: # 创建流水线实例 pipeline DocumentProcessingPipeline() # 示例1: 处理单张图片 # 请将下面的路径替换为你自己的图片路径 test_image /root/your_test_image.png # 修改这里 if os.path.exists(test_image): pipeline.process_single_image( test_image, title我的第一张测试文档, tags[测试, 示例] ) else: print(f[提示] 测试图片不存在跳过单文件测试: {test_image}) # 示例2: 批量处理一个目录下的所有PNG图片 test_dir /root/your_docs_directory # 修改这里 if os.path.isdir(test_dir): success, fail pipeline.process_directory( test_dir, file_pattern*.png, recursiveFalse, default_tags[批量导入, 文档] ) else: print(f[提示] 测试目录不存在跳过批量测试: {test_dir}) # 示例3: 进入交互式搜索体验检索效果 print(\n现在你可以尝试搜索已索引的文档了) pipeline.interactive_search()这个主流水线DocumentProcessingPipeline类把前面两个模块串联起来了process_single_image方法处理单张图片先调用OCR提取文字然后存入Elasticsearch。process_directory方法能批量处理一个文件夹里的所有图片。interactive_search方法提供了一个简单的命令行界面让你可以输入关键词实时搜索文档库。4. 运行与测试看看效果如何代码写完了我们来实际运行一下看看效果。4.1 运行主流水线首先确保你的GLM-OCR服务http://localhost:7860和Elasticsearch服务http://localhost:9200都在运行。然后在终端运行我们的主程序# 进入你保存代码的目录 cd /path/to/your/code # 运行主流水线 /opt/miniconda3/envs/py310/bin/python main_pipeline.py程序会先初始化连接两个服务。然后你需要修改main_pipeline.py文件末尾的测试路径指向你真实的图片或目录。4.2 通过Kibana可视化查看可选进阶如果你想让搜索结果有一个漂亮的网页界面可以安装Kibana它是Elasticsearch的数据可视化工具。# 拉取并运行Kibana docker run -d \ --name kibana \ -p 5601:5601 \ -e ELASTICSEARCH_HOSTShttp://elasticsearch:9200 \ --link elasticsearch:elasticsearch \ --restartalways \ kibana:8.11.0运行后访问http://你的服务器IP地址:5601。在Kibana中你可以进入“Discover”页面直接查询和浏览你索引的文档。进入“Dashboard”页面创建图表比如统计不同标签的文档数量。配置更复杂的搜索界面。4.3 构建简单Web搜索界面可选如果你不想用Kibana也可以快速写一个简单的Flask网页来提供搜索。创建一个新文件web_search.pyfrom flask import Flask, render_template, request, jsonify from es_indexer import DocumentIndexer app Flask(__name__) indexer DocumentIndexer() app.route(/) def home(): return render_template(search.html) # 需要一个简单的HTML模板 app.route(/api/search) def search_api(): query request.args.get(q, ) if not query: return jsonify({results: []}) results indexer.search_documents(query, size20) return jsonify({results: results}) if __name__ __main__: app.run(host0.0.0.0, port5000, debugTrue)再创建一个templates/search.html!DOCTYPE html html head title我的智能文档库/title style body { font-family: Arial; max-width: 800px; margin: 40px auto; } .search-box { width: 100%; padding: 10px; font-size: 16px; } .result { border: 1px solid #ddd; padding: 15px; margin: 10px 0; } .highlight { background-color: yellow; } /style /head body h1 智能文档库搜索/h1 input typetext idsearchInput classsearch-box placeholder输入关键词搜索文档... div idresults/div script document.getElementById(searchInput).addEventListener(input, function(e) { const query e.target.value; if (query.length 2) { document.getElementById(results).innerHTML ; return; } fetch(/api/search?q${encodeURIComponent(query)}) .then(r r.json()) .then(data { let html ; data.results.forEach(r { html div classresult h3${r.title}/h3 psmall来源: ${r.source_path}/small/p p${r.highlight r.highlight.length ? r.highlight.join(...br) : r.content_preview}/p /div; }); document.getElementById(results).innerHTML html || p未找到相关文档/p; }); }); /script /body /html运行python web_search.py访问http://localhost:5000你就有了一个属于自己的文档搜索网站5. 总结与展望跟着教程走下来我们已经成功搭建了一个从图片识别到全文检索的完整系统。让我们回顾一下关键步骤和未来的可能性。5.1 本教程核心要点回顾技术选型精准利用GLM-OCR处理复杂的文档图片识别利用Elasticsearch提供专业的全文检索能力两者结合优势互补。架构清晰实用我们设计了“提取-索引-检索”的三层流水线代码模块化易于理解和扩展。效果立竿见影通过这个系统你可以将任何图片、扫描件中的文字信息瞬间转化为可搜索的数字资产检索效率得到质的提升。5.2 如何进一步优化你现在拥有的是一个“能用”的系统如果想让它变得“好用”甚至“强大”可以考虑以下方向提升OCR精度与速度GLM-OCR本身很强但对于特定场景如古籍、手写体可以探索微调模型或后处理规则。丰富文档元数据除了标题和内容可以自动提取文档中的日期、人名、机构名等实体作为额外的搜索维度。实现增量更新与去重监控某个文件夹有新文件自动处理对同一份文档的不同版本进行去重。接入更多文件格式完善PDF、Word、Excel等格式的支持构建真正的统一文档库。增加权限与安全如果用于团队需要增加用户认证和文档访问权限控制。5.3 行动起来最好的学习就是实践。我强烈建议你立即动手找一些你自己的文档图片运行一遍整个流程体验从混乱到有序的成就感。尝试修改根据你的具体需求调整代码。比如修改索引的字段、增加自动打标签的功能等。分享反馈如果你在使用中发现了问题或者有更好的实现思路欢迎交流。技术的目的始终是解决问题创造价值。希望这套GLM-OCR Elasticsearch的方案能真正帮你管理好知识提升工作效率。获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。

GLM-OCR实战教程：GLM-OCR识别结果接入Elasticsearch构建可检索文档库

相关新闻

Python uiautomation 实现微信自动化消息处理

NFS硬挂载vs软挂载避坑指南：timeo参数设置与网络闪断处理的正确姿势

Nomic-Embed-Text-V2-MoE模型微调教程：适配垂直领域术语

最新新闻

2026年同声传译软件免费额度实测对比，差距竟然这么大谁才好用？

压榨机器，Hack，设计极限强度的网络应用

基于LangGraph的Agentic RAG智能问答系统构建指南

2026技术路线图模板,国自然青基高分热门技术路线图流程图ppt/word/visio模板合集含ppt+word+Visio可编辑版,pdf和jpg参考学习速览版，共计399款

Codex、Cursor、GitHub Copilot 怎么选？2026 AI 编程工具横向对比与 Pro 升级建议

Power BI DAX上下文与CALCULATE实战指南

日新闻

H2 与 MySQL 单元测试兼容性：5 个关键 SQL 语句差异与规避方案

Windows任务栏终极清理指南：用RBTray一键隐藏窗口到系统托盘

Visual C++ 运行时库一键安装终极指南：告别DLL缺失烦恼

周新闻

B站视频下载神器BiliTools：5分钟学会轻松保存任何B站内容

威胁模型全解析：从新手入门到实战应用，助你构建安全产品！

渗透测试入门指南：从零基础到实战环境搭建

月新闻