| 研究生: |
吳亭頤 Wu, Ting-Yi |
|---|---|
| 論文名稱: |
智慧文件檢索:
基於向量資料庫之導入實作研究 Intelligent Document Retrieval: An Implementation Study Based on Vector Databases |
| 指導教授: |
侯廷偉
Hou, Ting-Wei 鄧維光 Teng, Wei-Guang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 中文 |
| 論文頁數: | 76 |
| 中文關鍵詞: | 智慧文件檢索 、向量資料庫 、跨語言檢索 、多模態嵌入 、暗數據 、重排序機制 |
| 外文關鍵詞: | Intelligent Document Retrieval, Vector Database, Cross-lingual Retrieval, Multimodal Embedding, Dark Data, Reranking Mechanism |
| 相關次數: | 點閱:3 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在數位轉型浪潮下,傳統產業與軟體代理商面臨大量非結構化技術文件難以有效利用的困境。這些原廠手冊與技術圖表往往形成企業內部的「暗數據」,受限於語言隔閡與關鍵字檢索的僵化,不僅檢索效率低落,更擴大了資深專家與新進人員的技能落差。
本研究旨在設計並實作一套「智慧文件檢索系統」,整合光學字元辨識、版面分析與先進的向量嵌入技術,以解決上述問題。系統架構採用基於向量之雙階段語意檢索邏輯,並建立完整的 ETL 資料處理管道。在模型選策上,本研究選用向量模型—雙編碼器 (Jina-CLIP-v2) 與統一嵌入架構 (Jina-Embeddings-v4) 進行平行評測,旨在為有意導入現代化檢索系統的開發者提供具時效性的實務參考數據。
實驗結果顯示,本系統將技術文件的平均檢索時間從人工查找的 300 秒大幅壓縮至 30 秒以內,效率提升超過 90%。數據證實,Jina-CLIP-v2 具備極佳的推論速度,適合即時應用;而 Jina-Embeddings-v4 則在處理複雜語意與圖文對齊上展現更高準確度。此外,實驗同時邀請相同領域的資深及資淺人員與系統進行比較,參照評測結果,在系統輔助下,入門人員的檢索表現可接近專業人員水準,進而縮小知識斷層。
In the era of digital transformation, traditional industries struggle to effectively utilize massive unstructured technical documents, such as original equipment manuals and technical diagrams. These valuable assets often devolve into "Dark Data," where cross-lingual barriers and the limitations of rigid keyword matching inevitably lead to poor retrieval efficiency and a widening skill gap between experts and novices.
This study designs and implements an "Intelligent Document Retrieval System" by integrating Optical Character Recognition (OCR), layout analysis, and advanced vector embedding technologies. The system architecture employs a vector-based two-stage semantic retrieval logic—comprising retrieval and reranking—supported by a robust ETL pipeline. By systematically benchmarking state-of-the-art models, specifically the Dual-Encoder Jina-CLIP-v2 and the Unified Embedding Jina-Embeddings-v4, this research provides timely and practical implementation insights for developers deploying modern retrieval solutions.
Experimental results demonstrate that the system reduces average retrieval time for technical documents from 300 seconds (manual search) to under 30 seconds, improving efficiency by over 90%. Benchmarking with both senior and junior practitioners in the same domain shows that, with system assistance, novices can approach expert-level retrieval performance, narrowing the knowledge gap. Jina-CLIP-v2 excels in real-time inference speed, while Jina-Embeddings-v4 achieves higher accuracy in complex semantics and text–image alignment.
[1] S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
[2] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” in Proc. NeurIPS (Datasets and Benchmarks Track), 2021, pp. 1–16.
[3] J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM Computing Surveys, vol. 38, no. 2, Art. no. 6, pp. 1–56, 2006.
[4] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, “The vocabulary problem in human-system communication,” Communications of the ACM, vol. 30, no. 11, pp. 964–971, 1987.
[5] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, et al., “Dense passage retrieval for open-domain question answering,” in Proc. EMNLP, 2020, pp. 6769–6781.
[6] J. Johnson, M. Douze, and H. Jégou, “Billion-Scale Similarity Search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021, doi: 10.1109/TBDATA.2019.2921572.
[7] S. Humeau, K. Shuster, M.-A. Lachaux, and J. Weston, “Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring,” in Proc. ICLR, 2020, pp. 1–14.
[8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
[9] W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,” in Adv. Neural Inf. Process. Syst., vol. 35, pp. 17612–17625, 2022.
[10] M. Günther, A. Wang, N. Reimers, J. D. Vaughan, G. W. A. Park, J. Xie, et al., “Jina Embeddings 2: 8192-Token general-purpose text embeddings,” arXiv:2310.19923, pp. 1–14, 2023.
[11] A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V. Ramanujan, et al., “Matryoshka representation learning,” in Adv. Neural Inf. Process. Syst., vol. 35, pp. 30233–30249, 2022.
[12] M. Günther, S. Sturua, M. K. Akram, I. Mohr, A. Ungureanu, B. Wang, et al., “jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval,” arXiv:2506.18902, pp. 1–21, 2025.
[13] R. Nogueira and K. Cho, “Passage re-ranking with BERT,” arXiv:1901.04085, pp. 1–5, 2019.
[14] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proc. EMNLP-IJCNLP, 2019, pp. 3982–3992.
[15] O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” in Proc. SIGIR, 2020, pp. 39–48.
[16] K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia, “ColBERTv2: Effective and efficient retrieval via lightweight late interaction,” in Proc. NAACL, 2022, pp. 3715–3734.
[17] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: From pairwise approach to listwise approach,” in Proc. ICML, 2007, pp. 129–136.
[18] W. Sun, L. Yan, X. Ma, P. Ren, D. Chen, Z. Ren, et al., “Is ChatGPT good at search? Investigating large language models as re-ranking agents,” in Proc. EMNLP, 2023, pp. 14918–14937.
[19] X. Zhu, B. Wang, and I. Mohr, “Jina Reranker v3: A high-performance listwise reranking model with infinite context,” arXiv:2509.25085, pp. 1–11, 2025.
[20] R. Smith, “An overview of the Tesseract OCR engine,” in Proc. ICDAR, 2007, vol. 2, pp. 629–633.
[21] K. Pernice, “F-Shaped Pattern of Reading on the Web: Misunderstood, But Still Relevant (Even on Mobile),” Nielsen Norman Group, Nov. 12, 2017. [Online]. Available: https://www.nngroup.com/articles/f-shaped-pattern-reading-web-content/. Accessed: Feb. 4, 2026.
[22] L. O’Gorman, “The document spectrum for page layout analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 11, pp. 1162–1173, 1993.
[23] C. Li, R. Guo, J. Zhou, M. An, Y. Du, L. Zhu, et al., “PP-StructureV2: A strong optical character recognition system for structure document analysis,” arXiv:2210.05391, pp. 1–8, 2022.
[24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[25] S. Dodge and L. Karam, “Understanding how image quality affects deep neural networks,” in Proc. QoMEX, 2016, pp. 1–6.
[26] H. Bast and C. Korzen, “A benchmark and evaluation for text extraction from PDF,” in Proc. JCDL, 2017, pp. 99–108.
[27] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Probab., vol. 1, pp. 281–297, 1967.
[28] ISO 32000-1:2008, Document management—Portable document format—Part 1: PDF 1.7. Geneva, Switzerland: ISO, 2008.
[29] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, et al., “Retrieval-augmented generation for large language models: A survey,” arXiv:2312.10997, pp. 1–21, 2023.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021, pp. 1–21.
[31] W. E. Hick, “On the rate of gain of information,” Quart. J. Exp. Psychol., vol. 4, no. 1, pp. 11–26, 1952.
[32] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, “Accurately interpreting clickthrough data as implicit feedback,” in Proc. SIGIR, 2005, pp. 154–161.