簡易檢索 / 詳目顯示

研究生: 林振祺
Lin, Chen-chi
論文名稱: 利用網頁連結路徑改善自然語言問句的搜尋
Improve Natural Language Question Search Using Page Link Path
指導教授: 盧文祥
Lu, Wen-Hsiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 64
中文關鍵詞: 搜尋自然語言
外文關鍵詞: Search, Natural Language
相關次數: 點閱:56下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 目前廣為大部分人所使用的搜尋引擎皆是以單一網頁的內容、標題為索引的單位,這樣的索引方式會忽略了單一網站中,網頁和網頁之間可能的順序及階層關係。且對於自然語言問句的處理上,傳統的搜尋引擎目前也沒有有效的方法回傳使用者真正想要的搜尋結果。針對以上兩個問題,首先,本論文利用一個有效樹狀結構索引的演算法GRIPP,將每一個網站視為一棵樹,將網站階層結構所對應的樹的所有節點及連結結構(link structure)索引,降低搜尋整個樹狀結構的時間複雜度。而根據我們的觀察,問句中可能隱含有entity-feature的結構,所以我們對於自然語言問句的處理方式,會先將原問句做簡單的前處理,再根據前處理後的結果去找出可能相關的網頁,然而這些相關的網頁之間的連結似乎可構成和問題相似的結構,所以本論文提出Page Link Set Table Construction Algorithm找出這些相關網頁的網頁連結集合(page link set),另外也提出Top-down Based Page Link Path Search Algorithm及Bottom-Up Based Page Link Path Search Algorithm兩種不同的方法建構出所有可能的候選路徑(candidate path),而這些候選路徑就是和自然語言問句匹配且可能含有問句答案的搜尋結果,最後本論文利用問句隱含語意結構匹配模型,引入問句中可能的entity-feature結構,去計算每條候選路徑為問句的相關網頁的機率,而機率值愈高的候選路徑極有可能是我們想要的搜尋結果。而根據我們所實驗的結果,首先,本論文利用網站的多層及跨層連結結構找出和自然語言問句匹配的搜尋結果正確率,會比以單一網頁的title及anchor text為索引單位的方法還來得高。另外,和其他同樣處理自然語言問句隱含結構的方法比較,本論文所提出的方法不僅在速度上也不錯的改進,在正確率及其他評估方法都比其他一起比較的方法還要高出許多。

    Traditional search engines take single page as the unit to index its titles and contents. However, these indexing methods may loss the order and hierarchical structure relationship between web pages. Traditional search engines also have no effective solutions to correctly return the result that user really want in natural language question search. To solve these two problems, we employ an effective algorithm GRIPP to reduce the time complexity in searching the whole website. We regard each website as a tree and indexing the link structure between nodes of this tree. According to our observations, user’s questions may contain implicit semantic entity-feature relations. Therefore, we pre-process original natural language question and find relevant web pages according to the pre-processing results. The link structures between these web pages seem are similar to the structure of original natural language question. In this paper, We proposed Page Link Set Table Construction Algorithm to find page link set of these related pages, and then exploited Top-Down Based Page Link Path Search Algorithm and Bottom-Up Based Page Link Path Search Algorithm to find these candidate path which map to the implicit structure of natural language question and involve the answers. Finally, we proposed implicit structure matching model (ISMM) to calculate the probability of relevant web pages, lead-in the possible entity-feature relationships, of each candidate path. The bigger probability value maybe more closely to the result we want. According to our experiments, we can see that our proposed method which find those web pages based on multi-level link structure of web site has higher accuracy than traditional search engines which use single-page indexing. Besides, comparing with other methods which handle the implicit structure of natural language question, our method significantly outperforms these methods much in accuracy and other evaluation metrics.

    摘要 I Abstract II 誌謝 IV 章節目錄 V 圖目錄 VIII 表目錄 IX 第一章 序論 - 1 - 1.1前言 - 1 - 1.2研究動機 - 1 - 1.3解決方法 - 5 - 1.4 論文架構 - 6 - 第二章 文獻回顧與相關研究 - 7 - 2.1 網站樹狀結構 - 7 - 2.1.1 樹狀結構索引 - 7 - 2.1.2 Implicit link - 8 - 2.1.3 樹狀結構的比對 - 9 - 2.2 網頁結構分析和排序 - 9 - 2.3 自然語言搜尋系統 - 10 - 2.4 學習網站結構以改進自然語言搜尋 - 11 - 第三章 研究方法 - 12 - 3.1問題和挑戰 - 13 - 3.1.1遭遇的問題 - 13 - 3.1.2問題解決及挑戰 - 13 - 3.2 系統架構 - 14 - 3.3 擷取網站連結結構(Extract Website Link Structure) - 15 - 3.4建構網站連結結構索引表(Construct Website Link Structure index table) - 17 - 3.4.1索引網站連結結構(Index Website Link Structure) - 17 - 3.4.2 尋找兩節點間的路徑 - 19 - 3.4.3 計算兩節點間的路徑距離 - 23 - 3.5 建構網頁連結(Construct Page Links) - 24 - 3.5.1 問句前處理 - 24 - 3.5.2 尋找相關的網頁節點 - 24 - 3.5.3 建構網頁連結集合 - 25 - 3.6 找出候選的網頁連結路徑(Find Candidate PLP) - 27 - 3.7 排序候選的網頁連結路徑(Rank candidate PLP) - 30 - 3.7.1 問句隱含語意結構匹配模型 - 30 - 3.8 Bottom-Up Based Page Linked Path Search - 38 - 3.8.1 確認問句中的sub-feature - 38 - 3.8.2 找出候選的bottom-up based PLP - 39 - 第四章 實驗和評估 - 42 - 4.1實驗資料之取得與分析 - 42 - 4.2 評估方法 - 43 - 4.3問句隱含語意結構匹配模型參數調校 - 43 - 4.3.1網頁節點相關性的參數調整 - 44 - 4.3.2連結邊相關性的參數調整 - 46 - 4.4 實驗方法 - 47 - 4.4.1 Google Site Search(GSS) - 48 - 4.4.2 Cosine Similarity with Single Page Indexing(SPI) - 48 - 4.4.3.Tri-link Mapping Model(TMM) - 48 - 4.4.4問句隱含語意結構匹配模型 - 49 - 4.5實驗結果和錯誤分析 - 49 - 4.5.1單一網頁索引方法效能比較 - 50 - 4.5.2單一網頁和多層或跨層網頁效能比較 - 50 - 4.5.3 多層或跨層網頁效率評估 - 58 - 第五章 結論與未來研究方向 - 60 - 5.1 結論 - 60 - 5.2 未來研究工作及方向 - 60 - 參考文獻 - 62 -

    Andrew Nierman and H. V. Jagadish, Evaluating Structural Similarity in XML Documents, WebDB,2002.

    Antonio Badia, Mehmed Kantardzic, Graph Building as a Mining Activity:
    Finding Links in the Small, LinkKDD’05.

    David Fernandes, Edleno S. de Moura, Berthier Ribeiro-Neto, Altigran S. da Silva, Marcos André Gonçalves, Computing Block Importance for Searching on Web Sites, CIKM’07,2007.

    Dengyong Zhou, Jason Weston, Arthur Gretton,Olivier Bousquet, and
    Bernhard Sch¨olkopf, Ranking on Data Manifolds, NIPS 2003, Vancouver, Canada(2004).

    Dragomir R. Radev, Hong Qil, Zhiping Zhengl, Sasha Blair-Goldensohrt, Zhu
    Zhangl, Weiguo Fan, John Prager, Mining the Web for Answers to Natural
    Language Questions, CIKM’01,2001.

    Dragomir R.Radev, Hong Qi, Harris Wu, Weiguo Fan, EvaluatingWeb-based Question Answering Systems.

    G.-R. Xue, Q. Yang, H.-J. Zeng, Y. Yu and Z. Chen, Exploiting the Hierarchical Structure for Link Analysis, SIGIR’05,2005.

    Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-Jiang Zhang, and Chao-Jun Lu, Implicit Link Analysis for Small Web Search, SIGIR’03, 2003.

    Gunter Neumann and Feiyu Xu, Mining Answers in German Web Pages, WI’03,2003.

    Gu Xu, Wei-Ying Ma, Building implicit links from content for forum search, SIGIR’06,2006.

    Haixun Wang, Hao He, Jun Yang, Philip S. Yu, Jeffrey Xu Yu, Dual Labeling: Answering Graph Reachability Queries in Constant Time, ICDE’06,2006.

    Jennifer Chu-Carroll, John Prager, Yael Ravin and Christian Cesar, A Hybrid Approach to Natural Language Web Search, Language Processing (EMNLP), Philadelphia, July 2002, pp. 180-187.

    Jin Zhou, Chen Ding, Dimitrios Androutsos, Improving web site search using web server logs, Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research, October 16-19, 2006, Toronto, Ontario, Canada.

    Jure Leskovec, Susan Dumais, Eric Horvitz, Web Projections: Learning from Contextual Subgraphs of the Web, WWW’2007,2007.

    Marc Najork, Hugo Zaragoza, Michael Taylor, HITS on the Web: How does it
    Compare? SIGIR ’07,2007.

    Marius Pa¸sca, Lightweight Web-Based Fact Repositories for Textual
    Question Answering, CIKM’07,2007.

    Matthew W. Bilotti, Paul Ogilvie, Jamie Callan and Eric Nyberg, Structured Retrieval for Question Answering, SIGIR’07,2007.

    Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis, Berthier Ribeiro-Neto, Efficient Search Ranking in Social Networks, CIKM’07,2007.

    Norbert Martínez-Bazan, Victor Muntés-Mulero, Sergio Gómez-Villamor, Jordi Nin, Mario-A. Sánchez-Martínez, Josep-L. Larriba-Pey, DEX: High-Performance Exploration on Large Graphs for Information Retrieval, CIKM’07,2007.

    Proceedings of the Conference on Empirical Methods in Natural
    Guang Feng, Tie-Yan Liu, Ying Wang, Ying Bao, Zhiming Ma, Xu-Dong Zhang, and Wei-Ying Ma, AggregateRank: Bringing Order to Web Sites, SIGIR’06,2006.

    Satoshi Morinaga, Hiroki Arimura, Takahiro Ikeda, Key Semantics Extraction by Dependency Tree Mining, KDD’05, 2005.

    Shivani Agarwal, Ranking on Graph Data, Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh, PA, 2006.

    Silke Trißl, Ulf Leser, Fast and Practical Indexing and Querying of Very
    Large Graphs, SIGMOD’07,2007.

    Srini Narayanan, Sanda Harabagiu, Question Answering Based on Semantic
    Structures, Proceedings of the 20th international conference on
    Computational Linguistics.

    Xian Zhang, Yu Hao, Xiaoyan Zhu, Ming Li, Information Distance from a
    Question to an Answer, KDD’07,2007.

    下載圖示 校內:立即公開
    校外:2008-08-19公開
    QR CODE