簡易檢索 / 詳目顯示

研究生: 徐崇翔
Hsu, Chung-Hsiang
論文名稱: 應用命名實體識別與全文搜索技術之電商搜索引擎最佳化
Optimization of E-commerce Search Engines Using Named Entity Recognition and Full-text Search
指導教授: 王宏鍇
Wang, Hung-Kai
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 製造資訊與系統研究所
Institute of Manufacturing Information and Systems
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 93
中文關鍵詞: 電子商務搜尋引擎最佳化命名實體識別全文搜索Elasticsearch
外文關鍵詞: E-commerce, Search Engine Optimization, Named Entity Recognition, Full-Text search, Elasticsearch
相關次數: 點閱:30下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網際網路的快速發展,電子商務變得越來越蓬勃,網站搜尋已成為任何成功的電子商務企業不可或缺的功能。隨著蝦皮購物、Pchome和Amazon等領先企業不斷提高搜尋技術和標準,用戶對電子商務網站的搜尋期望也隨之提高。因此,如何改進搜尋功能,提供更準確、更智能的搜尋結果,成為電子商務企業的一個重要課題。
    如果用戶在搜尋時無法找到想要的商品,購物體驗將會受到嚴重影響,甚至導致用戶流失和商機損失。因此,本研究將對用戶的查詢進行命名實體識別(NER)的處理,來理解用戶的意圖,從而讓搜索結果的產品符合用戶的需求。
    本研究的目標在於建構一個結合全文搜索及命名實體識別技術的搜尋引擎,並使用Flipkart網站的商品資料集。全文搜索的技術是透過Elasticsearch實現,而NER則從三種模型包含BiLSTM-CRF、BERT-CRF及RoBERTa-CRF中選擇最佳模型,其研究結果顯示RoBERTa-CRF表現最佳,因此使用此模型生成新的資料集作為索引的資料,並用此模型對用戶的查詢進行實體擷取,這些擷取出來的實體將用於建立索引。而在索引之前,將進行本研究的三階段NER過程,透過對用戶查詢的預處理,使索引的結果更準確,進一步提升搜尋精確度和用戶體驗。結果顯示,本研究的方法在搜尋結果上比單獨使用全文搜索的方法表現更佳。

    With the rapid development of the internet, e-commerce has become increasingly prosperous, and site search has become an indispensable feature for any successful e-commerce business. As leading companies such as Shopee, PChome, and Amazon continuously improve search technologies and standards, user expectations for e-commerce website searches have also risen. Therefore, improving search functions to provide more accurate and intelligent search results has become a significant challenge for e-commerce companies.
    If users cannot find the products they want during a search, their shopping experience will be severely impacted, potentially leading to user attrition and loss of business opportunities. Consequently, this research aims to process user queries through Named Entity Recognition (NER) to understand user intent, thereby aligning search results with user needs.
    The objective of this study is to construct a search engine that combines full-text search and NER technologies, using the product dataset from the Flipkart website. The full-text search technology is implemented through Elasticsearch, while the NER is performed using the best model selected from three models: BiLSTM-CRF, BERT-CRF, and RoBERTa-CRF. The research results indicate that RoBERTa-CRF performs the best. Therefore, this model is used to generate a new dataset for indexing, and the same model is employed to extract entities from user queries. These extracted entities are then used to build the index. Before indexing, a three-phase NER process designed in this study will be conducted. Through preprocessing user queries, the accuracy of the indexing results is improved, further enhancing search precision and user experience. The results show that the method proposed in this study performs better in search results compared to using full-text search alone.

    中文摘要 I Abstract II Acknowledgements IV Table of Contents V List of Figures VIII List of Tables X Nomenclature XII Chapter 1. Introduction 1 1.1 Research Background, Motivation and Importance 1 1.2 Research Purposes 3 1.3 Thesis Architecture 4 Chapter 2. Literature Review 6 2.1 Search Engine Optimization 6 2.2 Characteristics of Search Queries 7 2.3 Query Segmentation 8 2.4 Full-Text Search using Elasticsearch 10 2.5 Natural Language Processing 12 2.5.1 Named Entity Recognition 13 2.5.2 Attention and Transformer 14 2.5.3 RoBERTa 17 2.5.4 CRF 18 Chapter 3. Research Method 20 3.1 Research Framework 20 3.2 Descriptions of the Dataset 21 3.2.1 NER Data Annotation 22 3.2.2 Annotation Tool 23 3.3 Annotation Data Preprocessing 25 3.3.1 Tokenization using SpaCy 25 3.3.2 Sequence Labeling 26 3.3.3 Method to Split the Annotated Data 27 3.4 NER Experimental Model 27 3.5 Data Processing and Column Expansion 29 3.6 Evaluation for NER models 30 3.7 Search system architecture 32 3.7.1 Search Engine using Elasticsearch 33 3.7.2 Search Frontend and Backend Application 34 3.8 Ecommerce NER methodology 35 3.8.1 Three types of Search Queries 37 3.8.2 Evaluation for Relevance scores Based on BM25 38 3.8.3 Evaluating the Quality of Search Result Rankings using NDCG 40 Chapter 4. Experimental Results and Discussion 42 4.1 Execution Environmental 42 4.2 Methods and Techniques Experimental Procedures 43 4.3 Experiment 1: Hyperparameter Setting on NER tasks 44 4.3.1 Description of the Data for NER tasks 44 4.3.2 Hyperparameter Setting 46 4.3.3 Results of two NER tasks 48 4.4 Experiment 2: NER Classification and Acquire Golden Data 50 4.4.1 Performance of the Different Search Query Type 53 4.5 Experiment 3: Query Search Results 60 4.5.1 Three stages NER process 61 4.5.2 Comparison of Search Results 63 4.5.3 Search Engine Performance Test 72 Chapter 5. Conclusion 74 5.1 Conclusion 74 5.2 Future Research 75 References 76

    Alasiry, A. M. (2015). Named entity recognition and classification in search queries. Birkbeck, University of London.
    Arora, J., & Park, Y. (2023). Split-NER: Named entity recognition via two question-answering-based classifications. arXiv preprint arXiv:2310.19942.
    Asrigo, R., & Kaburuan, E. R. (2024). Improving E-Commerce Website Rank Using Search Engine Optimization (SEO). International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 430-440.
    Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., & Silvestri, F. (2007). The impact of caching on search engines. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 183-190.
    Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
    Cotropia, C. A., Quillen Jr, C. D., & Webster, O. H. (2013). Patent applications and the performance of the US Patent and Trademark Office. Fed. Cir. BJ, 23, 179.
    Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Eiselt, A., & Figueroa, A. (2013). A two-step named entity recognizer for open-domain search queries. Proceedings of the Sixth International Joint Conference on Natural Language Processing, 829-833.
    Ghaddar, A., & Langlais, P. (2018). Robust lexical features for improved neural network named-entity recognition. arXiv preprint arXiv:1806.03489.
    Gong, Y., Mao, L., & Li, C. (2021). Few-shot learning for named entity recognition based on BERT and two-level model fusion. Data Intelligence, 3(4), 568-577.
    Granka, L. A., Joachims, T., & Gay, G. (2004). Eye-tracking analysis of user behavior in WWW search. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 478-479.
    Gu, S. Y., Slusarczyk, B., Hajizada, S., Kovalyova, I., & Sakhbieva, A. (2021). Impact of the COVID-19 Pandemic on Online Consumer Purchasing Behavior. Journal of Theoretical and Applied Electronic Commerce Research, 16(6), 2263-2281. https://doi.org/10.3390/jtaer16060125
    Guo, J., Xu, G., Cheng, X., & Li, H. (2009). Named entity recognition in query. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 267-274.
    Hagen, M., Potthast, M., Beyer, A., & Stein, B. (2012). Towards optimum query segmentation: in doubt without. Proceedings of the 21st ACM international conference on Information and knowledge management, 1015-1024.
    Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1), 411-420.
    Hu, S., & Ma, R. (2024). Named Entity Recognition of Automotive Parts Based on RoBERTa-CRF Model. 2024 4th International Conference on Neural Networks, Information and Communication (NNICE), 604-612.
    Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
    Johri, P., Khatri, S. K., Al-Taani, A. T., Sabharwal, M., Suvanov, S., & Kumar, A. (2021). Natural language processing: History, evolution, application, and future work. Proceedings of 3rd International Conference on Computing Informatics and Networks: ICCIN 2020, 365-375.
    Kale, A., Taula, T., Hewavitharana, S., & Srivastava, A. (2017). Towards semantic query segmentation. arXiv preprint arXiv:1707.07835.
    Kathare, N., Reddy, O. V., & Prabhu, V. (2020). A comprehensive study of Elasticsearch. International Journal of Science and Research (IJSR).
    Katz, U., Vetzler, M., Cohen, A. D., & Goldberg, Y. (2023). Neretrieve: Dataset for next generation named entity recognition and retrieval. arXiv preprint arXiv:2310.14282.
    Krrabaj, S., Baxhaku, F., & Sadrijaj, D. (2017). Investigating search engine optimization techniques for effective ranking: A case study of an educational site. 2017 6th Mediterranean conference on embedded computing (MECO), 1-4.
    Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Icml, 1(2), 3.
    Li, Z., Ding, D., Zou, P., Gong, Y., Chen, X., Zhang, J., Gao, J., Wu, Y., & Duan, Y. (2022). Distant Supervision for E-commerce Query Segmentation via Attention Network. Intelligent Processing Practices and Tools for E-Commerce Data, Information, and Knowledge, 3-19.
    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Mahajan, P., & Rana, D. (2022). Investigating Clinical Named Entity Recognition Approaches for Information Extraction from EMR. Tracking and Preventing Diseases with Artificial Intelligence, 153-175.
    Nguyen, D. (2019). Influences of multi-channel distribution related to consumer buying behavior and profits growth rate. Vaasa University of Applied Sciences.
    Nguyen, D. (2020). Improving Ecommerce Search with Query Named Entity Recognition. Metropolia University of Applied Sciences.
    Papenmeier, A., Kern, D., Hienert, D., Sliwa, A., Aker, A., & Fuhr, N. (2021). Dataset of Natural Language Queries for E-Commerce. Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, 307-311.
    Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532-1543.
    Pradhan, S., Moschitti, A., Xue, N., Ng, H. T., Björkelund, A., Uryupina, O., Zhang, Y., & Zhong, Z. (2013). Towards robust linguistic analysis using ontonotes. Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 143-152.
    Reddy, V. M., & Nalla, L. N. (2022). Enhancing Search Functionality in E-commerce with Elasticsearch and Big Data. International Journal of Advanced Engineering Technologies and Innovations, 1(2), 37-53.
    Saha Roy, R., Suresh, A., Ganguly, N., & Choudhury, M. (2016). Improving document ranking for long queries with nested query segmentation. Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016. Proceedings 38, 775-781.
    Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
    Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. Acm sigir forum, 33(1), 6-12.
    Spink, A., Jansen, B. J., Wolfram, D., & Saracevic, T. (2002). From e-sex to e-commerce: Web search changes. Computer, 35(3), 107-109.
    Statista. (2024). eCommerce market insights in Taiwan. Statista Market Insights. https://www.statista.com/outlook/emo/ecommerce/taiwan
    Tan, B., & Peng, F. (2008). Unsupervised query segmentation using generative language models and wikipedia. Proceedings of the 17th international conference on World Wide Web, 347-356.
    Ugawa, A., Tamura, A., Ninomiya, T., Takamura, H., & Okumura, M. (2018). Neural machine translation incorporating named entity. Proceedings of the 27th International Conference on Computational Linguistics, 3240-3250.
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

    無法下載圖示 校內:2029-07-30公開
    校外:2029-07-30公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE