成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	張恒瑞 Zhang, Heng-rui
論文名稱：	有效利用文件相似度之剽竊偵測方法 Exploiting Document Similarities for Plagiarism Detection
指導教授：	鄧維光 Teng, Wei-Guang
學位類別：	碩士 Master
系所名稱：	工學院 - 工程科學系 Department of Engineering Science
論文出版年：	2007
畢業學年度：	95
語文別：	英文
論文頁數：	45
中文關鍵詞：	局部修改、文件相似度、剽竊偵測
外文關鍵詞：	plagiarism detection, document similarity, partial revision
相關次數：	點閱：184 下載：2
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著資訊與網路科技的發達，使用者在網路上極易取得他們所需要的資訊，因此資訊的分享與學習過程亦隨之愈加便利；然而如果人們並不尊重他人的創意與智慧財產權，剽竊的問題反而會變得越來越嚴重。在此篇論文中，我們著重於將文件相似度比對的方法延伸至剽竊偵測，其中有兩項議題是此篇論文特別著重的部分，一是透過適當的技巧對可疑文件切割成較小的片段，以進行多個可疑來源之偵測，另一方面，為了避免剽竊者將蒐集到的文章作些微的修改，並重新組成一篇新的剽竊文件，我們提出一個可偵測文句部分修改的方法，且當進行文章相似度比對時，我們提出一個有效減少重複性計算的方法。透過實證研究，我們的方法可以正確且有效地辨識出剽竊文章與其剽竊者。

As information and networking technologies advance, people can easily get what they need on the web. This facilitates the learning and sharing processes among people. However, the plagiarism problem is also becoming more and more serious if people depreciate the creativity and intellectual property of others. An effective way to reduce the impacts of plagiarism lies on the detection techniques. In this work, we focus on extending the capabilities of identifying document similarities for plagiarism detection. Specifically, two crucial issues are addressed in this thesis. The first issue is on devising a proper technique to segment a suspicious document into smaller pieces for following steps to identify possibly multiple sources. On the other hand, since a plagiarist may slightly revise the grabbed contents when compiling into the plagiarized document, a technique to identify partial changes in a text segment should be developed. Moreover, our approach is carefully designed to reduce redundant computation cost when conducting comparison of document similarities. To verify the feasibility of our approach, empirical studies show that plagiarized documents and thus the malicious users can be precisely identified in a very efficient way.

Chapter 1	  Introduction	1
1	Motivation and Overview of the Thesis	1
2	Contributions of the Thesis	2
Chapter 2	  Literature Survey	3
1	Techniques for String Matching	3
1.1	Exact String Matching	3
1.2	Inexact String Matching	5
2	Approaches for Plagiarism Detection	6
2.1	Stylometry Approaches	6
2.2	Term-based Approaches	7
2.3	Sentence-based Approaches	8
Chapter 3  Finding Similarities among Documents	10
1	Identifying Similar Sentences from a Suspicious Document	10
1.1	Possible Editing Operations for Plagiarism	11
1.2	Edit Distance	12
2	Using Dynamic Programming for Calculating the Edit Distance	13
2.1	Recurrence Relation	13
2.2	Tabular Computation	14
2.3	Traceback	15
3	Similarity Queries	16
4	Proposed Approach	17
4.1	System Flows of Our Segment-based Approach	17
4.2	E-index to Filter Out Irrelevant Sentences	18
4.3	Sampling Techniques for Large Documents	21
Chapter 4	  Empirical Studies	22
1	Testing Datasets	22
2	Finding Similar Sentences among News Articles	24
2.1	Impacts of the Similarity Threshold	24
2.2	Impacts of the Filtering Strategy	26
2.3	Extensive Discussions of the E-index	26
2.4	Performance of Filtering Method	28
2.5	Experimental Results	29
3	Experiments on a Real Dataset	38
Chapter 5	  Conclusions and Future Works	41
Bibliography	42
                                    

[1]A. Apostolico and C. Guerra, “The Longest Common Subsequence Problem Revisited,” Algorithmica, 18(1): 315-336, 1987.
[2]K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger, “A Comparison of Techniques to Find Mirrored Hosts on the WWW,” Journal of the American Society for Information Science, 51(12): 1114–1122, October 2000.
[3]K. Bharat and A. Z. Broder, “Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content,” Proceedings of the 8th International World Wide Web Conference, pages 501–512, May 1999.
[4]S. Brin, J. Davis, and H. Garcıa-Molina, “Copy Detection Mechanisms for Digital Documents,” Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398–409, May 1995.
[5]A. Z. Broder, S. C. Glassman, M. S. Manasse and G. Zweig, “Syntactic Clustering of the Web,” Proceedings of the 6th International Conference on World Wide Web, pages 1157-1166, April 1997.
[6]R. S. Boyer, J. S. Moore, “A Fast String Searching Algorithm,” Communications of the ACM, 20(10):762-772, October 1977.
[7]P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proceedings of the 23rd International Conference on Very Large Data Bases, pages 426-435, August 1997.
[8]A. Chowdhury, O. Frieder, D. Grossman and M. C. McCabe, “Collection Statistics for Fast Duplicate Document Detection,” ACM Transactions on Information Systems, 20(2):171–191, April 2002.
[9]J. G. Conrad, X. Guo, and C. Schriber, “Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment,” Proceedings of the 12th ACM International Conference on Information and Knowledge Management, pages 443–452, November 2003.
[10]G. Das, R. Fleisher, L. Gasieniek, D. Gunopulos, and J. Karkkainen, “Episode matching,” Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching, pages 12–27, 1997.
[11]S. Eissen and B. Stein, “Intrinsic Plagiarism Detection,” Proceedings of the 28th European Conference on Information Retrieval, pages 565-569, April 2006.
[12]D. Fetterly, M. Manasse, M. Najork, and J. Wiener, “A Large-Scale Study of the Evolution of Web Pages,” Proceedings of the 12th International World Wide Web Conference, pages 669–678, May 2003.
[13]D. Fetterly, M. Manasse, and M. Najork, “On the Evolution of Clusters of Near-Duplicate Web Pages,” Proceedings of the 1st Latin American Web Congress, pages 37-45, October 2003.
[14]R. Grossi and G. F. Italiano, “Suffix Trees and Their Applications in String Algorithms,” Proceeding of the 1st South American Workshop on String Processing, pages 57-76, September 1993.
[15]S. Gruner and S. Naven, “Tool Support for Plagiarism Detection in Text Documents,” Proceedings of the 2005 ACM Symposium on Applied Computing, pages 13-17, March 2005.
[16]K. H. Hiary, “Watermark: From Paper Texture to Digital Media,” Proceedings 1st International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, pages 261–264, December 2005.
[17]P. Iyer and A.Singh, “Document Similarity Analysis for a Plagiarism Detection System,” Proceedings of the 2nd Indian International Conference on Artificial Intelligence, pages 2534-2544, December 2005.
[18]D.E. Knuth, J. Morris, and V. Pratt, “Fast Pattern Matching in Strings,” SIAM Journal on Computing, 6(2):323-350, 1977.
[19]Latent Semantic Analysis, University of Colorado, http://lsa.colorado.edu/.
[20]Letter frequencies, http://en.wikipedia.org/wiki/Letter_frequencies/.
[21]V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady, 10(8):707-710, 1966.
[22]H. Maurer, F. Kappe, and B. Zaka. “Plagiarism – A Survey,” Journal of Universal Computer Science, 12(8): 1050-1084, 2006.
[23]G. Salton, and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill Book Company, 1983.
[24]G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM 18(1):613–620, November 1975.
[25]N. Shivakumar and G. -M. Hector, “SCAM: A Copy Detection Mechanism for Digital Documents,” Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries, June 1995.
[26]M. Stricherz, “Many Teachers Ignore Cheating, Survey Finds,” Education Week, http://www.edweek.org/ew/articles/2001/05/09/34cheat.h20.html, May 2001.
[27]D. Sankoff and J. Krusakl, “TimeWarps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison,” Addison Wesley, 1983.
[28]O. Uzuner, B. Katz, and T. Nahnsen, “Using Syntactic Information to Identify Plagiarism,” Proceedings of the Association for Computational Linguistics Workshop on Educational Applications, pages 37-44, June 2005.
[29]D. R. White and M. S. Joy, “Sentence-based Natural Language Plagiarism Detection,” Journal on Educational Resources in Computing, 4(4): No.2 , December 2004.
[30]H. Yang and J. Callan, “Near-Duplicate Detection by Instance-Level Constrained Clustering,” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in information Retrieval , pages 421-428, August 2006.
[31]S. Ye, R. Song, J. R. Wen, and W.Y. Ma, “A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines,” Proceedings of the 6th Asia-Pacific Web Conference, pages 48–58, April 2004.
[32]R. Yerra and Y. –K. Ng, “A Sentence-Based Copy Detection Approach for Web Documents,” Proceedings of the 2nd Annual International Conference in Fuzzy Systems and Knowledge Discovery, pages 557-570, August 2005.

2009-07-27公開

簡易檢索 / 詳目顯示

相關論文