| 研究生: |
張博淵 Chang, Bor-Yuan |
|---|---|
| 論文名稱: |
應用非對稱性分類分析改進少數類別的分類正確率-以通聯紀錄為例 Use of Skewed Classification Analysis to Improve the Accuracy Ratio for Minority Classification : Exemplified by Call Detail Record |
| 指導教授: |
焦惠津
Jiau, Hewijin Christine |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 英文 |
| 論文頁數: | 86 |
| 中文關鍵詞: | 邏輯斯迴歸 、類神經網路 、資料探勘 、異常偵測 、決策樹 |
| 外文關鍵詞: | Fraud Detection, Data Mining, Logistic Regression, Decision Tree, Neural Network |
| 相關次數: | 點閱:107 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料探勘已被廣泛應用在各領域,只要該領域擁有具分析價值與需求的資料倉儲或資料庫,皆可利用探勘工具進行有目的的探勘分析。通聯紀錄是指電信用戶間彼此通話聯絡的紀錄,因此在犯罪偵查上,可藉由通聯紀錄的分析加以研判嫌犯的交往關係、生活作息、活動區域及涉案的可能性等,但通聯紀錄的分析要領需要長時間的摸索,才能熟悉相關分析技巧,因此本論文希望結合辦案人員所累積之經驗法則,並透過資料探勘技術,建立一套異常通聯分析模式,以便從大量繁雜的通聯資料中,快速鎖定少數重要人物所使用的關鍵號碼,再從這些關鍵號碼的通聯紀錄中,歸納出極具價值的異常通聯模式,日後只要運用這些異常通聯模式作交叉分析比對,即可迅速找出少數重要對象的電話號碼,提供辦案人員進行案件研判,並提供調閱標的,如此不僅可以有效節省通聯調閱費用,避免浪費公帑,又能協助辦案人員加速案件偵辦的進度。
為驗證本論文中所提方法的可行性,我們將使用真實結案後的通聯資料去建構出一個複合式模型,並評估此模型之預測正確率與穩定性。由於資料探勘的分類工具種類眾多,且每種分析工具都有其優缺點,經觀察與分析本研究的資料特性後,決定採用C5.0、CART、類神經網路及邏輯斯迴歸這四種資料分析工具,並搭配不同比例的隨機抽樣方式,分別去建立單一判別模式,再從中挑選預測能力較佳者來建立複合式模型,以提高預測的精確度。在經過反覆測試評估後,本論文最後將提出整合C5.0決策樹與類神經網路來建立複合式模型,可有效提高預測少數關鍵號碼之精確度。
Data mining has been widely applied to various domains. Once a domain possesses an analysis-valuable and a required data warehouse or database, a mining tool can be utilized to carry out an aimed mining analysis. The CDR (Call Detail Record) refers to the record of communication among telecom users. It can be used to analyze the social relationships, life habits, proximity to action areas, and the possibility of involvement in a crime of a suspect in a criminal investigation. But CDR analysis has to be practiced for a long time in order for personnel to be familiar with the relevant investigating skills. Therefore, this research aims at establishing a set of fraud CDR analysis models to quickly identify the phone numbers of a few important suspects among a large quantity of multifarious call data and determine which are the most valuable fraud call patterns by combining the experience accumulated by investigators and implementing data mining technology. In the future, these fraud call patterns could be used to carry out a comprehensive analysis and comparison. Then the phone numbers of a few important suspects could be identified in order for investigators to do case-analyses and to find consulting objects. This would not only effectively save communication consulting costs and avoid wasting public money, but also would assist law enforcement and accelerate investigative processes.
In order to test the feasibility of the measures proposed by this research, we will use the call data acquired after a case was resolved to construct a multiple model and to evaluate the forecasting accuracy rate and stability of the model. Considering that there are various classification tools for data mining, and each analysis tool has its own advantages and disadvantages, after observing and analyzing the characteristics of the data, we decided on four data analysis tools, namely C5.0, CART, Back Propagation Neural Network, and Logistic Regression. We also added a random sample of different proportions to respectively establish a single discriminate model and selected one model with better forecasting capability to create a multiple model in order to raise the accuracy of the estimation. After an iterative testing and evaluation, the research finally decided to unify the C5.0 decision tree and the Back Propagation Neural Network to establish a multiple model to enhance the accuracy of forecasting all the key phone numbers of important suspects.
[1] Yi-Tang Chiu, “Data Mining for Communication Database: Study on Prediction of Customer
Drains”, master thesis, Department of Information Management, National Sun Yat-Sen University,
1999.
[2] Shao-Chou Chiu, “The Application of Call Detail Records on Criminal Investigation”, master thesis,
Department of Criminal Police, Central Police University, 2001.
[3] Chun-Hung Cheng, “Analysis of Mobile Phone Criminal Detecting Patterns and Management Strategy”,
14th International Information Management Academic Seminar, 2003.
[4] J. Han, ”Data Mining,” in J. Urban and P. Dasgupta (eds.),Encyclopedia of Distributed Computing,
Kluwer Academic Publishers, 1999.
[5] J. Han and M. Kamber, ”Data Mining: Concepts and Techniques”, Morgan Kanfmann Publishers,
2001.
[6] M. J. A. Berry and G. S. Linoff, ”Data mining techniques for marketing, sales, and customer
support”, Wiley Computer publishing, 1997.
[7] Randy Kerber(NCR) Thomas Khabaza (SPSS) Thomas Reinartz(Daimler Chrysler) Colin
Shearer(SPSS) Rudiger Wirth (Daimler Chrysler) Peter Chapman(NCR), Julian Clinton(SPSS),
“CRISP-DM1.0 Step-by-Step data mining guide ”, http://www.crisp-dm.org, August 2000.
[8] P. Gosset and M. Hyland, “Classification, detection and prosecution of fraud in mobile networks”,
Proceedings of ACTS Mobile Summit, vol. Sorrento, Italy, June, 1999.
[9] R. J. Bolton and D. J Hand, “Statistical fraud detection: a review”, Statistical Science, vol. 17, no.
3, pp. 235–255, 2002.[10] S. Schwartz, “Is There a Schizophrenic Language?”, Behavioral and Brain Sciences, vol. 5, pp.
579–626, 1982.
[11] ChinCh’ang Lin, “Applying Hybird Soft Computing in Healthcare Management for the Detection
of DRGs Greeps”, master thesis, Department of Information, Fo Guang University, 2004.
[12] V. Sudhan L. Nathan V. Chandiramani, R. Jayaseelan and K. Priya, “A neural network approach to
process assignment in multiprocessor systems based on the execution time”, in Proc. of IEEE Int.
Conf. on Intelligent Sensing and Information Processing,Chennai, India, pp. 332–335, Aug. 2004.
[13] W.J. Hsieh, “The analysis and application of grey model and back-propagation network to the
premium rate service”, master thesis, Department of Computer Science and Engineering, Tatung
University, June 2003.
[14] B. Kijsirikul and K. Chongkasemwongse, “Decision tree pruning using back propagation neural
networks”, in Proc. of IEEE Int. Conf. on Neural Networks, Washington D.C., USA, vol. 3, pp.
1876–1880, July 2001.
[15] Y.C. Ye, ”The Application and Design Pattern of Artificial Neural Networks”, Scholars Books Inc.,
Taipei, Taiwan, 1993.
[16] Y.C. Ye, ”The Application of Artificial Neural Networks”, Scholars Books Inc., Taipei,Taiwan, 1997.
[17] Yen-Shih Li, ”Analysis of Risk Factors Influencing Cash Card Default”, master thesis, Department
of Information Management, National Central University, 2006.
[18] S. H. Ha and S. C. Park, “Application of Data Mining Tools to Hotel Data Mart on the Intranet
for Database Marketing”, Expert Systems With Applications, vol. 15, pp. 1–31, 1998.
[19] Chao-Kai Hung, ”Empirical Study on Applying Data Mining Technology to Overdue Credit Cards”,
master thesis, Department of Information Management, Fu Jen Catholic University, 2006.
[20] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: one-sided selection ”,
Proceedings of the 14th International Conference on Machine Learning, pp. 179–186, 1997.
[21] Brown J. Beck H. Fausett L. DeRouin, E. and M. Schneider, “Neural Network Training on Unequally
Represented Classes ”, Intelligent Engineering Systems Through Artificial Neural Networks, C. H.
Dagli, S. R. T. Kumara, and Y. C. Shin (Eds.),ASME Press, New York, pp. 135–145, 1991.[22] Chao-Chiung Cheng, ”Prediction of Applying Intelligent Business Technology to Default Risk of
Credit Card”, Department of Statistics, National Cheng-Chi University, 2006.
[23] Kdnuggets web, ”Which kind of data mining technique is the most frequently used by you?”, http
: // www.kdnuggets.com / polls / 2005 / data mining techniques.htm, 2005.
[24] Hsiangchih Yin, ”SQL Server 2005 Data Mining”, Delight Press, 2007.
[25] Shun-Cheng Yang, ”Application of Business Intelligence–Analysis of Defaulted Credit Card Accounts”,
Special Study on Professor Ben-Chang Shia’s Statistics and Data Mining, 2006.
[26] Patuwo B. E. Zhang, G. and M. Y. Hu, “Forecasting with artificial Neural Networks: the state of
the art ”, International Journal of Forecasting, vol. 14, no. 1, pp. 35–62, 1998.
[27] SPSS Clementine 10.1 Node Reference, 2007.