簡易檢索 / 詳目顯示

研究生: 李岳洪
Li, Yueh-Hung
論文名稱: 不平衡數據下機器學習之應用:以交通事故偵測為例
Application of Machine Learning under Imbalanced Data: An Example of Traffic Accident Detection
指導教授: 胡大瀛
Hu, Ta-Yin
學位類別: 碩士
Master
系所名稱: 管理學院 - 交通管理科學系
Department of Transportation and Communication Management Science
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 100
中文關鍵詞: 交通事故不平衡數據事故偵測
外文關鍵詞: Traffic Accidents, Imbalnced Data, Accident Detection
相關次數: 點閱:104下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 交通事故通常會導致非常態性的擁塞以及導致其他不必要的排放,有效的去偵測事故的發生是緩解這項問題的第一步,正確的預測事故可以提供運輸管理者可靠的訊息來制定適當的策略,為了要偵測事故的發生,目前有許多的設備可以應用,其中透過擷取VD重要的數據並偵測事故是我們想要探討的部分。
    事實上,在真實的數據集當中存在著嚴重的數據不平衡問題,事故與非事故的數量差距很大,所以我們有必要對於此進行處理,在本研究當中,我們開發一套事故與VD資料整併的模式來自動化收集資料的過程,並提出一個流程在不平衡的數據底下去訓練模型,除了傳統過採樣(SMOTE)以及欠採樣(Cluster Centroid)的方法,還嘗試用生成對抗網路(GAN)產生事故資料,最後藉由不同的資料處理方式來訓練四種不同的機器學習模型,並進行交叉實驗和比較。
    結果顯示,所有的模型都會受到數據不平衡的影響,在未經處理的資料底下表現都不是很好,經過欠採樣資料訓練的模型之偵測率較高,但同時也有較高的誤警率,透過SMOTE、GAN訓練的模型之誤警率比較低,根據AUC指標,機器學習的模型比較適合用SMOTE來處理資料,深度學習的模型比較適合用GAN的資料來訓練。
    最後也有將訓練好的模型實際應用於一整天的監測任務上,他們都能夠準確偵測到事故的發生,誤警率大約15%左右,或許可以當作一個快速檢驗的機制,後續要如何強化誤警率的部分是未來的一個方向,另外也可以嘗試接上道路即時資訊的API,進行實時的道路監控任務,提供交通管理人員更有用的旅行資訊。

    Traffic accidents usually cause non-recurrent congestion and make more unnecessary emissions. How to detect traffic accidents is the first step to mitigate this problem. To detect the occurrence of accidents, there are many devices available to apply. Among them, capturing important data from VD and detecting accidents is what this study is interested in.
    In fact, there is a problem that the amount of accidents and non-accident is not equal. This study develops a program for integrating accident and VD data to automate the process of data collecting and proposed a process to train the model considering data imbalanced. Besides the traditional over-sampling (SMOTE) and under-sampling (Cluster Centroid) methods, this study also tries to generate accident data using Generative Adversarial Networks. Four machine learning models are trained through different data processing, and cross-experiments and comparisons are implemented.
    The results show that all models are affected by data imbalance. The detection rate of the model trained with under-sampling data is higher, but it also has a higher false alarm rate. The false alarm rate of models trained through SMOTE and GAN is lower. According to AUC, the machine learning model trained by SMOTE is better, the deep learning model trained by GAN is better.
    Finally, trained models were applied to monitoring tasks, and they can accurately detect accidents. But the false alarm rate is about 15%. Maybe it can be used as a quick test mechanism. How to decrease the false alarm rate is future work. In addition, it can try to connect to the traffic information API to implement real-time road monitoring tasks and provide more useful travel information for traffic managers.

    ABSTRACT I 摘要 II TABLE OF CONTENTS III LIST OF TABLES VI LIST OF FIGURES VIII CHAPTER 1 INTRODUCTION 1 1.1 Research Background and Motivation 1 1.2 Research Objectives 3 1.3 Research Flow Chart 4 CHAPTER 2 LITERATURE REVIEW 8 2.1 Accident Detection 8 2.2 Description of Machine Learning 10 2.2.1 Introduction to Commonly Used Models 13 2.2.2 Application of Machine Learning in Traffic 20 2.3 Processing of Imbalanced Data 21 2.3.1 Under Sampling 23 2.3.2 Over Sampling 25 2.3.3 Application in Traffic 26 2.4 The Development of GAN 27 2.4.1 The Principle of GAN 28 2.4.2 Application of GAN in Traffic 31 2.5 Summary 32 CHAPTER 3 RESEARCH METHODOLOGY 33 3.1 Problem Statement 34 3.2 Research Framework 35 3.3 Imbalanced Data 38 3.3.1 Under Sampling 38 3.3.2 Over Sampling 39 3.3.3 Generative Adversarial Network (GAN) 40 3.4 Implementation of Machine Learning Models 44 3.4.1 Support Vector Machine (SVM) 44 3.4.2 Random Forest 46 3.4.3 XGboost 46 3.4.4 LSTM 47 3.5 Grid Search with 10-Fold Cross Validation 48 3.6 Evaluation Criteria of Building Models 50 3.6.1 Confusion Matrix and Various Indicators 51 3.6.2 Receiver Operating Characteristic (ROC) and Area under Curve (AUC) 53 3.7 Summary 55 CHAPTER 4 DATA DESCRIPTION AND PREPARATION 56 4.1 Data Resource 56 4.2 Data Preparation 57 4.2.1 Data Conversion 57 4.2.2 Accident Labeling 60 4.2.3 Building of Dataset 61 4.3 Dataset Description 63 CHAPTER 5 EXPERIMENTS AND RESULTS 64 5.1 Experiments Design 64 5.2 Results of Sampling 65 5.2.1 Cluster Centroid 65 5.2.2 SMOTE 66 5.2.3 GAN 67 5.2.4 Comparison of various sampling methods 68 5.3 Results of Accident Detection Models 70 5.3.1 SVM 70 5.3.2 Random Forest 73 5.3.3 XGboost 76 5.3.4 LSTM 79 5.4 Discussion of Experiment 83 5.5 Application of Real-time Traffic Accident Detection 84 5.5.1 Real-time Data Describe 85 5.5.2 Results of Detection and Discussion 87 5.6 Summary 90 CHAPTER 6 CONCLUSIONS AND SUGGESTIONS 91 6.1 Conclusions 91 6.2 Suggestions 93 REFERENCE 95

    Abilash, R. (2018). Applying Random Forest (classification) — Machine Learning Algorithm From Scratch With Real Datasets. Retrieved from https://medium.com/@ar.ingenious/applying-random-forest-classification-machine-learning-algorithm-from-scratch-with-real-24ff198a1c57.
    Alencar, R. (2017). Resampling strategies for imbalanced datasets. Retrieved from https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
    Arjovsky, M., Chintala, S., & Bottou, L. (2017a). Wasserstein gan. arXiv preprint arXiv:1701.07875.
    Arjovsky, M., Chintala, S., & Bottou, L. (2017b). Wasserstein generative adversarial networks. International conference on machine learning.
    Beck, J. R., & Shultz, E. K. (1986). The use of relative operating characteristic (ROC) curves in test performance evaluation. Archives of pathology & laboratory medicine, 110(1), 13-20.
    Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
    Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
    Cauchy, A. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847), 536-538.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
    Fan, Z., Liu, C., Cai, D., & Yue, S. (2019). Research on black spot identification of safety in urban traffic accidents based on machine learning method. Safety Science, 118, 607-616. doi:10.1016/j.ssci.2019.05.039
    Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
    Freeway Bureau, MOTC. (2021). Traffic Database. Retrieved from https://tisvcloud.freeway.gov.tw/
    Goodfellow, I. (2016). NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160.
    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems.
    Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. International conference on machine learning.
    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
    Hajian-Tilaki, K. (2013). Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian journal of internal medicine, 4(2), 627.
    Hamel, L. (2006). Visualization of support vector machines with unsupervised learning. 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.
    Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
    Hayes, J., Melis, L., Danezis, G., & De Cristofaro, E. (2017). LOGAN: evaluating privacy leakage of generative models using generative adversarial networks. arXiv preprint arXiv:1705.07663.
    Hindupur, A. (2018). The GAN Zoo. Retrieved from https://github.com/hindupuravinash/the-gan-zoo.
    Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    Hossain, M., & Muromachi, Y. (2012). A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways. Accid Anal Prev, 45, 373-381. doi:10.1016/j.aap.2011.08.004
    Huang, T., Wang, S., & Sharma, A. (2020). Highway crash detection and risk estimation using deep learning. Accid Anal Prev, 135, 105392. doi:10.1016/j.aap.2019.105392
    Institue of Tranportation, MOTC. (2001). Taiwan Highway Capacity Handbook. Retrieved from https://www.iot.gov.tw/cp-78-8811-47c6c-1.html
    Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE conference on computer vision and pattern recognition.
    Jiang, F., Yuen, K. K. R., & Lee, E. W. M. (2020). A long short-term memory-based framework for crash detection on freeways with traffic data of different temporal resolutions. Accid Anal Prev, 141, 105520. doi:10.1016/j.aap.2020.105520
    Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. the Journal of machine Learning research, 18(1), 559-563.
    Li, H., & Sun, J. (2012). Forecasting business failure: The use of nearest-neighbour support vectors and correcting imbalanced samples – Evidence from the Chinese hotel industry. Tourism Management, 33(3), 622-634. doi:https://doi.org/10.1016/j.tourman.2011.07.004
    Li, P., Abdel-Aty, M., & Yuan, J. (2020). Real-time crash risk prediction on arterials based on LSTM-CNN. Accid Anal Prev, 135, 105371. doi:10.1016/j.aap.2019.105371
    Lin, C., Huang, S., Wu, Y., & Lai, S. (2020). GAN-Based Day-to-Night Image Style Transfer for Nighttime Vehicle Detection. IEEE Transactions on Intelligent Transportation Systems, 1-13. doi:10.1109/TITS.2019.2961679
    Lin, Y., Li, L., Jing, H., Ran, B., & Sun, D. (2020). Automated traffic incident detection with a smaller dataset based on generative adversarial networks. Accid Anal Prev, 144, 105628. doi:10.1016/j.aap.2020.105628
    Mercader, P., & Haddad, J. (2020). Automatic incident detection on freeways based on Bluetooth traffic monitoring. Accid Anal Prev, 146, 105703. doi:10.1016/j.aap.2020.105703
    Merlin, L. A., Guerra, E., & Dumbaugh, E. (2020). Crash risk, crash exposure, and the built environment: A conceptual review. Accid Anal Prev, 134, 105244. doi:10.1016/j.aap.2019.07.020
    Mondal, M. A., & Rehena, Z. (2020). Road Traffic Outlier Detection Technique based on Linear Regression. Procedia Computer Science, 171, 2547-2555. doi:https://doi.org/10.1016/j.procs.2020.04.276
    Mujalli, R. O., Lopez, G., & Garach, L. (2016). Bayes classifiers for imbalanced traffic accidents datasets. Accid Anal Prev, 88, 37-51. doi:10.1016/j.aap.2015.12.003
    World Health Organization. (2018). Global status report on road safety 2018: Summary. Retrieved from https://www.who.int/violence_injury_prevention/road_safety_status/2018/en/
    Osman, O. A., Hajij, M., Karbalaieali, S., & Ishak, S. (2019). A hierarchical machine learning classification approach for secondary task identification from observed driving behavior data. Accid Anal Prev, 123, 274-281. doi:10.1016/j.aap.2018.12.005
    Parsa, A. B., Movahedi, A., Taghipour, H., Derrible, S., & Mohammadian, A. K. (2020). Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accid Anal Prev, 136, 105405. doi:10.1016/j.aap.2019.105405
    Parsa, A. B., Taghipour, H., Derrible, S., & Mohammadian, A. K. (2019). Real-time accident detection: Coping with imbalanced data. Accid Anal Prev, 129, 202-210. doi:10.1016/j.aap.2019.05.014
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Dubourg, V. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
    Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
    Raileanu, L. E., & Stoffel, K. (2004). Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1), 77-93.
    Roy, A., Hossain, M., & Muromachi, Y. (2018). Enhancing the prediction performance of real-time crash prediction models: a cell transmission-dynamic Bayesian network approach. Transportation research record, 2672(38), 58-68.
    Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
    Saha, S. (2018). A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way. Retrieved from https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
    Sangare, M., Gupta, S., Bouzefrane, S., Banerjee, S., & Muhlethaler, P. (2020). Exploring the forecasting approach for road accidents: Analytical measures with hybrid machine learning. Expert Systems with Applications, 113855. doi:https://doi.org/10.1016/j.eswa.2020.113855
    Santoso, B., Wijayanto, H., Notodiputro, K., & Sartono, B. (2017). Synthetic over sampling methods for handling class imbalanced problems: a review. IOP conference series: earth and environmental science.
    Shankar, S. (2020). Types of Machine Learning Algorithms. Retrieved from https://medium.com/swlh/types-of-machine-learning-algorithms-62608e83d709
    Sharma, R. (2018). What is Machine learning and why is it gaining so much popularity? Retrieved from https://sqlrelease.com/what-is-machine-learning-and-why-is-it-gaining-so-much-popularity
    Swets, J. A. (1973). The relative operating characteristic in psychology: a technique for isolating effects of response bias finds wide use in the study of perception and cognition. Science, 182(4116), 990-1000.
    Vapnik, V. (2013). The nature of statistical learning theory: Springer science & business media.
    Wang, D., Liu, Q., Ma, L., Zhang, Y., & Cong, H. (2019). Road traffic accident severity analysis: A census-based study in China. Journal of safety research, 70, 135-147.
    Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. arXiv preprint arXiv:1808.06601.
    Wray, N. R., Yang, J., Goddard, M. E., & Visscher, P. M. (2010). The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet, 6(2), e1000864.
    Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503.
    Yen, S.-J., & Lee, Y.-S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.
    Yildirim, S. (2020). Gradient Boosted Decision Trees-Explained. Retrieved from https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af
    Yu, R., Wang, X., Yang, K., & Abdel-Aty, M. (2016). Crash risk analysis for Shanghai urban expressways: A Bayesian semi-parametric modeling approach. Accident Analysis & Prevention, 95, 495-502. doi:https://doi.org/10.1016/j.aap.2015.11.029
    Zhang, K., Jia, N., Zheng, L., & Liu, Z. (2019). A novel generative adversarial network for estimation of trip travel time distribution with trajectory data. Transportation Research Part C: Emerging Technologies, 108, 223-244. doi:10.1016/j.trc.2019.09.019
    Zhang, M., Zhang, Y., Zhang, L., Liu, C., & Khurshid, S. (2018, 3-7 Sept. 2018). DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).
    Zheng, Z., Wang, Z., Zhu, L., & Jiang, H. (2020). Determinants of the congestion caused by a traffic accident in urban road networks. Accid Anal Prev, 136, 105327. doi:10.1016/j.aap.2019.105327
    Zhu, H., Liu, G., Zhou, M., Xie, Y., Abusorrah, A., & Kang, Q. (2020). Optimizing Weighted Extreme Learning Machines for imbalanced classification and application to credit card fraud detection. Neurocomputing, 407, 50-62. doi:10.1016/j.neucom.2020.04.078

    無法下載圖示 校內:2026-07-27公開
    校外:2026-07-27公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE