| 研究生: |
李岳洪 Li, Yueh-Hung |
|---|---|
| 論文名稱: |
不平衡數據下機器學習之應用:以交通事故偵測為例 Application of Machine Learning under Imbalanced Data: An Example of Traffic Accident Detection |
| 指導教授: |
胡大瀛
Hu, Ta-Yin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 交通管理科學系 Department of Transportation and Communication Management Science |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 100 |
| 中文關鍵詞: | 交通事故 、不平衡數據 、事故偵測 |
| 外文關鍵詞: | Traffic Accidents, Imbalnced Data, Accident Detection |
| 相關次數: | 點閱:104 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
交通事故通常會導致非常態性的擁塞以及導致其他不必要的排放,有效的去偵測事故的發生是緩解這項問題的第一步,正確的預測事故可以提供運輸管理者可靠的訊息來制定適當的策略,為了要偵測事故的發生,目前有許多的設備可以應用,其中透過擷取VD重要的數據並偵測事故是我們想要探討的部分。
事實上,在真實的數據集當中存在著嚴重的數據不平衡問題,事故與非事故的數量差距很大,所以我們有必要對於此進行處理,在本研究當中,我們開發一套事故與VD資料整併的模式來自動化收集資料的過程,並提出一個流程在不平衡的數據底下去訓練模型,除了傳統過採樣(SMOTE)以及欠採樣(Cluster Centroid)的方法,還嘗試用生成對抗網路(GAN)產生事故資料,最後藉由不同的資料處理方式來訓練四種不同的機器學習模型,並進行交叉實驗和比較。
結果顯示,所有的模型都會受到數據不平衡的影響,在未經處理的資料底下表現都不是很好,經過欠採樣資料訓練的模型之偵測率較高,但同時也有較高的誤警率,透過SMOTE、GAN訓練的模型之誤警率比較低,根據AUC指標,機器學習的模型比較適合用SMOTE來處理資料,深度學習的模型比較適合用GAN的資料來訓練。
最後也有將訓練好的模型實際應用於一整天的監測任務上,他們都能夠準確偵測到事故的發生,誤警率大約15%左右,或許可以當作一個快速檢驗的機制,後續要如何強化誤警率的部分是未來的一個方向,另外也可以嘗試接上道路即時資訊的API,進行實時的道路監控任務,提供交通管理人員更有用的旅行資訊。
Traffic accidents usually cause non-recurrent congestion and make more unnecessary emissions. How to detect traffic accidents is the first step to mitigate this problem. To detect the occurrence of accidents, there are many devices available to apply. Among them, capturing important data from VD and detecting accidents is what this study is interested in.
In fact, there is a problem that the amount of accidents and non-accident is not equal. This study develops a program for integrating accident and VD data to automate the process of data collecting and proposed a process to train the model considering data imbalanced. Besides the traditional over-sampling (SMOTE) and under-sampling (Cluster Centroid) methods, this study also tries to generate accident data using Generative Adversarial Networks. Four machine learning models are trained through different data processing, and cross-experiments and comparisons are implemented.
The results show that all models are affected by data imbalance. The detection rate of the model trained with under-sampling data is higher, but it also has a higher false alarm rate. The false alarm rate of models trained through SMOTE and GAN is lower. According to AUC, the machine learning model trained by SMOTE is better, the deep learning model trained by GAN is better.
Finally, trained models were applied to monitoring tasks, and they can accurately detect accidents. But the false alarm rate is about 15%. Maybe it can be used as a quick test mechanism. How to decrease the false alarm rate is future work. In addition, it can try to connect to the traffic information API to implement real-time road monitoring tasks and provide more useful travel information for traffic managers.
Abilash, R. (2018). Applying Random Forest (classification) — Machine Learning Algorithm From Scratch With Real Datasets. Retrieved from https://medium.com/@ar.ingenious/applying-random-forest-classification-machine-learning-algorithm-from-scratch-with-real-24ff198a1c57.
Alencar, R. (2017). Resampling strategies for imbalanced datasets. Retrieved from https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Arjovsky, M., Chintala, S., & Bottou, L. (2017a). Wasserstein gan. arXiv preprint arXiv:1701.07875.
Arjovsky, M., Chintala, S., & Bottou, L. (2017b). Wasserstein generative adversarial networks. International conference on machine learning.
Beck, J. R., & Shultz, E. K. (1986). The use of relative operating characteristic (ROC) curves in test performance evaluation. Archives of pathology & laboratory medicine, 110(1), 13-20.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Cauchy, A. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847), 536-538.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Fan, Z., Liu, C., Cai, D., & Yue, S. (2019). Research on black spot identification of safety in urban traffic accidents based on machine learning method. Safety Science, 118, 607-616. doi:10.1016/j.ssci.2019.05.039
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
Freeway Bureau, MOTC. (2021). Traffic Database. Retrieved from https://tisvcloud.freeway.gov.tw/
Goodfellow, I. (2016). NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems.
Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. International conference on machine learning.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
Hajian-Tilaki, K. (2013). Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian journal of internal medicine, 4(2), 627.
Hamel, L. (2006). Visualization of support vector machines with unsupervised learning. 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
Hayes, J., Melis, L., Danezis, G., & De Cristofaro, E. (2017). LOGAN: evaluating privacy leakage of generative models using generative adversarial networks. arXiv preprint arXiv:1705.07663.
Hindupur, A. (2018). The GAN Zoo. Retrieved from https://github.com/hindupuravinash/the-gan-zoo.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Hossain, M., & Muromachi, Y. (2012). A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways. Accid Anal Prev, 45, 373-381. doi:10.1016/j.aap.2011.08.004
Huang, T., Wang, S., & Sharma, A. (2020). Highway crash detection and risk estimation using deep learning. Accid Anal Prev, 135, 105392. doi:10.1016/j.aap.2019.105392
Institue of Tranportation, MOTC. (2001). Taiwan Highway Capacity Handbook. Retrieved from https://www.iot.gov.tw/cp-78-8811-47c6c-1.html
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE conference on computer vision and pattern recognition.
Jiang, F., Yuen, K. K. R., & Lee, E. W. M. (2020). A long short-term memory-based framework for crash detection on freeways with traffic data of different temporal resolutions. Accid Anal Prev, 141, 105520. doi:10.1016/j.aap.2020.105520
Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. the Journal of machine Learning research, 18(1), 559-563.
Li, H., & Sun, J. (2012). Forecasting business failure: The use of nearest-neighbour support vectors and correcting imbalanced samples – Evidence from the Chinese hotel industry. Tourism Management, 33(3), 622-634. doi:https://doi.org/10.1016/j.tourman.2011.07.004
Li, P., Abdel-Aty, M., & Yuan, J. (2020). Real-time crash risk prediction on arterials based on LSTM-CNN. Accid Anal Prev, 135, 105371. doi:10.1016/j.aap.2019.105371
Lin, C., Huang, S., Wu, Y., & Lai, S. (2020). GAN-Based Day-to-Night Image Style Transfer for Nighttime Vehicle Detection. IEEE Transactions on Intelligent Transportation Systems, 1-13. doi:10.1109/TITS.2019.2961679
Lin, Y., Li, L., Jing, H., Ran, B., & Sun, D. (2020). Automated traffic incident detection with a smaller dataset based on generative adversarial networks. Accid Anal Prev, 144, 105628. doi:10.1016/j.aap.2020.105628
Mercader, P., & Haddad, J. (2020). Automatic incident detection on freeways based on Bluetooth traffic monitoring. Accid Anal Prev, 146, 105703. doi:10.1016/j.aap.2020.105703
Merlin, L. A., Guerra, E., & Dumbaugh, E. (2020). Crash risk, crash exposure, and the built environment: A conceptual review. Accid Anal Prev, 134, 105244. doi:10.1016/j.aap.2019.07.020
Mondal, M. A., & Rehena, Z. (2020). Road Traffic Outlier Detection Technique based on Linear Regression. Procedia Computer Science, 171, 2547-2555. doi:https://doi.org/10.1016/j.procs.2020.04.276
Mujalli, R. O., Lopez, G., & Garach, L. (2016). Bayes classifiers for imbalanced traffic accidents datasets. Accid Anal Prev, 88, 37-51. doi:10.1016/j.aap.2015.12.003
World Health Organization. (2018). Global status report on road safety 2018: Summary. Retrieved from https://www.who.int/violence_injury_prevention/road_safety_status/2018/en/
Osman, O. A., Hajij, M., Karbalaieali, S., & Ishak, S. (2019). A hierarchical machine learning classification approach for secondary task identification from observed driving behavior data. Accid Anal Prev, 123, 274-281. doi:10.1016/j.aap.2018.12.005
Parsa, A. B., Movahedi, A., Taghipour, H., Derrible, S., & Mohammadian, A. K. (2020). Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accid Anal Prev, 136, 105405. doi:10.1016/j.aap.2019.105405
Parsa, A. B., Taghipour, H., Derrible, S., & Mohammadian, A. K. (2019). Real-time accident detection: Coping with imbalanced data. Accid Anal Prev, 129, 202-210. doi:10.1016/j.aap.2019.05.014
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Dubourg, V. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Raileanu, L. E., & Stoffel, K. (2004). Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1), 77-93.
Roy, A., Hossain, M., & Muromachi, Y. (2018). Enhancing the prediction performance of real-time crash prediction models: a cell transmission-dynamic Bayesian network approach. Transportation research record, 2672(38), 58-68.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
Saha, S. (2018). A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way. Retrieved from https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
Sangare, M., Gupta, S., Bouzefrane, S., Banerjee, S., & Muhlethaler, P. (2020). Exploring the forecasting approach for road accidents: Analytical measures with hybrid machine learning. Expert Systems with Applications, 113855. doi:https://doi.org/10.1016/j.eswa.2020.113855
Santoso, B., Wijayanto, H., Notodiputro, K., & Sartono, B. (2017). Synthetic over sampling methods for handling class imbalanced problems: a review. IOP conference series: earth and environmental science.
Shankar, S. (2020). Types of Machine Learning Algorithms. Retrieved from https://medium.com/swlh/types-of-machine-learning-algorithms-62608e83d709
Sharma, R. (2018). What is Machine learning and why is it gaining so much popularity? Retrieved from https://sqlrelease.com/what-is-machine-learning-and-why-is-it-gaining-so-much-popularity
Swets, J. A. (1973). The relative operating characteristic in psychology: a technique for isolating effects of response bias finds wide use in the study of perception and cognition. Science, 182(4116), 990-1000.
Vapnik, V. (2013). The nature of statistical learning theory: Springer science & business media.
Wang, D., Liu, Q., Ma, L., Zhang, Y., & Cong, H. (2019). Road traffic accident severity analysis: A census-based study in China. Journal of safety research, 70, 135-147.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. arXiv preprint arXiv:1808.06601.
Wray, N. R., Yang, J., Goddard, M. E., & Visscher, P. M. (2010). The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet, 6(2), e1000864.
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503.
Yen, S.-J., & Lee, Y.-S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.
Yildirim, S. (2020). Gradient Boosted Decision Trees-Explained. Retrieved from https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af
Yu, R., Wang, X., Yang, K., & Abdel-Aty, M. (2016). Crash risk analysis for Shanghai urban expressways: A Bayesian semi-parametric modeling approach. Accident Analysis & Prevention, 95, 495-502. doi:https://doi.org/10.1016/j.aap.2015.11.029
Zhang, K., Jia, N., Zheng, L., & Liu, Z. (2019). A novel generative adversarial network for estimation of trip travel time distribution with trajectory data. Transportation Research Part C: Emerging Technologies, 108, 223-244. doi:10.1016/j.trc.2019.09.019
Zhang, M., Zhang, Y., Zhang, L., Liu, C., & Khurshid, S. (2018, 3-7 Sept. 2018). DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).
Zheng, Z., Wang, Z., Zhu, L., & Jiang, H. (2020). Determinants of the congestion caused by a traffic accident in urban road networks. Accid Anal Prev, 136, 105327. doi:10.1016/j.aap.2019.105327
Zhu, H., Liu, G., Zhou, M., Xie, Y., Abusorrah, A., & Kang, Q. (2020). Optimizing Weighted Extreme Learning Machines for imbalanced classification and application to credit card fraud detection. Neurocomputing, 407, 50-62. doi:10.1016/j.neucom.2020.04.078
校內:2026-07-27公開