| 研究生: |
凃欣宇 Tu, Shin-Yu |
|---|---|
| 論文名稱: |
利用機器學習開發空氣污染排放設施未妥善控制而排放之預警模型 Developing prediction models for factories’ illegal air pollution emissions using the machine learning techniques |
| 指導教授: |
陳必晟
Chen, Pi-Cheng |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 環境工程學系 Department of Environmental Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 177 |
| 中文關鍵詞: | 稽查 、環境犯罪 、空氣污染 、機器學習 、PU 學習 |
| 外文關鍵詞: | inspection, air pollution, machine learning, PU learning |
| 相關次數: | 點閱:109 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在稽查違規排放空氣污染行為時,如何提升稽查效率與陳情告發率是稽查單位面臨的重要課題。機器學習分析有助於找出資料間隱藏的關聯與產出預測結果,並根據預測結果擬定管制策略,也是近年來環境執法的發展方向。
因此本研究將建立智慧稽查非法空污排放的預警模型,預測每日未妥善控制污染設備而排放空氣污染物行為,作為執法機關事先預警、調度稽查人力之參考。
本研究分別將具有許可證的空污列管工廠及設有CEMS監測系統的列管工廠作為研究對象。在建構資料集方面,使用空污法裁罰紀錄資料集作為被預測標籤,並結合原料、廢棄物申報資料、氣象、空氣品質、微型感測器物聯網、CEMS監測資料、地理資訊與經濟生產量等資料作為訓練欄位,並分析不同環境因子對工廠違規行為日的影響。模型訓練上,本研究探討不同特徵篩選方法及監督式學習、異常偵測及PU學習等不同機器學習演算法對資料集的影響,最後以F1-score、Precision等評估模型表現,並找出最適配之機器學習流程。
關聯性分析結果顯示發生工廠違規行為當日,工廠周遭空氣品質均較差,且過去違規次數越高,工廠再次違規的機會越大。
模型預測結果顯示不論對何種資料集,表現最好的流程為不篩選特徵,並使用PU學習重標籤後,將資料集使用PCA降維,最後使用SMOTE類別平衡後利用MLP分類器訓練,F1-score最高可達0.508。
When inspecting illegal discharges of air pollution, how to improve inspection efficiency is an important issue for inspection units. Machine learning analysis helps to find hidden correlations between data and output prediction results, and formulate control strategies based on the prediction results, which is also the development direction of environmental law enforcement in recent years.
Therefore, this study will establish a predictive model for inspection of non-compliance cases, and predict the daily behavior of illegal air pollutant emissions without proper control of polluting equipment. The research focuses on fixed pollution sources. The predicted labels are from Environmental Inspection and Punishment Control System. Additionally, raw materials, waste declarations, meteorological data, air quality index, IoT-based microsensors data, CEMS monitoring data are used as predictive data. The study analyzes the associations between different environmental factors and factories’ non-compliance behavior.
Regarding model training, various feature selection methods, and data pre-processing methods such as PCA and Positive-Unlabeled (PU) learning algorithms are explored to investigate their impact on the dataset.
The results of the correlation analysis indicate that on days when factory violations occur, the air quality in the vicinity of the factories is generally poorer, and higher past violation frequencies increase the likelihood of subsequent non-compliance.
The model predictions reveal that the best-performing pipeline, involves no feature selection, PU learning with re-labeling, PCA dimensionality reduction, and training a Multilayer Perceptron (MLP) classifier after applying SMOTE for class balancing. The achieved F1-score reaches a maximum of 0.508.
Abdulmohsin, H. A., H. B. A. Wahab and A. Hossen(2021).A new hybrid feature selection method using T-test and fitness function. CMC-Comput. Mater. Continua, 68(3), 3997-4016.
Bekker, J. and J. Davis,(2020).Learning from positive and unlabeled data: A survey.Machine Learning, 109 719-760.
Bergstra, J. and Y. Bengio.(2012).Random search for hyper-parameter optimization.Journal of machine learning research, 13(2).
Brantingham, P. J. and P. L. Brantingham,(1984).Patterns in crime, Macmillan New York.
Breiman, L.,(1996)."Bagging predictors." Machine learning, 24 123-140.
Breiman, L.,(2001)."Random forests." Machine learning, 45 5-32.
Chang, X., Y. Huang, M. Li, X. Bo and S. Kumar.(2020).Efficient Detection of Environmental Violators: A Big Data Approach. Production and Operations Management, 30(5), 1246-1270.
Chawla, N. V., K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer.(2002).SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16 321-357.
Chen, T. and C. Guestrin.(2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.
Chu, Y.-C.(2007). A Multi-layered System Architecture for Environmental Monitoring Data Management-Taiwan's Experience. EnviroInfo (1).
Contreras, L. and C. Ferri.(2016).Wind-sensitive interpolation of urban air pollution forecasts. Procedia Computer Science, 80 313-323.
Cortes, C. and V. Vapnik.(1995).Support-vector networks.Machine learning, 20 273-297.
Cover, T. and P. Hart.(1967).Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.
Cristianini, N. and J. Shawe-Taylor.(2000).An introduction to support vector machines and other kernel-based learning methods, Cambridge university press.
Delmas, M. and A. Keller.(2005).Free riding in voluntary environmental programs: The case of the US EPA WasteWise program. Policy Sciences, 38(2), 91-106.
Dreiseitl, S. and L. Ohno-Machado.(2002).Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics, 35(5-6), 352-359.
Egenhofer, M. J.(1989). A formal definition of binary topological relationships. International conference on foundations of data organization and algorithms, Springer.
Emmanuel, T., T. Maupong, D. Mpoeleng, T. Semong, B. Mphago and O. Tabona.(2021).A survey on missing data in machine learning. Journal of Big Data, 8(1), 1-37.
Freund, Y., R. Schapire and N. Abe.(1999).A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780), 1612.
Gray, W. B. and M. E. Deily.(1996).Compliance and enforcement: Air pollution regulation in the US steel industry. Journal of environmental economics and management, 31(1), 96-111.
Gu, Q., L. Zhu and Z. Cai.(2009). Evaluation measures of the classification performance of imbalanced data sets. Computational Intelligence and Intelligent Systems: 4th International Symposium, ISICA 2009, Huangshi, China, October 23-25, 2009. Proceedings 4, Springer.
Guo, G., H. Wang, D. Bell, Y. Bi and K. Greer.(2003). KNN model-based approach in classification. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings, Springer.
Gupta, S., S. Saksena and O. F. Baris.(2019).Environmental enforcement and compliance in developing countries: Evidence from India. World Development, 117 313-327.
Hancock, J. T. and T. M. Khoshgoftaar.(2020).Survey on categorical data for neural networks. Journal of Big Data, 7(1), 1-41.
Ho, T. K.(1995). Random decision forests. Proceedings of 3rd international conference on document analysis and recognition, IEEE.
Hossin, M. and M. N. Sulaiman.(2015).A review on evaluation metrics for data classification evaluations.International journal of data mining & knowledge management process, 5(2), 1.
Huang, G., L. J. Chen, W. H. Hwang, S. Tzeng and H. C. Huang.(2018).Real‐time PM2. 5 mapping and anomaly detection from AirBoxes in Taiwan. Environmetrics, 29(8), e2537.
Khder, M. A.(2021).Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing & Its Applications, 13(3).
Kianmehr, K. and R. Alhajj.(2008).Effectiveness of Support Vector Machine for Crime Hot-Spots Prediction.Applied Artificial Intelligence, 22(5), 433-458.
Kohavi, R. and G. H. John.(1997).Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324.
Kowalczyk, A. and B. Raskutti.(2002).One class SVM for yeast regulation prediction.ACM Sigkdd Explorations Newsletter, 4(2), 99-100.
Lamari, Y., B. Freskura, A. Abdessamad, S. Eichberg and S. de Bonviller.(2020).Predicting spatial crime occurrences through an efficient ensemble-learning model. ISPRS International Journal of Geo-Information, 9(11), 645.
Lee, J.-D., C.-Y. Lin and C.-H. Huang,(2013). Novel features selection for gender classification. 2013 IEEE International Conference on Mechatronics and Automation, IEEE.
Lee, W. S. and B. Liu,(2003). Learning with positive and unlabeled examples using weighted logistic regression. ICML.
Lega, M., C. Ferrara, G. Persechino and P. Bishop,(2014).Remote sensing in environmental police investigations: aerial platforms and an innovative application of thermography to detect several illegal activities. Environmental monitoring and assessment, 186(12), 8291-8301.
Leong, K. and A. Sung,(2015).A review of spatio-temporal pattern analysis approaches on crime analysis. International E-Journal of Criminal Sciences, 9 1-33.
Liu, B., Y. Dai, X. Li, W. S. Lee and P. S. Yu.(2003). Building text classifiers using positive and unlabeled examples. Third IEEE international conference on data mining.
Liu, F. T., K. M. Ting and Z.-H. Zhou.(2012).Isolation-Based Anomaly Detection.ACM Transactions on Knowledge Discovery from Data, 6(1), 1-39.
Lu, W.,(2019).Big data analytics to identify illegal construction waste dumping: A Hong Kong study. Resources, Conservation and Recycling, 141 264-272.
Magat, W. A. and W. K. Viscusi,(1990).Effectiveness of the EPA's regulatory enforcement: The case of industrial effluent standards. The Journal of Law and Economics, 33(2), 331-360.
McCallum, A. and K. Nigam.(1998). A comparison of event models for naive bayes text classification. AAAI-98 workshop on learning for text categorization, Madison, WI.
Mordelet, F. and J.-P. Vert.(2014).A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters, 37 201-209.
Morrison, R. D.(2000).Critical review of environmental forensic techniques: Part II.Environmental Forensics, 1(4), 175-195.
Mu, Y., X. Liu and L. Wang.(2018).A Pearson’s correlation coefficient based decision tree and its parallel implementation. Information Sciences, 435 40-58.
Patel, A. A.(2019).Hands-On Unsupervised Learning Using Python. O'Reilly Media, Inc.
Peck, R. and J. L. Devore.(2011).Statistics: The exploration & analysis of data. Cengage Learning.
Prusa, J., T. M. Khoshgoftaar, D. J. Dittman and A. Napolitano.(2015). Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data, 2015 IEEE International Conference on Information Reuse and Integration 197-202.
Ruck, D. W., S. K. Rogers and M. Kabrisky.(1990).Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2(2), 40-48.
Sagi, O. and L. Rokach.(2018).Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.
Sahin, Y. and E. Duman.(2010, Jul.). Detecting credit card fraud by decision trees and support vector machines. Paper presented at the World Congress on Engineering 2012, London, UK.
Sahoo, K., A. K. Samal, J. Pramanik and S. K. Pani.(2019)."Exploratory data analysis using Python." International Journal of Innovative Technology and Exploring Engineering (IJITEE), 8(12).
Sakurada, M. and T. Yairi.(2014), Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction, Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis - MLSDA'14 4-11.
Sansone, E., F. G. B. De Natale and Z. H. Zhou.(2019).Efficient Training for Positive Unlabeled Learning. IEEE Trans Pattern Anal Mach Intell, 41(11), 2584-2598.
Shin, S. Y. and H.-j. Kim.(2019).Autoencoder-based One-class Classification Technique for Event Prediction, Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things 54-58.
Song, Y.-Y. and L. Ying,(2015).Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2), 130.
Soper, D. S.(2022).Hyperparameter Optimization Using Successive Halving with Greedy Cross Validation. Algorithms, 16(1), 17.
Tang, J., S. Alelyani and H. Liu.(2014).Feature selection for classification: A review.Data classification: Algorithms and applications, 37.
Tasaki, T., T. Kawahata, M. Osako, Y. Matsui, S. Takagishi, A. Morita and S. Akishima.(2007).A GIS-based zoning of illegal dumping potential for efficient surveillance. Waste Management, 27(2), 256-267.
Thaseen, I. S., C. A. Kumar and A. Ahmad.(2019).Integrated intrusion detection model using chi-square feature selection and ensemble of classifiers. Arabian Journal for Science and Engineering, 44,3357-3368.
Vrigazova, B.(2021)."The proportion for splitting data into training and test set for the bootstrap in classification problems." Business Systems Research:International Journal of the Society for Advancing Innovation and Research in Economy, 12(1), 228-242.
Wei-Sung Hsiao, C.-W. C., Shao-yu Ma, Michelle Deng.(2020).Recent Developments of Taiwan’s and China’s Environmental Protection Laws and Advice for Enterprises. Retrieved from https://www.leeandli.com/EN/Newsletters/6393.html (Oct. 27,2022)
Zhou, N. and L. Wang,(2007).A modified T-test feature selection method and its application on the HapMap genotype data.Genomics, proteomics & bioinformatics, 5(3-4), 242-249.
內政部戶政司(2022)。政府資料開放平臺,各鄉鎮市區人口密度。檢自https://data.gov.tw/dataset/8410 (Dec. 30,2021)
交通部中央氣象局(2022)。中央氣象局觀測資料查詢系統。檢自https://e-service.cwb.gov.tw/HistoryDataQuery/ (Dec. 30,2021)
交通環境資源處(2019)。行政院,院會議案,空污感測物聯網應用於環保稽查推動成果。檢自 https://www.ey.gov.tw/Page/448DE008087A1971/4d1b964c-9294-4505-8814-6d14d57ae05d (Oct. 18,2022)
行政院環保署(2022)。 列管污染源資料(含裁處資訊)查詢系統。檢自https://prtr.epa.gov.tw/index.html (Dec. 30,2021)
行政院環保署(2022)。 固定汙染源管理資訊公開平台。檢自https://aodmis.epa.gov.tw/opendata/#/lq (Oct. 30,2022)
行政院環保署空保處(2022)。連續自動監測設施(CEMS)資訊網,CEMS監測數據資料辨識碼說明。檢自 https://cems.epa.gov.tw/cemsidentitycode/
行政院環保署監資處(2022)。環境資料開放平台,空氣品質監測小時值(一般污染物,每日更新)。檢自https://data.epa.gov.tw/dataset/detail/AQX_P_13
行政院環境保護署(2023)。空氣品質改善維護資訊網,CEMS 政策推動。檢自 https://air.epa.gov.tw/EnvTopics/StationarySource_9.aspx
行政院環境保護署(2022)。空氣品質監測網,中央監測背景介紹。檢自https://airtw.epa.gov.tw/CHT/EnvMonitoring/Central/Background_Intro.aspx
行政院環境保護署(2022)。環境感測物聯網,布建巡檢維護。檢自https://twiot.epa.gov.tw/MaOD.html (Oct. 18,2022)
行政院環境保護署(2022)。主管法規查詢系統,固定污染源設置操作及燃料使用許可證管理辦法。檢自https://oaout.epa.gov.tw/law/LawContent.aspx?id=FL015356
行政院環境保護署(2022)。空氣品質改善維護資訊網,空氣污染排放清冊。檢自https://air.epa.gov.tw/EnvTopics/AirQuality_6.aspx
行政院環境保護署(2022)。環保統計年報。檢自https://www.epa.gov.tw/Page/27372777FD92ADDB
行政院環境保護署(2019)。主管法規查詢系統,以網路傳輸方式申報廢棄物之產出、貯存、清除、處理、再利用、輸出及輸入情形之申報格式、項目、內容及頻率。檢自https://oaout.epa.gov.tw/law/LawContent.aspx?id=GL006044 (Oct. 19, 2022)
行政院環境保護署(2022)。政府資料開放平台,空氣品質監測站位置圖。檢自https://data.gov.tw/dataset/48269
行政院環境保護署(2023)。空氣品質改善維護資訊網,許可制度推動。檢自https://air.epa.gov.tw/EnvTopics/StationarySource_2.aspx (Apr. 08,2023)
行政院環境保護署督察總隊(2021)。行政院環境保護署環保新聞專區,「看不到卻抓得到」-揮發性有機廢氣專案查核成果。檢自https://enews.epa.gov.tw/Page/3B3C62C78849F32F/9bd9cd46-fdc5-4ae3-88ad-d42df2bc2dd9 (Oct. 18,2022)
卓冠廷(2021)。以機器學習開發事業廢水未妥善處理排放潛勢之預測模型。國立成功大學碩士論文,臺南市。取自https://hdl.handle.net/11296/s3kgbr。
國家科學及技術委員會(2022)。民生公共物聯網。檢自https://history.colife.org.tw/#/
許倬勛(2018)。自由時報,桃園無人機載偵測儀 智慧監控工業區空污。檢自https://news.ltn.com.tw/news/local/paper/1186500
經濟部統計處(2022)。工業產銷存動態調查-業別統計。檢自https://dmz26.moea.gov.tw/GMWeb/investigate/InvestigateDB.aspx
蔡孟書(2021)。臺中市細懸浮微粒污染特性探討及改善策略研擬。國立中興大學在職專班碩士學位論文,臺中市。取自https://hdl.handle.net/11296/s3kgbr
環保署空保處(2022)。環境資料開放平臺,固定污染源CEMS監測數據紀錄值資料集(1小時紀錄值)。檢自https://data.epa.gov.tw/dataset/detail/AQX_P_187
環保署廢管處(2022)。環境資料開放平臺,環境保護許可管理系統(暨解除列管)對象基本資料。檢自 https://data.epa.gov.tw/dataset/detail/EMS_S_01