簡易檢索 / 詳目顯示

研究生: 林易叡
Lin, Yi-Jui
論文名稱: 一個基於隨機傅立葉特徵演算法的可擴展核主成份分析方法
A Scalable Kernel Principal Component Analysis based on Random Fourier Features
指導教授: 謝錫堃
Shieh, Ce-Kuen
共同指導教授: 張志標
Chang, Jyh-Biau
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 30
中文關鍵詞: 大數據分析核主成份分析隨機傅立葉特徵特徵工程機器學習
外文關鍵詞: Big data analysis, KPCA, Random Fourier features, Feature engineering, Machine learning, Apache Spark
相關次數: 點閱:95下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著科技的快速發展,人們能收集與存儲的數據量呈指數性地快速增長,大數據分析的重要性也與日俱增。分析巨量數據時,資料降維是不可或缺的環節,尤其當資料特徵數量非常龐大,比如影像處理,藉由去除冗餘特徵,可大幅度降低後續分類的成本,以及避免維數災難問題,也能改善分類器效能。核主成份分析(KPCA)是一種降維演算法,可萃取出資料集的非線性主成分,其降維轉換的效果優於熱門的主成份分析(PCA),然而受限於核方法的可擴展性問題,核主成份分析無法應用於巨量資料分析。本研究提出「可擴展核主成份分析(SKPCA)」演算法,透過隨機傅立葉特徵演算法近似核主成份分析,將資料集映射到有限維度的再生核希柏特空間,避免了矩陣大小為資料量平方的核矩陣的直接計算與存儲問題,演算法的空間複雜度由O(n^2)降至O(n),時間複雜度也從O(n^3)降至O(n),進而達到可擴展的核主成份分析。此外,本研究也提供了分散式版本的「可擴展核主成份分析」,使得該演算法可在多節點中對巨量資料進行平行分散計算。實驗結果表明,「可擴展核主成份分析」的降維效果非常接近核主成份分析,且不論是時間成本或者空間成本都明顯優於核主成份分析,成功將核主成份分析推廣至大數據。

    With the rapid development of technology, the collected or generated data has been growing exponentially, and the importance of big data analysis has also risen. When analyzing large-scale data, dimensionality reduction is an indispensable link, especially in the case of a very large number of features, such as image processing. By removing redundant features, the cost of classification will be greatly reduced, the curse of dimensionality will be avoided, and the performance of the classifier may be improved. Kernel principal component analysis (KPCA) is a dimensionality reduction algorithm that can extract nonlinear principal components from the dataset. Its effect is better than Principal Component Analysis (PCA) which is a popular unsupervised algorithm. However, due to the scalability problem of kernel methods, KPCA can't be applied to big data analysis. In this paper, we propose Scalable Kernel Principal Component Analysis (SKPCA) algorithm based on random Fourier features, which approximates KPCA with mapping the data into the higher finite-dimensional reproducing kernel Hilbert space (RKHS), thus avoiding the calculation and storage problem of the kernel matrix whose complexity is quadratic. The space complexity is reduced from O(n^2) to O(n), and the time complexity is also reduced from O(n^3) to O(n), thereby achieving scalable KPCA. We also provide a distributed version of SKPCA, which means the algorithm can analyze data in parallel and distributed processing. The results show that SKPCA performs very close to KPCA in feature extraction, and it has much better performance in terms of time and space cost.

    Chapter 1 : Introduction 1 Chapter 2 : Background & Related Works 3 2.1 Background 3 2.1.1 Kernel Principal Component Analysis 3 2.1.2 Random Fourier Features 4 2.1.3 Related Works 5 Chapter 3 : Methodology 7 3.1 Method overview 7 3.2 Approximate Centered Kernel Matrix & its eigenpairs 8 3.3 Training Process of SKPCA 9 3.4 Unseen Data Process of SKPCA 9 Chapter 4 : Implementation 12 4.1 Overview 12 4.2 Single-Machine Implementation of SKPCA 12 4.3 A Distributed Implementation using Apache Spark of SKPCA 14 Chapter 5 : Experiments 16 5.1 Approximation Error 16 5.2 Visualization 17 5.3 Reconstruction 19 5.4 Performance Comparison 23 5.4.1 Runtime performance 23 5.4.2 Classification accuracy of SKPCA+SVM 24 5.5 Runtime of big data analysis 25 Chapter 6 : Conclusion 27 Chapter 7 : References 28

    [1] Shlens, Jonathon. "A tutorial on principal component analysis." arXiv preprint arXiv:1404.1100 (2014).
    [2] Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. "Kernel principal component analysis." International conference on artificial neural networks. Springer, Berlin, Heidelberg, 1997.
    [3] Rahimi, Ali, and Benjamin Recht. "Random features for large-scale kernel machines." Advances in neural information processing systems 20 (2007).
    [4] Liu, Fanghui, et al. "Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond." IEEE Transactions on Pattern Analysis & Machine Intelligence 01 (2021): 1-1.
    [5] Saitoh, Saburou, and Yoshihiro Sawano. Theory of reproducing kernels and applications. Singapore: Springer Singapore, 2016.
    [6] Rudin, Walter. Fourier analysis on groups. Courier Dover Publications, 2017.
    [7] Achlioptas, Dimitris, Frank McSherry, and Bernhard Schölkopf. "Sampling techniques for kernel methods." Advances in neural information processing systems 14 (2001).
    [8] Lopez-Paz, David, et al. "Randomized nonlinear component analysis." International conference on machine learning. PMLR, 2014.
    [9] Damodaran, Bharath Bhushan, Nicolas Courty, and Romain Tavenard. "Randomized nonlinear component analysis for dimensionality reduction of hyperspectral images." 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017.
    [10] Ghashami, Mina, Daniel J. Perry, and Jeff Phillips. "Streaming kernel principal component analysis." Artificial intelligence and statistics. PMLR, 2016.
    [11] Liberty, Edo. "Simple and deterministic matrix sketching." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.
    [12] Nyström, Evert J. "Über die praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben." Acta Mathematica 54 (1930): 185-204.
    [13] Williams, Christopher, and Matthias Seeger. "Using the Nyström method to speed up kernel machines." Advances in neural information processing systems 13 (2000).
    [14] Hallgren, Fredrik. "Kernel PCA with the Nystr\" om method." arXiv preprint arXiv:2109.05578 (2021).
    [15] Yang, Tianbao, et al. "Nyström method vs random fourier features: A theoretical and empirical comparison." Advances in neural information processing systems 25 (2012).
    [16] Zhang, Jian, et al. "Low-precision random Fourier features for memory-constrained kernel approximation." The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
    [17] Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. "Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions." (2009).
    [18] Martinsson, Per-Gunnar, Vladimir Rokhlin, and Mark Tygert. "A randomized algorithm for the decomposition of matrices." Applied and Computational Harmonic Analysis 30.1 (2011): 47-68.
    [19] Szlam, Arthur, Yuval Kluger, and Mark Tygert. "An implementation of a randomized algorithm for principal component analysis." arXiv preprint arXiv:1412.3510 (2014).
    [20] Apache Spark Data Types – RDD - based API document. https://spark.apache.org/docs/latest/mllib-data-types.html
    [21] Clanuwat, Tarin, et al. "Deep learning for classical japanese literature." arXiv preprint arXiv:1812.01718 (2018).
    [22] Xiao, Han, Kashif Rasul, and Roland Vollgraf. "Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms." arXiv preprint arXiv:1708.07747 (2017).
    [23] Spark-RSVD – a lib to compute approximate SVD decomposition of large sparse matrices. criteo/Spark-RSVD: Randomized SVD of large sparse matrices on Spark (github.com)

    下載圖示 校內:2025-09-30公開
    校外:2025-09-30公開
    QR CODE