簡易檢索 / 詳目顯示

研究生: 蔡宜瑾
Tsai, Yi-Chin
論文名稱: 應用2D+3D深度融合網路與RGB-D攝影機於微創手術之姿態與操作估計
Pose and Operation Estimation Using 2D+3D Deep Fusion Network with RGB-D Camera for Minimally Invasive Surgery
指導教授: 孫永年
Sun, Yung-Nien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 71
中文關鍵詞: 6D姿態估計點雲深度學習板機指影像導引微創手術
外文關鍵詞: 6D Pose Estimation, Point Cloud, Deep Learning, Trigger Finger, Image-guided Training System
相關次數: 點閱:75下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,微創手術已經成為全球外科手術發展的新主流,被廣泛應用於多種手術中,微創手術經由幾個約0.5 ~ 1.5公分的微小傷口代替以往大傷口的手術。因此有手術傷口較小、術後疼痛減少、出血和感染機會減少、復原時間縮短、更少的疤痕等多項的優勢,對病人都是更好的選擇。與一般外科手術可以用直接肉眼觀看其患部不同,微創手術醫生僅能透過螢幕查看內視鏡畫面來進行手術,且微創手術器械的使用方式也與一般手術大不相同,具有相當的學習難度,因此外科醫生通常會使用動物實驗或者微創手術模擬訓練系統來進行規劃與訓練。
    扳機指為手指常見的疾病之一,大多因手部過度活動造成,手掌處因肌腱腱鞘發炎增厚而狹窄,造成肌腱在腱鞘內滑行受阻而無法自由活動手指。扳機指的微創手術需要透過超音波影像引導整個手術過程,因此扳機指微創手術訓練系統也必須整合手術假體模型與術前超音波影像,以提供真實的操作影像回饋,使醫生得以進行完整的術前規劃與練習,讓手術進行的更順利。而在訓練系統中最重要的環節之一在於手術器械的精準定位,才能模擬出真實的手術過程。
    為了滿足以上需求,本論文提出一個基於2D+3D深度融合神經網路之物件6D姿態估計方法,應用於微創手術訓練系統中的手術器械偵測與追蹤,並提供擬真的超音波探頭影像回饋。該系統首先透過一台RGB-D攝影機擷取場景的彩色影像與深度圖,並輸入至2D+3D深度融合網路,辨識與追蹤場景內所有手術器械與其6D姿態。
    我們的6D姿態估計方法加入了Image Segmentation的分支,額外的任務有助於學習外觀特徵與提升表現,並提前預測有物體的區域。結合了具有Attention機制的Segmentation與MLP架構,使網路可以學習關注重要特徵並抑制不必要的特徵。我們在PointNet++中融合Residual結構,使幾何特徵的訓練更易於學習與收斂。我們還增加了Self-supervised Confidence分支,用於提取輸入點雲中具代表性的點集,以提升Keypoint聚類的結果與減少運算時間。
    在性能方面與Baseline相比,我們的方法在Render Dataset中降低了約51.3%的平均誤差,而在Real Dataset中也降低了約43.6%的誤差,並在消融實驗中證明了我們的多項改進都有助於改善其物件姿態預估的結果。系統整合方面,我們將手術假體模型與術前超音波影像進行對位與整合,並透過精準的6D姿態估計,辨識與追蹤超音波探頭與手術器械,以呈現出符合現實情況的超音波影像與回饋,幫助醫生針對手術操作進行訓練。

    In recent years, minimally invasive surgery has become increasingly prevalent over different kinds of surgical procedures. Minimally invasive surgery is widely substituting conventional operations, which features smaller trauma, fewer infection opportunities, and shorter recovery period with tiny incisions. Unlike the direct observation in general open surgery, surgeons can only perform operations with reference to the information displayed on the screen of the endoscope. The instruments using in minimally surgery are different from conventional procedures. As for conquering the steep learning curve, surgeons usually experiment with animals or a simulation training system.
    Trigger finger is one of the common diseases which mostly result from excessive hand movements. It occurs when inflammation narrows the space within the sheath and makes the tendon hard to pass-through. The minimally invasive surgery of the trigger finger requires ultrasound imaging to guide the entire surgical process. Therefore, the training system must integrate the prosthesis and pre-operative ultrasound images to provide real operating image feedback for surgeons to prepare complete pre-operative planning. One of the most important considerations is the precise positioning of surgical instruments for simulating the actual surgical process.
    To suffocate the requirement, we propose a pose estimation method based on a 2D+3D deep fusion network with an RGB-D camera to detect and trace the instruments using in a minimally invasive surgery training system that involves real operating image feedbacks. The system first inputs the color images and depth maps of the scene captured through an RGB-D camera to the 2D+3D deep fusion network to identify and track all surgical instruments and their 6D poses in the scene.
    We introduce the image segmentation branch into our 6D pose estimation method, which benefits in learning external characteristics and increasing the performance. Integrating the segmentation and the multi-layer perceptron with the attention module facilitates our network to focus on significant features and suppress unnecessary features. We further fuse the conception of the residual network into the PointNet++ to make the training on geometry features easier to learn and to converge. An additional self-supervised confidence branch that is integrated into the network assists in extracting a representative point set from the input point cloud, which enhances the keypoints clustering result and reduces execution time.
    Compared with the baseline, the proposed method reduces the average error of about 51.3% in the render dataset, and also reduces the error of about 43.6% in the Real Dataset. In ablation experiments, we also proved that each proposed method contributes to performance and effectively improves the final accuracy of the 6D pose estimation. We apply our method to the vision-based surgical training system which includes surgical instruments detection, surgical instruments tracking, surgical instrument pose estimation, pre-operative ultrasound 3D volume reconstruction, ultrasound image reconstruction of hand phantom and surgical training system scenario. By tracking the 6D pose of surgical instruments, we can easily combine real-world and virtual systems, allowing doctors to interact with the artificial hands and see the pre-operative images of the real situation.

    摘要 I ABSTRACT III 誌謝 VI CONTENTS VII LIST OF TABLES X LIST OF FIGURES XI CHAPTER 1 INTRODUCTION 1 1.1 Background and Motivation 1 1.2 Related Works 3 1.2.1 Trigger finger release surgery 3 1.2.2 Surgical training system 5 1.2.3 Deep learning 7 1.2.4 6D pose estimation method 8 1.3 Contributions 9 CHAPTER 2 EXPERIMENT MATERIALS 10 2.1 RGB-D Camera and Point Cloud 10 2.2 Pre-operative Data Acquisition 13 CHAPTER 3 PROPOSED METHOD 14 3.1 System Overview 14 3.2 Color Feature Extraction for RGB image 16 3.3 Geometry Feature Extraction for Point Cloud 19 3.4 Sampling Index 25 3.5 Fusion Embeddings 25 3.6 Instance Semantic Segmentation 26 3.7 3D Keypoint Hough Voting 32 3.8 Self-supervised Confidence 35 3.9 6D Pose Estimation 36 CHAPTER 4 DATA COLLECTION, LABEL, AND AUGMENTATION 37 4.1 Surgical Instrument Model Design 37 4.2 6D Pose Labeling 39 4.3 Model Keypoints Selection 40 4.4 Render Dataset 41 4.5 Real Dataset 42 4.6 Data Augmentation 43 CHAPTER 5 EXPERIMENTAL RESULT AND DISCUSSION 44 5.1 Evaluation Metrics 44 5.2 Training Detail 45 5.3 Ablation Experiment 45 5.4 Effect of use ResMLP in PointNet++ 48 5.5 Marker Design Comparison for Probe 49 5.6 Plug Model Design Comparison for Plug 51 5.7 Result of Full Training Process for Real Dataset 52 5.8 Evaluation of Stability and Precision of Tracking System 54 5.9 Compare Performance with Previous Tracking System 58 5.10 Demonstration of Surgical Training System 59 CHAPTER 6 CONCLUSION AND FUTURE WORK 62 6.1 Conclusion 62 6.2 Future Work 63 REFERENCES 64

    [1] B. P. Crawshaw, H. L. Chien, K. M. Augestad and C. P. Delaney, "Effect of Laparoscopic Surgery on Health Care Utilization and Costs in Patients Who Undergo Colectomy," JAMA surgery, vol. 150, no. 5, pp. 410-415, 2015.
    [2] C. Y. Chiu, "Trigger Finger Surgical Training System with Optical Tracking and Smart Glasses," 2017.
    [3] N. D. Weiss and &. M. B. Richter, "Percutaneous release of trigger digit," Am J Orthop (Belle Mead NJ), pp. 263-267, 2017.
    [4] M. Ryzewicz and J. M. Wolf, "Trigger Digits: Principles, Management, and Complications," The Journal of hand surgery, vol. 31, no. 1, pp. 135-146, 2006.
    [5] J. L. JR, "Surgical Treatment of Trigger-Finger By a Subcutaneous Method," Journal of Bone and Joint Surgery, vol. 40, no. 4, pp. 793-795, 1958.
    [6] I. M. Jou and T. C. Chern, "Sonographically Assisted Percutaneous Release of the A1 Pulley: A New Surgical Technique for Treating Trigger Digit," Journal of Hand Surgery, vol. 31, no. 2, pp. 191-199, 2006.
    [7] U. Kühnapfel, H. K. Cakmak and H. Maaß, "Endoscopic Surgery Training using Virtual Reality and deformable Tissue Simulation," Computers & graphics, vol. 24, no. 5, pp. 671-682, 2000.
    [8] C. Basdogan, M. Sedef, M. Harders and S. Wesarg, "VR-Based Simulators for Training in Minimally Invasive Surgery," IEEE Computer Graphics and Applications, vol. 27, no. 2, pp. 54-66, 2007.
    [9] G. E. Hinton and R. R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, vol. 313, no. 5786, pp. 504-507, 2006.
    [10] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in neural information processing systems, pp. 1097-1105, 2012.
    [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," IEEE conference on computer vision and pattern recognition, pp. 248-255, 2009.
    [12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, "Going Deeper With Convolutions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, 2015.
    [13] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," International Conference on Learning Representations, 2015.
    [14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," The journal of machine learning research, vol. 15, no. 1, pp. 1929-1958, 2014.
    [15] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," The International Conference on Machine Learning (ICML), vol. 37, pp. 448-456, 2015.
    [16] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
    [17] G. Huang, Z. Liu, L. v. d. Maaten and K. Q. Weinberger, "Densely Connected Convolutional Networks," roceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, 2017.
    [18] J. Long, E. Shelhamer and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440, 2015.
    [19] R. Olaf, F. Philipp and B. Thomas, "U-Net: Convolutional Networks for Biomedical Image Segmentation," International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234-241, 2015.
    [20] H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia, "Pyramid Scene Parsing Network," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881-2890, 2017.
    [21] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff and H. Adam, "Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation," Proceedings of the European conference on computer vision (ECCV), pp. 801-818, 2018.
    [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A. C. Berg, "SSD: Single Shot MultiBox Detector," European conference on computer vision, pp. 21-37, 2016.
    [23] J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv preprint arXiv:1804.02767, 2018.
    [24] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in neural information processing systems, pp. 91-99, 2015.
    [25] F. Rothganger, S. Lazebnik, C. Schmid and J. Ponce, "3D Object Modeling and Recognition Using Local Affine-Invariant Image," Toward Category-Level Object Recognition, pp. 105-126, 2006.
    [26] A. Collet, M. Martinez and S. S. Srinivasa, "The MOPED framework: Object recognition and pose estimation for manipulation," The International Journal of Robotics Research, vol. 30, no. 10, pp. 1284-1306, 2011.
    [27] W. Kehl, F. Manhardt, F. Tombari, S. Ilic and N. Navab, "SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1521-1529, 2017.
    [28] Y. Xiang, T. Schmidt, V. Narayanan and D. Fox, "PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes," arXiv preprint arXiv:1711.00199, 2017.
    [29] D. Xu, D. Anguelov and A. Jain, "PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 244-253, 2018.
    [30] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker and R. Triebel, "Implicit 3D Orientation Learning for 6D Object Detection from RGB Images," Proceedings of the European Conference on Computer Vision (ECCV), pp. 699-715, 2018.
    [31] B. Tekin, S. N. Sinha and P. Fua, "Real-Time Seamless Single Shot 6D Object Pose Prediction," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 292-301, 2018.
    [32] M. Rad and V. Lepetit, "BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects Without Using Depth," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3828-3836, 2017.
    [33] S. Zakharov, I. Shugurov and S. Ilic, "DPOD: 6D Pose Object Detector and Refiner," Proceedings of the IEEE International Conference on Computer Vision, pp. 1941-1950, 2019.
    [34] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton and C. Rother, "Learning 6D Object Pose Estimation Using 3D Object Coordinates," Proceedings of the European Conference on Computer Vision (ECCV), pp. 536-551, 2014.
    [35] A. Crivellaro, M. Rad, Y. Verdie, K. M. Yi, P. Fua and V. Lepetit, "Robust 3D Object Tracking from Monocular Images Using Stable Parts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1465-1479, 2018.
    [36] C. Wang, D. Xu, Y. Zhu, R. Martin-Martin, C. Lu, L. Fei-Fei and S. Savarese, "DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3343-3352, 2019.
    [37] S. Peng, Y. Liu, Q. Huang, X. Zhou and H. Bao, "PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4561-4570, 2019.
    [38] Y. Hu, J. Hugonot, P. Fua and M. Salzmann, "Segmentation-Driven 6D Object Pose Estimation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3385-3394, 2019.
    [39] Y. He, W. Sun, H. Huang, J. Liu, H. Fan and J. Sun, "PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632-11641, 2020.
    [40] L. Keselman, J. I. Woodfill, A. Grunnet-Jepsen and A. Bhowmik, "Intel RealSense Stereoscopic Depth Cameras," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-10, 2017.
    [41] "Depth Camera D435i – Intel® RealSense™ Depth and Tracking Cameras," Intel Corporation, [Online]. Available: https://www.intelrealsense.com/depth-camera-d435i/. [Accessed 23 7 2020].
    [42] T. L. Shen, "Trigger Finger Surgical Training System with 3D Image Reconstruction from Orthogonal Ultrasound Images," 2016.
    [43] D. Comaniciu and P. Meer, "Mean Shift: A Robust Approach. Toward Feature Space Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603-619, 2002.
    [44] K. S. Arun, T. S. Huang and S. D. Blostein, "Least-Squares Fitting of Two 3-D Point Sets," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vols. PAMI-9, no. 5, pp. 698-700, 1987.
    [45] H. Li, P. Xiong, J. An and L. Wang, "Pyramid Attention Network for Semantic Segmentation," arXiv preprint arXiv:1805.10180, 2018.
    [46] T.-Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollar, "Focal Loss for Dense Object Detection," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980-2988, 2017.
    [47] C. R. Qi, H. Su, K. Mo and L. J. Guibas, "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652-660, 2017.
    [48] C. R. Qi, L. Yi, H. Su and L. J. Guibas, "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space," Advances in Neural Information Processing Systems, pp. 5099-5108, 2017.
    [49] X. Wang, R. Girshick, A. Gupta and K. He, "Non-Local Neural Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794-7803, 2018.
    [50] S. Woo, J. Park, J. Y. Lee and I. S. Kweon, "CBAM: Convolutional Block Attention Module," Proceedings of the European Conference on Computer Vision (ECCV), pp. 3-19, 2018.
    [51] M. Nedrich, "Mean Shift Clustering," 26 5 2015. [Online]. Available: https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/. [Accessed 23 7 2020].
    [52] P. J. Besl and N. D. McKay, "A Method for Registration of 3-D Shapes," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 239-256, 1992.
    [53] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige and N. Navab, "Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes," Asian conference on computer vision, pp. 548-562, 2012.
    [54] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," International Conference on Learning Representations (ICLR), 2015.
    [55] A. Grunnet-Jepsen, J. N. Sweetser and J. Woodfill, "Best-Known-Methods for Tuning Intel® RealSense™ D400 Depth Cameras for Best Performance," Intel Corporation, 2018. [Online]. Available: https://www.intel.com/content/dam/support/us/en/documents/emerging-technologies/intel-realsense-technology/BKMs_Tuning_RealSense_D4xx_Cam.pdf. [Accessed 23 7 2020].
    [56] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel and A. M. Dollar, "The YCB object and Model set: Towards common benchmarks for manipulation research," International Conference on Advanced Robotics (ICAR), pp. 510-517, 2015.
    [57] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab and V. Lepetit, "Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes," International Conference on Computer Vision, pp. 858-865, 2011.
    [58] M. Lin, Q. Chen and S. Yan, "Network in network," International Conference on Learning Representations (ICLR), 2014.

    無法下載圖示 校內:2025-08-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE