Real-Time Signal Processing in Optical Coherence Tomography (OCT) at Fraunhofer IPT

Scientific Area

Tomographic imaging, ophthalmology

Short description

Optical coherence tomography (OCT) systems are developed at the Fraunhofer Institute for Production Technology (IPT) for medical and industrial applications. Optical coherence tomography is an imaging technique that relies on a signal processing chain. This chain used in Fraunhofer’s OCT systems covers, e.g., the acquisition of data from a line scan camera, image-related adjustments, a Fourier transformation, and the reduction of signal noise. Since these OCT systems are required to provide high spatial and temporal resolution in production setups, e.g., video frame rate (25 frames/s), the parallelization and tuning of the signal processing chain is crucial.

To achieve video frame rate, we have developed a Compute Unified Device Architecture version (CUDA) of Fraunhofer’s OCT signal processing chain that targets consumer graphics processing unit as used in customers’ production environments. We reformulated numerous components of the signal processing chain as matrix transformations to enable the usage of well-tuned libraries such as cuBLAS (Basic Linear Algebra Subprograms) and cuFFT (Fast Fourier Transformation). Furthermore, we focused on reducing data transfer times by overlapping data and compute operations. To this end, we wrap independent parts of the image – so-called B-scans – in CUDA streams and process them asynchronously.

For the performance validation of our CUDA implementation, we establish a performance model that predicts kernel runtimes and data transfer runtimes in such an asynchronous setup. Our performance model relies on existing approaches and interprets and combines them for our real-world code.

Results

Starting point of our performance evaluations is the original serial code version using the Microsoft Visual Compiler (v14.0). Due to the new algorithmic matrix formulations, we also created parallel CPU versions of the code with little additional effort: With the Microsoft Compiler, we used OpenBLAS (MSVS_BLAS) and, with the Intel compiler (v17.0), we used Math Kernel Library (MKL). Our new CUDA implementation was evaluated on two NVIDIA Pascal consumer GPUs (with similar results). To show the benefit of our optimizations with respect to asynchronous and overlapping operations, we present performance results of CUDA_SYNC and CUDA_ASYNC code versions.

Bar chart with the runtime of the CUDA-Sync and CUDA-Async versions
Line diagram with the runtime of the different versions

Overall, our CUDA version (CUDA_ASYNC) achieves ~50 frames/s even for the largest tested data set and, hence, fulfills our requirement for video frame rate. This means a speedup of 8-16 over the serial MSVS version, and a speedup of 4-5 over the fastest CPU parallel version. Our corresponding GPU performance model is in line with our runtime measurements (deviation below 15 %) and predicts that triple-sized images can still be processed in video frame rate with our CUDA_ASYNC version.

Code Details

Lines of code: ~6000 LOC for the used test setup
Programming languages & models: C++, CUDA, OpenMP, BLAS Libraries

Development & Cooperation

Tobias Schrödter, RWTH Aachen University
David Pallasch, Fraunhofer IPT, Aachen
Sandra Wienke, Christian Terboven, IT Center/ Chair for High Performance Computing, RWTH Aachen University

Reference

Schrödter T., Pallasch D., Wienke S., Schmitt R., Müller M.S. (2019) Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham