Swift Real-Time Image Processing for White Paper on White Desk

지피티가 만들어준 보고서인데..

왠지 코드가 작동할지 의심드는중.

Detecting the boundaries of a white paper on a white desk in real time is a challenging computer vision task. The low contrast between paper and desk can “wreak havoc” on standard algorithms (Fast and Accurate Document Detection for Scanning | Hacker News), so careful choice of libraries and algorithms is crucial. We will explore suitable Swift-compatible libraries, effective background segmentation to isolate the paper, and edge detection methods optimized for mobile performance. Finally, we provide a Swift code example illustrating a recommended solution.

1. Swift-Compatible Libraries for Real-Time Vision

Several libraries and frameworks can handle real-time image processing in Swift:

  • OpenCV (Open Source Computer Vision Library): A popular, highly optimized CV library for real-time applications (GitHub - mikeroyal/CoreML-Guide: Core ML Guide). OpenCV offers many algorithms (e.g. Canny, thresholding, contour detection) and supports iOS via a framework. Pros: Extensive functionality and proven algorithms for edge detection and image segmentation. Cons: Written in C++, so you must bridge to Swift (using a Objective-C++ wrapper or a pre-built framework). Integration overhead aside, OpenCV is a powerful choice for implementing custom image processing pipelines.
  • Core ML: Apple’s machine learning framework for on-device models (GitHub - mikeroyal/CoreML-Guide: Core ML Guide). Core ML allows you to run pre-trained models (e.g. a U-Net segmentation model or a document detector) efficiently on the device’s CPU/GPU/Neural Engine. Pros: Leverages hardware acceleration for ML tasks and integrates easily with Swift. Cons: Requires a trained model; using Core ML means formulating the problem as a learning task (e.g. training a model to segment paper vs. desk). If a suitable model is available, Core ML can achieve real-time performance and robustness. Otherwise, using classical methods might be simpler.
  • Metal (and Accelerate/CoreImage): Metal is Apple’s low-level GPU programming API, excellent for custom real-time filters and computations. You can write a compute shader to perform operations like thresholding or convolution in parallel on the GPU, achieving very high frame rates (even 120 FPS as some have attempted) by offloading work from the CPU. Pros: Maximum performance, fine-grained control. Additionally, Apple’s Accelerate (vImage) and Core Image frameworks provide GPU-accelerated image processing routines (e.g. blurs, color transforms) without full custom shader coding. Cons: Requires expertise in graphics programming (or using existing image filters). For example, instead of writing a custom Canny in Metal, you might integrate OpenCV (simpler) (Swift and OpenCV: a teaser - Gary Bartos - Medium). Metal is recommended if OpenCV on CPU is too slow, but often OpenCV or Vision framework suffice for mobile scanning apps.
  • Vision Framework (Apple’s Vision): [Not explicitly in the list but highly relevant] Apple’s Vision framework provides high-level CV algorithms, including a rectangle detection API. VNDetectRectanglesRequest can “locate regions of an image with rectangular shape, like... documents” (VNDetectRectanglesRequest | Apple Developer Documentation). This is essentially the problem at hand – detecting a sheet of paper (a rectangle). Vision uses hardware-optimized techniques (including machine learning under the hood) to find document corners in real time, and it’s easy to use in Swift. Pros: Easiest implementation (just configure the request and get results), optimized by Apple for real-time performance. Cons: Less control over the algorithm, and it might classify other rectangular objects as well. Nonetheless, many iOS scanning apps rely on Vision for robust paper detection.

Recommendation: For most cases, using Apple’s Vision (rectangle detection) or OpenCV (custom pipeline) are top choices. If you prefer a quick, high-level solution, Vision’s built-in document detector is ideal. If you need custom tuning or want to experiment with algorithms (like adaptive threshold or custom edge filters), OpenCV integrated into Swift gives you flexibility. In cases where machine learning can offer more robustness (e.g. complex backgrounds), a Core ML model (trained for document segmentation) could be used, but that involves a training pipeline beyond this scope. Metal is an optimization route – you might not need it initially, but it’s there if GPU acceleration is required.

2. Background Segmentation: Separating Paper from Desk

To reliably detect the paper’s boundary, we must separate the white paper from the white desk, minimizing interference from other objects or the broader scene. This background segmentation is critical because if the paper and desk blend together, edge detection might fail (Background removal with OpenCV (AKA segmentation)). Here are strategies to achieve this:

  • Global/Binary Thresholding: One approach is to convert the image to grayscale and apply a threshold to create a binary image (paper vs. background). A global threshold (like Otsu’s method) picks an optimal intensity cutoff. In practice, if the paper is slightly brighter or lit differently than the desk, Otsu thresholding can turn the paper region white and the desk region black (or vice versa). For example, converting to grayscale, blurring to reduce noise, then using Otsu’s method will “turn all pixels to either black or white” (binarize) to distinguish the white document from the image background (How to detect document edges in OpenCV - Scanbot SDK). This binary mask can then isolate the paper contour. However, if paper and desk have nearly identical intensity, global thresholding may not find a separating value (both might be lumped together).
  • Adaptive Thresholding (Local Threshold): Adaptive thresholding computes thresholds based on local neighborhoods, making it effective when lighting is uneven or contrast is low. Instead of one global cutoff, the image is segmented region by region. This technique can often pull out a low-contrast object from a similar background by exploiting small local differences. Indeed, practitioners note that “adaptive thresholding can be very helpful” in tough contrast scenarios (Fast and Accurate Document Detection for Scanning | Hacker News). Methods like Bradley or Gaussian adaptive threshold look at a window around each pixel to decide if it’s foreground or background. In our case, slight shadows or texture at the paper’s edges might cause local intensity differences that adaptive threshold can detect, even if a global threshold fails. In one discussion, a developer mentions using skimage.filters.threshold_adaptive and finding it “sufficiently good” for the white-on-white document problem (Fast and Accurate Document Detection for Scanning | Hacker News). Adaptive thresholding may output a mask where the paper is one blob and the desk is another. We might then need to pick the blob of the appropriate size (largest contour) as the paper.
  • Background Modeling/Subtraction: If the environment allows, you can do a one-time background capture (the empty desk) and subtract it from subsequent frames to segment the new object (the paper). Essentially, “take a blank picture (just the desk), then take the picture with the paper, and take pixel differences” (Segmentation of white object on white background - Python - OpenCV). The regions that differ significantly correspond to the paper. On mobile, this could be done by keeping a background frame (if the camera and desk remain fixed) or a running average background model. However, this is fragile if the camera moves or lighting changes. A more general approach is to estimate the background from the image itself: for example, apply a very large blur or median filter to the frame, which smooths out small objects like the paper, leaving an approximation of the background, then subtract the blurred image from the original (Segmentation of white object on white background - Python - OpenCV). This “high-pass” difference image will highlight the paper’s edges and contents. The OpenCV forum suggests exactly this: apply a median blur of large kernel (to get background), subtract it, then threshold to get a mask (Segmentation of white object on white background - Python - OpenCV). This technique is akin to a morphological Top-Hat filter (which extracts bright objects on a bright background by subtracting a background estimate).
  • Morphological Operations: After an initial segmentation (via threshold or subtraction), you may get a mask that is noisy or has gaps (e.g. the paper’s contour might be broken). Morphological operations like dilation and erosion can help. For instance, if adaptive threshold produces a somewhat “speckled” mask on the paper, a closing (dilate then erode) will “fill in” small gaps and connect adjacent regions (Segmentation of white object on white background - Python - OpenCV). This results in a more solid shape corresponding to the paper. Conversely, small false positives in the background can be removed by eroding or by filtering connected components by size (discard small regions as noise).
  • Deep Learning Segmentation: A more advanced solution is training a semantic segmentation model to label each pixel as “paper” or “not paper.” With enough training images (including tough cases of white-on-white), a neural network could learn subtle cues (texture, shadow, etc.) to segment the paper. As one expert quipped, “throw deep learning at it – it’s good at these things.” (Segmentation of white object on white background - Python - OpenCV). This would involve creating a Core ML model (e.g. using a CNN like U-Net/DeepLab) and then running it in real time via Core ML or TensorFlow Lite. The model approach can be very robust to variations, but requires a dataset and training effort. In many cases, classical methods (threshold + edges) suffice, but it’s an option if you need higher accuracy and have resources to train a model.

Analysis: For this scenario, a combination of thresholding and edge detection is often effective. Literature on document scanning frequently binarizes the image to separate the document from background (How to detect document edges in OpenCV - Scanbot SDK), then uses contour-finding. If the desk and paper are truly the same color with no shading, no purely vision algorithm can magically separate them – you’d need some external difference (e.g. the paper’s slight shadow or a learned model). In practice, papers do have slight shadows or the desk isn’t perfectly white, so adaptive threshold or background subtraction can succeed. The key is to produce a mask or edges that clearly outline the paper, suppressing everything else.

(In summary, start by converting to grayscale and applying adaptive or Otsu threshold to get a tentative segmentation. Use morphological filtering to clean the mask. This isolates the paper region and simplifies the subsequent edge/boundary detection.)

3. Edge Detection Techniques for Mobile

Once we have reduced interference from the background, we need to detect the paper’s boundaries (the contours). Several edge/boundary detection methods can be considered, each with pros and cons for mobile use:

  • Canny Edge Detection: Canny is a classic algorithm that excels at finding thin edges in an image. It applies a gradient filter (like Sobel), then non-maximum suppression and hysteresis thresholding to produce clean, one-pixel wide edges. Canny works well in many document scanning cases. The typical pipeline is: blur the image to reduce noise, apply Canny to the grayscale (or binary) image, then find contours. One guide describes applying Canny after an initial thresholding to emphasize the document edges (How to detect document edges in OpenCV - Scanbot SDK). In our case, we could run Canny on the binarized image, which often yields clearer edges of the paper (How to detect document edges in OpenCV - Scanbot SDK). Canny will trace the outline of the paper if there’s any gradient between paper and desk. On a white desk, the gradient might come from slight illumination differences or shadows at the paper’s edges. If Canny parameters (the high/low thresholds) are tuned low enough, it may pick up these faint edges. The advantage of Canny is its precision and the fact it can ignore gradual illumination changes (thanks to the thresholds). It’s also efficient in implementation; on mobile CPUs it’s usually fine for real-time at moderate resolutions (especially after blurring or downsampling the image). The downside is that if the contrast is extremely low, Canny might not find a continuous edge – you might get broken segments (only 3 corners detected instead of 4, as one Stack Overflow question noted). In such cases, post-processing (connecting line segments or using Hough transform to detect straight lines) can help fill the gaps (Fast and Accurate Document Detection for Scanning | Hacker News).
  • Adaptive Thresholding (as Edge): Interestingly, the binary mask from adaptive threshold can itself be used to extract the contour of the paper. Instead of explicitly looking for gradients, you treat the boundary between black and white regions as the edge. By finding contours on the thresholded image, you can often get the outline of the paper directly. The forum discussion we mentioned showed an adaptive threshold result where the object (white ducks on white background) appeared as separated regions, but the contours were fragmented (Segmentation of white object on white background - Python - OpenCV). The solution was to merge and fill those contours, which suggests using morphological closing to solidify the shape. Once that’s done, finding the outer contour of the white region gives the paper boundary. The advantage here is that adaptive threshold inherently segments by local contrast, so it might succeed where Canny fails (since Canny uses global gradient thresholds). The trade-off is that the edges you get from the contour of a binary mask may be less precise or jagged compared to Canny’s subpixel edge detection. However, we can approximate or smooth the contour afterward.
  • Sobel or Scharr Filter + Contours: A simpler edge detector is applying a Sobel operator to get the gradient magnitude image, then thresholding that to get edges. This is basically what Canny’s first stage does (without the thinning and hysteresis). The CodePasta blog attempted using Sobel to segment objects on a white backdrop (Background removal with OpenCV (AKA segmentation)). They then denoised and found contours. This can work and is a bit faster than full Canny, but it tends to produce thicker edges and more noise. The blog author noted that out-of-the-box Sobel/contour approaches struggled when the object and background colors were similar (Background removal with OpenCV (AKA segmentation)) (Background removal with OpenCV (AKA segmentation)) – exactly our scenario. So Sobel alone might not be enough unless combined with other steps.
  • Hough Line Transform: To ensure all four edges are found, one can use a Hough transform on the edge image to detect lines. This is useful if, say, one side of the paper has very low contrast but you assume the paper is a rectangle – the algorithm might still pick up a faint line along that edge. Hough can detect lines even if the edge is not continuous, as long as enough edge pixels lie on a line. However, Hough transforms are computationally heavier and might be overkill on mobile (and you must filter out spurious lines). Many document detection implementations skip Hough and rely on contour approximation to get the rectangle.
  • Color Space and Illumination: As a note, converting the image to a different color space can improve edge detection. One team found converting RGB to L*u*v (CIELUV) kept more useful contrast for document edges (Fast and Accurate Document Detection for Scanning | Hacker News). In grayscale conversion, if the paper and desk are both white, little difference remains; but in a perceptual color space, slight differences in reflectance or shadows might be enhanced. Another approach is to use image normalization or equalization to boost contrast along the edges. These are tweaks that can feed into any of the above edge methods for better results.

Mobile Performance Considerations: All the above methods can run in real time on modern iPhones/iPads, especially if you downsample the camera feed. Canny and adaptive threshold are both O(N) operations per frame (linear in number of pixels) and can be optimized with vectorized code. OpenCV’s implementations are in C/C++ and quite fast. If processing full 1080p frames at 30fps, you may hit CPU limits. In that case, you can lower the resolution of the analysis buffer (many apps detect the document on a smaller preview frame, say 640x480, then apply the results to the full-res image when capturing). Using Metal to offload these operations to the GPU can also yield higher throughput. For example, a Metal shader could compute an edge map in parallel, and thresholding is essentially a per-pixel operation ideal for GPU. Apple’s Accelerate framework has vImage functions for convolution and morphological operations that use SIMD instructions on the CPU. These optimizations can help maintain a smooth UI while processing frames.

In summary, Canny edge detection on a preprocessed image (blurred & possibly thresholded) is a reliable way to get the paper’s outline. If Canny is missing edges due to low contrast, consider adaptive thresholding and contour extraction as an alternative or complementary method. Many document scanners use a mix: threshold to isolate the document, then Canny or direct contour find, then polygon approximation. The largest closed contour in the scene is often the document (How to detect document edges in OpenCV - Scanbot SDK). Once edges are detected, we usually approximate the contour to a polygon with four points (using e.g. Douglas-Peucker algorithm) and verify it’s a quadrilateral (How to detect document edges in OpenCV - Scanbot SDK). This yields the corners of the paper.

4. Swift Implementation Example

Bringing it all together, here’s a simplified Swift code example that detects a white paper’s boundary on a white desk in real time. This example uses OpenCV (via its iOS framework) for image processing. We’ll assume you have integrated OpenCV and can call its functions in Swift (often done by creating a bridging header or wrapper functions). The code will grab frames from the camera, preprocess them, detect edges, and find the paper’s contour:

import UIKit
import opencv2   // OpenCV framework imported in Swift

func processFrame(_ image: UIImage) -> UIImage {
    // 1. Convert UIImage to OpenCV Mat.
    let mat = Mat(uiImage: image)  // convenience initializer from UIImage
    
    // 2. Convert to grayscale.
    Imgproc.cvtColor(src: mat, dst: mat, code: ColorConversionCodes.COLOR_BGR2GRAY)
    
    // 3. Blur to reduce noise.
    Imgproc.GaussianBlur(src: mat, dst: mat, ksize: Size(width: 5, height: 5), sigmaX: 0)
    
    // 4. Adaptive threshold to separate paper from desk.
    let threshMat = Mat()
    Imgproc.adaptiveThreshold(src: mat, dst: threshMat, maxValue: 255,
                               adaptiveMethod: Imgproc.ADAPTIVE_THRESH_MEAN_C,
                               thresholdType: Imgproc.THRESH_BINARY,
                               blockSize: 11, C: 2)
    // Alternatively, use Otsu global threshold:
    // Imgproc.threshold(src: mat, dst: threshMat, thresh: 0, maxval: 255,
    //                   type: Imgproc.THRESH_BINARY | Imgproc.THRESH_OTSU)
    
    // 5. Edge detection (Canny) on the thresholded image.
    let edges = Mat()
    Imgproc.Canny(image: threshMat, edges: edges, threshold1: 50, threshold2: 150)
    
    // 6. Find contours of the edges.
    let contours: [MatOfPoint] = []
    Imgproc.findContours(image: edges, contours: contours, hierarchy: Mat(),
                         mode: Imgproc.RETR_EXTERNAL, method: Imgproc.CHAIN_APPROX_SIMPLE)
    
    // 7. Filter contours by size and shape.
    var documentContour: MatOfPoint2f? = nil
    let imageArea = Double(mat.rows() * mat.cols())
    for contour in contours {
        // Only consider contours with sufficient area (at least 5% of the image).
        if Imgproc.contourArea(contour) < 0.05 * imageArea { continue }
        // Approximate contour to a polygon.
        let curve = MatOfPoint2f()
        contour.convert(to: curve, rtype: CvType.CV_32F)  // convert points to float
        let peri = Imgproc.arcLength(curve, true)
        let approxCurve = MatOfPoint2f()
        Imgproc.approxPolyDP(curve, approxCurve, 0.02 * peri, true)
        // Check for quadrilateral.
        if approxCurve.toArray().count == 4 {
            documentContour = approxCurve
            break
        }
    }
    
    // 8. Draw the detected contour on the image (for visualization).
    var outputImage = image
    if let doc = documentContour {
        // Convert MatOfPoint2f back to MatOfPoint for drawing.
        let docMatOfPoint = MatOfPoint()
        doc.convert(to: docMatOfPoint, rtype: CvType.CV_32S)
        let contourList: [MatOfPoint] = [docMatOfPoint]
        Imgproc.drawContours(src: Mat(uiImage: outputImage), contours: contourList,
                              contourIdx: -1, color: Scalar(0, 0, 255), thickness: 2)
        outputImage = UIImage(mat: Mat(uiImage: outputImage))  // back to UIImage
    }
    return outputImage
}

How this works: We first convert the camera frame to grayscale and apply a slight blur. We then perform an adaptive mean threshold (with a 11×11 block) to get a binary image. This should make the white paper (which may appear slightly different in illumination) stand out from the white desk. Next, we apply Canny edge detection on this binary image. Using the binary image focuses the edge detector on the border between paper (white) and desk (black) (How to detect document edges in OpenCV - Scanbot SDK). We retrieve contours from the Canny output and filter them. We look for the largest contours and approximate each to a polygon; if a polygon has 4 vertices, we assume it’s our document (How to detect document edges in OpenCV - Scanbot SDK). Finally, we draw the detected quadrilateral on the image (in red) for demonstration. In a real app, instead of drawing, you might use those 4 points for cropping or perspective correction.

This approach combines background segmentation (thresholding) and edge detection (Canny + contour) to robustly find the paper. The adaptive threshold helps even when the desk and paper have similar color, by exploiting local differences (Fast and Accurate Document Detection for Scanning | Hacker News). Canny then finds a crisp outline of the high-contrast binary boundary (How to detect document edges in OpenCV - Scanbot SDK).

On an actual device, you would likely run this in an AVCaptureVideoDataOutput callback on a background thread. The above pipeline can be optimized further (e.g., skip Canny and use the binary mask’s contour directly, or use Vision’s rectangle detector). But even as-is, with moderate resolution frames, it should work in real time on modern iPhones. The result is that the white paper’s boundary is detected and can be highlighted or used for subsequent operations (like document scan crop).

Conclusion

Detecting a white paper on a white desk in real time requires a combination of the right tools and algorithms. OpenCV (via Swift) or Apple’s Vision framework are recommended libraries to start with, given their real-time performance and available algorithms (GitHub - mikeroyal/CoreML-Guide: Core ML Guide) (VNDetectRectanglesRequest | Apple Developer Documentation). The core challenge is separating the paper from the background – techniques like adaptive thresholding and background subtraction prove effective in creating contrast where none is obvious (Segmentation of white object on white background - Python - OpenCV) (Segmentation of white object on white background - Python - OpenCV). Once segmented, traditional edge detection (Canny) and contour finding can reliably extract the paper’s boundaries (How to detect document edges in OpenCV - Scanbot SDK) (How to detect document edges in OpenCV - Scanbot SDK). On mobile devices, these methods can be optimized through hardware acceleration (Metal/CoreImage) if needed, but even CPU-based OpenCV or Vision algorithms often meet real-time requirements. The provided Swift code demonstrates a practical implementation: by leveraging local thresholding and edge detection, we can successfully detect the edges of a white sheet of paper on a white table. This enables building features like document scanning or augmented reality interactions with documents, all running live on a mobile device. The keys are careful image preprocessing (to handle the low contrast) and using robust edge/contour detection to pinpoint the paper’s boundary. With these techniques, the system can effectively “see” the paper’s outline despite the challenging white-on-white scenario.

Sources:

  1. Kevin (Scanbot SDK) – Document edge detection (OpenCV) tutorial (How to detect document edges in OpenCV - Scanbot SDK) (How to detect document edges in OpenCV - Scanbot SDK)
  2. Dropbox Tech Blog via HackerNews – Challenges in document detection (white on white) (Fast and Accurate Document Detection for Scanning | Hacker News) (Fast and Accurate Document Detection for Scanning | Hacker News)
  3. OpenCV Forum – Discussion on segmenting white object on white background (Segmentation of white object on white background - Python - OpenCV) (Segmentation of white object on white background - Python - OpenCV)
  4. OpenCV Documentation – OpenCV is optimized for real-time vision (GitHub - mikeroyal/CoreML-Guide: Core ML Guide)
  5. Apple Developer Documentation – Vision rectangle detection (VNDetectRectanglesRequest)