Image segmentation is a computer vision task that separates a digital image into multiple parts.
In an era where cameras and other devices increasingly need to see and interpret the world around them, image segmentation has become an indispensable technique for teaching devices how to understand the world around them.
But how does image segmentation work? And what can you build with it? What are the different approaches, what are its potential benefits and limitations, and how might you use it in your business?
In this guide, you’ll find answers to all of those questions and more. Whether you’re an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore what’s possible with computer vision and segmentation, this guide is for you.
Here’s a look at what we’ll cover:
Image segmentation is a computer vision technique used to understand what is in a given image at a pixel level. It is different than image recognition, which assigns one or more labels to an entire image; and object detection, which locatalizes objects within an image by drawing a bounding box around them. Image segmentation provides more fine-grain information about the contents of an image.
Consider a photo of a busy street you might take with your phone:
In the foreground: pavement, a bus, and a car. In the background: a building, a tree, the sky.
This photo is made up of millions of individual pixels, and the goal of image segmentation is to assign each of those pixels to the object to which it belongs. Segmenting an image allows us to separate the foreground from background, identify the precise location of a bus or a building, and clearly mark the boundaries that separate a tree from the sky.
While the term image segmentation refers to the general technique of partitioning an image into coherent parts, there are a few different ways this can work in practice, depending on your specific use case.
Image segmentation tasks can be broken down into two broad categories: semantic segmentation and instance segmentation.
Image segmentation allows us to decompose an image into meaningful parts and understand a scene at a more granular level. In this way, it differs from other computer vision tasks you might be familiar with. While image classification is helpful for seeing what’s in an image in general, and object detection allows us to locate and tracking the contents of an image, segmentation allows us to define and understand the shapes and boundaries of objects in an image.
We can readily see the power of image segmentation when we consider its flexibility in photo editing software. From automatically separating backgrounds and foregrounds, to cutting out segmented people and objects, to creating responsive portrait modes, image segmentation offers a wide range of capabilities for these kinds of creativity tools. We’ll explore this use case and a few others in more detail below.
This kind of granular image understanding also promises to transform various industries, including:
Of course, this isn’t an exhaustive list, but it includes some of the primary ways in which image segmentation is shaping our future.
Now that we know a bit about what image segmentation is and what it can be used for, let’s dive into how it actually works.
We can think about image segmentation as a function. The function takes an image as input and outputs a matrix, or mask, where each element tells us which class or instance that pixel belongs to.
If we were to think about some high-level features of images that might be useful for segmenting images, a number of heuristics would probably come to mind. Color, for example, is a common one. Green screens ensure the background of an image is a single color that can be programmatically detected and replaced in post-processing. Contrast is another useful feature. A dark green tree can be segmented from a bright blue sky by looking for pixel boundaries with high contrast values.
Heuristics like these form the basis for traditional segmentation algorithms based on image histograms, edges, and other clustering techniques. While these techniques can be simple, fast, and memory efficient, they typically require lots of use-case specific tuning and have limited accuracy on complex scenes. More recently, machine and deep learning has emerged as a powerful new tool providing flexibility and high levels of accuracy.
Machine learning approaches to image segmentation train models to identify which features of an image are important, rather than designing bespoke heuristics by hand. Deep neural networks have proven particularly adept at performing segmentation tasks. For a gentle, general introduction to CNNs, check out this overview. The rest of this section will cover a number of neural network architectures and training techniques have proven particularly successful for segmentation.
Though neural networks used for image segmentation may differ slightly in implementation, most have a similar basis structure:
There are too many specific neural network architectures to cover them all here, but there we’ll highlight a few robust, reliable ones that make good places to start.
The U-Net model architecture derives its name from its U shape. The encoder and decode work together to extract salient features of an image and then use those features to determine which label should be given to each pixel. While the features a model might look at to perform this task aren’t always interpretable by humans, you can think of them as statistical correlations between pixels. In the case of a U-Net model, the encoder is made up of blocks that downscale an image into narrower feature layers while the decoder mirrors those blocks in the opposite direction, upscaling outputs to the original image size and ultimately predicting a label for each pixel. Skip connections cut across the U to improve performance.
Deep learning models often struggle to segment objects that may appear at different spatial scales in an image. For example, a person may be small if walking down a long street, but large in a portrait. To provide good accuracy in all cases, pyramid structures use multiple encoders in parallel to process an image at different spatial scales.
One encoder branch takes a full resolution image as input, another takes an image scaled down by 50% as input, and so on. The outputs of each encoder are then combined and processed by the decoder to produce the segmentation mask. Sometimes, the model is trained by optimizing for accuracy at each scale individually, even if only the full resolution mask is used during inference.
Mask R-CNN and DeepLab are popular architectures designed by Facebook and Google, respectively, that combine many of the structures covered above. They use encoders, decoders, and skip connections to produce high quality segmentation masks of many objects. Pre-trained, open-source implementations of these models can be found online and are a great place to start if speed if your priority.
If your use case requires that image segmentation work in real-time, without internet connectivity, or on private data, you might be considering running your image segmentation model directly on an edge device like a mobile phone or IoT board.
In those cases, you’ll need to choose specific model architectures to make sure everything runs smoothly on these lower power devices. Here are a few tips and tricks to ensure your models are ready for edge deployment:
In section 3, we’ll cover some real-world use cases for image segmentation. We’ve mentioned several of them in previous sections, but here we’ll dive a bit deeper and explore the impact this computer vision technique can have across industries.
We’ll examine how image segmentation is used in the following areas, moving from the most common and achievable to the advanced use cases that are the most difficult to implement, but are still exciting:
While image segmentation greatly assists in powering transformative technologies like automated medical imaging/diagnostics, self-driving cars, and other robotics, it can also augment consumer-facing creativity tools – think photo and video editors, content creation platforms, and more.
The more we can teach our devices to understand and interpret images, the more we can manipulate them. Image segmentation represents one major technique for augmenting visual content in creative ways.
When we create pixel-level maps and masks of objects in an image, we isolate these regions and separate them out from one another. This creates opportunities to implement targeted image effects on specific areas of a given image:
These are a few possible ways image segmentation might assist in the development of unique and innovative creativity tools.
In order to more efficiently perform their tasks – whatever they might be – robots need to map and interpret the scenes in which they’re working. The pixel-level understanding that image segmentation provides can help robots of all kinds better navigate their workspaces
A number of areas of robotics can benefit from image segmentation, from service robotics and industrial robotics, to agricultural robotics, and beyond.
For more detailed information about the use of image segmentation in robotics, check out this paper:
Image segmentation can be a powerful technique in the initial steps of a diagnostic and treatment pipeline for many conditions that require medical images, such as CT or MRI scans.
Essentially, segmentation can effectively separate homogeneous areas that may include particularly important pixels of organs, lesions, etc. However, there are significant challenges, including low contrast (see part 2), noise, and various other imaging ambiguities.
As such, numerous segmentation techniques have been adapted specifically for medical image processing. While these techniques and the technologies underlying them are beyond the scope of this overview, here are a few valuable resources if you’d like to dive deeper into this use case:
I would hope you’re not driving a vehicle right now (if you are, bookmark this guide — it’ll be here when you’ve safely arrived at your destination), but let’s imagine you are. No matter where you’re driving, there’s a lot to pay attention to – the road, other vehicles, pedestrians, sidewalks, and (potentially) a plethora of other potential obstacles/safety hazards.
If you’ve been driving for a long time, noticing and reacting to this environment might seem automatic or like second nature. But if we consider a self-driving car, we quickly realize that this car needs to see, interpret, and respond to a scene in real-time. This means the camera system in this vehicle would need to create pixel-level map of the world in order to safely and efficiently navigate it.
Even though the field of autonomous vehicles is much more complicated than performing image segmentation, this pixel-level understanding is a crucial component in making these vehicles a reality.
To help illustrate how this works and looks in action, here’s an implementation video.
We hope the above overview was helpful in understanding the basics of image segmentation and how it can be used in the real world. But with all things, more answers lead to more questions.
This final section will provide a series of organized resources to help you take the next step in learning all there is to know about image segmentation.
In the interest of keeping this list relatively accessible, we’ve curated our top resources for each of the following areas:
The above datasets represent some of the most popular and widely-used datasets for image segmentation. For a closer look at a wider range of datasets, including some that are more task-specific, check out these detailed lists for:
The benefits of using image segmentation aren’t limited to applications that run on servers or in the cloud.
In fact, segmentation models can be made small and fast enough to run directly on mobile devices, opening up a wide range of possibilities: consumer try-on applications, powerful photo and video editors, content creation tools, augmented reality experiences, and more.
From brand loyalty, to user engagement and retention, and beyond, implementing image segmentation on-device has the potential to delight users in new and lasting ways, all while reducing cloud costs and keeping user data private.
Fritz AI is the machine learning platform that makes it easy to teach devices how to see, hear, sense, and think. To learn more about how Fritz AI can help you build green screen effects, portrait modes, and more, check out our Image Segmentation API.