Image Segmentation Guide

Almost everything you need to know about how image segmentation works


Image segmentation is a computer vision task that separates a digital image into multiple parts.

In an era where cameras and other devices increasingly need to see and interpret the world around them, image segmentation has become an indispensable technique for teaching devices how to understand the world around them.

But how does image segmentation work? And what can you build with it? What are the different approaches, what are its potential benefits and limitations, and how might you use it in your business?

In this guide, you’ll find answers to all of those questions and more. Whether you’re an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore what’s possible with computer vision and segmentation, this guide is for you.

Here’s a look at what we’ll cover:

Part 1: Image segmentation – the basics
  • What is image segmentation?
  • Modes and types of segmentation
  • Why is image segmentation important?
Part 2: How does image segmentation work?
  • Inputs and outputs
  • Basic structure
  • Model architecture overview
  • How image segmentation works on the edge
Part 3: Use cases and applications
  • Creativity tools
  • Robotics
  • Medical imaging and diagnostics
  • Autonomous vehicles
Part 4: Resources
  • Getting started
  • Tutorials
  • Literature review
  • Datasets available

Part 1: Image segmentation – the basics

What is image segmentation?

Image segmentation is a computer vision technique used to understand what is in a given image at a pixel level. It is different than image recognition, which assigns one or more labels to an entire image; and object detection, which locatalizes objects within an image by drawing a bounding box around them. Image segmentation provides more fine-grain information about the contents of an image.

Consider a photo of a busy street you might take with your phone:

Image Segmentation Outdoor Scene

In the foreground: pavement, a bus, and a car. In the background: a building, a tree, the sky.

This photo is made up of millions of individual pixels, and the goal of image segmentation is to assign each of those pixels to the object to which it belongs. Segmenting an image allows us to separate the foreground from background, identify the precise location of a bus or a building, and clearly mark the boundaries that separate a tree from the sky.

While the term image segmentation refers to the general technique of partitioning an image into coherent parts, there are a few different ways this can work in practice, depending on your specific use case.

Modes and types of image segmentation

Image segmentation tasks can be broken down into two broad categories: semantic segmentation and instance segmentation.

  • In semantic segmentation, each pixel belongs to a particular class (think classification on a pixel level). In the image above, for example, those classes were bus, car, tree, building, etc. Any pixel belonging to any car is assigned to the same “car” class.
  • Instance segmentation goes one step further and separates distinct objects belonging to the same class. For example, if there were two cars in the image above, each car would be assigned the “car” label, but would be given a distinct color because they are different instances of the class.

Why is image segmentation important?

Image segmentation allows us to decompose an image into meaningful parts and understand a scene at a more granular level. In this way, it differs from other computer vision tasks you might be familiar with. While image classification is helpful for seeing what’s in an image in general, and object detection allows us to locate and tracking the contents of an image, segmentation allows us to define and understand the shapes and boundaries of objects in an image.

We can readily see the power of image segmentation when we consider its flexibility in photo editing software. From automatically separating backgrounds and foregrounds, to cutting out segmented people and objects, to creating responsive portrait modes, image segmentation offers a wide range of capabilities for these kinds of creativity tools. We’ll explore this use case and a few others in more detail below.

This kind of granular image understanding also promises to transform various industries, including:

  • Photo/Video editing and creativity tools
  • Traffic control systems
  • Autonomous vehicles
  • Robotics

Of course, this isn’t an exhaustive list, but it includes some of the primary ways in which image segmentation is shaping our future.

Back to top

Part 2: How does image segmentation work?

Now that we know a bit about what image segmentation is and what it can be used for, let’s dive into how it actually works.

We can think about image segmentation as a function. The function takes an image as input and outputs a matrix, or mask, where each element tells us which class or instance that pixel belongs to.

If we were to think about some high-level features of images that might be useful for segmenting images, a number of heuristics would probably come to mind. Color, for example, is a common one. Green screens ensure the background of an image is a single color that can be programmatically detected and replaced in post-processing. Contrast is another useful feature. A dark green tree can be segmented from a bright blue sky by looking for pixel boundaries with high contrast values.

Heuristics like these form the basis for traditional segmentation algorithms based on image histograms, edges, and other clustering techniques. While these techniques can be simple, fast, and memory efficient, they typically require lots of use-case specific tuning and have limited accuracy on complex scenes. More recently, machine and deep learning has emerged as a powerful new tool providing flexibility and high levels of accuracy.

Machine learning approaches to image segmentation train models to identify which features of an image are important, rather than designing bespoke heuristics by hand. Deep neural networks have proven particularly adept at performing segmentation tasks. For a gentle, general introduction to CNNs, check out this overview. The rest of this section will cover a number of neural network architectures and training techniques have proven particularly successful for segmentation.

Basic structure

Though neural networks used for image segmentation may differ slightly in implementation, most have a similar basis structure:

Image Segmentation Model Architecture

  • The encoder: A set of layers that extract features of an image through a sequence of progressively narrower and deeper filters. Oftentimes, the encoder is pre-trained on a different task (like image recognition), where it learns statistical correlations from many images and may transfer that knowledge for the purposes of segmentation. You can read more on transfer learning in general here.
  • The decoder: A set of layers that progressively grows the output of the encoder into a segmentation mask resembling the pixel resolution of the input image.
  • Skip connections: Long range connections in the neural network that allow the model to draw on features at varying spatial scales to improve model accuracy.

Model architecture overview

There are too many specific neural network architectures to cover them all here, but there we’ll highlight a few robust, reliable ones that make good places to start.


The U-Net model architecture derives its name from its U shape. The encoder and decode work together to extract salient features of an image and then use those features to determine which label should be given to each pixel. While the features a model might look at to perform this task aren’t always interpretable by humans, you can think of them as statistical correlations between pixels. In the case of a U-Net model, the encoder is made up of blocks that downscale an image into narrower feature layers while the decoder mirrors those blocks in the opposite direction, upscaling outputs to the original image size and ultimately predicting a label for each pixel. Skip connections cut across the U to improve performance.

Pyramid structures

Deep learning models often struggle to segment objects that may appear at different spatial scales in an image. For example, a person may be small if walking down a long street, but large in a portrait. To provide good accuracy in all cases, pyramid structures use multiple encoders in parallel to process an image at different spatial scales.

One encoder branch takes a full resolution image as input, another takes an image scaled down by 50% as input, and so on. The outputs of each encoder are then combined and processed by the decoder to produce the segmentation mask. Sometimes, the model is trained by optimizing for accuracy at each scale individually, even if only the full resolution mask is used during inference.

Mask R-CNN and DeepLab

Mask R-CNN and DeepLab are popular architectures designed by Facebook and Google, respectively, that combine many of the structures covered above. They use encoders, decoders, and skip connections to produce high quality segmentation masks of many objects. Pre-trained, open-source implementations of these models can be found online and are a great place to start if speed if your priority.

How image segmentation works on the edge

If your use case requires that image segmentation work in real-time, without internet connectivity, or on private data, you might be considering running your image segmentation model directly on an edge device like a mobile phone or IoT board.

In those cases, you’ll need to choose specific model architectures to make sure everything runs smoothly on these lower power devices. Here are a few tips and tricks to ensure your models are ready for edge deployment:

  • Use MobileNet-based architectures for your encoder. This architecture makes use of layer types like depthwise separable convolutions that require fewer parameters and less computation while still providing solid accuracy.
  • Add a width multiplier to your model so you can adjust the number of parameters in your network to meet your computation and memory constraints. The number of filters in a convolution layer, for example, greatly impacts the overall size of your model. Many papers and open-source implementations will treat this number as a fixed constant, but most of these models were never intended for mobile use. Adding a parameter that multiplies the base number of filters by a constant fraction allows you to modulate the model architecture to fit the constraints of your device. For some tasks, you can create much, much smaller networks that perform just as well as large ones.
  • Shrink models with quantization, but beware of accuracy drops. Quantizing model weights can save a bunch of space, often reducing the size of a model by a factor of 4 or more. However, accuracy will suffer. Make sure you test quantized models rigorously to determine if they meet your needs.
  • Input and output sizes can be smaller than you think! If you’re designing a photo editing app, it’s tempting to think that your image segmentation model needs to be able to accept full resolution photos as an input. In most cases, edge devices won’t have nearly enough processing power to handle this. Instead, it’s common to train segmentation models to produce segmentation masks at modest resolutions, then upscale them with traditional image resizing techniques. For example, a segmentation mask with a VGA resolution (640x480) can produce very nice results even when scaled up to full HD resolution.

Back to top

Part 3: Use cases and applications

In section 3, we’ll cover some real-world use cases for image segmentation. We’ve mentioned several of them in previous sections, but here we’ll dive a bit deeper and explore the impact this computer vision technique can have across industries.

We’ll examine how image segmentation is used in the following areas, moving from the most common and achievable to the advanced use cases that are the most difficult to implement, but are still exciting:

  • Creativity tools
  • Robotics
  • Medical imaging and diagnostics
  • Autonomous vehicles

Creativity tools

While image segmentation greatly assists in powering transformative technologies like automated medical imaging/diagnostics, self-driving cars, and other robotics, it can also augment consumer-facing creativity tools – think photo and video editors, content creation platforms, and more.

The more we can teach our devices to understand and interpret images, the more we can manipulate them. Image segmentation represents one major technique for augmenting visual content in creative ways.

When we create pixel-level maps and masks of objects in an image, we isolate these regions and separate them out from one another. This creates opportunities to implement targeted image effects on specific areas of a given image:

  • Blur an image’s background to sharpen the focus on the foreground
  • Add an augmented reality object to a particular person or object in an image
  • Cut out specific regions of an image to create “stickers”
  • Create a green screen effect
  • Overlay artistic filters on a specific object
  • Develop “try-on” experiences that all users to sample different products they might be interested in buying (i.e. cosmetics, shoes, clothes, etc.

These are a few possible ways image segmentation might assist in the development of unique and innovative creativity tools.


In order to more efficiently perform their tasks – whatever they might be – robots need to map and interpret the scenes in which they’re working. The pixel-level understanding that image segmentation provides can help robots of all kinds better navigate their workspaces

A number of areas of robotics can benefit from image segmentation, from service robotics and industrial robotics, to agricultural robotics, and beyond.

For more detailed information about the use of image segmentation in robotics, check out this paper:

Visual Segmentation of “Simple” Objects for Robots (PDF)

Medical imaging and diagnostics

Image segmentation can be a powerful technique in the initial steps of a diagnostic and treatment pipeline for many conditions that require medical images, such as CT or MRI scans.

Essentially, segmentation can effectively separate homogeneous areas that may include particularly important pixels of organs, lesions, etc. However, there are significant challenges, including low contrast (see part 2), noise, and various other imaging ambiguities.

As such, numerous segmentation techniques have been adapted specifically for medical image processing. While these techniques and the technologies underlying them are beyond the scope of this overview, here are a few valuable resources if you’d like to dive deeper into this use case:

Autonomous vehicles

I would hope you’re not driving a vehicle right now (if you are, bookmark this guide — it’ll be here when you’ve safely arrived at your destination), but let’s imagine you are. No matter where you’re driving, there’s a lot to pay attention to – the road, other vehicles, pedestrians, sidewalks, and (potentially) a plethora of other potential obstacles/safety hazards.

If you’ve been driving for a long time, noticing and reacting to this environment might seem automatic or like second nature. But if we consider a self-driving car, we quickly realize that this car needs to see, interpret, and respond to a scene in real-time. This means the camera system in this vehicle would need to create pixel-level map of the world in order to safely and efficiently navigate it.

Even though the field of autonomous vehicles is much more complicated than performing image segmentation, this pixel-level understanding is a crucial component in making these vehicles a reality.

To help illustrate how this works and looks in action, here’s an implementation video.

Back to top

Part 4: Resources for image segmentation

We hope the above overview was helpful in understanding the basics of image segmentation and how it can be used in the real world. But with all things, more answers lead to more questions.

This final section will provide a series of organized resources to help you take the next step in learning all there is to know about image segmentation.

In the interest of keeping this list relatively accessible, we’ve curated our top resources for each of the following areas:

  • Getting started
  • Tutorials
  • Literature review
  • Available datasets
Getting started
Literature review
Datasets available

The above datasets represent some of the most popular and widely-used datasets for image segmentation. For a closer look at a wider range of datasets, including some that are more task-specific, check out these detailed lists for:

Image segmentation on mobile

The benefits of using image segmentation aren’t limited to applications that run on servers or in the cloud.

In fact, segmentation models can be made small and fast enough to run directly on mobile devices, opening up a wide range of possibilities: consumer try-on applications, powerful photo and video editors, content creation tools, augmented reality experiences, and more.

From brand loyalty, to user engagement and retention, and beyond, implementing image segmentation on-device has the potential to delight users in new and lasting ways, all while reducing cloud costs and keeping user data private.

Fritz AI is the machine learning platform that makes it easy to teach devices how to see, hear, sense, and think. To learn more about how Fritz AI can help you build green screen effects, portrait modes, and more, check out our Image Segmentation API.

For more inspiration, check out our tutorials for building a portrait mode on iOS and for smart background replacement on Android.

Back to top