AutoFocusFormer: Image Segmentation off the Grid
AuthorsChen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alex Schwing, Alex Colburn, Li Fuxin
AuthorsChen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alex Schwing, Alex Colburn, Li Fuxin
Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive downsampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.
In 2022, we launched a new systemwide capability that allows users to automatically and instantly lift the subject from an image or isolate the subject by removing the background. This feature is integrated across iOS, macOS, iPadOS and accessible in several apps like Photos, Preview, Safari, Keynote, and more. Underlying this feature is an on-device deep neural network that performs real-time salient object segmentation — or categorizes each pixel of an image as either a part of the foreground or background. Each pixel is assigned a score, denoting how likely it is to be part of the foreground. While prior methods often restrict this process to a fixed set of semantic categories (like people and pets), we designed our model to be unrestricted and generalize to arbitrary classes of subjects (for example, furniture, apparel, collectibles) — including ones it hasn’t encountered during training. While this is an active area of research in Computer Vision, there are many unique challenges that arise when considering this problem within the constraints of a product ready to be used by consumers. This year, we are launching Live Stickers in iOS and iPadOS, as seen in Figure 1, where static and animated sticker creation are built on the technology discussed in this article. In the following sections, we’ll explore some of these challenges and how we approached them.
Most successful examples of neural nets today are trained with supervision. However, to achieve high accuracy, the training sets need to be large, diverse, and accurately annotated, which is costly. An alternative to labelling huge amounts of data is to use synthetic images from a simulator. This is cheap as there is no labeling cost, but the synthetic images may not be realistic enough, resulting in poor generalization on real test images. To help close this performance gap, we've developed a method for refining synthetic images to make them look more realistic. We show that training models on these refined images leads to significant improvements in accuracy on various machine learning tasks.