AutoToon: Automatic Geometric Warping for Face Cartoon Generation — Supplementary Material

In this supplementary material, we show further details of model architecture to extend Figure 2, further motivation for warping field size in Section 4.2, data preprocessing details and examples, examples of obtained artist warps, and additional qualitative comparisons and results to extend Figures 4 and 10. This material is best viewed using Chrome or Safari on a larger screen.

Table of Contents

  1. Data Pre-Processing and Model Architecture Details
  2. Additional Qualitative Results
  3. Dataset and Ground Truth Artist Warps

1. Data Pre-Processing and Model Architecture Details

In this section, we discuss the details of our dataset and model architecture.

Data Pre-Processing

To pre-process the data, facial images were aligned and centered according to the FFHQ dataset alignment procedure, with resulting images 1000 x 1000 in spatial dimension. Images were padded using 'edge' rather than zero-padding to provide smoother background content. The images were then cropped to the center 700 x 700 pixels to have a tighter crop around the head, and then resized to 256 x 256 images for input to the network. Pre-processing for each image batch to the network involved normalization of the images to the per-channel mean and variance of the training dataset. Random horizontal flipping (probability 0.5) and color jittering (as detailed in section 5.1) were also applied.

Model Architecture

Here, we extend Figure 2 to provide more details on the specific implementation of the model architecture, and in particular, the Perceiver Network. Below, we show the architecture of the Perceiver Network, which is the truncated Squeeze-and-Excitation Network 50 [13] detailed in Section 4.2.



We also provide the motivation for using 32 x 32 warps as opposed to 16 x 16 warps, as mentioned in Section 4.2. The 16 x 16 warping field provided less exaggerated cartoons than the 32 x 32 warping field, which was our ultimate choice. Photos 1 and 2 by Pirátská strana and Jacob Seedenburg.


validation set
input image
16 x 16
warping field
32 x 32
warping field
validation set
input image
16 x 16
warping field
32 x 32
warping field


2. Additional Qualitative Results

In this section, we present additional qualitative results, extending Figure 4.

We compare our generated cartoons, stylized images, and warping fields to those of WarpGAN [28]. The first 6 images are from the validation set; all other images are from the test set. As discussed in Figure 4, our network successfully disentangles geometry and style. Photos 2, 3, 4, 5, 6, 7 below by Possible, Si1very, Frédéric de Villamil, Pirátská strana, Jacob Seedenburg, and Robby Schulze, respectively.

Comparison to WarpGAN [28]

test set
input image
WarpGAN [28] (warp only)
WarpGAN [28] (style only)
WarpGAN [28] (output)
WarpGAN [28] (warping field)
ours (warp only)
our warp + CartoonGAN [6]
ours
(warping field)






















Overlaid Warping Fields

Below, we visualize our learned warping fields by superimposing them onto the generated warped cartoon images. We show the original image, followed by this superimposed image, and then the generated cartoon.












3. Ground Truth Artist Warps (Obtained from Gradient Descent Optimization)

To our knowledge, we are the first to extract ground truth artist warps from artist cartoons for supervision in training for cartoon generation. Previous works did not directly use artist warps in training. In our method, these warps were directly used for a warping field loss during training, the importance of which can be seen in the ablation study in Figure 5.

The first image in each triplet is the original input image, followed by the ground truth artist warp, finally followed by the visualized warps created by warping the optimized 32 x 32 x 2 artist warp we obtained (as described in Section 4.1) to the original input image. The warping fields were obtained via the optimization specified in Section 4.1 using the differentiable warping module from Spatial Tranformer Networks [17] for 50,000 iterations with a learning rate of 1. A large majority of the images obtained from the artist warps were almost indistinguishable from the ground truth images, and only a few suffered minor artifacts. of the artifacts that created some distortion, we found that the effect was not very significant due to the other sources of supervision used in conjunction with this loss. All images below, at time of dataset collection, are in the Public Domain.

input artist fit input artist fit