In this supplementary material, we show further details of model architecture to extend Figure 2, further motivation for warping field size in Section 4.2, data preprocessing details and examples, examples of obtained artist warps, and additional qualitative comparisons and results to extend Figures 4 and 10. This material is best viewed using Chrome or Safari on a larger screen.
In this section, we discuss the details of our dataset and model architecture.
To pre-process the data, facial images were aligned and centered according to the FFHQ dataset alignment procedure, with resulting images 1000 x 1000 in spatial dimension. Images were padded using 'edge' rather than zero-padding to provide smoother background content. The images were then cropped to the center 700 x 700 pixels to have a tighter crop around the head, and then resized to 256 x 256 images for input to the network. Pre-processing for each image batch to the network involved normalization of the images to the per-channel mean and variance of the training dataset. Random horizontal flipping (probability 0.5) and color jittering (as detailed in section 5.1) were also applied.
Here, we extend Figure 2 to provide more details on the specific implementation of the model architecture, and in particular, the Perceiver Network. Below, we show the architecture of the Perceiver Network, which is the truncated Squeeze-and-Excitation Network 50  detailed in Section 4.2.
We also provide the motivation for using 32 x 32 warps as opposed to 16 x 16 warps, as mentioned in Section 4.2. The 16 x 16 warping field provided less exaggerated cartoons than the 32 x 32 warping field, which was our ultimate choice. Photos 1 and 2 by Pirátská strana and Jacob Seedenburg.
In this section, we present additional qualitative results, extending Figure 4.
We compare our generated cartoons, stylized images, and warping fields to those of WarpGAN . The first 6 images are from the validation set; all other images are from the test set. As discussed in Figure 4, our network successfully disentangles geometry and style. Photos 2, 3, 4, 5, 6, 7 below by Possible, Si1very, Frédéric de Villamil, Pirátská strana, Jacob Seedenburg, and Robby Schulze, respectively.
Below, we visualize our learned warping fields by superimposing them onto the generated warped cartoon images. We show the original image, followed by this superimposed image, and then the generated cartoon.
To our knowledge, we are the first to extract ground truth artist warps from artist cartoons for supervision in training for cartoon generation. Previous works did not directly use artist warps in training. In our method, these warps were directly used for a warping field loss during training, the importance of which can be seen in the ablation study in Figure 5.
The first image in each triplet is the original input image, followed by the ground truth artist warp, finally followed by the visualized warps created by warping the optimized 32 x 32 x 2 artist warp we obtained (as described in Section 4.1) to the original input image. The warping fields were obtained via the optimization specified in Section 4.1 using the differentiable warping module from Spatial Tranformer Networks  for 50,000 iterations with a learning rate of 1. A large majority of the images obtained from the artist warps were almost indistinguishable from the ground truth images, and only a few suffered minor artifacts. of the artifacts that created some distortion, we found that the effect was not very significant due to the other sources of supervision used in conjunction with this loss. All images below, at time of dataset collection, are in the Public Domain.