# towards_metamerism_via_foveated_style_transfer__e4615250.pdf

Published as a conference paper at ICLR 2019

Towards Metamerism via Foveated Style Transfer

Arturo Deza1,4, Aditya Jonnalagadda3, Miguel P. Eckstein1,2,4

1 Dynamical Neuroscience, 2Psychological and Brain Sciences, 3Electric and Computer Engineering, 4 Institute for Collaborative Biotechnologies UC Santa Barbara, CA, USA deza@dyns.ucsb.edu,aditya_jonnalagada@ece.ucsb.edu,eckstein@psych.ucsb.edu

The problem of visual metamerism is deﬁned as ﬁnding a family of perceptually indistinguishable, yet physically diﬀerent images. In this paper, we propose our Neuro Fovea metamer model, a foveated generative model that is based on a mixture of peripheral representations and style transfer forward-pass algorithms. Our gradient-descent free model is parametrized by a foveated VGG19 encoder-decoder which allows us to encode images in high dimensional space and interpolate between the content and texture information with adaptive instance normalization anywhere in the visual ﬁeld. Our contributions include: 1) A framework for computing metamers that resembles a noisy communication system via a foveated feed-forward encoder-decoder network We observe that metamerism arises as a byproduct of noisy perturbations that partially lie in the perceptual null space; 2) A perceptual optimization scheme as a solution to the hyperparametric nature of our metamer model that requires tuning of the image-texture tradeoﬀcoeﬃcients everywhere in the visual ﬁeld which are a consequence of internal noise; 3) An ABX psychophysical evaluation of our metamers where we also ﬁnd that the rate of growth of the receptive ﬁelds in our model match V1 for reference metamers and V2 between synthesized samples. Our model also renders metamers at roughly a second, presenting a 1000 speed-up compared to the previous work, which allows for tractable data-driven metamer experiments.

1 Introduction The history of metamers originally started through color matching theory, where two light sources were used to match a test light s wavelength, until both light sources are indistinguishable from each other producing what is called a color metamer. This leads to the deﬁnition of visual metamerism: when two physically diﬀerent stimuli produce the same perceptual response (See Figure 1 for an example). Motivated by Balas et al. (2009) s work of local texture matching in the periphery as a mechanism that explains visual crowding, Freeman & Simoncelli (2011) were the ﬁrst to create such point-of-ﬁxation driven metamers through such local texture matching models that tile the entire visual ﬁeld given log-polar pooling regions that simulate the V1 and V2 receptive ﬁeld sizes, as well as having global image statistics that match the metamer with the original image. The essence of their algorithm is to use gradient descent to match the local texture (Portilla & Simoncelli (2000)) and image statistics of the original image throughout the visual ﬁeld given a point of ﬁxation until convergence thus producing two images that are perceptually indistinguishable to each other.

However, metamerism research currently faces 2 main limitations: The ﬁrst is that metamer rendering faces no unique solution. Consider the potentially trivial examples of having an image I and its metamer M where all pixel values are identical except for one which is set to zero (making this diﬀerence unnoticeable), or the case where the metameric response arises from an imperceptible equal perturbation across all pixels as suggested in Johnson et al. (2016); Freeman & Simoncelli (2011). This is a concept similar to Just Noticeable Diﬀerences (Lubin (1997); Daly (1992)). However, like the work of Freeman & Simoncelli (2011); Keshvari & Rosenholtz (2016); Rosenholtz et al. (2012); Balas et al. (2009), we are interested in creating point-of-ﬁxation driven metamers, which create images that preserve information in the fovea, yet lose spatial information in the periphery such that this loss is unnoticeable contingent of a point of ﬁxation (Figure 1). The second issue is that the current state of the art for a full ﬁeld of view rendering of a 512px 512px metamer takes 6 hours for a grayscale image and roughly a day for a color image. This computational constraint makes data-

Published as a conference paper at ICLR 2019

Same or Diﬀerent?

Figure 1: Two visual metamers are physically diﬀerent images that when ﬁxated on the orange dot (center), should remain perceptually indistinguishable to each other for an observer. Colored circles highlight diﬀerent distortions in the visual ﬁeld that observers do not perceive in our model.

driven experiments intractable if they require thousands of metamers. From a practical perspective, creating metamers that are quick to compute may lead to computational eﬃciency in rendering of VR foveated displays and creation of novel neuroscience experiments that require metameric stimuli such as gaze-contingent displays, or metameric videos for f MRI, EEG, or Eye-Tracking.

We think there is a way to capitalize metamer understanding and rendering given the developments made in the ﬁeld of style transfer. We know that the original model of Freeman & Simoncelli (2011) consists of a local texture matching procedure for multiple pooling regions in the visual ﬁeld as well as global image content matching. If we can ﬁnd a way to perform localized style transfer with proper texture statistics for all the pooling regions in the visual ﬁeld, and if the metamerism via texture-matching hypothesis is correct we can in theory successfully render a metamer.

Within the context of style transfer, we would want a complete and ﬂexible framework where a single network can encode any style (or texture) without the need to re-train, and with the power of producing style transfer with a single forward pass, thus enabling real-time applications. Furthermore, we would want such framework to also control for spatial and scale factors (Gatys et al. (2017)) to enable foveated pooling (Akbas & Eckstein (2017); Deza & Eckstein (2016)) which is critical in metamer rendering. The very recent work of Huang & Belongie (2017), provides such framework through adaptive instance normalization (Ada IN), where the content image is stylized by adjusting the mean and standard deviation of the channel activations of the encoded representation to match with the style. They achieve results that rival those of Ulyanov et al. (2016); Johnson et al. (2016), with the added beneﬁt of not being limited to a single texture in a feed-forward pipeline.

In our model: we stack a peripheral architecture on top of a VGGNet (Simonyan & Zisserman (2015)) in its encoded feature space, to map an image into a perceptual space. We then add internal noise in the encoded space of our model as a characterization that perceptual systems are noisy. We ﬁnd that inverting such modiﬁed image representation via a decoder results in a metamer. This breaks down our model into a foveated feed-forward auto style transfer network, where the input image plays the role both of the content and the style, and internal network noise (stylized with the content statistics) serves as a proxy for intrinsic image texture. While our model uses Ada IN for style transfer and a VGGNet for texture statistics, our pipeline is extendible to other models that successfully execute style transfer and capture proper texture statistics (Ustyuzhaninov et al. (2017)).

2 Design of the Neuro Fovea model To construct our metamer we propose the following statement: A metamer M can be rendered by transferring k localized styles over a content image I, controlled by a set of style-to-content ratios αi for every pooling region (i-th receptive ﬁeld). More formally, our goal is to ﬁnd a Metamer function M( ) : I M, where an input image I RL is fed through a VGG-Net encoder E( ) : RL RD which is both the content and the style image, to produce the content feature C RD, where C = E(I) as shown in Figure 2. Let L = C H W, and D = C H W where {C,C },{H,H },{W,W } are the image/layer channels, height, width given the convolutional structure of the encoder (we drop fully connected layers). A noise patch colored via ZCA (Bell & Sejnowski (1995)) to match the

Published as a conference paper at ICLR 2019

Image Peripheral Architecture

VGG-Net Encoder

VGG-Net Encoder Ada IN

+ VGG-Net Decoder

For every -th Pooling Region

Neuro Fovea Metamer Meta VGG-Net Decoder

pix2pix Reﬁnement

Figure 2: The Neuro Fovea metamer generation schematic: An input image and a noise patch are fed through a VGG-Net encoder into a new feature space. Through spatial control we can produce an interpolation for each pooling region in such feature space between the stylized-noise (texture), and the content (the input image). This is how we successfully impose both global image and local texture-like constraints in every pooling region. The metamer is the output of the pooled (and interpolated) feature vector through the Meta VGG-Net Decoder.

content image s mean and variance N (µI,σ2 I ) RL is also fed through the same VGG-Net encoder producing the noise feature N RD, where N = E(N). This is the internal perceptual noise of the system which will later on serve us as a proxy for texture encoding. These vectors are masked through spatial control a la Gatys et al. (2017), and the noise is stylized via S( ) : RD RD with the content which encodes the texture representation of the content in the feature space through Adaptive Instance Normalization (Ada IN). A target feature Ti RD is deﬁned as an interpolation between the stylized noise S(Ni) and the content Ci modulated by α, in the feature space RD for every i-th pooling region:

Ti(I|N;α) = (1 α)Ci(I)+αS(Ni) (1)

In other words, in our quest to probe for metamerism, we are ﬁnding an intermediate representation (the convex combination) between two vectors representing the image and its texturized version (the stylized noise) in RD per pooling region as seen in Figure 3. Within the framework of style transfer, we could think of this as a content-vs-style or structure-vs-texture tradeoﬀ, since the style and the content image are the same. Similar interpolations have been explored in Hénaﬀ& Simoncelli (2016) via a joint pixel and network space minimization. The ﬁnal target feature vector T is the masked sum of every Ti with spatial control masks wi s.t. T = Pwi Ti. The metamer is the output of the Meta VGG-Net decoder D( ) on T, where the decoder receives only one vector (T) and produces a global decoded output. Our Meta VGG-Net Decoder compensates for small artifacts by stacking a pix2pix Isola et al. (2017) U-Net reﬁnement module which was trained on the Encoder-Decoder outputs to map to the original high resolution image. Figure 2 fully describes our model, and the metamer transform is computed via:

M(I|N; α) = D(EP(I|N; α)) = D(

i=1 wi (1 αi)Ei(I)+αi S(Ei(N)) ) (2)

where EP is the foveated encoder that is deﬁned as the sum of encoder outputs over all the k pooling regions (our spatial controls masks wi) in the visual ﬁeld. Note that the decoder was not trained to generate metamers, but rather to invert the encoded image and act as E 1. It happens to be the

Figure 3: Interpolating between an image s intrinsic content and texture via a convex combination in the output of the VGG19 Encoder E. Here we are treating the patch as a single pooling region. In our model, this interpolation given Eq. 1 is done for every pooling region in the visual ﬁeld.

Published as a conference paper at ICLR 2019

case that perturbing the encoded representation in the direction of the stylized noise by an amount speciﬁed by the size of the pooling regions, outputs a metamer. Additional speciﬁcations and training of our model can be seen in the Supplementary Material.

2.1 Model Interpretability

Within the framework of metamerism where distortions lie on the perceptual null space as proposed initially in color matching theory, and also in Freeman & Simoncelli (2011) for images, we can think of our model as a direct transform that is maximizing how much information to discard depending on the texture-like properties of the image and the size of the receptive ﬁelds. Consider the following: if our interpolation is projected from the encoded space to the perceptual space via P, from Eq. 1 we get PTi = P(1 α)Ci(I)+ P(α)S(Ni), it follows that for each receptive ﬁeld:

P Ti |{z} metamer = P Ci |{z} image

+Pα(S (Ni)+S (Ni)) | {z } distortion

by decomposing S(Ni) Ci = S (Ni)+S (Ni), where S is the projection of the diﬀerence vector on the perceptual space, and S (Ni) is the orthogonal component perpendicular to such vector which lies in the perceptual null space (PS (Ni) = 0). The value of these components will change depending on the location of Ci and S(Ni), and the geometry of the encoded space. If ||S (Ni)||2 2 < ϵ, (i.e. the image patch has strong texture-like properties), then α can vary above its critical value given that S (Ni) is in the null space of P and the distortion term will still be small; but if ||S (Ni)||2 2 > ϵ, α can not exceed its critical value for the metamerism condition to hold (PTi PCi). Thus our interest is in computing the maximal average amount of distortion (driven by α) given human sensitivity before observers can tell the diﬀerence. This is illustrated in Figure 4 via the blue circle around Ci in the perceptual space which shows the metameric boundary for any distortion.

Encoded Space

Perceptual Space

Figure 4: Perceptual Projection.

One can also see the resemblance of the model to a noisy communication system in the context of information theory. The information source is the image I, the transmitter and the receiver are the encoder and decoders (E,D) respectively, and the noise source is the encoded noise patch E(N) imposing texture distortions in the visual ﬁeld, and the destination is the metamer M. Highlighting this equivalence is important as metamerism can also be explored within the context of image compression and rate-distortion theory as in Ballé et al. (2017). Such approaches are beyond the scope of this paper, however they are worth exploring in future work as most metamer models purely involve texture and image analysis-synthesis matching paradigms that are gradient-descent based.

3 Hyperparameteric nature of our model Similar to our model, the Freeman & Simoncelli (2011) model (hereto be abbreviated FS) requires a scale parameter s which controls the rate of growth of the receptive ﬁelds as a function of eccentricity. This parameter should be maximized such that an upperbound for perceptual discrimination is found. Given that texture and image matching occurs in each one of the pooling regions: a high scaling factor will likely make the image rapidly distinguishable from the original as distortions are more apparent in the periphery. Conversely, a low scaling factor might gaurantee metamerism even if the texture statistics are not fully correct given that smaller pooling regions will simulate weak eﬀects of crowding. Low scaling factors in that sense are potentially uninteresting it is the value up until humans can tell the diﬀerence that is critical (Lubin (1997)). FS set out to ﬁnd such critical value via a psychophysical experiment where they perform the following single-variable optimization to ﬁnd such upper bound:

s0 = argmax s E[d (s|θobs)] (4)

s.t. 0 < d (s|θobs) < ϵ, where d = Φ 1(HR) Φ 1(FA) is the index of detectability for each observer θobs, Φ is the cumulative of the gaussian distribution, and HR and FA are the hit rate and false alarm rates as deﬁned in Green & Swets (1966). However, our model is diﬀerent in regards to a set of

Published as a conference paper at ICLR 2019

Figure 5: Potential issues of psychophysical intractability for the joint estimation of (s) and γ( ) as described by our model. Running a psychophysical experiment that runs an exhaustive search for upper bounds for the scale and distortion parameters for every receptive ﬁeld is intractable. The goal of Experiment 1 is to solve this intractabitilty posed formally in Eq. 6 via a simulated experiment.

hyperparameters α that we must estimate everywhere in the visual ﬁeld as summarized by the γ function, where we assume α to be tangentially isotropic:

α = γ( ; s) (5)

where each α represents the maximum amount of distortion (Eq. 1) that is allowed for every receptive ﬁeld in the visual periphery before an observer will notice. At a ﬁrst glance, it is not trivial to know if α should be a function of scale, retinal eccentricity, receptive ﬁeld size, image content or potentially a combination of the before-mentioned (hence the in the γ function s argument).

Thus, the motivation of α seems uncertain and perhaps un-necessary from the Occam s razor perspective of model simplicity. This raises the question: Why does the FS model not require any additional hyperparameters, requiring only a single scale (s) parameter? The answer lies in the nature of their model which is gradient descent based and where local texture statistics are matched for every pooling region in the visual ﬁeld, while preserving global image structural information. When such condition is reached, no further synthesis steps are required as it is an equilibrium point. Indeed, the experiments of Wallis et al. (2016) have shown that images do not remain metameric if the structural information of a pooling region is discarded while purely retaining the texture statistics of Portilla & Simoncelli (2000). This motivates the purpose of α where we interpolate between structural and texture representation. Thus our goal is to ﬁnd that equilibirum point in one-shot, given that our model is purely feed-forward and requires no gradient-descent (Eq. 2). At the expense of this artiﬁce, we run into the challenge of facing a multi-variable optimization problem that has the risk of being psychophysically intractable. Analogous to FS, we must solve:

s0, α0 = argmax s, α E[d (s, α|θobs)] (6)

s.t. 0 < d (s, α|θobs) < ϵ. Figure 5 shows the potential intractability: each observer would have to run multiple rounds of an ABX experiment for a collection of many scales and α values for each location in the visual ﬁeld. Consider: (S scales) (k pooling regions) (αm step size for each α) (N images) (w trials): S k Nαmw trials per observer.

We will show in Experiment 1 that one solution to Eq. 6 is to ﬁnd a relationship between each set of α s and the scale, expressed via the γ function. This requires a two stage process: 1) Showing that such γ exists; 2) Estimate γ given s. If this is achieved, we can relax the multi-variable optimization into a single variable optimization problem, where 0 < d (s,γ( ; s)|θobs) < ϵ, and:

s0 = argmax s E[d (s,γ( ; s)|θobs)] (7)

4 Experiments The goal of Experiment 1 is to estimate γ as a function of s via a computational simulation as a proxy for running human psychophysics. Once it is computed, we have reduced our minimization to a tractable single variable optimization problem. We will then proceed to Experiment 2 where we will perform an ABX experiment on human observers by varying the scale to render visual metamers as originally proposed by FS. We will use the images shown in Figure 6 for both our experiments.

Published as a conference paper at ICLR 2019

Figure 6: A color-coded collection of images used in our experiments.

4.1 Experiment 1: Estimation of model hyperparameters via perceptual optimization

Existence and shape of γ: Given some biological priors, we would like γ to satisfy these properties:

1. γ : Z α s.t. Z [0, ),α [0,1), where z Z is parametrized by the size (radius) of each receptive ﬁeld (pooling region) which grows with eccentricity in humans. 2. γ is continuous and monotonically non-decreasing since more information should not be gained given larger crowding eﬀects as receptive ﬁeld size increases in the periphery. 3. γ has a unique zero at γ(0) = 0. Under ideal assumptions there is no loss of information in the fovea, where the size of the receptive ﬁelds asymptotes to zero.

Indeed, we found that γ is sigmoidal, and is a function of z, parametrized by s:

γ(z; s) = a+ b c+exp( dz) = 1+ 2 1+exp( d(s)z) (8)

Figure 7: Perceptual optimization.

Estimation of γ: To numerically estimate the amount of α-noise distortion for each receptive ﬁeld in our metamer model we need to ﬁnd a way to simulate the perceptual loss made by a human observer when trying to discriminate between metamers and original images. We will deﬁne a perceptual loss L that has the goal of matching the distortions via SSIM of a gradient descent based method such as the FS metamers, and the Neuro Fovea metamers (NF) with their reference images a strategy similar to Laparra et al. (2017) used for perceptual rendering. We chose SSIM as it is a standard IQA metric that is monotonic with human judgements, although other metrics such as MS-SSIM and IW-SSIM show similar tuning properties for γ as shown in the Supplementary Material. Indeed the reference image I for the NF metamer is limited by the autoencoder-like nature of the model where the bottleneck usually limits perfect reconstruction s.t. I = D(E(I))|(α=0), where I I, and they are only equal if the encoder-decoder pair (E,D) allows for lossless compression. Since we can not deﬁne a direct loss function L between the metamers, we will need their reference images to deﬁne a convex surrogate loss function LR. The goal of this function should be to match the perceptual loss of both metamers for each receptive ﬁeld

Published as a conference paper at ICLR 2019

1st Eccentricity R.F.

0.0 0.2 0.4 0.6 0.8 1.0

2nd Eccentricity R.F.

0.0 0.2 0.4 0.6 0.8 1.0

3rd Eccentricity R.F.

0.0 0.2 0.4 0.6 0.8 1.0

4th Eccentricity R.F.

0.0 0.2 0.4 0.6 0.8 1.0

Figure 8: The result of each SSIM (top) for Experiment 1 for a scale of s = 0.3 where we ﬁnd the critical α for each receptive ﬁeld ring as we minimize E( -SSIM)2 (bottom). E( -SSIM)2 is minimized by matching the perceptual distortion of the Freeman & Simoncelli (2011) (MFS ) and Neuro Fovea (MNF) metamers in Eq. 9. Each color represents a diﬀerent 512 512 image trajectory, the black line (bottom) shows the average. Only the ﬁrst 4 eccentricity dependent receptive ﬁelds are shown.

k when compared to their reference images: the original image I for the FS model, and the decoded image I for the NF model:

LR(α|k) = E( -SSIM)2 = 1

j=1 (SSIM(M(j,k) FS ,I(j,k)) SSIM(M( j,k) NF (γs),I ( j,k)))2 (9)

and αi should be minimized for each k pooling region via: α0 = argminα LR(α|k) for the collection of N images. The intuition behind this procedure is shown in Figure 7. Note that if I = I, i.e. there is perfect lossless compression and reconstruction given the choice of encoder and decoder, then the optimization is performed with reference to the same original image. This is an important observation as the reconstruction capacity of our decoder is limited despite E(MS-SSIM(I,I ) = 0.86 0.04. Only using the original image in the optimization yields poor local minima at α = 0. Despite such limitation, we show that reference metamers can still be achieved for our lossy compression model.

Results: A collection of 10 images were used in our experiments. We then computed the SSIM score for each FS and NF image paired with their reference image across each receptive ﬁeld (R.F.) and averaged those that belonged to the same retinal eccentricity. Figure 8 (top) shows these results, as well as the convex nature of the loss function displayed in the bottom. This procedure was repeated for all the eccentricity-dependent receptive ﬁelds for a collection of 5 values of scale: {0.3,0.4,0.5,0.6,0.7}. A sigmoid to estimate γ was then ﬁtted to each α per R.F. parametrized by scale via least squares. This gave us a collection of d values that control the slope rate of the sigmoid (Eq. 8). These were d : {1.240,1.196,1.363,1.311,1.355} respectively per scale, and {d} = 1.281 for the ensemble of all scales. We then conducted a 10000 sample permutation test between the pair of (zs,αs) points per scale and the ensemble of points across all scales ({z},{α}) that veriﬁed that their variation is statistically non-signiﬁcant (p 0.05). Figure 9 illustrates the results from such procedure. We can conclude that the parameters of γ do not vary as we vary scale. In other words, the α = γ(z) function is ﬁxed, and the scale parameter itself which controls receptive ﬁeld size will implicitly modulate the maximum α-noise distortion with a unique γ function. If the scale factor is small, the maximum noise distortion in the far periphery will be small and vice versa if the scale is large. We should point out that Figure 9 might suggest that the maximal noise distortion is contingent on image content as the scores are not uniform tangentially for the receptive ﬁelds that lie on the same eccentricity ring. Indeed, we did simplify our model by computing an average and ﬁtting the sigmoid. However, computing an average should approximate the maximal distortion for the receptive ﬁeld size on that eccentricity in the perceptual space for the human observer i.e. the metameric boundary. We elaborate more on this idea in the discussion section.

Published as a conference paper at ICLR 2019

scale=0.3 scale=0.4 scale=0.5 scale=0.6 scale=0.7

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

d=1.240 d=1.196 d=1.363 d=1.311 d=1.355

Figure 9: Top: The average α-noise distortion over the entire visual ﬁeld for our 10 images without assuming tangential homogeneity. Notice that on average, α increases radially. Bottom: The γ( ) which completely deﬁnes the α-noise distortion for any receptive ﬁeld as a function of its size (radius).

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Scale: 0.3 Scale: 0.4 Scale: 0.5 Scale: 0.6 Scale: 0.7

(a) A scale invariant γ( )

(b) Rendering metamers via varying s.

Figure 10: Metamer generation proces for Experiment 2. We modulate the distortion for each receptive ﬁeld according to γ to perform an optimization as in Freeman & Simoncelli (2011).

4.2 Experiment 2: Psychophysical Evaluation of Metamerism with human observers Given that we have estimated the value of α anywhere in the visual ﬁeld via the γ function, we can now render our metamers as a function of the single scaling parameter (s), as the receptive ﬁeld size z is also a function of s as shown in Figure 10. The psychophysical optimization procedure is now tractable on human observers and has the following form where 0 < d (s,γ(z(s); s)|θobs) < ϵ:

s0 = argmax s E[d (s,γ(z(s))|θobs)] (10)

Inspired by the evaluations of Wallis et al. (2016), we wanted to test our metamers on a group of observers performing two diﬀerent ABX discrimination tasks in a roving design:

1. Discriminating between Synthesized images (Synth vs Synth): This has been done in the original study of Freeman & Simoncelli. While this test does not gaurantee metamerism (Reference vs Synth), it has become a standard evaluation when probing for metamerism. 2. Discriminating between the Synthesized and Reference images (Synth vs Reference). This metamerism test, was not previously reported in Freeman & Simoncelli (2011) for their original images and is the most rigorous evaluation. Recently Wallis et al. (2018) argued that any model that maps an image to white noise might gaurantee metamerism under the Synth vs Synth condition but not against the original/reference image, thus is not a metamer.

We had a group of 3 observers agnostic to the peripheral distortions and purposes of the experiment performed an interleaved Synth vs Synth and Synth vs Reference experiment for NF metamers for the previous set of images (Fig. 6). An SR Eye Link 1000 desk mount was used to monitor their gaze for the center forced ﬁxation ABX task as shown in Figure 11. In each trial, observers were

Published as a conference paper at ICLR 2019

Figure 11: Experiment 2 shows the ABX metamer discrimination task done by the observers. Humans must ﬁxate at the center of the image (no eye-movements) throughout the trial for it to be valid.

Proportion Correct (PC)

Scaling factor (s)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Synth vs Reference Synth vs Synth

Subject 'ZQ'

Synth vs Reference Synth vs Synth

Subject 'AL'

Scaling factor (s)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Synth vs Reference Synth vs Synth

Subject 'AG'

Scaling factor (s)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Synth vs Reference Synth vs Synth

Pooled Subject

Scaling factor (s)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 12: The results of the 3 observers and the pooled observer (average; shown on far right) for the Synth vs Reference and Synth vs Synth experiment for our metamers. The error bars denote the 68% conﬁdence interval after bootstrapping the trials per observer.

shown 3 images where their task is to match the third image to the 1st or the 2nd. Each observer saw each of the 10 images 30 times per scaling factor (5) per discriminability type (2) totalling 3000 trials per observer. Images were rendered at 512 512 px, and we ﬁxed the monitor at 52cm viewing distance and 800 600px resolution so that the stimuli subtended 26deg 26deg. The monitor was linearly calibrated with a maximum luminance of 115.83 2.12 cd/m2. We then estimated the critical scaling factor s0, and absorbing factors β0 of the roving ABX task to ﬁt a psychometric function for Proportion Correct (PC) as in Freeman & Simoncelli (2011); Wallis et al. (2018), where the

detectability is computed via d2(s) = β0(1 s2o

s2 )1s>s0, and

PC(s) = Φ d2(s) (6)

! +Φ d2(s) (6)

Results: Absorbing gain factors β0 and critical scales s0 per observer are shown in Figure 12, where the ﬁts were made using a least squares curve ﬁtting model and bootstrap sampling n = 10000 times to produce the 68% conﬁdence intervals. Lapse rates (λ) were also included for robustness of ﬁt as in Wichmann & Hill (2001). Analogous to Freeman & Simoncelli (2011), we ﬁnd that the critical scaling factor is 0.51 when doing the Synth vs Synth experiment which match V2, a critical region in the brain that has been identiﬁed to respond to texture as in Long et al. (2018); Ziemba et al. (2016). This suggests that the parameters we use to capture and transfer texture statistics which are diﬀerent from the correlations of a steerable pyramid decomposition as proposed in Portilla & Simoncelli (2000), might the match perceptual discrimination rates of the FS metamers. This does not imply that the models are perceptually equivalent, but it aligns with the results of Ustyuzhaninov et al. (2017) which shows that even a basis of random ﬁlters can also capture texture statistics, thus diﬀerent ﬂavors of metamer models can be created with diﬀerent statistics. In addition, we ﬁnd that the critical scaling factor for the Synth vs Reference experiment is less than 0.5 ( 0.25, matching V1) for the pooled observer as validated recently by Wallis et al. (2018) for their CNN synthesis and FS model for the Synth vs Reference condition. 5 Discussion There has been a recent surge in interest with regards to developing and testing new metamer models: The Side Eye model developed by Fridman et al. (2017), uses a fully convolutional network (FCN) as in Long et al. (2015) and learns to map an input image into a Texture Tiling Model (TTM) mongrel (Rosenholtz et al. (2012)). Their end-to-end model is also feedforward like ours, but no use

Published as a conference paper at ICLR 2019

of noise is incorporated in the generation pipeline making their model fully deterministic. At ﬁrst glance this seems to be an advantage rather a limitation, however it limits the biological plausilibility of metameric response as the same input image should be able to create more than one metamer. Another model which has recently been proposed is the CNN synthesis model developed by Wallis et al. (2018). The CNN synthesis model is gradient-descent based and is closest in ﬂavor to the FS model, with the diﬀerence that their texture statistics are provided by a gramian matrix of ﬁlter activations of multiple layers of a VGGNet, rather than those used in Portilla & Simoncelli (2000).

The question of whether the scaling parameter is the only parameter to be optimized for metamerism still seems to be open. This has been questioned early in Rosenholtz et al. (2012), and recently proposed and studied by Wallis et al. (2018), who suggest that metamers are driven by image content, rather than bouma s law (scaling factor). Figure 9 suggests that on average, it does seem that α must increase in proportion to retinal eccentricity, but this is conditioned by the image content of each receptive ﬁeld. We believe that the hyperparametric nature of our model sheds some light into reconciling these two theories. Recall that in Figures (4, 8), we found that certain images can be pushed stronger in the direction of it s texturized version versus others given their location in the encoded space, the local geometry of the surface, and their projection in the perceptual space. This suggests that the average maximal distortion one can do is ﬁxed contingent on the size of the receptive ﬁeld, but we are allowed to push further (increase α) for some images more than others, because the direction of the distortion lies closer to the perceptual null space (making this diﬀerence perceptually un-noticeable to the human observer). This is usually the case for regions of images that are periodic like skies, or grass. Along the same lines, we elaborate in the Supplementary Material on how our model may potentially explain why creating synthesized samples are metameric to each other at the scales of (V1;V2), but only generated samples at the scale of V1 (s = 0.25) are metameric to the reference image.

Our model is also diﬀerent to others (FS and recently Wallis et al. (2018)) given the role of noise in the computational pipeline. The previously mentioned models used noise as an initial seed for the texture matching pipeline via gradient-descent, while we use noise as a proxy for texture distortion that is directly associated with crowding in the visual ﬁeld. One could argue that the same response is achieved via both approaches, but our approach seems to be more biologically plausible at the algorithmic level. In our model an image is fed through a non-linear hierarchical system (simulated through a deep-net), and is corrupted by noise that matches the texture properties of the input image (via Ada IN). This perceptual representation is perturbed along the direction of the texture-matched patch for each receptive ﬁeld, and inverting such perturbed representation results in a metamer. Figure 13 illustrates such perturbations which produce metamers when projected to a 2D subspace via the locally linear embedding (LLE) algorithm (Roweis & Saul (2000)). Indeed, the 10 encoded images do not fully overlap to each other and they are quite distant as seen in the 2D projection. However, foveated representations when perturbed with texture-like noise seem to ﬁnely tile the perceptual space, and might act as a type of biological regularizer for human observers who are consistently making eye-movements when processing visual information. This suggests that robust representations might be achieved in the human visual system given its foveated nature as non-uniform high-resolution imagery does not map to the same point in perceptual space.

Figure 13: Image embeddings.

If this holds, perceptually invariant data-augmentation schemes driven by metamerism may be a useful enhancement for artiﬁcial systems that react oddly to adversarial perturbations that exploit coarse perceptual mappings (Goodfellow et al. (2015); Tabacof & Valle (2016); Berardino et al. (2017)).

Understanding the underlying representations of metamerism in the human visual system still remains a challenge. In this paper we propose a model that emulates metameric responses via a foveated feed-forward style transfer network. We ﬁnd that correctly calibrating such perturbations (a consequence of internal noise that match texture representation) in the perceptual space and inverting such encoded representation results in a metamer. Though our model is hyper-parametric in nature we propose a way to reduce the parametrization via a

Published as a conference paper at ICLR 2019

perceptual optimization scheme. Via a psychophysical experiment we empirically ﬁnd that the critical scaling factor also matches the rate of growth of the receptive ﬁelds in V2 (s = 0.5) as in Freeman & Simoncelli (2011) when performing visual discrimination between synthesized metamers, and match V1 (0.25) for reference metamers similar to Wallis et al. (2018). Finally, while our choice of texture statistics and transfer is relu4_1 of a VGG19 and Ada IN respectively, our 1000-fold accelerated feed-forward metamer generation pipeline should be extendible to other models that correctly compute texture/style statistics and transfer. This opens the door to rapidly generating multiple ﬂavors of visual metamers with applications in neuroscience and computer vision.

Acknowledgements

We would like to thank Xun Huang for sharing his code and valuable suggestions on Ada IN, Jeremy Freeman for making his metamer code available, Jamie Burkes for collecting original high-quality stimuli, N.C. Puneeth for insightful conversations on texture and masking, Christian Bueno for informal lectures on homotopies, and Soorya Gopalakrishnan and Ekta Prashnani for insightful discussions. Lauren Welbourne, Mordechai Juni, Miguel Lago, and Craig Abbey were also helpful in editing the manuscript and giving positive feedback. We would also like to thank NVIDIA for donating a Titan X GPU. This work was supported by the Institute for Collaborative Biotechnologies through grant 2 W911NF-09-0001 from the U.S. Army Research Oﬃce.

Published as a conference paper at ICLR 2019

Emre Akbas and Miguel P Eckstein. Object detection through search with a foveated visual system. PLo S computational biology, 13(10):e1005743, 2017.

Benjamin Balas, Lisa Nakano, and Ruth Rosenholtz. A summary-statistic representation in peripheral vision explains visual crowding. Journal of vision, 9(12):13 13, 2009.

Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. International Conference on Learning Representations (ICLR), 2017.

Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129 1159, 1995.

Alexander Berardino, Valero Laparra, Johannes Ballé, and Eero Simoncelli. Eigen-distortions of hierarchical representations. In Advances in neural information processing systems, pp. 3530 3539, 2017.

Scott J Daly. Visible diﬀerences predictor: an algorithm for the assessment of image ﬁdelity. In SPIE/IS&T 1992 Symposium on Electronic Imaging: Science and Technology, pp. 2 15. International Society for Optics and Photonics, 1992.

Arturo Deza and Miguel Eckstein. Can peripheral representations improve clutter metrics on complex scenes? In Advances In Neural Information Processing Systems, pp. 2847 2855, 2016.

Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream. Nature neuroscience, 14(9):1195 1201, 2011.

Lex Fridman, Benedikt Jenik, Shaiyan Keshvari, Bryan Reimer, Christoph Zetzsche, and Ruth Rosenholtz. Sideeye: A generative neural network based simulator of human peripheral vision. ar Xiv preprint ar Xiv:1706.04568, 2017.

Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual factors in neural style transfer. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations (ICLR), 2015.

DM Green and JA Swets. Signal detection theory and psychophysics. 1966. New York, 888:889, 1966.

Olivier J Hénaﬀ. Testing a mechanism for temporal prediction in perceptual, neural, and machine representations. Ph D thesis, Center for Neural Science, New York University, New York, NY, Sept 2018.

Olivier J Hénaﬀand Eero P Simoncelli. Geodesics of learned representations. International Conference on Learning Representations (ICLR), 2016.

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. International Conference on Computer Vision (ICCV), 2017.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and superresolution. In European Conference on Computer Vision, pp. 694 711. Springer, 2016.

Shaiyan Keshvari and Ruth Rosenholtz. Pooling of continuous features provides a unifying account of crowding. Journal of Vision, 16(39), 2016.

V Laparra, A Berardino, J Ballé, and EP Simoncelli. Perceptually optimized image rendering. Journal of the Optical Society of America. A, Optics, image science, and vision, 34(9):1511, 2017.

Bria Long, Chen-Ping Yu, and Talia Konkle. Mid-level visual features underlie the high-level categorical organization of the ventral stream. Proceedings of the National Academy of Sciences, 2018. ISSN 00278424. doi: 10.1073/pnas.1719616115. URL http://www.pnas.org/content/early/2018/08/30/ 1719616115.

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431 3440, 2015.

Published as a conference paper at ICLR 2019

Jeﬀrey Lubin. A human vision system model for objective picture quality measurements. In Broadcasting Convention, 1997. International, pp. 498 503. IET, 1997.

Javier Portilla and Eero P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coeﬃcients. International Journal of Computer Vision, 40(1):49 70, 2000.

Ruth Rosenholtz, Jie Huang, Alvin Raj, Benjamin J Balas, and Livia Ilie. A summary statistic representation in peripheral vision explains visual search. Journal of vision, 12(4):14 14, 2012.

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323 2326, 2000.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015.

Pedro Tabacof and Eduardo Valle. Exploring the space of adversarial images. In 2016 International Joint Conference on Neural Networks (IJCNN), pp. 426 433. IEEE, 2016.

Dmitry Ulyanov, Vadim Lebedev, Victor Lempitsky, et al. Texture networks: Feed-forward synthesis of textures and stylized images. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1349 1357, 2016.

Ivan Ustyuzhaninov, Wieland Brendel, Leon A Gatys, and Matthias Bethge. What does it take to generate natural textures? International Conference on Learning Representations (ICLR), 2017.

Thomas S. A. Wallis, Christina M Funke, Alexander S Ecker, Leon A. Gatys, Felix A. Wichmann, and Matthias Bethge. Image content is more important than bouma s law for scene metamers. bio Rxiv, 2018. doi: 10.1101/378521. URL https://www.biorxiv.org/content/early/2018/07/30/378521.

Thomas SA Wallis, Matthias Bethge, and Felix A Wichmann. Testing models of peripheral encoding using metamerism in an oddity paradigm. Journal of vision, 16(2):4 4, 2016.

Z Wang and Q Li. Information content weighting for perceptual image quality assessment. IEEE transactions on image processing: a publication of the IEEE Signal Processing Society, 20(5):1185 1198, 2011.

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pp. 1398 1402. Ieee, 2003.

Felix A Wichmann and N Jeremy Hill. The psychometric function: I. ﬁtting, sampling, and goodness of ﬁt. Perception & psychophysics, 63(8):1293 1313, 2001.

Corey M Ziemba, Jeremy Freeman, J Anthony Movshon, and Eero P Simoncelli. Selectivity and tolerance for visual texture in macaque v2. Proceedings of the National Academy of Sciences, 113(22):E3140 E3149, 2016.

Published as a conference paper at ICLR 2019

6 Supplementary Material

Figure 14: Reference Metamers at the scale of s = 0.25, at which they are indiscriminable to the human observer. The color coding scheme matches the data points of the optimization in Experiment 1 and the psychophysics of Experiment 2. All images used in the experiments were generated originally at 512 512 px subtending 26 26 d.v.a (degrees of visual angle).

Published as a conference paper at ICLR 2019

Algorithm 1 Pipeline for Metamer hyperparameter γ( ) search

1: procedure Estimate hyperparameter: γ( ) function 2: Choose image dataset S I. 3: Pick hyperparameter search step size αstep. Pick scale search step size sstep. 4: for each image I S I do 5: for each scale s [sinit : sstep : sﬁnal] do 6: Compute baseline metamer MFS(I) 7: for each α [0 : αstep : 1] do 8: Compute metamer MNF(I) 9: end for 10: Find the α for each receptive ﬁeld that minimizes: E( -SSIM)2. 11: Fit the γs( ) function to collection of α values. 12: end for 13: end for 14: Perform Permutation test on γs for all s. 15: if γs is independent of s then 16: γs = γ 17: else 18: Perform regression of parameters of γs as a function f of s. 19: γs = γ f(s) 20: end if 21: end procedure

6.1 Hyperparameter search algorithm

Algorithm 1 fully describes the outline of Experiment 1.

6.2 Model specificiations and training

We use k = kp +kf spatial control windows, kp pooling regions (θr receptive ﬁelds θt eccentricity rings) and k f = 1 fovea (at an approximate 3deg radius). Computing the metamers for the scales of {0.3,0.4,0.5,0.6,0.7} required {300,186,125,102,90} pooling regions excluding the fovea where we applied local style transfer. Details regarding the decoder network architecture and training can be seen in Huang & Belongie (2017). We used the publicly available code by Huang and Belongie for our decoder which was trained on Image Net and a collection of publicly available paintings to learn how to invert texture as well. In their training pipeline, the encoder is ﬁxed and the decoder is trained to learn how to invert the structure of the content image, and the texture of the style image, thus when the content and style image are the same, then the decoder approximates the inverse of the encoder (D E 1). We also re-trained another decoder on a set of 100 images all being scenes (as a control to check for potential diﬀerences), and achieved similar outputs (visual inspection) to the publicly available one of Huang & Belongie. The dimensionality of the input of the encoder is 1 512 512, and the dimensionality of the output (relu4_1) is 512 64 64, it is at the 64 64 resolution that we are applying foveated pooling from the initial guidance channels of the 512 512 input.

Constructions of biologically-tuned peripheral representations are explained in detail in Freeman & Simoncelli (2011); Akbas & Eckstein (2017); Deza & Eckstein (2016), and are governed by the following equations:

2( x (t0 1)/2

t0 )); (1+t0)/2 < x (t0 1)/2 1; (t0 1)/2 < x (1 t0)/2 cos2( π

2( x (1+t0)/2

t0 ))+1; (1 t0)/2 < x (1+t0)/2 (12)

hn(θ) = f θ (wθn+ wθ(1 t0)

Nθ ;n = 0,...,Nθ 1 (13)

gn(e) = f log(e) [log(e0)+we(n+1)]

;we = log(er) log(e0)

Ne ;n = 0,...,Ne 1 (14)

Published as a conference paper at ICLR 2019

where f(x) is a cosine proﬁling function that smoothes a regular step function, and hn(θ),gn(e), are the averaging values of the pooling region wi at a speciﬁc angle θ and radial eccentricity e in the visual ﬁeld. In addition we used the default values of visual radius of er = 26deg, and e0 = 0.25deg 1, and t0 = 1/2. The scale s deﬁnes the number of eccentricities Ne, as well as the number of polar pooling regions Nθ from 0,2π]. We perform the foveated pooling operation on the output of the Encoder. Since the encoder is fully convolutional with no fully connected layers, guidance channels can be used to do localized (foveated) style transfer.

Our pix2pix U-Net reﬁnement module took 3 days to train on a Titan X GPU, and was trained with 64 crops (256 256) per image on 100 images, including horizontally mirrored versions. We ran 200 training epochs of these 12800 images on the U-Net architecture proposed by Isola et al. (2017) which preserves local image structure given an adversarial and L2 loss.

6.3 Metamer Model Comparison

The following table summarizes the main similarities and diﬀerences across all current models:

Model FS (2011) CNN-Synthesis (2018) Side Eye (2017) NF (Ours) Feed-Forward - - Input Noise Noise Image Image Multi-Resolution - - Texture Statistics Steerable Pyramid VGG19 conv-11,21,31,41,51 Steerable Pyramid VGG19 relu41 Style Transfer Portilla & Simoncelli Gatys et al. Rosenholtz et al. Huang & Belongie Foveated Pooling (Implicit via FCN) Decoder (trained on) - - metamers/mongrels images Moveable Fovea Use of Noise Initialization Initialization - Perturbation Non-Deterministic - Direct Computable Inverse - - (Implicit via FCN) Rendering Time hours minutes miliseconds seconds Image type scenes scenes/texture scenes scenes Critical Scaling (vs Synth) 0.46 {0.39/0.41} Not Required 0.5 Critical Scaling (vs Reference) Not Available {0.2/0.35} Not Required 0.24 Experimental design ABX Oddball - ABX Reference Image in Exp. Metamer Original - Compressed via Decoder Number of Images tested 4 400 - 10 Trials per observers 1000 1000 - 3000

Table 1: Metamer Model comparison

Freeman & Simoncelli

Original Neuro Fovea

Figure 15: Algorithmic (top) and visual (bottom) comparisons between our metamers and a sample from Freeman & Simoncelli (2011) for a scaling factor of 0.3. Each model has it s own limitations: The FS model can not directly compute an inverse of the encoded representation to generate a metamer, requiring an iterative gradient descent procedure. Our NF model is limited by the capacity of the encoder-decoder architecture as it does not achieve lossless compression (perfect reconstruction).

1We remove central regions with an area smaller than 100 pixels, and group them into the fovea

Published as a conference paper at ICLR 2019

Encoded Space

Perceptual Space

Image Space

Encoded Space

Perceptual Space

Image Space

Sample Metamer A Sample Metamer B Perceptual Space

The distance in green between a V1 metamer and the content image is the same as the two V2 metamers, potentially explaining how they are perceptually indistinguishable to each other at diﬀerent scaling factors given the type of ABX task.

Figure 16: Decomposition and overview of the metamer generation process in the Image space, the Encoded space and the Perceptual space. The original image patch is coded in blue, the V1 metamers are coded in purple, and the V2 metamers are coded in pink. Dark brown represents the initial white noise that is later stylized via Ada IN through S( ). Note that these two points are far away to each other in image space, but quite closeby in perceptual space as they are also metameric to each other. They are not placed on the actual encoded manifold since these points are not in the near vicinity of either C nor S(N), as they have no scene-like structure. The interpolation for maximal distortion is done along the line between C and S(N), these are the points in blue and red in the encoded space which represent the extremes of α = 0.0 and α = 1.0 respectively.

6.4 Interpretability of V1 and V2 Metamers

In Figure 16, we illustrate the metamer generation process for two sample metamers, given diﬀerent noise perturbations. Here, we decompose Figure 4 into two separate ones for each metamer given each noise perturbation, and provide an additional visualization of the projection of the metamers in perceptual space, gaining theoretical insight on how and why metamerism arises for the synth-vs-synth condition in V2, and the synth-vs-reference condition in V1 as we demonstrated experimentally.

6.5 Pilot Experiments

Proportion Correct

List of Pilot Observers

'LR' 'SO' 'DS' 'JZ' 'OD' 'SW'

Synth vs Reference Synth vs Synth

Freeman & Simoncelli ABX task

Figure 17: Pilot Data on FS metamers.

In a preliminary psychophysical study, we ran an experiment with a collection of 50 images and 6 observers on the FS metamers. Observers performed a single session of 200 trials of the FS metamers where the scale was ﬁxed at s = 0.5. We found the following: While we found that the synthesized images were metameric to each other for the scaling factor of 0.5, the FS metamers were not metameric to their reference high-quality images at the scale of 0.5. Only a sub-group of observers: LR , SO , DS scored well above chance in terms of discriminating the images in the ABX task. These results are in synch with the evalutions done by Wallis et al. (2018), which varied scale and found a critical value to be less than 0.5 and rather closer to 0.25 within the range of V1.

Published as a conference paper at ICLR 2019

6.6 Estimation of Lapse-rate (λ) per observer

The motivation behind estimating the lapse rate is to quantify how engaged was the observer in the experiment, as well as providing a robust estimate of the parameters in the ﬁt of the psychometric functions. Not accounting for lapse rate may dramatically aﬀect the estimation of these parameters as suggested in Wichmann & Hill (2001). In general lapse rates are computed by penalizing a psychometric function ψ( ) that ranges between some lower bound and upper bound usually [0,1]. To estimate the lapse rate λ, a new ψ ( ) is deﬁned to have the following form:

ψ ( ) = b+(1 b λ)ψ( ) (15)

Recall that for us, our psychometric ﬁtting function ψ( ) = PCABX(s) is deﬁned by Equation 11 and parametrized by both the absorbing factor β0 and the critical scaling factor s0:

PCABX(s) = Φ d2(s) (6)

! +Φ d2(s) (6)

where we have:

d2(s) = β0(1 s2 o s2 )1s>s0 (17)

To compute the new ψ ( ), we notice ﬁrst that our ψ is bounded between [0.5,1], and that the new ψ will be a linear combination of a correct guess for a lapse, and a correct decision for a non-lapse from which we obtain:

PC(s) = λ+(1 2λ)PCABX(s) (18)

as derived in Hénaﬀ(2018) which includes lapse rates for an AXB task. When ﬁtting the curves for each of the n = 10000 bootstrapped samples, we restricted the lapse rate to vary between λ = [0.00,0.06] as suggested in Wichmann & Hill (2001), and found the following lapse rates:

Observer 1: λRS ZQ = 0.0248 0.0209, λS S ZQ = 0.0430 0.0228.

Observer 2: λRS AL = 0.0008 0.0062, λS S AL = 0.0166 0.0215.

Observer 3: λRS AG = 0.0141 0.0243, λS S AG = 0.0218 0.0236.

We later averaged these lapse rates as there is an equal probability of each type of trial to appear (Synth vs Synth, or Reference vs Synth), and reﬁtted each curve with the new pooled lapse rate estimates λ . Indeed, each observer did both experiments in a roving paradigm, rather than doing one experiment after the other thus we should only have one estimate for lapse rate per observer. It is worth mentioning that re-performing the ﬁts with separate lapse rates did not signiﬁcantly aﬀect the estimates of critical scaling values, as one might argue that higher lapse rates will signiﬁcantly move the critical scaling factor estimates. This is not the case as the absorbing factor β does not place an upper bound for the psychometric function at 1.

Our critical estimates of lapse rates were: λZQ = 0.0339, λAL = 0.0087, λAG = 0.0179, as shown in Figure 12.

The estimates (critical scale (s0), absorbing factor (β0) and lapse rate (λ0)) shown for the pooled observer were obtained by averaging the estimates over the 3 observers.

Published as a conference paper at ICLR 2019

6.7 Robustness of estimation of γ function

In this subsection we show how the perceptual optimization pipeline is robust to a selection of IQA metrics such as MS-SSIM (multi-scale SSIM 2) from Wang et al. (2003) and IW-SSIM (information content weighted SSIM) from Wang & Li (2011).

There are 3 key observations that stem from these additional results:

1. The sigmoidal natural of the γ function is found again and is also scale independent, showing the broad applicability of our perceptual optimization scheme and how it is extendable to other IQA metrics that satisfy SSIM-like properties (upper bounded, symmetric and unique maximum).

2. The tuning curves of MS-SSIM and IW-SSIM look almost identical, given that IW-SSIM is not more than a weighted version of MS-SSIM where the weighting function is the mutual information between the encoded representations of the reference and distortion image across multiple resolutions. Diﬀerences are stronger in IW-SSIM when the region over which it is evaluated is quite large (i.e. an entire image), however given that our pooling regions are quite small in size, the IW-SSIM score asymptotes to the MS-SSIM score. In addition both scores converge to very similar values given that we are averaging these scores over the images and over all the pooling regions that lie within the same eccentricity ring. We found that 90% of the maximum α s had the same values given the 20 point sampling grid that we use in our optimization. Perhaps a diﬀerent selection of IW hyperparameters (we used the default set), ﬁner sampling schemes for the optimal value search, as well as averaging over more images, may produce visible diﬀerences between both metrics.

3. The sigmoidal slope is smaller for both IW-SSIM and MS-SSIM vs SSIM, which yields more conservative distortions (as α is smaller for each receptive ﬁeld). This implies that the model can still create metamers at the estimated found scaling factors of 0.21 and 0.50, however they may have diﬀerent critical scaling factors for the reference vs synth experiment, and for the synth vs synth experiment. Future work should focus on psychophysically ﬁnding these critical scaling factors, and if they still are within the range of rate of growth of receptive ﬁeld sizes of V1 and V2.

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Scale: 0.3 Scale: 0.4 Scale: 0.5 Scale: 0.6 Scale: 0.7

Scale: 0.3 Scale: 0.4 Scale: 0.5 Scale: 0.6 Scale: 0.7

Scale: 0.3 Scale: 0.4 Scale: 0.5 Scale: 0.6 Scale: 0.7

Figure 18: A collection of scale invariant γ( ) s across multiple IQA metrics for the perceptual optimization scheme of Experiment 1. In this ﬁgure we superimpose all maximal α-noise distortions for each scale, and ﬁnd a function that ﬁts all the points showing that γ is indepedent of scale.

2scale in the context of SSIM is referred to resolution (as in scales of a laplacian pyramid), and is not to be confused with the scaling factor s of our experiments which encode the rate of groth of the receptive ﬁelds.

Published as a conference paper at ICLR 2019

scale=0.3 scale=0.4 scale=0.5 scale=0.6 scale=0.7

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

d=1.240 d=1.196 d=1.363 d=1.311 d=1.355

(a) Perceptual Optimization with SSIM.

scale=0.3 scale=0.4 scale=0.5 scale=0.6 scale=0.7

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

d=0.621 d=0.649 d=0.749 d=0.757 d=0.736

(b) Perceptual Optimization with MS-SSIM.

scale=0.3 scale=0.4 scale=0.5 scale=0.6 scale=0.7

Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius of R.F. (d.v.a.)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

d=0.614 d=0.640 d=0.733 d=0.758 d=0.733

(c) Perceptual Optimization with IW-SSIM.

Figure 19: Top: The maximum α-noise distortion computed per pooling region, and collapsed over all images for each IQA metric. Bottom: When averaging across all the pooling regions for each retinal eccentricity, we ﬁnd that the γ function is invariant to scale as in our original experiment suggesting that our perceptual optimization scheme is ﬂexible across IQA metrics.

Published as a conference paper at ICLR 2019

scale=0.3 scale=0.4 scale=0.5 scale=0.6 scale=0.7

Number of Permuted Samples

Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15

Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 0

Number of Permuted Samples

Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 Diﬀerence Distribution

0.25 -0.15 -0.05 0.05 0.15 0

Number of Permuted Samples

Figure 20: A permutation test was ran and determined that each γ function is also scale independent under the 99% conﬁdence interval (CI), as we increased the CI to account for false discovery rates (FDR). Indeed, when we perform the permutation tests and use a 95% conﬁdence interval (shown in the ﬁgure with the vertical lines in cyan), all curves except for MS-SSIM and IW-SSIM only for the scaling factor of 0.3 show a signiﬁcant diﬀerence p 0.02 (non FDR-corrected), potentially due to small receptive ﬁeld sizes, which bias the estimates. All other diﬀerences in the d parameter of the sigmoid function, with respect to the average ﬁtted sigmoid, are statistically insigniﬁcant.