#  2010 Preference Judgments

##  GAST experiment

Figure 1a. Dell LCD 1905FP - Chevron-shaped RGB subpixels.
Figure 1b. Dell LCD 1907FPc - Stripes-shaped RGB subpixels.

###  Experiment description

In this experiment, we use 5-tap filters [a b c b a], and try to find optimal filter. We vary a and b, and set $c = 1 - (2a+2b)$, and $|c|<=1$. We examine the parameter space of $-0.6<= a,b <=0.6$, sampled in the rate of 0.1. This gives us 88 filters.

If we wanted to do full scale pairwise comparison experiment, comparing all possible pairs of filter in this space, we would get 3,828 comparison. If we would like to have each comparison voted by each subject 10 times, in order to get reliable results, we would have to make each subject vote 38,280 times(!). Obviously, this would not be a feasible task.

In order to make this experiment possible, we looked for an adaptive algorithm for pairwise comparison testing, that will find "shortcuts" to the optimal filter, without having to compare all the pairs in the space. For this purpose we use the GAST (Gradient Ascending Subject's Testing) algorithm and code for optimization of parameters finding in pairwise-comparison testing, developed by Stephen Voran and Andrew Catellier. Each GAST run starts from a random starting point in the (a,b) parameter space, and "climbs", based on subject's responses, until it reaches optimal a and b values. Voran and Catellier's study shows, that the mean of the results of 35 runs of this algorithm, gives 95-percent confidence of the results.

For each subject, our script code runs 10 GAST runs for each one of the letters 'g', 'v' and 's', and then plots the results for each letter separately and for the total of all 30 runs. We run it for font style georgia size 12, and then the same set for font style arial size 11, total to 60 GAST-runs per session. We had 3 subjects, each one of them had 3 sessions. We repeated this entire procedure for two different calibrated displays with different pixel structures - Dell LCD 1905FP (figure 1a) and Dell LCD 1907FPc (figure 1b).

Subjects are presented with pairs of two different filtered versions of the same character. Subjects then choose one of 5 options - left character is much better than right character, left character is better than right character, characters look the same, right character is better or much better than left character.

You can browse the script of this experiment, and most of its sub-functions here. In order to run the experiment, you must check out this entire svn directory and its sub-directories:

https://white.stanford.edu/ee/pdcprojects/CleartypeMetrics/ctExp/filterPreference/Pairwise Comparison/

And run the script:

ctPrefGast

You must have ctToolbox, Psychtoolbox and ISET-4.0 on your path. Type 'help ctPrefGast' for further information.

GAST-all.png
GAST-stripes.png

###  Results

• Optimal filter values (mean):
a = 0.0313
b = 0.1697

This tells us that the optimal filter is approximately 3-tap filter of [0.2,0.6,0.2]. However, as figure 2 shows, all GAST paths led to a diagonal distribution of the optimal results, where the variance of the c coef is very small, as opposed to the variance of a and b, which was almost across the entire spectrum. This teaches us that along this line we get equivalent quality to the optimal filter. In other words - it seems that as long as we maintain the c coef value of the filter in a certain range, we get good quality filters, independent of its a and b coefficients (up to certain extent).

If we try to understand the meaning of the variance in a and b, we can see that the left-top region of the distribution (high a and negative b), will create W-shaped filters, whereas the right-bottom region (negative a and high b) create M-shaped filters. The center of the distribution gives us Gaussian filters. Another look at figure 2 shows us that within this distribution the central area is more populated than the others, which is not surprising.

Distribution by Letter

Now we would like to take a closer look at the results, and analyze the differences between letters, font styles and subjects. Figure 3 shows plots color coded by these parameters for the Chevron display (the Stripes display had similar results):

 GAST-byletter.png GAST-byfontstyle.png GAST-bysubject.png

We find minor differences between letters results, for each one of the displays (but not necessarily consistent between the different displays). We will see another expression of these difference when we discuss the S-CIELAB predictions later on.

We also examined the results by font family and by subject, but didn't find significant changes.

All data files of the experiment results can be found in the Data sub-directory of the experiments svn directory (see checkout instructions above). The Data directory is divided to sub-directories by subject's name.

Each file is the results of one run of the GAST script described above (30 runs of one font family letters). You can load each file, and then use the command:

ctAnalyzeGAST(TrialStructure)

to see the plots. You will get 4 plots - one for each letter, showing the starting and ending points of each GAST path, and one that shows the ending points of all 30 GAST tasks ran in that session, and their mean. Numerical value of the means will also appear in your Matlab command window.

If you would like to see more inclusive results for all experiments performed by a subject, or for the entire experiment, browse into the folder you want to analyze (they are all under "data", separated by name, and there's another one with all the results of everybody). When your current Matlab directory is the one you want to analyze, use the function:

ctAnalyzeGASTSubj(mode)

where "mode" can be one of {0,1,2,3} for categorizing the data by {uncategorized, font family, letter, subject} respectively. Here too you will get numerical values of the optimal filters (divided by the category of your choice) in your command window.

##  S-CIELAB predictions

S-SCIELAB.png
S-SCIELAB-GAST.png

After collecting the human preference data, our goal is to find a metric to predict human preferences for a given display. For this purpose we use S-CIELAB - spatial extension to the perceptual color metric CIELAB (see X. M. Zhang and B. A. Wandell, A spatial extension to CIELAB for digital color image reproduction, Proc. Society for Information Display Symposium, San Diego, CA., 1996). We use our CtToolBox in order to perform a grid search to find optimal display rendering filter. Each image using different rendering filter is compared to ClearType image rendered on a matched (= same resolution) monochrome display.

This is a full reference metric, meaning - we measure the quality of the rendered image by comparing it to a presumably ideal reference image. Since the reference image is rendered on a monochrome display we will have no color artifacts, and since we do not apply any filtering to the reference image, we have no blur in the reference image. Therefore a comparison between our rendered test image and this reference image will predict the differences in terms of blur, color artifacts and other deviation from the ideal character.

###  Results

For each given letter, font style viewing conditions and display, we get a contour map of the S-CIELAB ΔE values of the filter parameter space (a,b), and can find the predicted optimal filter. Figure 4a is one example for such map.

When we compare these predictions to the human preference data that we collected in the GAST experiment for the letter, font size, font family, display type and viewing conditions, we find that the S-CIELAB predicts human preferences successfully. This can be seen in figure 4b - many GAST tasks terminated in the region of filters with the lowest S-CIELAB ΔE values, and the mean of the GAST terminating point for this character falls very close to the predicted optimal value.

This correlation between the S-CIELAB predictions and the human preferences data was found for other letters and displays as well. The meaning of these results is that given a display and its properties, we have a reliable tool to predict what will be the optimal filter for this display, without having to perform human preference experiments. Furthermore - as can be seen in the S-CIELAB ΔE contour map, there's a region of optimal filters, that produce the same results. We can choose a 3-tap filter from this region, that will be cheaper in computation, and will provide optimal results.

Variance Between Different Letters

When we look at the S-CIELAB ΔE contour maps in the ab parameter space for different letters, we find that S-CIELAB predictions vary from one letter to another. Figure 5 shows comparison between three different letters' contour maps (in addition to the contour map for the letter 's' shown in figure 4a). The optimal filters region is different for each letter.

 D-georgia-CHEV.png E-georgia-CHEV.png N-georgia-CHEV.png
These are only three examples, but the variation goes across all letters. There’s no intersection between optimal filters regions of all letters. These results mean that different letters require different filters in order to produce optimal results. However, this will require very expensive computation to apply adaptive filter for large bulks of text.
CostThreshold.png
CostVariety.png

###  Method for Finding Optimal Filter

In order to find one filter (or region of filters) that will produce optimal results for all letters, we developed the following method:

1. For each filter, sum ∆E across all letters.

2. Filter with minimal sum, will give best results across all letters.

3. We can define a threshold, and choose all filters with: $sum \le min(sum(\Delta E)) + threshold$

We refer to this as a "cost function", that tells us the accumulated cost in quality of each filter across all letters. Since we know that no filter will produce optimal result for every single letter, we look for the filter that will give us the minimal deviation from the optimal quality.

The reason we use a threshold in step 3, and choose a group of filters, and not only the optimal one, is that we don't want to select only one filter (which might be 5-tap), but rather a group of filters, from which we can select a 3-tap. If the threshold is low, we can assure that it gives us results that are almost as good as the optimal filter.

Figure 6 shows a threshold map of the filters that their $sum \le min(sum(\Delta E)) + 1$. This means that across all letters the difference between the worse filter in this region and the optimal filter is equal or less than 1 ∆E. If we divide that by 26 letters, we get that in average, this filter will have an error around 0.04 ∆E higher than the optimal filter per letter. This means that there's no visual difference between the optimal filter and all the other filters that have $sum \le min(sum(\Delta E)) + 1$. This is why we choose to provide a range of filter and then we can choose from them the filter that will be the easiest to work with, i.e. a 3-tap filter. For most displays we checked a threshold of 1 produces a 3-tap filter, but even we don't find 3-tap for a threshold = 1, we can set threshold = 2, we will still get filter that will practically perform as good as the optimal.

In the case of the Chevron display analysis, shown in figure 6, we can choose (0.3,0.4,0.3) as our ideal filter. The sum of ∆E across all letters for this filter is less or equal to 65.5518 ∆E, which means that the average is 2.52 ∆E per letter.

In order to make sure that we don't get any letter with very high ∆E value, which will mean that this filter will produce very bad results for some letters, we check the variance in ∆E across all letters for this filter. We checked for each letter, how much is the ∆E of this filter for that letter is greater than the ∆E of the optimal filter of that letter. As can be seen in figure 7, the cost of a 3-tap filter (0.3,0.4,0.3) did not exceed 0.84 for any letter. For other displays and letters combinations we've found very few deviations that exceeded 1 ∆E, but even there the visual effect wasn't noticeable.

There are many advantages to this method:

• We can expand this method to full display analysis, across different font families, other characters, different sizes, etc. and then come up with one filter that wil lprovide best results for that display, and will be robust across different texts.
• This method is very cheap in computation - it require one time analysis of the display, that provide one easy-to-deal-with filter.
• This can be provided as part of every display driver (can cover a variety of pixel structures).
• The computation can be adaptive to viewing distance, and can take into account viewers with vision impairments.

##  Next step - Reference Free Metrics

As we explained above, our S-CIELAB metric that we used in order to predict human preferences, is a full reference metric. This means that it calculates the error by looking at the difference (or distance), between the test character and a reference image, which supposed to represent the ideal character. The logic is that any deviation from the ideal character is undesirable, and is therefore erroneous.

However, there are numerous disadvantages and limits to such approach:

• We are limited to a certain reference image. The choice of such image might be non-trivial and challenging. We chose to use as a reference an image of the same character rendered on a matched (=same resolution) monochrome display, without applying any filtering. However, there are other options, such as using a very high resolution monochrome display as a reference. One can argue that this represents the ideal character more authentically. So the choice is not obvious.
• The metric doesn't measure quality, but rather distance from reference image. We use it as a way to measure quality based on the assumption mentioned below, that the reference image is ideal and therefore any deviation from the reference image represents a reduction of the quality. However, this is not necessarily always true. There could be situations where the differences between the test image and the reference image are in favor of the test image - it might have better contrast etc. However, the metric will treat this as an error. This goes back to the previous point, of the luck of confidence in the reference image, however, here the problem comes in from the test image's angle.
• Comparison between a test image and a reference image requires perfect alignment between the images. Otherwise, many more false errors will be detected. This alignment presents challenges in computation and representation.

These problems motivated us to develop a reference free metric, that will measure the quality of the character by direct analysis of the character itself, rather than indirectly by comparing it to a reference image. We want our metric to measure artifacts and element in the image that reduce its quality.

###  Reference Free Chrominance Metric

The first artifact we want to detect is color artifact. The method used by ClearType algorithm of sub-pixel rendering takes advantage of the sub-pixel level to achieve better resolution. However, using this level creates color artifact in the character. It is clear that these artifacts are unwanted, and therefore we treat them as errors which we want our reference-free metric to detect.

In order to accurately measure the visible color artifacts of the image, we need to be able to take into account the human visual system. In order to do that we process the image and bring it to its S-CIELAB representation, as we described earlier. This takes into account the spacial resolution that is visible to the eye in each color channel. We then convert the image to L*a*b* representation and measure its $\sqrt(a^2+b^2)$. This enables us to measure only the color energy of the image, and not the luminance and spacial errors.

Below is an example of the reference-free chrominance metric's results for an optimal filter (figure 8a) and the worse filter (figure 8b) for a certain letter (the figures were generated by the CtToolBox plotting tools):

 OptFilter-chrom.png WorstFilter=chrom.png

The difference is very clear to the eye. In sanity check for other filter, letters, font styles and display we found that in over than 90% of the cases the metric predicts color artifacts successfully.

Open Issues

• The error image shows errors not only at the edges of the character, but also inside the character itself. The reason for that might be that there's color energy there (in the a* and b* channels), but since there's no luminance they are invisible. Since they are invisible we don't want our metric to count them. In the reference free chrominance code (in ctFontEval), we inserted an option of putting threshold on the L* channel, and counting chrominance error only in pixels that exceed this threshold. So far it didn't seem to help - but it's there to examine.

###  Reference Free Blur Metric

The second artifact we want to detect is blur. In order to correct the color artifacts generated by the sub-pixel rendering, the ClearType algorithm uses spacial filtering to smooth these artifacts. However, the trade-off of this solution is the creation of blur in the character. People prefer sharp character over blurred ones. Thus, we treat blur as error which we want our reference-free metric to detect.

Studies show that noticeable blur is a function of two properties: edge width and contrast. The wider the edge - the more noticeable the blur, and higher contrast will produce more noticeable blur than lower contrast for the same edge width. This is why we've created a metric that will measure the blur as a product of edge width and contrast.

In order to represent the image correctly, taking into consideration the human visual system, here too we're processing the image to its S-CIELAB representation, and we're measuring the width and the contrast in the S-CIELAB opponent-filter Luminance channel (brighter or darker). Since we're dealing now only with one-dimensional horizontal filters, we need to measure blur only on the horizontal dimension. Once we use two-dimensional filters, we can extend this method to two-dimensional edge measurements.

For each pixel we measure its derivative on the horizontal dimension, as a representation of its contrast, multiplied by the width of the edge this pixel is part of. We measure width by the number of consecutive pixels that change in the same direction (brighter or darker).

Below is an example of the reference-free blur metric's results for an optimal filter (figure 9a) and the worse filter (figure 9b) for a certain letter (the figures were generated by the CtToolBox plotting tools):

 OptFilter-blur.png WorstFilter-blur.png

Here too the difference is very clear to the eye. In sanity check for other filter, letters, font styles and display we found here that in around 75% of the cases the metric predicts blur artifacts successfully.

Open Issues

• The metric fails in letters with close vertical stripes (such as 'm'), were edges interfere in high blur. Think about a solution to handle these situations.

###  Future Work

• The reference free metrics suggested here are in an early stage of work. More work needs to be done in modifying the accuracy of these metrics using human perception data, as well as calibration to meaningful units.
• We developed reference free metrics for color artifacts and blur. However, there are other image quality that these metrics do not measure, such as edge continuity and others. There's need to develop reference free metric that will measure these as well.
• At last, each one of these metrics measures one artifact that reduces image quality. There's a need to develop a full image quality reference free metric that will take into account all the artifacts and weigh them to one value. This metric will be a function of all the other metrics, and will basically quantify the weight given to each artifact in the general image evaluation. One possibility is to make a linear combination of the chrominance, blur and other artifact's metrics values. The correct weights for each metric in this linear combination will have to be found by using human preferences data, and getting the linear combination coefficients that optimize the predictions.

#  Past Experiments

##  2009 experiment

• See papers
• Notes
• We varied a and b parameters and set c = 1 - (2a + 2b)
• At the time we ran the experiment, we did not realize that c > 1.0 was a constraint. Later, we realized that in cases where c > 1.0, a and b were reset to 0 and c was set to 1.0. Hence, the letters created with a=b=0 and c=1.0 were over-represented in the data.
• Given this error, should we treat the 2009 experiment as a pilot experiment? This would be a shame since we compared two different displays (Chevron and vertical pixels shape). Of course, we could repeat the 2010 experiment (diagonal pairwise condition only) on the display with the Chevron pixels. This display is in the lab in the Packard Building.
• We need to change Figure 1 in Visual preference for ClearType technology, SID 2009
• Below are Figures 2009-1a and 2009-1b that have been appropriately adjusted
 Figure 2009-1. Space of possible filter parameters. Each point represents the a and b filter coefficients for a particular letter. Lines connect the filter values of letters that were presented in a given trial.

• Future Experiments
• we should have people compare ClearType characters to the case where there was no ClearType
• we could reduce the number of comparisons by testing the diagonal region in a b parameter space that was most preferred
• we could also decrease the granularity of a and b (in other words, vary a and b in step of 0.1 or 0.05) - could get greater sensitivity
• we could compare results that we get when we use pairwise comparisons to the results we get when we present many alternatives and have subjects pick the rendition they like the best - could get greater sensitivity

##  Diagonal experiment - OLD

Note: At some point during the 2010 experiments, after the GAST procedure, we wanted to try to refine the optimal filter region by searching only in certain limited range. We then went on to perform the experiments described below. At the end, we didn't find the results of these experiments to be very helpful, so we continued our research in a different path. Anyway, we left the documentation here for reference.

As we said earlier, our experiment produced a range of filters, distributed along the a-b diagonal, with no much variance over the c coefficient.

In order to complete this experiment, and better define the actual optimal filter, we performed another set of experiments. This time we sampled the diagonals of $c = 0.2$ and $c = 0.4$ in a rate of 0.1 over the a and b range. We got 19 samples, which allow us to use traditional full pairwise comparison experiment between all the pairs in this set (171 pairs). We repeated each comparison 5 times for each one of the letters 's', 'v' and 'g' (georgia 12). All together we got 171 pairs x 5 comparisons x 3 letters = 2565 votes per subject.

For this experiment we had the same 3 subjects of the GAST experiment. However, as opposed to the GAST experiment, this time the subject had only 2 options - right character is better or left character is better.

You can browse the script of this experiment, and most of its sub-functions here. In order to run the experiment run the script:

ctPrefDiag

In the same svn directory of the GAST experiment. Here too you must have ctToolbox, Psychtoolbox and ISET-4.0 on your path. Type 'help ctPrefDiag' for further information.

Diag-all.png

###  Results

Figure DIAG-1 is a contour map of the results of the diagonal experiment. We can see that the right bottom half of the spectrum (gaussian and m-shaped filters) was preferred by the subjects over the top left area (w-shaped filters). It can also be seen that we have a peak at the point:

a = 0
b = 0.3

This tells us that we can use a 3-tap filter [0.3,0.4,0.3] as the optimal filter instead of a 5-tap filter. However, here too, if we look across the letters, we will find noticeable differences.

Results by letter

Figure DIAG-2 shows a comparison between the patterns of the contour maps of each letter:

 All-s.png All-v.png All-g.png

We can see that the tendency to prefer the right-bottom region of the spectrum existed only for the letters 'v' and 'g'. For 's' the results are much more ambiguous.

Looking into the numerical votes results, we found that the variance between votes for the different filters of s across the filter space was very small. The filter that got the highest number of votes had 163 votes of 270 comparison it participated in, which is approximately 60% of the votes (62%,61%,61% for subjects 1,2,3 respectively), whereas the lowest score was 105 which is ~39% of the votes (40%,38%,39% for subjects 1,2,3 respectively). For the global count, 17 out of 19 filters were preferred between 45-54%. In pairwise comparison tests, a situation like this is called a "confusion", meaning that there is no clear preference for one stimuli or a set of stimuli over the others. As we showed, this is consistent for all 3 subjects.

Since the subject data for 's' did not define a preferable region of filters, we defined the entire region of the filters space as equal quality, and then applied its minimal value (34 votes per filter) as our threshold for the other letters' data. In other words, we wanted to test whether this can create a specific region of values in the parameter space of 'g' and 'v', since there the scores were clearer. Using this threshold we compared the region of "good" filters for the letters 'g' and 'v' for each subject.

Figure DIAG-3 shows the threshold maps for v and g - the three lines represent the threshold of each subject.

 Threshold-v.png Threshold-g.png

As can be seen - for these two letters the results are very clear, and there's no ambiguity. They are consistent across subjects, and define a very clear region of preferred filter values.

Therefore, we concluded that we can use the overlapping region of each letter as the preferable set of filter for that letter of all subjects, and ignore the differences between the subjects.

All data files of the experiment results can be found in the 'Data' sub-directory of the experiments svn directory (see checkout instructions above). The Data directory is divided to sub-directories by subject's name.

If you would like to see these contour maps for each subject, and/or for each letter, browse into the folder you want to analyze (they are all under "data", separated by name, and there's another one with all the results of everybody). When your current Matlab directory is the one you want to analyze, use the function:

ctAnalyzeDiagSubj

#  Human optical and retinal sampling

This year we proposed:

• Group 1 Objectives:
• Paper on the metric (ISET) - Completed
• Paper on the subject preference data - Completed
• Write a review paper - in process
• Group 2 Objectives
• Model the visual encoding of the rendered font
• Develop metrics to characterize edge contrast and contours

TODO

• Joyce:
• write outline of the review paper
• read Pelli paper and make proposal for next project
• Brian:
• simulation of retinal irradiance image (done)
• metric for characterizing edge contrast in retinal irradiance
• something like the ISO 12233?
• method and metric for characterizing contours in retinal irradiance image
• contours derived from retinal irradiance image are the same for bad and good ClearType versions so perhaps this is not a good metric (JF)
• Comment There are two things that people identify as objectionable when the parameters of ClearType characters are sub-optimal -- 1) the visibility of color artifacts and 2) the perception of blur. SCIELAB does not separate the visibility of color artifacts from the perception of blur. Perhaps it would be useful to have metrics that identify and characterize these two features, independently. Separate metrics could be useful for other applications as well (e.g. demosaicking).
• simulation of human cone mosaic sampling
• metrics here?

#  Reference free metrics

Microsoft: no-reference metrics for blur and color artifacts

Kevin would like us to propose and test "various approaches to quantifying edge quality (blur or sharpness, contrast, continuity, etc.)"

In our 2009-2010 research proposal to Microsoft, we identified three areas for reference-free metrics

• 1. Visibility of color fringing
• Possible metric: S-CIELAB (a, b) values, say to assess the visibility of color in the rendered character
• 2. Visibility of sharpness and blur
• Possible metrics – these metrics need to be modified to reflect visual filtering – perhaps they are calculated on the SCIELAB luminance image
• Sharpness: slope across an edge
• Contrast: max/min
• 3. Contour continuity
• Note that edges that are not supposed to be connected may become connected if the character is blurred
• tradeoffs between edge-sharpness and color artifacts.
• Increasing blur will reduce color artifacts (a good thing) but decrease sharpness (a bad thing). How sharpness and color trade-off against one another could be evaluated by experimental measurements.
• impact of individual differences and the impact of display properties

#  Psychophysical Experiments 2010

• Pairwise presentation of two different versions of the same character
• Visibility of color artifacts: In one block of trials, indicate which of the two versions has more visible color artifacts
• Can we predict the results using the difference between the magnitude of the metric (e.g ) for the two different versions?
• Visibility of blur: In one block of trials, indicate which of the two versions appears to be more blurred
• Visibility of color continuity : In one block of trials, indicate which of the two versions appears to have less contour continuity

Experiment 2

• Determine which type of artifact is more annoying
• Pairwise comparisons of the stimuli
• Which of two versions do people prefer?
• Can we predict results by separable functions for visibility of color artifacts, blur and/or continuity? Or is there an interaction?
• Compare to the predictions in which we use the reference metric based on S-CIELAB
• Judgments of visibility of color artifacts are not independent of visibility of blur because of the nature of the stimuli. Characters that have more color artifacts are less blurred, and vice versa.
• Note that this is also true about demosaicking –the more you blur, the less color artifacts you see

TODO