student student 2 78 2000-12-01T07:25:00Z 2000-12-01T07:25:00Z 4 5220 26622 Stanford University 739 283 36540 9.2720 MasterPages

Image Classification for Coding

I. Introduction

Compression concerns itself with the reduction of redundant, or less perceivable information.  When compressing images that contain both graphics and text, the compression artifacts are often more obvious and irritating in the textual regions than in the graphics regions.This is true because graphics and text have different properties, and convey different types of information to the viewer.An image coder that takes advantage of these differences will realize efficiency gains by treating text differently than graphics.

We first examine the differences between text and graphics in section II.  In sections III and IV, we explore methods of distinguishing text from graphics in images.  Finally, in section V, we illustrate compound  coding with an example.

II. Comparing Properties of Text and Graphics

  We present Rate vs. Peak Signal-to-Noise Ratio (PSNR) curves for five text-only documents, five graphics-only documents and five compound documents.These curves show that for a given PSNR, text requires a higher bitrate than graphics.We also show that the coding of compound documents can be improved simply by using different codebooks for text and images.

A. Methodology for Generating PSNR vs. Rate Curves

1. Coding

We transform the images with 8-by-8 DCT blocks. We then quantize each of the 64 coefficients of a block using the scaled JPEG de facto matrix ([[i]], JPEG standards).
jpeg quantization matrix

Figure 1: DCT quantization matrix with scale factor 1

2. Rate Calculation

To do rate calculations for a transform coder, we needed histograms of DCT coefficient frequencies for each combination of quantizer scale, class of image (graphics, text, or compound), and DCT coefficient number (1-64). For each class, we used 5 image samples to generate the histograms, producing 15 images total .

3. PSNR Calculation

We compute PSNR in the usual fashion.

4. Image Set

We used the following images:

a) Text images 1), 2), 3)4), 5)

b) Graphic images 1), 2), 3), 4), 5)

c) Compound images 1), 2), 3)4), 5).

B. PSNR vs. Rate Plots for Text And Graph

Figure 2: Rate Vs PSNR for Text

Figure 3: Rate Vs. PSNR for Graphics

Figures 2 and 3 show that for a given bit rate the PSNR for graphics is much greater than for text. A bit rate of 1 bit/pixel will get you a median graphics PSNR of 31 dB, while the median PSNR for text is only 19 dB.   Notice also that the slope of the curves for text is 8 dB per bit compared to the 4-6 dB per bit for the graphics.Moreover, at low bit rates the curves for graphics become much steeper, indicating a rapid incremental improvement.

C. PSNR vs. Rate for Compound Documents

Figure 4 shows the rate-PSNR curves for compound documents.They are quite similar to the graph rate-PSNR curves, but at a lower PSNR as we would expect.   Next, in figure five we show that by coding text and graphics portions individually, we can increase coding efficiency by using codebooks tailored to text.Even after the added cost of 1/64 bit per block to specify its type, the savings is still about 0.5 bits.   Note that to produce figure five, we manually segmented text from the graphics.

Figure 4: Compound documents coded throughout with one set of codebooks

Figure 5: Compound image with text and graphical portions coded individually.

Clearly, we showed that text and graphics have different properties and should therefore be considered separately by coders.

III. Classification

Each of our algorithms computes a certain 'activity' score for blocks in an image.  We then assign a threshold to the score, under which the block is considered graphic, and over which is considered text.

A.  Classification Algorithms

1. Block Based Algorithms

The following algorithms [[iii],[iv]] only rely on the 8-by-8 image blocks, without needing any transform.

a) Range

This first block-based algorithm is based on the observation that text blocks are likely to have a much higher dynamic range than non-text blocks.  Here, the activity value of each block is simply the range of its pixel values, which computationally trivial.

b) Variance

This algorithm scores based on pixel variance within a block, since text blocks are likely to have a higher variance than non-text blocks.

c) Absolute-Deviation

This algorithm uses the mean absolute deviation of each block as activity of this block.

d) Sobel-Filter

Since text blocks are likely to have more edges than graph blocks, this method counts the number of edge pixels, using the eight-neighbor Sobel edge detection filter.  To be very accurate, we choose to compute the sum of the eight-neighbor Sobel-filtered block result. To do that, for each pixel of the block, we use a Sobel edge detection filter.

Here,  H detects horizontal edges, and V=transpose(H) detects vertical edges. We first do the convolution of the 3-by-3 matrix centered on this pixel with H and V. Then, we adjust the result to take care of the edges of the 8-by-8 block. This gives us X and Y, the horizontal and vertical edge detection results. Then, we compute the Sobel Gradient for the pixel, summing the gradients for a block for the score. 

2. DCT Based Algorithms

a) Introduction

Energy is distributed differently among DCT coefficients for text and non-text blocks.  Thus we segment by examining the appropriate set of DCT coefficients that capture this difference between text and graph, and then compare the energy or absolute sum of these coefficients (the activity) to a threshold.  
For instance, as in [[v]],the following graph extracted from our codebook shows the difference between the energy of text and graph for various DCT coefficients in a run-length coding (quantized matrix with scale factor of 0.5), especially for coefficients 29, 39, 51 and 60.

Similarly, the computed absolute sum also shows an important difference between graph and text, so we do not necessarily have to use energy (absolute sum is faster to compute).

The following chart shows that a much coarser quantization produces an even greater difference. (We use a scale of 5, and the zero probabilities of the codebook are approximated by a very low value, as shown in the chart for graph).

Unless the contrary is specified, the next schemes are all applied to a quantized matrix with a scale factor of 1.

b) DCT Energy

This scheme computes the energy of the quantized DCT block, i.e. the sum of the 64 coefficients, to get the activity of the block.

c) DCT Absolute-Sum

This scheme simply replaces the sum of the squares from the energy calculus, by the absolute sum, easier to compute.

d) DCT-18 Absolute-Sum

This scheme uses the 18 coefficients found in [iv] to compute the same sum, instead of all the 64 coefficients, because they effectively capture the difference between the text and the non-text blocks. When beginning counting coefficients at 1 and going line after line, the 18 coefficients are: Coefficients = [4 5 6 12 13 14 20 21 22 44 45 46 52 53 54 60 61  62]

e) DCT Bitrate

The paper by Konstantinides and Tretter [[vi]] examines the information contained in several coefficients instead of simply doing some sum.  Therefore, it uses a function that assigns to every coefficient the approximate bit length it would take in a run-length coding of the DCT block: zero coefficients take 0 bit, and non-zero coefficients take approximately log2(coefficient)+4. Then, the activity is the sum of these values for each coefficient.  
Indeed, this paper is basically utilizing the fact, shown in Chapter II, that text blocks use much more bits than graph blocks.

3. An Efficient Innovation of This Project: Delta DCT Codebook Based Algorithms

a) Introduction

Having analyzed all the existing schemes, we propose what we believe are truly innovative algorithms. Our tests show that they indeed produce superior results.  We observe that the difference between text and graph coefficients in DCT blocks is not simply based on an energy difference, but rather that each coefficient's values are differently distributed in text and graphics blocks.   For instance, if we take sample codebooks for text and graph, suppose we find that for coefficient 2, text has a probability 0.1 of getting a value of 1, and graph has a probability of 0.001. Then if we take a block of unknown type, and coefficient 2 has the value 1, it is much more probable that it is text rather than graph.

b)  Delta Probabilities

Delta-Probabilities is the simplest codebook-based scheme. We first find the probability of each coefficient being a text coefficient by taking the codebook probability for the coefficient value. Then, we do the same for the probability of being a graph coefficient, and we take the difference, delta=P(text)-P(graph). The activity score is the sum of these deltas (higher --> likely to be text). As in the DCT Bitrate scheme, we assign the codebook-entropy value for the text and graph codebooks, and check for the difference, instead of simply assigning a fixed-length code bitrate value to each coefficient.

c) Delta Probabilities - Low Energy

Should we bias our scheme towards high energy coefficients in an effort to better characterize text? On the contrary, our results indicate that Low Energy coefficients are much more efficient.  The big differences between text and graph do not occur in the volatile high energies (concentrated in the top-left of the DCT matrix, around the dc value), so biasing towards these high energies is unproductive.  
Thus, this scheme algorithm is computed by multiplying delta by 1/(1+energy) for each coefficient.

d) Delta Probabilities - High Probability

This scheme is biases the score towards coefficients with higher probabilities by using the difference in squares of probabilities.

e)  Delta Probabilities - High Difference

This variant biases the score towards coefficients with the biggest differences by summing delta^3 instead of delta.

f)  Delta Probabilities - 18 Coefficients

This scheme sums the deltas for the 18 coefficients,  instead of all 64 coefficients.

g)  Delta Probabilities - Horizontal, Vertical and Horizontal&Vertical High Frequency; Horizontal Very High Frequency

This scheme favor high frequencies in the DCT block, recalling the idea that text images have sharper high-frequency components. Thus, we change the activity value for the (i,j) coefficient of a DCT matrix from delta to:
-j*delta (horizontal high frequency), 
-i*delta (vertical high frequency)
-i*j*delta (horizontal&vertical high frequency)  
Moreover, the horizontal frequency was the most characteristic of text (consider the shape of the letters), so we tested schemes measuring only the horizontal very high frequencies. 

h) Delta Probabilities - Horizontal High Frequency and Low Energy

This scheme simply combines two variants: horizontal high frequency and low energy.

4. DCT Codebook Entropy Based Algorithms

a) Introduction

Some DCT coefficients have a better differentiating power than others, so we assign a weight to each coefficient while calculating Delta Probabilities. First, we find those coefficients where graph values are very often zero or ±1, while text values take several values. Then, we introduce a new measure: the entropy of the coefficient probability distribution.

b) Entropy Ratio Weights

This scheme consists in taking the ratio (text entropy)/(graph entropy) as a weight for the deltas to distinguish the good coefficients. Entropy Ratio Weight=

c) Entropy Difference Weights

This scheme consists in taking the difference (text entropy)-(graph entropy) as a weight for the deltas to distinguish the good coefficients. Entropy Delta Weight=

5. DCT Codebook Bayesian Based Algorithms

a) Introduction

This scheme is based on the MAP (Maximum a Posteriori Probability) rule.. Thus, we want:

b) DCT Codebook MAP (Maximum a Posteriori Probability) Rule

This implementation is just the equation above.  We force the probability greater than a small value like 0.001, to avoid the log of zero.

6. Wavelet-Based Algorithms


  • The wavelet transform is an effective means of mapping an image from the space domain to the frequency domain. In a typical wavelet coding scheme, an image's frequency components are subdivided recursively, refining the lower subband (in the two-band decomposition case) at each step.
  • The Idea

    We chose a classification method proposed by Li and Gray[i].  In a continuous-tone picture, one can model the higher frequency transform coefficients with a Laplacian distribution. Textual areas, however, are characterized by coefficient values clustered  around a handful of peak values.

    Example: cnntext.tif

    cnntext.tif, in GIF format

    cnntext.gif can be considered a typical case where textual coding is desired.

    This illustrates a typical histogram for the wavelet coefficients in a text region. Note that most of the coefficients are clustered around zero: in text, the background color remains dominant. Also note that the remaining coefficients are strongly clustered around only a few peaks.

    Example: cnnpic.tif

    cnnpic.tif, in GIF format

    Clearly, cnnpic.gif has different characteristics than the textual image above. By inspection, we can see a variety of both high frequency and low frequency components.

    This distribution is largely typical of picture regions in images. Although the histogram appears to have a sharper peak, the Laplacian remains a reasonable approximation.

    Example: newcnn512x512.tif

    newcnn512x512.tif, in GIF format

    newcnn512x512.tif  has characteristics of both text and picture regions, and cannot be clearly classified as one or the other.

    Again, the histogram easily concurs with the quick visual classification: Although one can see some discrete peaks away from zero (showing textual characteristics), one can also see the gradual falloff from the peak centered at zero(showing the smooth picture characteristics). Thus the image is a combination of text and picture.

    Quantifying the Idea

    Unfortunately, the pattern-recognition skills that allow humans to visually classify regions quickly are non-trivial to implement, so we turn to something more suited to a digital computation.

    Metric 1: Chi^2 goodness of fit

    One metric for comparing a histogram with the Laplacian is the Chi^2 test. This test measures how well an experimental sample fits a hypothetical expectation.

    For more details about calculating, please consult [Li and Gray]

    Metric 2: How discrete are those peaks, really?

    A key feature of textual coefficient distribution is that the coefficients seem clustered around a few discrete values. To measure this, we divide the histogram into zones, where each zone has the characteristic that of having one peak value, and the pmf on either side must be monotonically non-increasing with increasing distance from the peak. Then we calculate what percentage of the values in the zone are contained in the peak, giving L. In the textual case, nearly all of the values in a zone should be within a tight proximity of the peak.

    Making sense of the numbers

    Each metric provides a numerical score, which is mapped linearly onto [0,1]. Extreme values saturate. 

    The theta values here are merely thresholds which determine when the mapping functions saturate. Experimentally, we found that Chi^2 was interesting over [10^4, 10^12] and L was interesting over [0.7, 1].

    The output classification value is a plain arithmetic mean of the two delta values. We experimented with the geometric mean, but did not find that the results were significantly better.


    Step 1: Single-level decomposition of the original image

    For speed and simplicity, we used the built-in matlab routines for wavelet decomposition. This provides us with a four-subband decomposition of the space-domain image into four frequency domain subband coefficients.
    LL HL
    LH HH
    For the purposes of classification, we are only concerned with the three quadrants containing with high-frequency components.
    Haar wavelet impulse response
    Low pass High pass
    0.70710678118655 -0.70710678118655
    0.70710678118655 0.70710678118655
    We chose the Haar wavelet for its short FIR characteristic. Its short impulse response keeps the mapping from spatial to transform coefficients localized, thus allowing us to better map the wavelet coefficients to the spatial coordinates, even though it suffers from a softer cut-off in frequency[xi].

    Step 2: Classify the whole image

    First, find the coefficient variances. A small variance strongly suggests a background region. Background regions contain similar histogram properties to text images when using our metrics, so it is important to do this form of trivial reject before continuing the algorithm.

    Next we apply our two metrics, and average them to produce an overall classification. Values very close to 1 or 0 (within a certain error tolerance) indicate textual or picture regions, respectively.

    Unfortunately, when we look at a whole compound region, we should not see such extreme values. We thus subdivide the region into four quadrants, and recursively classify those regions, always checking the variance before attempting the two classification metrics.

    This recursive method  works well. Initially, we chose to iterate blockwise, over blocks that represented 8x8 blocks spatially in the original image, since we were interested in comparing the performance versus DCT-based classification. However, it was found that the small sample size (4x4 blocks in three quadrants) often yielded poor classification results. With this in mind, our recursion stops when the subdivision would yield blocks smaller than 4x4.

    Tuning the Classification Routines

    There are several issues concerning the wavelet classification routine.


    Consider the following image, cmp5-text.tif.

    cmp5-text.tif in gif format

    It has the following histogram.

    cmp5-text histogram.

    Note that the histogram does not present a clear classification of text or picture, as was readily done in the previous examples. In this case, the algorithm cannot conclusively classify the image as either text or picture. Subdivision, in this case, does little good, since the image seems to contain similar properties across the whole area. Here, the algorithm scores as follows:
    Metric Value Delta mapping (using thetas as above)
    chi2b  1.301097e+07 1.300097e-05
    L 8.614095e-01  5.380317e-01
    Overall 2.690223e-01

    B. Evaluation Methodology

    1. The Error Score Metric

    The performance of a segmentation scheme is represented using the Error Score Metric.

    where w1 and w2 are weighting coefficients satisfying w1 + w2 = 1 .

    False Negatives are text blocks mistakenly labeled graphics, and False Positives are graphics blocks wrongly labeled text.The weighting coefficients allow one to bias the benchmark against False Negatives, which can quickly ruin text.

    2. Human Generated Reference Segmentation

    We compared our classification output against hand-generated classification output.

    3. Evaluation

    We trained each algorithm by choosing the threshold value that minimized Error Score over three images.Using this experimentally optimal threshold, we score each algorithm using a different set of three images.   Having disjoint training and testing set forces an algorithm to exhibit robust behavior with respect to threshold to rate well.The winning algorithm could not be overly sensitive to threshold setting for a particular image.

    C. Results

    From the bar graph above, we see among the schemes we read in papers the DCT Bit rate scheme, the most recent, indeed performed the best.The DCT-18 absolute sum, the second most recent, came in second.We believe that this outcome validates our evaluation methodology.   We show in the bar graph that our new algorithms perform far better than algorithms in the literature.In particular ourHorizontal VHF scheme scored 7.53% compared to 10.15% for the best one in the literature.   Included below are graphs indicating Horizontal VHF performance vs. threshold value for three different compound images.The third chart illustrates the relatively flat error score right of the optimal threshold(dotted line).

    IV. Improving the Classifier Output

    A. Introduction to Smoothing

    We can enhance the performance of the schemes by post processing their outputs.In general, it is good to smooth the output of a scheme because text does not come in isolated blocks.  An isolated text block is probably an error.Smoothing fixes these errors.In a similar fashion, smoothing can reduce false negatives when a "graph" block is surrounded by "text" blocks.  
    In this section we examine two smoothing methods: Rule-based nonlinear smoothing (section B), and convolution-based smoothing (section C).

    B. Rule-Based Smoothing

    The first and most successful smoothing scheme we tried was based on the set of rules[vii] shown in figures 1 and 2,  applied to all blocks except those on the edges.

    Figure 6: Isolated text becomes graph.

    Figure 7: Graph with three adjacent text neighbors becomes text.

    1. A Rule-Based Smoothing Example

    Starting with a compound image, we apply the Konstantinides segmentation scheme.Next, we post-process using rule-based smoothing, and compare transcoded outputs when segmentation is performed with and without smoothing.  
    Figure 8: A compound document

    Figure 9: Text blocks (in red) as identified by the Konstantinides scheme.


    Figure 10: After Rule-based smoothing.   The isolated false positives in the photograph as well as the false negatives in the text portion are removed.

    Graphics were quantized coarsely at 10 times the JPEG quantization matrix, while text was quantized using the unscaled JPEG matrix.

    Figure 11: Transcoded image with no smoothing.  Note the  dropouts in the text because of false negative errors.



    Figure 12: Transcoded image with smoothing. 
    With smoothing  these false negative errors are reduced.Without smoothing, we also see false positives - some of the photograph is coded at the text resolution (look around the knot of the tie).These are reduced when smoothing is used.

    2. Evaluating Rule-Based Smoothing

    To evaluate rule-based smoothing, we simply utilize the same evaluation techniques discussed in III.B., applying the smoothing after the classification scheme.Ideally, we would test rule-based smoothing in combination with each of the fifteen-or-so classification schemes we used, but one scheme should be sufficient to observe the effects of the smoothing.The figures below show the performance of the Konstantinides scheme with smoothing and without, for the three images in the testing set.

    Error Score for the Konstantinides scheme with no smoothing over three images


    Figure 14: Error score for Konstantinides scheme with Rule-Based smoothing, over three images.

    Rule-based smoothing gives an average error score over the three images that is equal to that of Konstantinides without smoothing.However, smoothing reduces the maximum error score.

    C. Convolution Based Smoothing

    Convolution-based smoothing takes the continuous-valued outputs of activity schemes and filters them two-dimensionally using a low-pass averaging filter.The result is then thresholded normally to give binary decisions for each block. For this project, we tried ten or so smoothing filters, one of which is below.For all of the filters we tried, linear smoothing resulted in worse error scores.This was true even if we weighted false negatives heavily compared to false positives.We tried linear smoothing directly on the algorithms' unthresholded output and also after thresholding.   The reason for its failure is probably that linear smoothing with symmetric responses does not allow us to specify that "text" neighbors be contiguously placed.Also, the filters' sharp low-pass filter boundaries introduce errors.
    Figure 15: Impulse reponses of linear smoothing filters.

    V. An Example of Compound Coding Using our "delta_very_horizontal" scheme

    In this section we step through a compound coding example using the best segmentation scheme that we found - delta_very_horizontal.We do this to demonstrate the presupposed application of image segmentation, and also to illustrate a problem we call "highlighting".   In the example below, we assume that the end user would like high text quality and is not interested in the graphics.The images below show that using segmentation techniques, the image can be compressed with a higher text quality at the same average word-length.

    A. The Images

    1. Original Image 

    Figure 17: The original Image

    2. Results of segmentation and smoothing.

    MATLAB Handle Graphics

    3. The transcoded output of the fixed quantizer (Quantized at 1.5 times JPEG).

    4. The transcoded output of the variable quantizer (graph at 20 times JPEG, text at 0.5 times JPEG.

    B. Results

    Quantizing the whole image at 0.5 times the JPEG matrix  
    Quantizing the whole image at 1.5 times the JPEG matrix (as in the upper transcoded image)yields: Quantizing text at 0.5 and the graphics at 20 times the JPEG matrix yields: Conclusion: Segmentation of the images allows much higher quality text for the same rate.Gains are also realized because text and graphics can have individual codebooks.And, although here graph is destroyed, it could be extremely useful in Stanford-Online-kind images where we would only destroy the background.

    C. Highlighting Problem

    Because the text is quantized at a different scale than the graphics, the boundary between text blocks and graphics blocks often becomes visible as a shift in background color.In the images above, the 'highlighting' effect is not very apparent.With different quantization scales and backgrounds it can be, however. When using differential encoding of DC, the highlighting is much more apparent.Please see the figure.

    If highlighting is undesirable, we found that it can be remedied by quantizing the DC component of the graphics portion at the same resolution as the text.This incurs a cost in entropy, however. It should be noted that we first thought that we could quantize the DC component of the text portion at graphics resolution.This does not work because the DC component is not the background color - thus, this counter-intertuitive result helped us better understand a fundamental topic. For the example above, quantizing the DC values of the graphics at text resolution leads to an average word length of 1.1 bits/pel.

    VI. Conclusion

    In this paper we have shown that text requires a much higher bit rate than graphics to achieve the same quality.  This motivated us to find methods of segmenting compound documents into text and graphics portions.  We explored algorithms from the literature including space domain, DCT, and wavelet methods.  We developed new such algorithms, notably our  Delta-Probabilities H-VHF scheme, that perform better than some of the newest algorithms in the literature.  Our H-VHF scheme scored 7.53% (average class identification error) compared to 10.15% for the best one from the literature.  We have then included two types of smoothing, which are means to further improve classification results.  Finally, we showed in an example application that segmented coding resulted in an increased text PSNR of 6 .2 dB for the same overall bit rate.  Such results can be important in applications where text is far more important than background as in video-conferencing and distance learning applications.

    Appendix A. Bibliography

    [i]J. Li and R. Gray, Text and Picture Segmentation by the Distribution Analysis of Wavelet Coefficients.In Proceeding of International Conference on Image Processing, October 1998.  
    [ii] K. Konstantinides and D. Tretter, " A JPEG Variable Quantization Method for Compound Documents ", IEEE Transactions on Image Processing, Vol.9, No.7, July 2000, p.1282   [iii] C.T. Chen, "Transform coding of digital images using variable block size DCT with adaptive thresholding and quantization", SPIE vol. 1349, 1990, pp.43-54  
    [iv] N. Chaddha, R. Sharma, A. Agrawal, and A. Gupta, "Text segmentation in mixed-mode images," in Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 1356--1361, 1995.  
    [v]N. Chaddha,"Segmentation-Assisted Compression of Multimedia Documents" in Proceedings of the 29th Asilomar Conference on Signals, Systems and Computers, vol. 2, pp.1452--1456, 1996.  
    [vi]K. Konstantinides and D. Tretter, " A JPEG Variable Quantization Method for Compound Documents ", IEEE Transactions on Image Processing, Vol.9, No.7, July 2000, p.1282

    [vii] J. Minguillon, J. Pujol, and K. Zeger,  "Progressive Classification Scheme for Document Layout Recognition,"

    [viii]N. Chaddha, R. Sharma, A. Agrawal, and A. Gupta, "Text segmentation in mixed-mode images," in Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 1356--1361, 1995. [ix]N. Chaddha and A. Gupta, "Text segmentation using linear transforms," in Proceedings of the 29th Asilomar Conference on Signals, Systems and Computers, vol. 2, pp.1447--1451, 1996. [x]J. Li and R. M. Gray, "Context-based multiscale classification of images," in Proceedings of the IEEE International Conference on Image Processing, 1998
    [xi] B. Girod, F. Hartung, and U. Horn, "Subband Image Coding", in A. Akansu, M. J. T. Smith (eds.), "Design and Applications of Subbands and Wavelets," Kluwer Academic Publishers, Norwell, MA, Oct. 1995. (PDF)

    Appendix B. Work Distribution