Compression concerns itself with the reduction of
redundant, or less perceivable information. When compressing images
that contain both graphics and text, the compression artifacts are often
more obvious and irritating in the textual regions than in the graphics
regions.This is true because graphics
and text have different properties, and convey different types of information
to the viewer.An image coder that
takes advantage of these differences will realize efficiency gains by treating
text differently than graphics.
We first examine the differences between text and graphics in section
II. In sections III and IV, we explore methods of distinguishing
text from graphics in images. Finally, in section V, we illustrate
compound coding with an example.
Properties of Text and Graphics
We present Rate vs. Peak Signal-to-Noise Ratio (PSNR)
curves for five text-only documents, five graphics-only documents and five
compound documents.These curves
show that for a given PSNR, text requires a higher bitrate than graphics.We
also show that the coding of compound documents can be improved simply
by using different codebooks for text and images.
for Generating PSNR vs. Rate Curves
transform the images with 8-by-8 DCT blocks. We then quantize each of the
64 coefficients of a block using the scaled JPEG de facto matrix
DCT quantization matrix with scale factor 1
To do rate calculations
for a transform coder, we needed histograms of DCT coefficient frequencies
for each combination of quantizer scale, class of image (graphics, text,
or compound), and DCT coefficient number (1-64).
For each class, we used
5 image samples to generate the histograms, producing 15
images total .
We compute PSNR in the
We used the following
vs. Rate Plots for Text And Graph
Rate Vs PSNR for Text
Rate Vs. PSNR for Graphics
Figures 2 and 3 show that for a given bit rate the
PSNR for graphics is much greater than for text. A bit rate of 1 bit/pixel
will get you a median graphics PSNR of 31 dB, while the median PSNR for
text is only 19 dB.
Notice also that the slope of the curves for text
is 8 dB per bit compared to the 4-6 dB per bit for the graphics.Moreover,
at low bit rates the curves for graphics become much steeper, indicating
a rapid incremental improvement.
vs. Rate for Compound Documents
Figure 4 shows the rate-PSNR curves for compound
are quite similar to the graph rate-PSNR curves, but at a lower PSNR as
we would expect.
Next, in figure five we show that by coding text
and graphics portions individually, we can increase coding efficiency by
using codebooks tailored to text.Even
after the added cost of 1/64 bit per block to specify its type, the savings
is still about 0.5 bits.
Note that to produce figure five, we manually segmented
text from the graphics.
Compound documents coded throughout with one set of codebooks
Compound image with text and graphical portions coded individually.
Clearly, we showed that text and graphics have different
properties and should therefore be considered separately by coders.
Each of our algorithms
computes a certain 'activity' score for blocks in an image. We then
assign a threshold to the score, under which the block is considered graphic,
and over which is considered text.
The following algorithms
only rely on the 8-by-8 image blocks, without needing any transform.
This first block-based
algorithm is based on the observation that text blocks are likely to have
a much higher dynamic range than non-text blocks. Here, the activity
value of each block is simply the range of its pixel values, which computationally
This algorithm scores based on pixel variance within
a block, since text blocks are likely to have a higher variance than non-text
This algorithm uses the mean absolute deviation
of each block as activity of this block.
Since text blocks are likely to have more edges
than graph blocks, this method counts the number of edge pixels, using
the eight-neighbor Sobel edge detection filter. To be very accurate,
we choose to compute the sum of the eight-neighbor Sobel-filtered block
result. To do that, for each pixel of the block, we use a Sobel edge detection
Here, H detects horizontal edges, and V=transpose(H) detects
vertical edges. We first do the convolution of the 3-by-3 matrix centered
on this pixel with H and V. Then, we adjust the result to take care of
the edges of the 8-by-8 block. This gives us X and Y, the horizontal and
vertical edge detection results.
Then, we compute the Sobel Gradient for the pixel,
summing the gradients for a block for the score.
Energy is distributed differently among DCT coefficients
for text and non-text blocks. Thus we segment by examining the appropriate
set of DCT coefficients that capture this difference between text and graph,
and then compare the energy or absolute sum of these coefficients (the
activity) to a threshold.
For instance, as in [[v]],the
following graph extracted from our codebook shows the difference between
the energy of text and graph for various DCT coefficients in a run-length
coding (quantized matrix with scale factor of 0.5), especially for coefficients
29, 39, 51 and 60.
Similarly, the computed absolute sum also shows
an important difference between graph and text, so we do not necessarily
have to use energy (absolute sum is faster to compute).
The following chart shows that a much coarser quantization
produces an even greater difference. (We use a scale of 5, and the zero
probabilities of the codebook are approximated by a very low value, as
shown in the chart for graph).
Unless the contrary is specified, the next schemes
are all applied to a quantized matrix with a scale factor of 1.
This scheme computes the energy of the quantized
DCT block, i.e. the sum of the 64 coefficients, to get the activity of
This scheme simply replaces the sum of the squares
from the energy calculus, by the absolute sum, easier to compute.
This scheme uses the 18 coefficients found in [iv]
to compute the same sum, instead of all the 64 coefficients, because
they effectively capture the difference between the text and the non-text
When beginning counting coefficients at 1 and going
line after line, the 18 coefficients are:
Coefficients = [4 5 6 12 13 14 20 21 22 44 45 46 52 53 54 60 61
The paper by Konstantinides and Tretter [[vi]]
examines the information contained in several coefficients instead of simply
doing some sum. Therefore, it uses a function that assigns to every
coefficient the approximate bit length it would take in a run-length coding
of the DCT block: zero coefficients take 0 bit, and non-zero coefficients
take approximately log2(coefficient)+4. Then, the activity is the sum of
these values for each coefficient.
Indeed, this paper is basically utilizing the fact, shown in Chapter
II, that text blocks use much more bits than graph blocks.
Efficient Innovation of This Project: Delta DCT Codebook Based Algorithms
Having analyzed all the existing schemes, we propose
what we believe are truly innovative algorithms. Our tests show that they
indeed produce superior results. We observe that the difference between
text and graph coefficients in DCT blocks is not simply based on an energy
difference, but rather that each coefficient's values are differently distributed
in text and graphics blocks.
For instance, if we take sample codebooks for text
and graph, suppose we find that for coefficient 2, text has a probability
0.1 of getting a value of 1, and graph has a probability of 0.001. Then
if we take a block of unknown type, and coefficient 2 has the value 1,
it is much more probable that it is text rather than graph.
Delta-Probabilities is the simplest codebook-based
first find the probability of each coefficient being a text coefficient
by taking the codebook probability for the coefficient value. Then, we
do the same for the probability of being a graph coefficient, and we take
the difference, delta=P(text)-P(graph). The activity score is the sum of
these deltas (higher --> likely to be text).
As in the DCT Bitrate scheme, we assign the codebook-entropy
value for the text and graph codebooks, and check for the difference, instead
of simply assigning a fixed-length code bitrate value to each coefficient.
Probabilities - Low Energy
Should we bias our scheme towards high energy coefficients
in an effort to better characterize text? On the contrary, our results
indicate that Low Energy coefficients are much more efficient. The
big differences between text and graph do not occur in the volatile high
energies (concentrated in the top-left of the DCT matrix, around the dc
value), so biasing towards these high energies is unproductive.
Thus, this scheme algorithm is computed by multiplying delta by 1/(1+energy)
for each coefficient.
Probabilities - High Probability
This scheme is biases the score towards coefficients
with higher probabilities by using the difference in squares of probabilities.
Probabilities - High Difference
variant biases the score towards coefficients with the biggest differences
by summing delta^3 instead of delta.
Probabilities - 18 Coefficients
scheme sums the deltas for the 18 coefficients, instead of all 64
Probabilities - Horizontal, Vertical and Horizontal&Vertical High Frequency;
Horizontal Very High Frequency
scheme favor high frequencies in the DCT block, recalling the idea that
text images have sharper high-frequency components. Thus, we change the
activity value for the (i,j) coefficient of a DCT matrix from delta to:
(horizontal high frequency),
(vertical high frequency)
(horizontal&vertical high frequency)
Moreover, the horizontal frequency was the most characteristic of text
(consider the shape of the letters), so we tested schemes measuring only
the horizontal very high frequencies.
Probabilities - Horizontal High Frequency and Low Energy
scheme simply combines two variants: horizontal high frequency and low
Codebook Entropy Based Algorithms
Some DCT coefficients have a better differentiating
power than others, so we assign a weight to each coefficient while calculating
First, we find those coefficients where graph values
are very often zero or ±1,
while text values take several values. Then, we introduce a new measure:
the entropy of the coefficient probability distribution.
This scheme consists in taking the ratio (text entropy)/(graph
entropy) as a weight for the deltas to distinguish the good coefficients.
Entropy Ratio Weight=
This scheme consists in taking the difference (text
entropy)-(graph entropy) as a weight for the deltas to distinguish the
Codebook Bayesian Based Algorithms
This scheme is based on the MAP (Maximum a Posteriori
Thus, we want:
Codebook MAP (Maximum a Posteriori Probability) Rule
This implementation is just the equation above.
We force the probability greater than a small value like 0.001, to avoid
the log of zero.
The wavelet transform is an effective means of mapping an image from the
space domain to the frequency domain. In a typical wavelet coding scheme,
an image's frequency components are subdivided recursively, refining the
lower subband (in the two-band decomposition case) at each step.
We chose a classification method proposed by Li and Gray[i].
In a continuous-tone picture, one can model the higher frequency transform
coefficients with a Laplacian distribution. Textual areas, however, are
characterized by coefficient values clustered around a handful of
cnntext.gif can be considered a typical case where textual coding is
This illustrates a typical histogram for the wavelet coefficients in
a text region. Note that most of the coefficients are clustered around
zero: in text, the background color remains dominant. Also note that the
remaining coefficients are strongly clustered around only a few peaks.
Clearly, cnnpic.gif has different characteristics than the textual image
above. By inspection, we can see a variety of both high frequency and low
This distribution is largely typical of picture regions in images. Although
the histogram appears to have a sharper peak, the Laplacian remains a reasonable
newcnn512x512.tif has characteristics of both text and picture
regions, and cannot be clearly classified as one or the other.
Again, the histogram easily concurs with the quick visual classification:
Although one can see some discrete peaks away from zero (showing textual
characteristics), one can also see the gradual falloff from the peak centered
at zero(showing the smooth picture characteristics). Thus the image is
a combination of text and picture.
Quantifying the Idea
Unfortunately, the pattern-recognition skills that allow humans to visually
classify regions quickly are non-trivial to implement, so we turn to something
more suited to a digital computation.
Metric 1: Chi^2 goodness of fit
One metric for comparing a histogram with the Laplacian is the Chi^2 test.
This test measures how well an experimental sample fits a hypothetical
For more details about calculating, please consult [Li
Metric 2: How discrete are those peaks, really?
A key feature of textual coefficient distribution is that the coefficients
seem clustered around a few discrete values. To measure this, we divide
the histogram into zones, where each zone has the characteristic that of
having one peak value, and the pmf on either side must be monotonically
non-increasing with increasing distance from the peak. Then we calculate
what percentage of the values in the zone are contained in the peak, giving
L. In the textual case, nearly all of the values in a zone should be within
a tight proximity of the peak.
Making sense of the numbers
Each metric provides a numerical score, which is mapped linearly onto [0,1].
Extreme values saturate.
The theta values here are merely thresholds which determine when the
mapping functions saturate. Experimentally, we found that Chi^2 was interesting
over [10^4, 10^12] and L was interesting over [0.7, 1].
The output classification value is a plain arithmetic mean of the two
delta values. We experimented with the geometric mean, but did not find
that the results were significantly better.
Step 1: Single-level decomposition of the original image
For speed and simplicity, we used the built-in matlab routines for wavelet
decomposition. This provides us with a four-subband decomposition of the
space-domain image into four frequency domain subband coefficients.
For the purposes of classification, we are only concerned with the three
quadrants containing with high-frequency components.
We chose the Haar wavelet for its short FIR characteristic. Its short impulse
response keeps the mapping from spatial to transform coefficients localized,
thus allowing us to better map the wavelet coefficients to the spatial
coordinates, even though it suffers from a softer cut-off in frequency[xi].
|Haar wavelet impulse response
Step 2: Classify the whole image
First, find the coefficient variances. A small variance strongly suggests
a background region. Background regions contain similar histogram properties
to text images when using our metrics, so it is important to do this form
of trivial reject before continuing the algorithm.
Next we apply our two metrics, and average them to produce an overall
classification. Values very close to 1 or 0 (within a certain error tolerance)
indicate textual or picture regions, respectively.
Unfortunately, when we look at a whole compound region, we should not
see such extreme values. We thus subdivide the region into four quadrants,
and recursively classify those regions, always checking the variance before
attempting the two classification metrics.
This recursive method works well. Initially, we chose to iterate
blockwise, over blocks that represented 8x8 blocks spatially in the original
image, since we were interested in comparing the performance versus DCT-based
classification. However, it was found that the small sample size (4x4 blocks
in three quadrants) often yielded poor classification results. With this
in mind, our recursion stops when the subdivision would yield blocks smaller
Tuning the Classification Routines
There are several issues concerning the wavelet classification routine.
Wavelet choice: We chose Haar for its short impulse response.
Level of Decomposition: We chose a single level of decomposition
as suggested by the literature. The sharp, high frequency characteristics
that this algorithm uses to classify must be present in the high frequency
bands at the first level, and further decomposition would be of little
aid in classification.
Block choice: In producing the histogram, we choose the representative
block from each of the HL, LH, and HH quadrants. Alternatively, we can
perform the classification routines on the block from each quadrant and
merge the results creatively.
Classification Metrics Our wavelet experimentation used the two
histogram-based metrics, but this might be combined with other spatially-based
schemes for further accuracy.
Mapping the classification Metrics: We mapped the Chi^2 and "peakiness"
values(L) linearly onto [0,1]. With the Chi^2 values varying as much as
they do, a logarithmic mapping might be a better choice. However, there
is little reason to believe that this is the best mapping. Our threshold
values are were chosen experimentally, but there may be a more rigorous
Combining the metrics: The arithmetic mean is quick, simple, and
easy to understand, but geometric, harmonic, or other nonlinear functions
may be more suitable.
Classification Tolerance: In deciding whether or not to continue
subdividing regions, the algorithm considers the combined delta value with
the extreme values of 0 or 1. If the value is within a certain tolerance
of 0 or 1, the block is considered classified.
Consider the following image, cmp5-text.tif.
It has the following histogram.
Note that the histogram does not present a clear classification of text
or picture, as was readily done in the previous examples. In this case,
the algorithm cannot conclusively classify the image as either text or
picture. Subdivision, in this case, does little good, since the image seems
to contain similar properties across the whole area. Here, the algorithm
scores as follows:
||Delta mapping (using thetas as above)
Error Score Metric
The performance of a segmentation
scheme is represented using the Error Score Metric.
where w1 and w2 are weighting coefficients satisfying w1 + w2 = 1 .
False Negatives are text
blocks mistakenly labeled graphics, and False Positives are graphics blocks
wrongly labeled text.The weighting
coefficients allow one to bias the benchmark against False Negatives, which
can quickly ruin text.
Generated Reference Segmentation
We compared our classification
output against hand-generated classification output.
We trained each algorithm by choosing the
threshold value that minimized Error Score over three images.Using
this experimentally optimal threshold, we score each algorithm using a
different set of three images.
Having disjoint training and testing set forces
an algorithm to exhibit robust behavior with respect to threshold to rate
well.The winning algorithm could
not be overly sensitive to threshold setting for a particular image.
From the bar graph above, we see among the schemes
we read in papers the DCT Bit rate scheme, the most recent, indeed performed
the best.The DCT-18 absolute sum,
the second most recent, came in second.We
believe that this outcome validates our evaluation methodology.
We show in the bar graph that our new algorithms
perform far better than algorithms in the literature.In
particular ourHorizontal VHF
scheme scored 7.53% compared to 10.15% for the best one in the literature.
Included below are graphs indicating Horizontal
VHF performance vs. threshold value for three different compound images.The
third chart illustrates the relatively flat error score right of the optimal
IV. Improving the
We can enhance the performance of the schemes by
post processing their outputs.In
general, it is good to
smooth the output of a scheme because text
does not come in isolated blocks. An isolated text block is probably
an error.Smoothing fixes these
errors.In a similar fashion, smoothing
can reduce false negatives when a "graph" block is surrounded by "text"
In this section we examine two smoothing methods: Rule-based nonlinear
smoothing (section B), and convolution-based smoothing (section C).
The first and most successful smoothing scheme we
tried was based on the set of rules[vii] shown
in figures 1 and 2, applied to all blocks except those on the edges.
Isolated text becomes graph.
Graph with three adjacent text neighbors becomes text.
Rule-Based Smoothing Example
Starting with a compound image, we apply the Konstantinides
segmentation scheme.Next, we post-process
using rule-based smoothing, and compare transcoded outputs when segmentation
is performed with and without smoothing.
A compound document
Text blocks (in red) as identified by the Konstantinides scheme.
After Rule-based smoothing. The isolated false positives in
the photograph as well as the false negatives in the text portion are removed.
|Graphics were quantized coarsely at 10 times the JPEG quantization
matrix, while text was quantized using the unscaled JPEG matrix.
Transcoded image with no smoothing. Note the dropouts in the
text because of false negative errors.
Transcoded image with smoothing.
With smoothing these false negative errors are reduced.Without
smoothing, we also see false positives - some of the photograph is coded
at the text resolution (look around the knot of the tie).These
are reduced when smoothing is used.
To evaluate rule-based smoothing, we simply utilize
the same evaluation techniques discussed in III.B., applying the smoothing
after the classification scheme.Ideally,
we would test rule-based smoothing in combination with each of the fifteen-or-so
classification schemes we used, but one scheme should be sufficient to
observe the effects of the smoothing.The
figures below show the performance of the Konstantinides scheme with smoothing
and without, for the three images in the testing set.
Error Score for the Konstantinides scheme with no smoothing over three
Error score for Konstantinides scheme with Rule-Based smoothing, over three
Rule-based smoothing gives an average error score
over the three images that is equal to that of Konstantinides without smoothing.However,
smoothing reduces the maximum error score.
Convolution-based smoothing takes the continuous-valued
outputs of activity schemes and filters them two-dimensionally using a
low-pass averaging filter.The result
is then thresholded normally to give binary decisions for each block.
project, we tried ten or so smoothing filters, one of which is below.For
all of the filters we tried, linear smoothing resulted in worse error scores.This
was true even if we weighted false negatives heavily compared to false
positives.We tried linear smoothing
directly on the algorithms' unthresholded output and also after thresholding.
The reason for its failure is probably that linear
smoothing with symmetric responses does not allow us to specify that "text"
neighbors be contiguously placed.Also,
the filters' sharp low-pass filter boundaries introduce errors.
Impulse reponses of linear smoothing filters.
Example of Compound Coding Using our "delta_very_horizontal" scheme
In this section we step through a compound coding
example using the best segmentation scheme that we found - delta_very_horizontal.We
do this to demonstrate the presupposed application of image segmentation,
and also to illustrate a problem we call "highlighting".
In the example below, we assume that the end user
would like high text quality and is not interested in the graphics.The
images below show that using segmentation techniques, the image can be
compressed with a higher text quality at the same average word-length.
1. Original Image
The original Image
of segmentation and smoothing.
transcoded output of the fixed quantizer (Quantized at 1.5 times JPEG).
transcoded output of the variable quantizer (graph at 20 times JPEG, text
at 0.5 times JPEG.
Quantizing the whole image at 0.5 times the JPEG matrix
- Average Word Length: 1.4103
Quantizing the whole image at 1.5 times the JPEG matrix (as in the
upper transcoded image)yields:
Quantizing text at 0.5 and the graphics at 20 times
the JPEG matrix yields:
Average Word Length: 0.83
Text PSNR:22.35 dB
Graph PSNR: 35.39 dB
Conclusion: Segmentation of the images allows much
higher quality text for the same rate.Gains
are also realized because text and graphics can have individual codebooks.And,
although here graph is destroyed, it could be extremely useful in Stanford-Online-kind
images where we would only destroy the background.
Average Word Length: 0.81
bits/pel (including 1/64 bits/pel to send block type)
Text PSNR:28.55 dB
Text Entropy: 4.2178 bits/pel
Graph PSNR: 27.75 dB
Graph Entropy: 0.20 bits pel
Because the text is quantized at a different scale
than the graphics, the boundary between text blocks and graphics blocks
often becomes visible as a shift in background color.In
the images above, the 'highlighting' effect is not very apparent.With
different quantization scales and backgrounds it can be, however. When
using differential encoding of DC, the highlighting is much more apparent.Please
see the figure.
If highlighting is undesirable, we found that it
can be remedied by quantizing the DC component of the graphics portion
at the same resolution as the text.This
incurs a cost in entropy, however.
It should be noted that we first thought that we
could quantize the DC component of the text portion at graphics resolution.This
does not work because the DC component is not the background color - thus,
this counter-intertuitive result helped us better understand a fundamental
For the example above, quantizing the DC values
of the graphics at text resolution leads to an average word length of 1.1
In this paper we have shown that text requires a much higher bit
rate than graphics to achieve the same quality. This motivated us
to find methods of segmenting compound documents into text and graphics
portions. We explored algorithms from the literature including space
domain, DCT, and wavelet methods. We developed new such algorithms,
notably our Delta-Probabilities H-VHF scheme, that perform better
than some of the newest algorithms in the literature. Our H-VHF scheme
scored 7.53% (average class identification error) compared to 10.15% for
the best one from the literature. We have then included two types
of smoothing, which are means to further improve classification results.
Finally, we showed in an example application that segmented coding resulted
in an increased text
PSNR of 6 .2 dB for the same overall bit rate. Such results can
be important in applications where text is far more important than background
as in video-conferencing and distance learning applications.
Appendix A. Bibliography
Li and R. Gray, Text and Picture Segmentation by the Distribution Analysis
of Wavelet Coefficients.In Proceeding
of International Conference on Image Processing, October 1998.
K. Konstantinides and D. Tretter, " A JPEG Variable Quantization Method
for Compound Documents ", IEEE Transactions on Image Processing, Vol.9,
No.7, July 2000, p.1282
C.T. Chen, "Transform coding of digital images using variable block size
DCT with adaptive thresholding and quantization", SPIE vol. 1349, 1990,
N. Chaddha, R. Sharma, A. Agrawal, and A. Gupta, "Text segmentation in
mixed-mode images," in Proceedings of the 28th Asilomar Conference on Signals,
Systems and Computers, vol. 2, pp. 1356--1361, 1995.
Compression of Multimedia Documents" in Proceedings of the 29th Asilomar
Conference on Signals, Systems and Computers, vol. 2, pp.1452--1456, 1996.
Konstantinides and D. Tretter, " A JPEG Variable Quantization Method for
Compound Documents ", IEEE Transactions on Image Processing, Vol.9, No.7,
July 2000, p.1282
[vii] J. Minguillon,
J. Pujol, and K. Zeger, "Progressive Classification Scheme for Document
Chaddha, R. Sharma, A. Agrawal, and A. Gupta, "Text segmentation in mixed-mode
images," in Proceedings of the 28th Asilomar Conference on Signals, Systems
and Computers, vol. 2, pp. 1356--1361, 1995.
Chaddha and A. Gupta, "Text segmentation using linear transforms," in Proceedings
of the 29th Asilomar Conference on Signals, Systems and Computers, vol.
2, pp.1447--1451, 1996.
Li and R. M. Gray, "Context-based multiscale classification of images,"
in Proceedings of the IEEE International Conference on Image Processing,
[xi] B. Girod, F. Hartung,
and U. Horn, "Subband Image Coding", in A. Akansu, M. J. T. Smith (eds.),
"Design and Applications of Subbands and Wavelets," Kluwer Academic Publishers,
Norwell, MA, Oct. 1995. (PDF)
Appendix B. Work Distribution
Mark & Isaac: Worked mainly together. Please
see the log for details.
Wrote on all non-wavelet scripts,comparison methodology JPEG-style DCT
Isaac: Innovative DCT segmentations schemes save for the MAP rule scheme
done with Mark.
Dan: Wavelet-based schemes. Dan worked mostly independently.
Mark, Isaac, Dan preparation of report.