We hear the term JPEG all the time. What really is JPEG? How does it work?

I will try to give a brief overview of what JPEG is and how it works.

This is a beginner level article, so we will not dwell too much on the details.

First, JPEG is not a file format. It’s a compression method.

The term JPEG is an acronym for the Joint Photographic Experts Group, which created the standard.

Most of the time you’re talking about your file in a JPEG format, you are referring to the JFIF (JPEG File Interchange Format) wrapper.

Let’s start from the beginning of an image’s life.

You click a picture using your camera.

The camera’s sensor is overlaid with a **color filter array (CFA)** , usually a Bayer filter, consisting of a mosaic of a 2×2 matrix of red, green, blue and (again) green filters. The green photo-sensors are luminance-sensitive elements and the red and blue ones are chrominance-sensitive elements. Bayer used twice as many green elements as red or blue so as to mimic the physiology of the human eye.

The raw image file thus obtained using the sensors is a **bitmap** (2D array). It’s very huge in size.

So, we need to compress this file, and that’s where JPEG come into play.

JPEG is a lossy compression technique, which means it uses approximations and partial data discarding to compress the content. Therefore it’s irreversible.

JPEG compression is based on the following 2 observations:

**Observation #1** : Human eyes don’t see color (chrominance) quite as well as we do brightness (luminance).

**Observation #2** : Human eyes can’t distinguish high frequency changes in image intensity.

### Step 1: Convert RGB to YCbCr color space

Each pixel in your image is stored as a additive combination of Red, Blue and Green values. Each of these values can be in the range of 0 to 255. This color model is called the RGB model. Consider a pixel that is khaki in color. It will be stored as (240, 230, 140).

Remember Observation #1 – Luminance is more important to the eventual perceptual quality of the image than color. So we convert from RGB color space to one where luminance is confined to a single channel. This color space is called **YCbCr** .

Here, **Y** is the luminance component and **Cb** , **Cr** are the chrominance components. They are the blue and red differences respectively.

Their values will be in the range 0 to 255.

YCbCr values can be computed directly from RGB as follows: ^{[1]}

Y = 0.299 R + 0.587 G + 0.114 B

Cb = – 0.1687 R – 0.3313 G + 0.5 B + 128

Cr = 0.5 R – 0.4187 G – 0.0813 B + 128

### Step 2: Downsampling

Since chrominance is not very important, we can downsample and reduce the amount of color (CbCr components).

Generally, color is reduced by factor of 2 in both directions (vertical & horizontal) – that is, Y is sampled at each pixel, where as Cb and Cr are sampled at every block of 2×2 pixels.

Now for every 4 Y pixels, there will exist only 1 CbCr pixel.

You won’t notice much of a change in the image, but a good amount of file size is reduced.

In image editing software, you are generally asked what quality you want the image to be saved. This is in fact the software asking you how much downsampling you want it to do on the image.

### Step 3: Use Discrete Cosine Transform (DCT)

Each of the three YCbCr components are compressed and encoded separatelyusing the same method described here. For now, consider only one of these components. The other 2 components are processed exactly the same way.

3 a. base images

DCT is a method that expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.

For compression, cosine functions are used rather than sine functions because fewer cosine functions are needed to approximate a typical signal.

See the following image ^{[2]}

These are 64 base images, that are built from cosine functions at different frequencies in the X and Y axes.

First base image, that is `baseimg[0][0]`

will be full white,

for `baseimg[0][1]`

to `baseimage[0][7]`

, you can see the frequency increasing along the x-axis.

for `baseimg[1][0]`

to `baseimage[7][0]`

, you can see the frequency increasing along the y-axis.

`baseimg[7][7]`

will be totally checkered.

**3 b. sub-images**

The entire image we want to compress is divided into sub-images each of which comprises of 8×8 pixels. Let’s call each of them as a sub-image.

This sub-image can be visualized as an 8×8 matrix.

We are going to compress the full image one sub-image at a time.

Consider an example. The values of the component under consideration is given in the following matrix:

Since we are going to use DCT, and cosine waves go from 1 to -1, we are going to center our values around zero. This means we shift the range from [0..255] to [-128..128]. So we subtract 128 from every value.

Now our sub-image is shifted to:

Now, we have 2 things in our hand:

1. The 8×8 sub-image to be compressed2. 64 base images

Our task here is to transform the sub-image to a linear combination of these 64 base images.

The sub-image can be converted to this frequency-domain representationusing a normalized, two-dimensional type-II Discrete Cosine Transform (DCT).

We can think of the sub-image as being comprised of a weighted set of these 64 base images merged together on top of each other.

Therefore, **subimage = C1f1 + C2f2 + C3f3 + … C64f64**

where Ci is some constant and fi are the base images.

We can find each of these coefficients (Ci) using DCT (type II).I am not going to explain here how DCT works. That’s for you to find out.

I am going to use the following python code to find the DCT’ed matrix of my sub-image:

import numpy as np from scipy.fftpack import dct def dct2D(x): tmp = dct(x, type=2 ,norm='ortho').transpose() return dct(tmp, type=2 ,norm='ortho').transpose() print dct2D([ [-64., -68., -71., -72., -80., -81., -81., -85.], [-67., -70., -75., -76., -80., -79., -76., -75.], [-61., -68., -75., -75., -79., -81., -80., -74.], [-60., -67., -65., -65., -66., -63., -63., -64.], [-57., -67., -58., -65., -59., -54., -40., -40.], [-45., -36., -26., -23., -21., -17., -18., -13.], [-33., -20., -20., -4., -6., 2., 0., 0.], [-21., -10., -3., 6., 9., 14., 13., 9.] ])

And, I get:

This is the 8×8 table of coefficients, that represents the contribution of each base image to the sub-image.

### Step 4: Quantization

We will now quantize the coefficient table we obtained using DCT. This is the real lossy part of the process.

In the table of coefficients we got through DCT, the top-left cells refer to low frequency part, and the bottom-right cells refers to high frequency part.We know that high frequency part can be eliminated without much loss in the look of the image. (Remember Observation #2)

So we now prepare a 8×8 quantization table. This table will have have very small values at the top-left part and very high values towards the bottom-right part.

Every value in the coefficient table is divided by the corresponding value in the quantization table and rounded to the nearest integer.

Now, because of the high divisor in the bottom-right part, the divided values here become zero – thus eliminating the high frequency data.

This Quantization table is up to the encoder, and therefore the table is kept in the image header so the image can be later decoded.

Here’s a standard JPEG Quantization table:

And here’s our sub-image after quantization.(after dividing each value in our coefficient table with the corresponding value in the quantization table)

Notice that in the quantized output table, all values except the top-left 3×3 block are all zeroes. These are the high frequency data we eliminated. JPEG’s claim to fame is that with just these 9 values we can get almost the same image back.

### Step 5: Encoding

We have now got the compressed output as a 2D array. We also know that a lot of them are zeroes. So, we will find a better way to store the sub-image than store its as a 2D array.

We will store the values in a zigzag order. So the data will be:

-24, -23, 19, 5, 4, 0, 0, 1, 0, 0, 0, 0, 1 followed by 53 zeroes.

Data of this pattern can be easily compressed by Run-Length Encoding (RLE) algorithm. The final output is encoded using a combination of RLE and Huffman encoding.

### Step 6: Add Header

Put whatever is required in the header according to the specification.

Your compressed file is ready !!!

To decompress, simply do the reverse.

Use Discrete Cosine Transform-III to reverse DCT-II.

Next, in JPEG 102 we will write from scratch a JPEG encoder in C, but that’s for another day.

**Footnotes**

1. JPEG File Interchange Format – Eric Hamilton, W3C

2. Image credit: https://oku.edu.mie-u.ac.jp/~okumura