This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here , and I’d recommend they are read before continuing.
I’ve been working towards HDMI output on my TPU SOC, and this week I managed to get enough of something to get pixels (very large pixels!) output to the screen.
The plan was to map an area of memory to a VRAM block, which could be read and written to form the TPU, and also read for the graphics subsystem that would generate the video signals that are to be output.
The current ram used in TPU is a block ram primitive entity on Spartan6 – RAMB16BWER. This 16Kbit ram has two ports, which can be run at different clock rates. At the moment, we map this primitive into an ‘ebram’ component, which disables the second port, and services the block ram via the bus signals on TPU. I made a new component, handily named ‘ebram2port’ to expose the second port of the RAMB16BWER instance for read-only use.
entity ebram2port is Port ( // existing 'ebram' TPU interface I_clk : in STD_LOGIC; I_cs : in STD_LOGIC; I_we : in STD_LOGIC; I_addr : in STD_LOGIC_VECTOR (15 downto 0); I_data : in STD_LOGIC_VECTOR (15 downto 0); I_size : in STD_LOGIC; O_data : out STD_LOGIC_VECTOR (15 downto 0); // new read-only secondary port interface I_p2_clk : in STD_LOGIC; I_p2_cs : in STD_LOGIC; I_p2_addr : in STD_LOGIC_VECTOR (15 downto 0); O_p2_data : out STD_LOGIC_VECTOR (15 downto 0) ); end ebram2port;
An instance of eram2port is created in my TPU top-level design. The chip select signals (I_cs and I_p2_cs) are driven via some conditionals which check for address ranges on the TPU output bus.
CS_VRAM <= '1' when ((MEM_O_addr >= X"C000") and (MEM_O_addr < X"C200") and (O_int_ack = '0')) else '0'; CS_ERAM <= '1' when ((MEM_O_addr < X"1000") and (O_int_ack = '0')) else '0';
CS_ERAM being the chip select for the actual embedded ram instance, with the TPU bootloader code.
The input to TPU from our data sources, such as RAM and peripherals, also needs to change.
MEM_I_data <= INT_DATA when O_int_ack = '1' else ram_output_data when CS_ERAM = '1' else vram_output_data when CS_VRAM = '1' else IO_DATA ;
INT_DATA and IO_DATA busses are controlled by other external processes, and thus don’t matter much here. This code is what I’d like auto-generated from my emulator, temu – as it’s the sort of code which when duplicated to the extent I’ll need (for tens of block rams integrated) that human error comes into play. Everyone makes copy and paste errors. Everyone.
The last real item is the address, which is fed into I_addr. This must be modified from the 0xc000 – 0xc200 address that TPU sees to 0x0000 – 0x0200. This is done as you’d expect, by simply chopping off the high 4 bits.
Now we have a VRAM block integrated into the TPU top-level module, which TPU programs can read and write to via standard memory instructions to our mapped area of memory, but which also has a second port, which can read the same memory but at a different clock rate. The difference in clock rate is the important part in this.
- DVI is a subset of HDMI.
- The pixel signals and timing is essentially the same as VGA.
- The data is encoded as TMDS serial.
- The data is then sent along 3 differential signal pairs, with a 4th pair for a clock.
My code uses the DVID test project from Michaels Hamsterworks Wiki . I’ve edited some areas of the code, for my own requirements.
VGA timing generally works along the lines of a pixel clock, which is set specifically to allow for the number of pixels required for your resolution to be transmitted within tolerances, along with horizontal and vertical sync signals, and a blanking flag. The pixel data itself can be thought of as a sub-image of a larger set of data which is transmitted, origin in the top left hand corner. The area to the right and bottom which is not part of the original data is ‘blank’.
The timings and the durations of these blanking periods all depends on set figures defined by standards. For example, for an 800×600, 60Hz image, the pixel clock is 40MHz. Essentially, each row can be thought of as having around 1056 pixels, with the additional pixels accounting for blanking and sync periods, where the actual pixel value doesn’t matter – it exists only for timing. An example for the resolution above lays out the exact number of pixels in each are, along with time representations.
I have a VGA signal generator, which takes the pixel clocks and counts through the pixels, outputting pixel offsets, sync and blanking bits. Within this VHDL module, the constants for our 800×600 image are as follows:
constant h_rez : natural := 800; constant h_sync_start : natural := 800+40; constant h_sync_end : natural := 800+40+128; constant h_max : natural := 1056; constant v_rez : natural := 600; constant v_sync_start : natural := 600+1; constant v_sync_end : natural := 600+1+4; constant v_max : natural := 628;
The VGA signal generator is exposed in my TPU design as the following entity.
entity vga_gen is Port ( pixel_clock : in STD_LOGIC; pixel_h : out STD_LOGIC_VECTOR(11 downto 0); pixel_v : out STD_LOGIC_VECTOR(11 downto 0); blank : out STD_LOGIC := '0'; hsync : out STD_LOGIC := '0'; vsync : out STD_LOGIC := '0' ); end vga_gen;
The pixel_h and pixel_v offsets then combine to form an address which can be looked up in VRAM, which holds the pixel data.
Generating the TMDS data
The actual image data we’ll send over the HDMI cable is actually DVI. The Way HDMI and DVI send image data can be pretty much the same. HDMI can carry more varied data, such as sound – but thats really just hidden in the blanking periods of the communicated image.
TMDS (or Transition-minimized differential signalling if you want the full name!) is a method for transmitting serial data at high clock rates over varying length cables. It has methods for reducing the effects of electromagnetic interference. You can read more about it over at Wikipedia .
The main understanding required is that it’s a form of 8b/10b encoding. 8 bits of data are encoded as 10 bits in such a way that the number of transitions to 1 or 0 states are balanced. This allows the DC voltage to be at a sustained average level – which has various benefits.
Michael has a few TMDS encoder modules available on his various projects, going from basic ones which match low-end 3-bit per pixel input to fixed outputs, to a real encoder capable of the full range of 8bit per pixel RGB. I use the full encoder without modifications. A simple flow of how it works is as follows (again, from Wikipedia ):
A two-stage process converts an input of 8 bits into a 10 bit code.
- In the first stage, the first bit is transformed and each subsequent bit is either XOR or XNOR transformed against the previous bit.
The encoder chooses between XOR and XNOR by determining which will result in the fewest transitions. The ninth bit encodes which operation was used.
- In the second stage, the first eight bits are optionally inverted to even out the balance of ones and zeros and therefore the sustained average DC level; the tenth bit encodes whether this inversion took place.
With this encoder, we can get the 10 bits we then need to serialize across the cable to our monitor.
Serializing the TMDS data
To serialize the TMDS data to our differential output pairs, we use Double Data rate Registers (ODDR2). These registers are implemented as primitives in the VHDL. Using these DDR registers, we only need a serialization clock 5x that of the pixel clock, rather than 10x. There are ‘true’ serialization primitives available on Spartan6, which I may look at later (there is a SERDES example on Hamsterworks for those interested).
ODDR2_red : ODDR2 generic map ( DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC" ) port map ( Q => red_s, D0 => shift_red(0), D1 => shift_red(1), C0 => clk, C1 => clk_n, CE => '1', R => '0', S => '0' );
Each pixel clock, the 10-bit TMDS value for each pixel is latched. Each subsequent cycle of the 5x pixel clock, the TMDS value is shifted 2 bits to the right. The low 2 bits are then fed into the D0 and D1 inputs of our DDR2 register. The clock inputs C0 and C1 are both 5x pixel, so 200MHz, but the C1 clock input is 180 degrees out of phase.
The output of this register, red_s, is then fed into an OBUFDS primitive which drives the TMDS pair output, which is connected to the HDMI socket pins on the miniSpartan6+ board.
OBUFDS_red : OBUFDS port map ( O => hdmi_out_p(2), OB => hdmi_out_n(2), I => red_s );
There is similar for the other 3 channels. It goes in the order 0:Blue, 1:Green, 2:Red, 3:Clock.
At the moment my clocking system needs work, but it’s fixed just now to my needs for 800x600x60Hz. For this, the 50MHz miniSpartan6+ input clock is buffered, then input into a PLL which multiplies it by 20 to 800MHz, before dividing it to 40MHz for the pixel clock, and 200MHz for the serial drivers. There is also a second 200MHz output, 180 degrees out of phase, used in the ODDR registers as clk_n.
PLL_BASE_inst : PLL_BASE generic map ( CLKFBOUT_MULT => 16, --800MHz CLKOUT0_DIVIDE => 20, --40MHz CLKOUT0_PHASE => 0.0, CLKOUT1_DIVIDE => 4, --200MHz CLKOUT1_PHASE => 0.0, CLKOUT2_DIVIDE => 4, --200MHz CLKOUT2_PHASE => 180.0, CLK_FEEDBACK => "CLKFBOUT", -- Clock source to drive CLKFBIN CLKIN_PERIOD => 20.0, -- IMPORTANT! 20.00 = 50MHz DIVCLK_DIVIDE => 1 -- Division value for all output clocks (1-52) ) port map ( CLKFBOUT => clk_feedback, CLKOUT0 => clock_x1_unbuffered, CLKOUT1 => clock_x5_unbuffered, CLKOUT2 => clock_x5_180_unbuffered, CLKOUT3 => open, CLKOUT4 => open, CLKOUT5 => open, LOCKED => pll_locked, CLKFBIN => clk_feedback, CLKIN => clk50_buffered, RST => '0' );
As with the 50MHz input, the 3 clock outputs are buffered before being used in the various subsystems. For this the BUFG primitive is used.
At the moment I have the second port on my ‘vram’ instance clocked at 200MHz. The first port, which TPU uses, is clocked at 50MHz. 200MHz within the allowable operating range for the device I’m using, and it seems to work well. At the moment, I’m pretty sure that I am 1 pixel out of phase, but I can fix that later. The address that the VRAM sees is the following
-- generate the vram scan address, forcing reads at 2 byte boundaries vram_addr <= X"0" &"000" & pixel_v(8 downto 5) & pixel_h(8 downto 5) & '0'; -- Only show 512x512 of the display with our expanded virtual pixels vram_data <= vram_output_2 when ((pixel_h(11 downto 9) = "000") and (pixel_v(11 downto 9) = "000")) else X"0000";
The ‘VRAM’ is currently set up to contain a 16×16 image. Tiny, but perfect for what I need just now. The 16-bit pixels are in 565 format, and I trivially expand that to 8-bit for the TMDS encoders.
Now we have an integrated graphics subsystem, albeit one that is very rigid (for now).
I currently need to have the following definition in my constraints file for the clock:
CLOCK_DEDICATED_ROUTE = FALSE;
Without it, the VHDL doesn’t route. It compiles and works fine (seemingly) with that included, but I’ve still to nail down exactly what it means, an how to fix it. Currently, my understanding is that when my VHDL is built, the compilers can’t generate clock placement which satisfies all the rules set. It’s something I want to understand further. It could be as simple as missing out some buffers.
There is also a line of pixels to the far right of the displayed screen, which suggests I’m out of phase by one pixel with the memory read results and the VGA signals. This isn’t too bad, so I’ll look at fixing that along when I increase the VRAM size for higher resolution.
This brings this part to a close. We have HDMI output which is the representation of a small VRAM that TPU controls. It’s pretty neat. I hope to increase the resolution of the image from it’s current 16×16 to something more manageable.
The emulator was very useful during this, as it validated my output for me. Ignore the 2 bright green pixels in the superimposed emulator output 😉
(The HDMI output to the left is actually 16×16, combination of lighting and a bad camera seems to give the impression of 8×8)