Vivado HLS

During my research, I battle a lot with Vivado HLS. Following is some advice based on experience.

Always debug the software implementation before synthesizing it.

Some error messages of Vivado HLS are not very descriptive or point you in a wrong direction. I have spent hours debugging issues which I could easily have found by testing the software properly. One notorious example was an array that I was accessing out-of-bounds. The synthesis tool threw errors that did not point me in the right direction. Eventually, I updated my software testbench when I found the bug.

Always run the software implementation in a memory checker before synthesis.

This is also motivated by the previous issue. I have good experience with using Valgrind for this purpose.

Don’t use array_shape and array_partition on the same array.

This will lead to the following warning:

WARNING: [XFORM 203-180] Applying partition directive (<location>) and reshape directive (<location>) on the same variable '<variable>' may lead to unexpected synthesis behaviors.

Reusing arrays can be bad for meeting timing.

The synthesis tool does not always realize that accessing different array elements can be done in parallel. Take for example the following function:

void Test(int X, int A[256], int B[256])
{
#pragma HLS ARRAY_PARTITION variable=A complete dim=1
#pragma HLS ARRAY_PARTITION variable=B complete dim=1
  for (int i = 0; i < 256; i++)
#pragma HLS unroll
    if (B[i] == 1)
      A[i] *= X;
}

This requires at least 256 cycles in Vivado HLS 2017.1 with a clock period of 2 ns. Analysis of the schedule reveals that the tool finds dependencies between uses of A in different iterations of the loop. If we separate A into two variables, one for input and one for output, we achieve the same in 7 cycles:

void Test(int X, int A[256], int B[256], int C[256])
{
#pragma HLS ARRAY_PARTITION variable=A complete dim=1
#pragma HLS ARRAY_PARTITION variable=B complete dim=1
#pragma HLS ARRAY_PARTITION variable=C complete dim=1
  for (int i = 0; i < 256; i++)
#pragma HLS unroll
    if (B[i] == 1)
      C[i] = A[i] * X;
    else
      C[i] = A[i];
}

Note that there are more solutions, such as using the HLS dependence pragma.

Conditional expressions can eliminate false dependencies.

Continuing with the previous example, we can also solve the false dependencies by using conditional expressions instead of if-statements.

void Test(int X, int A[256], int B[256])
{
#pragma HLS ARRAY_PARTITION variable=A complete dim=1
#pragma HLS ARRAY_PARTITION variable=B complete dim=1
  for (int i = 0; i < 256; i++)
#pragma HLS unroll
    A[i] = B[i] == 1 ? A[i] * X : A[i];
}

Unsigned integers are often preferred over signed integers.

It is tempting to declare integer variables as (signed) int, just like one would do for software applications, but some operations such as modulo and division are more expensive on signed integers than unsigned integers. An unsigned integer division can be reduced to a simple truncation, and a modulo operation to a bitwise-and operation.

Make sure that all outputs are updated.

I have seen misleading warnings from Vivado HLS 2017.1 such as the following:

ERROR: [XFORM 203-123] Cannot stream  '<array>' (<location>): Some entries have both read and write access.

This warning refers to an array of structures was an output of the function. I updated a number of elements of the array, but I forgot to update several fields in those structures. In that case, the fields are supposed to retain the value that they had upon invocation of the function, which means that the array was not only written, but also read.

Not updating all elements in a ping pong buffer can yield correct implementations.

Vivado HLS 2017.1 will throw the following discouraging warning when not all elements in a ping pong buffer are written:

WARNING: [XFORM 203-713] All the elements of global array '<array>' (<file>) should be updated in process function '<function>' (<file>), otherwise it may not be synthesized correctly.

When you are working with data of variable length, you may want to size the buffer such that it can accommodate the maximum size and you may find yourself in situations where you do not want to fill the entire buffer. Despite the scary warning, such designs seem to synthesize just fine.

Not accessing all elements in an input or output array can yield correct implementations.

Similar to the last warning, Vivado HLS 2017.1 may also throw the following warning at you if you use input or output arrays of a function that are implemented with a streaming interface such as ap_fifo or ap_hs:

WARNING: [XFORM 203-124] Array  '<array>' (<file>): may have improper streaming access(es), possible reasons: (1) some entries are accessed more than once; (2) some entries are not used; (3) the entries are not accessed in sequential order.

Leaving a number of elements at the end of the array alone synthesizes fine in my experience despite violating reason (2). In that case, I would suggest using an hls::stream object instead however.

Not accessing all elements in an input or output FIFO can result in decreased performance.

Consider the following code:

void Function1(hls::stream<int> & Input, hls::stream<int> & Output)
{
  static int Count = 0;
  if (Count == 3)
    for (int i = 0; i < 256; i++)
    {
#pragma HLS pipeline
      Output.write(Input.read());
    }
  else
    for (int i = 0; i < 16; i++)
    {
#pragma HLS pipeline
      Output.write(Input.read());
    }
  Count = (Count + 1) % 4;
}
void Function2(hls::stream<int> & Input, hls::stream<int> & Output)
{
  static int Count = 0;
  if (Count == 0)
    for (int i = 0; i < 256; i++)
    {
#pragma HLS pipeline
      Output.write(Input.read());
    }
  else
    for (int i = 0; i < 16; i++)
    {
#pragma HLS pipeline
      Output.write(Input.read());
    }
  Count = (Count + 1) % 4;
}
void Test(hls::stream<int> & Input, hls::stream<int> & Output)
{
  static hls::stream<int> Temp;
#pragma HLS STREAM variable=Temp depth=1024
#pragma HLS DATAFLOW
  Function1(Input, Temp);
  Function2(Temp, Output);
}

We call the top-level function Test many times. The first 3 invocations put data in the Temp FIFO, but nothing is consumed by Function2. The data remains in the FIFO between invocations in a software implementation because we declared the FIFO as static. In the fourth iteration, Function2 is supposed to 1024 elements, which equals all data received so far. In software simulation, this approach works fine.

In hardware, this implementation does not achieve maximum performance because there is a hidden FIFO (called something like start_for_Function2_U) between Function1 and Function2 that is too small. The FIFO stores tokens to start Function2. We need execution of Function1 and Function2 to overlap for optimal performance. Function1 should execute 4 times while Function2 executes once. Each execution of Function1 queues a token in the hidden FIFO, so 4 tokens will be queued while Function2 executes. That requires a FIFO capacity of 4 as no tokens are consumed after Function2 has started until it completes. Unfortunately, the hidden FIFO always has a fixed size of 2.

So to conclude, do not attempt to keep data in a FIFO between different invocations of the top-level function.

Integrate array initialization of arrays into existing loops.

In a certain design, I had a pipelined loop that operated on a fully partitioned line buffer. The initialization of the line buffer was performed in a separate loop before the main loop. The initialization loop was unrolled. My design did not meet the clock constraint, so I investigated the situation. I found out that many copies of each register existed, and that many φ-multiplexers were inserted in the update paths of the line buffer. I believe that φ-multiplexers are inserted to select between different copies of the variables associated with different basic blocks in the intermediate representation. To reduce the number of φ-multiplexers and register copies, I removed the initialization loop, and I performed the initialization on-the-fly in the main loop using conditional expressions. This resulted in considerably less resource consumption (3876 LUTs vs 22956 LUTs before the change), and a shorter critical path estimate (5.23 ns vs 6.74 ns). The latency in clock cycles reduced as well. Note that the Vivado HLS user guides also recommend this practice.

Accelerator interface generated by SDSoC has low maximum frequency.

Whereas Xilinx architectures often support frequencies of 500 MHz or more, the IP blocks in the accelerator are typically limited to 200 MHz or less. Xilinx documentation of the AXI DMA v7.1 core, for example, reports that the AXI4 version of the IP instantiated on the Artix-7 FPGA of the ZedBoard supports only 150 MHz. 

Scatter-Gather DMA supports higher frequencies than Simple DMA.

In one of my SDSoC designs, where an AXI DMA v7.1 core was instantiated to implement a Simple-DMA connection, the core did not satisfy a 7-ns (143 MHz) clock constraint. Remember from the previous bullet that the maximum frequency according to Xilinx is 150 MHz. The maximum frequency was probably measured in a design that had a lower utilization than my design, so I did not expect the design to run at 150 MHz anyway. After some experimentation, I found out that by replacing the Simple DMA by Scatter-Gather DMA, I could make the design meet the timing constraint. I had not expected that because Scatter-Gather DMA is much more complex than Simple-DMA.

Static variables are shared by multiple instances of the same function.

You are probably aware that static variables inside functions retain their value between function calls. This could be useful for such things as storing coefficients of a convolution layer in a convolutional neural network (CNN). Now, let’s say you want to generate multiple parallel instances of this function. You call the function from a loop that you unroll using an unroll pragma. You will notice that this will not work. The reason is that each hardware instance of the function will still share the same static variable. To work around this, you could convert the function into a template function with a template parameter that is assigned a different value for each hardware instance.

Take advantage of compiler optimizations to avoid replicating code.

Imagine that you have a pipeline with 10 stages that are almost identical. There is no need to make 10 copies of the same code and to make small alterations. Instead, write a function that has the code for all instances. Surround code that is to be run only by a part of the instances by an if-statement. Code that all instances have in common can be run unconditionally. The if-statements can be controlled using a new function parameter. If a constant is passed to the new parameter on each function call, the if-statement will always evaluate to the same Boolean value. Therefore, the if-statement will be optimized away, and depending on the value, the code inside the if-statement as well. As a result, the hardware will be indistinguishable from a dedicated specialized implementation. If the only difference between two functions is a constant, such as the number of iterations in a loop, one could replace the constant by a parameter. Again, when a constant is passed when the function with the loop is called, the compiler will propagate the constant to the loop and there will be no excess hardware.

Be aware of the two compilation phases of Vivado HLS

Like virtually any compiler, Vivado HLS can roughly be divided in two parts: a front-end and a back-end. The front-end performs transformations on the code, and the back-end translates the code to hardware. This may not seem relevant to coding, but it turns out that it shapes our expectations from the compiler. The array_partition and unroll pragmas, for example, control code transformations that are performed in the front-end, whereas the dataflow and pipeline pragmas take only effect in the back-end. As unrolling is applied before pipelining, unrolling and pipelining the same loop is not the same as replicating a single pipeline. Unrolling increases the scope of the back-end from all operations in a single loop iteration to all operations in multiple successive loop iterations. The front-end merely exists for convenience. the same results can be achieved by applying the transformations manually to the code. The same cannot be said about the back-end.

Do not unroll loop partially.

A partially unrolling a loop is functionally the same as a loop that is not unrolled surrounding a loop that is completely unrolled. Although the implementation should be identical, Vivado HLS has often more difficulty recognizing parallelism in a partially unrolled loop than a full unrolled loop.

Use wide data types whenever possible.

Although a hardware implementation may be cleaner when you use smaller datatypes, the compile time of Vivado HLS increases with the number variables in the application. A pattern that I frequently encounter is that data is loaded from an array, processed, and stored in another array. To increase the parallelism, one could partition the array and unroll the loop. However, as the array gets more partitioned, more temporary variables are created that must be scheduled, which takes time. To reduce the burden on the compiler, I usually decompose the data processing code hierarchically. The innermost function processes a portion of the data. Inside the function, I need to access the data at a fine granularity, so I use arrays with small datatypes. Outside, the function, I don’t need the same granularity, so I bundle the data into a wider datatype before I pass it out. If the compiler still needs to much time, I decompose the data in multiple steps. A drawback of this approach is that it requires more implementation time.

Loops can reduce compile time.

While I was working on a sorting network, I found out that introducing a loop can improve the compilation time. Let me illustrate this with an example:

void layer1(int x[128])
{
#pragma HLS inline
  func(x[0], x[1]);
  func(x[2], x[3]);
  func(x[4], x[5]);
  ...
}

void layer2(int x[128])
{
#pragma HLS inline
  func(x[0], x[64]);
  func(x[1], x[65]);
  func(x[2], x[66]);
  ...
}

...

void top(int x[128])
{
#pragma HLS array_partition variable=x complete dim=0
#pragma HLS allocation instances=func limit=32
  layer1(x);
  layer2(x);
  layer3(x);
  ...
}

In top, I am trying to process an array in 32 sequential layers. I will call top only once in a blue moon and I want to save resources, so I don’t want to pipeline top. The layers are built up from a generic function called func. To reduce resource consumption, I’d like share func hardware instances between layers, and if necessary to meet the constraint, within layers as well. When you try to compile code like this, you will see that it takes half an hour or so to compile. All layers are inlined, so top will have a large number (16384) of calls to func. Although we can easily see that the layers are depending on each other and therefore need to be processed sequentially, the compiler doesn’t take that constraint into account. It throws the func calls from all layers on a big pile and starts scheduling and binding. I found that I can make the compiler aware of the sequential nature of the code with the following code without sacrificing the sharing opportunities that I care for:

void top(int x[128])
{
#pragma HLS array_partition variable=x complete dim=0
#pragma HLS allocation instances=func limit=32
  for (int i = 0; i < 32; i++)
  {
    if (i == 0)
      layer1(x);
    else if (i == 1)
      layer2(x);
    else if (i == 2)
      ...
    else
      layer32(x);
  }
}

This code builds in a matter of minutes.

4 Replies to “Vivado HLS”

  1. First of all thank you very much for your practical tips on using Vivado HLS, this list has helped me quite a bit.
    I have question regarding your point “Integrate array initialization of arrays into existing loops”. Could you give a short code example how you do this exactly? I tried in HLS but could not get the results I hoped for.

    Thanks again and best regards from Munich
    Eyke

  2. I appreciate all of your notes on HLS. They have been very helpful to me.
    Have you ever had any issue calling the same function multiple times? I have a project that works on small problems, but when the problem gets larger, meaning it has to call the functions more, it hangs up.

  3. I to do daily battle with (now) Vitis Hls. I’ve used some of the optimizations you mention, but there are some new ones here that I’ll find a use for. Will check out your other posts.

Leave a Reply