Tuesday, April 8, 2014

PCI Express advanced topics: Part 2

PCIe Enumeration 

Before we go to the details of enumeration, lets use the following nomenclature. 
PCIe port : physical group of transmitter(s)/receiver(s) on same chip
PCIe lane : one pair of differential transmitter and receiver
PCIe link : collection of lane(s). 
Upstream device : is the root complex
Downstream device : end point that implements atleast one function
Upstream port : resides on the downstream device facing the upstream device
Downstream port : resides on the upstream device facing the downstream device

Figure -1 denotes these nomenclature.

Figure -1
Now, let's take an example of single lane (lane 0) - one root complex connected to one PCIe device that implements one function for illustrating the enumeration process.

Enumeration is the process where the root complex driver discovers the PCI(e) bus topology by traversing through the hierarchy from root complex (which is at the top of the hierarchy). 

Following are the per-requisites that needs to be satisfied
a) Root complex is on the Bus 0 and configuration space access of the root complex is accessible using embedded CPU (via API or known mechanism)
b) The EP device that implements a function must have a Vendor ID that is not all '0xF's
c) link training has completed and link has been established on both the sides of the link and the Data link layer is in DL_ACTIVE state.

Steps below outlines the simplified enumeration procedure.
1) RC driver steps through the hierarchy to find all the connected downstream devices by looping the bus number. For E.g. read the vendor ID of the downstream device for bus 1 and device 0, function 0 
using the API to access the configuration space with the address set to 

CFG_ADDR = base_addr | 1<<20.

If there is valid function implemented on this downstream device, it must return a valid vendor ID and it is first step to discover this device

2) If the device is found, now the device needs to be assigned the memory window so that it gets allocated the memory space on the system memory map. This is done by scanning the BAR's and check for the BAR size and do the allocation accordingly. Downstream device BAR size can be found by writing all one's to the respective BAR register and checking for the bits that are zero's. Root complex driver needs to only set here the base address of the requested memory size.

e.g. EP device BAR0 after writing all '1's returns with  0xFF0000000 implies that the memory size requested by this EP device is 16MB and it is 32 bit addressable. Therefore the RC driver can assign any based address in the 4GB space to allocate  the requested 16MB memory 
e.g. BAR0 = 0x1C000000 as the base address

3) After the base address is assigned EP device needs to be configured to enable as bus master so that it can start sending memory TLP requests. 
set the command register to enable Bus master, memory enable and IO enable.

CMD_REGISTER | =  1<<2 | 1<<1 | 1<<0;
the above steps are basic functions required to for any PCIe based applications. End point now has to manage this allocated memory from the system as a memory resource for the their application specific implementation.

Wednesday, September 25, 2013

PCI Express advanced topics: Part I

Having experienced with several design phases of the PCIe - Physical Layer design, integrating the controllers and PHY's together in SOC platforms, bringing up in Lab, characterizing in high speed lab, I attempt to share some of my knowledge in this post

PCIe Throughput calculation

Taking example of PCIe Gen1. Maximum physical link bandwidth is 2.5Gbps. Because it is encoded with 8b10b scheme, effective throughput is 2.0Gbps. 

Assuming a typical trade off for the band width utilization and the retry buffer size, ACK/NACK and flow control updates are scheduled to minimize the negative impact of retry.

for e.g. Case 1 : with 128 bytes payload size data, the TLP size will be 128 bytes (data) + header (12-16 bytes) + ECRC (4 bytes) + Sequence number (2 bytes) + LCRC (4 bytes) + STP (1 byte) + END(1byte) = 152 bytes (18.75% overhead); in other words we can only use 81.53% of the link effective bandwidth (2Gbps), which is 1.63Gbps.

Now considering the other overhead on the link for the transmission of the Flow control and the ACK/NACK and SKIP order sets.

for instance with 1ACK + 1FC per 4 data packets128 bytes payload data causes to use additional 8 bytes (FC DLLP) and 8 bytes (ACK DLLP) which is another 3% overhead, implies the effective bandwidth is 0.97*1.63 = 1.58Gbps.

Tuesday, June 26, 2012

Modeling delays in verilog

Verilog supports three kinds of delay modeling
  • Distributed delay modelling - As the name indicates the module delay is consists of delays from the sub components.
          Verilog code example:

           module AND_OR (Y, A, B, C, D);
              output Y;
              input A, B, C, D;
              and #1.0 I0(outA, A, B);
              //  #(1.0:1.1:1.2, 0.9:1.0:1.1) min:typ:max triplets for rise, fall delays
              nor #2.0 I1(Y, C, D, outA);
           endmodule;

Total delay from the above example is 3 time units for output Y from pin A, B. Delays can be more precisely specified using min:typ:max triplets and sets for rise, fall delays. For digital RTL simulations, the simulators can take these delays into account when above module is compiled with +delay_mode_distributed simulator option or using a dummy verilog module (first module in the compile order) with `delay_mode_distributed verilog directive.

  • Lumped delay - difference between the distributed delay mode and the lumped delay is that only the sub component at the output of the module represents the total delay 
       module AND_OR (Y, A, B, C);
              output Y;
              input A, B, C;              and I0(outA, A, B);
              or #3.0 I1(Y, C, outA);

      endmodule; 
Same verilog directive/simulator compile switch as delay mode distributed are used for this case.

  • Pin-to-Pin/path delays - This approach treats the insides of the module as the black box and uses the ports to specify the delay from different input/inouts to output/inout ports. Verilog has special construct 'specify' to use for the path delay modeling. SDF can be used for back annotating the delays if the SDF constructs are matched with the timing model described in the specify. This allows the back annotated simulation of the post layout netlist
 module AND_OR (Y, A, B, C);
              output Y;
              input A, B, C;               and I0(outA, A, B);
              or I1(Y, C, outA);
  specify
    // delay parameters
    specparam
      tplhAY = 1.0,
      tphlAY = 1.0,
      tplhBY = 1.0,
      tphlBY = 1.0,
      tplhCY = 1.0,
      tphlCY = 1.0;

    // path delays ( using conditional delays for C to Y)
      (A *> Y) = (tplhAY, tphlAY);
      (B *> Y) = (tplhBY, tphlBY);
    if (A == 1'b1 && B == 1'b0 )
       (C *> Y) = (tplhCY, tphlCY);
    if (A == 1'b0 && B == 1'b1 )
       (C *> Y) = (tplhCY, tphlCY);
    if (A == 1'b0 && B == 1'b0 )
       (C *> Y) = (tplhCY, tphlCY);
  endspecify

endmodule;
 Example SDF construct
(CELL
  (CELLTYPE "AND_OR")
  (INSTANCE top_tb/DUT/i_a/i_b)
  (DELAY
    (ABSOLUTE
    (IOPATH A Y (0.09168:0.09168:0.09168) (0.10759:0.10759:0.10759))
    (IOPATH B Y (0.10432:0.10432:0.10432) (0.15321:0.15321:0.15321))
    (COND A == 1'b1 && B == 1'b0 (IOPATH C Y (0.04601:0.04601:0.04601) (0.05479:0.05479:0.05479)))
    (COND A == 1'b0 && B == 1'b1 (IOPATH C Y (0.04602:0.04602:0.04602) (0.05478:0.05478:0.05478)))
    (COND A == 1'b0 && B == 1'b0 (IOPATH C Y (0.04610:0.04610:0.04610) (0.05423:0.05423:0.05423)))
    )
  )
)  


Friday, June 22, 2012

Modelsim TCL

Long simulation runs in the data path engines (filters, FFT, IFFT etc) verifications are  mostly due to
  • memory filling before the HW processing engine starts
  • on the fly preparing the new memory contents and writing to the memories
  • on the fly checking the memory contents for the expected HW behavior
Once the memory read/write behavior is fully verified, short-cutting the above tasks will reduce the simulation run time significantly. One way is to use back door initialization via the text file, saving of the memory contents to the text file for offline processing. Modelsim/NCSIM TCL is can be used as quick and efficient way.

E.g. TCL script

# use the  native "mem load" command to fill the contents of the memory from a file
proc mem_load_proc { file mem} {
     if { [file exists $file] } {  
         mem load -i $file $mem
   }    
}
# use the native "mem display" command to save the contents from the memory to a file
proc mem_save_proc { file mem} {
     set f [open $file w+]
     puts $f [mem display -format mti -dataradix hexadecimal -addressradix hexadecimal -wordsperline 8 $mem]
     flush $f
     close $f
}
# concurrent When statement triggered based on the signal value conditional check
when "/tb_top/hier1/sig_a = '1'" {
    # save the current memory contents 
    mem_save_proc $save_file $mem
   # generate new data, in this case using DataGen.pl
    exec DataGen.pl
   # load the new memory contents
   mem_load_proc $load_file $mem
}


Tuesday, June 19, 2012

ASIC ECO flow

ECO: Engineering change order is commonly used flow to  introduce logic changes to fix the bugs found at very late stage of the ASIC implementation flow/ to fix the silicon bug(s) using minimum metal layers to impact the cost. Few years back I have presented a methodology flow and some of the proven methods to implement the ECO in cost effective way (need Synopsys solvnet account to access!)
 http://www.synopsys.com/news/pubs/snug/singapore2008/06_UserPaper_Infineon.pdf


Asynchronous FIFO depth

Elastic FIFO/Asynchronous FIFO's are used to compensate for any frequency drift between the two clock domains. Although the system design has to ensure by protocol to compensate for long-term wander, these Elastic FIFO's are needed to handle the wander between the compensation periods. For instance PCIe protocol requires the SKIP ordered sets to be transmitted at regular internals to compensate for the clock frequency drift. Similarly SGMII, 1000BASE-X uses the IDLE order sets to compensate.

Example of the elastic buffer depth calculation:
Assume protocol specifies 600ppm frequency difference is allowed.
frequency difference = 600/1e6 = .0006
1 UI slip occurs every 1/.0006 = 1666.66 bits
Assume the maximum packet length (in bytes) transmitted before the compensation start  = 5660 bytes
packet size (in bits) = 5660 * 8 = 45280 bits
total number of bits slips = 45280/1666.66 = 28 bits (rounded to ceil) = 4 bytes (rounded to ceil)

Thus the elastic buffer depth of +/-4 ( 8) required  to compensate for the read, write pointers drift. However it is good practice to have additional depth to overcome the boundary conditions.

ZOC, REXX, Python

ZOC is a powerful terminal emulator and I use it extensively for communicating with the evaluation boards via the serial port (COM port). It's user friendly interface and support for the macro language programming (REXX) makes it easy to use for Silicon bring up and automation.
For more information on the ZOC follow the webpage :
http://www.emtec.com/zoc/

REXX (Restructured Extended Executor) is a free form language with only one data type "character string (classic REXX)". It has very small instruction set.
For more details on the REXX programming, refer to the below web links
http://www.rexxinfo.org/

Combining ZOC, REXX together with Python (PyVisa) makes it even more suitable for automation. Instruments (like power supplies, temperature chambers, oscilloscopes other measurement equipments) that be controlled over the GPIB interface can be controllable using the python VISA module.

E.g. Python module to control E3631A HP (Agilent) power supply via GPIB

# -- gpib.py module ---
import visa;

def gpib_init(ins="GPIB0::5"):
    return visa.instrument(ins)
def clear(ins):
    ins.write("*CLS")

def reset(ins):
    ins.write("*RST")

def set6v(ins,voltage=1.000,current=1.000):
    ins.write("APPLY P6V, {voltage}, {current}".format(voltage=voltage,current=current))

def set25v(ins,voltage=1.000,current=1.000):
    ins.write("APPLY P25V, {voltage}, {current}".format(voltage=voltage,current=current))

def poweron(ins):
    ins.write("OUTPUT:STATE ON")

def poweroff(ins):
    ins.write("OUTPUT:STATE OFF")

#------ end


# E631A_on.py
import gpib
import sys
ins = gpib.gpib_init(sys.argv[1]) # check the VISA address

## start the instrument in Power off, default mode
gpib.set25v(ins,sys.argv[3],1.000)
gpib.set6v(ins,sys.argv[2],1.000)
# Switch on the power supply
gpib.poweron(ins)

# E631A_off.py

import sys
import gpib
## initialize the GPIB
ins = gpib.gpib_init(sys.argv[1]) # check the VISA address from Agilent IO Library
## Power off the instrument
gpib.clear(ins)
gpib.reset(ins)
gpib.poweroff(ins)


# Simple Automation example script in REXX + ZOC commands (test.zrx)

CALL ZocLogging 1
CALL ZocLogname  "Test.log"

/* Power up the device */
vdd6v     = 3.3000
vdd25v     = 12.000
gpib_ad = 'GPIB0::1'


/* Run test 10 times with power cycle */
DO 10
    CALL E3631A_off gpib_ad
    ZocDelay 5
    CALL E3631A_on gpib_ad vdd6v vdd12v
    ZocDelay 12
    CALL test
END
return

/* Actual test commands over the serial port */
test:
    ZocSend "command_1^M"
    ZocDelay 2
    ZocSend "command_2^M"
    ZocDelay 2
return

E3631A_on:
   gpib_ad = arg(1)
   v6      = arg(2)
   v25     = arg(3)
   CALL ZocExec "C:\Python27\python.exe C:\Python27\scripts\E3631A_on.py" gpib_ad v6 v25, 1
   ZocDelay 6
return

E3631A_off:
    gpib_ad = arg(1)
    CALL ZocExec "C:\Python27\python.exe C:\Python27\scripts\E3631A_off.py" gpib_ad, 1
    ZocDelay 2
return