I designed a custom motherboard to run Linux OS used for optical telecommunication application. The board was brought-up successfully and the design was verified. Once the DVT build started, there seems to be lots of fall outs in the production line where the boards not booting Linux OS. This blog provides a detailed approach of my debug methodology and the resolution I came up to fix the issue. Before that let us see how this boot process in the system works (refer the simplified block diagram ). On any system which runs a operating system(OS), when the board is powered, the first thing happens is the SoC/processor will handshake with the DRAM, then the SoC reads the OS image from the flash memory/hard disk and copies it into the DRAM and start the booting process by accessing the OS image from the DRAM.
First I checked is the power delivery, voltage levels and power sequencing to the SoC, DDR3 and Flash memory is good.
Verified the PLL reference clocks and resets lines to the SoC, DDR3 and Flash memory is within the timing specifications per the SoC datasheet.
Confirmed the read/write to flash memory is good and ensured the OS image is not corrupted.
Probed and verified the signal integrity (SI) (setup time, rise time, over/under voltage shoots) between the SoC, flash memory and DDR3.
Reviewed the DDR3 PCB layout once again to verify the DQS, DQ lines are stub (trace length) matched per the datasheet recommendation.
Probed and verified the UART connection from the SoC going into the monitor, because if this link is broken then no boot logs will be displayed.
Probed the DDR3 clock/data traces and looked at the eye diagram, it was per the datasheet specification . But still cannot probe some nets which were routed in strip-line with back drill.
Probing these clocks & data nets were necessary even when the design is verified. Because imperfection in PCBA manufacturing can cause reflections and impedance matching issue which affects the signal integrity.
Connected JTAG to SoC and dumped all status registers from SoC after it failed to boot.
Decoded those register values against SoC register reference manual using my own python script (to automate the tedious decoding work).
Reviewing the results, it looked like the boot fails at fsbl stage.
Then I did a fsbl code step debug via JTAG using Xilinx SDK tool by loading it into on-chip memory of Zynq SoC.
It looked the SoC failed to handshake with DRAM because the read/write levelling training sequence of DRAM failed. This could have happen due to bad part (DDR3) or no good solder attachment of the DRAM.
So suspected PCBA manufacturing process issue and performed X-ray of the Zynq SoC and DDR3 but could not find much from it.
Since the SoC to DRAM read/write training is failing, it could either or both SoC and DDR3 is not attached intact. May be some BGA balls did not make good contact with the board.
When re-balled and reattached the DDR3, the problem went away and SoC booted successfully.
Manufacturing process issue - After the PCBA assembly the contract manufacture did some rework on the motherboard near the DDR3 area. Where they heated that DDR3 area so much it caused some DDR3 balls solder to melt making some loose contact.
Board rework process improved by the manufacturing and a screening put in place after the reworks.
Proper PCB manufacturing controls should have been followed, may be the design group could performed a audit and witness rework process.