Search for a command to run...
Transposing matrices is essential in various fields, including signal processing and machine learning, making efficient hardware implementations for real-time processing highly desirable. The transposition of general matrices in a stream using memory is mathematically formalized. A novel algorithm has been developed to compute the necessary sequence of addresses for the memory whose depth is parametrizable and reducible to its theoretical minimum. For the subclass of matrices, where one dimension is a power of two and the other is two, it has been determined that the recursions in this algorithm exhibit logarithmic time complexity when minimal memory is used. Subsequently, it is shown how efficient transposition circuits can be built for these matrices, achieving minimal latency and full throughput. The proposed solution and recently published designs have been simulated for an ASIC implementation using the 45-nm NanGate Open Cell Library and compared for all suitable matrix dimensions of up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2\times 512$</tex-math> </inline-formula>. For the considered configurations, the results indicate for the proposed approach a memory saving of up to 33.3% over the state of the art. Overall area savings of up to 24.9% were achieved, making the new scheme very favorable for deployments with restricted area. Further advantages are seen with an average frequency increase of 13.7% and a lowered power consumption by up to 29.6%.
Published in: IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Volume 34, Issue 2, pp. 530-541