Education
                  
                 | 
               
              
                | 
                   Cornell University 
                   
                    M.S./Ph.D. in Electrical and Computer Engineering  
                    Advisor: Prof. Zhiru Zhang  
                     Aug 2021 — Present 
                   
                 | 
                
                   
                 | 
               
              
                | 
                   Sun Yat-sen University 
                   
                    B.Eng. in Telecommunication Engineering  
                    Advisor: Prof. Xiang Chen  
                    Outstanding thesis 
                     Aug 2016 — Jun 2020 
                   
                 | 
                
                   
                 | 
               
            
           
          
          
            
              
                
                  
                     
                    Work Experience
                  
                 | 
               
              
                | 
                   NVIDIA Research 
                   
                    PhD Research Intern  
                    Design Automation Research  
                    Mentor/Manager: Anthony Agnesina, Chenhui Deng, Mark Ren  
                     May 2024 — Aug 2024  
                     Jan 2025 — May 2025 
                   
                 | 
                
                   
                 | 
               
              
                | 
                   Advanced Micro Devices 
                   
                    Compiler Intern  
                    Advanced Compilers for Distribution and Computation (ACDC)  
                    Chief Technology Organization (CTO)  
                    Manager: Stephen Neuendorffer  
                     May 2023 — Aug 2023 
                   
                 | 
                
                   
                 | 
               
              
                | 
                   Intel Labs 
                   
                    Exempt Tech Employee  
                    Specification and Validation End-to-End (SAVE) Group, SCL/ADR/IL  
                    Manager: Jin Yang, Sunny Zhang  
                     Feb 2021 — Aug 2021 
                   
                 | 
                
                   
                 | 
               
              
                | 
                   Tsinghua University 
                   
                    Research Assistant  
                    Nanoscale Integrated Circuits and System Lab 
                      (NICS-EFC)   
                    Advisor: Prof. Yu Wang  
                     Nov 2019 — Aug 2021   
                   
                 | 
                
                 
                 | 
               
              
                | 
                   The University of Waterloo 
                   
                    MITACS
                    Research Intern  
                    WatCAG  
                    Advisor: Prof. Nachiket Kapre  
                     Jul 2019 — Oct 2019 
                   
                 | 
                
                 
                 | 
               
            
           
          
          
          
          
          
            
              
                
                  
                     
                    Research & Publication
                  
                  
                    My research spans hardware design automation and compute‑in‑SRAM accelerators, accelerator design and programming languages, and energy‑efficient machine‑learning systems.
                   
                 | 
               
            
           
          
            
              
                
                   
                 | 
                
                   
                     Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device 
                   
                  
                    Niansong Zhang, Wenbo Zhu, Courtney Golden, Dan Ilan, Hongzheng Chen, Christopher Batten, Zhiru Zhang
                   
                  MICRO 2025 | To appear
                  
                    This paper characterizes a commercial compute-in-SRAM device using realistic workloads, proposes key data management optimizations, and demonstrates that it can match GPU-level performance on retrieval-augmented generation tasks while achieving over 46× energy savings.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     ASPEN: LLM-Guided E-Graph Rewriting for RTL Datapath Optimization 
                   
                  
                    Niansong Zhang, Chenhui Deng, Johannes Maximilian Kuehn, Chia-Tung Ho, Cunxi Yu, Zhiru Zhang, Haoxing Ren
                   
                  MLCAD 2025 | To appear
                  
                    Why choose between smart and sound? ASPEN uses LLM-guided e-graph rewriting with real PPA feedback for RTL optimization. With 16.51% area and 6.65% delay improvements over prior methods, ASPEN shows you can have both—and it's fully automated.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     Cypress: VLSI-Inspired PCB Placement with GPU Acceleration 
                   
                  
                    Niansong Zhang, Anthony Agnesina, Noor Shbat, Yuval Leader, Zhiru Zhang, Haoxing Ren
                   
                  ISPD 2025 | paper | code
                  
                    🏆 Best Paper Award
                   
                  
                    We present Cypress, a GPU‑accelerated, VLSI‑inspired PCB placer that boosts routability by up to 5.9×, cuts track length by 19.7×, and runs up to 492× faster on new realistic benchmarks.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines 
                   
                  
                    Jinming Zhuang*, Shaojie Xiang*, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou
                   
                  
                    * Equal Contribution
                   
                  FPGA 2025 | paper | code
                  
                    🏅 Best Paper Nominee
                   
                  
                    We propose ARIES, a unified MLIR‑based compilation flow that abstracts task, tile, and instruction‑level parallelism across AMD AI Engine arrays (and optional FPGA fabric), boosting Versal VCK190 GEMM throughput by up to 1.6× over prior work.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     Allo: A Programming Model for Composable Accelerator Design 
                   
                  
                    Hongzheng Chen* Niansong Zhang*, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, Zhiru Zhang
                   
                  
                    * Equal Contribution
                   
                  PLDI 2024 | paper | code
                  
                    Specialized hardware accelerators are vital for performance improvements, but current design languages are inadequate for complex accelerators. Allo, a new composable programming model, decouples hardware customizations from algorithms and outperforms existing languages in performance and productivity.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     Formal Verification of Source-to-Source Transformations for HLS 
                   
                  
                    Louis-Noël Pouchet, Emily Tucker, Niansong Zhang, Hongzheng Chen, Debjit Pal, Gaberiel Rodríguez, Zhiru Zhang
                   
                  FPGA 2024 | paper | code
                  
                    🏆 Best Paper Award
                   
                  
                    We target the problem of efficiently checking the semantics equivalence between two programs written in C/C++ as a means to 
                    ensuring the correctness of the description provided to the HLS toolchain, by proving an optimized code version fully 
                    preserves the semantics of the unoptimized one.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model Inference 
                   
                  
                    Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang
                   
                  ACM Transactions on Reconfigurable Technology and Systems (TRETS)  
                  Vol. 18, No 1, Article 5.
                  
                    We design a spatial FPGA accelerator for LLM inference that assigns each operator its own hardware block and connects them with on‑chip dataflow to cut memory traffic and latency. An analytical model guides parallelization and scaling, showing when FPGAs can outpace GPUs.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator 
                   
                  
                    Courtney Golden, Dan Ilan, Caroline Huang, Niansong Zhang, Zhiru Zhang, Christopher Batten
                   
                  IEEE Computer Architecture Letters |
                  paper
                  
                    We implement a virtual vector instruction set on a commercial Compute-in-SRAM device, 
                    and perform detailed instruction microbenchmarking to identify performance benefits and overheads. 
                   
                 | 
               
              
                
                   
                 | 
                
                   
                     Serving Multi-DNN Workloads on FPGAs: a Coordinated Architecture, Scheduling, and Mapping Perspective
                   
                  
                    Shulin Zeng, Guohao Dai, Niansong Zhang, Xinhao Yang, Haoyu Zhang, Zhenhua Zhu, Huazhong Yang, Yu Wang
                   
                  IEEE Transactions on Computers |
                  paper
                  
                    🏅 Featured Paper in the May 2023 Issue
                   
                  
                    This paper proposes a Design Space Exploration framework to jointly optimize heterogeneous multi-core architecture, 
                    layer scheduling, and compiler mapping for serving DNN workloads on cloud FPGAs.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                    Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges
                   
                  
                    Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, Zhiru Zhang
                   
                  Invited Paper, DAC 2022, 
                  paper
                  
                  
                  
                    We show the advantages of the decoupled programming model and further discuss some of our recent efforts to enable a robust and viable verification solution in the future.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                    CodedVTR: Codebook-Based Sparse Voxel Transformer with Geometric Guidance
                   
                  
                    Tianchen Zhao, Niansong Zhang, Xuefei Ning, He Wang, Li Yi, Yu Wang
                   
                  CVPR 2022, 
                  paper |
                  
                  website |
                  slides |
                  poster |
                  video
                  
                    We propose a flexible 3D Transformer on sparse voxels to address transformer's generalization issue.
                    CodedVTR (Codebook-based Voxel TRansformer)
                    decomposes attention space into linear combinations of learnable prototypes to regularize attention
                    learning.
                    We also propose geometry-aware self-attention to guide training with geometric pattern and voxel
                    density.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                    HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for
                      Software-Defined FPGAs
                   
                  
                    Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit
                    Pal, Zhiru Zhang
                   
                  FPGA 2022, 
                  paper |
                  code
                  
                    We propose an FPGA accelerator programming model that decouples the algorithm specification from
                    optimizations related to
                    orchestrating the placement of data across a customized memory hierarchy.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                    RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using
                      Evolutionary Algorithms
                   
                  
                    Niansong Zhang, Xiang Chen, Nachiket Kapre
                   
                  
                    🏆 Best Paper Award
                   
                  Invited Paper, ACM Transactions on Reconfigurable Technology and Systems (TRETS)  
                  Volume 15, Issue 4, Article No.: 38, pp 1–23
                  
                  
                    We extend the previous work on RapidLayout with cross-SLR routing, placement transfer learning, and
                    placement bootstrapping from a much
                    smaller device to improve runtime and design quality.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                    aw_nas: A Modularized and Extensible NAS Framework
                   
                  
                    Xuefei Ning, Changcheng Tang, Wenshuo Li, Songyi Yang, Tianchen Zhao, Niansong
                      Zhang, Tianyi Lu, Shuang Liang, Huazhong Yang, Yu Wang
                   
                  Arxiv Preprint,
                  paper |
                  code
                  
                    We build an open-source Python framework implementing various NAS algorithms in a modularized and
                    extensible manner.
                   
                 | 
               
              
                
                   
                 | 
                
                   
                    RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using
                      Evolutionary Algorithms
                   
                  
                    Niansong Zhang, Xiang Chen, Nachiket Kapre
                   
                  FPL 2020,
                  paper |
                  code
                  
                    🏅 Michal Servit Best Paper Award Nominee
                   
                  
                    We build a fast and high-performance evolutionary placer for FPGA-optimized hard block designs that
                    targets high clock frequency such as 650+MHz.
                   
                 | 
               
            
           
          
          
            
              
                 
                Workshops & Talks
              
             
            
              
                 
               | 
              
                 
                  An MLIR-based Intermediate Representation for Accelerator Design with Decoupled Customizations
                 
                
                  Hongzheng Chen*, Niansong Zhang*, Shaojie Xiang, Zhiru Zhang
                 
                MLIR Open Design Meeting (08/11/2022)  |
                video |
                slides  |
                website
                 
                 
                CRISP Liaison Meeting (09/28/2022)  |
                news |
                slides |
                website
                 
                 
                
                  We decouple hardware customizations from the algorithm specifications at the IR level to:
                  (1) provide a general platform for high-level DSLs, (2) boost performance and productivity, and (3)
                  make customization verification scalable.
                 
               | 
             
            
              
                 
               | 
              
                 
                  Enabling Fast Deployment and Efficient Scheduling for Multi-Node and Multi-Tenant DNN
                    Accelerators in the Cloud
                 
                
                  Shulin Zeng, Guohao Dai, Niansong Zhang, Yu Wang
                 
                MICRO 2021 ASCMD Workshop,
                paper
                |
                video
                
                  We propose a multi-node and multi-core accelerator architecture and a decoupled compiler for
                  cloud-backed INFerence-as-a-Service (INFaaS).
                 
               | 
             
           
          
          
          
          
          
          
            
              
                
                   
                  Awards and Honors
                
                
                  Best Paper Award at ISPD 2025
                 
                
                  Best Paper Nomination at FPGA 2025
                 
                
                  Best Paper Award at FPGA 2024
                 
                
                  Best Paper Award for ACM TRETS in 2023
                 
                
                  Best Paper Nomination (Michal Servit Award) at FPL 2020
                 
                
                  DAC Young Fellow 2021 & 2023
                 
                
                  Outstanding Bachelor Thesis Award | Sun Yat-sen University
                 
                
                  Mitacs Globalink Research Internship Award | Mitacs, Canada
                 
                
                  First-class Merit Scholarship x2 | Sun Yat-sen University
                 
                
                  Lin and Liu Foundation Scholarship | SEIT, Sun Yat-sen University
                 
               | 
             
           
          
          
            
              
                
                   
                  Patents
                
                
                  Niansong Zhang, Haoxing Ren, Brucek Khailany,
                  "Printed Circuit Board Component Placement." US Patent 24-0963US2, filed in Feburary 2025.
                 
                
                  Niansong Zhang, Songyi Yang, Shun Fu, Xiang Chen,
                  "Industry Profile Geometric Dimension Automatic Measuring Method Based on Computer Vision
                    Imaging." Chinese Patent CN201811539019.8A, filed on December 17, 2018, and issued on April 19, 2019.
                 
                
                  Niansong Zhang (at Novauto Technology), "A Pruning Method and Device of
                    Multi-task Neural Network Models", Chinese Patent 202010805327.1, filed on August 12, 2020.
                 
               | 
             
           
         |