Niansong Zhang

Hi there! I am an MS/PhD student at Cornell University, advised by Prof. Zhiru Zhang. I am interested in Electronic Design Automation, Domain-Specific Languages for hardware design and programming, and hardware accelerators.

Dept Electrical & Computer Engineering
Lab Computer Systems Laboratory
Office 471D Rhodes Hall
Email nz264 [at] cornell [dot] edu

News

Our paper on FPGA-based LLM inference won the TRETS 2026 Best Paper Award. 7/2026
Selected as LAD'25 Fellow at the International Conference on LLM-Aided Design. 12/2025
Our ISPD 2025 paper Cypress won the Best Paper Award. [Photo] 3/2025
ARIES nominated for Best Paper Award at FPGA 2025. 3/2025
Our FPGA 2024 paper HLS Formal Verification won the Best Paper Award. [News] 3/2024
Our ACM TRETS 2023 paper RapidLayout won the Best Paper Award. [Photo] 7/2023
Donated hair to Locks of Love to support children suffering from hair loss. [Certificate] 7/2023
Serving Multi-DNN Workloads on FPGAs selected as Featured Paper of IEEE Trans. on Computers. 5/2023
Patent CN109658402B granted. [Certificate] [Google Patents] 4/2023
Selected as DAC Young Fellow. [Certificate] 4/2023

Education

Cornell University

M.S./Ph.D. in Electrical and Computer Engineering

Advisor: Prof. Zhiru Zhang

2021 — Present

Sun Yat-sen University

B.Eng. in Telecommunication Engineering

Advisor: Prof. Xiang Chen · Outstanding thesis

2016 — 2020

Work Experience

NVIDIA Research

PhD Research Intern · Design Automation Research

Mentor: Anthony Agnesina, Chenhui Deng, Mark Ren

2024 — 2025

Advanced Micro Devices

Compiler Intern · ACDC, CTO

Manager: Stephen Neuendorffer

Summer 2023

Intel Labs

Exempt Tech Employee · SAVE Group

Manager: Jin Yang, Sunny Zhang

2021

Tsinghua University

Research Assistant · NICS-EFC

Advisor: Prof. Yu Wang

2019 — 2021

University of Waterloo

MITACS Research Intern · WatCAG

Advisor: Prof. Nachiket Kapre

2019

Research & Publications

I build compilers, programming models, and design automation tools that make it easier to design and program specialized hardware.

Cedar: Learning to Optimize RTL via Accumulating Verified Rewrite Rules

Niansong Zhang, Chenhui Deng, Johannes Maximilian Kuehn, Chia-Tung Ho, Cunxi Yu, Zhiru Zhang, Brucek Khailany

ICCAD 2026 · to appear

Abstract

Cedar extends equality saturation to RTL well beyond prior e-graph work through an LLM-driven optimization loop that accumulates formally verified rewrite rules across designs—rules discovered while optimizing one circuit are reused on future, unseen ones. Extended e-graph operators and direct semantic rewrites let Cedar handle complex RTL such as multiplexers, FSMs, decoders, and heterogeneous datapaths, while a power-aware extraction cost model enables joint power, delay, and area optimization, with formal equivalence preserved on every design.

From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, André Rösti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang

ACM TRETS · paper · code

Abstract

We introduce MLIR-AIR, an open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs, achieving up to 78.7% compute efficiency on matrix multiplication.

Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device

Niansong Zhang, Wenbo Zhu, Courtney Golden, Dan Ilan, Hongzheng Chen, Christopher Batten, Zhiru Zhang

MICRO 2025 · paper

Abstract

This paper characterizes a commercial compute-in-SRAM device using realistic workloads, proposes key data management optimizations, and demonstrates that it can match GPU-level performance on retrieval-augmented generation tasks while achieving over 46x energy savings.

ASPEN: LLM-Guided E-Graph Rewriting for RTL Datapath Optimization

Niansong Zhang, Chenhui Deng, Johannes Maximilian Kuehn, Chia-Tung Ho, Cunxi Yu, Zhiru Zhang, Haoxing Ren

MLCAD 2025 · paper

Abstract

ASPEN uses LLM-guided e-graph rewriting with real PPA feedback for RTL optimization. With 16.51% area and 6.65% delay improvements over prior methods, ASPEN shows you can have both smart and sound—and it's fully automated.

Cypress: VLSI-Inspired PCB Placement with GPU Acceleration

Niansong Zhang, Anthony Agnesina, Noor Shbat, Yuval Leader, Zhiru Zhang, Haoxing Ren

ISPD 2025 · paper · code

🏆 Best Paper Award

Abstract

We present Cypress, a GPU-accelerated, VLSI-inspired PCB placer that boosts routability by up to 5.9x, cuts track length by 19.7x, and runs up to 492x faster on new realistic benchmarks.

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

Jinming Zhuang*, Shaojie Xiang*, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou * Equal Contribution

FPGA 2025 · paper · code

🏅 Best Paper Nominee

Abstract

We propose ARIES, a unified MLIR-based compilation flow that abstracts task, tile, and instruction-level parallelism across AMD AI Engine arrays (and optional FPGA fabric), boosting Versal VCK190 GEMM throughput by up to 1.6x over prior work.

Allo: A Programming Model for Composable Accelerator Design

Hongzheng Chen*, Niansong Zhang*, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, Zhiru Zhang * Equal Contribution

PLDI 2024 · paper · code

Abstract

Allo, a new composable programming model, decouples hardware customizations from algorithms and outperforms existing languages in performance and productivity for specialized hardware accelerator design.

Formal Verification of Source-to-Source Transformations for HLS

Louis-Noël Pouchet, Emily Tucker, Niansong Zhang, Hongzheng Chen, Debjit Pal, Gaberiel Rodríguez, Zhiru Zhang

FPGA 2024 · paper · code

🏆 Best Paper Award

Abstract

We target the problem of efficiently checking the semantics equivalence between two programs written in C/C++ as a means to ensuring the correctness of the description provided to the HLS toolchain.

Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model Inference

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

ACM TRETS, Vol. 18, No 1, Article 5 (FCCM'24 Journal Track) · paper

🏆 TRETS 2026 Best Paper Award

Abstract

We design a spatial FPGA accelerator for LLM inference that assigns each operator its own hardware block and connects them with on-chip dataflow to cut memory traffic and latency.

Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator

Courtney Golden, Dan Ilan, Caroline Huang, Niansong Zhang, Zhiru Zhang, Christopher Batten

IEEE Computer Architecture Letters · paper

Abstract

We implement a virtual vector instruction set on a commercial Compute-in-SRAM device, and perform detailed instruction microbenchmarking to identify performance benefits and overheads.

Serving Multi-DNN Workloads on FPGAs: a Coordinated Architecture, Scheduling, and Mapping Perspective

Shulin Zeng, Guohao Dai, Niansong Zhang, Xinhao Yang, Haoyu Zhang, Zhenhua Zhu, Huazhong Yang, Yu Wang

IEEE Transactions on Computers · paper

🏅 Featured Paper in the May 2023 Issue

Abstract

This paper proposes a Design Space Exploration framework to jointly optimize heterogeneous multi-core architecture, layer scheduling, and compiler mapping for serving DNN workloads on cloud FPGAs.

Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges

Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, Zhiru Zhang

Invited Paper, DAC 2022 · paper

Abstract

We show the advantages of the decoupled programming model and further discuss some of our recent efforts to enable a robust and viable verification solution in the future.

CodedVTR: Codebook-Based Sparse Voxel Transformer with Geometric Guidance

Tianchen Zhao, Niansong Zhang, Xuefei Ning, He Wang, Li Yi, Yu Wang

CVPR 2022 · paper · website · slides · poster · video

Abstract

We propose CodedVTR, a flexible 3D Transformer on sparse voxels that decomposes attention space into linear combinations of learnable prototypes to regularize attention learning, with geometry-aware self-attention to guide training.

HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs

Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit Pal, Zhiru Zhang

FPGA 2022 · paper · code

Abstract

We propose an FPGA accelerator programming model that decouples the algorithm specification from optimizations related to orchestrating the placement of data across a customized memory hierarchy.

RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using Evolutionary Algorithms

Niansong Zhang, Xiang Chen, Nachiket Kapre

Invited Paper, ACM TRETS, Vol. 15, Issue 4, Article 38 · paper

🏆 Best Paper Award

Abstract

We extend the previous work on RapidLayout with cross-SLR routing, placement transfer learning, and placement bootstrapping from a much smaller device to improve runtime and design quality.

RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using Evolutionary Algorithms

Niansong Zhang, Xiang Chen, Nachiket Kapre

FPL 2020 · paper · code

🏅 Michal Servit Best Paper Award Nominee

Abstract

We build a fast and high-performance evolutionary placer for FPGA-optimized hard block designs that targets high clock frequency such as 650+MHz.

Preprints

Dato: A Task-Based Programming Model for Dataflow Accelerators

Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang

arXiv 2025 · paper

Abstract

We introduce Dato, a Python-based programming framework for dataflow accelerators that treats data communication as a first-class language feature. On AMD Ryzen AI, Dato achieves up to 84% hardware utilization for GEMM and a 2.81x speedup on attention kernels compared to state-of-the-art commercial frameworks.

aw_nas: A Modularized and Extensible NAS Framework

Xuefei Ning, Changcheng Tang, Wenshuo Li, Songyi Yang, Tianchen Zhao, Niansong Zhang, Tianyi Lu, Shuang Liang, Huazhong Yang, Yu Wang

arXiv 2020 · paper · code

Abstract

We build an open-source Python framework implementing various NAS algorithms in a modularized and extensible manner.

Workshops & Talks

From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis

Niansong Zhang, Sunwoo Kim, Shreesha Srinath, Zhiru Zhang

LATTE 2026 · paper

Abstract

We argue that HLS remains essential in the agentic era—offering faster iteration, portability, and design permutability that make it a natural layer for agentic optimization—and propose a six-level autonomy taxonomy (L0 Manual DSE → L5 Silicon Partner) for the symbiotic evolution of agentic HLS.

ASPEN: LLM-Guided E-Graph Rewriting for RTL Datapath Optimization

Niansong Zhang, Chenhui Deng, Johannes Maximilian Kuehn, Chia-Tung Ho, Cunxi Yu, Zhiru Zhang, Haoxing Ren

NSF Workshop on Agents for Chip Design Automation, UCLA (03/13/2026) · website

Abstract

An MLIR-based Intermediate Representation for Accelerator Design with Decoupled Customizations

Hongzheng Chen*, Niansong Zhang*, Shaojie Xiang, Zhiru Zhang

MLIR Open Design Meeting (08/11/2022) · video · slides · website

CRISP Liaison Meeting (09/28/2022) · news · slides · website

Abstract

We decouple hardware customizations from the algorithm specifications at the IR level to: (1) provide a general platform for high-level DSLs, (2) boost performance and productivity, and (3) make customization verification scalable.

Enabling Fast Deployment and Efficient Scheduling for Multi-Node and Multi-Tenant DNN Accelerators in the Cloud

Shulin Zeng, Guohao Dai, Niansong Zhang, Yu Wang

MICRO 2021 ASCMD Workshop · paper · video

Abstract

We propose a multi-node and multi-core accelerator architecture and a decoupled compiler for cloud-backed INFerence-as-a-Service (INFaaS).

Professional Services

Session Chair: MLCAD 2025

Conference Reviewer: CVPR 2026, FCCM 2024, ICCAD 2022 & 2023

Journal Reviewer: IEEE TCAD, IEEE TC, IEEE TPDS, IEEE TCAS-II, ACM TRETS

Student Volunteer: FCCM'22

Teaching

[ECE 6775] High-Level Digital Design Automation
Head TA, Fall 2025

[ECE 2300] Digital Logic and Computer Organization
Head TA, Spring 2024

[ECE 5775] High-Level Digital Design Automation
Part-time TA, Fall 2022

Awards & Honors

Best Paper Award for ACM TRETS in 2026
Best Paper Award at ISPD 2025
Best Paper Nomination at FPGA 2025
Best Paper Award at FPGA 2024
Best Paper Award for ACM TRETS in 2023
Best Paper Nomination (Michal Servit Award) at FPL 2020
DAC Young Fellow 2021 & 2023
Outstanding Bachelor Thesis Award, Sun Yat-sen University
Mitacs Globalink Research Internship Award, Canada
First-class Merit Scholarship x2, Sun Yat-sen University
Lin and Liu Foundation Scholarship, SEIT, Sun Yat-sen University

Patents

Anthony Agnesina, Haoxing Ren, Niansong Zhang, "Printed Circuit Board Component Placement." US Patent App. US20260044660A1, filed Feb 2025, published Feb 12, 2026. [Google Patents]
Niansong Zhang, Songyi Yang, Shun Fu, Xiang Chen, "Industry Profile Geometric Dimension Automatic Measuring Method Based on Computer Vision Imaging." Chinese Patent CN201811539019.8A, filed Dec 17, 2018, issued Apr 19, 2019. [Google Patents]
Niansong Zhang, Tianyi Lu, Changcheng Tang, Jian Zhang, "A Pruning Method and Device of Multi-task Neural Network Models." Chinese Patent CN114492783A, filed Aug 12, 2020. [Google Patents]