OneMain Financial Jobs

Job Information

Nvidia Deep Learning Kernel Software Performance Architect in Shanghai, China

NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.

What you'll be doing:

  • Performance analysis + debugging

  • Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.

  • Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.

  • Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.

  • Automation + regression infrastructure (Python-heavy)

  • Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.

  • Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.

  • Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.

  • Cross-team collaboration and operating model

  • Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.

  • Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.

  • Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently

  • Following general software engineering best practices including support for regression testing and CI/CD flows

What we need to see:

  • Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field

  • Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging)

  • Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).

  • Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.

  • Comfortable working across teams and driving issues to decision/closure with clear communication

  • Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design

  • Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads)

  • Solid understanding of computer architecture and some experience with assembly programming

  • Identify bottlenecks, optimize resource utilization, and improve throughput

Ways to stand out from the crowd:

  • Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)

  • Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics

  • GPU programming/perf experience (CUDA or equivalent parallel programming)

  • Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)

  • Familiarity with simulators/analytical modeling or performance characterization methodology

DirectEmployers