Who we are:
Spectral Labs is a spatial intelligence company building reasoning models for engineering physical systems. Our model SGS-1 is state-of-the-art for parametric geometry, and we are currently building the next generation of models to revolutionize how systems are designed and manufactured from the ground up. Our team is small and talent dense. We have founded quantitative trading firms and built generative design at Autodesk. Our founding members have worked on the cutting edge of applied AI at Meta, Autodesk Research and Samsung Research.
Role: In person in SF
Comp: 350-600k+ TC
What we're looking for
Spectral is seeking a team member who will develop ML pipelines to fine tune and run RL on our CAD foundation models. This person will own the infrastructure for making our models better.
Responsibilities
- Optimize distributed training & RL across our GPU cluster of hundreds of H100 GPUs (FSDP, DeepSpeed, or custom parallelism strategies)
- Identify and correct bottlenecks with a complex, bespoke multi-modal training + RL setup
- Own our training + RL infrastructure
- Work closely with researchers to unblock training experiments and reduce iteration time
Qualifications
- Experience optimizing multi-node training at scale
- Deep understanding of profiler traces: can understand when a system is I/O, network, CPU, or GPU bound
- Comfort with NCCL internals
- Experience with high-performance networking stacks (e.g., GCP TCPXO) is a plus
- Experience with various different types of models and their unique training challenges (diffusion, AR models, etc)
Benefits
- Compensation competitive with top opportunities, including meaningful ownership
- Health insurance with 100% premium covered