kaiko.ai

Machine Learning Engineer – GPU Acceleration & Dis...

Job description

What we build at Kaiko

The core of Kaiko’s vision constitutes of Kaiko’s data- & compute platform (KDCP). A distributed information system that brings together hospitals and labs to provide data ingestion and processing, analysis, and modeling, reporting and intelligence, distribution and sharing of a multitude of complex sources of structured and unstructured data, including genomics, imaging, and clinical data, delivered as a multi-tenant SaaS platform on the cloud.

We are working in close collaboration with cancer hospitals and research institutes to integrate KCDP at those premises. Our vision is to explore new frontiers on how to support medical doctors in their decision-making process, as well as enable researchers to run complex machine learning pipelines and resulting models.


About the Role

As a Machine Learning Engineer specializing in GPU acceleration and distributed training, you will focus on enhancing the efficiency of handling very long sequence lengths in Transformers, State Space Models (SSM) and other architectures using CUDA/Triton & Torch. Additionally, you will scale training processes across multi-node distributed systems to ensure robust and efficient model development. You will work closely with our ML Research teams to build and maintain high-performance training pipelines. 


How you’ll contribute

  • Efficiency Optimization: Leverage CUDA, Triton and Torch to improve the efficiency of Transformers, SSMs and other architectures for very long sequence lengths.
  • Distributed Training: Scale custom machine learning training pipelines efficiently across multi-node GPU clusters.
  • Collaboration: Work with ML Researchers and Engineering teams to integrate optimized training solutions into the development lifecycle.


What you'll bring

  •  Master's degree in computer science, Engineering, or a related field. Ph.D. is a plus.
  •  Proficient in Python with extensive experience in PyTorch.
  • Deep expertise with CUDA and/or Triton for optimizing GPU performance, specifically for large-scale sequence processing.
  •  Proven experience in scaling machine learning trainings to multi-node distributed GPU environments.
  •  Strong understanding of Transformer, State Space Models (SSMs) and other common architectures and their optimization.
  • Skilled in performance tuning and profiling for both software and hardware in machine learning contexts.
  •  Ability to diagnose and resolve complex technical challenges related to GPU acceleration and distributed training.
  •  Excellent communication skills and ability to work effectively within a multidisciplinary team.
  • Capable of managing multiple projects simultaneously and adapting to evolving priorities in a fast-paced environment.


Nice to Have

  • Experience with containerization technologies, such as Docker or Kubernetes.
  • Experience with cloud computing platforms, such as Azure, AWS or GCP.


Additional Information

This position is full-time and requires residency in either the Netherlands or Switzerland, a valid work permit, and proximity to our offices in Amsterdam or Zürich. A Certificate of Conduct will be necessary upon finalizing the employment contract due to the handling of sensitive data.

 

Our culture

At Kaiko we strive for an open, creative and non-hierarchical work atmosphere where we offer flexibility – for instance remote work - and direct impact in return for accountability and team spirit. Prioritizing, managing, and executing your own goals with ownership and alignment with those of the company. Sensitive data builds the core of our daily work and thus data privacy aspects are key skills of all our employees.

We give talented people a lot of room to explore new ideas and we reward exceptional talent with an attractive package and opportunities for personal development.  

Please let the company know that you found this position on this Job Board as a way to support us, so we can keep posting cool jobs.