logo

View all jobs

Site Reliability Engineer

Remote, Canada · Computer/Software

Senior Site Reliability Engineer

We are helping turn the tide on climate change with breakthrough technologies that accelerate electrification and sustainable operations for energy-intensive industries. We develop full-stack, integrated, open systems that support commercial and industrial electric vehicles, building operations, and agriculture to optimize how the world uses energy, so every watt is worthwhile for humanity. We’re looking for curious, intelligent, collaborative people from diverse backgrounds who want to make a real impact on the sustainability of our planet.

The Job

As a team, SRE builds tools and processes that deliver software quickly, confidently, and reliably across the whole organization.  We design, build, and operate production infrastructure using best practices to deliver high levels of reliability and scalability.

As a Senior SRE you will

  • act as a technical leader: mentoring other SREs and collaborating with application developers to identify and meet SLOs
  • architect and build repeatable, highly-scalable infrastructure from code
  • craft a cohesive CI/CD platform that enables maximum developer productivity
  • develop platform tools and services
  • design and implement strategies for ensuring our infra is observable and alerting on only the most important events
  • respond to production incidents both as primary for infra and as-needed support for the application development teams
  • drive and engage in blameless incident post-mortems
  • bring a desire to learn and a focus on solving problems through automation (“automate all the things”)
 

At this point, we hope you're feeling excited about the job description you’re reading. Even if you don't feel that you meet every single requirement, we still encourage you to apply. We're eager to meet people that believe in our mission and can contribute to our team in a variety of ways - not just candidates who check all the boxes.  We want people to feel comfortable expressing their true selves and to come, stay, and do their best work here.

The Requirements

  • 6+ years work experience in software roles, with 4+ years in SRE or devops
  • 3+ years operating infrastructure in public clouds  (azure/aws/gcp)
  • 2+ years operating Kubernetes clusters in production
  • Deep understanding of infrastructure-as-code (we use Pulumi, but terraform/arm templates/cloudformation is fine)
  • Implementation of all parts of an observability stack (Datadog, Prometheus, ELK, Sentry, etc)
  • Understanding of incident management processes (eg on-call, incident playbooks, and blameless post-mortems)
  • Deep understanding of CI/CD 
  • Experience programming in Shell, Go, and Python, or willingness to learn
  • An “automate all the things” mindset 
  • Knowledgeable in distributed systems, APIs, cloud computing, and scalability
  • Excellent written & verbal communication skills
  • Degree in CS or understanding of Computer Science fundamentals

Bonus Points

  • Sped up product teams via GitLab CI/CD pipelines
  • Experience using Azure and AKS
  • Knowledgeable in compliance & cyber security best practices
  • Managed Cassandra, Kafka, and/or PostgreSQL
  • Architected IoT systems and monitored device fleets
  • Knowledge of Linux/UNIX administration & networking

Share This Job

Powered by