Site Reliability Engineer (SRE)

Remote
Full Time
Experienced

Excellent opportunity to work REMOTELY with a U.S.-based company. Candidates living in Mexico, Central, or South America are welcome to apply.

About the Company

Bydrec, Inc. is a California-based company that connects top Tech talent from Latin America with U.S. companies looking to expand their development teams. Learn more at bydrec.com.

Our client is a dynamic company that requires a proactive and self-assured engineer to help define and lead this project. The ideal candidate must be able to ensure their platform remains fast, resilient, and scalable, especially during high-traffic live events. This is a unique opportunity to contribute to the future of reliability at a company where uptime and user experience are paramount.

What You’ll Do 

  • Optimize Performance: Continuously monitor and analyze system performance, identify bottlenecks, and implement solutions to improve efficiency and scalability across our cloud-native infrastructure. 

  • Monitoring & Alerting: Design and manage robust observability systems using Prometheus, Grafana, ELK stack, and APM tools to ensure real-time visibility into platform health. 

  • Incident Management: Lead incident response efforts, perform root cause analysis, and drive post-mortem processes to prevent recurrence and improve system resilience. 

  • Cloud Infrastructure: Architect and maintain infrastructure across Azure and GCP, ensuring high availability, security, and cost-effectiveness. 

  • Automation & Tooling: Build and maintain automation scripts and playbooks using Python and Ansible to reduce manual effort and improve deployment consistency. 

  • Container Orchestration: Manage Kubernetes clusters to support dynamic scaling and seamless deployment of microservices. 

  • CI/CD & GitOps: Collaborate with development teams to enhance GitLab pipelines and promote GitOps practices for reliable and repeatable deployments. 

  • Cross-Team Collaboration: Work closely with Engineering, Development, and Technical Operations to align reliability goals with product and business objectives. 

Technical Requirements:

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles within a SaaS or cloud-native environment. 
  • Strong expertise in:  
    - Azure and GCP cloud platforms 
    - Kubernetes and container orchestration 
    - Monitoring tools: Prometheus, Grafana, ELK stack, APM solutions 
    - Automation: Python, Ansible 
    - CI/CD: GitLab 

  • Proven success in performance tuning, incident response, and system scalability. 

  • Excellent communication and collaboration skills across technical and non-technical teams. 

  • Initiative, confidence, and a builder’s mindset—ready to shape a nascent function and drive impact from day one. 
  • Sense of urgency during critical incidents, as the work focuses on maintaining high availability.
  • Advanced level of English

Must have skills:

  • Experience using APM (Application Performance Monitoring) tools — also referred to as Observability platforms.
  • Skill in leveraging logs for monitoring, alerting, and forensics.
  • Expertise working with modern cloud-native environments, with experience in both on-premise and cloud infrastructure (due to the ongoing migration).

Share

Apply for this position

Required*
Apply with
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*