Site Reliability Engineer (SRE)

Remote

Full Time

Experienced

Excellent opportunity to work REMOTELY with a U.S.-based company. Candidates living in Mexico, Central, or South America are welcome to apply.

About the Company

Bydrec, Inc. is a California-based company that connects top Tech talent from Latin America with U.S. companies looking to expand their development teams. Learn more at bydrec.com.

Our client is a dynamic company that requires a proactive and self-assured engineer to help define and lead this project. The ideal candidate must be able to ensure their platform remains fast, resilient, and scalable, especially during high-traffic live events. This is a unique opportunity to contribute to the future of reliability at a company where uptime and user experience are paramount.

What You’ll Do

Optimize Performance: Continuously monitor and analyze system performance, identify bottlenecks, and implement solutions to improve efficiency and scalability across our cloud-native infrastructure.

Monitoring & Alerting: Design and manage robust observability systems using Prometheus, Grafana, ELK stack, and APM tools to ensure real-time visibility into platform health.

Incident Management: Lead incident response efforts, perform root cause analysis, and drive post-mortem processes to prevent recurrence and improve system resilience.

Cloud Infrastructure: Architect and maintain infrastructure across Azure and GCP, ensuring high availability, security, and cost-effectiveness.

Automation & Tooling: Build and maintain automation scripts and playbooks using Python and Ansible to reduce manual effort and improve deployment consistency.

Container Orchestration: Manage Kubernetes clusters to support dynamic scaling and seamless deployment of microservices.

CI/CD & GitOps: Collaborate with development teams to enhance GitLab pipelines and promote GitOps practices for reliable and repeatable deployments.

Cross-Team Collaboration: Work closely with Engineering, Development, and Technical Operations to align reliability goals with product and business objectives.

Technical Requirements:

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles within a SaaS or cloud-native environment.
Strong expertise in:
- Azure and GCP cloud platforms
- Kubernetes and container orchestration
- Monitoring tools: Prometheus, Grafana, ELK stack, APM solutions
- Automation: Python, Ansible
- CI/CD: GitLab
Proven success in performance tuning, incident response, and system scalability.
Excellent communication and collaboration skills across technical and non-technical teams.
Initiative, confidence, and a builder’s mindset—ready to shape a nascent function and drive impact from day one.
Sense of urgency during critical incidents, as the work focuses on maintaining high availability.
Advanced level of English

Must have skills:

Experience using APM (Application Performance Monitoring) tools — also referred to as Observability platforms.
Skill in leveraging logs for monitoring, alerting, and forensics.
Expertise working with modern cloud-native environments, with experience in both on-premise and cloud infrastructure (due to the ongoing migration).

Apply for this position

Required*

First Name*

Last Name*

Email Address*

Phone*

Address

Resume*

We've received your resume. Click here to update it.

Attach resume or Paste resume

Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

1. How many years of experience do you have working in Site Reliability Engineering, DevOps or Infrastructure roles within SaaS or cloud-native environments?*

2. How many years of experience do you have working with Microsoft Azure (implementation, maintenance, or administration of cloud services)?*

3. How many years of experience do you have working with Google Cloud Platform (GCP)?*

4. How many years of experience do you have managing Kubernetes clusters or similar container orchestration systems?*

5. How many years of experience do you have using monitoring and observability tools like Prometheus, Grafana, or ELK Stack?*

6. How many years of experience do you have working with APM (Application Performance Monitoring) tools?*

7. How many years of experience do you have developing scripts or automations with Python and Ansible?*

8. How many years of experience do you have building or maintaining CI/CD pipelines in GitLab?*

9. Do you have experience leveraging logs for monitoring, alerts, and forensic analysis?*

10. Do you have experience working with modern cloud-native environments, both on-premises and in the cloud?*

11. What is your expected salary (monthly salary in dollars)?*

Human Check*

Submit Application

Thanks for visiting our Career Page. Please review our open positions and apply to the positions that match your qualifications.

Site Reliability Engineer (SRE)

Apply for this position