Site Reliability Engineer (SRE)
Excellent opportunity to work REMOTELY with a U.S.-based company. Candidates living in Mexico, Central, or South America are welcome to apply.
About the Company
Bydrec, Inc. is a California-based company that connects top Tech talent from Latin America with U.S. companies looking to expand their development teams. Learn more at bydrec.com.
Our client is a dynamic company that requires a proactive and self-assured engineer to help define and lead this project. The ideal candidate must be able to ensure their platform remains fast, resilient, and scalable, especially during high-traffic live events. This is a unique opportunity to contribute to the future of reliability at a company where uptime and user experience are paramount.
What You’ll Do 
Optimize Performance: Continuously monitor and analyze system performance, identify bottlenecks, and implement solutions to improve efficiency and scalability across our cloud-native infrastructure.
Monitoring & Alerting: Design and manage robust observability systems using Prometheus, Grafana, ELK stack, and APM tools to ensure real-time visibility into platform health.
Incident Management: Lead incident response efforts, perform root cause analysis, and drive post-mortem processes to prevent recurrence and improve system resilience.
Cloud Infrastructure: Architect and maintain infrastructure across Azure and GCP, ensuring high availability, security, and cost-effectiveness.
Automation & Tooling: Build and maintain automation scripts and playbooks using Python and Ansible to reduce manual effort and improve deployment consistency.
Container Orchestration: Manage Kubernetes clusters to support dynamic scaling and seamless deployment of microservices.
CI/CD & GitOps: Collaborate with development teams to enhance GitLab pipelines and promote GitOps practices for reliable and repeatable deployments.
Cross-Team Collaboration: Work closely with Engineering, Development, and Technical Operations to align reliability goals with product and business objectives.
Technical Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles within a SaaS or cloud-native environment.
 Strong expertise in:
- Azure and GCP cloud platforms
- Kubernetes and container orchestration
- Monitoring tools: Prometheus, Grafana, ELK stack, APM solutions
- Automation: Python, Ansible
- CI/CD: GitLabProven success in performance tuning, incident response, and system scalability.
Excellent communication and collaboration skills across technical and non-technical teams.
- Initiative, confidence, and a builder’s mindset—ready to shape a nascent function and drive impact from day one.
 - Sense of urgency during critical incidents, as the work focuses on maintaining high availability.
 - Advanced level of English
 
Must have skills:
- Experience using APM (Application Performance Monitoring) tools — also referred to as Observability platforms.
 - Skill in leveraging logs for monitoring, alerting, and forensics.
 - Expertise working with modern cloud-native environments, with experience in both on-premise and cloud infrastructure (due to the ongoing migration).