Jobgether logo

Senior Site Reliability Engineer (Remote)

Jobgether
Full-time
On-site
remote

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in Texas (USA).

In this role, you will play a key part in ensuring the performance, scalability, and resilience of large-scale, mission-critical systems. You will work closely with cross-functional engineering teams to implement observability, monitoring, and profiling solutions that empower developers to optimize system performance and reliability. This role offers the opportunity to define reliability standards, improve incident response processes, and contribute to a culture of continuous improvement. You will be hands-on with cloud infrastructure, distributed systems, and automation tools, helping to deliver a seamless experience for millions of users while shaping the future of operational excellence.

Accountabilities

  • Build, maintain, and enhance monitoring, tracing, and profiling systems to ensure visibility into system health and performance.
  • Define, implement, and optimize SLIs, SLOs, and SLAs that accurately reflect user experience.
  • Partner with engineering teams to identify and resolve performance bottlenecks and reduce operational toil.
  • Lead incident response efforts, conduct post-incident reviews, and implement learnings to improve system resilience.
  • Collaborate on platform improvements that enhance developer productivity and system reliability.
  • Continuously evaluate and recommend new tools and process improvements to optimize operational efficiency.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
  • 6+ years of experience in site reliability, DevOps, or software engineering roles.
  • Expertise with observability, monitoring, and alerting platforms (e.g., Prometheus, Grafana, Loki, OpenTelemetry).
  • Experience implementing tracing, logging, and profiling for distributed systems.
  • Strong knowledge of incident management, postmortem processes, and reliability metrics.
  • Familiarity with Linux, Kubernetes, Terraform, and cloud platforms (GCP preferred).
  • Proficiency in at least one scripting or backend language (e.g., Python, Go, Bash).
  • Excellent problem-solving, communication, and collaboration skills with a passion for continuous improvement.

Benefits

  • Competitive salary and equity opportunities with significant upside.
  • Flexible schedules and generous time-off policies, including a sabbatical after five years of service.
  • Comprehensive health, dental, and vision coverage.
  • Supportive work environment promoting work-life balance and personal growth.
  • Access to modern office amenities, collaborative spaces, and team-building activities.
  • Opportunities to work remotely with hybrid flexibility.


Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team performs an additional manual review to ensure no strong profile is missed.

The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or assessments) are made by their internal hiring team.

Thank you for your interest!

 

#LI-CL1

Apply now
Share this job