Site Reliability Engineer

Alianza, Inc.

5 months ago

Full-time

On-site

London

This position can be Hybrid or remote in the UK. You must currently have the right to work in the UK without requiring sponsorship, either now or in the future. A Site Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and scalability for Alianza’s Cloud Platform systems and infrastructure. Key Objectives include: * Run the production environment by monitoring availability and taking a holistic view of system health. * Improve reliability, quality, and time-to-market of software solutions. * Balance feature development speed and reliability with well-defined service-level objectives.   Key Responsibilities:  1. Monitoring and Maintenance: 1. Continuously monitor system health and performance, ensuring high availability and reliability of applications. 2. Detect and automatically handle failures, preparing disaster recovery plans. 2. Automation and Improvement: 3. Build and maintain software and systems to manage platform infrastructure and applications. 4. Implement automation to reduce manual intervention and improve system efficiency. 3. Performance Optimization: 1. Measure and optimize system performance, pushing capabilities forward and innovating for continual improvement. 2. Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding. 4. Collaboration and Consulting: 1. Partner with development teams to improve services through rigorous testing and release procedures. 2. Participate in system design consulting, platform management, and capacity planning. 5. Incident Management: 1. Provide primary operational support and engineering for multiple large-scale distributed software applications. 2. Participate in on-call rotations to respond to incidents and ensure system reliability.   Competencies & Attributes Competencies: 1. Attention to Detail: The ability to perform tasks with thoroughness and accuracy, ensuring all aspects of the system are meticulously managed. Problem-Solving Skills: The capability to analyze complex issues, identify root causes, and develop effective solutions to ensure system reliability and performance. 2. Technical Expertise: Proficiency in understanding and applying technical knowledge related to infrastructure, code, and tools, which can be enhanced through continuous learning and experience. 3. Automation Skills: The ability to design and implement automation processes to reduce manual intervention and improve system efficiency. 4. Communication Skills: The ability to clearly convey ideas, strategies, and updates to various stakeholders, ensuring alignment and transparency across the organization. Attributes: 1. Meticulousness: An inherent tendency to be precise and conscientious, ensuring high standards are maintained in all aspects of work. 2. Resilience: The innate ability to remain calm and composed under pressure, effectively managing stressful situations and leading the team through challenges. 3. Curiosity: A natural inclination to explore and learn new technologies and methodologies, driving innovation and continuous improvement. 4. Empathy: An inherent quality of understanding and valuing the perspectives and needs of team members and stakeholders, fostering a supportive and inclusive environment. 5. Adaptability: The ability to naturally adjust to changing circumstances and environments, ensuring effective responses to new challenges and opportunities.  Desired Skills/Qualifications 1. Technical Proficiency: 1. Understanding of high-level languages such as Python, Java, C/C++, Ruby, and JavaScript. 2. Experience with distributed storage technologies and dynamic resource management frameworks. 3. Experience of Telco technology and Metaswitch software as a bonus. 2. Problem-Solving Skills: 1. Strong analytical skills to diagnose and resolve complex technical issues. 3. Communication Skills: 1. Excellent communication skills to collaborate effectively with cross-functional teams and convey technical concepts. 4. Experience with Cloud Platforms: 5. Hands-on experience with cloud platforms like AWS, GCP, or Azure. Understanding cloud-native applications and services is vital for modern SRE roles. 5. Knowledge of Networking and Distributed Systems: 1. Strong understanding of networking fundamentals and experience with distributed systems such as Kafka, Kubernetes, and other stream-processing technologies. This helps in managing large-scale, complex systems.

Apply now

Site Reliability Engineer

More jobs

Python Software Engineer

Old Mission

Cost Manager / Quantity Surveyor

Turner & Townsend