Site Reliability Engineering Leads

Java, Microservices, App Monitoring Tools (Splunk/Dynatrace/New Relic), AWS/GCP/Azure
Description

An efficient Site Reliability Engineering (SRE) professional is as much about how you think as your technical skills. The SRE role requires a mix of development and operations skills that combine software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.

As a part of the SRE team, you will manage the complex challenges of scale that are unique to the client, while using your expertise in coding, systems, the complexity of operating systems, and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is key to its success. We bring together people with a wide variety of backgrounds, experiences, and perspectives.

Who We Are

GSPANN has been in business for over a decade, with over 1800 employees worldwide, and servicing some of the largest retail, high technology, and manufacturing clients in North America. We provide an environment that enables career growth while still interacting with company leadership.

Visit Why GSPANN for more information.

Location: Hyderabad / Gurugram
Role Type: Full Time
Published On: 9 February 2022
Experience: 8 - 12 Years
Description

An efficient Site Reliability Engineering (SRE) professional is as much about how you think as your technical skills. The SRE role requires a mix of development and operations skills that combine software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.

As a part of the SRE team, you will manage the complex challenges of scale that are unique to the client, while using your expertise in coding, systems, the complexity of operating systems, and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is key to its success. We bring together people with a wide variety of backgrounds, experiences, and perspectives.

Role and Responsibilities
  • Drive reliability throughout the platform using Observability tools, informed architectural improvements, and automation.
  • Collaborate closely with the teams to build cohesive service operation solutions into the overall service design.
  • Build and enhance the process, environment, and tool chains for high-service reliability and availability.
  • Exercise and optimize the service operation process to support the whole process with all partner teams.
  • Mitigate and recover live site incidents efficiently.
  • Adaptive and flexible to manage multiple tasks with changing priorities.
Skills and Experience
  • 2+ years of work experience in the IT industry, supporting large-scale applications/services on platforms like Azure/AWS/GCP.
  • 2+ years of experience in incident and problem management processes using tools like ServiceNow.
  • 3+ years of experience in automating business processes using Java.
  • Hands-on experience with Observability tools like Splunk, NewRelic, Azure Monitor, or CloudWatch.
  • Expertise in monitoring and alerting concepts.
  • Expertise in supporting highly available and scalable systems to debug/troubleshoot live systems.
  • Good troubleshooting skills and deep understanding of Metrics, Logs, and Traces.
  • Must adapt to working in a lean-scaled Agile delivery environment.
  • Exceptional written, verbal, and interpersonal communication skills with management, peers, and stakeholders.

Key Details

Location: Hyderabad / Gurugram
Role Type: Full Time
Published On: 9 February 2022
Experience: 8 - 12 Years

Apply Now