Technical Lead - SRE

Site Reliability Engineering (SRE), Java, REST, AWS Lambda, New Relic/Splunk/Datadog, Full Stack
Description

An efficient Site Reliability Engineering (SRE) professional is as much about how you think as your technical skills. The SRE role requires a mix of development and operations skills that combine software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As a part of the SRE team, you will manage the complex challenges of scale that are unique to the client while using your expertise in coding, systems, the complexity of operating systems, and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is key to its success. We bring together people with a wide variety of backgrounds, experiences, and perspectives.

Who We Are

GSPANN has been in business for over a decade, with over 2000 employees worldwide, and servicing some of the largest retail, high technology, and manufacturing clients in North America. We provide an environment that enables career growth while still interacting with company leadership.

Visit Why GSPANN for more information.

Location: Hyderabad / Gurugram / Pune
Role Type: Full Time
Published On: 27 March 2023
Experience: 8+ Years
Description
An efficient Site Reliability Engineering (SRE) professional is as much about how you think as your technical skills. The SRE role requires a mix of development and operations skills that combine software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As a part of the SRE team, you will manage the complex challenges of scale that are unique to the client while using your expertise in coding, systems, the complexity of operating systems, and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is key to its success. We bring together people with a wide variety of backgrounds, experiences, and perspectives.
Role and Responsibilities
  • Use expertise in full-stack application support to provide meaningful health and performance metrics to drive reliability and consumer experience improvements continuously.
  • Work across teams to proactively identify and troubleshoot incidents within the stack (consumer and non-consumer facing).
  • Identify, evaluate, and execute preventive measures to minimize/avoid impact to the consumer experience.
  • Enable and lead critical incident and problem management processes to restore service, manage root cause analysis, and recommend solutions for a long-term fix.
  • Conduct product launches and releases, prepare reports for management, identify risks/solutions, and make recommendations for continual improvement.
  • Work across teams to continuously analyze system performance in production, troubleshoot consumer-reported issues, and proactively identify areas in need of optimization.
  • Prepare SOPs and set up runbooks for routine issues.
  • Lead a technical team of support engineers through day-to-day operations and critical incidents.
Skills and Experience
  • 8+ years of relevant work experience, including 5-6 years as an SRE.
  • 3+ years of experience in building cloud-based enterprise systems, ideally on AWS.
  • Proficient in Java 8 or later versions.
  • Hands-on in automating manual tasks or repeated incidents using Java/Spring Boot/Python/Node.js or similar.
  • Expertise in writing PL/SQL queries.
  • Basic or advanced AWS certification and hands-on AWS experience in services like Amazon EC2 and Autoscaling, Amazon S3, DynamoDB, CloudFormation, ECS, RDS, SQS, SNS, ALB are desirable.
  • Good understanding of IT service management (incident, problem, change, and knowledge management).
  • Hands-on experience in at least one popular observability tool – New Relic, Splunk, Datadog, SignalFx, etc.
  • Capable of writing complex queries for the above tool/s and creating dashboards.
  • Ability to effectively triage issues in production.
  • Practical exposure in managing and leading application reliability practices for consumer-facing web and mobile experiences.
  • Prior experience in developing and driving real-time monitoring solutions that provide visibility into site health and key performance indicators.
  • Good knowledge of agile methodologies, performance engineering, and automation tools.
  • Highly confident and capable of reporting and communicating high-value metrics to the leadership.
  • Deep understanding of the business landscape and how site reliability influences our consumers.

Key Details

Location: Hyderabad / Gurugram / Pune
Role Type: Full Time
Published On: 27 March 2023
Experience: 8+ Years

Apply Now