Cloud Reliability Engineer

Location:  Kochi
|
Nov 13, 2023

Cloud Reliability Engineer

Overview

Reports To: Delivery Manger

 

You will be responsible for designing, implementing, and maintaining robust observability and reliability solutions for our cloud infrastructure across platforms such as AWS, Azure, and GCP. Your expertise in cloud technologies and monitoring tools will be crucial in ensuring the availability, performance, and resilience of our systems.

 

RESPONSIBILITIES

 

1. Design and implement observability solutions, including monitoring, logging, and tracing, to gain insights into the performance and health of our cloud infrastructure.

2. Develop and maintain monitoring dashboards, alerts, and incident response playbooks to proactively identify and address performance and availability issues.

3. Collaborate with development and operations teams to implement logging and tracing mechanisms within applications to facilitate troubleshooting and root cause analysis.

4. Continuously evaluate and enhance observability tools and practices to ensure optimal performance and scalability.

5. Implement and maintain incident management processes, including incident response, escalation, and post-incident reviews.

6. Conduct capacity planning and performance optimization exercises to ensure the scalability and efficiency of our cloud infrastructure.

7. Implement automated infrastructure deployment and configuration management using tools like Terraform, Ansible, or equivalent.

8. Collaborate with cross-functional teams to design and implement disaster recovery and business continuity solutions for cloud-based applications.

9. Stay up-to-date with industry trends and emerging technologies in cloud observability and reliability engineering.

10. Provide guidance and support to development and operations teams on best practices for building resilient and observable cloud applications.

11. Document observability and reliability standards, procedures, and configurations for knowledge sharing and compliance purposes.

 

REQUIRED

 

  • 2+ years of experience
  • Bachelors/Master’s Degree in Computer Science, IT or similar

 

MUST  HAVE

 

1. Strong experience in cloud observability and reliability engineering, with expertise in at least one major cloud platform: AWS, Azure, or GCP. Knowledge of multiple platforms is a plus.

2. In-depth understanding of cloud services and architectures, including networking, storage, and security.

3. Proficiency in observability tools and frameworks such as ELK stack, Grafana, Jaeger, or equivalent.

4. Experience with log management and analysis tools (e.g., Splunk, ELK stack, CloudWatch Logs).

5. Solid understanding of infrastructure-as-code (IaC) principles and experience with automation tools like Terraform, Ansible, or equivalent.

6. Strong scripting and programming skills (e.g., Python, Bash, PowerShell) for automation and tooling.

7. Understanding of capacity planning and performance optimization techniques for cloud infrastructure.

8. Strong problem-solving and troubleshooting skills with a keen attention to detail.

9. Knowledge of networking concepts, including VPNs, VPCs, subnets, and firewalls.

10. Familiarity with DevOps practices and CI/CD pipelines.

 

NICE TO HAVE

 

    • Excellent understanding  on Airline/Aviation Industry.
    • Multi Cloud Experience