Job Description
Key Responsibilities:
Qualifications
Minimum 10+ years of experience in relevant area.
Team Leadership: Strong ability to mentor and manage teams using collaborative platforms like Jira, Teams, and Confluence. Excellent communication and collaboration skills.
System Design and Architecture: Expertise in designing scalable and reliable systems using tools like #Kubernetes, #Docker, and #cloudservices (AWS, Azure, GCP). Experience with Kafka, Cassandra, and other infrastructure tools. Familiarity with middleware technologies such as Kafka, APIs, and Microservices architecture.
Incident Management: Proficiency in managing incidents using tools like PagerDuty, xMatters, alongside conducting effective post-mortems.
Monitoring and Analytics: Experience with monitoring tools such as Splunk, AppDynamics, #Grafana, #Prometheus, etc for proactive issue detection.
Automation: Skilled in using automation tools like Terraform, Ansible, and scripting languages (Python, Bash, ShellScript) to streamline workflows.
Capacity Planning: Familiarity with performance analysis and forecasting tools to ensure infrastructure scalability.
SLA/SLO Management: Defining and tracking reliability goals using SRE best practices and tools like ServiceNow.
Continuous Improvement: Ability to assess system reliability with tools like ELK Stack (Elasticsearch, Logstash, Kibana) and implement enhancements.Key Responsibilities:
Qualifications
Minimum 10+ years of experience in relevant area.
Team Leadership: Strong ability to mentor and manage teams using collaborative platforms like Jira, Teams, and Confluence. Excellent communication and collaboration skills.
System Design and Architecture: Expertise in designing scalable and reliable systems using tools like #Kubernetes, #Docker, and #cloudservices (AWS, Azure, GCP). Experience with Kafka, Cassandra, and other infrastructure tools. Familiarity with middleware technologies such as Kafka, APIs, and Microservices architecture.
Incident Management: Proficiency in managing incidents using tools like PagerDuty, xMatters, alongside conducting effective post-mortems.
Monitoring and Analytics: Experience with monitoring tools such as Splunk, AppDynamics, #Grafana, #Prometheus, etc for proactive issue detection.
Automation: Skilled in using automation tools like Terraform, Ansible, and scripting languages (Python, Bash, ShellScript) to streamline workflows.
Capacity Planning: Familiarity with performance analysis and forecasting tools to ensure infrastructure scalability.
SLA/SLO Management: Defining and tracking reliability goals using SRE best practices and tools like ServiceNow.
Continuous Improvement: Ability to assess system reliability with tools like ELK Stack (Elasticsearch, Logstash, Kibana) and implement enhancements.