Job Description:
Intangles Lab is looking for a hands-on Senior Site Reliability Engineer to manage large 24×7 Cloud Operations.
Looking for a Site Reliability Engineer with 2+ years of experience, having hands-on with the following technologies/skillset:
Must-Required Skills:
1.AWS Cloud (Advanced):Certification is preferred.
2.Networking (Intermediate):Proficiency in networking concepts is necessary.
3.Ubuntu/Linux & OS (Advanced):Strong Linux & Networking basics, Prior working experience is preferred.
4.Database (Basic Knowledge):
a.Familiarity with SQL and NoSQL databases is required, having worked with at least one of them.
b.Database Administration (MongoDB & PostgreSQL, Elasticsearch), having hands-on experience of at least one is required
5.Containerization Tools: Docker
6.Kubernetes (Advanced)
a.Knowledge of Amazon EKS is compulsory.
b.Working knowledge of StatefulSets is required.
c.Familiarity with the HELM Chart is necessary.
CI/CD (Advanced):
Proficiency in at least one CI/CD tool, such as CircleCI, Argo Project, GitHub Actions, or similar, is essential.
Programming:
a.Basic programming knowledge is required, with the ability to write code.
b.Scripting Language: Python, Shell
Monitoring Stack:
- Prometheus, Grafana, Alert Mangaer, Istio, Jaeger, Datadog, PagerDuty (or similar). ElasticAPM
- Optional Skills:
- Medium to High Level of Application Development Experience in languages like JavaScript, Python, and Java will be a bonus.
- Understanding of N-tier Architectures
- Understanding of REST & gRPC API Frameworks
- Understanding of Web Servers in NodeJS
Responsibilities:
- To work in a production environment with technologies like Linux, AWS, Terraform, Kubernetes, MongoDB, Elasticsearch & PostgreSQL Administration
- To keep the production environment up & running, i.e. ensuring the reliability of the production environment.
- To troubleshoot, debug and fix issues in case of failures of the production and QA environment and provide technical solutions.
- To own the responsibilities of on-call as per the team’s policy.
- To write and enhance automations as and when needed.
- To work closely with internal teams and customers to follow the processes and SLAs of uptime.
- To write, update and enhance documentation, including runbooks/playbooks and prepare postmortem reports for the production incidents.
- Considering the role is to ensure the platform’s reliability, ready to work in a 24*7 work environment when required.
- Additional Requirements:
- One should be aware of change/incident/problem/issue/risk management/escalations.
- Should be flexible in working in rotational shifts and night hours (Including weekends).
- Excellent thinking and problem-solving skills.