Site Reliability Engineer

Job Description:

Intangles Lab is looking for a hands-on Senior Site Reliability Engineer to manage large 24×7 Cloud Operations.

Looking for a Site Reliability Engineer with 2+ years of experience, having hands-on with the following technologies/skillset:

Must-Required Skills:

1.AWS Cloud (Advanced):Certification is preferred.

2.Networking (Intermediate):Proficiency in networking concepts is necessary.

3.Ubuntu/Linux & OS (Advanced):Strong Linux & Networking basics, Prior working experience is preferred.

4.Database (Basic Knowledge):

a.Familiarity with SQL and NoSQL databases is required, having worked with at least one of them.

b.Database Administration (MongoDB & PostgreSQL, Elasticsearch), having hands-on experience of at least one is required

5.Containerization Tools: Docker

6.Kubernetes (Advanced)

a.Knowledge of Amazon EKS is compulsory.

b.Working knowledge of StatefulSets is required.

c.Familiarity with the HELM Chart is necessary.

CI/CD (Advanced):

Proficiency in at least one CI/CD tool, such as CircleCI, Argo Project, GitHub Actions, or similar, is essential.

Programming:

a.Basic programming knowledge is required, with the ability to write code.

b.Scripting Language: Python, Shell

Monitoring Stack:

Prometheus, Grafana, Alert Mangaer, Istio, Jaeger, Datadog, PagerDuty (or similar). ElasticAPM
Optional Skills:
Medium to High Level of Application Development Experience in languages like JavaScript, Python, and Java will be a bonus.
Understanding of N-tier Architectures
Understanding of REST & gRPC API Frameworks
Understanding of Web Servers in NodeJS

Responsibilities:

To work in a production environment with technologies like Linux, AWS, Terraform, Kubernetes, MongoDB, Elasticsearch & PostgreSQL Administration
To keep the production environment up & running, i.e. ensuring the reliability of the production environment.
To troubleshoot, debug and fix issues in case of failures of the production and QA environment and provide technical solutions.
To own the responsibilities of on-call as per the team’s policy.
To write and enhance automations as and when needed.
To work closely with internal teams and customers to follow the processes and SLAs of uptime.
To write, update and enhance documentation, including runbooks/playbooks and prepare postmortem reports for the production incidents.
Considering the role is to ensure the platform’s reliability, ready to work in a 24*7 work environment when required.
Additional Requirements:
One should be aware of change/incident/problem/issue/risk management/escalations.
Should be flexible in working in rotational shifts and night hours (Including weekends).
Excellent thinking and problem-solving skills.

View all job openings