DevOps Engineer focused on Availability

The ERISAT team is growing! We are looking for a DevOps Engineer focused on Availability to join us. For more information, see this article.

Main mission

Ensure the development, implementation, and operation of monitoring and MCO tools for on-premise and cloud infrastructures that support THALOS application services and solutions sold to customers, with the primary objective of guaranteeing their availability, reliability, and predictability.
Contribute to the improvement and optimization of cloud-oriented application developments in conjunction with the IT and Application Development teams.
Be responsible for monitoring critical environments and ensuring the proper functioning of incident management procedures.

Responsibilities

  • Participate in improving system monitoring tools and processes (VMWare, Datacore, Google Cloud Platform) by ensuring their relevance to service availability and reliability objectives
  • Develop the tools needed to alert in the event of a problem, define thresholds and actionable indicators focused on service impact (SLO), and implement response, escalation, and information processes. Participate in weekend on-call duty to cover additional shifts.
  • Analyze incidents and malfunctions using a structured feedback process to improve responsiveness and prevention by mastering key production infrastructure tools. Propose improvements to monitoring and supervision procedures.
  • Participate in the improvement of information systems by working on improvements and Cloud projects (Google Cloud Platform) and network infrastructures to optimize projects and provide resilience, robustness, and service continuity.
  • Participate in the drafting of documentation and PRA/PCA, and in the definition of service objectives (availability, recovery time) for production environments.
  • In conjunction with the CIO, act as the on-site point of contact for the implementation of cybersecurity policies applicable to the equipment and methods used at ERISAT, as well as for the administration of servers and workstations.
  • In the context of the company’s growth, be the point of reference for architecture and service reliability, provide expertise and escalation during major incidents as part of on-call duties, and contribute to the structuring and skill development of a team that is set to grow.

Required technical skills

  • Solid knowledge of systems administration (Windows, Linux) and networks (TCP/IP, VPN, firewall, switch, router).
  • Proficiency in supervision and monitoring tools (handling and development of plugins under NAGIOS in Perl or Python).
  • Good knowledge of virtualization environments (VMWare, Google Compute Engine, Docker) and storage virtualization (DATACORE).

Behavioral skills

  • Rigorous and organized.
  • Responsive and able to manage stress.
  • Good written and oral communication skills.
  • Able to work independently while following established procedures.
  • Team player with a service-oriented attitude.

Required profile

  • IT training (BTS SIO with SISR option, DUT Networks and Telecoms, professional bachelor’s degree, or equivalent).
  • Previous experience in production systems and network supervision or support.*
  • Availability to work on call at weekends (fixed or rotating shifts).

Working conditions

  • On-site position (with the possibility of teleworking depending on the organization).
  • Collaboration with the systems, networks, and security teams of other entities within the group.
Partager cette offre d'emploi :