Поиск работы на robota.uaukraine
Удаленная работа

Monitoring manager (Support)

Patrianna LTD
1 неделю назад
12 ноября 2024
Киев
Удаленная работа
Полная занятость
Посменная работа

Dive into the pulse of cutting-edge solutions with Patrianna LTD! 

Are you ready to dive into the dynamic world of social gaming and be part of a rapidly expanding team? We’re on the lookout for a talented Monitoring Specialist (support) to join our Patrianna LTD team on a full-time basis.

 What You Gain?

Dynamic Environment: Step into the heart of a super fast-growing social gaming company, where innovation and creativity thrive.
Global Impact: Be at the forefront of crafting a global social entertainment platform, with a primary focus on captivating the North American market.
Limitless Growth: Take your career to new heights with opportunities for advancement and personal development. Join us in the exhilarating journey of continuous growth.
Massive Reach: Contribute to the development of client web and mobile apps that engage with up to 150 million customers worldwide.
Commitment to Excellence: We’re dedicated to delivering high-quality code, ensuring predictable behavior in production, seamless scaling, and automation every step of the way.

We are looking for a skilled Monitoring Specialist to join our 24?7 SRE team. The ideal candidate will work non-business hours aligned with European time to ensure seamless operations and system reliability. This role focuses on monitoring and diagnostics across a multi-site production environment, primarily for Java-based applications on Google Cloud Platform (GCP). Leveraging modern monitoring tools, the SRE will proactively identify, analyze, and resolve issues, maintaining high service performance and reliability.

Key Responsibilities:

  • Production Monitoring & Alerting
    • Oversee multi-site production environments using tools like Prometheus, Grafana, and Sentry to monitor application performance, database health, and event streams.
    • Continuously monitor performance metrics, setting up alerts to identify potential issues before they impact system availability.
  • Log Analysis & Diagnostics
    • Analyze logs across applications, databases, and event streaming services (Kafka) to detect irregularities and gain insights into root causes.
    • Use tools like ELK and GCP-native monitoring solutions to maintain visibility and optimize system behavior.
  • Database & Event Stream Monitoring
    • Monitor and tune performance for databases like PostgreSQL/AlloyDB and Spanner, focusing on query optimization, performance metrics, and troubleshooting.
    • Manage and monitor Kafka clusters, including consumer lag tracking and data pipeline health, to ensure continuous data processing.
  • Error Tracking & Troubleshooting
    • Use Sentry and similar tools to track, document, and resolve errors, escalating issues to the engineering team when necessary.
    • Follow troubleshooting protocols and assist in root cause analysis to resolve incidents in a structured and efficient manner.
  • Network & Security Insights
    • Collaborate with Cloudflare tools to monitor network performance and ensure security standards, with an emphasis on DDoS protection and latency optimization.
    • Work closely with the Engineering and DevOps teams to develop proactive monitoring and performance strategies.

Required Skills & Qualifications:

  • Cloud Platform Expertise: Advanced knowledge of the Google Cloud Platform and associated services.
  • Monitoring & APM Tools: Proficient with Prometheus, Grafana, Sentry, and ELK, plus familiarity with Kubernetes (K8s) and GCP-native monitoring solutions.
  • Database Systems: Strong knowledge of PostgreSQL/AlloyDB and Spanner, especially for performance tuning, query optimization, and diagnostics.
  • Event Streaming: Hands-on experience with Kafka, including the ability to monitor Kafka clusters, track consumer lag, and manage data pipeline reliability.
  • Networking & Security: Familiarity with Cloudflare, DDoS protection strategies, and network performance monitoring.
  • Problem-Solving Skills: Excellent analytical skills to troubleshoot complex, multi-layered cloud systems, perform root cause analysis, and address issues in a dynamic environment.

Nice-to-Have Skills:

  • Scripting: Experience with Python or Bash for automation and scripting tasks.

Schedule Requirements:

  • This role operates during non-business hours aligned with European time to provide continuous coverage and support for our production environments.

Вікторія

Похожие вакансии

Похожие вакансии по профессиям: