Work

Unified IT Operations Platform — Monitoring, Logging & Inventory at Scale

Infrastructure
Monitoring
Docker
Python
NetOps

A consolidated operations platform — network monitoring, centralised logging, CMDB/IPAM and SIEM across five distributed enterprise sites, replacing fragmented tooling with a coherent, code-driven stack.

Network operations centre with monitoring dashboards on multiple screens

Challenge

A manufacturing company operating across five geographically distributed sites, with 600+ managed devices, ran fragmented monitoring tools, had no centralized log collection, and maintained its CMDB by hand. Network devices, servers, VMs, printers, CCTV and VoIP were all managed in silos — leaving alerting blind spots and no authoritative view of the network. Monitoring agents at remote sites could not even reach the operations server through standard routing, and printer alert noise had rendered the team’s notification channel nearly useless. The platform had to be deployable and maintainable without a dedicated ops team, with no cloud dependency.

Solution

  • Deployed 8 production services — Zabbix, LibreNMS, Graylog, Grafana and NetBox, plus a SIEM/EDR layer, a reverse proxy and container management — as isolated Docker Compose stacks, each independently upgradable and recoverable with its own blast radius.
  • Made NetBox the single source of truth for the device and address inventory across all sites.
  • Automated inventory with Python scripts keeping NetBox in sync from several independent sources — the asset-management system, Zabbix, LibreNMS (including LLDP topology) and the firewall — removing manual upkeep.
  • Broad monitoring coverage: Zabbix across network, servers, printers, CCTV and VoIP (SIP) with Telegram alerting; LibreNMS for SNMP auto-discovery; a dedicated SIEM/EDR layer for endpoint security events.
  • Centralised logging & dashboards: Graylog (OpenSearch backend) aggregating firewall, NAS and PBX logs — processing 1M+ events/hour — with Grafana as a unified layer of 14 code-generated dashboards over Zabbix, Graylog and LibreNMS.
  • Network & security design: monitoring and management traffic isolated on a dedicated segment; policy routing so all five sites reach the server over symmetric return paths; every service UI restricted to internal networks only, fronted by a reverse proxy with per-service DNS.
  • One-day operational rollout: unified agent-installer scripts (PowerShell + Bash, Windows/Ubuntu/Debian/RHEL) served from a self-hosted host deployed 18 remote agents across all sites in a single day; printer alert flooding (~206 messages/month) was replaced by a smart suppression rule and a once-daily digest.
  • Documented to hand off: a version-controlled corpus of 13 operational documents — README, architecture, runbook, disaster recovery, decisions, changelog, roadmap and audit.

Result

The platform reached production in under two weeks and runs continuously, with each service isolated in its own failure domain. A single live CMDB tracks network phones, video servers, servers and workstations, kept current automatically. Printer alert noise dropped from ~206 unsolicited notifications per month to zero, replaced by one actionable daily digest, and all 18 monitoring and SIEM agents were rolled out in a single day. The whole platform — deployment, operations and disaster recovery — is documented in runbooks for lean, low-headcount operation.