Direct-to-Chip Cooling

Chip level heat extraction for ultra dense AI and HPC delivering higher performance, lower energy use, and consistent thermals at scale

Direct Chip CoolingFundamentals

Desert Dragon provides Direct-to-Chip (DTC) as a managed service. The coolant is circulated directly over cold plates to remove heat at the source, unlocking densities and performance air cooling cannot support. The offering is turnkey—integrated plant CDUs, approved fluids, leak detection, and precision thermal controls are built into the facility for rapid rack roll‑out.

Operational integration & safeguards

  • Enables rack power densities well beyond air‑cooled limits
  • Facility‑level CDU capacity, redundant fluid loops, and leak containment
  • Qualified coolants, fluid lifecycle management, and handling SOPs
  • Preinstalled cold‑plate plumbing and OEM‑agnostic support for major platforms
  • Real‑time telemetry and alarmed safety systems
  • Lower silicon temps → higher sustained clocks and longer component life
  • Reduced energy per TFLOP and lower ongoing cooling costs

 

Our AI tool – FinBladeAI continuously monitors workloads and thermal behavior, predicts demand shifts, and auto‑tunes setpoints in closed loop to maintain SLAs with minimum energy use. The result: stable training/inference throughput, lower energy per TFLOP, and reduced lifecycle OPEX through fewer thermal failures.

Performance & Efficiency Benefits

  • Higher sustained performance from lower silicon temperatures
  • Predictable throughput for training and inference
  • Extended hardware life and reduced failure rates. 

Capabilities & Features

Our DTC cooling ecosystem is purpose-built for modern AI and HPC hardware, enabling rapid deployment of high-density server racks without additional engineering. Integrated direct-to-chip liquid cooling allows customers to deploy high-performance hardware without custom cooling integration.

Density & Efficiency​

Design-dependent CDU classes support 65–85 kW capacity, maintaining stable clocks under sustained load while reducing IT fan power consumption. This enables predictable rack-level heat removal, allowing high-power GPU nodes to operate efficiently and sustain extended AI training workloads.​

Thermal Stability​

Tight ΔT control at both cold-plate and loop levels helps extend component lifespan while maintaining consistent time-to-train performance. Our SLO-driven thermal envelopes are aligned with training throughput and inference latency targets to ensure reliable and predictable AI cluster performance.​

Integrated Plant​

Facility-integrated heat rejection options, including dry coolers, plate-and-frame exchangers, and campus loop integration, are engineered for redundancy and seasonal efficiency. Fluid health telemetry, leak detection, pressure and flow analytics, and automated alarm routing feed directly into NOC and SOC runbooks to ensure continuous monitoring and rapid operational response.​

Operations & Reliability ​

Qualified installation and commissioning procedures are supported by a structured spare-parts strategy, incident response protocols, and periodic system validation. Concurrent maintenance practices ensure service continuity during component replacements, maintenance activities, and infrastructure upgrades.​

AI-Managed Controls​

Sensing and learning capabilities ingest loop metrics such as flow, pressure, and ΔT, along with rack sensors, server telemetry, and workload signals including queue depth and GPU utilization. The system continuously learns heat profiles for each topology, accounting for node type, rack composition, and workload patterns across time of day and seasonal conditions.​

Adaptive Controls​

The system automatically tunes pump speeds, valve positions, and CDU setpoints within defined guardrails to maintain thermals within SLO bands. It also schedules preventive actions such as filter changes and fluid checks, and recommends maintenance windows based on predicted operational risk.​

Post Incident​

The system correlates anomalies, such as micro-leaks or flow restrictions, with workload events and triggers guided operational runbooks for rapid response. It also captures lessons learned from incidents and continuously updates operational policies to reduce the likelihood of repeat events.​

Extended Service Scope andOptional Add-On Capabilities

Our service scope spans design, implementation, operations, and continuous optimization, including:

  • Design & Enablement: CDU sizing and hydraulic modeling, manifold layout, and cold-plate kit validation. Heat-rejection integration (dry coolers / plate-and-frame / campus loop) with defined redundancy targets
  • Implementation & Commissioning: Qualified procedures, flushing/charging, leak testing, and baseline capture. Acceptance testing with performance sign-off (ΔT, flow, and stability under defined load)

  • Run Operations (24/7): Real-time telemetry, alarms, trending, ticketing, and escalations via ITSM. FinBladeAI adaptive control and policy governance with human-in-the-loop approvals for higher-risk actions

  • Maintenance & Reliability: Preventive maintenance cadence, spares strategy, and concurrent service methods. Quarterly validation cycles (thermal drills, capacity headroom checks, and firmware interplay checks)

  • Reporting & Reviews: Executive dashboards with SLA/SLO attainment, MTTA/MTTR, ΔT stability, energy per rack, and capacity headroom. Monthly service reviews and a continuous-improvement backlog with prioritized remediation

Optional Add-ons:

  • AI Performance (Track): Adds model-aware steering (e.g., staging or pausing low-priority training jobs when thermal envelopes tighten) and “cost-per-experiment” analytics for client program managers

  • Mixed-Cooling Estate: Unified FinBladeAI control across DTC and immersion blocks, with shared reporting and common runbooks

  • Sovereign Analytics Pack: On-prem analytics warehousing (RBAC and audit trails), Arabic/English UI, and export-controlled data governance options

Service Level ObjectivesPerformance Targets, and Operational Transparency

The following examples illustrate typical Service Level Objectives (SLOs) and operational transparency measures applied to liquid cooling environments. Actual targets and reporting metrics are tailored to each client contract and infrastructure configuration.

  • Thermal SLO: ≥ 99.9% of intervals within the ΔT target band at the cold plate

  • Leak Response: P1 isolation ≤ 5 minutes; recovery to a safe state ≤ 30 minutes

  • Performance SLO: 95th percentile time-to-train variance ≤ the agreed threshold under steady load

  • Client-Visible Dashboards: ΔT and setpoint history, CDU utilization, energy per rack, incident timeline, and change log with approvals

  • Deployment Notes: Qualification of server SKUs and cold-plate kits is required prior to service adoption. Actual kW per rack or CDU depends on IT design, ambient design points, and heat-rejection selection. OEM warranty alignment for IT equipment remains under client policy; Desert Dragon supports approved kits and processes

Secure, Flexible Hyperscaling &
Colocation Solutions

Get in touch with us to learn how our secure colocation environments and industry-leading interconnection services can support your growth and ensure operational continuity.