Problem. A small warehouse receives totes at inbound docks, routes some across conveyor belts, packs orders at stations,
and sends shipments to outbound docks. An AGV (autonomous cart) moves on a grid to pick/drop totes. Your goal is to learn a policy that
maximizes throughput and on‑time deliveries while minimizing congestion and travel time.
State. \(s=(x,y,c,\phi,t)\) — AGV at cell \((x,y)\), carrying \(c\in\{0,1\}\), phase \(\phi\in\{\)toPick,toPack,toDock,idle\(\}\), time‑to‑deadline \(t\). Actions. \(a\in\{\uparrow,\downarrow,\leftarrow,\rightarrow,\text{wait}\}\). Belts move carts automatically in their direction. Reward. \(-0.04\)/step, \(+1\) pickup, \(+3\) pack complete, \(+10\) on‑time delivery (or \(+5-0.1\cdot\)lateness if late), \(-1\) into wall/rock. Update. \(Q(s,a)\leftarrow Q + \alpha\big[r+\gamma\max_{a'} Q(s',a') - Q(s,a)\big]\). Auxiliary Rewards. When enabled, provides distance-based guidance, exploration bonuses, and urgency rewards to help learning. Algorithm. Enhanced Q-learning with experience replay, Q-value normalization, adaptive ε-greedy exploration, and reduced congestion sensitivity. Algorithm. Enhanced Q-learning with experience replay, Q-value normalization, adaptive ε-greedy exploration, and reduced congestion sensitivity.
Stats & Plots
Demonstration Scenarios
Click a scenario button to load different warehouse configurations
0
Delivered
—
On-time %
—
Efficiency
Order Flow Distribution
■ Express■ Standard■ Bulk
Overall Efficiency
Area chart shows avg reward per step over episodes. Heatmap indicates congested areas.
Avg reward / step
—
Congestion events
0
Q-value range
—
Adaptive ε
0.15
Q-value range
—
Adaptive ε
0.15
🎯 Reward Shaping: ON - Agent receives auxiliary guidance rewards