This paper focuses on the intersection of Multi-agent Deep Reinforcement Learning, Inverse RL and Dynamic Structural I/O. The latter - Dynamic Structural I/O - method applied by economists typically assumes that the agents already knows the optimal policy. The econometrician must observe the agents actions and infer their reward (utility) functions. Typically the rewards for agents is presumed to be a linear combination of (current state, action, next state) features to allow unique identification of the coefficients. Further in multi agent environments the space of policies increase exponentially with number of agents, therefore economists search in a much smaller sub space of policies that satisfy game theoritic concepts such as - Nash Equilibrium and Sequential Rationality. Ignoring the identification and the game theoritic constraints, this objective is similar to Inverse-RL’s intermediate objective of inferring the expert’s reward function. Given Inverse RL’s limited ability to deal with non linear reward functions in high dimensional state space, we limit our focus to Deep RL in this paper.

In achieving their goal economists typically make two significant assumptions to make such environments tractable: (1) The state space is small. This assumption is often logical since even in markets with many consumer segments i.e. complex demand dynamics and many firm i.e. complex competitor dynamics the agent – a human decision maker – intuitively chunks together similar states to arrive at a handful of traceable clustered states. (2) Agents are perfectly aware of the entire state space, state transition function and reward functions. Deep RL does not rely on either of these assumptions so potentially offers a solution to a wider set of RCC games. Specifically we want to consider enviroments with a high dimensional state space where agents/experts can not reasonably be expected to know the optimal policy upfront, Deep-RL could explain the learning dynamics in such settings. However for such methodology to be realistic we must incorporate into the RL policy/model learning process basic concepts that guide a rational decision maker – sequential rationality, nash equilibrium and pareto optimality – as suggested by economic game theoretic literature. I attempt in this paper to combine Deep RL with game theoretic principles (Deep Nash Q Learning) applied to a standard competitive markets in order (a) to compare the converged policies against closed form solutions for rational agents in simplified environments. (b) to characterize the learning dynamics and converged policies in more sophisticated environments. The natural next step would be extend to an inverse-Deep Nash RL, we leave that to be tackled beyond this paper.