This paper surveys research in RL enhanced LLMs. We make a systematic review of the literature, including:

  • the basics of RL
  • popular RL-enhanced LLMs
  • two reward model-based RL techniques: RLHF and RLAIF
  • DPO: bypassing the reward model to directly align LLM outputs with human expectations