This paper surveys research in RL enhanced LLMs. We make a systematic review of the literature, including:
- the basics of RL
- popular RL-enhanced LLMs
- two reward model-based RL techniques: RLHF and RLAIF
- DPO: bypassing the reward model to directly align LLM outputs with human expectations