QDHF

Quality Diversity through Human Feedback
NeurIPS 2023: ALOE Workshop (Spotlight)

Li Ding
UMass Amherst
Jenny Zhang
Univ. of British Columbia
Vector Institute
Jeff Clune
Univ. of British Columbia
Vector Institute
Canada CIFAR AI Chair
Google DeepMind
Lee Spector
Amherst College
UMass Amherst
Joel Lehman
Stochastic Labs
(Past: CarperAI at Stability AI)

QDHF (right) improves the diversity in text-to-image generation results compared to best-of-N (left) using Stable Diffusion.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has shown potential in qualitative tasks where clear objectives are lacking. However, its effectiveness is not fully realized when it is conceptualized merely as a tool to optimize average human preferences, especially in generative tasks that demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms excel at identifying diverse and high-quality solutions but often rely on manually crafted diversity metrics. This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach integrating human feedback into the QD framework. QDHF infers diversity metrics from human judgments of similarity among solutions, thereby enhancing the applicability and effectiveness of QD algorithms. Our empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery and matches the efficacy of using manually crafted metrics for QD on standard benchmarks in robotics and reinforcement learning. Notably, in a latent space illumination task, QDHF substantially enhances the diversity in images generated by a diffusion model and was more favorably received in user studies. We conclude by analyzing QDHF’s scalability and the quality of its derived diversity metrics, emphasizing its potential to improve exploration and diversity in complex, open-ended optimization tasks. Source code is available on GitHub.

Method

The main idea is to derive distinct representations of what humans find interestingly different, and incorporate this procedure in QD algorithms.

Diversity Characterization: A latent projection is used to model representations of diversity as its measures.
Alignment: We use contrastive learning to align the diversity representations to human intuition.
Progressive Optimization: As novel solutions are found, more human feedback is collected to refine the diversity representation.

Robotic/RL Search Tasks

Maze Navigation

Robotic Arm

❮ ❯

Each point on the heatmap is a solution with its objective value visualized in color. QDHF fills up the archives with more solutions than AURORA, and closely matches the search performance of QD using the ground truth diversity metrics.

Notably, in the maze navigation task, while both AURORA and QDHF learned a rotated version of the maze as diversity (first column), QDHF is able to more accurately learn the scale of the maze especially in the under-explored area.

Latent Space Illumination

a photo of an astronaut riding a horse on mars

an image of a bear in a national park

an image of a cat on the sofa

an image of a person playing guitar

an image of a dog in the park

an image of urban downtown

❮ ❯

QDHF substantially enhances the variations in images generated by a diffusion model. The results show visible trends of diversity, and was more favorably received in user studies.

Citation

@misc{ding2023quality,
    title={Quality Diversity through Human Feedback}, 
    author={Li Ding and Jenny Zhang and Jeff Clune and Lee Spector and Joel Lehman},
    year={2023},
    eprint={2310.12103},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Acknowledgements

Li Ding and Lee Spector were supported by the National Science Foundation under Grant No. 2117377. Jeff Clune and Jenny Zhang were supported by the Vector Institute, a grant from Schmidt Futures, an NSERC Discovery Grant, and a generous donation from Rafael Cosman. The authors would like to thank members of the PUSH lab at Amherst College, University of Massachusetts Amherst, and Hampshire College for helpful discussions, and anonymous reviewers for their thoughtful feedback. This work was performed in part using the high performance computing equipment from Collaborative R&D Fund managed by Massachusetts Technology Collaborative.

The website template was borrowed from Jon Barron.

Quality Diversity through Human Feedback
NeurIPS 2023: ALOE Workshop (Spotlight)

Paper

Code

Tutorial

Abstract

Method

Robotic/RL Search Tasks

Latent Space Illumination

Citation

Acknowledgements

Quality Diversity through Human Feedback NeurIPS 2023: ALOE Workshop (Spotlight)

Paper

Code

Tutorial

Abstract

Method

Robotic/RL Search Tasks

Latent Space Illumination

Citation

Acknowledgements

Quality Diversity through Human Feedback
NeurIPS 2023: ALOE Workshop (Spotlight)