Artwork

Konten disediakan oleh Daniel Filan. Semua konten podcast termasuk episode, grafik, dan deskripsi podcast diunggah dan disediakan langsung oleh Daniel Filan atau mitra platform podcast mereka. Jika Anda yakin seseorang menggunakan karya berhak cipta Anda tanpa izin, Anda dapat mengikuti proses yang diuraikan di sini https://id.player.fm/legal.
Player FM - Aplikasi Podcast
Offline dengan aplikasi Player FM !

33 - RLHF Problems with Scott Emmons

1:41:24
 
Bagikan
 

Manage episode 423107256 series 2844728
Konten disediakan oleh Daniel Filan. Semua konten podcast termasuk episode, grafik, dan deskripsi podcast diunggah dan disediakan langsung oleh Daniel Filan atau mitra platform podcast mereka. Jika Anda yakin seseorang menggunakan karya berhak cipta Anda tanpa izin, Anda dapat mengikuti proses yang diuraikan di sini https://id.player.fm/legal.

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html

Topics we discuss, and timestamps:

0:00:33 - Deceptive inflation

0:17:56 - Overjustification

0:32:48 - Bounded human rationality

0:50:46 - Avoiding these problems

1:14:13 - Dimensional analysis

1:23:32 - RLHF problems, in theory and practice

1:31:29 - Scott's research program

1:39:42 - Following Scott's research

Scott's website: https://www.scottemmons.com

Scott's X/twitter account: https://x.com/emmons_scott

When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747

Other works we discuss:

AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752

Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475

The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693

Episode art by Hamish Doodles: hamishdoodles.com

  continue reading

37 episode

Artwork
iconBagikan
 
Manage episode 423107256 series 2844728
Konten disediakan oleh Daniel Filan. Semua konten podcast termasuk episode, grafik, dan deskripsi podcast diunggah dan disediakan langsung oleh Daniel Filan atau mitra platform podcast mereka. Jika Anda yakin seseorang menggunakan karya berhak cipta Anda tanpa izin, Anda dapat mengikuti proses yang diuraikan di sini https://id.player.fm/legal.

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html

Topics we discuss, and timestamps:

0:00:33 - Deceptive inflation

0:17:56 - Overjustification

0:32:48 - Bounded human rationality

0:50:46 - Avoiding these problems

1:14:13 - Dimensional analysis

1:23:32 - RLHF problems, in theory and practice

1:31:29 - Scott's research program

1:39:42 - Following Scott's research

Scott's website: https://www.scottemmons.com

Scott's X/twitter account: https://x.com/emmons_scott

When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747

Other works we discuss:

AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752

Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475

The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693

Episode art by Hamish Doodles: hamishdoodles.com

  continue reading

37 episode

Toate episoadele

×
 
Loading …

Selamat datang di Player FM!

Player FM memindai web untuk mencari podcast berkualitas tinggi untuk Anda nikmati saat ini. Ini adalah aplikasi podcast terbaik dan bekerja untuk Android, iPhone, dan web. Daftar untuk menyinkronkan langganan di seluruh perangkat.

 

Panduan Referensi Cepat