uu直播快3平台_UU快3直播官方

18 Issues in Current Deep Reinforcement Learning from ZhiHu

时间:2020-02-19 05:06:35 出处:uu直播快3平台_UU快3直播官方

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Col- ´ menarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., nech Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476

Tips:阅读此文请掌握DQN、Double DQN、Prioritized Experience Replay这另一另另1个背景。

早在1997年Tsitsiklis就证明了不可能 Function Approximator采用了神经网络这人非线性的黑箱,越来越其收敛性和稳定性是无法保证的。

现有解法依然是蒙特卡洛搜索,详情还都要参考初代AlphaGo的实现【Silver et al 2016a】

prediction, policy evaluation

Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In the International Conference on Machine Learning (ICML).

reward function not available

Taylor, M. E. and Stone, P. (1509). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685.

GTD 【Sutton 1509a、Sutton 1509b、Mahmood 2014】

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. (2017). Learning to navigate in complex environments. In the International Conference on Learning Representations (ICLR).

Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2017). An actor-critic algorithm for sequence prediction. In the International

Conference on Learning Representations (ICLR).

Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., and Uszkoreit, J. (2017a). One Model To Learn Them All. ArXiv e-prints.

这里精选18个关键什么的什么的问题 ,包含空间搜索、探索利用、策略评估、内存使用、网络设计、反馈激励等等话题。本文精选了73篇论文(其中2017年论文有27篇,2016年论文有21篇)为了方便阅读,原标题上放文章最后,还都要根据索引找到。

Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J.,Sutskever, I., Abbeel, P., and Zaremba, W. (2017). One-Shot Imitation Learning. ArXiv e-prints.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction, , proc. of 10th. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS).

现有解法完整围绕迁移学习走 【Taylor and Stone, 1509、Pan and Yang 2010、Weiss et al 2016】,learn invariant features to transfer skills 【Gupta et al 2017】

TRPO(Trust Region Policy Optimization)【Schulman 2015】

learn to navigate with unsupervised auxiliary learning 【Mirowski et al 2017】

TODO list:文章内容还严重不足充实,有刚刚论文是全的。未来一段时间会把论文的链接找齐,下载好有刚刚打个包传到百度云上,预计一4天 完成。(2017/12/19)

Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016). Deep exploration via bootstrapped DQN. In the Annual Conference on Neural Information Processing Systems (NIPS).

Q-learning与Actor-Critic

吴恩达的逆强化学习【Ng and Russell 1150)】

Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In the Annual Conference on Neural Information Processing Systems (NIPS)

现有解法围绕着无监督学习开展

learn with expert's trajectories and those may not from experts 【Audiffren et al 2015】

现有解法基本上围绕模仿学习

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016c). Dueling network architectures for deep reinforcement learning. In the International

Conference on Machine Learning (ICML)
.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In the International Conference on Machine Learning (ICML)

Kaiser, Ł., Nachum, O., Roy, A., and Bengio, S. (2017b). Learning to Remember Rare Events. In the International Conference on Learning Representations (ICLR).

integrate temporal abstraction with intrinsic motivation 【Kulkarni et al 2016】

Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between GANs, inverse reinforcement learning, and energy-based models. In NIPS 2016 Workshop

on Adversarial Training.

learn a flexible RNN model to handle a family of RL tasks 【Duan et al 2017、Wang et al 2016a】

deep exploration via bootstrapped DQN 【Osband et al 2016)】

Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017). Learning invariant feature spaces to transfer skills with reinforcement learning. In the International Conference on Learning Representations (ICLR).

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256.

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, ´E. (1509a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In the International Conference on Machine Learning (ICML).

Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. In the International Conference on Learning Representations (ICLR).

分水岭论文Deep Q-learning Network【Mnih et al 2013】中提到:其实我们我们 的结果看上去很好,有刚刚越来越任何理论土妙招(原文很狡猾的反过来说一遍)。

Sutton老爷子教科书里的经典安利:Dyna-Q 【Sutton 1990】

Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., and Silver, D. (2017). Emergence of Locomotion Behaviours in Rich Environments. ArXiv e-prints

adapt rapidly to new tasks

Sutton, R. S., Szepesvari, C., and Maei, H. R. (1509b). A convergent O( ´ n) algorithm for off-policy temporal-difference learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS).

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., & Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. Computer Science, 8(6), A187.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.51502, 2013.

Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv e-prints.

unify count-based exploration and intrinsic motivation 【Bellemare et al 2017】

Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning. In the International Conference on Machine Learning (ICML)

He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2017a). Learning to play in a day: Faster deep reinforcement learning by optimality tightening. In the International Conference on Learning Representations (ICLR)

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016a). Learning to reinforcement learn. arXiv:1611.05763v1.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. ArXive-prints.

美中严重不足,TD Learning中很容易老出 Over-Estimate(高估)什么的什么的问题 ,具体原因 如下:

下面老出 DQN的范畴——

Li, K. and Malik, J. (2017). Learning to optimize. In the International Conference on Learning Representations (ICLR).

现有解法有:

Florensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforcement learning. In the International Conference on Learning Representations (ICLR)

focus on salient parts

现有的网络架构搜索土妙招【Baker et al 2017、Zoph and Le 2017】,其中Zoph的工作分量非常重。

极其优秀的工作:unsupervised reinforcement and auxiliary learning 【Jaderberg et al 2017】

strategic attentive writer to learn macro-actions 【Vezhnevets et al 2016】

Duel DQN【Wang 2016c】(ICML2016最佳论文)

variational information maximizing exploration 【Houthooft et al 2016】

异步算法A3C 【Mnih 2016】

Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, M. (2015). Maximum entropy semisupervised inverse reinforcement learning. In the International Joint Conference on Artificial Intelligence (IJCAI).

model-based learning

benefit from non-reward training signals in environments

Stadie, B. C., Abbeel, P., and Sutskever, I. (2017).Third person imitation learning. In the International Conference on Learning Representations (ICLR).

现有解法:

Baker, B., Gupta, O., Naik, N., and Raskar, R. (2017). Designing neural network architectures using reinforcement learning. In the International Conference on Learning Representations (ICLR).

model-free planning

最出名的解法是在Nature上大秀一把的Differentiable Neural Computer【Graves et al 2016】

目前解法有另一另另1个流派,一图胜千言:

Gruslys, A., Gheshlaghi Azar, M., Bellemare, M. G., and Munos, R. (2017). The Reactor: A Sample-Efficient Actor-Critic Architecture. ArXiv e-prints

Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching networks for one shot learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

现有解法有:

train perception and control jointly end-to-end

Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual attention. In the Annual Conference on Neural Information Processing Systems

(NIPS)
.

Emphatic-TD 【Sutton 2016】

imitation learning with GANs 【Ho and Ermon 2016、Stadie et al 2017】 (其TensorFlow实现在imitation)

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning,3(1):9–44.

Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In the International Conference on Learning Representations (ICLR).

经验回放下的actor-critic 【Wang et al 2017b】

比较新的解法有另一另另1个:

现有解法:多层强化学习 【Barto and Mahadevan 1503】

旷世猛将van Hasselt先生很喜欢处理Over-Estimate什么的什么的问题 ,他先搞出另一另另1个Double Q-learning【van Hasselt 2010】大闹NIPS,六年后搞出层厚学习版本的Double DQN【van Hasselt 2016a】!

有刚刚,另一另另1个很好的思路是从计算机视觉与自然语言处理领域汲取灵感,同类下文中不可能 提到的unsupervised auxiliary learning土妙招借鉴了RNN+LSTM中的少许操作

O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017). PGQ: Combining policy gradient and q-learning. In the International Conference on Learning Representations (ICLR).

Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 – 1359.

Wang, S. I., Liang, P., and Manning, C. D. (2016b). Learning language games through interaction. In the Association for Computational Linguistics annual meeting (ACL)

model-free与model-based的结合使用【Chebotar et al 2017】

Vezhnevets, A. S., Mnih, V., Agapiou, J., Osindero, S., Graves, A., Vinyals, O., and Kavukcuoglu, K. (2016). Strategic attentive writer for learning macro-actions. In the Annual Conference on Neural Information Processing Systems (NIPS).

exploration-exploitation tradeoff

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., , and Bengio, Y. (2014). Generative adversarial nets. In the Annual

Conference on Neural Information Processing Systems (NIPS), page 2672?26150.

Lin, L. J. (1993). Reinforcement learning for robots using neural networks.

现有解法是Guided Policy Search 【Levine et al 2016a】

这4天 我阅读了两篇篇猛文A Brief Survey of Deep Reinforcement Learning 和 Deep Reinforcement Learning: An Overview ,作者排山倒海的引用了150多篇文献,阐述强化学习未来的方向。原文归纳出层厚强化学习中的常见科学什么的什么的问题 ,并列出了目前解法与相关综述,我在这里做出分发,抽取了相关的论文。

Sutton, R. S. and Barto, A. G. (2017). Reinforcement Learning: An Introduction (2nd Edition, in preparation). MIT Press.

(neural networks architecture design )

Barto, A. G. and Mahadevan, S. (1503). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.

新的架构有【Kaiser et al 2017a、Silver et al 2016b、Tamar et al 2016、Vaswani et al 2017、Wang et al 2016c】

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016a). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.

lifelong learning with hierarchical RL 【Tessler et al 2017】

Horde 【Sutton et al 2011】

control, finding optimal policy

Anschel, O., Baram, N., and Shimkin, N. (2017). Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In the International Conference on Machine Learning (ICML).

data/sample efficiency

Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846

one/few/zero-shot learning 【Duan et al 2017、Johnson et al 2016、 Kaiser et al 2017b、Koch et al 2015、Lake et al 2015、Li and Malik 2017、Ravi and Larochelle, 2017、Vinyals et al 2016)

Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F., Watten- ´berg, M., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. ArXive-prints.

DQN的改良主要依靠另一另另1个Trick:

Hester, T. and Stone, P. (2017). Intrinsically motivated model learning for developing curious robots. Artificial Intelligence, 247:170–86.

下面是CV和NLP方面的几条简介:物体检测 【Mnih 2014】、机器翻译 【Bahdanau 2015】、图像标注【Xu 2015】、用Attention代替CNN和RNN【Vaswani 2017】等等。

PGQ,policy gradient and Q-learning 【O'Donoghue et al 2017】

Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17:1–29

现有解法基本上是learn to learn

Zhu, X. and Goldberg, A. B. (1509). Introduction to semi-supervised learning. Morgan & Claypool

Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence Between Policy Gradients and Soft Q-Learning. ArXiv e-prints.

stochastic neural networks for hierarchical RL 【Florensa et al 2017】

Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In the Annual Conference on Neural Information Processing Systems (NIPS).

Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In the AAAI Conference on Artificial Intelligence (AAAI).

Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(9)

Ng, A. and Russell, S. (1150).Algorithms for inverse reinforcement learning. In the International Conference on Machine Learning (ICML).

Distributed Proximal Policy Optimization 【Heess 2017】

Zoph, B. and Le, Q. V. (2017). Neural architecture search with reinforcement learning. In the International Conference on Learning Representations (ICLR)

大名鼎鼎的GANs 【Goodfellow et al 2014】

data storage over long time, separating from computation

van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23:, Conference on Neural Information Processing Systems 2010.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. (2016b). The predictron: End-to-end learning and planning. In NIPS 2016 Deep Reinforcement Learning Workshop.

现有解法完整围绕半监督学习 【Zhu and Goldberg 1509】

Policy gradient与Q-learning 的结合【O'Donoghue 2017、Nachum 2017、 Gu 2017、Schulman 2017】

learning to learn, 【Duan et al 2017、Wang et al 2016a、Lake et al 2015】

Q-Prop, policy gradient with off-policy critic 【Gu et al 2017】

Nachum, O., Norouzi, M., and Schuurmans, D. (2017). Improving policy gradient by exploring under-appreciated rewards. In the International Conference on Learning Representations (ICLR).

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A.,Salakhutdinov, R., Zemel, R. S., and Bengio,Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In the International Conference on Machine Learning (ICML).

return-based off-policy control, Retrace 【Munos et al 2016】, Reactor 【Gruslyset al 2017】

Instability and Divergence when combining off-policy,function approximation,bootstrapping

learn from demonstration 【Hester et al 2017】

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2017). Q-Prop: Sampleefficient policy gradient with an off-policy critic. In the International

Conference on Learning Representations (ICLR).

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 3150(6266):1332–1338.

learn, plan, and represent knowledge with spatio-temporal abstraction at multiple levels

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,

Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015).

Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In the International Conference on Learning Representations (ICLR).

Mahmood, A. R., van Hasselt, H., and Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In the Annual Conference on Neural Information Processing Systems (NIPS).

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. International Conference on International Conference on Machine Learning (pp.387-395). JMLR.org.

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the International Conference on Machine Learning (ICML).

万变不离其宗,Temporal Difference土妙招仍然是策略评估的核心哲学【Sutton 1988】。TD的拓展版本和她某种一样鼎鼎大名——1992年的Q-learning与2015年的DQN。

learn with MDPs both with and without reward functions 【Finn et al 2017)】

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G.(2016). Safe and efficient offpolicy reinforcement learning. In the Annual Conference on Neural Information Processing Systems (NIPS).

下面几篇论文一定会DQN相关话题的:

learn knowledge from different domains

层厚强化学习的什么的什么的问题 在哪里?未来怎么会走?有哪些方面还都要突破?

gigantic search space

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292

Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016a). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17:1–40.

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1150). Policy gradient methods for reinforcement learning with function approximation. In the Annual Conference on Neural Information Processing Systems

(NIPS)
.

benefit from both labelled and unlabelled data

Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In the Annual Conference on Neural Information Processing Systems (NIPS).

van Hasselt, H., Guez, A., , and Silver, D. (2016a). Deep reinforcement learning with double Qlearning. In the AAAI Conference on Artificial Intelligence (AAAI).

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. In the International Conference on Machine Learning (ICML).

under-appreciated reward exploration 【Nachum et al 2017)】

train dialogue policy jointly with reward model 【Su et al 2016b】

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. ArXiv e-prints.

热门

热门标签