Post by sabbirislam258 on Feb 14, 2024 6:00:13 GMT
Reinforcement learning is a promising avenue for AI development, creating AI that can handle highly complex tasks. Reinforcement AI algorithms are used in the creation of mobile robotics systems and self-driving cars, among other applications. However, due to the way reinforcement AIs are trained, they can sometimes exhibit strange and unpredictable behavior. These behaviors can be dangerous, and AI researchers refer to this problem as the "safe search" problem, which is where AI gets stuck looking for unsafe states. Recently, Google's AI research lab DeepMind released a paper proposing new ways to tackle the safe search problem and train AI to learn more safely. DeepMind's recommended method also corrects for reward hacking or flaws in reward quality.
DeepMind's new methodology has two different systems that aim to Kuwait Telemarketing Data guide the AI's behavior in situations where unsafe behavior might occur. Two systems used by DeepMind's training techniques are a generative model and a forward dynamics model. Both of these models are trained on different types of data, such as demonstrations by safety experts and completely random vehicle speeds. The data is labeled by a supervisor with specific reward values, and the AI agent will consider patterns of behavior that will enable it to collect the largest reward. Vulnerable states are also labeled, and once the model is able to successfully predict rewards and vulnerable states it is deployed to execute targeted actions. The idea, the research team explains in the paper, is to create possible behaviors from scratch, suggest desired behaviors, and make these hypothetical scenarios as informative as possible as well as directly relevant to the learning environment.
Direct interference should be avoided. The DeepMind team refers to this approach as ReQueST, or Rewarding Query Synthesis through Speed Optimization. ReQueST is capable of leading to four different types of behavior. The first type of approach seeks to maximize uncertainty regarding reward models. Meanwhile, behavior two and three try to minimize and maximize predicted rewards. Predicted rewards are minimized to discover behaviors that the model is predicting incorrectly. On the other hand, the predicted reward is maximized to lead to behavioral labels having the most informative value. Finally, the fourth type of behavior seeks to maximize the novelty of the trajectory, so that the model continues to seek it regardless of the rewards presented. Once the model reaches the desired level of reward realization, a planning agent is used to make decisions based on the learned rewards.
DeepMind's new methodology has two different systems that aim to Kuwait Telemarketing Data guide the AI's behavior in situations where unsafe behavior might occur. Two systems used by DeepMind's training techniques are a generative model and a forward dynamics model. Both of these models are trained on different types of data, such as demonstrations by safety experts and completely random vehicle speeds. The data is labeled by a supervisor with specific reward values, and the AI agent will consider patterns of behavior that will enable it to collect the largest reward. Vulnerable states are also labeled, and once the model is able to successfully predict rewards and vulnerable states it is deployed to execute targeted actions. The idea, the research team explains in the paper, is to create possible behaviors from scratch, suggest desired behaviors, and make these hypothetical scenarios as informative as possible as well as directly relevant to the learning environment.
Direct interference should be avoided. The DeepMind team refers to this approach as ReQueST, or Rewarding Query Synthesis through Speed Optimization. ReQueST is capable of leading to four different types of behavior. The first type of approach seeks to maximize uncertainty regarding reward models. Meanwhile, behavior two and three try to minimize and maximize predicted rewards. Predicted rewards are minimized to discover behaviors that the model is predicting incorrectly. On the other hand, the predicted reward is maximized to lead to behavioral labels having the most informative value. Finally, the fourth type of behavior seeks to maximize the novelty of the trajectory, so that the model continues to seek it regardless of the rewards presented. Once the model reaches the desired level of reward realization, a planning agent is used to make decisions based on the learned rewards.