神刀安全网

Mr 'Don't Do Evil' has patented deep Q learning,it is getting a patent on brain

1

. A method of reinforcement learning, the method comprising:

inputting training data relating to a subject system, the subject system having a plurality of states and, for each state, a set of actions to move from one of said states to a next said state;

wherein said training data is generated by operating on said system with a succession of said actions and comprises starting state data, action data and next state data defining, respectively for a plurality of said actions, a starting state, an action, and a next said state resulting from the action; and

training a second neural network using said training data and target values for said second neural network derived from a first neural network;

the method further comprising:

generating or updating said first neural network from said second neural network.

2 . A method as claimed in claim 1 further comprising selecting said actions using learnt action-value parameters from said second neural network, wherein said actions are selected responsive to an action-value parameter determined for each action of a set of actions available at a state of said system.

3 . A method as claimed in claim 2 wherein said training data comprises experience data derived from said selected actions, the method further comprising generating said experience data by storing data defining said actions selected by said second neural network in association with data defining respective said starting states and next states for the actions.

4 . A method as claimed in claim 3 further comprising generating said target values by providing said data defining said actions and said next states to said first neural network, and training said second neural network using said target values and said data defining said starting states.

5 . A method as claimed in claim 2

further comprising:

inputting state data defining a state of said system;

providing said second neural network with a representation of said state of said system;

retrieving from said second neural network a learnt said action-value parameter for each action of said set of actions available at said state; and

selecting an action to perform having a maximum or minimum said learnt action-value parameter from said second neural network.

6 . A method as claimed in claim 5 further comprising storing experience data from said system, wherein said experience data is generated by operating on said system with said actions selected using said second neural network, and wherein said training data comprises said stored experience data.

7 . A method as claimed in claim 6

further comprising:

selecting, from said experience data, starting state data, action data and next state data for one of said plurality of actions;

providing said first neural network with a representation of said next state from said next state data;

determining, from said first neural network, a maximum or minimum learnt action-value parameter for said next state;

determining a target value for training said second neural network from said maximum or minimum learnt action-value parameter for said next state.

8 . A method as claimed in claim 7 wherein said training of said second neural network comprises providing said second neural network with a representation of said starting state from said starting state data and adjusting weights of said neural network to bring a learnt action-value parameter for an action defined by said action data closer to said target value.

9 . A method as claimed in claim 7 wherein said experience data further comprises reward data defining a reward value or cost value of said system resulting from said action taken, and wherein said determining of said target value comprises adjusting said maximum or minimum learnt action-value parameter for said next state by said reward value or said cost value respectively.

10 . A method as claimed in claim 1 wherein a state of said system comprises a sequence of observations of said system over time representing a history of said system.

11 . A method as claimed in claim 2 wherein said training of said second neural network alternates with said selecting of said actions and comprises incrementally updating a set of weights of said second neural network used for selecting said actions.

12 . A method as claimed in claim 1 wherein said generating or updating of said first neural network from said second neural network is performed at intervals after repeated said selecting of said actions using said second neural network and said training of said second neural network.

13 . A method as claimed in claim 12 wherein said generating or updating of said first neural network from said second neural network comprises copying a set of weights of said second neural network to said first neural network.

15 . A method as claimed in claim 1 wherein a said state is defined by image data.

16 . A method as claimed in claim 1 wherein said first and second neural networks comprise deep neural networks with a convolutional neural network input stage.

17 . A non-transitory data carrier carrying processor control code to implement the method of claim 1 .

18 . A method of Q-learning wherein Q values are determined by a neural network and used to select actions to be performed on a system to move the system between states, wherein a first neural network is used to generate a Q-value for a target for training a second neural network used to select said actions.

19 . A method as claimed in claim 18 wherein at intervals said first neural network is refreshed from said second neural network.

20 . A method as claimed in claim 19 wherein weights of said first neural network are quasi-stationary, remaining substantially unchanged during intervals between said refreshing.

21 . A method as claimed in claim 18 further comprising storing a record of said selected actions and states, and using said record to generate said Q-value for said target.

22 . A method as claimed in claim 18 wherein said first and second neural networks are deep neural networks including locally connected or sparsely connected front end neural network portions.

23 . A method as claimed in claim 18 wherein said Q-value comprises a value of an action-value function approximating an expected cost of or return from a strategy of actions including a defined next action.

24 . A non-transitory data carrier carrying processor control code to implement the method of claim 18 .

25

. A processor configured to perform reinforcement learning, the system comprising:

an input to receive training data from a system having a plurality of states and, for each state, a set of actions to move from one of said states to next said state;

wherein said training data is generated by operating on said system with a succession of said actions and comprises starting state data, action data and next state data defining, respectively for a plurality of said actions, a starting state, an action, and a next said state resulting from the action;

wherein said actions are selected responsive to an action-value parameter for each action of said set of actions available at each state;

selecting said actions using learnt action-value parameters from a second neural network; and

a training module to train a second neural network using said training data and target values derived from a first neural network; and

a neural network generation module to generate or update said first neural network from said second neural network.

26 . A data processor as claimed in claim 25 further comprising an action selection module to select said actions responsive to an action-value parameter for each action of said set of actions available at a state of said system, wherein said action-value parameters are provided by said second neural network.

27 . A data processor as claimed in claim 25 wherein said neural network generation module is configured to copy a set of weights of said second neural network to said first neural network.

28

. A data processor configured to perform Q-learning, wherein Q values are determined by a neural network and used to select actions to be performed on a system to move the system between states, the data processor comprising a processor coupled to working memory and to non-volatile program memory storing processor control code, wherein said processor control code is configured to control said processor to:

generate a Q-value for a target using a first-neural network;

train a second neural network using said target; and

select actions to control said system using said second neural network.

29

. An electronic controller trained by reinforcement-learning to control a system having a plurality of states and, for each state, a set of actions to move from one of said states to next said state; the electronic controller comprising:

an input to receive state data from said system;

a neural network having a set of input neurons coupled to said input, a plurality of hidden layers of neurons, and at least one output neuron, wherein said neural network is trained to provide, for each of said set of actions, an action quality value defining an expected cost of or reward from a strategy of actions beginning with the respective action to move to a next state;

an action selector configured to select an action from said set of actions responsive to the action quality values for said actions; and

an output to output data defining said selected action for controlling said system.

30 . An electronic controller as claimed in claim 29 wherein an input portion of said neural network comprises a convolutional neural network.

31 . An electronic controller as claimed in claim 29 wherein said neural network has a plurality of output neurons, each configured to provide a respective said action quality value for an action of said set of available actions.

32 . An electronic controller as claimed in claim 31 wherein said output neurons are each coupled to said action selector to provide said action quality values in parallel to said action selector.

33

. A method of learning in a control system the method comprising, for a succession of states of a subject system:

inputting current state data relating to a current state of a subject system;

providing a version of said current state data to a neural network;

determining, using said neural network, values for a set of action-value functions, one or each of a set of potential actions;

selecting a said action responsive to said values of said action-value functions;

outputting action data for said selected action to said subject system such that said subject system transitions from said current state to a subsequent state;

inputting subsequent state data relating to said subsequent state of said subject system and reward data relating to a reward or cost resulting from said transition from said current state to said subsequent state;

storing, in said experience memory, experience data representing said current state, said subsequent state, said selected action, and said reward or cost;

determining a target action-value function output for said neural network from said stored experience data; and

updating weights of said neural network using said target action-value function output, wherein said updating comprises incrementally modifying a previously determined set of weights of said neural network;

the method further comprising:

storing a set of weights of said neural network to create two versions of said neural network, one time-shifted with respect to the other,

wherein said determining of said values of said set of action-value functions for selecting said action is performed using a later version of said neural network versions, and

wherein said determining of said target action-value function is performed using an earlier version of said neural network versions.

34 . A method as claimed in claim 33 wherein said state data comprises digitised image or waveform data.

35 . A method as claimed in claim 33

wherein said target action-value function output is determined by reading, from said experience memory, data identifying a first state, an action, a subsequent state, and a reward or cost value; the method further comprising:

determining, using said neural network, a value of an action-value function for an action recommended by said neural network for said subsequent state; and

determining said target action-value function output from a combination of said value of said action-value function for said action recommended by said neural network for said subsequent state and said reward or cost value.

36

. A control system, the system, comprising:

a data input to receive sensor data;

a data output to provide action control data; and

a deep neural network having an input layer coupled to said data input and an output layer; and

an action selector coupled to said output layer of said deep neural network;

wherein said input layer of said deep neural network defines a sensor data field in one or more dimensions,

wherein said output layer of said deep neural network defines a value for an action-value function associated with each of a plurality of possible actions for said control system to control; and

an action selector, coupled to said output layer of said deep neural network and to said data output, to select a said action responsive to said action-value function and to provide corresponding action control data to said data output.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Mr 'Don't Do Evil' has patented deep Q learning,it is getting a patent on brain

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮