#### Markov Decision Processes 0/5

- Lecture1.1
- Lecture1.2
- Lecture1.3
- Quiz1.1
- Lecture1.4
- Lecture1.5
- Lecture1.6

#### Dynamic Programming 0/9

- Lecture2.1
- Lecture2.2
- Lecture2.3
- Lecture2.4
- Lecture2.5
- Lecture2.6
- Lecture2.7
- Lecture2.8
- Lecture2.9

#### Monte Carlo Methods 0/9

- Lecture3.1
- Lecture3.2
- Lecture3.3
- Lecture3.4
- Lecture3.5
- Lecture3.6
- Lecture3.7
- Lecture3.8
- Lecture3.9

#### Model Free Learning 0/7

- Lecture4.1
- Lecture4.2
- Lecture4.3
- Lecture4.4
- Quiz4.1
- Lecture4.5
- Lecture4.6

#### RL in Continuous Spaces 0/8

- Lecture5.1
- Lecture5.2
- Lecture5.3
- Lecture5.4
- Lecture5.5
- Lecture5.6
- Quiz5.1
- Lecture5.7

#### Deep Reinforcement Learning 0/8

- Lecture6.1
- Lecture6.2
- Lecture6.3
- Lecture6.4
- Lecture6.5
- Quiz6.1
- Lecture6.6
- Lecture6.7

#### Policy Based Methods 0/6

- Lecture7.1
- Lecture7.2
- Lecture7.3
- Quiz7.1
- Lecture7.4
- Lecture7.5

#### Policy Gradient Methods 0/9

- Lecture8.1
- Quiz8.1
- Lecture8.2
- Lecture8.3
- Lecture8.4
- Lecture8.5
- Quiz8.2
- Lecture8.6
- Lecture8.7

#### Actor Critic Methods 0/10

- Lecture9.1
- Lecture9.2
- Lecture9.3
- Lecture9.4
- Lecture9.5
- Lecture9.6
- Lecture9.7
- Lecture9.8
- Lecture9.9
- Quiz9.1

#### Multi Agent RL 0/8

- Lecture10.1
- Lecture10.2
- Lecture10.3
- Lecture10.4
- Lecture10.5
- Quiz10.1
- Lecture10.6
- Lecture10.7

This content is protected, please login and enroll course to view this content!

Prev The Bellman Equation

Next Value Functions Quiz

## 47 Comments

First paragraph, “For example, in chase game, the chess board configuration after a move is being made can be decided based on the current board configuration and the action being made now and we donβt need to worry about previous chess board configurations or past actions.” Should be “chess” game, not “chase”.

Feel free to delete comment after.

I noticed that as well. I think “chase” is a typo.

yeah, it is a typo

Yep. I confirm this.

The first diagram regarding the Roomba example is not really visible .

You should be able to magnify this diagram in your browser. For instance in Chrome, I use [CTRL] [+] and [CTRL] [-] to zoom in/out. Hope this helps!

It’s still blurry and can’t really make out anything.

Yes, it’s not really visible but other similar examples are very much available.

I do not find the Formulas really clear here. May the could be explained even more.

Hey what happened to all the great links you had here for further reading?

won’t we get video of this lecture?

It is the written form of what we have already studied through the videos

There was some reading assignment before this. I am not able to see that anymore.

Where did the links go?

https://www.lpalmieri.com/posts/rl-introduction-01/

https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision-processes-part-1-bf00dda41690

These were few of the links.

thanks man!!!

if there was an option to print the info in .pdf , it would be wonderful.

Why did the reading assignment (with several links to blog posts and David Silver’s course) dissapear?

In your chess example you have misspell as chase make it correct.

Your grammar isn’t right, ”make it correct”.

Sure give me an edit option!

In my value function quiz, I corrected 5 of 6 and still, the system is showing that I failed. Please review it.

The quiz is not multiple choice, there may be several right answers to the questions.

That may or may not help you, but will help future quiz takers…

same with me. π Unfortunately, i finished all my attempts in marking 100% correct answers. As of now,my last score is 5/6 but still it shows as failed.

How the hell do you know the answerws to the quiz questions?? I find it SUPER confusing as there are questions that haven’t even been mentioned in lectures.

Yes, I get these as well!

Text needs some polishing. Some suggestions:

* The word Markov here refers to ~that~ *the* Markovian property which means (…)

* This means that *the* current state (…)

* I see a question mark (?) rather than a lambda in the definition of MDP as a 5 tuple.

* (…) is the one that ~optimizes to maximize~ maximizes*

Hi Siraj,

When can I reattempt my value function quiz.

can you use bigger images or else a zoom in feature will be helpful for images.

You can also zoom into the content with your browser. I used [CTRL] [+] and [CTRL] [-] to zoom in/out.

yes but the images get blurry and i am not able to make out labels on the flow diagram

I don’t know if the markdown on wikipedia is supported, but it shows the following well written definition for the Markov Decision Process:

SOURCE: https://en.wikipedia.org/wiki/Markov_decision_process

A Markov decision process is a 5-tuple {\displaystyle (S,A,P_{a},R_{a},\gamma )} {\displaystyle (S,A,P_{a},R_{a},\gamma )}, where

{\displaystyle S} S is a finite set of states,

{\displaystyle A} A is a finite set of actions (alternatively, {\displaystyle A_{s}} A_s is the finite set of actions available from state {\displaystyle s} s),

{\displaystyle P_{a}(s,a,s’)=\Pr(s_{t+1}=s’\mid s_{t}=s,a_{t}=a)} {\displaystyle P_{a}(s,a,s’)=\Pr(s_{t+1}=s’\mid s_{t}=s,a_{t}=a)} is the probability that action {\displaystyle a} a in state {\displaystyle s} s at time {\displaystyle t} t will lead to state {\displaystyle s’} s’ at time {\displaystyle t+1} t+1,

{\displaystyle R_{a}(s,a,s’)} {\displaystyle R_{a}(s,a,s’)} is the immediate reward (or expected immediate reward) received after transitioning from state {\displaystyle s} s to state {\displaystyle s’} s’, due to action {\displaystyle a} a,

{\displaystyle \gamma \in [0,1]} \gamma \in [0,1] is the discount factor, which represents the difference in importance between future rewards and present rewards.

(Note: The theory of Markov decision processes does not state that {\displaystyle S} S or {\displaystyle A} A are finite, but the basic algorithms below assume that they are finite.)

I recommend a small clarification, based on the following passage by the author:

“For example, in chase game, the chess board configuration after a move is being made can be decided based on the current board configuration and the action being made now and we donβt need to worry about previous chess board configurations or past actions.”

The rules of chess often take previous configurations of the board into account:

1. Castling may only be done if both the king and rook involved have never moved:

https://en.wikipedia.org/wiki/Castling

2. En-passant requires that the pawn capture take place immediately after the opponent moves their pawn two squares:

https://en.wikipedia.org/wiki/En_passant

You can clarify by saying that chess relies on knowledge of previous board configurations, and the game does not have the Markovian property.

The numbers in thefirst diagram of the Roomba are not explained. Why those numbers? Is it just an example with arbitrary numbers?

A policy determines the set of actions that are taken by the agent to reach a goal

A policy is a set of actions that are taken by the agent to reach a goal

Please which of these two definitions is more accurate?

A typo:

>> βrβ β P(r | s, a) reward model

Should it be E(r | s, a)?

I too think it should be E instead of P there as reward can’t be a probable value.

Small typo in the second paragraph under ‘What is a Policy?’

Last sentence, “For an MDP…” should be For a MDP

Very minor :^)

check it out :

https://github.com/llSourcell/navigating_a_virtual_world_with_dynamic_programming

Content-wise is ok, but the formatting needs to be polished a bit, e.g.: using KaTeX for the math.

Would Be better with better pictures. The Pictures are not clear.

can anyone explain this part?

βrβ β P(r | s, a) reward model that models what reward the agent will receive when it performs an action a when it is in state βsβ.

So far so good!

TYPO: “βTβ β P(sβ | s, a)” the minus sign must be an equal sign.

Feel free to erase this comment.

The same with “βrβ β P(r | s, a)”

Perfect text, congratulations.

Excellent post. I was checking constantly this blog and I’m impressed! Very useful information specifically the last part π I care for such info a lot. I was seeking this particular info for a very long time. Thank you and best of luck.