Environment

The environment contains the information about the problem, handles the execution of the actions and provides the rewards.

source

SdPEnvironment

 SdPEnvironment (problem, sdp_solver, budget,
                 reward_criterion='bound_norm', layout_basis=None,
                 max_basis_size=None, initial_size=1, bound_tol=0.001)

Environment for constraint-space exploration.

The environment contains all the information of the problem at hand, and controls the way the agent can explore the state space.

Hence, we must provide the SdPEnvironment with an instance of our problem, such as a Hamiltonian, an adequate solver that defines the objective, such as an SdPEnergySolver.

Then, we need to provide information about the reinforcement learning task, such as the computational budget we can afford, and the reward function that we want to use. Optionally, we can provide a pre-defined layout basis, which conditions the kind of constraints that the agent can create. Otherwise, the environment generates its own basis directly from the problem instance and it underlying structure, e.g., the Hamiltonian.graph.

Create an environment

Let’s see an example. We will start by defining the problem parameters. In this case, we will try to find the ground state energy of a Heisenberg XX Hamiltonian on a one-dimensional chain. Furthermore, we will consider a moderate budget that only allows us to consider 3-body reduced density matrices to compute the bounds.

N = 5
chain = Chain1D(N)
B, J = 3, 1
H = XXHamiltonian(chain, B, J)
solver = SdPEnergySolver()
budget = 100

With this, we can create our environment.

env = SdPEnvironment(H, solver, budget)

The environment starts in the simplemost state, which corresponds to a trivial relaxation of the problem. This is represented by a state of zeros, indicating that no compatibility constraints are active at this time.

env.state

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

assert (env.state == np.zeros(2*N)).all()

The state vector serves as a boolean indicator of which constraints in the environment’s basis are currently active, although the vector is not boolean itself because it has to be processed by the agents.

In this case, the environment’s basis is automatically generated from the Hamiltonian.graph.edges and the budget.

env.layout_basis

array([array([0, 1]), array([1, 2]), array([2, 3]), array([3, 4]),
       array([0, 4]), array([0, 1, 2]), array([0, 1, 4]),
       array([1, 2, 3]), array([2, 3, 4]), array([0, 3, 4])], dtype=object)

The environment associates a unique layout to every state which depends on the basis and the active constraints. In this case, the layout is made out of single-body reduced density matrices.

env.layout

[array([0]), array([1]), array([2]), array([3]), array([4])]

The environment takes care of providing the solver the needed information of the problem. In this case, it feeds the SdPEnergySolver both the data from the Hamiltonian and the current layout. We can solve the associated semidefinite program (SdP) to the active constraints with the get_values method.

bound, cost, error = env.get_values()
bound, cost, error

(-24.99999999732747, 19, 0)

assert bound == solver.solve(H.to_sdp(), env.layout)
assert cost == solver.ojimetro(env.layout)

Every time we visit a new state, the environment keeps track of the active values. This way, it builds a reference of the best possible bounds that can be obtained. For now, given that we have not moved from the initial state, the best bound and layout are the only ones it has ever seen.

env.max_bound, env.best_layout

(-24.99999999732747,
 [array([0]), array([1]), array([2]), array([3]), array([4])])

In order to assess which is the best layout, it accounts for both the best possible bound and the lowest possible cost with which it has ever been achieved. This is stored in best which contains the best possible bound together with the maximum an minimum costs it was ever achieved.

env.best

array([-25.,  19.,  19.])

assert bound == env.max_bound
assert env.best_layout == env.layout
assert (env.best == np.array([bound, cost, cost])).all()

Exploring the state space

We can move to new states by performing an action. Actions are, in essence, very simple. They are integers that indicate which bit we flip in the state vector. Doing so, adds or removes constraints in the associated SdP.

In order to test this out, let’s add the first 3-body constraint, indexed by N in this case.

new_state, bound, cost, error = env.perform_action(N)
new_state, bound, cost

(array([1, 1, 0, 0, 0, 1, 0, 0, 0, 0]), -20.999999995348936, 71)

We observe that now we have a higher energy bound. In this case, the higher the better, so it’s nice! However, the associated cost is also much larger than before.

Additionally, despite performing a single action, we observe that now there are three “ones” in our state vector. This is because the basis element in the N-th position, contained two other elements of smaller size. We can see the state vector by size with the method show_constraints.

env.show_constraints()

2: [1 1 0 0 0]
3: [1 0 0 0 0]

print(f"{env.layout_basis[N]} contains {env.layout_basis[0]} and {env.layout_basis[1]}.")

[0 1 2] contains [0 1] and [1 2].

However, these two smaller ones are not present in the laoyut, as they would be redundant.

env.layout

[array([0, 1, 2]), array([3]), array([4])]

We can see this with the contained_constraints method.

env.contained_constraints()

array([ True,  True, False, False, False, False, False, False, False,
       False])

This way, we ensure a consistent exploration through state space. Now, if we removed the larger element, the two smaller ones would remain, ensuring that there are no big leaps in the exploration and limiting the actions that can be performed. The agent cannot attempt to add or remove elements contained in larger ones. We can see this with the action_mask method, which indicates the state vector positions that can be modified.

env.action_mask()

array([False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

Notice that the action mask is one element longer than the state vector because it accounts for the action of “remaining still”.

Since we have moved to a state that provides a better bound for our problem, the references have changed.

env.max_bound, env.best_layout

(-20.999999995348936, [array([0, 1, 2]), array([3]), array([4])])

assert (new_state == env.state).all()
assert bound == env.max_bound
assert (bound, cost, 0) == env.get_values()
assert (env.best == np.array([bound, cost, cost])).all()
assert np.all([(l_best == l).all() for l_best, l in zip(env.best_layout, env.layout)])
assert (env.contained_constraints() == np.array([True, True] + [False]*(2*N-2))).all()
assert (env.action_mask() == np.array([False, False] + [True]*(2*N-1))).all()

Repeating the action, we remove the larger element and the two smaller ones remain.

new_state, bound, cost, _ = env.perform_action(N)
env.show_constraints()

2: [1 1 0 0 0]
3: [0 0 0 0 0]

env.layout

[array([0, 1]), array([1, 2]), array([3]), array([4])]

Doing so, we have reduced the associated cost of the state. However, the bound has remained the same.

bound, cost

(-20.99999999834299, 36)

This is one of the main motivations behind the work Certificates of quantum many-body properties assisted by machine learning. We can see this reflected in the environment’s references, which now shows the minimum and maximum costs with which the best bound has ever been obtained.

env.best

array([-21.,  36.,  71.])

Exploration boundaries and errors

If we were to exceed the allowed computational budget with an action, we would obtain an error indicator. The erorr code is: - 0: all good! - 1: there could not be found a solution for the associated SdP - 2: the associated SdP cost exceeds the computational budget

env.perform_action(N+3)
new_state, bound, cost, err = env.perform_action(N+4)
env.show_constraints()

2: [1 1 1 1 1]
3: [0 0 0 1 1]

err, cost, budget

(2, 132, 100)

assert err == 2
assert cost > budget

Errors are the way through which we establish “walls” in the state space. This way, agents do not need to learn the boundaries of the feasible region nor need to keep track of it.

The environment prevents threspassing the state space boundaries with the step method. Rather than taking a single action, such as the perform_action method.

source

SdPEnvironment.step

 SdPEnvironment.step (actions)

Receives a list of actions (priority ordered) and executes them until one succeeds. Input: - actions: collection of actions (iterable of ints). Output: - next_state: new state after performing the chosen action. - action: action that was actually performed among the input ones. - bound: bound associated to the new state. - cost: cost associated to the new state. - err: error code (should be zero).

The step method takes a list of priority-ordered actions. The method performes them in the given order until one succeeds, ensuring that we never land on an error-flagged state.

In order to test this out, let’s first reset the environment to the initial state.

env.reset()
env.show_constraints()

2: [0 0 0 0 0]
3: [0 0 0 0 0]

new_state, action, bound, cost, error = env.step([5, 6, 7, 0, 1])
print(f"The performed action was {action}.")
env.show_constraints()

The performed action was 5.
2: [1 1 0 0 0]
3: [1 0 0 0 0]

assert action == 5
assert error == 0

If we now try to add an additional large constraint, we will exceed the computational budget. However, the step method will perform the first suitable action.

actions_to_try = [8, 6, 3, 5]
new_state, action, bound, cost, error = env.step(actions_to_try)
print(f"The performed action was {action}.")
env.show_constraints()

The performed action was 3.
2: [1 1 0 1 0]
3: [1 0 0 0 0]

error

assert action == 3

The step method has skipped the first two actions in the priority list to avoid exceeding the budget. In the extreme case in which none of the actions provided can be executed, the function returns the resutls for the last action it tried, although the environment has not advanced. This should never happen if we provide all possible actions ranked and the state-space is properly designed. Intuitively, there should never be rabbit holes the agent can’t escape from.

Initial state

As we’ve seen before, the exploration starts at the zero state by default. However, in some cases it may be more convenient to start somewhere else. We can specify the initial constraint size to the SdPEnvironment such that the initial state comprises all the basis elements of such size.

Let’s see how this works in the current example.

env = SdPEnvironment(H, solver, budget, initial_size=2)
env.show_constraints()

2: [1 1 1 1 1]
3: [0 0 0 0 0]

assert (env.state[:N] == 1).all()
assert (env.state[N:] == 0).all()

Custom basis

We can have a major impact in the state space and its boundaries by specifying a layout basis of our choice. For instance, we may be interested in finding bounds to a problem using a certain kind of constraints.

Let’s take the previous example and set a specific basis.

layout_basis = np.array([np.array([0, 1]), np.array([2, 3]), np.array([3, 4, 5]), np.array([0, 1, 4, 5])], dtype=object)

env = SdPEnvironment(H, solver, budget, layout_basis=layout_basis)

env.layout_basis

array([array([0, 1]), array([2, 3]), array([3, 4, 5]),
       array([0, 1, 4, 5])], dtype=object)

assert np.all([(lb == le).all() for lb, le in zip(layout_basis, env.layout_basis)])

In this case, we have basis elements of different sizes with various overlaps. For instance, in this case there are no 2-body elements contained in the 3-body one.

env.step([2])
env.show_constraints()

2: [0 0]
3: [1]
4: [0]

While some times we may want to use a fully customized basis, some times we simply care about truncating it. For instance, we may have a budget that allows us to reach up to four-body elements, but we wish to invest the resources into using exclusively as many three-body elements as possible. In these cases, we can specify a maximum size for the basis elements.

budget = 300
max_basis_size = 3
env_unrestricted = SdPEnvironment(H, solver, budget)
env_restricted = SdPEnvironment(H, solver, budget, max_basis_size=max_basis_size)

env_unrestricted.show_constraints()

2: [0 0 0 0 0]
3: [0 0 0 0 0]
4: [0 0 0 0 0]

env_restricted.show_constraints()

2: [0 0 0 0 0]
3: [0 0 0 0 0]

assert max([len(l) for l in env_restricted.layout_basis]) == max_basis_size

Rewards

Besides providing a consistent state space and exploration rules, the environment is in charge of providing feedback to the agent. When we initialize the environment, we choose the desired criterion for the reward. By default, we use the bound_norm_reward, with which we have obtained the best results so far and we have performed all the calculations in [1].

env.reward_fun.__name__

'bound_norm_reward'

assert env.reward_fun.__name__ == 'bound_norm_reward'

Given that, in general, we do not know the optimal relaxation, the environment relies on the references it has gathered during the exploration to provide the rewards. We can choose among: - bound_reward: the reward is the obtained bound, regardless of the cost. It can take any arbitrary value depending on the problem. - bound_norm_reward: the reward is a normalized function between zero and one which accounts for both the value of the bound and the associated cost. - bound_improve_reward: the reward is the improvement of a bound with respect to the minimum one ever observed.

We select the criterion by providing the name of the function excluding the '_reward'.

Memory

The environment implements a memory that stores the SdP solution of all the visited states. This way, we do not need to instantiate the SdP and solve it every time we revisit a state, speeding up the whole process. We can save the memory with the save_memory method and it will be automatically loaded when dealing with the same problem. The limit stored solutions in the memory is 1e6.

env.memory_path

Path('../memories/memory_Energy_xx_N5_Chain1D_terms_3.00_3.00_3.00_3.00_3.00_1.00_1.00_1.00_1.00_1.00.pkl')

References

[1] B. Requena, G. Muñoz-Gil, M. Lewenstein, V. Dunjko, J. Tura. Certificates of quantum many-body properties assisted by machine learning. Phys. Rev. Research 5, 013097 (2023)