2023年春研修

今回は強化学習のためのオセロのひな型の学習を行います。

１.環境構築

reversi環境構築手順

インストール目次・参照URL

(1)anacondaインストール

anacondaインストール.pdfを参照して、anaconda環境を構築してください。

(2)reversiインストール

reversiインストール.pdfを参照して、reversi実行環境を構築してください。

(3)jupyterインストール

jupyterインストール.pdfを参照して、notebook環境を構築してください。

２.実行確認

(1)notebookの起動

①Anacondaプロンプトを起動します。

②環境構築で作成したreversi環境に切り替えます

> activate reversi

③notebookを起動します

> jupyter notebook

(2)サンプルの実行

jupyter notebookから、サンプルの01_tkinter_app.ipynbを実行して、reversi画面が表示され、ゲームができることを確認してください。

サンプル

(3)最小限の形の実行

以下のコードを入力し、reversiが表示されることを確認します

from reversi import Reversi
Reversi().start()

※メニューのBlackがuser1 Whiteがuser2のみ選択可能（人対戦モードのみ）

(4)AI対戦を追加（ライブラリにあらかじめ組み込まれているAI）

以下のコードを入力し、reversiのメニューにRANDOM, GREEDYが追加されることを確認します

from reversi import Reversi
from reversi.strategies import Random, Greedy
Reversi(
    {
        'RANDOM': Random(),
        'GREEDY': Greedy(),
    }
).start()

※メニューのBlackとWhiteが上記のuser1/user2に加えて、RANDOM, GREEDY選択可能

・Random:ランダムな手を打つAI

・Greedy:できるだけ多く石が取れる手を打つAI

参考:reversi-master\reversi\strategies\easy.py

import random
from reversi.strategies.common import AbstractStrategy
# ------------------------------------------ 
# Random:ランダムな手を打つAI 
# ------------------------------------------ 
class Random(AbstractStrategy):
    def next_move(self, color, board):
        # 着手可能な位置を返します(get_legal_moves) 
        moves = board.get_legal_moves(color)
        # ランダムに座標を選択します 
        return random.choice(moves)
# ------------------------------------------ 
# Greedy:できるだけ多く石が取れる手を打つAI 
# ------------------------------------------ 
class Greedy(AbstractStrategy):
    def next_move(self, color, board):
        # 着手可能な位置を返します(get_legal_moves) 
        legal_moves = board.get_legal_moves(color)
        max_count = max([len(board.get_flippable_discs(color, *move)) for move in legal_moves])
        moves = [move for move in legal_moves if len(board.get_flippable_discs(color, *move)) == max_count]
        # ランダムに座標を選択します 
        return random.choice(moves)
# ------------------------------------------ 
# Unselfish:取れる石が最も少ない手を選ぶAI 
# ------------------------------------------ 
class Unselfish(AbstractStrategy):
    def next_move(self, color, board):
        # 着手可能な位置を返します(get_legal_moves) 
        legal_moves = board.get_legal_moves(color)
        min_count = min([len(board.get_flippable_discs(color, *move)) for move in legal_moves])
        moves = [move for move in legal_moves if len(board.get_flippable_discs(color, *move)) == min_count]
        # ランダムに座標を選択します 
        return random.choice(moves)
# ------------------------------------------ 
# SlowStarter:序盤(盤面に置かれている石が15%未満の場合)は、取れる石が最も少ない手を選び、
# 以降は取れる石が最も多い手を選ぶAI 
# ------------------------------------------ 
class SlowStarter(AbstractStrategy):
    def __init__(self):
        self.unselfish = Unselfish()
        self.greedy = Greedy()
    def next_move(self, color, board):
        squares = board.size**2
        blanks = sum([row.count(0) for row in board.get_board_info()])
        # 序盤(盤面に置かれている石が15%未満の場合)
        if (squares-blanks)/squares < 0.15:
            return self.unselfish.next_move(color, board)
        # 上記以外
        return self.greedy.next_move(color, board)

(5)自作の処理を追加

①以下のコードを入力し、独自クラスを作成します（角が取れる時は必ず取る）

import random
from reversi import Reversi
from reversi.strategies import AbstractStrategy
class Corner(AbstractStrategy):
    def next_move(self, color, board):
        size = board.size
        # 着手可能な位置を返します(get_legal_moves) 
        legal_moves = board.get_legal_moves(color)
        # 取得した座標が四隅の場合、その座標を返します 
        for corner in [(0, 0), (0, size-1), (size-1, 0), (size-1, size-1)]:
            if corner in legal_moves:
                return corner
        # 上記以外の場合は、ランダムに座標を選択します 
        return random.choice(legal_moves)

②以下のコードを入力し、reversiのメニューにCORNERが追加されることを確認します

from reversi import Reversi
from reversi.strategies import Random, Greedy
Reversi(
    {
        'RANDOM': Random(),
        'GREEDY': Greedy(),
        'CORNER': Corner(),
    }
).start()

※メニューのBlackとWhiteが上記のuser1/user2/RANDOM/GREEDYに加えてCORNERが追加される

get_legal_movesの説明(github readmeより)

(6)対戦をシュミレート

提供されている機能には、画面で対戦するのではなく、シュミレーションできる機能があります。今回はRandomとGreedyを対戦させてみます

①以下のコードを入力します。

import timeit
from reversi import Simulator, strategies

if __name__ == '__main__':
    simulator = Simulator(
        {
            'Random': strategies.Random(),
            'Greedy': strategies.Greedy(),
        },
        './simulator_setting.json',
    )

    elapsed_time = timeit.timeit('simulator.start()', globals=globals(), number=1)
    print(simulator, elapsed_time, '(s)')

    if simulator.processes == 1:
        keys = strategies.Measure.elp_time.keys()
        for key in keys:
            print()
            print(key)
            print(' min :', strategies.Measure.elp_time[key]['min'], '(s)')
            print(' max :', strategies.Measure.elp_time[key]['max'], '(s)')
            print(' ave :', strategies.Measure.elp_time[key]['ave'], '(s)')

②実行するフォルダに以下の設定ファイルを配置します（simulator_setting.json）

{
    "board_size": 8,
    "board_type": "bitboard",
    "matches": 50,
    "processes": 2,
    "prallel": "player",
    "random_opening": 0,
    "player_names": [
        "Random",
        "Greedy"
    ]
}

【パラメタの説明】

③実行結果

(7)自作のAI処理を追加

①以下のコードを入力し、独自クラスを作成します（強化学習版柏木さん作成）

import random
import os
import numpy as np
import pickle
from reversi import Reversi, strategies

class Kashiwagi(strategies.common.AbstractStrategy):
    def __init__(self):
        self.size = None
        self.color = None
    
    def next_move(self, color, board):
        self.size = board.size
        self.color = color
        move = None
        legal_moves = board.get_legal_moves(self.color)
        line_info = board.get_board_line_info(self.color)

        q_table_path = './table/q_table_{}_{}.txt'.format(self.size, self.color)
        i_table_path = './table/i_table_{}_{}.txt'.format(self.size, self.color)
        action_table_path = './table/action_table_{}_{}.txt'.format(self.size, self.color)
        q_table = None
        i_table = None
        action_table = None
        q_index = None
        action_index = None
        
        
        
        # サイズに合ったQテーブルがあるかどうか確認する。
        if os.path.exists(q_table_path):
            # pickleってのを使うといいらしい。参照→https:://www.robotech-note.com/entry/2016/10/01/180840
            q_table = pickle.load(open(q_table_path, 'rb'))
            # q_table = np.load(q_table_path) #  あれば読み込む
            i_table = pickle.load(open(i_table_path, 'rb'))
            # print(q_table_path + "を読み込みました")
        else:
            q_table = [] # 空で作成する
            i_table = []
            # print(q_table_path + "を空で作成しました")
            
            
        # boardをいろいろ操作し、実質同じ盤面を列挙する
        # 省略
        
        
        # Qテーブルに存在するかチェックする
        if not any(line_info == i[0] for i in i_table):
            # 参照先がないため追加
            i_table.append(self.make_i_record(line_info, len(q_table)))
            q_table.append(self.make_q_record(len(legal_moves)))
            
        # テーブルから最大のQ値を参照する
        for i in i_table:
            if line_info == i[0]:
                q_index = int(i[1])
                break
                
        # アクションを決定する
        action_index = np.argmax(q_table[q_index])
        move = legal_moves[action_index]
                
        # ファイル保存
        pickle.dump(q_table, open(q_table_path, 'wb'))
        pickle.dump(i_table, open(i_table_path, 'wb'))
        
        #アクションを保存する
        if os.path.exists(action_table_path):
            action_table = pickle.load(open(action_table_path, 'rb'))
        else:
            action_table = []
            
        action_table.append((q_index, action_index))
        pickle.dump(action_table, open(action_table_path, 'wb'))
            
#         print(legal_moves)
#         print(q_table[q_index])
#         print(np.argmax(q_table[q_index]))
#         print(legal_moves[np.argmax(q_table[q_index])])
#         print(move)
#         board.get_board_line_info(color)
        
#         print("board")
#         print(board)
#         print("legal_moves")
#         print(legal_moves)
        
#         print(board.get_board_line_info(color))

        return move

    def make_i_record(self, line, q_len):
        i_record = np.empty(0)
        i_record = np.append(i_record, line)
        i_record = np.append(i_record, q_len)
        return i_record

    def make_q_record(self, legal_moves_len):
        q_record = np.empty(0)
        for i in range(legal_moves_len):
            q_record = np.append(q_record, random.uniform(-1, 1))
        return q_record


    # 終了時の処理
    # (resultには以下の情報が格納されています)
    # result.winlose : 対戦結果(0=黒の勝ち、1=白の勝ち、2=引き分け)
    # result.black_name : 黒のAIの名前
    # result.white_name : 白のAIの名前
    # result.black_num : 黒の石の数
    # result.white_num : 白の石の数
    def get_result(self, result):
        # 報酬確定処理
        full_num = result.black_num + result.white_num
    
        if result.winlose == 0:
            # 黒の勝ち
            if self.color == 'black':
                reward = self.win_reward_cal(result.black_num - result.white_num, full_num)
            elif self.color == 'white':
                reward = self.lose_reward_cal(result.white_num - result.black_num, full_num)

        elif result.winlose == 1:
            # 白の勝ち
            if self.color == 'black':
                reward = self.lose_reward_cal(result.black_num - result.white_num, full_num)
            elif self.color == 'white':
                reward = self.win_reward_cal(result.white_num - result.black_num, full_num)

        elif result.winlose == 2:
            # 引き分け
            reward = -5
            
        q_table_path = './table/q_table_{}_{}.txt'.format(self.size, self.color)
        action_table_path = './table/action_table_{}_{}.txt'.format(self.size, self.color)

        q_table = pickle.load(open(q_table_path, 'rb'))
        action_table = pickle.load(open(action_table_path, 'rb'))
        
        # アクションテーブルを最初から読んでいき、報酬を適用する
        for  i, action in enumerate(action_table):
            if len(action_table) == i+1:
                break
            next_action = action_table[i + 1]
            q_table = self.update_q_table(q_table, action, reward, next_action)


        pickle.dump(q_table, open(q_table_path, 'wb'))
        action_table = []
        pickle.dump(action_table, open(action_table_path, 'wb'))


    def win_reward_cal(self, diff, full_num):
        max_move = (self.size * self.size)
        early_finish = max_move - full_num
        reward = diff * 10 + early_finish * 10
        return reward


    def lose_reward_cal(self, diff, full_num):
        max_move = (self.size * self.size)
        early_finish = max_move - full_num
        reward = diff * 10 - early_finish * 10
        return reward


    # Qテーブルの更新
    def update_q_table(self, q_table, action, reward, next_action):
        alpha = 0.2      # 学習率
        gamma = 0.99     # 割引率
        next_Max_Q = q_table[next_action[0]][np.argmax(q_table[next_action[0]])]
        q_table[action[0]][action[1]] = (1 - alpha) * q_table[action[0]][action[1]] +\
            alpha * (reward + gamma * next_Max_Q)

        return q_table

②以下のコードを入力し、reversiのメニューにKASIWAGIが追加されることを確認します

from reversi import Reversi
from reversi.strategies import Random, Greedy
Reversi(
    {
        'RANDOM': Random(),
        'GREEDY': Greedy(),
        'KASIWAGI': Kashiwagi(),
    }
).start()

※メニューのBlackとWhiteが上記のuser1/user2/RANDOM/GREEDY/CORNERに加えてKASIWAGIが追加される

③以下のファイルを解凍し、tableフォルダとしてソースと同じ階層に格納します

table.zip

④【補足】get_resultメソッド

AIに以下のget_resultメソッドを実装することで、シミュレータ実行時に1試合ごとの対戦結果を、AIに渡し何らか処理させることができます。

(8)各自で独自対戦ロジック作成

①以下の資料を参照し、各自独自対戦ロジックを作成しましょう

作者のGitHubのREADME

作者のGitHub

reversi仕様解説（柏木さん作成）