はじめての機械学習をデプロイしてみた（後編）

投稿日 2020年7月7日
著者 aws-recipe-user
カテゴリー Amazon SageMaker
カテゴリー AWS

はじめに

前回の記事の Amazon sagemaker のチュートリアルの続きです。

次に、データを SageMaker インスタンスにダウンロードして、データフレームにロードする必要があります。
次のコードをコピーして実行します。

try:
  urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
  print('Success: downloaded bank_clean.csv.')
except Exception as e:
  print('Data load error: ',e)

try:
  model_data = pd.read_csv('./bank_clean.csv',index_col=0)
  print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

				
					
				1
2
3
4
5
6
7
8
9
10
11

						try:
  urllib.request.urlretrieve ("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
  print('Success: downloaded bank_clean.csv.')
except Exception as e:
  print('Data load error: ',e)
 
try:
  model_data = pd.read_csv('./bank_clean.csv',index_col=0)
  print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

					

			

次に、データをシャッフルし、トレーニングデータとテストデータに分割します。

トレーニングデータ
（顧客の70％）がモデルのトレーニングループ中に使用されます。
勾配ベースの最適化を使用して、モデルパラメータを繰り返し調整します。
モデルエラーを最小化していきます。

テストデータ
（顧客の30％を、残りの）モデルの性能を評価し、目に見えないデータに、訓練されたモデルのデフォルト値を測定するために使用されます。

次のコードを新しいコードセルにコピーし、[run]を選択してデータをシャッフルおよび分割します。

train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print(train_data.shape, test_data.shape)

				1
2

						train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print(train_data.shape, test_data.shape)

ステップ4.データからモデルをトレーニングする

トレーニングデータセットを使用して機械学習モデルをトレーニングします。

ビルド済みXGBoostモデルを使用するには、トレーニングデータのヘッダーと最初の列を再フォーマットし、S3バケットからデータをロードする必要があります。

次のコードを新しいコードセルにコピーし、[Run]を選択してデータを再フォーマットしてロードします。

pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

				1
2
3

						pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

次に、Amazon SageMaker セッションをセットアップし、XGBoost モデルのインスタンス（推定器）を作成し、モデルのハイパーパラメーターを定義する必要があります。
次のコードを新しいコードセルにコピーし、[ 実行 ] を選択します。

sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m4.xlarge',output_path='s3://{}/{}/output'.format(bucket_name, prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)

				1
2
3

						sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m4.xlarge',output_path='s3://{}/{}/output'.format(bucket_name, prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)

データが読み込まれ、XGBoost 推定器が設定されたら、次のコードを次のコードセルにコピーして[実行]を選択することにより、ml.m4.xlarge インスタンスで勾配最適化を使用してモデルをトレーニングします。

数分後、生成されるトレーニングログの表示を開始する必要があります。

xgb.fit({'train': s3_input_train})

				1

						xgb.fit({'train': s3_input_train})

成功すればこのようにログが出てきます。

コンソール上でもこのように、トレーニングジョブが立ち上がっています。

ステップ5.モデルをデプロイする

このステップでは、トレーニング済みモデルをエンドポイントにデプロイし、CSVデータを再フォーマットしてロードしてから、モデルを実行して予測を作成します。

サーバーにモデルをデプロイし、アクセス可能なエンドポイントを作成するには、次のコードを次のコードセルにコピーして、[ 実行 ] を選択します。

xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

				1

						xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

銀行商品に登録されたテストデータの顧客がいるかどうかを予測するには、次のコードを次のコードセルにコピーして、[ Run ] を選択します。

test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values #load the data into an array
xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

				
					
				1
2
3
4
5
6

						test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values #load the data into an array
xgb_predictor.content_type = 'text/csv' # set the data type for an inference
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)

					

			

ステップ6.モデルのパフォーマンスを評価する

このステップでは、機械学習モデルのパフォーマンスと精度を評価します。

以下のコードをコピーして貼り付け、[Run]を選択します。

混合行列と呼ばれるテーブルの実際の値と予測値を比較します。

cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:&lt;20}{1:&lt;4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:&lt;15}{1:&lt;15}{2:&gt;8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:&lt;15}{1:&lt;2.0f}% ({2:&lt;}){3:&gt;6.0f}% ({4:&lt;})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:&lt;16}{1:&lt;1.0f}% ({2:&lt;}){3:&gt;7.0f}% ({4:&lt;}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))

				
					
				1
2
3
4
5
6
7

						cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:&lt;20}{1:&lt;4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:&lt;15}{1:&lt;15}{2:&gt;8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:&lt;15}{1:&lt;2.0f}% ({2:&lt;}){3:&gt;6.0f}% ({4:&lt;})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:&lt;16}{1:&lt;1.0f}% ({2:&lt;}){3:&gt;7.0f}% ({4:&lt;}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))

					

			

予測に基づいて、テストデータの90％の顧客に対して、顧客が預金証書を正確に登録すると予測したと、結論付けることができます。
登録済みの精度は65％（278/429）で、90％（10,785 / 11,928）登録しませんでした。

ステップ7.リソースを終了する

このステップでは、Amazon SageMaker 関連のリソースを終了します。

重要※
アクティブに使用されていないリソースを終了すると、コストが削減されます。
リソースを終了しないと、料金が発生します。

Amazon SageMaker エンドポイントと S3 バケット内のオブジェクトを削除するには、次のコードをコピーして貼り付け、実行します。

sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

				1
2
3

						sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

感想

ただデータを用意するだけではなく、訓練したりテストしたり、様々なデータが機械学習には必要であることが分かったと思います。
さらにその精度を評価したり。本当に結果が正しいものかを推測する必要もあります。
今回はコンソール画面ではなく、Jupyter-notebook 上でほとんど完結する内容でした。
あまり使い慣れていない方でも気軽に試せるので、ぜひ実際にやってみてください。

公式リンク

Amazon SageMaker

SageMakerの導入ならナレコムにおまかせください。

日本のAPNコンサルティングパートナーとしては国内初である、Machine Learning コンピテンシー認定のナレコムが導入から活用方法までサポートします。お気軽にご相談ください。

ソリューションページはこちら

この記事を書いた人

aws-recipe-user

記事一覧