Automagicaを試してみる

2019-10-22

ToolsSelenium

Automagica, Python, RPA

参考
メモ
- 総合的感想
- Getting startedを試す
  - 環境構築
  - Example 1の実行

参考

メモ

総合的感想

用途ごとに個別に開発されているライブラリを統合して使えるのは便利。エントリポイントが統一化され、 from automagica import * ですべての機能が使えるのは画期的である。しかし以下の点は少し課題を感じた。

ユーザ目線で似たもの同士に感じる機能に関し、APIが異なる
- 具体的にはWord、Excel操作
- ただし内部的に用いられるライブラリが別々のものなので致し方ない。（うまくすれば、APIレベルでも統合可能かもしれない）
内部的に用いられているライブラリのAPIを直接利用したいケースがあるが少しわかりづらい
- Automagicaが提供しているAPIよりも、内部的に用いられている元ネタのライブラリのAPIの方が機能豊富（少なくとも docx ライブラリはそうだと思う）
- ということから、まずはAutomagicaのAPIを基本として使用しつつ、必要に応じて内部で用いられているライブラリのAPIを直接利用するのが良さそう。実際に OpenWordDocument メソッドの戻り値は docx.Document クラスのインスタンス。しかし1個目の課題と関連するが、その使い方を想起させるドキュメントやAPI仕様に見えない。 Word activities以外ではどうなのか、は、細かくは未確認。
ドキュメントにバグがちらほら見られるのと、ドキュメント化されていない機能がある。
- これは改善されるだろう

Getting startedを試す

環境構築

AutomagicaのGetting started を参考にしつつ、動かしてみる。 Anacondaにはパッケージが含まれていなかった（conda-forge含む）のでpipenvを使って環境づくり。

1
2
3

> pipenv.exe --python 3.7
> pipenv.exe install jupyter
> pipenv.exe install https://github.com/OakwoodAI/automagica/tarball/master

なお、Python3.8系では依存関係上の問題が生じたので、いったん切り分けも兼ねて3.7で環境構築している。 AutomagicaのGetting started では、Tesseract4のインストールもオプションとして示されていたが、いったん保留とする。

Example 1の実行

Example 1 に載っていた例を動かそうとしたが、一部エラーが生じたので手直しして実行した。

エラー生じた箇所1：

ドキュメント上は、以下のとおり。

1	lookup_terms.append(ExcelReadCell(excel_path, 2, col))

しかし、行と列の番号で指定するAPIは、 ExcelReadRowCol である。また、例に示されていたエクセルでは1行目に検索対象文字列が含まれているため、第2引数は 2 ではなく、 1 である。したがって、

1	lookup_terms.append(ExcelReadRowCol(excel_path, 1, col))

とした。なお、ExcelReadCellメソッド、ExcelReadRowColメソッドの実装は以下の通り。（2019/10/24現在）

automagica/activities.py:639

def ExcelReadCell(path, cell="A1", sheet=None):
    '''
    Read a cell from an Excel file and return its value.
    Make sure you enter a valid path e.g. "C:\\Users\\Bob\\Desktop\\RPA Examples\\data.xlsx".
    The cell you want to read needs to be defined by a cell name e.g. "A2". The third variable 
    is a string with the name of the sheet that needs to be read. If omitted, the 
    function reads the entered cell of the active sheet.
    '''
    workbook = load_workbook(path)
    if sheet:
        worksheet = workbook.get_sheet_by_name(sheet)
    else:
        worksheet = workbook.active

    return worksheet[cell].value

automagica/activities.py:656

def ExcelReadRowCol(path, r=1, c=1, sheet=None):
    '''
    Read a Cell from an Excel file and return its value.
    Make sure you enter a valid path e.g. "C:\\Users\\Bob\\Desktop\\RPA Examples\\data.xlsx".
    The cell you want to read needs to be row and a column. E.g. r = 2 and c = 3 refers to cell C3. 
    The third variable needs to be a string with the name of the sheet that needs to be read. 
    If omitted, the function reads the entered cell of the active sheet. First row is defined 
    row number 1 and first column is defined column number 1.
    '''
    workbook = load_workbook(path)
    if sheet:
        worksheet = workbook.get_sheet_by_name(sheet)
    else:
        worksheet = workbook.active

    return worksheet.cell(row=r, column=c).value

エラーが生じた箇所2：

ドキュメント上では、読み込んだ文字列のリストをそのまま使っているが、そのままではリスト後半に None が混ざることがある。そこで、リストをフィルタするようにした。

1	lookup_terms = list(filter(lambda x: x is not None, lookup_terms))

エラーが生じた箇所3：

ドキュメント上では、以下の通り。

1	ExcelWriteCell(excel_path, row=i+2, col=j+2, write_value=url)

これも読み込みの場合と同様にAPIが異なる。合わせて引数名が異なる。そこで、

1	ExcelWriteRowCol(excel_path, r=i+2, c=j+2, write_value=url)

とした。

Seleniumを試してみる

2019-10-19

ToolsSelenium

Python, Slenium

参考
メモ
- 環境構築
- 便利な情報源
  - Selenium関連
  - おまけ

参考

メモ

日常生活の簡単化のために使用。【超便利】PythonとSeleniumでブラウザを自動操作する方法まとめを試す。

環境構築

Condaを使って環境を作り、seleniumをインストールした。 WebDriverは自分の環境に合わせて、Chrome77向けのものを使用したかっただが、 WSLでChromeを使おうとすると、--no-sandboxをつけないと行けないようだ。詳しくは、ブラウザのヘッドレスモードでスクショ及び【WSL】にFirefox&Google-Chromeをインストールする方法！を参照。

geckodriver を使うことにした。

また、当初は【超便利】PythonとSeleniumでブラウザを自動操作する方法まとめを参考にしようとしたが、途中から PythonからSeleniumを使ってFirefoxを操作してみるを参考にした。

動作確認ログは、 selenium_getting_started を参照。

便利な情報源

Selenium関連

PythonからSeleniumを使ってFirefoxを操作してみる
- Firefoxを使ってみる例
geckodriver
- FirefoxのWebDriverを得る場所
【WSL】にFirefox&Google-Chromeをインストールする方法！
- WSLにブラウザを導入する例
- Chromeでの苦労についても触れられている
PythonでSeleniumを使ってスクレイピング (基礎)
- スクレイピングの面で使いそうな基本機能の説明
Seleniumで要素を選択する方法まとめ
- 要素選択の基本的な例
- 汎用性が高いXPathを使う例も記載
4. 要素を見つける
- 上記のXPathを使う例なども含め、網羅的に記載されている。
Seleniumで待機処理するお話
- waitする方法

おまけ

Pythonで取得したWebページのHTMLを解析するはじめの一歩
- おそらくBeautiful SoupによるHTML解析も伴うことが多いだろう、ということで。

Conferences at Dec. 2019

2019-10-14

参考
メモ

参考

Middleware 2019

LISA19

USENIX ATC'20

USENIX ATC20

IEEE Big Data 2019

IEEE Big Data 2019

メモ

気になるカンファレンスをメモ。

ACM/IFIP International Conference MIDDLEWARE 2019

2019/12/9 - 13

UC DAVIS, CA, US

悪くなさそう。

趣旨

The annual Middleware conference is a major forum for the discussion of innovations and recent scientific advances of middleware systems with a focus on the design, implementation, deployment, and evaluation of distributed systems, platforms and architectures for computing, storage, and communication. Highlights of the conference will include a high quality single-track technical program, invited speakers, an industrial track, panel discussions involving academic and industry leaders, poster and demonstration presentations, a doctoral symposium, tutorials and workshops.

ということから、ミドルウェアに関する技術要素についてのカンファレンスと理解。

キーノートスピーカー

Brian F. Cooper (Google)
- Title: Why can't the bits sit still? The never ending challenge of data migrations.
Monica Lam (STANFORD UNIVERSITY)
- Title: Building the Smartest and Open Virtual Assistant to Protect Privacy
Ion Stoica (BERKELEY)
- Title: To unify or to specialize?

採録論文

Middleware 2019 - Accepted Papers に一覧が載っている。

気になった論文は以下の通り。

Scalable Data-structures with Hierarchical, Distributed Delegatio
FfDL: A Flexible Multi-tenant Deep Learning Platform
Monitorless: Predicting Performance Degradation in Cloud Applications with Machine Learning
Automating Multi-level Performance Elastic Components for IBM Streams
Self-adaptive Executors for Big Data Processing
Differential Approximation and Sprinting for Multi-Priority Big Data Engines
Combining it all: Cost minimal and low-latency stream processing across distributed heterogenous infrastructures
ReLAQS: Reducing Latency for Multi-Tenant Approximate Queries via Scheduling
Medley: A Novel Distributed Failure Detector for IoT Network

LISA19

概要

LISA is the premier conference for operations professionals, where we share real-world knowledge about designing, building, securing, and maintaining the critical systems of our interconnected world.

とのことであり、開発や運用に関するカンファレンス。

Conference Program

LISA19 conference program

気になったセッションは以下の通り。

Pulling the Puppet Strings with Ansible
Linux Systems Performance
The Challenges of Managing Open Source Infrastructure at Bloomberg
Jupyter Notebooks for Ops
その他Kubernetes関連のセッションが複数

2020 USENIX Conference on Operational Machine Learning

USENIX OpML'20

May 1, 2020

Santa Clara, CA, United States Hyatt Regency Santa Clara

USENIX ATC '20

USENIX ATC20

JULY 1517, 2020 BOSTON, MA, USA

IEEE Big Data 2019

Los Angeles, CA, USA 9-12, Dec.

気にはなる。

参考： IEEE Big Data 2018の気になった論文リスト

2018年度の気になった論文によると、情報システムというよりは要素技術や分析手法などの方が主旨のようだ。

Scraping with session information using Python

2019-10-14

Open DataScraping

Open Data, Python, Scraping, Session

参考
メモ

参考

RequestsでSessionモード

メモ

あとでメモを追加する

Efficient and Robust Automated Machine Learning

2019-09-22

Machine LearningAutoML

AutoML, Machine Learning, Machine Learning Lifecycle/

参考
論文メモ
動作確認

参考

論文メモ

気になったところをメモする。

概要

いわゆるAutoMLの一種。特徴量処理、アルゴリズム選択、ハイパーパラメータチューニングを実施。さらに、メタ学習とアンサンブル構成も改善として対応。 scikit-learnベースのAutoML。

auto-sklearnと呼ぶ。

GitHub上にソースコードが公開されている。

同様のデータセットに対する過去の性能を考慮し、アンサンブルを構成する。

ChaLearn AutoMLチャレンジで勝利。

OpenMLのデータセットを用いて、汎用性の高さを確認した。

2019/11時点では、識別にしか対応していない？

1. Introduction

以下のあたりのがAutoML周りのサービスとして挙げられていた。

1 2	BigML.com, Wise.io, SkyTree.com, RapidMiner.com, Dato.com, Prediction.io, DataRobot.com, Microsoft’s Azure Machine Learning, Google’s Prediction API, and Amazon Machine Learning

Auto-wekaをベースとしたもの。

パラメトリックな機械学習フレームワークと、ベイジアン最適化を組み合わせて用いる。

特徴

新しいデータセットに強い。ベイジアン最適化をウォームスタートする。
自動的にモデルをアンサンブルする
高度にパラメタライズされたフレームワーク
数多くのデータセットで実験

2 AutoML as a CASH problem

ここで扱う課題をCombined Algorithm Selection and Hyperparameter optimization (CASH) と呼ぶ。

ポイントは、

ひとつの機械学習手法がすべてのデータセットに対して有効ではない
機械学習手法によってはハイパーパラメータチューニングが必要

ということ。それらを単一の最適化問題として数式化できる。

Auto-wekaで用いられたツリーベースのベイジアン最適化は、ガウシアンプロセスモデルを用いる。低次元の数値型のハイパパラメータには有効。しかし、ツリーベースのアルゴリズムは高次元で離散値にも対応可能。 Hyperopt-sklearnのAutoMLシステムで用いられている。

提案手法では、ランダムフォレストベースのSMACを採用。

3 New methods for increasing efﬁciency and robustness of AutoML

提案手法のポイント

ベイジアン最適化のウォームスタート
アンサンブル構築の自動化

まずウォームスタートについて。

メタ学習は素早いが粗い。ベイジアン最適化はスタートが遅いが精度が高い。これを組み合わせる。

既存手法でも同様の試みがあったが、複雑な機械学習モデルや多次元のパラメータには適用されていない？

OpenML を利用し、メタ学習。

つづいてアンサンブル構築について。

単一の結果を用いるよりも、アンサンブルを組んだほうが良い。アンサンブル構成の際には重みを用いる。重みの学習にはいくつかの手法を試した。 Caruana et al. のensemble selectionを利用。

5 Comparing AUTO-SKLEARN to AUTO-WEKA and HYPEROPT-SKLEARN

AUTO-SKLEARNをそれぞれ既存手法であるAUTO-WEKAとHYPEROPT-SKLEARNと比較。（エラーレート？で比較）

AUTO-WEKAに比べて概ね良好。一方HYPEROPT-SKLEARNは正常に動かないことがあった。

6 Evaluation of the proposed AutoML improvements

OpenMLレポジトリのデータを使って、パフォーマンスの一般性を確認。テキスト分類、数字や文字の認識、遺伝子やRNAの分類、広告、望遠鏡データの小片分析、組織からのがん細胞検知。

メタ学習あり・なし、アンサンブル構成あり・なしで動作確認。

7 Detailed analysis of AUTO-SKLEARN components

ひとつひとつの手法を解析するのは時間がかかりすぎた。

ランダムフォレスト系統、AdaBoost、Gradient boostingがもっともロバストだった。一方、SVMが特定のデータセットで性能高かった。

どのモデルも、すべてのケースで性能が高いわけではない。

個別の前処理に対して比較してみても、AUTO-SKLEARNは良かった。

動作確認

依存ライブラリのインストール

auto-sklearnのドキュメントを参考に動かしてみる。

予め、ビルドツールをインストールしておく。

1	$ sudo apt-get install build-essential swig

pipenv使って環境構築

pipenvを使って、おためし用の環境を作る。

1	$ pipenv --python 3.7

依存関係とパッケージをインストールしようとしたらエラーで試行錯誤

今回はpipenvを使うので、公式手順を少し修正して実行。（pip -> pipenvとした）

1	curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt \| xargs -n 1 -L 1 pipenv install

なお、requirementsの中身は以下の通り。

setuptools
nose
Cython

numpy>=1.9.0
scipy>=0.14.1

scikit-learn>=0.21.0,<0.22

lockfile
joblib
psutil
pyyaml
liac-arff
pandas

ConfigSpace>=0.4.0,<0.5
pynisher>=0.4.2
pyrfr>=0.7,<0.9
smac==0.8

論文の通り、SMACが使われるようだ。

つづいて、auto-sklearn自体をインストール。（最初から、これを実行するのではだめなのだろうか？勝手に依存関係を解決してくれるのでは？）

1	$ pipenv run pip install auto-sklearn

なお、最初に pipenv install auto-sklearn していたのだがエラーで失敗したので、上記のようにvirtualenv環境下でpipインストールすることにした。エラー内容は以下の通り。

(snip)

[pipenv.exceptions.ResolutionFailure]:       req_dir=requirements_dir
[pipenv.exceptions.ResolutionFailure]:   File "/home/dobachi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/pipenv/utils.py", line 726, in resolve_deps
[pipenv.exceptions.ResolutionFailure]:       req_dir=req_dir,
[pipenv.exceptions.ResolutionFailure]:   File "/home/dobachi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/pipenv/utils.py", line 480, in actually_resolve_deps
[pipenv.exceptions.ResolutionFailure]:       resolved_tree = resolver.resolve()
[pipenv.exceptions.ResolutionFailure]:   File "/home/dobachi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/pipenv/utils.py", line 395, in resolve
[pipenv.exceptions.ResolutionFailure]:       raise ResolutionFailure(message=str(e))
[pipenv.exceptions.ResolutionFailure]:       pipenv.exceptions.ResolutionFailure: ERROR: ERROR: Could not find a version that matches scikit-learn<0.20,<0.22,>=0.18.0,>=0.19,>=0.21.0

(snip)

作業効率を考え、Jupyterをインストールしておく。

1	$ pipenv install jupyter

つづいて公式ドキュメントの手順を実行。

以下を実行したらエラーが出た。

1	import autosklearn.classification

1	ModuleNotFoundError: No module named '_bz2'

scikit-learn で No module named '_bz2' というエラーがでる問題の通り、 bzip2に関する開発ライブラリが足りていない、ということらしい。パッケージをインストールし、Pythonを再インストール。pipenvのvirtualenv環境も削除。その後、pipenvで環境を再構築。

$ sudo apt install libbz2-dev
$ pyenv install 3.7.5
$ pipenv --rm
$ pipenv install

なお、sklearnのバージョンでやはり失敗する？よくみたら、pypiから2019/11/1時点でインストールできるライブラリは0.5.2であるが、公式ドキュメントは0.6.0がリリースされているように見えた。また対応するscikit-learnのバージョンが変わっている…。

依存関係のインストールとパッケージのインストール（2度め）

致し方ないので、依存関係のライブラリはpipenvのPipfileからインストールし、 auto-sklearnはGitHubからダウンロードしたmasterブランチ一式をインストールすることにした。

1 2	$ pipenv install $ pipenv install ~/Downloads.win/auto-sklearn-master.zip

改めてJupyter notebookを起動

1	$ pipenv run jupyter notebook

以下を実行したら、一応動いたけど…

1	import autosklearn.classification

少し警告

1	Could not import the lzma module. Your installed Python is incomplete.

sklearn関係もインポート成功。

1
2
3

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

学習データ等の準備

1
2
3

X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

学習実行

1 2	automl = autosklearn.classification.AutoSklearnClassifier() automl.fit(X_train, y_train)

実行したらPCのCPUが100%に張り付いた！

1 2	y_hat = automl.predict(X_test) print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

参考）CentOS7ではSwig3をインストールすること

CentOSで依存関係をインストールするときには、 swig ではなく swig3 を対象とすること。

でないと、以下のようなエラーが生じる。

(snip)

----------------------------------------', 'ERROR: Command errored out with exit status 1: /home/centos/.local/share/virtualenvs/auto-sklearn-example-hzZc_yaE/bin/python3.7m -u -c \'import sys, setuptools, tokenize; sys.argv[0] = \'"\'"\'/tmp/pip-install-53cbwkis/pyrfr/setup.py\'"\'"\'; __file__=\'"\'"\'/tmp/pip-install-53cbwkis/pyrfr/setup.py\'"\'"\';f=getattr(tokenize, \'"\'"\'open\'"\'"\', open)(__file__);code=f.read().replace(\'"\'"\'\\r\\n\'"\'"\', \'"\'"\'\\n\'"\'"\');f.close();exec(compile(code, __file__, \'"\'"\'exec\'"\'"\'))\' install --record /tmp/pip-record-udozwz5v/install-record.txt --single-version-externally-managed --compile --install-headers /home/centos/.local/share/virtualenvs/auto-sklearn-example-hzZc_yaE/include/site/python3.7/pyrfr Check the logs for full command output.']

(snip)

メモリエラー

EC2インスタンス上で改めて実行したところ、以下のようなエラーを生じた。

(snip)

  File "/home/centos/.local/share/virtualenvs/auto-sklearn-example-hzZc_yaE/lib/python3.7/site-packages/autosklearn/smbo.py", line 352, in _calculate_metafeatures_encoded
  
(snip)

ImportError: /home/centos/.local/share/virtualenvs/auto-sklearn-example-hzZc_yaE/lib/python3.7/site-packages/sklearn/tree/_criterion.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory

(snip)

これから切り分ける。

What is OpML

2019-09-02

Machine LearningOpML

Machine Learning, OpML

参考
メモ

参考

メモ

OpMLについて

USENIX OpML'19のテックジャイアントたちが話題にしている主要なトピックからスコープを推測。

ヘテロジーニアスアーキテクチャの活用（エッジ・クラウドパラダイム）
モデルサービング、レイテンシの改善
機械学習パイプライン
機械学習による運用改善（AIOps）
機械学習のデバッガビリティ改善

USENIX OpML'19 の定義

USENIX OpML'19 によると、 OpML = Operational Machine Learning の定義。

カンファレンスの定義は以下の通り。

The 2019 USENIX Conference on Operational Machine Learning (OpML '19) provides a forum for both researchers and industry practitioners to develop and bring impactful research advances and cutting edge solutions to the pervasive challenges of ML production lifecycle management. ML production lifecycle is a necessity for wide-scale adoption and deployment of machine learning and deep learning across industries and for businesses to benefit from the core ML algorithms and research advances.

上記では、機械学習の生産ライフサイクルの管理は、機械学習や深層学習が産業により広く用いられるため、またコアとなる機械学習アルゴリズムや研究からビジネス上のメリットを享受するために必要としている。

コミッティー

カンファレンスサイトの情報から集計すると、以下のような割合だった。

多様な企業、組織からコミッティを集めているが、 Googleなど一部企業からのコミッティが多い。

セッション

カンファレンスのセッションカテゴリは以下の通り。

カテゴリとしては、今回は実経験に基づく知見を披露するセッションが件数多めだった。

一方で、同じカテゴリでも話している内容がかなり異なる印象は否めない。もう数段階だけサブ・カテゴライズできるようにも見え、まだトピックが体系化されていないことを伺わせた。

Google、LinkedIn、Microsoft等のテックジャイアントの主要なセッションを見ていて気になったキーワードを並べる。

ヘテロジーニアスアーキテクチャの活用（エッジ・クラウドパラダイム）
モデルサービング、レイテンシの改善
機械学習パイプライン
機械学習による運用改善（AIOps）
機械学習のデバッガビリティ改善

その他アルゴリズムの提案も存在。

小さめの企業では、ML as a Serviceを目指した取り組みがいくつか見られた。

セッションをピックアップして紹介

実経験にもとづく知見のセッションを一部メモ。

Opportunities and Challenges Of Machine Learning Accelerators In Production

タイトル: Opportunities and Challenges Of Machine Learning Accelerators In Production
キーワード: TPU, Google, ヘテロジーニアスアーキテクチャ, 性能
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_ananthanarayanan.pdf

CPUでは近年の深層学習ワークロードの計算アルゴリズム、量において不十分（ボトルネックになりがち）だが、 TPU等のアーキテクチャにより改善。

しかしTPUが入ってくると「ヘテロジーニアスなアーキテクチャ」になる。これをうまく使うための工夫が必要。

加えて近年のTPUについて紹介。

Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft

タイトル: Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft
キーワード: モデルサービング, レイテンシの改善, Microsoft, 性能
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_zhang-minjia.pdf

Microsoft。 CPU上でRNNを高速にサーブするためのライブラリ: DeepCPU 目標としていたレイテンシを実現するためのもの。例えばTensorFlow Servingで105msかかっていた「質問に答える」という機能に関し、目標を超えるレイテンシ4.1msを実現した。スループットも向上。

A Distributed Machine Learning For Giant Hogweed Eradication

タイトル: A Distributed Machine Learning For Giant Hogweed Eradication
キーワード: Big Data基盤, モデルサービング, ドローン, 事例, NTTデータ
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19-nttd.pdf

NTTデータ。デンマークにおいて、ドローンで撮影した画像を利用し、危険外来種を見つける。 Big Data基盤と機械学習基盤の連係に関する考察。

AIOps: Challenges and Experiences in Azure

タイトル: AIOps: Challenges and Experiences in Azure
キーワード: Microsoft, 運用, 運用改善, AIOps
スライド: 非公開

スライド非公開のため、概要から以下の内容を推測。 Microsoft。 Azureにて、AIOps（AIによるオペレーション？）を実現したユースケースの紹介。課題と解決方法を例示。

Quasar: A High-Performance Scoring and Ranking Library

タイトル: Quasar: A High-Performance Scoring and Ranking Library
キーワード: LinkedIn, スコアリング, ランキング, 性能, フレームワーク
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_birkmanis.pdf

スコアリングやランキング生成のためにLinkedInで用いられているライブラリ。 Quasar.

AI from Labs to Production - Challenges and Learnings

タイトル: AI from Labs to Production - Challenges and Learnings
キーワード: AIの商用適用
スライド: 非公開

エンタープライズ向けにAIを使うのはB2CにAIを適用するときとは異なる。

MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing

タイトル: MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing
キーワード: Samsung Research, 運用, 深層学習, 製造業, 運用スキーム
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_lim.pdf

Samsung Research。製造業における画像検知？における運用スキームの提案。異なるステークホルダが多数存在するときのスキーム。学習、推論、それらをつなぐ管理機能を含めて提案。ボルトの検査など。

Deep Learning Vector Search Service

タイトル: Deep Learning Vector Search Service
キーワード: Microsoft, Bing, 検索, ベクトルサーチ, ANN(Approximae Nearest Neighbor), 深層学習
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_zhu.pdf

深層学習を用いてコンテンツをベクトル化する。ベクトル間の関係がコンテンツの関係性を表現。スケーラビリティを追求。

Signal Fabric An AI-assisted Platform for Knowledge Discovery in Dynamic System

タイトル: Signal FabricAn AI-assisted Platform for Knowledge Discovery in Dynamic System
キーワード: Microsoft, Azure, 運用改善のためのAI活用
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_aghajanyan.pdf

Microsoft。Azure。動的に変化するパブリッククラウドのサービスの振る舞いを把握するためのSignal Fabric。「AI」も含めて様々な手段を活用。

Relevance Debugging and Explaining at LinkedIn

タイトル: Relevance Debugging and Explaining at LinkedIn
キーワード: LinkedIn, AIのデバッグ, 機械学習基盤, デバッガビリティの改善
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_qiu.pdf

LinkedIn。分散処理のシステムは複雑。AIエンジニアたちはそれを把握するのが難しい。そこでそれを支援するツールを開発した。様々な情報をKafkaを通じて集める。それを可視化したり、クエリしたり。

Shooting the moving target: machine learning in cybersecurity

タイトル: Shooting the moving target: machine learning in cybersecurity
キーワード: セキュリティへの機械学習適用, 機械学習のモデル管理
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_arun.pdf

セキュリティへの機械学習適用。基盤はモジューライズされ？、拡張可能で、モデルを繰り返し更新できる必要がある。

Deep Learning Inference Service at Microsoft

タイトル: Deep Learning Inference Service at Microsoft
キーワード: 深層学習, 推論, Microsoft, ヘテロジーニアスアーキテクチャ
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_soifer.pdf

Microsoft。深層学習の推論に関する工夫。インフラ。ヘテロジーニアス・リソースの管理、リソース分離、ルーティングなど。数ミリ秒単位のレイテンシを目指す。モデルコンテナをサーブする仕組み。

ソリューション関係

KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep Neural Networks

タイトル: KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep Neural Networks
キーワード: エッジ・クラウドパラダイム, DNN, 分散処理, ヘテロジーニアスアーキテクチャ
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_biookaghazadeh.pdf

DNNは強化学習、物体検知、ビデオ処理、VR・ARなどに活用されている。一方でそれらの処理はクラウド中心のパラダイムから、エッジ・クラウドパラダイムに移りつつある。そこでDNNの学習とサービングをエッジコンピュータにまたぐ形で提供するようにする、 KnwledgeNetを提案する。ヘテロジーニアスアーキテクチャ上で学習、サービングする。

クラウドでTeacherモデルを学習し、エッジでStudentモデルを学習する。

Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform

タイトル: Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform
キーワード: Google, 継続的な機械学習, TFX（TensorFlow Extended）, 機械学習のパイプライン
スライド: https://www.usenix.org/system/files/opml19papers-baylor.pdf

大企業は、継続的な機械学習パイプラインに支えられている。それが止まりモデルが劣化すると、下流のシステムが影響を受ける。 TensorFlow Extendedを使ってパイプラインを構成する。

Reinforcement Learning Based Incremental Web Crawling

タイトル: Reinforcement Learning Based Incremental Web Crawling
キーワード: ウェブクロール, 機械学習？
スライド: 非公開

Katib: A Distributed General AutoML Platform on Kubernetes

タイトル: Katib: A Distributed General AutoML Platform on Kubernetes
キーワード: AutoML, Kubernetes, ハイパーパラメータサーチ, ニューラルアーキテクチャサーチ
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_zhou.pdf

Kubernetes上に、ハイパーパラメータサーチ、ニューラルアーキテクチャサーチのためのパイプラインをローンチするAutoMLの仕組み。

Machine Learning Models as a Service

タイトル: Machine Learning Models as a Service
キーワード: 中央集権的な機械学習基盤, 機械学習基盤のセキュリティ, Kafkaを使ったビーコン
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_wenzel.pdf

データサイエンティストがモデル設計や学習よりも、デプロイメントに時間をかけている状況を受け、中央集権的な機械学習基盤を開発。 PyPIに悪意のあるライブラリが混入した件などを考慮し、中央集権的にセキュリティを担保。 Kafkaにビーコンを流す。リアルタイムに異常検知するフロー、マイクロバッチでデータレイクに格納し、再学習をするフロー。その他、ロギング、トレーシング（トラッキング）、モニタリング。

Stratum: A Serverless Framework for the Lifecycle Management of Machine Learning-based Data Analytics Tasks

タイトル: Stratum: A Serverless Framework for the Lifecycle Management of Machine Learning-based Data Analytics Tasks
キーワード: 機械学習向けのサーバレスアーキテクチャ, Machine Learning as a Service, 推論, エッジ・クラウド
スライド: https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_bhattacharjee.pdf

Stratumというサーバレスの機械学基盤を提案。モデルのデプロイ、スケジュール、データ入力ツールの管理を提供。機械学習ベースの予測分析のアーキテクチャ。

その他のセッション傾向

キーワード:

スケーラビリティ
監視・観測、診断
最適化・チューニング

SYSML

SYSMLの定義

公式ウェブサイトには、以下のように記載されている。

The Conference on Systems and Machine Learning (SysML) targets research at the intersection of systems and machine learning. The conference aims to elicit new connections amongst these fields, including identifying best practices and design principles for learning systems, as well as developing novel learning methods and theory tailored to practical machine learning workflows.

以上から、「システムと機械学習の両方に関係するトピックを扱う」と解釈して良さそうである。

SYSML'19

コミッティなど

チェアたちの所属する組織の構成は以下の通り。

プログラムコミッティの所属する組織の構成は以下の通り。

チェア、コミッティともに、幅広い層組織からの参画が見受けられるが、大学関係者（教授等）が多い。僅かな偏りではあるが、Stanford Univ.、Carnegie Mellon Univ.、UCBあたりの多さが目立つ。コミッティに限れば、MITとGoogleが目立つ。

トピック

トピックは以下の通り。

Parallel & Distributed Learning
Security and Privacy
Video Analytics
Hardware
Hardware & Debbuging
Efficient Training & Inference
Programming Models

概ねセッション数に偏りは無いようだが、1個目のParallel & Distributed Learningだけは他と比べて2倍のセッション数だった。

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

タイトル: TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
キーワード: 深層学習, DNNの高速化, TensorFlow, PyTorch, 学習高速化, 推論高速化
スライド: https://www.sysml.cc/doc/2019/199.pdf

パラメータ交換を高速化。その前提となる計算モデルを考慮して、パラメータ交換の順番を制御する。 TensorFlowを改造。パラメータサーバを改造。モデル自体を変更することなく使用可能。推論で37.7%、学習で19.2%の改善。

ideavimrc

2019-08-26

ToolsIntellij

Intellij, vim

参考
メモ

参考

IntelliJ(Android Studio)のVimプラグイン「IdeaVim」の使い方と設定

メモ

IntelliJ(Android Studio)のVimプラグイン「IdeaVim」の使い方と設定に.ideavimrcの書き方の例が記載されている。ただし、リンクで書かれているGitHubレポジトリは存在しないようだた。　

上記ブログを参考にしつつ、とりあえず初期バージョンを作成。→ https://github.com/dobachi/ideavimrc このあと色々と付け足す予定。

sort by timestamp with NERDTree

2019-08-24

vim

NERDTree, vim

参考
メモ

参考

メモ

vimでNERDTreeを使うとき、直近ファイルを参照したいことが多いため、デフォルトのソート順を変えることにした。

Support sorting files and directories by modification time. #901 を参照し、タイムスタンプでソートする機能が取り込まれていることがわかったので、 NERDTree.txt を参照して設定した。具体的には、 NERDTreeSortOrder の節を参照。

vimrcには、

1	let NERDTreeSortOrder=['\/$', '*', '\.swp$', '\.bak$', '\~$', '[[-timestamp]]']

を追記した。デフォルトのオーダに対し、 [[-timestamp]] を追加し、タイムスタンプ降順でソートするようにした。

Spark Docker Image

2019-08-24

Spark

Docker, Spark

参考
メモ
- Spark公式のDockerfileを活用
- Anaconda等も利用するとして？

参考

メモ

Spark公式のDockerfileを活用

Spark公式ドキュメントの Running Spark on Kubernetes 内にある Docker Images によると、 SparkにはDockerfileが含まれている。

以下のパッケージで確認。

1
2
3

$ cat RELEASE
Spark 2.4.3 built for Hadoop 2.7.3
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Pflume -Psparkr -Pkafka-0-8 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

Dockerfileは、パッケージ内の kubernetes/dockerfiles/spark/Dockerfile に存在する。また、これを利用した bin/docker-image-tool.sh というスクリプトが存在し、 Dockerイメージをビルドして、プッシュできる。今回はこのスクリプトを使ってみる。

ヘルプを見ると以下の通り。

$ ./bin/docker-image-tool.sh --help
Usage: ./bin/docker-image-tool.sh [options] [command]
Builds or pushes the built-in Spark Docker image.

Commands:
  build       Build image. Requires a repository address to be provided if the image will be
              pushed to a different registry.
  push        Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
  -f file               Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
  -p file               Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark.
  -R file               Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark.
  -r repo               Repository address.
  -t tag                Tag to apply to the built image, or to identify the image to be pushed.
  -m                    Use minikube's Docker daemon.
  -n                    Build docker image with --no-cache
  -b arg      Build arg to build or push the image. For multiple build args, this option needs to
              be used separately for each build arg.

Using minikube when building images will do so directly into minikube's Docker daemon.
There is no need to push the images into minikube in that case, they'll be automatically
available when running applications inside the minikube cluster.

Check the following documentation for more information on using the minikube Docker daemon:

  https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon

Examples:
  - Build image in minikube with tag "testing"
    ./bin/docker-image-tool.sh -m -t testing build

  - Build and push image with tag "v2.3.0" to docker.io/myrepo
    ./bin/docker-image-tool.sh -r docker.io/myrepo -t v2.3.0 build
    ./bin/docker-image-tool.sh -r docker.io/myrepo -t v2.3.0 push

つづいて最新版Sparkのバージョンをタグ付けて試す。

1 2	$ sudo ./bin/docker-image-tool.sh -r docker.io/dobachi -t v2.4.3 build $ sudo ./bin/docker-image-tool.sh -r docker.io/dobachi -t v2.4.3 push

これは成功。早速起動して試してみる。なお、Sparkのパッケージは、 /opt/spark 以下に含まれている。

$ sudo docker run --rm -it dobachi/spark:v2.4.3 /bin/bash

(snip)

bash-4.4# /opt/spark/bin/spark-shell  

(snip)

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) 
Type in expressions to have them evaluated.
Type :help for more information.

scala>

データを軽く読み込んでみる。

scala> val path = "/opt/spark/data/mllib/kmeans_data.txt"
scala> val textFile = spark.read.textFile(path)
scala> textFile.first
res1: String = 0.0 0.0 0.0

scala> textFile.count
res2: Long = 6

scala> val csvLike = spark.read.format("csv").
         option("sep", " ").
         option("inferSchema", "true").
         option("header", "false").
         load(path)

scala> csvLike.show
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|0.0|0.0|0.0|
|0.1|0.1|0.1|
|0.2|0.2|0.2|
|9.0|9.0|9.0|
|9.1|9.1|9.1|
|9.2|9.2|9.2|
+---+---+---+

scala> csvLike.where($"_c0" > 5).show
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|9.0|9.0|9.0|
|9.1|9.1|9.1|
|9.2|9.2|9.2|
+---+---+---+

なお、PySparkが含まれているのは、違うレポジトリであり、 dobachi/spark-py を用いる必要がある。

$ sudo docker run --rm -it dobachi/spark-py:v2.4.3 /bin/bash

(snip)

bash-4.4# /opt/spark/bin/pyspark 
Python 2.7.16 (default, May  6 2019, 19:35:26) 
[GCC 8.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'
>>>

この状態だと、OS標準付帯のPythonのみ使える状態に見えた。実用を考えると、もう少し分析用途のPythonライブラリを整えた方が良さそう。

Anaconda等も利用するとして？

上記の通り、Spark公式のDockerfileで作るDockerイメージには、Python関係の分析用のライブラリ、ツールが不足している。そこでこれを補うために、Anaconda等を利用しようとするとして、パッと見でほんの少し議論なのは、なにをベースにするかという点。

上記で紹介したSparkに含まれるDockerfileおよび dockerhubのopenjdk に記載の通り、当該DockerイメージはAlpine Linuxをベースとしている。ただし、OpenJDKのDockerイメージ自体には、 GitHub上のDebianベースのOpenJDK Dockerfile がある。
AnacondaのDockerイメージは、 DockerhubのAnacondaのDockerfile の通り、Debianをベースとしている。しかし、 GitHub上のAlpine版のAnaconda公式Dockerfile によるとAlpine Linux版もありそう。

どっちに合わせるか悩むところだが、Debianベースに置き換えでも良いか、という気持ちではある。 Debianベースのほうが慣れているという気持ちもあり。

また、Anacondaを合わせて導入するようにしたら、環境変数 PYSPARK_PYTHON も設定しておいたほうが良さそう。

Databricks AutoML Toolkit

2019-08-23

Machine LearningAutoML

AutoML, Machine Learning

参考
メモ
- ブログ？記事？
- GitHub
  - Readme（2019/08/23現在）
  - 実装を確認

参考

メモ

ブログ？記事？

Databricks launches AutoML Toolkit for model building and deployment の内容を確認する。

MLFlowなどをベースに開発された。ハイパーパラメタータチューニング、バッチ推論、モデルサーチを自動化する。

TensorFlowやSageMakerなどと連携。

データサイエンティストとエンジニアが連携可能という点が他のAutoMLと異なる要素。またコード実装能力が高い人、そうでない人混ざっているケースに対し、高い抽象度で実装なく動かせると同時に、細かなカスタマイズも可能にする。

DatabricksのAutoMLは、Azureの機械学習サービスとも連携する。（パートナーシップあり）

GitHub

GitHub databrickslabs/automl-toolkit を軽く確認する。

Readme（2019/08/23現在）

「non-supported end-to-end supervised learning solution for automating」と定義されている。以下の内容をautomationするようだ。

Feature clean-up（特徴選択？特徴の前処理？）
Feature vectorization（特徴のデータのベクトル化）
Model selection and training（モデル選択と学習）
Hyper parameter optimization and selection（ハイパーパラメータ最適化と選択）
Batch Prediction（バッチ処理での推論）
Logging of model results and training runs (using MLFlow)（MLFlowを使ってロギング、学習実行）

対応している学習モデルは以下の通り。 Sparkの機械学習ライブラリを利用しているようなので、それに準ずることになりそう。

Decision Trees (Regressor and Classifier)
Gradient Boosted Trees (Regressor and Classifier)
Random Forest (Regressor and Classifier)
Linear Regression
Logistic Regression
Multi-Layer Perceptron Classifier
Support Vector Machines
XGBoost (Regressor and Classifier)

開発言語はScala（100%）

DBFS APIを前提としているように見えることから、Databricks Cloud前提か？

ハイレベルAPI（FamilyRunner）について載っている例から、ポイントを抜粋する。

// パッケージを見る限り、Databricksのラボとしての活動のようだ？
import com.databricks.labs.automl.executor.config.ConfigurationGenerator
import com.databricks.labs.automl.executor.FamilyRunner

val data = spark.table("ben_demo.adult_data")


// MLFlowの設定値など、最低限の上書き設定を定義する
val overrides = Map("labelCol" -> "income",
"mlFlowExperimentName" -> "My First AutoML Run",

// Databrciks Cloud上で起動するDeltaの場所を指定
"mlFlowTrackingURI" -> "https://<my shard address>",
"mlFlowAPIToken" -> dbutils.notebook.getContext().apiToken.get,
"mlFlowModelSaveDirectory" -> "/ml/FirstAutoMLRun/",
"inferenceConfigSaveLocation" -> "ml/FirstAutoMLRun/inference"
)

// コンフィグを適切に？自動生成。合わせてどのモデルを使うかを指定している。
val randomForestConfig = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", overrides)
val gbtConfig = ConfigurationGenerator.generateConfigFromMap("GBT", "classifier", overrides)
val logConfig = ConfigurationGenerator.generateConfigFromMap("LogisticRegression", "classifier", overrides)

// 生成されたコンフィグでランナーを生成し、実行
val runner = FamilyRunner(data, Array(randomForestConfig, gbtConfig, logConfig)).execute()

上記実装では、以下のようなことが裏で行われる。

前処理
特徴データのベクタライズ
3種類のモデルについてパラメータチューニング

実装を確認

エントリポイントとしては、 com.databricks.labs.automl.executor.FamilyRunner としてみる。

com.databricks.labs.automl.executor.FamilyRunner

executeメソッドが例として載っていたので確認。当該メソッド全体は以下の通り。

com/databricks/labs/automl/executor/FamilyRunner.scala:115

def execute(): FamilyFinalOutput = {

  val outputBuffer = ArrayBuffer[FamilyOutput]()

  configs.foreach { x =>
    val mainConfiguration = ConfigurationGenerator.generateMainConfig(x)

    val runner: AutomationRunner = new AutomationRunner(data)
      .setMainConfig(mainConfiguration)

    val preppedData = runner.prepData()

    val preppedDataOverride = preppedData.copy(modelType = x.predictionType)

    val output = runner.executeTuning(preppedDataOverride)

    outputBuffer += new FamilyOutput(x.modelFamily, output.mlFlowOutput) {
      override def modelReport: Array[GenericModelReturn] = output.modelReport
      override def generationReport: Array[GenerationalReport] =
        output.generationReport
      override def modelReportDataFrame: DataFrame =
        augmentDF(x.modelFamily, output.modelReportDataFrame)
      override def generationReportDataFrame: DataFrame =
        augmentDF(x.modelFamily, output.generationReportDataFrame)
    }
  }

  unifyFamilyOutput(outputBuffer.toArray)

}

また、再掲ではあるが、当該メソッドは以下のように呼び出されている。

1 2	// 生成されたコンフィグでランナーを生成し、実行 val runner = FamilyRunner(data, Array(randomForestConfig, gbtConfig, logConfig)).execute()

第1引数はデータ、第2引数は採用するアルゴリズムごとのコンフィグを含んだArray。

では、executeメソッドを少し細かく見る。

まずループとなっている箇所があるが、ここはアルゴリズムごとのコンフィグをイテレートしている。

com/databricks/labs/automl/executor/FamilyRunner.scala:119

1	configs.foreach { x =>

つづいて、渡されたコンフィグを引数に取りながら、パラメータを生成する。

com/databricks/labs/automl/executor/FamilyRunner.scala:120

1	val mainConfiguration = ConfigurationGenerator.generateMainConfig(x)

次は上記で生成されたパラメータを引数に取りながら、処理の自動実行用のAutomationRunnerインスタンスを生成する。

com/databricks/labs/automl/executor/FamilyRunner.scala:122

1 2	val runner: AutomationRunner = new AutomationRunner(data) .setMainConfig(mainConfiguration)

つづいてデータを前処理する。

com/databricks/labs/automl/executor/FamilyRunner.scala:125

1
2
3

val preppedData = runner.prepData()

val preppedDataOverride = preppedData.copy(modelType = x.predictionType)

つづいて生成されたAutomationRunnerインスタンスを利用し、パラメータチューニングを実施する。

com/databricks/labs/automl/executor/FamilyRunner.scala:129

1	val output = runner.executeTuning(preppedDataOverride)

その後、出力内容が整理される。レポートの類にmodelFamilyの情報を付与したり、など。

com/databricks/labs/automl/executor/FamilyRunner.scala:131

outputBuffer += new FamilyOutput(x.modelFamily, output.mlFlowOutput) {
  override def modelReport: Array[GenericModelReturn] = output.modelReport
  override def generationReport: Array[GenerationalReport] =
    output.generationReport
  override def modelReportDataFrame: DataFrame =
    augmentDF(x.modelFamily, output.modelReportDataFrame)
  override def generationReportDataFrame: DataFrame =
    augmentDF(x.modelFamily, output.generationReportDataFrame)
}

最後に、モデルごとのチューニングのループが完了したあとに、それにより得られた複数の出力内容をまとめ上げる。（Arrayをまとめたり、DataFrameをunionしたり、など）

com/databricks/labs/automl/executor/FamilyRunner.scala:142

1	unifyFamilyOutput(outputBuffer.toArray)

com.databricks.labs.automl.AutomationRunner#executeTuning

指定されたモデルごとにうチューニングを行う部分を確認する。

当該メソッド内では、modelFamilyごとの処理の仕方が定義されている。以下にRandomForestの場合を示す。

com/databricks/labs/automl/AutomationRunner.scala:1597

  protected[automl] def executeTuning(payload: DataGeneration): TunerOutput = {

    val genericResults = new ArrayBuffer[GenericModelReturn]

    val (resultArray, modelStats, modelSelection, dataframe) =
      _mainConfig.modelFamily match {
        case "RandomForest" =>
          val (results, stats, selection, data) = runRandomForest(payload)
          results.foreach { x =>
            genericResults += GenericModelReturn(
              hyperParams = extractPayload(x.modelHyperParams),
              model = x.model,
              score = x.score,
              metrics = x.evalMetrics,
              generation = x.generation
            )
          }
          (genericResults, stats, selection, data)


(snip)

なお、2019/8/25現在対応しているモデルは以下の通り。（caseになっていたのを抜粋）

case "RandomForest" =>
case "XGBoost" =>
case "GBT" =>
case "MLPC" =>
case "LinearRegression" =>
case "LogisticRegression" =>
case "SVM" =>
case "Trees" =>

さて、モデルごとのチューニングの様子を確認する。 RandomForestを例にすると、実態は以下の runRandomForest メソッドに見える。

com/databricks/labs/automl/AutomationRunner.scala:1604

1	val (results, stats, selection, data) = runRandomForest(payload)

runRandomForest メソッドは、以下のようにペイロードを引数にとり、その中でチューニングを実施する。

com/databricks/labs/automl/AutomationRunner.scala:46

  private def runRandomForest(
                               payload: DataGeneration
                             ): (Array[RandomForestModelsWithResults], DataFrame, String, DataFrame) = {

(snip)

特徴的なのは、非常にきめ細やかに初期設定している箇所。これらの設定を上書きしてチューニングできる、はず。

com/databricks/labs/automl/AutomationRunner.scala:58

    val initialize = new RandomForestTuner(cachedData, payload.modelType)
      .setLabelCol(_mainConfig.labelCol)
      .setFeaturesCol(_mainConfig.featuresCol)
      .setRandomForestNumericBoundaries(_mainConfig.numericBoundaries)
      .setRandomForestStringBoundaries(_mainConfig.stringBoundaries)

(snip)

実際にチューニングしているのは、以下の箇所のように見える。 evolveWithScoringDF 内では、 RandomForestTuner クラスのメソッドが呼び出され、ストラテジ（実装上は、バッチ方式か、continuous方式か）に従ったチューニングが実施される。

com/databricks/labs/automl/AutomationRunner.scala:169

1	val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF()