PyOpenCLで手軽にOpenCLを組む - 0xfeeb

OpenCLのコードを書こうとすると、ホストコードとカーネルコードの両方を書いてコンパイルして、となかなか大変だったりする。もともとカーネルコードは実行時にコンパイルできる仕様なので、スクリプト言語をとても相性が良いのではないかと思う。それを実現したのがPyOpenCL。

PyOpenCL

MacとLinuxに限っては、以下のように簡単にセットアップできる。

Mac OS X 10.6

MacPortsをインストールし、「sudo port -v selfupdate」しておく
「port search opencl」してみると、「py26-pyopencl」が見つかるのでインストール
python2.6からpyopenclが使える
これだけ

MacではGPUとCPUがOpenCLの対象デバイスになるので、ほとんどのiMacやMacBookではデバイスがふたつ扱えるようになる。手元のiMacでは以下のようなデバイス情報が出た。

NAME: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
MAX_COMPUTE_UNITS: 8

NAME: Radeon HD 4850
MAX_COMPUTE_UNITS: 10

Ubuntu 10.10

「システム管理」「追加のドライバ」からnVidiaの純正ドライバをインストール（たぶんAMDでも）
「システム管理」「Synapticパッケージ・マネージャ」から「python-pyopencl」をインストール
標準のpythonからpyopenclが使える
これだけ

Ubuntuのほうには、以下のパスにサンプルが入っている。これはMacのほうに持って行っても、ちゃんと動いた。

/usr/share/doc/python-pyopencl/examples

ひょっとすると次のような例外が発生するかも知れない。

Traceback (most recent call last):
  File "demo.py", line 8, in 
    ctx = cl.create_some_context()
  File "/usr/lib/pymodules/python2.6/pyopencl/__init__.py", line 346, in create_some_context
    return Context(devices)
pyopencl.RuntimeError: Context failed: out of host memory

以下の設定を変えると直るかも知れない。

「設定」「外観の設定」「視覚効果」を「効果なし」に

benchmark-all.pyをちょっと最適化

examplesの中に入っているスクリプトの一つを、ちょっとだけ速くしてみた。カーネルのソースを見てみると、毎回globalメモリへアクセスしているため効率が悪い。以下のパッチを当てて計測してみる。

$ diff benchmark-all.py benchmark-all-private.py 
56,58c56,62
<                                 c[gid] = a[gid] + b[gid];
<                                 c[gid] = c[gid] * (a[gid] + b[gid]);
<                                 c[gid] = c[gid] * (a[gid] / 2.0);



-

>                                 float fa, fb, fc;
>                                 fa = a[gid];
>                                 fb = b[gid];
>                                 fc  = fa + fb;
>                                 fc = fc * (fa + fb);
>                                 fc = fc * (fa / 2.0);
>                                 c[gid] = fc;

[before]

Execution time of test without OpenCL: 11.3199682236 s
===============================================================
Platform name: Apple
Platform profile: FULL_PROFILE
Platform vendor: Apple
Platform version: OpenCL 1.0 (Dec 23 2010 17:30:26)

-

Device name: Radeon HD 4850
Device type: GPU
Device memory: 512 MB
Device max clock speed: 503 MHz
Device compute units: 10
Execution time of test: 0.00564408 s
Results OK
===============================================================
Platform name: Apple
Platform profile: FULL_PROFILE
Platform vendor: Apple
Platform version: OpenCL 1.0 (Dec 23 2010 17:30:26)

-

Device name: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
Device type: CPU
Device memory: 6144 MB
Device max clock speed: 2800 MHz
Device compute units: 8
Execution time of test: 0.0021682 s
Results OK

[after]

Execution time of test without OpenCL: 11.3338668346 s
===============================================================
Platform name: Apple
Platform profile: FULL_PROFILE
Platform vendor: Apple
Platform version: OpenCL 1.0 (Dec 23 2010 17:30:26)

-

Device name: Radeon HD 4850
Device type: GPU
Device memory: 512 MB
Device max clock speed: 503 MHz
Device compute units: 10
Execution time of test: 0.00388932 s
Results OK
===============================================================
Platform name: Apple
Platform profile: FULL_PROFILE
Platform vendor: Apple
Platform version: OpenCL 1.0 (Dec 23 2010 17:30:26)

-

Device name: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
Device type: CPU
Device memory: 6144 MB
Device max clock speed: 2800 MHz
Device compute units: 8
Execution time of test: 0.00147917 s
Results OK

たったこれだけだけど、３０％ぐらい速くなってる。しかし、困ったことにCPUのほうが速い……恐らくMAX_COMPUTE_UNITSがCPU 8に対して、GPU 10しかないため、個々の演算ユニットのガチ勝負になると、CPUのほうが速いためと思われる。それとも、もっと最適化する方法があるんだろうか。