What's the state of TOPI on server-class CPU?


I want to deploy cross&&deep network by NNVM on server-class CPU. And I’m trying to build the graph in incrementally.tvm_model_zoo

As far as I know, the quality of kernels has a big impact on the entire performance.

So, what’s the state of TOPI on server-class CPU? Compare with openblas and nnpack?

And could anyone give some advise about implement the cross side? Invent custom operator?


For the question about the performance of server-class cpu. We made some optimization for AVX2/AVX-512 recently. On AWS EC2 c5.9xlarge instance, our solution (for resnet-18/34/50/101/152, ssd, etc) is about 1.7x faster than mxnet+mkldnn, 3x faster than mxnet+mklml. Though not tested, I believe it can bring more speedup for openblas/nnpack.

Will contribute to topi soon.

While our optimization mainly focus on convolution and inference, it shows potentials of tvm on CPU.


how about SSE4.2? our deploy target’s CPU only supports SSE 4.2 (Intel ATOM CPU), no avx. When can we see the new implementation? Thanks. Because we want to implement based on one version of NNVM / TVM.


Not sure about the Intel ATOM CPU, I guess the basic optimization idea should be similar, though the SIMD width, cache size may differ, indicating different sizes of split block.

we have to incrementally send pull requests. we have made necessary changes to nnvm and tvm,

I hope I can pr our new schedules for resnet this weekend.


How about the graph runtime implement? Homebrew or the default graph_runtime?


cross ref https://github.com/dmlc/tvm/pull/1143