Monday, 11 November 2019

dividiti (dv/dt) accelerate omni-benchmarking for MLPerf Inference

The MLPerf consortium has recently released over 500 validated inference benchmarking results from 14 organizations measuring how fast and how well a pre-trained computer system can classify images, detect objects, and translate sentences. Over 400 of these results were submitted by dividiti, a high-tech company based in Cambridge, UK.

“Our success in MLPerf Inference v0.5 is due to our unique open workflow automation technology called Collective Knowledge (CK)”, explains Dr Anton Lokhmotov, CEO and co-founder of dividiti. “We conducted literally hundreds of benchmarking experiments, followed by thousands of auditing experiments, with many combinations of machine learning models, libraries, frameworks and hardware platforms. Such experiments are notoriously hard to stage in an automated, portable and reproducible fashion, which explains why even well-resourced hardware vendors only submit a handful of results. In collaboration with Arm and the Polytechnical University of Milan, we staged experiments on systems ranging from Raspberry Pi class boards and Android phones to high-end workstations. Benchmarking anything anywhere is what we call omni-benchmarking.

“MLPerf is being contributed to by many organizations, from tiny startups to giant corporations with up to 50 contributors per organization. It is simply astonishing that a small organization with only 3 MLPerf contributors has submitted nearly 3 times more results than all other organizations combined,” stated Dr Vijay Janapa Reddi, Associate Professor, Harvard University, and MLPerf Inference Co-chair. “Based on the success of the first submission round, we fully expect to receive thousands of results next year. Workflow automation will be critical not only for generating large volumes of high-quality results, but also for validating and finding the most optimal solutions in terms of performance, quality and cost.”

“Benchmarking modern day platforms with multiple software branches, libraries, toolchains, datasets, and test and device configurations may deliver a set of inconsistent results,” said Colin Osborne, director of engineering and distinguished engineer, Machine Learning Group, Arm. “Arm uses the Collective Knowledge (CK) framework to transform our multi-dimensional problem space into simplified building blocks and more manageable benchmark results.”

“We have been contributing to the MLPerf initiative since its official announcement in 2018Q2. Our automated, customizable and reproducible Collective Knowledge workflows for image classification and object detection were among the very first inference workloads included in MLPerf Inference. Eventually, we aim to automate all MLPerf workloads, and thus enable easy validation, interactive visualization and fair comparison of all submissions,” said Dr Anton Lokhmotov. “We are already working with several key customers, helping create highly competitive, credible and compliant submissions for the next MLPerf Inference v0.7 round. We believe that workflow automation will go far beyond benchmarking to accelerate time-to-market and slash development costs for innovative products in automotive, robotics, healthcare and smart infrastructure domains.”

About dividiti

dividiti is a UK-based high-tech company built upon decades of unique R&D experience of Dr Anton Lokhmotov (formerly manager of GPU Compute compilers at Arm with a PhD from the University of Cambridge) and Dr Grigori Fursin (formerly head of program optimization group at Intel’s Exascale Lab and senior tenured scientist at INRIA with a PhD from the University of Edinburgh).

Our pioneering techniques have enabled rigorous performance analysis and optimization for world-leading companies including Arm, Intel, and General Motors, and powered the world’s first machine-learning based compiler developed in the MILEPOST project with IBM. Our customizable workflow framework, Collective Knowledge (CK), is the only universal solution for continuous multi-objective performance analysis and optimization available under a permissive open-source license. By automating systematic and reproducible experimentation with ever evolving software and hardware, CK gives our partners a distinct competitive advantage, as confirmed by a growing number of users in industry and academia.

Monday, 23 July 2018

Enabling virtual environment for multiple LLVM versions on Linux, MacOS, Windows and Android

It's a painful process to continuously switch between different versions of LLVM (including the ones built locally), set up numerous environment variables, and fix cmake when benchmarking and optimizing different AI/ML/math libraries.

Eventually, we decided to automate this process by introducing "virtual environments" for LLVM and related tools in the CK framework similar to Python virtualenv while supporting Linux, MacOS, Windows and Android. It automatically sets all necessary environment variables for
different versions of different tools natively installed on a user machine.

CK virtual environment requires minimal dependencies (just python, pip and git) and you can try it on your machine as follows (you can check CK installation guide if you have issues):

 $ (sudo) pip install ck
 $ ck pull repo:ck-env

 $ ck detect soft --tags=compiler,llvm

CK will search for all installed clang instances on your machine (including Microsoft Visual Studio dependency on Windows) and will ask you which one to register for the CK virtual environment. You can then repeat this process and register multiple versions you use for testing. You can then see all registered virtual environments as follows:

 $ ck show envor
 $ ck show env --tags=compiler,llvm

Now you can start a specific virtual environment as follows:

 $ ck virtual env --tags=compiler,llvm

CK will set up PATH, LD_LIBRARY_PATH and other variables to point to a specific clang version, and will start bash on Linux/MacOS or shell on Windows. You can then use different environment variable specific to a given Clang instance and starting from CK_ in your own scripts:

 $ export | grep "CK_"

We also added the possibility to install pre-built versions of LLVM on different platforms using CK packages and automatically register CK virtual environment:

 $ ck search package --tags=compiler,llvm
 $ ck install package --tags=compiler,llvm,v6.0.0
 $ ck show env --tags=compiler,llvm
 $ ck virtual env --tags=compiler,llvm,v6.0.0

  > ${CK_CC_FULL_PATH} --version 
  > %CK_CC_FULL_PATH% --version

Finally, you can also rebuild LLVM from a trunk and again automatically register it in the CK virtual environment:

 $ ck install package --tags=compiler,llvm,vtrunk

You can find other software which you can automatically detect and register in the CK virtual environment using CK plugins in this online list.

If you are interested to know more about this and other CK functionality, please check the "CK getting started guide" or feel free to get in touch with the CK community!

Thursday, 8 February 2018

ACM ReQuEST: 1st open and reproducible tournament to co-design Pareto-efficient deep learning (speed, accuracy, energy, size, costs)

The first Reproducible Quality-Efficient Systems Tournament (ReQuEST) will debut at ASPLOS’18 ( ACM conference on Architectural Support for Programming Languages and Operating Systems, which is the premier forum for multidisciplinary systems research spanning computer architecture and hardware, programming languages and compilers, operating systems and networking).

Organized by a consortium of leading universities (Washington, Cornell, Toronto, Cambridge, EPFL) and the cTuning foundation, ReQuEST aims to provide a open-source tournament framework, a common experimental methodology and an open repository for continuous evaluation and multi-objective optimization of the quality vs. efficiency Pareto optimality of a wide range of real-world applications, models and libraries across the whole software/hardware stack.

ReQuEST will use the established artifact evaluation methodology together with the Collective Knowledge framework validated at leading ACM/IEEE conferences to reproduce results, display them on a live dashboard and share artifacts with the community. Distinguished entries will be presented at the associated workshop and published in the ACM Digital Library. To win, the results of an entry do not necessarily have to lie on the Pareto frontier, as an entry can be also praised for its originality, reproducibility, adaptability, scalability, portability, ease of use, etc.

The first ReQuEST competition will focus on deep learning for image recognition with an ambitious long-term goal to build a public repository of portable and customizable “plug&play” AI/ML algorithms optimized across diverse data sets, models and platforms from IoT to supercomputers (see live demo). Future competitions will consider other emerging workloads, as suggested by our Industrial Advisory Board.

For more information, please visit

Monday, 4 September 2017

Video from ARM presenting our collaborative DNN co-design technology

ARM shared a video from Embedded Vision Summit'17 with a brief demonstration of our open-source technology to collaboratively optimize Deep Learning Applications (sw/hw/model co-design) across diverse hardware and software stack:
 It is an on-going project bringing industry, academia and end-users together to collaboratively co-design more efficient software and hardware for emerging workloads such as deep learning:

Wednesday, 31 May 2017

Difference between MILEPOST GCC (machine learning based self-tuning compiler) and Collective Knowledge Framework

I recently received several questions about the differences between MILEPOST GCC compiler and Collective Knowledge Framework. This motivated me to write this slightly nostalgic post with the R&D history behind MILEPOST GCC and our CK framework.

MILEPOST GCC is an extended GCC which includes:

1) Interactive Compilation Interface aka ICI - a plugin based framework to expose or change various information and optimization decisions inside compilers at fine-grain level via external plugins. I originally developed it for Open64 and later collaborated with Zbigniew Chamski and colleagues from Google and Mozilla to make it a standard plugin framework for GCC.

2) Feature extractor developed by Mircea Namolaru from IBM as an ICI plugin to expose low-level program features at a function level (see available features here). It was also extended by Jeremy Singer (ft57–65).

However, to keep MILEPOST project complexity under control, I decided to separate MILEPOST GCC from an infrastructure to auto-tune workloads, build models and use them to predict optimizations. Therefore, I developed the first version of the cTuning framework to let users auto-tune GCC flags for shared benchmarks and data sets, use MILEPOST GCC to extract features for these benchmarks, build predictive models (possibly on the fly, i.e. via active learning), and then use them to predict optimizations for previously unseen programs (using ICI to change optimizations).

However, since it was still taking really too long to train models (my PhD students, Yuriy Kashnikov and Abdul Memon, spent 5 months preparing experiments in 2010 for our MILEPOST GCC paper), we decided to crowdsource autotuning via a common repository across diverse hardware provided by volunteers and thus dramatically speed up training process. Accelerating training process and improving the diversity of a training set is the main practical reason why my autotuning frameworks use crowdtuning mode by default nowadays ;) … 

The first cTuning framework turned out very heavy and difficult to install and port (David Del Vento and his interns from NCAR used it in 2010 to tune their workloads and provided lots of useful feedback — thanks guys!). This motivated me to develop a common research SDK (Collective Knowledge aka CK) to simplify, unify and automate general experiments in computer engineering.

CK framework lets the community share their artifacts (benchmarks, data sets, tools, models, experimental results) as customizable and reusable Python components with JSON API. So, you can take advantage from already shared components to quickly prototype your own research workflows such as benchmarking, multi-objective autotuning, machine-learning based optimization, run-time adaptation, etc. That is rather then re-building numerous ad-hoc in-house tools or scripts for autotuning and machine-learning based optimization which rarely survive after PhD students are gone, you can now participate in collaborative and open research with the community, reproduce and improve collaborative experiments, and build upon them ;) … That’s why ACM is now considering using CK for unified artifact sharing (see CK on the ACM DL front page).

You can also take advantage of integrated and cross-platform CK package manager which can prepare your workflow and install missing dependencies on Linux, Windows, MacOS and Android.

For example, see highest ranked artifact from CGO’17 shared as a customizable and portable CK workflow at GitHub.

To conclude my nostalgic overview of the MILEPOST project and CK ;) — MILEPOST GCC is now added to the CK as a unified workflow while taking advantage of a growing number of shared benchmarks, data sets, and optimization statistics (see CK GitHub repo).

I just didn’t have time to provide all the ML gluing, i.e. building models from all optimization statistics and features shared by the community at . But it should be quite straightforward, so I hope our community will eventually help implement it. We are now particularly interested to check the prediction accuracy from different models (SVM, KNN, DNN, etc) or to find extra features which improve optimization prediction.

Friday, 10 March 2017

Enabling open and reproducible computer systems research: the good, the bad and the ugly

14 March 2017, CNRS webinar, Grenoble, France
Slides are now available here!

A decade ago my research nearly stalled. I was investigating how to crowdsource performance analysis and optimization of realistic workloads across diverse hardware provided by volunteers and combine it with machine learning [1]. Often, it was simply impossible to reproduce crowdsourced empirical results and build predictive models due to the continuously changing software and hardware stack. Worse still, lack of realistic workloads and representative data sets in our community severely limited the usefulness of such models.

All these problems forced motivated me to develop an open-source framework and repository ( to share, validate and reuse workloads, data sets, tools, experimental results and predictive models, while involving the community in this effort [2]. This experience, in turn, helped us initiate so-called Artifact Evaluation (AE) at the premier ACM conferences on parallel programming, architecture and code generation (CGO, PPoPP, PACT and SC). AE aims to independently validate experimental results reported in the publications, and to encourage code and data sharing.

I would like to invite you to my webinar “Enabling open and reproducible research at computer systems conferences: the good, the bad and the ugly” at CNRS Grenoble on 14 March 2017, 1:30PM (UTC+1). I will share our practical experience organizing Artifact Evaluation over the past three years, along with encountered problems and possible solutions. You can find further info at this GitHub page including links to the video stream and the pad for notes.

On the one hand, we have received incredible support from the research community, ACM, universities and companies. We have even received a record number of artifact submissions at the CGO/PPoPP'17 AE (27 vs 17 two years ago) sponsored by NVIDIA, dividiti and cTuning foundation. We have also introduced Artifact Appendices and co-authored the new ACM Result and Artifact Review and Badging policy now used at Supercomputing. 

On the other hand, the use of proprietary benchmarks, rare hardware platforms, and totally ad-hoc scripts to set up, run and process experiments all place a huge burden on evaluators. It is simply too difficult and time-consuming to customize and rebuild experimental setups, reuse artifacts and eventually build upon others’ efforts - the main pillars of open science!

I will then present Collective Knowledge (CK), our humble attempt to introduce a customizable workflow framework with a unified JSON API and a cross-platform package manager, which can automate experimentation and enable interactive articles, while automatically adapting to the ever evolving software and hardware [3]. I will also demonstrate a practical CK workflow for collaboratively optimizing deep learning engines (such as Caffe and TensorFlow) and models across different compilers, libraries, data sets and diverse platforms from constrained mobile devices to data centers (CK-Caffe on GitHub / Android app to crowdsource DNN optimization) [4].

Finally, I will describe our open research initiative to publicly evaluate artifacts and papers which we have successfully validated at CGO-PPoPP’17, and plan to keep building upon in the future [5]. 

I am looking forward to your participation and feedback! Please feel free to contact me at or if you have any questions or comments!

[3] Collective Knowledge: towards R&D sustainability”, Proceedings of the Conference on Design, Automation and Test in Europe (DATE), 2016
[4] Optimizing Convolutional Neural Networks on Embedded Platforms with OpenCL”, IWOCL'16, Vienna, Austria, 2016
[5] “Community-driven reviewing and validation of publications”, Proceedings of the 1st ACM SIGPLAN Workshop on Reproducible Research Methodologies and New Publication Models in Computer Engineering @ PLDI’14, Edinburgh, UK

Wednesday, 15 February 2017

Our CGO'07 paper on machine learning based workload optimization received the CGO "Test of Time" award!

I had a really nice surprise at the last International Symposium on Code Generation and Optimization (CGO) - our CGO'07 research paper on "rapidly selecting good compiler optimizations using performance counters" co-authored with my colleagues from INRIA and the University of Edinburgh has won the "test of time" award! This award recognises outstanding papers published at GGO one decade earlier, whose influence is still strong today!

When preparing that paper, I really suffered a lot from the continuously changing software and hardware stack when performing and processing huge amounts of experiments to build and train models which could predict optimizations. That experience eventually motivated me to continue my work on machine learning based optimization as a community effort [1,2] while sharing all my benchmarks, data sets, models, tools and scripts as customizable and reusable components. It also motivated me to develop an open-source framework and repository to crowdsource empirical experiments (such as multi-objective optimization of deep learning and other realistic workloads) across diverse hardware and input provided by volunteers which later became known as the Collective Knowledge (CK):
    Therefore, I would really like to thank the community for such a strong support of our open and reproducible research initiative during past 10 years and for all the constructive feedback and help to develop common experimental infrastructure and methodology!

    For example, CK now assists various Artifact Evaluation initiatives at the premier ACM conferences on parallel programming, architecture and code generation (CGO, PPoPP, PACT, SC), which aim to encourage sharing of code and data, and independently validate experimental results from published papers:
    We also use CK to crowdsource benchmarking and optimizations of realistic workloads across embedded devices such as mobile phones and tablets, while publicly sharing all optimization statistics for further collaborative analysis and mining:
    dividiti (a startup based in Cambridge, UK) and the cTuning foundation (non-profit research organization) also use above technology to lead interdisciplinary research with ARM, General Motors and other companies to build faster, smaller, more power efficient and more reliable software and hardware:
    Hope you will also join our community effort to accelerate computer systems' research and enable cheap and efficient computing from IoT devices to supercomputers!