9/30/2019

Robust Classification by Bertsimas, et al.

Dimitris Bertsimas, Jack Dunn, Colin Pawlowski, and Ying Daisy Zhuo, Robust ClassificationINFORMS Journal on Optimization, Vol. 1, No. 1, Winter 2019, pp. 2–34.
Motivated by the fact that there may be inaccuracies in features and labels of training data, we apply robust optimization techniques to study in a principled way the uncertainty in data features and labels in classification problems and obtain robust formulations for the three most widely used classification methods: support vector machines, logistic regression, and decision trees. We show that adding robustness does not materially change the complexity of the problem and that all robust counterparts can be solved in practical computational times. We demonstrate the advantage of these robust formulations over regularized and nominal methods in synthetic data experiments, and we show that our robust classification methods offer improved out-of-sample accuracy. Furthermore, we run large-scale computational experiments across a sample of 75 data sets from the University of California Irvine Machine Learning Repository and show that adding robustness to any of the three nonregularized classification methods improves the accuracy in the majority of the data sets. We observe the most significant gains for robust classification methods on high-dimensional and difficult classification problems, with an average improvement in out-of-sample accuracy of robust versus nominal problems of 5.3% for support vector machines, 4.0% for logistic regression, and 1.3% for decision trees.
Complement to the previous paper Optimal classification trees: Table 10. Solver Time for Selected University of California Irvine Data Sets in Seconds

9/28/2019

Optimal classification trees (最佳分類樹)

D. Bertsimas and J. Dunn, Optimal classification trees, Machine Learning, July 2017, Volume 106, Issue 7, pp 1039–1082.
State-of-the-art decision tree methods apply heuristics recursively to create each split in isolation, which may not capture well the underlying characteristics of the dataset. The optimal decision tree problem attempts to resolve this by creating the entire decision tree at once to achieve global optimality. In the last 25 years, algorithmic advances in integer optimization coupled with hardware improvements have resulted in an astonishing 800 billion factor speedup in mixed-integer optimization (MIO). Motivated by this speedup, we present optimal classification trees (1), a novel formulation of the decision tree problem using modern MIO techniques that yields the optimal decision tree for axes-aligned splits. We also show the richness of this MIO formulation by adapting it to give optimal classification trees with hyperplanes (2) that generates optimal decision trees with multivariate splits. Synthetic tests demonstrate that these methods recover the true decision tree more closely than heuristics, refuting the notion that optimal methods overfit the training data. We comprehensively benchmark these methods on a sample of 53 datasets from the UCI machine learning repository. We establish that these MIO methods are practically solvable on real-world datasets with sizes in the 1000s, and give average absolute improvements in out-of-sample accuracy over CART of 1–2 and 3–5% for the univariate and multivariate cases, respectively. Furthermore, we identify that optimal classification trees are likely to outperform CART by 1.2–1.3% in situations where the CART accuracy is high and we have sufficient training data, while the multivariate version outperforms CART by 4–7% when the CART accuracy or dimension of the dataset is low.

9/17/2019

The ML Test Score by Google

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley, The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction, Proceedings of IEEE Big Data, 2017.
Creating reliable, production-level machine learning systems brings on a host of concerns not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for ensuring the production-readiness of an ML system, and for reducing technical debt of ML systems. But it can be difficult to formulate specific tests, given that the actual prediction behavior of any given model is difficult to specify a priori. In this paper, we present 28 specific tests and monitoring needs, drawn from experience with a wide range of production ML systems to help quantify these issues and present an easy to follow road-map to improve production readiness and pay down ML technical debt.
Hidden Technical Debt in Machine Learning Systems 的延續。分為  feature tests、model testsML infrastructure tests、和 production monitoring,並訪問了 36 個 Google 團隊,瞭解四個面向的執行程度

9/13/2019

AI transforming the enterprise by KPMG

Steve Hill, AI transforming the enterprise, KPMG, 2019. (四大pdf file)
We conducted the KPMG 2019 Enterprise AI Adoption Study to gain insight into the state of AI and automation deployment efforts at select large cap companies. This involved in-depth interviews with senior leaders at 30 of the world’s largest companies, as well as secondary research on job postings and media coverage. These 30 highly influential, Global 500 companies represent significant global economic value – collectively, they employ approximately 6.2 million people, with aggregate revenues of $3 trillion. Together, they also represent a significant component of the AI market.
The trends

  1. Rapid shift from experimental to applied technology
  2. Automation, AI, analytics and low-code platforms are converging
  3. Enterprise demand is growing
  4. New organizational capabilities are critical
  5. Internal governance emerging as key area
  6. The need to manage AI
  7. Rise of AI-as-a-service
  8. AI could shift the competitive landscape

9/11/2019

Python 初學者的好用工具 Google Colab

少數派,推薦Python初學者的好用工具:Google Colab,2019.03.20

上課使用的檔案 lec08 MNIST-GPU.ipynb,執行環境 GPU: NVIDIA GTX 1070, RAM 24450 MB, Win10 64 bits,我將之上傳到 Google drive

因為沒有上傳相關的圖檔,所以無法執行Image(filename='data/05-Chollet-MNIST-sample.jpg') 和 Image(filename='data/05-MNIST.png');設定的方法請參考新檔執行環境可以選 TPU,初始化需要點時間;可以比較本機執行和 Google cloud TPU 的運算時間 

9/10/2019

注定一戰?中美能否避免修昔底德陷阱

包淳亮注定一戰?中美能否避免修昔底德陷阱八旗文化 2018
Allison Graham, Destined for War: Can America and China Escape Thucydides’s Trap?, Mariner Books, 2018
◎從古希臘到美蘇冷戰,從兩千年人類戰爭史出發,預測美、中國不安的未來! 
  西元前五世紀的希臘史學家修昔底德記錄了摧毀整個希臘世界的「伯羅奔尼撒戰爭」,他將戰爭的起因總結為「雅典的崛起,以及斯巴達揮之不去的恐懼,使戰爭不可避免」。本書作者格雷厄姆・艾利森把當時斯巴達與雅典面臨的困境稱之為「修昔底德陷阱」:在原本的權力平衡面臨改變時,既有的統治強權可能為了捍衛地位而出手訓誡、扼殺後起的挑戰者,挑戰者也可能不甘屈居人下、試圖改變遊戲規則而「問鼎中原」。在過去500年中,崛起強權挑戰統治強權的案例有16起,其中12起爆發戰爭。「修昔底德陷阱」像幽魂一再地將大國推向毀滅的深淵。俾斯麥在普法戰爭中挑戰歐陸霸主法國,德皇威廉二世在一戰中挑戰英國海軍,日本自認應該享有平等的尊嚴而發動日俄戰爭,又因恐懼美國的經濟封鎖扼殺它的發展而襲擊珍珠港。種種盲目不理智的行為,都可以透過「修昔底德陷阱」得到解釋。 
  ◎南海衝突、台灣獨立、網路攻擊、北韓崩潰、貿易戰爭……誰將引爆美中大戰,又該如何避免? 
  21世紀初的中國與美國恰恰再度落入「修昔底德陷阱」的模式,彷彿難逃「注定一戰」。中國的飛速崛起為二戰後美國主導的國際秩序與美國的軍事霸權構成嚴重挑戰。二戰後的美國占全球經濟的50%,如今已下滑至16%。同一時期,中國的比例從1980年的2%飆升至2016年的18%。雪上加霜的是,標榜「中國夢」的習近平與「美國第一」的川普不僅都誓言恢復國家的偉大光榮,也都認為對方是實現目標的障礙。沒有另外兩個領導者比習、川更可能把美中帶向戰爭。 
  格雷厄姆・艾利森是全球知名的國際關係學者,憑1970年代對古巴飛彈危機的深刻研究奠定其不可動搖的大師地位。他透過對歷代戰爭提綱挈領地分析建立了「修昔底德陷阱」的理論基礎,並以此預測美中爆發衝突的各種可能途徑,在書中列舉了5種爆發戰爭的可能,以及12條趨吉避凶的和平線索,並針對美國政府提出懇切建言,一方面呼籲美國嚴肅看待中國崛起的事實與恢復民族光榮的決心,一方面諄諄勸誡美國外交決策圈應重拾美蘇冷戰時代的宏觀戰略思維,以面對從所未見的安全威脅。本書未上市就已在全球政治、學術、新聞界造成轟動,《注定一戰?》成為所有關心美中未來人士的話題。連習近平都親自表示:「我們都應該努力避免陷入修昔底德陷阱!」

9/06/2019

Globalization in transition: The future of trade and value chains

Susan Lund, James Manyika, Jonathan Woetzel, Jacques Bughin, Mekala Krishnan, Jeongmin Seong, and Mac Muir, Globalization in transition: The future of trade and value chains, McKinsey Global Institute, January 2019.
Although output and trade continue to increase in absolute terms, trade intensity (that is, the share of output that is traded) is declining within almost every goods-producing value chain. Flows of services and data now play a much bigger role in tying the global economy together. Not only is trade in services growing faster than trade in goods, but services are creating value far beyond what national accounts measure. Using alternative measures, we find that services already constitute more value in global trade than goods. In addition, all global value chains are becoming more knowledge-intensive. Low-skill labor is becoming less important as factor of production. Contrary to popular perception, only about 18 percent of global goods trade is now driven by labor-cost arbitrage. 

Learning Scheduling Algorithms for Data Processing Clusters

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh, Learning Scheduling Algorithms for Data Processing Clusters, SIGCOMM '19 Proceedings of the ACM Special Interest Group on Data Communication, Pages 270-288. 
Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load.
Codes and more information. 

9/03/2019

Food Discovery with Uber Eats

Ferras Hamad, Isaac Liu, and Xian Xing Zhang, Food Discovery with Uber Eats: Building a Query Understanding Engine, Uber, June 10, 2018
Choice is fundamental to the Uber Eats experience. At any given location, there could be thousands of restaurants and even more individual menu items for an eater to choose from. Many factors can influence their choice. For example, the time of day, their cuisine preference, and current mood can all play a role. At Uber Eats, we strive to help eaters find the exact food they want as effortlessly as possible....