讀書寫作: 使用最佳化 (optimization) 方法以插補遺漏資料 (missing data imputation)

D. Bertsimas, C. Pawlowski and Y. Zhuo, From Predictive Methods to Missing Data Imputation: An Optimization Approach, Journal of Machine Learning Research, 18 (2018), 1-39. (pdf)

資料科學第一步是做資料的分析，常常面臨的問題是資料有遺失，在此篇論文中，作者使用最佳化的方法以插補 (impute) 遺失的資料。這是延續 Prof. Bertsimas 之前的研究方法，也可以參考此授課大綱 (Machine Learning via a Modern Optimization Lens) 中所引述的論文。

因為最佳化的問題為非凸 (nonconvex)，所以轉成整數規劃；但整數規劃的求解時間太久，所以用一次條件 (coordinate descent (坐標下降法)) 快速地求解。和傳統方法 (K-NN, SVM, trees) 比較，得到較好的結果

For models trained using opt.impute single imputations with 50% data missing, the average out-of-sample R^2 is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method.

論文中的參考文獻大都來自醫學，看了 Y. Zhuo 的資料後，才知道和 D. Bertsimas 開了家新創公司 Interpretable AI，也可以從中了解美國大學的產學運作機制，如何將最新的研究轉成商品或軟體公司。

公司的介紹，是把兩位前學生放前面，展現 D. Bertsimas 的氣度和尊重人才的重要性。

讀書寫作

12/29/2018

使用最佳化 (optimization) 方法以插補遺漏資料 (missing data imputation)

沒有留言:

張貼留言