顯示具有 機率統計 標籤的文章。 顯示所有文章
顯示具有 機率統計 標籤的文章。 顯示所有文章

9/05/2025

專題和論文的製作與報告 (tips for the final project and your thesis)

  • (at the bottom) Avoid common phenomena, final written report (for your ppt content)
  • (大學生) 重要任務
    • 人生困境:
      • Tal Ben-Shahar, Happier: Learn the Secrets to Daily Joy and Lasting Fulfillment, McGraw Hill, 2007. (譚家瑜譯,更快樂:哈佛最受歡迎的一堂課,天下雜誌,2012)

7/13/2025

Statistical Modeling: The Two Cultures

Cynthia Rudin, Leo Breiman, the Rashomon Effect, and the Occam Dilemma, arXiv:2507.03884, 2025.

In the famous “Two Cultures” paper, Leo Breiman provided a visionary perspective on the cultures of “data models” (modeling with consideration of data generation) versus “algorithmic models” (vanilla machine learning models). I provide a modern perspective on these two approaches. One of Breiman’s key arguments against data models is what he called the “Rashomon Effect,” which is the existence of many different-but-equally-good models. The Rashomon Effect implies that data modelers would not be able to determine which model generated the data. Conversely, one of his core advantages in favor of data models is simplicity, as he claimed there exists an “Occam Dilemma,” i.e., an accuracy-simplicity tradeoff, where algorithmic models must be complex in order to be accurate. After 25 years of more powerful computers, it has become clear that this claim is not generally true, in that algorithmic models do not need to be complex to be accurate; however, there are nuances that help explain Breiman’s logic, specifically, that by “simple,” he appears to consider only linear models or unoptimized decision trees. Interestingly, the Rashomon Effect is a key tool in proving the nullification of the Occam Dilemma. To his credit though, Breiman did not have the benefit of modern computers, with which my observations are much easier to make.

5/11/2024

Fluid approximations for stochastic optimization

When one encounters a stochastic optimization/control problem, one popular approach is to transform it into a deterministic problem by fluid approximation. The following highly-cited classic papers illustrate the applications of this approach:   

2/05/2024

學習數學的四個層次:(3) 在許多行業的應用

學習數學的四個層次:(0) 如何學數學(1) 代表具備基礎的知識與能力(2) 邏輯推理和抽象思考的能力(3) 在許多行業的應用(4) 純粹滿足好奇心或求知慾

2015/12/1 初稿,持續更新中。

一般性說明
  • 數學是科學之母,科學則是工業的基礎,所以大學工學院的數理化課程總學分超過 1/3。可以參考如何選填大學志願
  • 應用在不同的領域 (理工商醫農、教育),如財務工程、設計電腦、貨物產銷、工程師、使用統計學分析學習成效等等。
  • 抽象的模式與思考的方式,適用於現在與未來的應用,以微分為例,物理學的距離微分是速度,經濟學中成本的微分是邊際成本,電子學的電荷微分是電流。也就是說,可以使用函數表示任何待解的問題,函數的微分便可以研究其變化和極值的情況,例如機器學習中,超參數 (hyperparameter) 的學習 。
  • 基本的原則變動不大,微積分、機率和統計學、和線性代數已經有 200 年以上的歷史,可幫助未來的自我學習。許多人說學校學的東西,畢業後立即過時或沒用,我覺得很疑惑。大學只是基礎教育,必須不斷地學習新的東西,以因應產業和職務的變化;最近熱門的大數據 (big data) 和人工智慧 (artificial intelligence),其數學基礎正是這些課程

1/24/2024

Applications of Operations Research (作業研究) (including Optimization)

為了提高同學們的學習動機,提供以下相關的資訊,以幫助同學們找到方向。也和暑期實習和未來就業中,決策支援系統中的演算法有密切關聯。以下許多的內容屬於碩博士階段的課程,也可以增加同學們就讀研究所的動機:

  • Journals: 
    • INFORMS Journal on Applied Analytics
      • INFORMS is the leading international association for Operations Research & Analytics professionals.
      • The mission of INFORMS Journal on Applied Analytics is to publish manuscripts focusing on the practice of operations research and management science and the impact this practice has on organizations throughout the world
      • Good topics to be explored for the final project
    • Ramayya Krishnan and Pascal Van Hentenryck, editors, Advances in Integrating AI & O.R.INFORMS EC2021, Volume 16, April 19, 2021.

11/01/2023

Learning an Inventory Control Policy with General Inventory Arrival Dynamics

S Andaz, C Eisenach, D Madeka, K Torkkola, R Jia, D Foster, S Kakade, Learning an Inventory Control Policy with General Inventory Arrival Dynamics, 2023, arXiv preprint arXiv:2310.17168. (Amazon)

In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.

10/26/2023

Sparse PCA: A New Scalable Estimator Based On Integer Programming

Kayhan Behdin and Rahul Mazumder, Sparse PCA: A New Scalable Estimator Based On Integer Programming, arXiv:2109.11142v2, 2021. (Julia ahd Gurobi code)

We consider the Sparse Principal Component Analysis (SPCA) problem under the well-known spiked covariance model. Recent work has shown that the SPCA problem can be reformulated as a Mixed Integer Program (MIP) and can be solved to global optimality, leading to estimators that are known to enjoy optimal statistical properties. However, current MIP algorithms for SPCA are unable to scale beyond instances with a thousand features or so. In this paper, we propose a new estimator for SPCA which can be formulated as a MIP. Different from earlier work, we make use of the underlying spiked covariance model and properties of the multivariate Gaussian distribution to arrive at our estimator. We establish statistical guarantees for our proposed estimator in terms of estimation error and support recovery. We propose a custom algorithm to solve the MIP which is significantly more scalable than off-the-shelf solvers; and demonstrate that our approach can be much more computationally attractive compared to earlier exact MIP-based approaches for the SPCA problem. Our numerical experiments on synthetic and real datasets show that our algorithms can address problems with up to 20000 features in minutes; and generally result in favorable statistical properties compared to existing popular approaches for SPCA.

9/07/2023

機率與統計的應用

生活中到處充滿著不確定性,例如吃飯排隊的等候時間、機台生產的良率、民調統計數字的分析等等。工業系也開設許多相關的課程,例如品質管制、資料分析、智慧製造、實驗設計、機器學習、人工智慧等等,以解決工商業的問題

許多人修這門課的時候,很痛苦 (1)。除了了解其應用外,建議可以念一些科普的書,增加學習動機:

8/02/2023

Queueing Theory: Classical and Modern Methods

Dimitris Bertsimas and David Gamarnik, Queueing Theory: Classical and Modern Methods, ‎Dynamic Ideas, 2022.

STRUCTURE OF THE BOOK:

  • Part I describes single and multi-server queues.
  • Part II treats single and multiclass queueing networks (MQNETs).
  • Part III introduces asymptotic methods, including queueing networks in heavy traffic, large deviations, call centers, queues in space, and the supermarket model.
  • Part IV outlines the use of optimization in queueing networks.
  • Part V presents Markov chains and processes, Brownian motion, and weak convergence in the Appendix.

5/09/2023

Optimization in Online Content Recommendation Services

Omar Besbes, Yonatan Gur, Assaf Zeevi, Optimization in Online Content Recommendation Services: Beyond Click-Through Rates, 18(1), pp. 15–33, Manufacturing & Service Operations Management, Volume 18, Issue 1, Winter 2016. 

A new class of online services allows Internet media sites to direct users from articles they are currently reading to other content they may be interested in. This process creates a “browsing path” along which there is potential for repeated interaction between the user and the provider, giving rise to a dynamic optimization problem. A key metric that often underlies this recommendation process is the click-through rate (CTR) of candidate articles. Whereas CTR is a measure of instantaneous click likelihood, we analyze the performance improvement that one may achieve by some lookahead that accounts for the potential future path of users. To that end, by using some data of user path history at major media sites, we introduce and derive a representation of content along two key dimensions: clickability, the likelihood to click to an article when it is recommended; and engageability, the likelihood to click from an article when it hosts a recommendation. We then propose a class of heuristics that leverage both clickability and engageability, and provide theoretical support for favoring such path-focused heuristics over myopic heuristics that focus only on clickability (no lookahead). We conduct a live pilot experiment that measures the performance of a practical proxy of our proposed class, when integrated into the operating system of a worldwide leading provider of content recommendations, allowing us to estimate the aggregate improvement in clicks per visit relative to the CTR-driven current practice. The documented improvement highlights the importance and the practicality of efficiently incorporating the future path of users in real time.

4/17/2023

A Practical End-to-End Inventory Management Model with Deep Learning

Meng Qi, Yuanyuan Shi, Yongzhi Qi, Chenxin Ma, Rong Yuan, Di Wu, Zuo-Jun (Max) Shen (2023) A Practical End-to-End Inventory Management Model with Deep Learning. Management Science 69(2):759-773. (Data and Python codes

We investigate a data-driven multiperiod inventory replenishment problem with uncertain demand and vendor lead time (VLT) with accessibility to a large quantity of historical data. Different from the traditional two-step predict-then-optimize (PTO) solution framework, we propose a one-step end-to-end (E2E) framework that uses deep learning models to output the suggested replenishment amount directly from input features without any intermediate step. The E2E model is trained to capture the behavior of the optimal dynamic programming solution under historical observations without any prior assumptions on the distributions of the demand and the VLT. By conducting a series of thorough numerical experiments using real data from one of the leading e-commerce companies, we demonstrate the advantages of the proposed E2E model over conventional PTO frameworks. We also conduct a field experiment with JD.com, and the results show that our new algorithm reduces holding cost, stockout cost, total inventory cost, and turnover rate substantially compared with JD’s current practice. For the supply chain management industry, our E2E model shortens the decision process and provides an automatic inventory management solution with the possibility to generalize and scale. The concept of E2E, which uses the input information directly for the ultimate goal, can also be useful in practice for other supply chain management circumstances.

3/15/2023

2022 Franz Edelman Award

 2022 Edelman Competition (videos)

Leonardo J. Basso et al., Analytics Saves Lives During the COVID-19 Crisis in Chile, INFORMS Journal on Applied Analytics, 2023, 53(1):9-31. (2022 Franz Edelman Award) (statistical analysis, integer programming, regression)

During the COVID-19 crisis, the Chilean Ministry of Health and the Ministry of Sciences, Technology, Knowledge and Innovation partnered with the Instituto Sistemas Complejos de Ingeniería (ISCI) and the telecommunications company ENTEL, to develop innovative methodologies and tools that placed operations research (OR) and analytics at the forefront of the battle against the pandemic. These innovations have been used in key decision aspects that helped shape a comprehensive strategy against the virus, including tools that (1) provided data on the actual effects of lockdowns in different municipalities and over time; (2) helped allocate limited intensive care unit (ICU) capacity; (3) significantly increased the testing capacity and provided on-the-ground strategies for active screening of asymptomatic cases; and (4) implemented a nationwide serology surveillance program that significantly influenced Chile’s decisions regarding vaccine booster doses and that also provided information of global relevance. Significant challenges during the execution of the project included the coordination of large teams of engineers, data scientists, and healthcare professionals in the field; the effective communication of information to the population; and the handling and use of sensitive data. The initiatives generated significant press coverage and, by providing scientific evidence supporting the decision making behind the Chilean strategy to address the pandemic, they helped provide transparency and objectivity to decision makers and the general population. According to highly conservative estimates, the number of lives saved by all the initiatives combined is close to 3,000, equivalent to more than 5% of the total death toll in Chile associated with the pandemic until January 2022. The saved resources associated with testing, ICU beds, and working days amount to more than 300 million USD.

2/02/2023

學習大數據 (big data) 的技能

一些工具或念個學位

可以參考 DS Examiner, Data Scientist Foundations: The Hard and Human Skills You Need, November 8, 2013

或者  Insight Data Science Fellows Program 說明了可能使用的工具
  1. Software Engineering Best Practices: Learn how to contribute to a large code-base and instrument a web application to collect data. Tools you may learn: Python, Git, LAMP web stack, Javascript, Flask.
  2. Storing and Retrieving Data: How to clean data, store it in the appropriate database or distributed data storage system and then run queries to retrieve the information needed for analysis. Tools you may will learn: MySQL, Hadoop, Hive.
  3. Statistical Analysis & Machine Learning: Learn industry best practices for doing basic and advanced statistical analysis on large data sets. Tools you may learn: R, NumPy & SciPy, Mahout.
  4. Visualizing and Communicating Results: Learn how to effectively communicate your findings visually and verbally. Tools you may learn: D3 Javascript library, visualization and presentation best practices. 

2/01/2023

Generalized Synthetic Control for TestOps at ABI

Luis Costa, Vivek F. Farias, Patricio Foncea, Jingyuan (Donna) Gan, Ayush Garg, Ivo Rosa Montenegro, Kumarjit Pathak, Tianyi Peng, and Dusan Popovic, Generalized Synthetic Control for TestOps at ABI: Models, Algorithms, and Infrastructure, To appear in INFORMS Journal on Applied Analytics (Winner, Daniel H. Wagner Prize 2022)

We describe a novel optimization-based approach– Generalized Synthetic Control (GSC)– to learning from experiments conducted in the world of physical retail. GSC solves a long-standing problem of learning from physical retail experiments when treatment effects are small, the environment is highly noisy and non-stationary, and interference and adherence problems are commonplace. The use of GSC has been shown to yield an approximately 100x increase in power relative to typical inferential methods and forms the basis of a new large-scale testing platform: ‘TestOps’. TestOps was developed and has been broadly implemented as part of a collaboration between Anheuser Busch Inbev (ABI) and an MIT team of operations researchers and data engineers. TestOps currently runs physical experiments impacting approximately 135M USD in revenue every month and routinely identifies innovations that result in a 1-2% increase in sales volume. The vast majority of these innovations would have remained unidentified absent our novel approach to inference: prior to our implementation, statistically significant conclusions could be drawn on only ∼ 6% of all experiments; a fraction that has now increased by over an order of magnitude.