9/08/2020

Taming the Tail: Adventures in Improving AI Economics

Martin Casado and Matt Bornstein, Taming the Tail: Adventures in Improving AI Economics, a16z.com, August 12, 2020.



Part I: Understanding the problem to be solved

Building vs. experimenting (or, software vs. AI)

As the CTO of one late-stage data startup put it, AI development often feels “closer to molecule discovery in pharma”* than software engineering.

This is because AI development is a process of experimenting, much like chemistry or physics. The job of an AI developer is to fit a statistical model to a dataset, test how well the model performs on new data, and repeat. This is essentially an attempt to reign in the complexity of the real world.

Software development, on the other hand, is a process of building and engineering. Once the spec and overall architecture for an application have been defined, new features and functionality can be added incrementally – one line of code, library, or API call at a time – until the full vision takes shape. This process is largely under the control of the developer, and the complexity of the resulting system can often be reigned in using standard computer science practices, such as modularization, instrumentation, virtualization, or choosing the right abstractions....

The long tail and machine learning

Many of the difficulties in building efficient AI companies happen when facing long-tailed distributions of data, which are well-documented in many natural and computational systems.

While formal definitions of the concept can be pretty dense, the intuition behind it is relatively simple: If you choose a data point from a long-tailed distribution at random, it’s very likely (for the purpose of this post, let’s say at least 50% and possibly much higher) to be in the tail.

Take the example of internet search terms. Popular keywords in the “head” and “middle” of the distribution (shown in blue below) account for less than 30% of all terms. The remaining 70% of keywords lie in the “tail,” seeing less than 100 searches per month. If you assume it takes the same amount of work to process a query regardless of where it sits in the distribution, then in a heavy-tailed system the majority of work will be in the tail – where the value per query is relatively low....

These types of distributions are not necessarily bad. But – unlike the internet search example – current ML techniques are not well equipped to handle them. Supervised learning models tend to perform well on common inputs (i.e. the head of the distribution) but struggle where examples are sparse (the tail). Since the tail often makes up the majority of all inputs, ML developers end up in a loop – seemingly infinite, at times – collecting new data and retraining to account for edge cases. And ignoring the tail can be equally painful, resulting in missed customer opportunities, poor economics, and/or frustrated users....

Impact on the economics of AI

The long tail – and the work it creates – turn out to be a major cause of the economic challenges of building AI businesses.

The most immediate impact is on the raw cost of data and compute resources. These costs are often far higher for ML than for traditional software, since so much data, so many experiments, and so many parameters are required to achieve accurate results. Anecdotally, development costs – and failure rates – for AI applications can be 3-5x higher than in typical software products....

Part II: Building better AI systems

Easy mode: Bounded problems

In the simplest case, understanding the problem means identifying whether you’re actually dealing with a long-tailed distribution. If not – for example, if the problem can be described reasonably well with linear or polynomial constraints – the message was clear: don’t use machine learning! And especially don’t use deep learning....

Harder: Global long tail problems

If you are working on a long-tail problem – which includes most common NLP (natural language processing), computer vision, and other ML tasks – it’s critical to determine the degree of consistency across customers, regions, segments, and other user cohorts. If the overlap is high, it’s likely you can serve most of your users with a global model (or ensemble). This can have a huge, positive impact on gross margins and engineering efficiency.

We’ve seen this pattern most often in B2C tech companies that have access to large user datasets. The same advantages often hold for B2B vendors working on unconstrained tasks in relatively low entropy environments like autonomous vehicles, fraud detection, or data entry – where the deployment setting has a fairly weak influence on user behavior.

In these situations, some local training (e.g. for major customers) is often still necessary. But you can minimize it by framing the problem in a global context and building proactively around the long tail. The standard advice to do this includes:

  • Optimize the model by adding more training data (including customer data), adjusting hyperparameters, or tweaking model architecture – which tends to be useful only until you hit the long tail
  • Narrow the problem by explicitly restricting what a user can enter into the system – which is most useful when the problem has a “fat head” (e.g. data vendors that focus on high-value contacts) or is susceptible to user error (e.g. Linkedin supposedly had 17,000 entities related to IBM until they implemented auto-complete)
  • Convert the problem into a single-turn interface (e.g. content feeds, product suggestions, “people you may know,” etc) or prompt for user input / design human failover to cover exceptional cases (e.g. teleoperations for autonomous vehicles)
  • For many real-world problems, however, these tactics may not be feasible. For those cases, experienced ML builders shared a more general pattern called componentizing....

Really hard: Local long tail problems

Many problems do not show global consistency across customers or other user cohorts – nearly all ML teams we spoke with emphasized how common it is to see at least some local problem variation. Determining overlap is also nontrivial, since input data (especially in the enterprise) may be segregated for commercial or regulatory reasons. Frequently, the difference between a global problem and a local problem lies in the scope of available data.

Table stakes: Operations

  • Consolidate data pipelines.
  • Build an edge case engine. 
  • Own the infrastructure. 
  • Compress, compile, and optimize. 
  • Test, test, test. 

沒有留言:

張貼留言