DeepLearning.ai通讯节录：深度学习在样本稀少的数据类表现不佳

在线教育平台Coursera的联合创办人、谷歌大脑创始者、前百度首席科学家吴恩达（Andrew Ng）探讨深度学习即便在供给了貌似充裕的大数据，仍无法有效发挥作用的状况。

他分析到，一些数据总体集合量大，但特定类型数据却稀少（由于现实状况发生频率和概率极低，而无法被有效采集的突发案例）的深度学习训练模型，会在应对现实特例时表现不达标。这是因为深度学习是以所有数据类型的平均值函数来赋予权重值和优化学习结论的。如果其中一些数据类的样本稀少，其所赋予的权重值将极为细小，而无法获得有效的针对性处理。

其中一个经典例子，也许就是无人驾驶汽车人工智能开发项目当前面对的窘境 -- 我们可以以无人驾驶汽车已经达到长距离的零意外行驶哩数而感到确幸，然而，无人驾驶技术的安全性和可靠性，却是该技术在面对训练数据所没涵盖的突发状态发生时，如何自处和回应周遭状况，才得以断定的。

吴恩达把上述特例数据缺失的状况，称为小数据（small data）或低数据（low data）。他认为特例数据样本稀少而难以有效培训深度学习人工智能准确度的难题，可望通过以下方式来解决：

1）转移学习（Transfer Learning）。亦即我们从类似任务中学习，并转移知识，这包括自我监督学习（无监督学习 - self-supervised learning/unsupervised learning）的各种形式。由此，我们可以从训练成本较为低廉的未标记数据中获取实用的学习成果，来“弥补”低数据任务原先的不足。

2）一次或几次性学习（One- or few-shot learning）。我们通过对许多类似的任务进行少量数据培训来达到元学习（meta learning）的目的，以在我们真正感兴趣的问题上达到事半功倍的深度学习成效。

3）依靠人手来编码知识（Relying on hand-coded knowledge）。例如通过设计更复杂的机器学习数据和知识输送管道。这个构想建基于人工智能系统具有两个主要的知识来源：（i）数据和（ii）工程团队以人手编码的先验知识。在数据缺失的情况下，人工智能项目的开发团队可以通过编码为系统灌输先验知识，达到数据和先验知识互补的目的。

4）数据扩充和数据综合（Data augmentation and data synthesis）。

[Quote] Deep learning has seen tremendous adoption in consumer internet companies with a huge number of users and thus big data, but for it to break into other industries where dataset sizes are smaller, we now need better techniques for small data.

In the manufacturing system described above, the absolute number of examples was small. But the problem of small data also arises when the dataset in aggregate is large, but the frequency of specific important classes is low.

Say you are building an X-ray diagnosis system trained on 100,000 total images. If there are few examples of hernia in the training set, then the algorithm can obtain high training- and test-set accuracy, but still do poorly on cases of hernia.

Small data (also called low data) problems are hard because most learning algorithms optimize a cost function that is an average over the training examples. As a result, the algorithm gives low aggregate weight to rare classes and under-performs on them. Giving 1,000 times higher weight to examples from very rare classes does not work, as it introduces excessive variance.

We see this in self-driving cars as well. We would like to detect pedestrians reliably even when their appearance (say, holding an umbrella while pushing a stroller) has low frequency in the training set. We have huge datasets for self-driving, but getting good performance on important but rare cases continues to be challenging.

How do we address small data? We are still in the early days of building small data algorithms, but some approaches include:

1) Transfer learning, in which we learn from a related task and transfer knowledge over. This includes variations on self-supervised learning, in which the related tasks can be “made up” from cheap unlabeled data.

2) One- or few-shot learning, in which we (meta-)learn from many related tasks with small training sets in the hope of doing well on the problem of interest. You can find an example of one-shot learning in the Deep Learning Specialization.

3) Relying on hand-coded knowledge, for example through designing more complex ML pipelines. An AI system has two major sources of knowledge: (i) data and (ii) prior knowledge encoded by the engineering team. If we have small data, then we may need to encode more prior knowledge.

4) Data augmentation and data synthesis.
[/Quote]

资料出处/Reference Source：
deeplearning.ai "The Batch" newsletter
https://info.deeplearning.ai/the-batch-self-driving-cars-that-cant-see-pedestrians-evolutionary-algorithms-fish-recognition-fighting-fraud-

DeepLearning.ai通讯节录：深度学习在样本稀少的数据类表现不佳

Posted by Er HC

Post a Comment

0 Comments

报章专栏文章 | 投稿

Search This Blog

Blog Archive

Subscribe Us

Most Popular (Last 7 Days)

【科技文章分享】陆奇的大模型世界观

AI as an Amplifier of Human Knowledge and Biases

Is Overdependence on Technology Killing Creativity? A Debate Script from the Proponent's Perspective

从“日本制造”到“美国品牌”：《变形金刚》的跨国变身

CPU vs GPU：一场关于确定性与不确定性的范式拔河

The Tiananmen Square Incident - Chinese Intellectuals' Unwavering Pursuit of Justice in an Era of Change

生命错按的时候

Generative AI - the product of humans who automate themselves out of their jobs

浅谈OpenAI前CEO加盟微软

Understanding the Attention Economy: How Social Media Algorithms Shape Information Consumption

Links

Random Posts

Popular Posts (All Time)

《浅谈人工智能》线上课程讲义

本周重磅人工智能新闻

一个7字辈的80年代

六四天安门事件－－那个中国知识青年没有缺席正义的年代

Generative AI is Impressive but deceptive 生成式人工智能令人赞叹，但极具误导性

浅谈谷歌LaMDA聊天机器人

【线上座谈视频】抓紧科技转型机遇 . 克服数码鸿沟

六四 . 天安 - 1989的不安

异议者在平台算法时代还能翻盘吗？

数据献祭：我们都是人工智能的信息矿工

Footer Menu Widget

Contact form

Ad Code

DeepLearning.ai通讯节录：深度学习在样本稀少的数据类表现不佳

Posted by Er HC

You may like these posts

Post a Comment

0 Comments

Social Plugin

报章专栏文章 | 投稿

Subscribe To My Blog Post | 订阅我的部落格

Search This Blog

Blog Archive

Subscribe Us

Most Popular (Last 7 Days)

Links

Random Posts

Popular Posts (All Time)

Footer Menu Widget

Contact form