Ad Code

DeepLearning.ai通讯节录:深度学习在样本稀少的数据类表现不佳

在线教育平台Coursera的联合创办人、谷歌大脑创始者、前百度首席科学家吴恩达(Andrew Ng)探讨深度学习即便在供给了貌似充裕的大数据,仍无法有效发挥作用的状况。

他分析到,一些数据总体集合量大,但特定类型数据却稀少(由于现实状况发生频率和概率极低,而无法被有效采集的突发案例)的深度学习训练模型,会在应对现实特例时表现不达标。这是因为深度学习是以所有数据类型的平均值函数来赋予权重值和优化学习结论的。如果其中一些数据类的样本稀少,其所赋予的权重值将极为细小,而无法获得有效的针对性处理。

其中一个经典例子,也许就是无人驾驶汽车人工智能开发项目当前面对的窘境 -- 我们可以以无人驾驶汽车已经达到长距离的零意外行驶哩数而感到确幸,然而,无人驾驶技术的安全性和可靠性,却是该技术在面对训练数据所没涵盖的突发状态发生时,如何自处和回应周遭状况,才得以断定的。

吴恩达把上述特例数据缺失的状况,称为小数据(small data)或低数据(low data)。他认为特例数据样本稀少而难以有效培训深度学习人工智能准确度的难题,可望通过以下方式来解决:

1)转移学习(Transfer Learning)。亦即我们从类似任务中学习,并转移知识, 这包括自我监督学习(无监督学习 - self-supervised learning/unsupervised learning)的各种形式。由此,我们可以从训练成本较为低廉的未标记数据中获取实用的学习成果,来“弥补”低数据任务原先的不足。

2)一次或几次性学习(One- or few-shot learning)。我们通过对许多类似的任务进行少量数据培训来达到元学习(meta learning)的目的,以在我们真正感兴趣的问题上达到事半功倍的深度学习成效。

3)依靠人手来编码知识(Relying on hand-coded knowledge)。例如通过设计更复杂的机器学习数据和知识输送管道。 这个构想建基于人工智能系统具有两个主要的知识来源:(i)数据和(ii)工程团队以人手编码的先验知识。在数据缺失的情况下,人工智能项目的开发团队可以通过编码为系统灌输先验知识,达到数据和先验知识互补的目的。

4)数据扩充和数据综合(Data augmentation and data synthesis)。

[Quote] Deep learning has seen tremendous adoption in consumer internet companies with a huge number of users and thus big data, but for it to break into other industries where dataset sizes are smaller, we now need better techniques for small data.

In the manufacturing system described above, the absolute number of examples was small. But the problem of small data also arises when the dataset in aggregate is large, but the frequency of specific important classes is low.

Say you are building an X-ray diagnosis system trained on 100,000 total images. If there are few examples of hernia in the training set, then the algorithm can obtain high training- and test-set accuracy, but still do poorly on cases of hernia.

Small data (also called low data) problems are hard because most learning algorithms optimize a cost function that is an average over the training examples. As a result, the algorithm gives low aggregate weight to rare classes and under-performs on them. Giving 1,000 times higher weight to examples from very rare classes does not work, as it introduces excessive variance.

We see this in self-driving cars as well. We would like to detect pedestrians reliably even when their appearance (say, holding an umbrella while pushing a stroller) has low frequency in the training set. We have huge datasets for self-driving, but getting good performance on important but rare cases continues to be challenging.

How do we address small data? We are still in the early days of building small data algorithms, but some approaches include:

1) Transfer learning, in which we learn from a related task and transfer knowledge over. This includes variations on self-supervised learning, in which the related tasks can be “made up” from cheap unlabeled data.

2) One- or few-shot learning, in which we (meta-)learn from many related tasks with small training sets in the hope of doing well on the problem of interest. You can find an example of one-shot learning in the Deep Learning Specialization.

3) Relying on hand-coded knowledge, for example through designing more complex ML pipelines. An AI system has two major sources of knowledge: (i) data and (ii) prior knowledge encoded by the engineering team. If we have small data, then we may need to encode more prior knowledge.

4) Data augmentation and data synthesis.
[/Quote]

资料出处/Reference Source:
deeplearning.ai "The Batch" newsletter
https://info.deeplearning.ai/the-batch-self-driving-cars-that-cant-see-pedestrians-evolutionary-algorithms-fish-recognition-fighting-fraud-

Post a Comment

0 Comments