Ad Code

因果变量(causal variable)或是深度学习技术的突破关键

这几个星期,我一直反复听基于对深度学习(Deep Learning)的研究贡献,在今年3月与Geoffrey Hinton和Yann LeCun同获有电脑科学诺贝尔奖之称Turing Award奖项的Yoshua Bengio关于人工智能分享视频,收获甚多,譬如:

1)深度学习的训练(training )和在数据格式更改后的重新培训 需要花费长时间是因为,人工神经网络(Artificially Neural Network)的训练模型通常以数量繁多、非常低层次数码单位变量(variable)来对我们想要解决的现实世界问题进行抽象表述(low level abstraction),譬如图像识别(image classification)用像素(pixel)来代表现实世界的物件。因此,按照Yoshua Bengio的说法,当条件或环境改变而需要以新的数据来训练人工智能模型时,似乎所有的函数(parameters)都要介入模型的再训练过程。Yoshua Bengio的研究团队目前正在实验以causal variable来模拟现实世界的问题,希望有朝一日,当只是一个因果变量(causal variable)出现变更时,我们只需要处理这个个别函数的数据,来概括与归纳(generalize) 任何输入数据的mapping结论,而非像现下人工神经网络系统那样,需要用大量我们不知是否为结论导因的变量(variable)来重新训练人工智能模型。

2)以预设特定变量(variable)为训练模型数据主要函数来源的“实验室条件”人工智能模型,往往无法准确回应现实世界的真实场景。这是因为实验室假定变量(variable)所导出的statistical distribution与真实世界所应被模拟的data distribution不符。

3)我们无法直接运用既有的机器学习模型来应对所有我们希望完成的周边任务是因为,我们需要把既有常态数据所产生的distribution应用到distribution不尽然相同的新任务上。因此,转移学习(Transfer Learning)的作用是希望以较少量数据来进行训练,就能以既有模型的distribution,来模拟相关的distribution(model related distribution)。

4)许多企业所训练的机器学习或深度学习模型无法运用在所有公司营运的国家是因为,当系统推出到新的国家试用后,人们才赫然发现distribution不同。这也解释为何企业需要重新培训个别的模型,来模拟不同国家业务数据的distribution。

5) 纯粹的无监督学习或自我监督学习(Unsupervised Learning)无法达到监督学习(Supervised Learning)技术高层次模拟或代表世界(high level representation)的抽象表述。虽然监督学习技术只以标注(label)而非句子(sentence)来注释数据,但智能系统所获取的提示(hints)却是十分强大的。

参考资料/Reference Source:

A) WSAI Americas 2019 - Yoshua Bengio - Moving beyond supervised deep learning
https://www.youtube.com/watch?v=0GsZ_LN9B24

Transcript:

1) [...] one of the things people have observed a long time ago is called catastrophic forgetting and difficulties when neural nets are trained on one task and then another one and then another one in sequence. And they sort of forget the old ones and need to almost relearn from scratch the new ones which apparently is not at all what humans are doing - we are able to reuse the past knowledge even though the the tasks we are seeing, the examples we are seeing come in sequence and I believe that this is due to the fact that in  current Neural Net architectures it's like if every parameter wants to participate in every job, in every part of the knowledge representation and so when something changes in the world, like one of these modules changes in like..., in a ground truth model in the world, all of the parameters all of the weights of the new on that want to change and because there are many of them it takes a lot of time a lot of data to adapt all of them to adapt to the change.

[..] Now I'm going to propose a hypothesis to help us deal with how distributions of the data change and that hypothesis is inspired by work by Judea Pearl and collaborators who wrote a great book recently on causality. That hypothesis is that the changes in  distribution are small if you represent the information about the description in the right space, in the right way right so like in the space of pixels it may look like things change a lot for example. If I shut my eyes, pixel wise, things have changed a lot but really it's just you know one little thing in the world that changed (when) my eyes got closed, so so we don't want to model the world in in the space of pixels we would like to model in this space of causal variables. And and the hypothesis is that in that space of representation of the structure between the relationship between those variables, the changes will be small. In fact will be
focused on maybe just one variable.

[...] If we have represented our knowledge in this space of causal variables, then this hypothesis of small change should mean that I can recover from that change now. The adapt to that change with very little data right? Because if just a few things change in my representation of the world.

For example I'm representing those objects and only one of them changed - I only need to gather data about that change and I don't need a lot of examples. So that's what we call sample complexity in the jargon of machine learning.

3) I'll be arguing today is that this question is related to one that seems very different that we've been asking in machine learning which is - how do we represent knowledge in a way that the learner is is acquiring from the data in a way to separate it into pieces that can be easily reused. So there's a notion of modernization for a few years researchers of deep learning have been asking how to separate the new on that into modules that can be dynamically recombined in new ways when we consider for example new settings changes in distribution or new tasks so this is related to another important notion in machine learning called transfer learning that is how do we build learning systems that not only work well on the training data but but then can adapt quickly on new data that comes from a related distribution so that you know you don't need tons and tons of data for this new setting in order to do a good job. And and we can have systems that are robust when the world changes because there are non-stationary the world changes all the time due to interventions of agents due to our incomplete knowledge as we move around in the world.

4) [...] I think it's important to understand the motivations for this now in classical machine learning the learning theory is based on the idea that we only thinking about one distribution - the training distribution and we can assemble training data, we can assemble some test data from the same distribution but once you start thinking about the scenario where things change in the world the distribution changes - this whole theory is not sufficient anymore and maybe that explains a lot of the issues companies have when deploying machine learning products when say you trained on data from one country and then you test you apply the system in a different country or for example maybe people with the different distribution of genders or races as we've seen recently and things don't work as well so so that sort of robustness is something that current learning theory doesn't really handle because it's about how we generalize from data from the same distribution as the training data but it doesn't tell us anything about how to generalize to new distributions [...]

B) AI Horizons Keynote - Yoshua Bengio
https://www.youtube.com/watch?v=s3AUUYUXsP8

Transcript:

2) [...] We know that for a lot of machine learning systems that we build in labs, when we bring them to the real world where the distribution is a bit different, there's a loss in performance. It's difficult to generalize out of distribution.  In fact all our theory breaks down so so one aspect of dealing with this I think is embracing the challenge by building machines that not only understand the particular data you give them, but somehow try to figure out the underlying causal sure and an understanding of how the world works which is behind that data.

C) Yoshua Bengio: Deep Learning | MIT Artificial Intelligence (AI) Podcast
https://www.youtube.com/watch?v=azOmzumh0vQ

Transcript:

5) [...] Instead of learning separately from images and videos on one hand, and from text on the other hand, we need to do a better job of jointly learning about language and about the world to which it refers so that both sides can help each other.

We need to have good world models in our neural nets for them to really understand sentences which talk about what's going on in the world. And I think we need language input to help providing clues about what high-level concepts like semantic concepts should be represented at the top levels of these neural nets.

[...] In fact there is evidence that the purely unsupervised learning of representations doesn't give rise to high level representations that are as powerful as the ones we are getting from supervised learning.

And so the the clues (of supervised learning) we're getting just with the labels not even sentences is already very powerful.

Post a Comment

0 Comments