Ad Code

Software Testing vs. Machine Learning Testing

Abstract: In this article, I will briefly discuss the high level idea behind two major AI world views - Connectionalism (Neural Network based Machine Learning and Deep Learning techniques) that learn about the world by using real world examples and Symbolism (Expert Systems) that encoded with human defined specifications in the form of symbolic representation and programming logic.
Besides, I also list the differences between conventional software testing which has relatively stable and clear test oracles, and machine learning testing that needs to deal with a constantly moving target with bugs that exist not just in the algorithm but also in the training dataset.


There are two main schools of thought in the field of Artificial Intelligence, namely Connectionism (Artificial Neural Network) and Symbolism (Expert System).
Symbolism which let the human experts to encode prior knowledge in the form of instructions or rule-based specifications into the IT system has been dominating the news headlines and funding for several decades since the mid of 1950s.
On the other hand, Artificial Neural Network techniques (e.g., Machine Learning and Deep Learning) which let the IT system to grow its intelligence by the assimilation of the training data finally regain momentum and outperform Expert Systems on many business fronts in recent years following the Godfather of Deep Learning Geoffrey Hinton's team handsomely won the ImageNet contest (a yearly large scale visual recognition challenge) with a wide margin by using Convolutional Neural Network AlexaNet in 2012.
In my opinion, the current forms of AI regardless of Connectionism (Artificial Neural Network) or Symbolism (Expert System) are the outcomes of the mapping of human intelligence to machine intelligence.
Classical Artificial Intelligence enables the IT system to gain its cognitive capability via human coding/programming (hand-coded specifications by programmers) based on subject matter experts' design.
Artificial Neural Networks which are more commonly known as Machine Learning and Deep Learning these days learn to make sense of the world by processing lots of high quality, domain specific and context driven data (real world examples) that are carefully curated by semi-skilled workers and human experts.
Therefore, I believe that any quality assurance process, policy or standard that we intend to apply in the development life cycle of an AI-based information system should be human centered, value driven, and collaboration oriented so that we can help the cross-functional teams to maximize human intelligence throughput in the form of efficient data pipeline, clean and accurate data, quality coding (algorithm), insightful AB testing and etc.
For a lot of software testers who have spent a vast number of years in testing traditional software applications developed by programmers through rule-based specifications mimicking Expert Systems, maybe it is now the right time for us to explore what are some new things we need to learn or at least be aware of in order to be able to conduct meaningful software testing for AI-based applications which are very likely built on top of the neural net based Machine Learning technique or Deep Learning technique.
According to Zhang et al (2019), Traditional Software Testing and Machine Learning Testing are different in many aspects as follows:
1) Component to test (where the bug may exist): traditional software testing detects bugs in the code, while ML testing detects bugs in the data, the learning program, and the framework, each of which play an essential role in building an ML model.
2) Behaviours under test: the behaviours of traditional software code are usually fixed once the requirement is fixed, while the behaviours of an ML model may frequently change as the training data is updated.
3) Test input: the test inputs in traditional software testing are usually the input data when testing code; in ML testing, however, the test inputs in may have more diverse forms. Note that we separate the definition of ‘test input’ and ‘test data’. In particular, we use ‘test input’ to refer to the inputs in any form that can be adopted to conduct machine learning testing; while ‘test data’ specially refers to the data used to validate ML model behaviour. Thus, test inputs in ML testing could be, but are not limited to, test data. When testing the learning program, a test case may be a single test instance from the test data or a toy training set; when testing the data, the test input could be a learning program.
4) Test oracle: traditional software testing usually assumes the presence of a test oracle. The output can be verified against the expected values by the developer, and thus the oracle is usually determined beforehand. Machine learning, however, is used to generate answers based on a set of input values after being deployed online. The correctness of the large number of generated answers is typically manually confirmed. Currently, the identification of test oracles remains challenging, because many desired properties are difficult to formally specify. Even for a concrete domain specific problem, the oracle identification is still time-consuming and labour-intensive, because domain-specific knowledge is often required. In current practices, companies usually rely on third-party data labelling companies to get manual labels, which can be expensive. Metamorphic relations are a type of pseudo oracle adopted to automatically mitigate the oracle problem in machine learning testing.
5) Test adequacy criteria: test adequacy criteria are used to provide quantitative measurement on the degree of the target software that has been tested. Up to present, many adequacy criteria are proposed and widely adopted in industry, e.g., line coverage, branch coverage, dataflow coverage. However, due to fundamental differences of programming paradigm and logic representation format for machine learning software and traditional software, new test adequacy criteria are required to take the characteristics of machine learning software into consideration.
6) False positives in detected bugs: due to the difficulty in obtaining reliable oracles, ML testing tends to yield more false positives in the reported bugs.
7) Roles of testers: the bugs in ML testing may exist not only in the learning program, but also in the data or the algorithm, and thus data scientists or algorithm designers could also play the role of testers.

Reference
Zhang, J.M., Harman, M., Ma, L. and Liu, Y., 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering. [Online]. Available at: https://arxiv.org/pdf/1906.10742.pdf (Accessed: 26 March 2020)

Post a Comment

0 Comments