Uses NLP machine learning

Machine learning and NLP: ready for industrial use!

Machine learning (ML) applications have recently made spectacular breakthroughs in a number of machine language processing (NLP) tasks. The focus of much work here is on the development of ever better models, while the proportion of tasks in practical projects that do not deal with modeling, but with topics such as data provision and the evaluation, maintenance and deployment of models, is often not given sufficient attention. As a result, companies that do not have the opportunity to design their own platforms for the use of ML and NLP often lack suitable tools and best practices. It is becoming apparent that in the coming months an engineering perspective on ML and its use in companies, which is directed towards these practical questions, will gain in importance.

Machine learning (ML) applications have recently made spectacular breakthroughs in a wide range of natural language processing (NLP) tasks. The focus of much of this work is on the development of ever better models, while the share of tasks in practical projects that are not concerned with model development but with topics such as data provision and evaluation, maintenance and deployment of models is often not yet given sufficient attention. As a result, companies that do not have the capacity to design their own platforms for the use of ML and NLP often lack suitable tools and best practices.

It is becoming apparent that in the coming months an engineer's view of ML and its use in the business world will become increasingly important, particularly with regard to these practical questions.

Les applications d’apprentissage automatique ont récemment réalisé des avancées spectaculaires dans un large éventail de tâches de traitement automatique du langage (TALN). De nombreux travaux sont centrés sur le développement de modèles de plus en plus perfectionnés, tandis que la part des tâches dans les projets pratiques, qui ne traite pas de la modélisation, mais de sujets tels que laniture de données, l'évaluation, l 'entretien et le déploiement de modèles, fait rarement l'objet d'une attention suffisante. En conséquence, les entreprises qui ne sont pas en mesure de concevoir leurs propres plateformes pour utiliser l’apprentissage automatique et le TALN ne disposent souvent pas d’outils appropriés et les bonnes pratiques manquent.

Il est évident que, dans les prochains mois, un point de vue technique sur l’apprentissage automatique et son utilization en entreprise, visant précisément ces questions pratiques, deviendra de plus en plus important.


Machine processing of natural language (NLP) is currently developing more rapidly than it has been for a long time. If one compares the titles of the articles accepted at the major conferences in the field with those from ten years ago, one gets an idea of ​​the profound upheaval that has been affecting the NLP for several years.

illustration 1

The proportion of submissions on the subject of deep learning at large NLP conferences has been rising sharply for years.[1]

While this trend naturally first reached the academic world, the corresponding approaches have for some time now increasingly also reached the domain of industrial applications and directly usable software. In addition to the research-oriented, scientific approach to NLP, this leads to the increased development of NLP as an engineering discipline with its own requirements and success criteria.

NLP in transition

An increase in the use of quantitative, statistical methods in many areas of NLP has been observed for more than 25 years and these have long been the state of the art in tasks such as part-of-speech tagging, parsing or categorization of documents. With the advent of deep learning techniques[1] However, this trend has increased massively again and thus follows a development that initially took place in other areas such as image processing: The hour of intense public awareness for deep learning was born[2] is often associated with the success of a deep learning process at the ILSVRC conference 2012, in which a neural network with the appropriate methods left the entire competition far behind in the task of classifying image content. A situation that currently seems to be repeated in NLP: In autumn 2017, a model based on deep learning called BERT was introduced[3] published for NLP, which improved the state of the art - i.e. the best results in each case on publicly available data sets - in a whole range of tasks. One of these tasks - NLI: natural language inference - consists, for example, in deciding whether in a pair of statements the first statement logically includes the second, contradicts it, or whether the two statements are in no such relationship to one another.

Table 1

The SNLI Corpus contains over 500,000 sentence pairs and information on their logical relationship.

Statement AStatement BLogical relationship
A soccer game with multiple males playingSome men are playing a sportEntailment
A black race car starts up in front of a crowd of people.A man is driving down a lonely road.Contradiction

Recently, deep learning methods here, determined over many, many thousands of such pairs, have achieved results that equal or even exceed those of human readers.[4]

As with the breakthroughs in image processing, such successes require, in addition to an excellent understanding of the mathematical and algorithmic fundamentals of the methods used, in particular massive computing power and massive amounts of training data.[5] Fortunately, such requirements only apply to a complete recalculation of the model, while the use and even fine-tuning of the model based on your own data are often feasible with moderate hardware and time budgets. For use in everyday business for tasks such as categorization, the extraction of entities or the semantic search, the focus is often less on the design of new models than on the use of existing models and their optimization ("fine-tuning") on the specific upcoming tasks. This is made possible by the fact that models like BERT are able to abstract and store general properties of, in this case, natural language, in training in such a way that the models can also be used on new tasks - an approach that can be used with the term "transfer learning" is called.[6] For a few months now, the principles of transfer learning have been widely used in NLP[7] and had led to the conclusion that in 2018 the "Imagenet moment" was finally also in NLP[8] had arrived. This alluded to the successes in the transfer of knowledge learned from large image data sets to new tasks without the need for a fundamental redesign of the networks used: The BERT network mentioned above, for example, achieves its sensational results in various areas without specific adaptation to the respectively changed task .

With the advances described above (the wide availability of powerful models that can be used for a wide range of tasks), another prerequisite for the successful use of deep learning-based NLP methods is gaining in importance: the acquisition of suitable training data.

Where is the data?

The fact that data is the new and decisive raw material of our time has become a much-cited commonplace. The NLP methods mentioned above, for example, practically no longer use any explicitly coded linguistic knowledge in the form of rules, lexicons or symbolic analysis algorithms - the knowledge stored in these models is rather determined from the analysis of large amounts of data. The authors of BERT, for example, state that BERT was trained on data from Wikipedia and a “Book Corpus”.[9] This type of data is abundant and readily available because it is unannotated voice data. After a download from Wikipedia (and the corresponding format translation of the text files), no time-consuming and expensive manual work such as marking certain text passages or categorizing documents has to be carried out. While such data is available in large quantities, training data for specific business needs is usually not freely available. Often they are not available in the required quantity and quality, even for corporate customers, but have to be laboriously created for each upcoming project. In a world in which the basic program libraries for NLP topics are often freely available software[10] and even complex pre-trained models are made available in the public domain, the scenario-specific training data is often of decisive importance for success or failure and for a considerable part of the costs and duration of projects.

Emphasis on domain expertise

It can be seen as evidence of the growing maturity of the subject that in many cases the weights in the project are moving away from the implementation of classic NLP processing steps such as tokenization, POS tagging and parsing and towards the modeling of the respective domain knowledge (e.g. medical or legal Expertise). The former are usually available in usable quality and are often easily accessible, also with regard to installation and use. The involvement of domain experts is therefore particularly important - made possible and simplified by suitable user environments that have few technical requirements Use of NLP software is increasingly decoupled from the need to implement and optimize basic mathematical concepts of the processes used. However, the fullest possible understanding of these procedures will remain an important condition for successful use. This is how complex methods such as an LSTM layer become[11] or architectures like GAN[12]that carry out comparatively complex calculations internally, in the corresponding program libraries into easy-to-combine building blocks for the construction of networks. However, the skills to select, combine and use these available modules sensibly for an upcoming problem solution remain the domain of experienced specialists. And this requirement can also be further mitigated by using existing, pre-trained models.

Figure 2

The Kairntech Dataset Factory allows the rapid and ML-supported annotation of documents for the efficient creation of training corpora.

In the following chapter we look at two exemplary tasks that use the principles presented above: many NLP-relevant questions benefit from learning processes and the software components required here are largely open source in the public domain and allow one to abstract from many basics and instead focus on content.

Structuring and extraction of semi-structured documents

The public discussion about language processing AI very quickly deals with visionary questions such as when we can no longer determine in conversation whether we are facing a person or a machine. As exciting as these questions are; In day-to-day business, there are often more mundane but no less promising questions to be solved. For example, how metadata can be generated reliably and with high accuracy from a large corpus of documents, for example clinical studies, contracts or scientific papers. A document may contain a number of date expressions, but which one denotes e.g. B. the entry into force of the contract (and not the date of signature or the end of the first phase of the project)? A clinical trial may mention a number of diseases - which of them is the actual subject of investigation (and not just a casually listed undesirable side effect)?

Under the name Information extraction dealing with these questions has a long tradition. Often the solution here includes the creation of detailed, often very complex, manually created rules. The creation of these rules is time-consuming and requires a high degree of familiarity with the formalism used - conditions that are often not given in the company. Experts know their documents and their implicit structure well, but not all specialists in a particular domain also have knowledge of a given rule language or the time and leisure to acquire it. NLP learning methods make it possible to extract the desired information by creating a number of training examples. The behavior that we want to simulate here is that in many cases it is clear to human readers after a few examples where certain information can be found in the document.

The above title page of a scientific publication is intuitively understandable for the viewer: What is the title, who are the authors and their affiliation, from when is the paper, etc. The authors are shown here in capital letters and centered, the title is in a larger one Font set. But that may be different with the next document. It is now essential that the system learns to recognize these and many similar cases by giving a manageable number of examples. It is obvious that in addition to the respective character string itself, formatting information such as font size, font changes, centering, position in the line or in the text is important.

An example of a system that implements these criteria is Grobid.[13] Grobid has been in production for this type of task for some time at users such as NASA, CERN, ResearchGate and others.[14]

Generation of training data

A central success criterion for the evaluation of NLP procedures in the academic field is the quality measured on large amounts of test data. These data are often freely available or are specially created on the occasion of large-scale evaluation competitions. Progress in a subject is rightly measured to a large extent from the results on these test data sets. An approach that measurably leaves the competition behind, sometimes secures widespread attention in expert circles. Nevertheless, it remains to be seen that the requirements in the industrial sector often differ considerably from those in the academic sector. How high-performance a system is, how long it takes users to prepare it for a task, how complex the installation, how extensive the hardware requirements are, or how easy it is to locate and correct errors can often be just as important a criterion as the maximum achievable analysis quality.

This applies in particular to the question of training data. Peter Norvig even comes to the conclusion that a simpler model with a lot of training data is usually superior to even very complex models that have been trained on small amounts of data.[15]

While training corpora of a relevant size are now available in the academic sector and the large digital companies often determine extensive data sets from user data, even in the times of big data, not enough labeled training data is often available from corporate customers. If, for example, a system for the automatic processing of damage reports is to be created for an insurance company, data is required in which the relevant information (such as the name and address of the injured party, date of the damage, amount of damage, details of the causer) is shown in order to train a recognizer . Let us assume that such a data set can be generated in three minutes and that a relevant corpus should contain at least 5000 data sets[15], then you already have one and a half months of pure processing time. Breaks and the question of whether data should not always be annotated by several editors for reasons of consistency are not even included. In our experience, efforts of this magnitude represent a significant hurdle in projects, even with large industrial customers.

Here, too, the focus is on requirements for the use of NLP in companies that do not exist in this form in the field of research. Suggestions that address these requirements are, for example, the environment of[16] or the KairnTech Dataset Factory[17], in each of which by suitable organization of the learning and annotation process, in particular by selecting the next data record to be annotated by the processor, the amount of required data records can be significantly reduced compared to an uninformed, random sorting.[18] The developers of state that such an environment, in view of the large number of analysis components, is the “missing piece” that is needed to make NLP more usable in the project context.[19]

NLP and machine learning in the corporate context

Emmanuel ants[20] estimates for the currently rapidly developing field of NLP / machine learning that the effort in specific projects is often only around 10 percent modeling, while the remaining 90 percent is spent on data collection and processing and similar tasks. Interestingly, however, on the other hand, a good 90 percent of the projects are in relevant platforms for open source software such as github[21] concerned with modeling and certainly not because the other questions have already been largely resolved. Rather, there has been a general lack of best practices so far, and modeling is simply more exciting and promises more prestige. In addition to the approaches that are currently very popular in research, such as “deep learning” in particular, there are a number of older, well-understood and powerful methods that may not enjoy the same prestige at conferences, but which are used to solve specific tasks may even be preferable in a corporate context. For example, reach decision trees[22] Good results on many data sets, but compared to more modern approaches are characterized in particular by properties that are particularly important in the non-academic area, such as the traceability of his decisions or the comparatively reduced data requirement when training. This underlines once again the difference in the requirements in the academic area compared to the corporate area. "If the only tool you have is a hammer, every problem seems like a nail." Is a popular saying on the subject. For companies as well as experts in the field of NLP / AI, this means that it certainly makes sense to expand your own portfolio of skills to include deep learning in order to be able to use its often superior performance when required, but that there is no need to to throw established and in the specific case sometimes even more suitable procedures overboard.

In summary, there is much to suggest that we will see exciting developments in these areas in particular in the coming months, that progress in the basic architecture of models will be achieved through accompanying and building on progress in the implementation and refinement of applications for specific use in companies become. This is particularly important for companies that cannot build their own platforms from the current open source components with their own teams, but rather rely on developing standards for training, monitoring, deployment, combination and maintenance of models, i.e. best practices, are instructed.

swell [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar, BERT: [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar [1.3.2019]. Search in Google Scholar

Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: theory and applications. World Scientific Pub Co Inc. ISBN 978-9812771.Search in Google Scholar

Published Online: 2019-05-15
Published in Print: 2019-05-08

© 2019 Walter de Gruyter GmbH, Berlin / Boston