Halyna Oliinyk, Ocean’s Junior Machine Learning Developer, explains natural language processing.
Intelligent personal assistants like Alexa, Siri and Cortana can feel like a godsend, ordering takeaway, changing the thermostat and updating us on our favorite sports teams. But how do they understand what it is we want them to do?
Siri understands us when we ask where the nearest ATM is because of a computer science field called natural language processing (NLP). That’s also what makes online translation services work, among a whole host of other things – NLP allows computers to understand and respond to human conversation. It is often misunderstood as a branch of machine learning but NLP is a tricky area that requires a lot of human input.
For all data scientists, datasets with a large number of variables are difficult to work with. But NLP is particularly difficult for two reasons: firstly, because languages have their own distinct dialects, idioms, and nuances, where context is crucial to understanding. This means the same dataset varies over different languages and their domains.
Secondly, the lexical, morphological, and syntactical structures of spoken languages vary hugely. This means that the techniques for processing human language have to be different depending on the language in question. For instance, German is an entirely different language type to Thai. English shares some similarities with both but isn’t structurally the same as either.
Because it’s so complex, solving NLP issues requires programmers to understand linguistics, statistics, machine learning, and the basics of calculus – which is quite a lot!
For example, Ocean.io uses NLP constantly to identify all the data that makes it so useful, your phone uses it to predict which word you want to type next, and Alexa uses it to understand what takeaway you want to order.
Want to know about the technical side of all this? The article continues below.
Image credit: Marcus Spiske
Can you explain more about the problems caused by multilingual datasets?
Most data scientists know the term ‘curse of dimensionality’, which means dealing with anomalies that occur in a high-dimensional space – a dataset with a large number of variables. As I said above, this problem is transformed into the ‘curse of multilingualism’ with NLP, as techniques are highly dependent on the text of the language, which has many context-specific and domain-specific features, especially when it comes to different language families. So the ‘curse of dimensionality’ becomes the ‘curse of multilingualism’.
The best example of this issue is machine translation, which is always followed up by text alignment algorithms to correct parallel representations of sentences in different languages.
Another instance of such a problem becoming very hard to solve is the usage of unsupervised methodologies, such as unsupervised sentiment analysis. The realization of syntactical structure of data, the correct choice of language-specific seed words, etc are crucial.
How does NLP compare with computer vision?
Almost all modern computer vision algorithms select features themselves by processing image pixels and heading them through convolutional and pooling layers. The neural network understands itself what features to extract and how to treat them mathematically.
However, correct vector representation of text is itself an enormous unsolved task, so extracting features from this representation while preserving needed information (syntactical, morphological, lexical, hierarchical, etc.) remains difficult.
Many modern NLP algorithms are based on manual feature selection, which means that the scientist needs to know what statistical properties of text can help to define patterns in it. Also, different algorithms perceive different types of features (boolean, discrete, real, etc) in different ways. This issue is dependent on linear/non-linear transformations, present in the algorithm, type of its cost function, non-consistency of data, etc.
Is there anything else that it’s important to understand?
There isn’t room here to describe the opportunities deep learning offers for NLP (another time!) but an important part of text analysis is proper data cleaning, segmentation, augmentation, and validation (for some kind of issues when outliers are present in the dataset). This process sometimes takes longer than the time spent on creating a proper model and measuring its performance because language itself is non-linear and often needs to be manually adapted for quality text processing without getting face-to-face with such problems as overfitting, underfitting, biased accuracy measuring, etc.
In many cases, data preparation requires the creation of advanced rule-based systems involving many if-else conditions, mappings, regular expressions. The ability to do it is tightly connected with the ability of the scientist to understand the data, weight the majority of the input cases the model can observe, and consider how the final algorithm should deal with them. The larger and less consistent the training dataset is, the harder it is to solve this problem.
The Ocean.io team will regularly be updating this blog with explainers of the work we do, so keep your eyes peeled for more!