Here at the BMS LAB, we often get requests from researchers who want to make use of machine learning to classify data collected during a study. The data can take various forms such as collected interviews which have been transcribed, text scraped from social media, physiological data or video recordings. With all the hype about machine learning in the popular press, the expectations are often unrealistically high and in this blog I hope to point out some factors to consider before you decide to use machine learning in your research.
Although recent accomplishments in machine learning such as self-driving cars and beating humans at video games have received much attention in the press, the current state of the art represents a level of intelligence not much above that of insects, so the expectation that machine learning is going to magically perform better than a human at making sense of your data is clearly unrealistic. It is also important to understand that machine learning is in essence a technique to recognise patterns and the algorithm has to be trained or given some criterion in order to do so. This requirement implies several things:
- You will need lots of training data
- You will need lots of cheap labour to annotate your data
- You will need to categorize and extract features which may be meaningful in your data before it is given to the algorithm
What do I mean by ‘lots’? We often get requests where the amount of data may consist of say twenty interviews. The first question to ask is whether a human with his or her much greater intelligence will be able to classify the data in the way you require, given the amount of data available. By human I do not mean someone who has been trained in the field to do this classification, but a complete novice that only has this small sample of data available to be trained to do the classification. If the answer is no, you should reconsider the use of machine learning for this task and whether well tried and tested statistical techniques may not be a better option.
In machine learning ‘lots’ usually mean hundreds if not thousands of data samples. If you are fortunate enough to have such a large dataset, which you may typically get from physiological data or social media scraping, then the next challenge is to decide how to extract features of the data which can be used to do classification or prediction. Here well known techniques in signal processing, statistical analysis, clustering, etc. can be used to pre-process the data and to explore ways in which to structure the analysis. Once that has been done and you have decided to opt for a supervised machine learning technique, some of the data will have to be allocated for training purposes. This means that a subset of the data is selected and has to be classified by annotating the data with a desired outcome. Let us say that you have collected images of a thousand faces showing various emotions. You will start by selecting a subset of these faces and presenting these to a number of humans (this is where the cheap labour comes in), who will then classify each face with a class label such as happy, sad, angry, etc. This will be your training set. The feature extraction on these faces can for example be to apply image processing techniques to extract lines or edges . The machine learning algorithm will be trained on this training sample with the feature extraction applied and will hopefully reach a level of training where it becomes say 90% accurate in its classification.
The data volume challenge and the cheap labour challenge can sometimes be overcome by making use of machine learning models which have been trained on large datasets similar to yours already. Many of these data sets and models have been made available recently and include text analysis, human movement and facial feature detection applications. Searches on the internet for these will save you much time and we can also help in this regard.
I hope that this blog has highlighted some of the issues to consider in using machine learning in your research. Remember that the old saying of rubbish in- rubbish out (RIRO) applies very much in the domain of machine learning. This TED talk by Janelle Shane may point out some interesting results in this regard: https://www.ted.com/talks/janelle_shane_the_danger_of_ai_is_weirder_than_you_think/transcript?language=en