Training data: the milestone of machine learning

Lorenzo Viscanti
February 3, 2022
Training data: the milestone of machine learning

Machine learning is a type of AI that teaches machines how to learn, interpret and predict results based on a set of data. As the world — and internet — have grown exponentially in the past few years, machine learning processes have become common for organizations of all kinds. For example, companies in the healthcare sector use ML to detect and treat diseases better, while in the farming sector machine learning helps predict harvest yields.

ML involves computers finding insightful information without being told where to look, differing so from traditional computing, in which algorithms are sets of explicitly programmed instructions. ML does this by leveraging algorithms that are trained on data, on which they learn in an iterative process in order to generate outputs, and automate decision-making processes.

The three basic ingredient of machine learning

There are three basic functional ingredients of ML.

  1. Data: The dataset you want to use must be well-structured, accurate. The data you use can be labeled or unlabeled. Unlabeled data are sample items — e.g. photos, videos, news articles — that don’t need additional explanation, while labeled ones are augmented: unlabeled information is bundled and an explanation, with a tag, is added to them.
  2. Algorithm: there are different types of algorithms that can be used (e.g. linear regression, logistic regression). Choosing the right algorithm is both a combination of business need, specification, experimentation and time available.
  3. Model: ​​A “model” is the output of a machine learning algorithm run on data. It represents the rules, numbers, and any other algorithm-specific data structures required to make predictions.

How is machine learning used

Successful machine learning algorithms can be used for a variety of purposes. The Director of the Massachusetts Institute of Technology (MIT), Thomas W. Malone wrote in a recent research:

The function of a machine learning system can be descriptive, meaning that the system uses the data to explain what happened; predictive, meaning the system uses the data to predict what will happen; or prescriptive, meaning the system will use the data to make suggestions about what action to take.

What are training data

Training data is the initial data used to train machine learning models. Training datasets are fed to machine learning algorithms so that they can learn to make predictions, or perform a desired task. This type of data is key, because it helps machines achieve results and work in the right way, as shown in the graph below.

The innovative power of machine learning models is in the fact that they learn and improve over time, as they are exposed to relevant training data. Some data is held out from the training data to be used as “evaluation data ‘’, which validates and tests how accurate the machine learning model is. This type of data is contained in the validation and test datasets which will be later discussed.

The importance of training data

Training data is a key part of the machine learning process. There are several aspects in play when you build a training dataset. The prime consideration is the size of datasets which depends on the use made of ML: More complicated the use, the bigger the size of the dataset. In the case of unsupervised learning, the more patterns you want your model to identify, the more examples it will need. You want a scalable learning algorithm, which can deal with any amount of data.

Second thing to consider is the quality of the data. Concerning this aspect, it is important to feed the system with carefully curated data. The higher the quality of your training data is, the better will your machine learning model be, especially in the early stages.

Having quality in data you used, means collecting real-world data, which closely mimics how an application will receive external inputs, and diverse data, for reducing the possibility of biases that we will later discuss.

To understand how much training data is important, think of vehicle manufacturers that are pivoting themselves towards the challenge of autonomous drive. The quality of the data is essential to ensuring autonomous vehicles operate safely and as expected. It isn’t enough for vehicles to perform well in simulated-good weather conditions, or on one type of road. They must perform flawlessly in all weather conditions in every imaginable road scenario.

Keep also in mind that the quality of the data comes from including the final user in your product/service. The most successful AI projects are those that integrate data collection during the product life-cycle. It must be built into the core of the product itself, in order that every time a user engages with it, you collect data from that interaction. The main purpose is to use the constant data flow to improve your offer for the user. Think of Spotify that uses an AI system called “collaborative filtering”, to create personalized “Discover Weekly” playlists which help fans to sort out new music that’s appealing to them. The more the user listens to and searches for music that he/she enjoys, the more the app will know what to recommend.

How machine learning can learn from data

Machine learning offers a number of different ways to learn from data:

  • Supervised learning : it can be regarded as a “hands-on” approach, since it uses labeled data. Humans must tag, label, or annotate the data to their criteria, in order to train the model to predict the “correct” outputs which are predetermined.
  • Unsupervised learning : it can be construed as a “broad pattern-seeking” approach, since it uses unlabeled data and, instead of predicting the correct output, models are tasked with finding patterns, similarities and deviations, that can be then applied to other data that exhibit similar behaviour.
  • Reinforcement learning: it uses unlabeled data and it involves a feedback mechanism. When it performs a task correctly, it receives positive feedback, which strengthens the model in connecting the target inputs and output. Likewise, it can receive negative feedback for incorrect solutions.

Validation and testing

Validation and testing begins with splitting your training dataset. The “Valid-Test split” is a technique to evaluate the performance of your ML model. You need to split the data because you don’t want your model to over-learn from training data, to not perform well. But, most of all, you want to evaluate how well your model is generalizing.

Hence, you held back from training dataset, validation and testing subsets for assessing your model in a meaningful way. Notice that a typical split ratio of data, between training, validation and testing sets is around 50:25:25. A brief explanation of the role of each of these dataset is below.

  • Validation dataset: it is useful when it comes to model selection. The data included in this set will be used to find the optimal values for the parameters of the model under consideration. When you work with ML models, you typically need to test multiple models with different parameters values for finding the optimal values that will give the best possible performance. Therefore, in order to pick the best model you must evaluate each of them.
  • Testing dataset: when you have tuned the model by performing parameters optimisation, you should end up with the final model. The testing set is used to provide an unbiased evaluation of the performance of this model and ensure that it can generalise well to new, unseen data.

Bias in machine learning

Bias in Machine Learning is defined as the phenomena of observing results that are systematically prejudiced due to faulty assumptions. It can be interpreted as the accuracy of our predictions. A high bias will result in an inaccurate prediction, so you need to know what bias is, to prevent it. An inaccurate prediction can derive from

There are techniques to handle bias, and they are related to the quality of training data. For sure, they must be as diverse as possible, including as many relevant groups of data as possible. The more inclusive is the dataset, the less likely it is to turn a blind eye to a given data group. You must identify representative data.

In general, bias reduces the potential of AI for business and society by encouraging mistrust and producing distorted results. Any value delivered by machine learning systems in terms of efficiency or productivity will be wiped out if the algorithms discriminate against individuals.

What are the different types of bias in Machine Learning

  • Sample bias: if the sample data used to train models do not replicate a real-world scenario, models are exposed to a part of the problem space. An example is facial recognition softwares primarily trained on images of white men.
  • Prejudicial bias: it occurs due to cultural stereotypes. Social status and gender may slide into a model. The consequence is that results will be skewed against people of a particular group. When a software used to hire talents, is fed mostly male resumes, it will learn that men are preferable to other profiles
  • Algorithmic bias: it may occur when the algorithm used is inappropriate for the current application. It is an error that derives from an error of approach. This bias can emerge due to the wrong “architectural” design of the algorithm or to unintentional decisions relating to the way data is collected. It is quite difficult to address.
  • Exclusion bias: it happens when important data are excluded from the training dataset. For example, imagine you have a dataset of customer sales in America and Europe. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. This means your model will not pick up on the fact that your European customers spend two times more.

How businesses are using machine learning

Every company is pivoting to use machine learning in their products and services in some way. It is almost like ML is becoming an expected feature. We are using it to make human tasks easier, faster, and better than before.

As said in the introduction, an example of Machine Learning applied to content consumption is Netflix with its personalisation of movie recommendations, in order to “learn to entertain the world”. Users who watch A are likely to watch B. Netflix uses the watching history of other users with similar tastes to recommend what you may be most interested in watching next.

Product recommendation is one of the most successful applications of machine learning in business. They will pull in front of you those products you are most likely to buy, according to the product you have previously bought and browsed. For example, Amazon uses the browsing history of a user to always keep those products in the customer’s sight.

Machine learning is also used by advertisers, for the so called “machine learning based advertising”. Using ML in this field is fundamental especially because of the changes introduced by Apple updates for iOS. Privacy is a key feature of them, and this made for marketers the optimization of the ROI of their campaigns even harder, since precise targeting has become difficult. As advertising gets more complex, you need to rely on the analytical and on real-time optimisation capacities that an algorithm can provide.

Here at Mapendo, Jenga, our proprietary AI technology, collects tens of thousands of data related to a given topic, finds patterns, and it manages to predict the possible outcome of a marketing campaign and finds the audience that is most likely to convert for a type of ad.

Our algorithm has been trained to optimize the traffic according to the client’s KPIs, maximize user retention and generate post install actions. Advertisers need to leverage technology to find meaningful insights, predict outcomes and maximise the efficiency of their investment, by choosing the right channels and budget.


A basic understanding of machine learning is important. This is for two reasons. The first one is that ML can really improve our life, finding applications in our daily routines. Instead, the second one is that with digitalisation disrupting every industry, sharing and delivering data has become a high priority.

As we have explained it is fundamental to build a trained model. For guaranteeing a quality “coaching”, you must provide machine learning with accurate data, and in the right amounts. The way you teach the algorithm and how it learns, depends on how much accuracy it is put into constructing your dataset, inputting labeled or unlabeled data, and paying close attention to not feed the algorithm with biased ones. Data biases will lead to unreliable results, and if you use those, you will give the wrong answer to your problem. Biased datasets can jeopardize business processes and decisions.

The last fundamental step is the one that leads to the final results of the machine learning process. Validation and, in greater detail, testing will determine the overall model performance, making sure that the model can really work when you use it to give an answer to a real-world problem.