Another factor that could minimize the dataset size problem in the next few years is an anticipated increase in unstructured data. Indeed, highly unstructured data — such as that collected by video drones watching businesses and their customers — could potentially sidestep language issues entirely, as the video analysis could be captured directly and saved in many different languages.
Until the volume of high-quality data for non-English languages gets much stronger — something that might slowly happen with more unstructured, private, and language-agnostic data in the next few years — CIOs need to demand better answers from model vendors on the training data for all non-English models.
Let’s say a global CIO is buying 118 models from an LLM vendor, in a wide range of languages. The CIO pays maybe $2 billion for the package. The vendor doesn’t tell the CIO how little training was done on all of those non-English models, and certainly not where that training data came from. If the vendors were fully transparent on both of those points, CIOs would push back on pricing for everything other than the English model.
In response, the model makers would likely not charge CIOs less for the non-English models but instead ramp up their efforts to find more training data to improve the accuracy of those models.
Given the massive amount of money enterprises are spending on genAI, the carrot is obvious. The stick? Maybe CIOs need to get out of their comfort zone and start buying their non-English models from regional vendors in every language they need.
If that starts to happen on a large scale, the major model makers may suddenly see the value of data-training transparency.