For the last year I’ve been participating into Deep-ESP an NLP community focused on developing tools for the spanish language. This week after several months of work from several members of the community we presented the best public models, to date, for the generation of texts in Spanish.
To train them we use 3 gb of texts from Wikipedia and 6 gb of books (literature, essays and outreach). Although the generations are not always perfect, these models work better than those trained by OpenAI, since the tokenizer of the original models, having been trained with texts in English, creates a bottleneck for Spanish.
The models learned the grammatical rules of Spanish perfectly, and some generations have such an amazing quality that they seem indistinguishable from a human-generated text. However, sometimes the generations do not have semantic coherence.
The work was leaded by Jose Ortiz Fuentes and Alejandro Oñate Latorre, more details here.
For fine tuning the model I developed a colab notebook, you can see there a test to generate horoscope texts.