OpenAI highlights the importance of a comprehensive training dataset to enable artificial intelligence (AI) to understand various industries, cultures, and languages profoundly. To achieve this, the company calls upon organizations and interested parties to contribute extensive datasets covering diverse aspects of human society that are not readily available online. The submitted data, whether in text, image, audio, or video formats, will be utilized in open-source archives for public AI model training and private datasets for proprietary AI models.
OpenAI protects sensitive or personal information within the datasets and possesses tools to handle various data formats, such as transcription and digitalization of PDFs. The company collaborates with partners globally to incorporate diverse data from different countries and industries. Notable examples of these collaborations include working with the Icelandic Government and Miðeind ehf to enhance GPT-4’s proficiency in Icelandic and partnering with the non-profit Free Law Project to integrate extensive legal document collections into AI training.