“Less is More”. Effective dataset curation requires samples that are diverse and cover a wide range of scenarios relevant to the downstream task. Models learn more efficiently from a smaller number of diversified samples than from large collections of correlated data. This approach directly improves training speed and reduces costs. Furthermore, diversity in training data helps models generalize better and perform more effectively during inference. The more representative your data is of the downstream task, the better your fine-tuned model will perform.
To build agents, each conversation represents a training sample. Conversations are turn based and should provide sufficient context and information to enable accurate responses. Correspondingly, responses by assistants should be unambiguous, concise, and directly address the question to maintain clarity and completeness.NOTE : Multimodal models use “ShareGPT” format.Multimodal dataset consists of two parts
Conversations
Images
Hence, multimodal dataset is a directory containing images directory and jsonl file. Every multimodal dataset should be compressed as zip before uploading it on the platform.Dataset File Structure:
Note: The dataset directory name and jsonl filename should be identical.Example:Sample data from Mathvision dataset (VQA).
Copy
[ { "question": "Which number should be written in place of the question mark? <image>", "answer": "60" }, { "question": "Which bike is most expensive? Options:[ 'A', 'B', 'C', 'D', 'E' ] <image>", "answer": "A" }, { "question": "Which kite has the longest string? Options:[ 'A', 'B', 'C', 'D', 'E' ] <image>", "answer": "C" }]
We currently only support conversations data in the jsonl format. Each sample should be formatted as follows.Impulse AI Format:jsonl
Conversations from Mathvision dataset in jsonl format.
Copy
{"conversations": [{"from": "system", "value": "You are a helpful assistant that answers questions based on the given image."}, {"from": "human", "value": "Which number should be written in place of the question mark?\n<image>"}, {"from": "gpt", "value": "60"}], "image": "images/1.jpg"}{"conversations": [{"from": "system", "value": "You are a helpful assistant that answers questions based on the given image."}, {"from": "human", "value": "Which bike is most expensive?\n Options: A, B, C, D, E\n<image>"}, {"from": "gpt", "value": "A"}], "image": "images/2.jpg"}{"conversations": [{"from": "system", "value": "You are a helpful assistant that answers questions based on the given image."}, {"from": "human", "value": "Which kite has the longest string?\n Options: A, B, C, D, E\n<image>"}, {"from": "gpt", "value": "C"}], "image": "images/3.jpg"}
Note:Support for different data formats will be added in the future.Mathvision dataset file structure
If you are fine-tuning a conversational model, your dataset should follow a specific format, typically consisting of a series of messages. Each message must include:
from: Identifies the sender (e.g., human or gpt or system)