“Less is More”. Effective dataset curation requires samples that are diverse and cover a wide range of scenarios relevant to the downstream task. Models learn more efficiently from a smaller number of diversified samples than from large collections of correlated data. This approach directly improves training speed and reduces costs. Furthermore, diversity in training data helps models generalize better and perform more effectively during inference. The more representative your data is of the downstream task, the better your fine-tuned model will perform.

Dataset Curation

To build agents, each conversation represents a training sample. Conversations are turn based and should provide sufficient context and information to enable accurate responses. Correspondingly, responses by assistants should be unambiguous, concise, and directly address the question to maintain clarity and completeness. NOTE : Multimodal models use “ShareGPT” format. Multimodal dataset consists of two parts
  1. Conversations
  2. Images
Hence, multimodal dataset is a directory containing images directory and jsonl file. Every multimodal dataset should be compressed as zip before uploading it on the platform. Dataset File Structure:
<dataset-name>/
├── images/
│   ├── <image-name>.jpg
│   └── <image-name>.jpg
│   ├── .....
│   └── <image-name>.jpg
└── <dataset-name>.jsonl
Note: The dataset directory name and jsonl filename should be identical. Example: Sample data from Mathvision dataset (VQA).
[
  {
    "question": "Which number should be written in place of the question mark? <image>",
    "answer": "60"
  },
  {
    "question": "Which bike is most expensive? Options:[ 'A', 'B', 'C', 'D', 'E' ] <image>",
    "answer": "A"
  },
  {
    "question": "Which kite has the longest string? Options:[ 'A', 'B', 'C', 'D', 'E' ] <image>",
    "answer": "C"
  }
]
We currently only support conversations data in the jsonl format. Each sample should be formatted as follows. Impulse AI Format: jsonl
{
	"conversations": [
		{
			"from": "system", 
			"value": "<system-prompt>"
		},
		{
			"from": "human",
	    "value": "<question>"
	  },
	  {
		  "from": "gpt",
		  "value": "<answer>"
		}
	],
	"image": "images/<image-name>.jpg"
}
Conversations from Mathvision dataset in jsonl format.
{"conversations": [{"from": "system", "value": "You are a helpful assistant that answers questions based on the given image."}, {"from": "human", "value": "Which number should be written in place of the question mark?\n<image>"}, {"from": "gpt", "value": "60"}], "image": "images/1.jpg"}
{"conversations": [{"from": "system", "value": "You are a helpful assistant that answers questions based on the given image."}, {"from": "human", "value": "Which bike is most expensive?\n Options: A, B, C, D, E\n<image>"}, {"from": "gpt", "value": "A"}], "image": "images/2.jpg"}
{"conversations": [{"from": "system", "value": "You are a helpful assistant that answers questions based on the given image."}, {"from": "human", "value": "Which kite has the longest string?\n Options: A, B, C, D, E\n<image>"}, {"from": "gpt", "value": "C"}], "image": "images/3.jpg"}
Note:Support for different data formats will be added in the future. Mathvision dataset file structure
mathvision/
├── images/
│   ├── 1.jpg
│   ├── 2.jpg
│   └── 3.jpg
└── mathvision.jsonl
“mathvision.zip” should be uploaded on the platform.

Upload Dataset

The curated dataset in zip format can be uploaded to Impulse platform using Impulse SDK or via Impulse Web App Method 1: Upload via Impulse SDK
import os
import asyncio
from impulse.api_sdk.sdk import ImpulseSDK
from impulse.api_sdk.models import (
    DatasetCreate,
)


async def main():
    async with ImpulseSDK(os.environ.get("IMPSDK_API_KEY")) as client:
        files = await client.dataset.list_datasets()
        await client.dataset.upload_dataset("<file-path>/<filename>.zip", DatasetCreate(name="<dataset name>", description="<dataset description>"))

        print(files)

asyncio.run(main())
Method 2: Upload via Web App
  1. Login to Impulse Dashboard.
  2. Navigate to Datasets tab in the left panel.
  3. Click on upload dataset.
  4. Enter the name of the dataset.
  5. Upload the file.
After the dataset is uploaded, it will be visible in Datasets page.

Structured Format for Conversations

If you are fine-tuning a conversational model, your dataset should follow a specific format, typically consisting of a series of messages. Each message must include:
  • from: Identifies the sender (e.g., human or gpt or system)
  • value: The actual text or message content