“Less is More”. Effective dataset curation requires samples that are diverse and cover a wide range of scenarios relevant to the downstream task. Models learn more efficiently from a smaller number of diversified samples than from large collections of correlated data. This approach directly improves training speed and reduces costs. Furthermore, diversity in training data helps models generalize better and perform more effectively during inference. The more representative your data is of the downstream task, the better your fine-tuned model will perform.

Dataset Curation

To build agents, each question-answer pair represents a training sample. Questions should provide sufficient context and information to enable accurate responses. Correspondingly, answers should be unambiguous, concise, and directly address the question to maintain clarity and completeness. Example: Sample data from TriviaQA dataset.
[
  {
    "question": "What is the capital of France?",
    "answer": "The capital of France is Paris."
  },
  {
    "question": "Who wrote the novel '1984'?",
    "answer": "The novel '1984' was written by George Orwell."
  },
  {
    "question": "What is the speed of light?",
    "answer": "The speed of light is approximately 299,792 kilometers per second."
  }
]
We only support data in the jsonl format at the moment. Each sample should be formatted as follows. Impulse AI Format: jsonl
{
  "messages": [
    {
      "role": "system",
      "content": "<system-prompt>"
    },
    {
      "role": "user",
      "content": "<question>"
    },
    {
      "role": "assistant",
      "content": "<answer>"
    }
  ]
}
TriviaQA dataset in jsonl format.
{"messages": [{"role": "system", "content": "You are an assistant that accurately answers general knowledge questions."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}]}
{"messages": [{"role": "system", "content": "You are an assistant that accurately answers general knowledge questions."}, {"role": "user", "content": "Who wrote the novel '1984'?"}, {"role": "assistant", "content": "The novel '1984' was written by George Orwell."}]}
{"messages": [{"role": "system", "content": "You are an assistant that accurately answers general knowledge questions."}, {"role": "user", "content": "What is the speed of light?"}, {"role": "assistant", "content": "The speed of light is approximately 299,792 kilometers per second."}]}
Note:Support for different data formats will be added in the future.

Note:This formatting tool can help create and modify JSON to JSONL datasets quickly. https://impulseai-json-to-jsonl.vercel.app/

Upload Dataset

The curated dataset in jsonl format can be uploaded to Impulse platform using Impulse SDK or via Impulse Web App Method 1: Upload via Impulse SDK
import os
import asyncio
from impulse.api_sdk.sdk import ImpulseSDK
from impulse.api_sdk.models import (
    DatasetCreate,
)


async def main():
    async with ImpulseSDK(os.environ.get("IMPSDK_API_KEY")) as client:
        files = await client.dataset.list_datasets()
        await client.dataset.upload_dataset("<file-path>/<filename>.jsonl", DatasetCreate(name="<dataset name>", description="<dataset description>"))

        print(files)

asyncio.run(main())
Method 2: Upload via Web App
  1. Login to Impulse Dashboard.
  2. Navigate to Datasets tab in the left panel.
  3. Click on upload dataset.
  4. Enter the name of the dataset.
  5. Upload the file.
After the dataset is uploaded, it will be visible in Datasets page.

Structured Format for Conversations

If you are fine-tuning a conversational model, your dataset should follow a specific format, typically consisting of a series of messages. Each message must include:
  • Role: Identifies the sender (e.g., user or assistant)
  • Content: The actual text or message content