Dataset Curation
To build agents, each conversation represents a training sample. Conversations are turn based and should provide sufficient context and information to enable accurate responses. Correspondingly, responses by assistants should be unambiguous, concise, and directly address the question to maintain clarity and completeness. NOTE : Multimodal models use “ShareGPT” format. Multimodal dataset consists of two parts- Conversations
- Images
jsonl
Note:
Support for different data formats will be added in the future.
Mathvision dataset file structure
Upload Dataset
The curated dataset in zip format can be uploaded to Impulse platform using Impulse SDK or via Impulse Web App Method 1: Upload viaImpulse SDK
Web App
- Login to Impulse Dashboard.
- Navigate to Datasets tab in the left panel.
- Click on upload dataset.
- Enter the name of the dataset.
- Upload the file.
Structured Format for Conversations
If you are fine-tuning a conversational model, your dataset should follow a specific format, typically consisting of a series of messages. Each message must include:- from: Identifies the sender (e.g., human or gpt or system)
- value: The actual text or message content