Dataset Overview
Data Sources for Zentris
Zentris covers a wide range of Web3 ecosystem data, primarily including:
Solana blockchain data: Transaction history, smart contract calls, liquidity analysis, etc.
Social media data: Sentiment analysis through platforms like Twitter, Reddit, capturing market sentiment fluctuations.
Decentralized Finance (DeFi) data: Extracting key metrics such as market performance and trading volume from DeFi protocols, providing in-depth insights into project ecosystems.
Data Extraction and Processing
Data Extraction:
Data was meticulously extracted from designated sources using a combination of manual curation and automated techniques. The dataset was compiled with a strong emphasis on data integrity and accuracy. No automated scraping techniques were employed to avoid potential biases or inaccuracies.
Data Cleaning:
Removal of HTML/Markdown: HTML tags, Markdown formatting, and other irrelevant elements were removed to ensure clean and consistent text.
Deduplication: Duplicate entries were identified and removed to prevent redundancy and ensure data quality.
Error Correction: Minor spelling and grammatical errors were corrected to improve consistency.
Standardization: Terminology was standardized across different sources to maintain consistency and improve data coherence.
Text Chunking:
The extracted text was divided into smaller, manageable chunks of 2,000 characters with an overlap of 200 characters. This ensures that each chunk contains sufficient information for generating meaningful questions and answers while maintaining context.
Question-Answer Pair Generation:
For each chunk, three high-quality question-answer pairs were generated using a powerful language model (e.g., GPT-4). The model was instructed to:
Generate questions that are relevant to the provided text chunk.
Ensure that the questions are answerable based solely on the information within the chunk.
Generate concise and informative answers that accurately reflect the content of the chunk.
Dataset Structure
Zentris’ dataset is structured as a JSONL file, where each line represents a single question-answer pair. Each line contains the following fields:
question: The question generated from the given text chunk.
answer: The corresponding answer to the generated question.
chunk: The original text chunk from which the question-answer pair was derived.
This approach ensures that the data is extracted, cleaned, chunked, and processed efficiently, while maintaining a strong focus on accuracy and reliability.
Last updated