Data Collection

1. Data Loading and Preprocessing

Loading the Dataset: The dataset is loaded using the Hugging Face datasets library, which helps in efficiently managing and processing large datasets.
Tokenization: Text data (questions and answers) is tokenized using the LLaMa 3.1 tokenizer, converting the text into numerical tokens that the model can understand.
Chat Template Application: The apply_chat_template() function from the transformers library formats the input data according to the LLaMa 3.1 chat template, structuring the data into roles such as "system" (system prompt), "user" (the question), and "assistant" (the answer).

2. Data Splitting

The dataset is divided into three subsets:

Training Set: The majority of the data is used to train the model, allowing it to learn from a wide range of examples.
Validation Set: A smaller subset is used to monitor the model's performance during training and to adjust hyperparameters.
Test Set: This subset is reserved for final evaluation after training to test the model's performance on unseen data.

3. Data Collation

Collation Function: A custom function is implemented to handle batching during training, ensuring that all batches have consistent lengths and are processed efficiently.

4. API Interaction and Question Generation

The following snippet shows how chunks of text are processed to generate questions and answers via the OpenAI API.

Real-Time Scraping: Zentris collects relevant discussions and social media data using platforms like the Twitter API. This data is crucial for sentiment analysis and understanding market trends.

6. On-Chain Data Collection and Cleaning

Blockchain Data: Zentris uses efficient scraping tools to collect high-quality on-chain data from Solana and other Web3 blockchains. This data is essential for accurate analysis of token performance and ecosystem activities.

7. Regular Update Mechanism

Data Freshness: Zentris regularly updates its dataset to ensure that predictions and analyses are based on the latest market data, maintaining relevance and accuracy.

This structured approach ensures proper preparation, processing, and utilization of external data sources to train and fine-tune the model effectively, while also incorporating real-time, blockchain-specific, and social media data for enhanced market insights.

PreviousDataset Overview NextTraining Metrics and Process

Last updated 5 months ago