Data Collection
1. Data Loading and Preprocessing
Loading the Dataset: The dataset is loaded using the Hugging Face
datasets
library, which helps in efficiently managing and processing large datasets.Tokenization: Text data (questions and answers) is tokenized using the LLaMa 3.1 tokenizer, converting the text into numerical tokens that the model can understand.
Chat Template Application: The
apply_chat_template()
function from thetransformers
library formats the input data according to the LLaMa 3.1 chat template, structuring the data into roles such as "system" (system prompt), "user" (the question), and "assistant" (the answer).
2. Data Splitting
The dataset is divided into three subsets:
Training Set: The majority of the data is used to train the model, allowing it to learn from a wide range of examples.
Validation Set: A smaller subset is used to monitor the model's performance during training and to adjust hyperparameters.
Test Set: This subset is reserved for final evaluation after training to test the model's performance on unseen data.
3. Data Collation
Collation Function: A custom function is implemented to handle batching during training, ensuring that all batches have consistent lengths and are processed efficiently.
4. API Interaction and Question Generation
The following snippet shows how chunks of text are processed to generate questions and answers via the OpenAI API.
5. Social Media Data Collection
Real-Time Scraping: Zentris collects relevant discussions and social media data using platforms like the Twitter API. This data is crucial for sentiment analysis and understanding market trends.
6. On-Chain Data Collection and Cleaning
Blockchain Data: Zentris uses efficient scraping tools to collect high-quality on-chain data from Solana and other Web3 blockchains. This data is essential for accurate analysis of token performance and ecosystem activities.
7. Regular Update Mechanism
Data Freshness: Zentris regularly updates its dataset to ensure that predictions and analyses are based on the latest market data, maintaining relevance and accuracy.
This structured approach ensures proper preparation, processing, and utilization of external data sources to train and fine-tune the model effectively, while also incorporating real-time, blockchain-specific, and social media data for enhanced market insights.
Last updated