Refined
Enhanced Dataset
Cerebromesh Labs provides high-quality, curated datasets to enhance your AI.

Built on Bittensor
Understanding Good Data
"Good data" is pivotal for training large-scale models, yet its quality can be ambiguous and context-specific. Human evaluation may miss subtle flaws. An effective approach involves training models on trusted datasets and evaluating new data with metrics like perplexity. This method ensures data reliability for algorithm training.
URL Collection
Gather a wide range of URLs from various sources.
Filtering
Apply filters to eliminate irrelevant or low-quality URLs.
Text Extraction
Extract textual content from the remaining URLs.
PLL Removal
Identify and remove any personally identifiable.
The FineWeb Recipe
The data preparation algorithm involves several key steps to ensure the creation of a high-quality dataset.
Deduplication
Use techniques like MinHash to remove duplicate entries.
Content Quality
Use advanced algorithms to maintain the quality of text.
Standard Filters
Implement commonly used filters for further refinement.
Language Filtering
Ensure that the extracted text is in the desired language(s).
Text Extraction Algorithm
Text extraction from web crawl data can use raw HTML or text-only versions. Open-source libraries enhance quality by removing boilerplate, creating a smaller, better dataset. While resource-intensive, this method is ideal for high-quality results, though budget constraints may necessitate using lower-quality text-only.
Aggregate Score
Attempting to further globally dedup worsened perf
Trending up by 5.2% this month
Showing total visitors for the last 6 months
Aggregate Score
Attempting to further globally dedup worsened perf
Trending up by 5.2% this month
Showing total visitors for the last 6 months
Base Filtering Algorithm:
Filtering refines datasets by removing harmful content and improving quality. Key steps include:
- URL Filtering: Block unwanted content (e.g., adult material).
- Language Classification: Retain text in the desired language with quality thresholds.
- Quality and Repetition Filters: Eliminate low-quality and repetitive content.
This process ensures a higher-quality dataset for model training.
Additional Quality Filtering Algorithm:
To enhance performance beyond initial filtering, the approach included:
- Benchmark Analysis: Study characteristics and benchmarks of datasets like C4.
- New Filtering Steps: Investigate additional filters to improve quality.
- Refinement: Incorporate effective methods from benchmark datasets.
- Iterative Optimization: Continuously refine the filtering process to exceed benchmark performance.
This iterative approach aims to surpass existing dataset quality and performance.
Our Aim:
In this work, our approach involved training small models and evaluating them on a set of "early-signal" benchmark tasks. This methodology served as a reasonable proxy for assessing the quality of the data utilized to train these models. However, it's crucial to acknowledge the potential caveat surrounding overfitting on the evaluation benchmarks.
We used this algorithm specifically for benchmarking purposes, emphasizing its role in assessing the performance and quality of the trained models.
We provide
Training Datasets
Cerebromesh TTS Subnet Leaderboard
Cerebromesh TTS Subnet is a groundbreaking project that leverages the power of decentralized collaboration to advance the state-of-the-art in open-source Text-to-Speech (TTS) technology. By harnessing the Bittensor blockchain and a unique incentive mechanism, we aim to create the most advanced and accessible TTS models. By leveraging Cerebromesh's user base of over one million individuals, we are devoted to pushing cutting-edge technology to every end-user. The Cerebromesh leaderboard shows miner's daily performance in building the dataset.