Refined

Enhanced Dataset

Cerebromesh Labs provides high-quality, curated datasets to enhance your AI.
bittensor logo

Built on Bittensor

Feature

Understanding Good Data

"Good data" is pivotal for training large-scale models, yet its quality can be ambiguous and context-specific. Human evaluation may miss subtle flaws. An effective approach involves training models on trusted datasets and evaluating new data with metrics like perplexity. This method ensures data reliability for algorithm training.

Learn More

URL Collection

Gather a wide range of URLs from various sources.

Filtering

Apply filters to eliminate irrelevant or low-quality URLs.

Text Extraction

Extract textual content from the remaining URLs.

PLL Removal

Identify and remove any personally identifiable.

The FineWeb Recipe

The data preparation algorithm involves several key steps to ensure the creation of a high-quality dataset.

Deduplication

Use techniques like MinHash to remove duplicate entries.

Content Quality

Use advanced algorithms to maintain the quality of text.

Standard Filters

Implement commonly used filters for further refinement.

Language Filtering

Ensure that the extracted text is in the desired language(s).

Algorithm

Text Extraction Algorithm

Text extraction from web crawl data can use raw HTML or text-only versions. Open-source libraries enhance quality by removing boilerplate, creating a smaller, better dataset. While resource-intensive, this method is ideal for high-quality results, though budget constraints may necessitate using lower-quality text-only.

Aggregate Score

Attempting to further globally dedup worsened perf

Trending up by 5.2% this month

Showing total visitors for the last 6 months

Aggregate Score

Attempting to further globally dedup worsened perf

Trending up by 5.2% this month

Showing total visitors for the last 6 months

Base Filtering Algorithm:

Filtering refines datasets by removing harmful content and improving quality. Key steps include:

  • URL Filtering: Block unwanted content (e.g., adult material).
  • Language Classification: Retain text in the desired language with quality thresholds.
  • Quality and Repetition Filters: Eliminate low-quality and repetitive content.

This process ensures a higher-quality dataset for model training.

Additional Quality Filtering Algorithm:

To enhance performance beyond initial filtering, the approach included:

  1. Benchmark Analysis: Study characteristics and benchmarks of datasets like C4.
  2. New Filtering Steps: Investigate additional filters to improve quality.
  3. Refinement: Incorporate effective methods from benchmark datasets.
  4. Iterative Optimization: Continuously refine the filtering process to exceed benchmark performance.

This iterative approach aims to surpass existing dataset quality and performance.

Our Aim:

In this work, our approach involved training small models and evaluating them on a set of "early-signal" benchmark tasks. This methodology served as a reasonable proxy for assessing the quality of the data utilized to train these models. However, it's crucial to acknowledge the potential caveat surrounding overfitting on the evaluation benchmarks.

We used this algorithm specifically for benchmarking purposes, emphasizing its role in assessing the performance and quality of the trained models.

PRODUCTS

We provide

Training Datasets

BUILD
AI Training Datasets

Explore our comprehensive collection of datasets tailored for machine learning.

BUILD
Customizable Solutions

Tailored datasets to meet specific project needs and model requirements.

BUILD
Quality Assurance

Rigorous validation processes ensure high-quality data for accurate model.

BUILD
AI Training Datasets

Explore our comprehensive collection of datasets tailored for machine learning.

BUILD
Customizable Solutions

Tailored datasets to meet specific project needs and model requirements.

BUILD
Quality Assurance

Rigorous validation processes ensure high-quality data for accurate model.

LEADERBOARD

Cerebromesh TTS Subnet Leaderboard

Cerebromesh TTS Subnet is a groundbreaking project that leverages the power of decentralized collaboration to advance the state-of-the-art in open-source Text-to-Speech (TTS) technology. By harnessing the Bittensor blockchain and a unique incentive mechanism, we aim to create the most advanced and accessible TTS models. By leveraging Cerebromesh's user base of over one million individuals, we are devoted to pushing cutting-edge technology to every end-user. The Cerebromesh leaderboard shows miner's daily performance in building the dataset.