Newsletter

Universal Dataset to Test, Enhance and Benchmark AI Algorithms

This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in elementary school can understand, yet incredibly hard to prove. At the very core, it is about the spectacular quantum dynamics of the digit sum function.

Then, I present an infinite dataset that has all the patterns you or AI can imagine, and much more, ranging from obvious to undetectable. More specifically, it is an infinite number of infinite datasets all in tabular format, with various degrees of auto- and cross-correlations (short and long range) to test, enhance and benchmark AI algorithms including LLMs. It is based on the physics of the digit sum function and linked to the aforementioned conjecture. This synthetic data of its own kind is useful in context such as fraud detection or cybersecurity.

Finally, it comes with very efficient Python code to generate the data, involving gigantic numbers and high precision arithmetic.

Summary

The universal dataset is an invaluable system to test, enhance or benchmark pattern detection algorithms for fraud detection, cybersecurity, and other applications. The methodology relies on string auto-convolutions to discover deep insights about the digit sum function, offering a new perspective towards solving a famous multi-century old conjecture: are the binary digits of e evenly distributed? In this paper, I discuss the results obtained so far, both empirical and those formally proved, including several new ones. I also discuss the dataset, its relevancy to modern AI as a fundamental testing system, the incredibly rich and diversified set of patterns that it boasts, as well as connections to large language models (LLMs), quantum dynamics, synthetic data, and cryptography. I also provide very efficient, fast Python code to produce the data, dealing with numbers larger than 2n +1 at power 2n, with n larger than 106.

The dataset

While the paper has a good chunk of very interesting material dedicated to the number theory conjecture in question, well beyond PhD level yet made accessible to first-year college students, here I summarize the dataset. In the end, this is what most of my readers are interested in, as it offers many practical applications.

One of the infinitely many patterns found in the data

Each table consists of a number of rows. Each row contains 2n bits of information (n can be as large as you want).  The patterns and correlations, trivial at the beginning, become harder and harder to detect later in the data. At row n, the first n bits match the binary digits of e or related transcendental numbers. Beyond that row, there are no more patterns. Some features:

  • Rows can be interpreted as strings, and there is also a time series component. These strings can be split into words, either short to emulate categorical features, or long for numerical features, to mimic enterprise datasets.
  • The structure in the dataset allows you to test clustering algorithms: the various strings can be clustered. The colors in the featured image represent such clusters. Towards the end, the number of clusters dramatically increases, with the structure becoming more and more fuzzy.
  • The dataset also allows you to test predictive algorithms. In particular, predicting the next strings based on historical data (the previous strings), with application to training large language models (LLMs).
  • It can be used as generic, very versatile type of synthetic data, or to create synthetic data. The digit sum function plays the role of a response, summary, or aggregate feature.
  • The iterated self-convolution used to create the data, or its inverse – the iterated square root of a string– is useful to design efficient, fast pseudo-random number generators (PRNGs) linked to pattern-free transcendental numbers with infinite period (such as e), and thus with much better randomness properties than classical congruential generators.
  • The connection to dynamical systems and quantum dynamics can be exploited for simulations, modeling purposes, and agent-based modeling.

Test your own pattern detection algorithm on the universal dataset, see how far it can go. If it detects pattern beyond row n, these are false positives, and you need to address this issue on your side. It’s also a great sandbox to benchmark pattern detection systems.

For customization based on your enterprise needs, help with data generation, interpretation, sample size, simulations, feature generation, and any other questions about building your own enterprise version to address your priorities,
contact the author.

Python code, dataset, and technical paper

The PDF with many illustrations is available for free as paper 53, here. It also features fast Python code (with link to GitHub) to deal with gigantic numbers. The underlying theory is explained in detail, with several modern references. The blue links in the PDF are clickable once you download the document from GitHub and view it in any browser but may not be clickable in the GitHub “view mode”. I hope GitHub fix this issue in the future!

There are several issues about synthetic data in general: biases, poor quality using costly deep neural networks (for instance, failure to generate values outside the observation range such as outliers, poor representation of patterns spanning across many dimensions) and replicability. The universal dataset avoids all these problems.

To no miss future articles, subscribe to our newsletter. 

Vincent Granville

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

Ebook

Piercing the Deepest Mathematical Mystery

Any solution to the mythical problem in question has remained elusive for centuries.

Take your company into the new era of Artificial Intelligence

Recent Articles

Universal Dataset to Test, Enhance and Benchmark AI Algorithms

This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in

10 Must-Read Articles and Books About Next-Gen AI in 2025

You could call it the best kept secret for professionals and experts in AI, as you won’t find these books and articles in traditional outlets. Yet, they are

LLMs – Key Concepts Explained in Simple English, with Focus on LLM 2.0

The following glossary features the main concepts attached to LLM 2.0, with examples, rules of thumb, caveats, best practices, contrasted against standard LLMs. For instance, OpenAI has billions

What is LLM 2.0?

LLM 2.0 refers to a new generation of large language models that mark a significant departure from the traditional deep neural network (DNN)-based architectures, such as those used

Doing Better with Less: LLM 2.0 for Enterprise

Standard LLMs are trained to predict the next tokens or missing tokens. It requires deep neural networks (DNN) with billions or even trillions of tokens, as highlighted by

From 10 Terabytes to Zero Parameter: The LLM 2.0 Revolution

In this article, I discuss LLM 1.0 (OpenAI, Perplexity, Gemini, Mistral, Claude, Llama, and the likes), the story behind LLM 2.0, why it is becoming the new standard

Scaling Business Value with GenAI

Email

© 2024 Copyright - BondingAI.

Designed by LKTCV.WORK