This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in elementary school can understand, yet incredibly hard to prove. At the very core, it is about the spectacular quantum dynamics of the digit sum function.
Then, I present an infinite dataset that has all the patterns you or AI can imagine, and much more, ranging from obvious to undetectable. More specifically, it is an infinite number of infinite datasets all in tabular format, with various degrees of auto- and cross-correlations (short and long range) to test, enhance and benchmark AI algorithms including LLMs. It is based on the physics of the digit sum function and linked to the aforementioned conjecture. This synthetic data of its own kind is useful in context such as fraud detection or cybersecurity.
Finally, it comes with very efficient Python code to generate the data, involving gigantic numbers and high precision arithmetic.
Summary
The universal dataset is an invaluable system to test, enhance or benchmark pattern detection algorithms for fraud detection, cybersecurity, and other applications. The methodology relies on string auto-convolutions to discover deep insights about the digit sum function, offering a new perspective towards solving a famous multi-century old conjecture: are the binary digits of e evenly distributed? In this paper, I discuss the results obtained so far, both empirical and those formally proved, including several new ones. I also discuss the dataset, its relevancy to modern AI as a fundamental testing system, the incredibly rich and diversified set of patterns that it boasts, as well as connections to large language models (LLMs), quantum dynamics, synthetic data, and cryptography. I also provide very efficient, fast Python code to produce the data, dealing with numbers larger than 2n +1 at power 2n, with n larger than 106.
The dataset
While the paper has a good chunk of very interesting material dedicated to the number theory conjecture in question, well beyond PhD level yet made accessible to first-year college students, here I summarize the dataset. In the end, this is what most of my readers are interested in, as it offers many practical applications.

Each table consists of a number of rows. Each row contains 2n bits of information (n can be as large as you want). The patterns and correlations, trivial at the beginning, become harder and harder to detect later in the data. At row n, the first n bits match the binary digits of e or related transcendental numbers. Beyond that row, there are no more patterns. Some features:
- Rows can be interpreted as strings, and there is also a time series component. These strings can be split into words, either short to emulate categorical features, or long for numerical features, to mimic enterprise datasets.
- The structure in the dataset allows you to test clustering algorithms: the various strings can be clustered. The colors in the featured image represent such clusters. Towards the end, the number of clusters dramatically increases, with the structure becoming more and more fuzzy.
- The dataset also allows you to test predictive algorithms. In particular, predicting the next strings based on historical data (the previous strings), with application to training large language models (LLMs).
- It can be used as generic, very versatile type of synthetic data, or to create synthetic data. The digit sum function plays the role of a response, summary, or aggregate feature.
- The iterated self-convolution used to create the data, or its inverse – the iterated square root of a string– is useful to design efficient, fast pseudo-random number generators (PRNGs) linked to pattern-free transcendental numbers with infinite period (such as e), and thus with much better randomness properties than classical congruential generators.
- The connection to dynamical systems and quantum dynamics can be exploited for simulations, modeling purposes, and agent-based modeling.
Test your own pattern detection algorithm on the universal dataset, see how far it can go. If it detects pattern beyond row n, these are false positives, and you need to address this issue on your side. It’s also a great sandbox to benchmark pattern detection systems.
For customization based on your enterprise needs, help with data generation, interpretation, sample size, simulations, feature generation, and any other questions about building your own enterprise version to address your priorities,
contact the author.
Python code, dataset, and technical paper
The PDF with many illustrations is available for free as paper 53, here. It also features fast Python code (with link to GitHub) to deal with gigantic numbers. The underlying theory is explained in detail, with several modern references. The blue links in the PDF are clickable once you download the document from GitHub and view it in any browser but may not be clickable in the GitHub “view mode”. I hope GitHub fix this issue in the future!
There are several issues about synthetic data in general: biases, poor quality using costly deep neural networks (for instance, failure to generate values outside the observation range such as outliers, poor representation of patterns spanning across many dimensions) and replicability. The universal dataset avoids all these problems.
To no miss future articles, subscribe to our newsletter.