Blog

Universal Dataset to Test, Enhance and Benchmark AI Algorithms

This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in elementary school can understand, yet incredibly hard to prove. At the very core, it is about

Join our list and receive exclusive content!

Join our list and receive exclusive content

You could call it the best kept secret for professionals and experts in AI, as you won’t find these books and articles in traditional outlets. Yet, they are read by far more people than documents posted on ArXiv or published in scientific journals, so not really a secret. Actually, one of these books is also published by Elsevier (here), but this is the exception rather than the rule.

What makes them different? They are written by an expert with considerable practical experience and successful, large-scale enterprise implementations of the material discussed. While covering state-of-the-art with numerous ground-breaking innovations and enterprise-grade Python code, on occasions covering topics well beyond PhD level, the focus is on explaining efficient technology and better mouse traps, in simple English without jargon.

Articles

Books

The second one is popular among data scientists and AI engineers who want to learn efficient state-of-the-art technology discussed nowhere else, at a fraction of the cost and time commitment required by bootcamps and traditional training.

For more books, see here.

Other resources

See my PowerPoint presentation about xLLM, accessible from here. There is another one about NoGAN — the best tabular data synthesizer with best evaluation metric — posted here. For open-source code, see our GitHub repository, here.

We also offer an AI Fellowship with a certification that you can add to your credentials section on your LinkedIn profile, for under $100. Based on the same material. See details here.

More articles are posted on the following blogs:

For deep technical research papers, see here.

To not miss future articles, books and AI resources, sign-up to our free newsletter.

The following glossary features the main concepts attached to LLM 2.0, with examples, rules of thumb, caveats, best practices, contrasted against standard LLMs. For instance, OpenAI has billions of parameters while xLLM, our proprietary LLM 2.0 system has none. This is true if we consider a parameter as a weight connecting neurons in a deep neural network: after all, the xLLM architecture does not rely on neural networks for the most part. Yet, xLLMs has a few dozen intuitive parameters easy to fine-tune, if the word “parameter” is to be interpreted in its native form. The picture below is an extract from the Python code.

Extract from xLLM code

AgentsAction agents are home-made apps to perform tasks on retrieved data: tabular data synthetization, auto-tagging, auto-indexing, taxonomy creation or augmentation, text clustering for instance for fraud detection, enhanced visualization including data videos, or cloud regression for predictive analytics (whether the data comes from PDFs, Web, databases or data lakes). Tasks not included in this list (code generation, solving math problems) are performed using external agents.

AgentsSearch agents detect user intent in the prompt (“how to”, “what is”, “show examples” and so on) to retrieve chunks pre-tagged accordingly. Alternatively, they can be made available from the UI, via an agent selection box.

Backend — Parameter, tables, and components linked to corpus processing. See also frontend.

Card — A clickable summary box in the response featuring one element of the response, corresponding to a chunk and featuring its relevancy score, index, and contextual elements. When you click on it, you get the detailed text, and tables or images if there are any.

Chunking — We use hierarchical chunking with two levels. A backend table maps the list of daughter chunks attached to a parent chunk (the key is a parent chunk); xLLM also builds the inverse table where the key is a daughter chunk, and the value is the parent.

Contextual elements — Categories, chunk titles, tags, time stamps, document titles in PDF repository, and so on, added to chunks. Their name and content come from the corpus or created post-crawling. Chunks have a variable number of contextual elements. In poorly structured corpuses, chunk titles may be generated based on relative font size or retrieved from a table of content. Since these elements are linked to the entire knowledge graph, xLLM uses an infinite context window.

Context window — To compute multi-token correlations or generate c-tokens (multi-tokens with words not glued together but separated by other words) we look at words relatively close to each other within a chunk. This is the finite context window. Its length can be fine-tuned. An efficient algorithm performs this task linearly rather than quadratically depending on the size of the chunk (no inner loops).

Crawling — Generic word that means intelligently extracting text and other elements from the corpus to create the xLLM file. If the corpus is a repository of PDF documents, crawling should be interpreted as parsing. If the input data comes from a database, crawling means DB querying, extraction and parsing. Some corpuses have a mix of public and private Web, DB, file repositories (PDFs) and so on. In this case, crawling means intelligently extracting and combining all the information from the various sources, broken down into chunks. See also smart crawling and xLLM file.

Distillation — Not to be confused with deep neural network distillation (removing useless weights). Here backend distillation is used to remove duplicate entries in the corpus. Frontend distillation removes duplicates or redundant elements in the response, acting as a response cleaning agent. It can be turned off or customized via frontend parameters.

Embeddings — Unlike standard LLMs, embeddings are not needed to build the response. Instead, they are used to suggest multi-tokens related to the prompt, allowing the user to try a different prompt with just one click. We use variable length embeddings and a combination of PMI or E-PMI to measure similarities, rather than cosine similarity. We do not use vector databases or vector search. Instead, we use very fast hash lookups.

Fine-tuning — Frontend fine-tuning is offered to power users in real-time with no latency. To the contrary, backend fine-tuning re-creates all the backend tables. Fine-tuning can be global (for all sub-LLMs at once), local (one sub-LLM at a time) or hybrid (with both local and global parameters). Parameters are intuitive: you can predict the impact of increasing or decreasing values or tuning them on or off. The system is based on explainable AI. Fine-tuning allows the response to be customized to the user. See also parameters and self-tuning.

Frontend — Parameter, tables, and components linked to prompt processing. Local extracts of the backend tables are used in the frontend to process the prompt efficiently, for instance q_dictionary and q_embeddings. The prefix “q_” indicates that this is a local table containing all that is needed from the backend, to process the prompt (also called “query”). See also backend.

Indexing — Mechanism to reference the exact location in the corpus corresponding to a response element or chunk in the xLLM data. We use a multi-index with entries sequentially generated to allow for corpus browsing; the multi-index contains references to the parent chunk and other elements such as font size.

Knowledge graph — A set of contextual elements organized as a graph, retrieved from the corpus via smart crawling, such as the taxonomy, breadcrumbs, related items, tags, links, tables, and so on. If the internal knowledge graph is poor, it can be enhanced by adding tags or categories via a labeling algorithm (or clustering), post-crawling. Typically, xLLM does not build a knowledge graph; instead, it uses the one found in the corpus. In some cases, it is augmented or blended with external knowledge graphs.

Multimodal — Tables and images are detected at the chunk level and stored separately. They are accessible from cards in the response, along with the other contextual elements (categories, tags, titles). Also, there are special tags pre-assigned to chunks post-crawling, to indicate the presence of tables or images for chunks that contain some. This applies to PDFs, but data coming from the Web is processed in the same way, as both are converted to JSON prior to be fed to the xLLM engine.

N-grams — We use sorted N-grams to match each multi-token combination detected in a sorted prompt, with multi-tokens found in the corpus, in order to retrieve the correct word ordering. Example: “data science book” in a prompt is turned to book~data~science. In the sorted N-gram table, the row corresponding to book~data~science (if it exists) lists all the variations found in the corpus, say data~science~book and science~book~data, with number of occurrences. It eliminates the combinatorial explosion generated by multi-tokens consisting of many words.

Nested hashes — The base xLLM architecture stores databases as nested hashes for fast and real-time execution. Also called in-memory LLM. In a nested hash, a parent key can be a vector such as pair of multi-tokens, and the value can be a hash table. This structure is very efficient to deal with sparsity: a keyword pair {A, B} is stored only in the rare instances when A and B are correlated. The code includes functions to update, delete, add or retrieve entries in a nested hash, and to compute the inverse or transposed hash. The latter is the equivalent of transposing a matrix or tuning a row into a column database. A nested hash entry is similar to a JSON element.

Parameters — Not to be confused with standard LLM parameters which are deep neural network weights. Backend parameters are used to build the backend tables linked to corpus processing; frontend parameters are linked to prompt processing and can be fine-tuned in real time, for instance to optimize the relevancy scores, frontend distillation, E-PMI (if offered as customizable front-end component), and frontend stemming.

PMI — We use generalized pointwise mutual information (PMI) to measure the correlation between two multi-tokens A and B in the corpus. PMI takes into account word counts for A and B in the corpus, as well as occurrences of A, B found jointly in a same text element or chunk. Enhanced PMI (E-PMI) takes into account additional factors such as the number of words in a multi-token, or word intersection between multi-tokens A and B. The E-PMI metric has parameters that can be fine-tuned. Finally, it is standardized to take a value between 0 (no association) and 1 (strong association).

Preprocessing — Consists of turning the original corpus (Web pages, PDFs, databases or a mix of all) into xLLM format. It also involves building a stopwords list, stemming optimization, pre-tagging (assigning search agents, categories, titles and tags to chunks), setting backend parameters, and building synonyms/acronyms dictionary as needed.

Relevancy scores — Different scores are computed for each candidate card in the response, based on the multi-tokens attached to it and the relevancy to the prompt. Typically, with one score for each token type. Scores are normalized to stay within a range: 0 for worst match, to 10 for best match. The cards in the response are then ranked by score, for each score. For each card, a global rank is computed as a weighted sum of the individual ranks attached to the card in question. The top 10 cards (by global rank) are shown in the response. Several parameters can be fine-tuned to modify the scores and the ranking system.

Response — Results to the user prompt. Consists of concise information organized as structured output (such as cards or sections with bullet lists, including relevancy scores and precise link to source) or long text typical of Perplexity. For long text generation, use specialized models and train from scratch using the minimum training set needed. If using pre-trained models, deep neural network distillation is needed. In the response, the user can choose to have hist next prompt linked to the previous prompts, or not. Responses to most popular prompts are cached.

Search options — In the UI, the prompt box allows you to do exact or broad search, enter negative keywords, put a higher weight on the first multi-token in the prompt, search by recency, or specify agents, tags, and sub-LLMs. A debugging mode is available, offering a catch-all parameter set, useful to developers.

Self-tuning — Power users can fine-tune parameters in real time. Favorite parameters chosen by power user are combined to create sets of default parameters, regularly updated. This reinforcement learning technique is known a self-tuning.

Smart crawling — During crawling or when parsing PDF documents, we retrieve the contextual elements embedded in the corpus, such as taxonomy or words in large font, to add context (categories, tags and so on) to the chunks. While the structure is similar across different corpus types (database, Web and so on) there is no Python library that does the job without missing important elements. Our proprietary technique does the job but requires some minimum customization for each corpus. The task is facilitated by converting the corpus to JSON. Even bullet lists and relative font sizes can be correctly identified in PDFs. This works whether the corpus document uses elements labeled as headers, or not.

Stemming — Can be turned off. Backend and frontend stemming may or may not use the same stem table. See also un-stemming and synonyms. Backend stemming should be turned on for frontend stemming to be effective.

Stopwords — List of words to ignore from corpus when crawling to generate the xLLM file. Specific to corpus and even to sub-LLM. There is also a frontend list of stopwords to ignore in the prompt. Both lists may or may not be the same. Keywords such as “how” and “what” may be in the prompt stopwords list (ignored for retrieval in chunks) yet treated separately as they relate to specific search agents and indicate user intent.

Sub-LLM — Corporate corpuses can be broken down into sub-corpuses, each with its own sub-LLM. Access to specific sub-LLMs, and even to specific chunks within a sub-LLM, is granted to authorized users only. Thanks to using an LLM router, results can be extracted from multiple sub-LLMs as needed, to answer a prompt. When building sub-LLMs, we work with stakeholders and IT to make sure that we do not miss any source due to siloed information. This process also involves sound quality assurance to make sure that all the important data sources are integrated into xLLM, to no miss important information in the response.

Synonyms — Multi-tokens found in the prompt are matched against a synonym and acronyms dictionary to increase exhaustivity in search results. Example: say “games” is found in the prompt but not in the corpus. First, “games” is stemmed to “game”. Then “game” is un-stemmed to identify related words: “games”, “gaming”, “gamers” and so on. If “gaming” is in the corpus, it will be retrieved despite not being in the prompt. Other examples are more complex, for instance matching multi-token “analysis~variance” to “anova”. The words retrieved via un-stemming, if found in the corpus, are called s-tokens. They may get a lower weight than standard multi-tokens as they are not a perfect match.

Tags — Tags may come directly from the corpus and are integrated into chunks as contextual elements. We also generate tags that we assign to chunks post-crawling, to establish a link to specific agents, indicate the presence of tables or images, or mark the chunk as sensitive with access limited to authorized users only.

Token — In xLLM, tokens consist of one word (single tokens) or multiple words (multi-tokens). A word such as “real estate” is both a single token (real_estate), a multi-token (real~estate) and two single tokens (real, estate). They are referred to as multi-tokens in all cases. In addition, c-tokens consists of words found close to each other in a chunk but not glued together (separated by other words). Example: data^books is a c-token extracted from “data science books”.

Token type — Besides regular multi-tokens found in standard text in the corpus, xLLM uses knowledge graph multi-tokens referred to as g-tokens. They are found in the contextual elements attached to a chunk: category, sub-categories, title, tags, and so on. They get a higher weight in the relevancy scores. See also c-tokens in the “token” entry, and s-token in the “synonyms” entry.

Un-stemming — See synonyms. We use a proprietary algorithm to un-stem. When using public stemmers or lemmatizers, we need to breakdown some entries: “racing” and “race” cannot have the same stem. Look at top 50 multi-tokens with highest frequency in the corpus to fix the most problematic ones.

Weights — A positive value to boost or reduce the influence of certain multi-tokens in search results, such as c-tokens, s-tokens, g-tokens, or single-word tokens, compared to regular multi-tokens assigned a weight of 1. Not to be confused with weights in deep neural networks. Our weights can be fine-tuned on the frontend in real time.

xLLM file — Raw input from corpus, cleaned then converted to JSON-like text file regardless of origin (Web, PDF repository, DB). Each row corresponds to a chunk and includes the contextual elements and index ID within a single JSON entity. Images and tables are stored separately with an index referencing to the parent chunk they belong to. Chunks have indexes pointing to the images and tables that they contain.

Recent Articles

Universal Dataset to Test, Enhance and Benchmark AI Algorithms

This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in

10 Must-Read Articles and Books About Next-Gen AI in 2025

You could call it the best kept secret for professionals and experts in AI, as you won’t find these books and articles in traditional outlets. Yet, they are

LLMs – Key Concepts Explained in Simple English, with Focus on LLM 2.0

The following glossary features the main concepts attached to LLM 2.0, with examples, rules of thumb, caveats, best practices, contrasted against standard LLMs. For instance, OpenAI has billions

What is LLM 2.0?

LLM 2.0 refers to a new generation of large language models that mark a significant departure from the traditional deep neural network (DNN)-based architectures, such as those used

Doing Better with Less: LLM 2.0 for Enterprise

Standard LLMs are trained to predict the next tokens or missing tokens. It requires deep neural networks (DNN) with billions or even trillions of tokens, as highlighted by

From 10 Terabytes to Zero Parameter: The LLM 2.0 Revolution

In this article, I discuss LLM 1.0 (OpenAI, Perplexity, Gemini, Mistral, Claude, Llama, and the likes), the story behind LLM 2.0, why it is becoming the new standard

Join our list and receive exclusive content!

Scaling Business Value with GenAI

© 2025 Copyright - BondingAI.