Newsletter

Doing Better with Less: LLM 2.0 for Enterprise

Standard LLMs are trained to predict the next tokens or missing tokens. It requires deep neural networks (DNN) with billions or even trillions of tokens, as highlighted by Jensen Huang, CEO of Nvidia, in his keynote talk at the GTC conference earlier this year. Yet, 10 trillion tokens cover all possible string combinations; the vast majority of them is noise. After all, most people have a vocabulary of about 30k words. But this massive training is necessary to prevent DNNs from getting stuck in sub-optimal configurations due to vanishing gradient and other issues.

What if you could do with a million times less? With mere millions of tokens rather than trillions? Afterall, predicting the next token is a task remotely related to what modern LLMs do. Its history is tied to text auto-filling, guessing missing words, autocorrect and so on, developed initially for tools such as BERT. Now, it’s no different than training a plane to efficiently operate on the runway, but not to fly. It also entices LLM vendors to charge clients by token usage, with little regard to ROI.

Our approach is radically different. We do not use DNNs nor GPUs. It is as much different from standard AI than it is from classical NLP and machine learning. Its origins are similar to other tools that we built including NoGAN, our alternative to GAN for tabular data synthetization. NoGAN — a fast technology with no DNN — performs a lot faster with much better results, even in real-time. The output quality is assessed using our ground-breaking evaluation metric capturing important defects missed by all other benchmarking tools.

In this article, I highlight unique components of xLLM, our new architecture for enterprise. In particular, how it can be faster, not using DNNs or GPUs, yet delivers more accurate results at scale without hallucination while minimizing the need for prompt engineering.

No training, smart engine and other components instead

The xLLM architecture goes beyond traditional LLMs transformed based models; xLLM is not a pre-trained model. The core components that make it innovative are: 

Figure 1: Key components of xLLM

1. Smart Engine

The foundation of xLLM is enterprise data (PDFs, web, systems, etc.) The base model does not require training. It retrieves the text “as is” from the corpus along with contextual elements found in the corpus: 

  • Domain, sub-domain
  • Tags, categories, parent category, sub-categories
  • Creation date (when document was created)
  • Chunk, sub-chunk, document title
  • Precise link to source
  • Other corpus-dependent context elements
  • Summary info

Tags and categories may be pre-assigned to chunks post-crawling using home-made algorithm if absent in the corpus. The output is structured and displayed as 10 summary boxes called cards and selected out of possibly 50 or more, based on relevancy score. To get full info about a card, the user clicks on it. In the UI, the user can also specify category, sub-LLM, and other contextual elements. 

As an advanced user, you can leverage the Smart Engine to validate the retrieval process (data, tags, and other contextual elements), fine-tune intuitive parameters based on E-PMI metric (enhanced pointwise mutual information, a flexible alternative to cosine similarity), to adapt relevancy scores, stemming, and so on. Embeddings displayed as a clickable graph allows you to try suggested, related prompts relevant to your query, based on corpus content.

2. Multimodal agents such as synthetic data

In Figure 1, the featured agent is for tabular data synthetization. We developed NoGAN synthetic data generation to enrich xLLM. This component depends on the business use case, e.g. a bank wants to enhance the fraud models using xLLM in two stages: 

  • Create new variables (convert text intro ML features)
  • Synthetic data: improve their current fraud model, generate new data for new markets with new patterns not found in the current (real) data, yet mimicking existing data.

Other home-made agents part of xLLM:

  • Predictive analytics on detected, retrieved and blended tables
  • Unstructured text clustering
  • Taxonomy creation or augmentation

3. Response Generator offering user-customized contextual output

The user can choose prose for the response (like Perplexity.ai) as opposed to structured output (organized boxes or sections with bullet lists from Smart Engine). In this case, training is needed but not using the whole Internet: typical LLMs have very large tokens lists covering big chunks of the English and other languages, most tokens irrelevant to business context.  

How is xLLM different from RAG

The xLLM architecture is different from RAG. It shows structured output to the user (sections with bullet lists, or cards) that can be turned into continuous text if desired. The chat environment is replaced by selected cards displayed to the user, each with summary information, links and relevancy scores. With one click (the equivalent of a second prompt in standard systems), you get the detailed information attached to a card. Also, alternate clickable prompts are suggested based on corpus content and relevant to your initial queries. Top answers are cached. Of course, the user can still manually enter a new prompt, to obtain deeper results related to his previous prompt (if there are any). Or the user can decide not to link the new prompt to the previous one.

A few key points:

  • xLLM is designed to maximize accuracy, relevancy, exhaustivity, and to minimize prompt engineering and hallucinations. 
  • xLLM also displays related / alternate / suggested clickable pre-made prompts based on original prompt and what’s in the corpus, using keyword correlations (enhanced PMI, E-PMI) and variable-size embeddings based on E-PMI. 
  • Tokens are actually multi-tokens and are of two types: contextual if found in contextual elements attached to chunk, and standard if found in regular text. Chunking is hierarchical (2 levels) and chunks indexed via multi-index. 

The user can do exact or broad search, search by recency, enter negative keywords or put higher weight on first multi-token in the prompt; xLLM will look at all combinations of multi-tokens in the prompt (multi-tokens up to 5 words) for 10-word prompts and do so very efficiently with our own technology rather than vector DB. It also looks for synonyms and acronyms to increase exhaustivity. It also has a unique un-stemming algorithm.

Real-time fine-tune: LoRA approach

LoRA is used to adapt the style of the output but not the knowledge of the LLM. Fine-tune is used both to adapt response and change selection criteria, to optimize output distillation, stemming, E-PMI metric, relevancy scores, various thresholds, speed, and so on. Parameters most frequently favored by users lead to a default parameter set. This is the reinforcement learning part, leading to xLLM self-tuning

Important to mention:

  • xLLM only shows what exists in the Enterprise Corpus
  • Any response generated comes from xLLM Smart Engine 

Leveraging Explainable AI  

The overall approach is to be an Open Box (the opposite of a Black Box) able to explain systematically from a prompt perspective everything that happens, as follows.

A prompt generates a clickable graph with nodes. The nodes are based on E-PMI threshold and source. By clicking on a node, the user gets domain, sub-domain, tags, chunk or sub-chunk ID, relevancy score, and content: text, images, tables.

Key Elements of XAI (explainable AI for xLLM):

  • Transparency – Users can see how the model works internally in a graph view.
  • Interpretability – Outputs and predictions can be understood in human terms exploratory in a graph view.
  • Justification – The AI provides reasons or data points supporting its decision mapping the source, data, model score, and so on.
  • Trustworthiness – Users trust xLLM more when they understand where results comes from.
  • Compliance – xLLm meets legal, ethical, and regulatory requirements (GDPR, HIPAA) as our structure is sub-xLLM domain based (following mesh principals).

Dealing with figures, numbers, dates

Extracting tables, figures, or dates is the easy part. We have a proprietary algorithm to retrieve information (tables, bullet lists, components embedded into graphs) from PDFs. It perfroms the job more accurately than all the PDF processing libraries that we tested. To get the best of both worlds, we use efficient Python libraries combined with our algorithm and workarounds to avoid glitches coming from the libraries. Also, PDFs, like most other sources (web, database), are converted and blended to JSON-like format before being fed to the xLLM engine. This format is superior to the other formats tested. In particular, you can retrieve the relative size of a font, its face and color, and even the exact location (pixel coordinates) in the PDF. This in turn helps create contextual elements to add to the chunks.

Another option is to convert PDF pages to images (very easy) and then use OCR, if the user wants to retrieve or leverage text or other information embedded in an image. This may be available in a future version.

Addressing security issue

To ensure security, sensitive data our platform follows standards of Data Mesh principals (Domain-Centric & Security-Focused), our architecture enable to enhance companies’ internal security protocols and enhance to ensure full complaint.

Figure 2: High-level architecture

Each business domain becomes a sub-xLLM, ensuring security and isolation of departmental data while complying with any security and privacy rules. For example:

Sales sub-LLM:

  • Sub-xLLM: Sales
  • Sensitive data (yes/no)
  • Access management

HR sub-LLM:

  • Sub-xLLM: Sales
  • Sensitive data (yes/no)
  • Access management

Our guiding principles:

  • Access Control by Domain: Fine-grained access policies tailored to the domain’s data sensitivity.
  • Data Encryption: Both in transit and at rest; required across all domain data products.
  • Audit & Lineage: Transparent data movement and usage logs ensure compliance and traceability.
    Data Privacy by Design: PII detection, masking, and consent enforcement baked into domain pipelines.
  • Zero Trust Model: Assume breach — verify every access request with strong authentication and authorization.

In addition, security levels and authorized users can be added to chunks post-crawling, in the same way that we add category tags as needed. So that only authorized users can see the content of specific chunks.

Additional references

We regularly post articles about new developments and technological updates, on our website. To not miss these announcements, subscribe to our newsletter.

Articles:

See also past PowerPoint presentation on the subject, here.

Books

Vincent Granville

Vincent Granville is a pioneering GenAI scientist, co-founder at BondingAI.io, the LLM 2.0 platform for hallucination-free, secure, in-house, lightning-fast Enterprise AI at scale with zero weight and no GPU. He is also author (Elsevier, Wiley), publisher, and successful entrepreneur with multi-million-dollar exit. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. He completed a post-doc in computational statistics at University of Cambridge.

Ebook

Piercing the Deepest Mathematical Mystery

Any solution to the mythical problem in question has remained elusive for centuries.

Take your company into the new era of Artificial Intelligence

Recent Articles

Universal Dataset to Test, Enhance and Benchmark AI Algorithms

This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in

10 Must-Read Articles and Books About Next-Gen AI in 2025

You could call it the best kept secret for professionals and experts in AI, as you won’t find these books and articles in traditional outlets. Yet, they are

LLMs – Key Concepts Explained in Simple English, with Focus on LLM 2.0

The following glossary features the main concepts attached to LLM 2.0, with examples, rules of thumb, caveats, best practices, contrasted against standard LLMs. For instance, OpenAI has billions

What is LLM 2.0?

LLM 2.0 refers to a new generation of large language models that mark a significant departure from the traditional deep neural network (DNN)-based architectures, such as those used

Doing Better with Less: LLM 2.0 for Enterprise

Standard LLMs are trained to predict the next tokens or missing tokens. It requires deep neural networks (DNN) with billions or even trillions of tokens, as highlighted by

From 10 Terabytes to Zero Parameter: The LLM 2.0 Revolution

In this article, I discuss LLM 1.0 (OpenAI, Perplexity, Gemini, Mistral, Claude, Llama, and the likes), the story behind LLM 2.0, why it is becoming the new standard

Scaling Business Value with GenAI

Email

© 2024 Copyright - BondingAI.

Designed by LKTCV.WORK