Preparing Your Data for AI – with Qlik Talend Cloud and TOON

larissamiddendorf7
6 hours ago
2 min read

Most companies face the same three questions when they want to use artificial intelligence in a meaningful way:

How do we access our data – from different systems?
How do we prepare that data for AI?
And how can we reduce token costs when using large language models (LLMs)?

In our latest use case, we walk through this exact process using a real Talend job – from data integration to efficient delivery for AI. The combination of Qlik Talend Cloud and the TOON format, developed by Leitart, offers a complete, streamlined solution.

Merging data from multiple sources – with Qlik Talend Cloud

In most organizations, critical data is scattered across different systems – like CSV files, CRM platforms such as Salesforce, or databases like Microsoft SQL Server.

With Qlik Talend Cloud, these sources can be easily connected and merged into a unified data flow.

In our example, we work with:

a CSV file (users.csv)
Salesforce as a CRM system
a Microsoft SQL Server database

These sources are combined in a Talend job and passed through several data preparation steps.

Preparing data for AI

Before data can be processed by AI – for example, by large language models like GPT – it must be cleaned, structured, and prepared accordingly. In the Qlik Talend process, this is handled by components like:

tMap – for merging data from multiple sources
UniqRow – for removing duplicates
tSortRow – for sorting and structuring the dataset

But this is exactly where many run into a hidden issue: the format.

⚠️ JSON is not optimized for AI

The widely used JSON format is human-readable and flexible – but when it comes to LLMs, it's highly inefficient.

Every record contains the same field names again and again – which leads to unnecessary token overhead.

And since most AI APIs (like OpenAI) bill by token count, this makes processing slower and more expensive.

✅ The solution: TOON – Token Oriented Object Notation

TOON is a compact, readable data format developed by Leitart, designed specifically for AI pipelines. It significantly reduces the number of tokens without sacrificing structure or clarity.

A direct comparison makes the difference clear:

Format	File Size	Tokens Required
JSON	7,000 Bytes	2,800 Tokens
TOON	1,800 Bytes	840 Tokens

That’s a reduction of more than 74% – which translates directly into lower API costs and faster response times when working with LLMs.

Conclusion: Less tokens, more efficiency – thanks to TOON

The combination of Qlik Talend Cloud and the new TOON format opens up new possibilities for businesses to not just analyze their data – but make it truly AI-ready.

TOON is more than just another data format. It's a powerful efficiency booster in an era where AI is token-based.

👉 Want to integrate our TOON Writer into your own data pipeline?

Just let us know – we’re happy to share it with you, free of charge.