Preparing Your Data for AI – with Qlik Talend Cloud and TOON
- larissamiddendorf7
- 6 hours ago
- 2 min read
Most companies face the same three questions when they want to use artificial intelligence in a meaningful way:
How do we access our data – from different systems?
How do we prepare that data for AI?
And how can we reduce token costs when using large language models (LLMs)?
In our latest use case, we walk through this exact process using a real Talend job – from data integration to efficient delivery for AI. The combination of Qlik Talend Cloud and the TOON format, developed by Leitart, offers a complete, streamlined solution.
Merging data from multiple sources – with Qlik Talend Cloud
In most organizations, critical data is scattered across different systems – like CSV files, CRM platforms such as Salesforce, or databases like Microsoft SQL Server.
With Qlik Talend Cloud, these sources can be easily connected and merged into a unified data flow.
In our example, we work with:
a CSV file (users.csv)
Salesforce as a CRM system
a Microsoft SQL Server database
These sources are combined in a Talend job and passed through several data preparation steps.
Preparing data for AI
Before data can be processed by AI – for example, by large language models like GPT – it must be cleaned, structured, and prepared accordingly. In the Qlik Talend process, this is handled by components like:
tMap – for merging data from multiple sources
UniqRow – for removing duplicates
tSortRow – for sorting and structuring the dataset
But this is exactly where many run into a hidden issue: the format.
⚠️ JSON is not optimized for AI
The widely used JSON format is human-readable and flexible – but when it comes to LLMs, it's highly inefficient.
Every record contains the same field names again and again – which leads to unnecessary token overhead.
And since most AI APIs (like OpenAI) bill by token count, this makes processing slower and more expensive.
✅ The solution: TOON – Token Oriented Object Notation
TOON is a compact, readable data format developed by Leitart, designed specifically for AI pipelines. It significantly reduces the number of tokens without sacrificing structure or clarity.
A direct comparison makes the difference clear:
Format | File Size | Tokens Required |
JSON | 7,000 Bytes | 2,800 Tokens |
TOON | 1,800 Bytes | 840 Tokens |
That’s a reduction of more than 74% – which translates directly into lower API costs and faster response times when working with LLMs.
Conclusion: Less tokens, more efficiency – thanks to TOON
The combination of Qlik Talend Cloud and the new TOON format opens up new possibilities for businesses to not just analyze their data – but make it truly AI-ready.
TOON is more than just another data format. It's a powerful efficiency booster in an era where AI is token-based.
👉 Want to integrate our TOON Writer into your own data pipeline?
Just let us know – we’re happy to share it with you, free of charge.
