Home/Case Studies/R2 Digital Library
Automating XML Conversion at Scale with AI
How Zentrovia built an AI-powered XML conversion pipeline that processes 67,000+ pages across 130 titles — reducing cost per title by 69% in five months.
67,000+
Pages processed
130
Titles delivered in 5 months
69%
Cost per title reduction
200+
Target titles/month at scale
Overview
A major healthcare digital library needed to convert thousands of legacy medical textbooks and references into structured XML — at a pace and cost that traditional vendors couldn't match.
The R2 Digital Library is a leading medical and healthcare reference platform used by over 1,000 hospitals, universities, and healthcare institutions across the United States. Their catalog includes thousands of medical textbooks, clinical references, nursing guides, and allied health publications from dozens of publishers.
Each title needed to be converted from its source format (PDF, Word, InDesign, or publisher-specific XML) into the R2 platform's proprietary DocBook-based XML schema — a complex DTD with 69 target elements that must render without errors on the R2 platform.
The challenge: source files came from 290 to 347 publisher-specific input variations. Traditional conversion vendors were quoting $1 to $3 per page — meaning a single 500-page textbook could cost $500 to $1,500 to convert. At that rate, converting the full catalog was economically unviable.
The Challenge
Three problems that made traditional conversion impossible at scale.
Publisher Variation Complexity
290 to 347 distinct publisher-specific input formats — each with different XML schemas, CSS styles, and structural conventions. No two publishers format their content the same way.
Cost Prohibitive at Scale
Traditional XML conversion vendors charge $1 to $3 per page. A 500-page medical textbook costs $500–$1,500 to convert. At thousands of titles, the total investment would exceed the entire project budget.
Quality Requirements
Every output XML must render with zero errors on the R2 platform. The 69 DocBook target elements must map correctly from hundreds of source variations — with no room for structural errors in medical content.
The Solution
A three-phase AI-powered pipeline built specifically for this problem.
Custom DTD Conversion Pipeline
We built a custom conversion engine that maps 290–347 publisher-specific input variations to the R2 platform's 69 DocBook target elements. The engine handles the full complexity of medical publishing: cross-references, citations, figure captions, table structures, index entries, and multi-level heading hierarchies — producing zero-error XML output on the R2 platform.
BookLoader + Table of Contents Engine
Automated ingestion pipeline with table of contents generation, metadata tagging, and batch processing. This eliminated manual data entry and enabled high-volume processing — each title automatically ingested, structured, and queued for conversion.
Agentic AI Quality Assurance
The breakthrough: AI-powered agentic QA that runs parallel quality checks on every title — completing in minutes what previously took hours. The agent performs full structural XML review, cross-reference validation, DTD compliance checks, and self-corrects known error patterns — dramatically reducing QA overhead while maintaining the same quality standard.
Results
Measurable impact at every stage.
69%
Cost per title reduction
vs. manual baseline
80%
Faster turnaround
AI-powered vs. traditional vendors
99.5%
Accuracy rate
Zero XML errors on platform
200+
Target titles/month
Scaling with AI automation




Timeline
From first title to 200/month in six months.
Dec–Jan 2026
Delivered
30
titles
~15,000
pages
Pipeline development + first production batch. Proving the conversion engine handles publisher variations.
February 2026
Delivered
40
titles
~20,000
pages
Scaling production with BookLoader automation. Throughput increasing as publisher patterns are mapped.
March 2026
Delivered
60
titles
~30,000
pages
Full pipeline operating at capacity with manual QA processes.
April 2026
In progress
80
titles
~40,000
pages
AI Agentic QA deployed. Significant cost and time reduction while maintaining quality.
May–June 2026
Scale plan
160–200
titles
~100,000
pages
Full agentic pipeline with AI handling majority of QA. Scaling to 200+ titles/month.
Technical Architecture
What the AI pipeline automates.
Automated First
- Publisher XML variation mapping
- DTD compliance validation
- Common error pattern detection
- Metadata + TOC generation
AI Handles (Agentic)
- Full structural XML review
- Cross-reference validation
- Known publisher anomaly flags
- Self-correction on mapped errors
Stays Human
- New publisher onboarding
- Edge cases flagged by AI
- Final sign-off on complex titles
- Pipeline maintenance + updates
Tech Stack
What powers the pipeline.
PLATFORM
INFRASTRUCTURE
Need to convert content at scale?
Let's talk.
Book a free consultation and we'll analyze a sample of your content library — at no cost.