Why XML Matters for Modern Publishing
Structured XML is the foundation of modern publishing workflows. JATS (Journal Article Tag Suite) is the standard for scholarly articles, NLM DTD for medical literature, and BITS for books. Converting legacy content to structured XML unlocks multi-format output (HTML, ePub, PDF from a single source), semantic search and discovery, PubMed Central submission compliance, accessibility (proper heading structure, reading order), and long-term preservation.
The cost of NOT converting is growing. Legacy PDFs can't feed into modern content platforms, AI systems, or multi-channel distribution pipelines. Every year of delay makes the backlog larger and more expensive to address.
XML Formats Explained: JATS vs NLM vs BITS vs DITA
JATS (Journal Article Tag Suite): The current standard for journal articles. Maintained by NISO. Three variants: Archiving (preservation), Publishing (publisher workflows), and Authoring (manuscript preparation). Used by PubMed Central, Crossref, and most major publishers.
NLM DTD: The predecessor to JATS, still used by some legacy systems. Being phased out in favor of JATS but millions of existing documents use this format.
BITS (Book Interchange Tag Suite): The book equivalent of JATS. Covers book-level metadata, chapters, parts, indexes, and cross-references.
DITA (Darwin Information Typing Architecture): Used primarily for technical documentation. Topic-based architecture with strong reuse capabilities. Common in software documentation and manufacturing.
The AI-Powered Conversion Pipeline
Traditional XML conversion relies heavily on manual tagging — human operators reading each document and applying XML tags. This is accurate but expensive ($15-40 per page) and slow.
AI-powered conversion uses machine learning models trained on millions of already-tagged documents to automate the heavy lifting. Structure detection identifies headings, paragraphs, tables, figures, and references automatically. Entity recognition tags author names, institutions, citations, and identifiers. Quality validation checks the output against the DTD/schema and flags issues for human review.
The result: 80% faster conversion at 60-70% lower cost, with human reviewers focused on edge cases rather than routine tagging. Our pipeline at Zentrovia achieves 99.5% accuracy on structured journal content.
Cost Factors and Budgeting
XML conversion costs vary dramatically based on content complexity, source format quality, target schema, and volume.
Basic conversion (well-structured Word/PDF to JATS): $2-5 per page. Standard conversion (mixed-quality sources, complex tables/math): $5-15 per page. Complex conversion (legacy formats, poor-quality scans, heavy math): $15-40 per page.
Volume matters significantly. A batch of 10,000 pages gets much better unit economics than 100 pages. Most vendors offer tiered pricing with breakpoints at 1,000, 5,000, and 10,000+ pages.
AI-powered pipelines reduce costs by 40-60% compared to fully manual conversion, primarily by automating routine tagging and reducing the human hours per page.
Choosing a Conversion Partner
Look for: domain expertise in your content type (STM, humanities, legal), AI-powered automation capabilities, transparent quality metrics, sample conversion at no cost, flexible volume scaling, and integration with your existing workflow.
Red flags: no sample offered, no quality SLA, outsourced to undisclosed third parties, no automated QA pipeline, and inability to handle your specific XML schema.
Ask every vendor: What is your accuracy rate? How do you measure it? Can you convert 10 sample pages for free? What is your turnaround time at our expected volume? How do you handle complex elements (math, tables, figures)?
Related Solutions
Continue Reading
Related articles
Need help with this?
Our team can help you implement the strategies discussed in this article.