Parsing legacy documents such as HWP and PDFs has long been a challenge for developers working with Korean public data. Enter Kordoc, a specialized parser transforming these formats into structured Markdown and intermediate representation blocks. Built for AI agents like Claude Code, Kordoc extracts semantic content and tables, even from encrypted files, bridging a significant gap in document processing workflows.

Decoding Complex Legacy Formats

Korean HWP files, frequently used by government agencies, pose a unique challenge due to their proprietary OLE2-based binary format. Traditional parsers often struggle with extracting meaningful content. Kordoc changes this by offering a specific parser that turns these complex documents into a structured format suitable for AI analysis, addressing the 'document hell' faced by many developers.

Advanced Table Detection and More

One of Kordoc's standout features is its ability to detect tables, even those that are borderless in PDF files, using cluster-based algorithms. Additionally, Kordoc can recognize 'label-value' pairs in Korean forms, a common data structure in government documents. This semantic extraction capability marks it as superior to generic PDF libraries that might return scattered or broken text.

Integration with AI Agents

Kordoc doesn’t just parse documents; it seamlessly integrates with AI tools via its MCP (Model Context Protocol) server. This allows AI agents like Claude Code to directly access and analyze HWP/HWPX/PDF files, significantly enhancing their ability to automate administrative tasks, like summarizing attendance tables into Markdown reports.

Security Measures and Community Impact

Not only does Kordoc elevate parsing capabilities, but it also accounts for security with mechanisms like ZIP bomb protection and error message sanitization. Despite being a solo-developed project, the community's positive reaction highlights its high utility, though long-term support remains a concern.

Kordoc is the missing link for inclusive AI ecosystems, enabling effective document processing for Korean formats traditionally overlooked. However, given its current status as a solo project, ensuring ongoing support and adaptation remains crucial.

Practical Takeaway

Here's what you can do with this today: Integrate Kordoc with Claude Code to automate the analysis of HWP and PDF files. This allows you to run natural language prompts that extract and synthesize data, significantly speeding up repetitive tasks involving legacy documents.