Show HN: Tile.run – Extract structured data from any document via API
tile.runHey HN,
Today, we’re launching tile.run, an API that extracts structured data from unstructured documents (PDF, images, text) with support for custom schemas.
The Problem: Extracting data out of unstructured documents is surprisingly hard. We built tile.run while solving this for our product Kili (automation for invoicing/reconciliation). We found that getting to accuracy that is reliable enough for automation is challenging. Dense documents (e.g., lots of tables or line items) are even harder, and these are the most valuable to automate. After talking to other teams and developers, we found many other teams were after similar solutions.
Key Features:
- Multiple formats: PDF, JPEG, PNG, TIFF, plain text
- Custom schema support with nested objects/arrays
- Specialized in dense documents with tables
- Self-serve API - start extracting in minutes
Technical Details:
- REST API with simple JSON responses
- Robust error handling and validation
Coming Soon:
- Improved accuracy
- More file formats
- Self-hosting options
- Zero data retention mode
Links:
- Landing page: https://tile.run
- Documentation: https://tile.run/docs
I appreciate there have been a bunch of launches in this area recently, so wanted to address that head on as well:
- Clearly this problem is very valuable to solve but requires significant effort
- There are many ways to approach the same problem. For example, tile.run targets technical teams whereas other teams are solving this for business teams or specific functions (e.g. ETL).
We're excited to hear your feedback on the product.
> We found that getting to accuracy that is reliable enough for automation is challenging.
This is in the problem description of your pitch, and leads me to believe that tile.run has been solving this problem. Is that right?
> Coming Soon:
> - Improved accuracy
Can you expand more?
I have a large need for this sort of tooling, but accuracy is my primary concern.
Yes, we needed to solve the problem for our other product (https://kili.so). We spent a lot of time getting accuracy up for dense and multi-page invoices. Then realised other teams have this need as well so decided to ship the API.
On the accuracy point, given our work so far we believe we are best in class in terms of accuracy for document extraction. We've also set up a system of evaluations internally that allow us to keep iterating and improving (hence us mentioning that we want to continue working on it).
Offtopic but I'm so confused, how and why are there so many players in this space? Who even are the customers?
Any company that works with invoices from a number of suppliers. We had to solve this problem at DigiBuild, where every supplier seemed to have totally unique invoice formats (and even the same product names could differ between two suppliers.)
Not off topic at all!
I can only speak to our experience. Once you get under the hood, you find that this is a hard problem to solve.
There are also a lot of workflows that involve documents in every sector and every function. In other words, the opportunity is massive.
For our product, our customers are either internal engineering teams or folks building products that require document extraction but don’t want to invest time in it.
It's a fairly "natural" case for AI, and there is tons and tons of people who need to pull structured data of out of PDFs for myriad reasons.