Data preprocessing for LLM applications.
## Unstructured MCP Server: Document Processing for AI The **Unstructured MCP Server** integrates Unstructured's document processing capabilities into Google Antigravity. This platform extracts clean, structured data from PDFs, images, Office documents, and more, preparing them for RAG and LLM applications. ### Why Unstructured MCP? Unstructured solves document preprocessing: - **Universal Parsing**: Handle any document format - **Clean Extraction**: Preserve structure and meaning - **Table Handling**: Extract tables accurately - **OCR Built-in**: Process scanned documents - **RAG Optimized**: Ready for vector databases ### Key Features #### 1. Document Processing ```python from unstructured.partition.auto import partition # Automatic format detection elements = partition(filename="document.pdf") for element in elements: print(f"{element.category}: {element.text[:100]}") # Supported formats: PDF, DOCX, PPTX, HTML, images, and more ``` #### 2. Chunking for RAG ```python from unstructured.chunking.title import chunk_by_title # Parse document elements = partition(filename="report.pdf") # Smart chunking for RAG chunks = chunk_by_title( elements, max_characters=1000, combine_text_under_n_chars=200 ) # Each chunk preserves document structure for chunk in chunks: print(f"Chunk: {chunk.text}") print(f"Metadata: {chunk.metadata}") ``` #### 3. API Access ```python from unstructured_client import UnstructuredClient client = UnstructuredClient(api_key="your-key") # Process via API response = client.general.partition( files=open("document.pdf", "rb"), strategy="hi_res", # Use OCR for complex documents languages=["eng"] ) elements = response.elements ``` ### Configuration ```json { "mcpServers": { "unstructured": { "command": "npx", "args": ["-y", "@anthropic/mcp-unstructured"], "env": { "UNSTRUCTURED_API_KEY": "your-api-key", "UNSTRUCTURED_API_URL": "https://api.unstructured.io" } } } } ``` ### Use Cases **RAG Pipelines**: Extract and chunk documents for retrieval-augmented generation systems. **Document Digitization**: Convert scanned documents and images to searchable, structured text. **Data Extraction**: Pull structured data from invoices, contracts, and forms. The Unstructured MCP Server enables clean document processing for Antigravity RAG applications.
{
"mcpServers": {
"unstructured": {}
}
}