Seventeen file processors, six categories, one priority system
Seventeen file processors, six categories, one priority system — companion deep-dive for the NeuroLink blog with architectural detail and code examples.
We designed NeuroLink’s file processing framework because inconsistent file handling was silently breaking our AI agent workflows. An agent could extract text from a .docx file uploaded by a user, but the same workflow would fail on an .rtf file containing the exact same information. A .zip archive created on macOS would fail to unpack where a Windows-generated one succeeded. These subtle inconsistencies meant that any AI agent relying on user-provided files, whether for RAG with a Claude model or for data analysis, was operating on a foundation of sand. We needed a unified, predictable system that could see a file path, identify its contents, and route it to a specialized processor that understood its format, from Microsoft Word documents to FFmpeg-compatible video streams.
The Core Lifecycle: BaseFileProcessor
Every file processor in NeuroLink inherits from a single abstract class: BaseFileProcessor. This class establishes a consistent, three-stage lifecycle for every file we handle, whether we’re processing a 10-line YAML file or a video that hits our size limits. The core logic lives in the processFile method, which orchestrates the entire flow.
This shared foundation ensures that every processor, regardless of the file format it handles, adheres to the same contract for downloading, validation, and processing. It’s an architecture that parallels how we think about provider integrations; just as every LLM provider has a common interface, every file format gets a common processing pattern. You can learn more about that philosophy in What You Actually Inherit When You Extend BaseProvider.
The lifecycle consists of three main steps managed by BaseFileProcessor:
Download: The
downloadFilemethod retrieves the file from its source. It’s wrapped in adownloadFileWithRetryhelper that handles transient network errors, a critical feature for robustly handling large files from remote storage.Validate:
validateDownloadedFileruns a series of checks. It uses helpers likevalidateFileSizeandisSupportedMimeTypeto ensure the file is something we can and should handle before committing more resources. This is our first line of defense against malformed or malicious inputs.Build: The abstract
buildProcessedResultmethod is where the specialized logic for each processor lives. This is the method that a subclass likeWordProcessororVideoProcessormust implement to perform its unique parsing and data extraction.
This common structure is what allows the system to be so extensible. Adding support for a new file type means creating a new class that extends BaseFileProcessor and implements a single method.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
export abstract class BaseFileProcessor<T extends ProcessedFileBase> {
// ... constructor and config ...
async processFile(
fileInfo: FileInfo,
options?: FileProcessorOptions,
): Promise<Result<T, FileProcessingError>> {
// 1. Download the file with retries
const bufferResult = await this.downloadFileWithRetry(fileInfo, options);
if (!bufferResult.isSuccess) {
return err(bufferResult.error);
}
// 2. Validate the downloaded file
const validationResult = await this.validateDownloadedFileWithResult(fileInfo, bufferResult.value);
if (!validationResult.isSuccess) {
return err(validationResult.error);
}
// 3. Build the processed result using subclass-specific logic
return this.buildProcessedResultWithResult(bufferResult.value, fileInfo, options);
}
protected abstract buildProcessedResult(
buffer: Buffer,
fileInfo: FileInfo,
options?: FileProcessorOptions,
): Promise<Result<T, FileProcessingError>>;
// ... other helper methods ...
}
Document Processors: The Office Suite and Beyond
This is one of the most common categories of files users upload. Our goal is to extract clean text and metadata from a variety of proprietary and open document formats.
WordProcessor: Handles.docxfiles. It uses themammothlibrary (loaded vialoadMammoth) to convert Word documents into HTML and plain text, preserving headings and lists where possible. This is critical for maintaining document structure for RAG.ExcelProcessor: Parses.xlsxfiles. It can extract data from specified sheets and rows, turning spreadsheets into structured data that an AI agent can analyze.PptxProcessor: Extracts text content from every slide in a.pptxpresentation file. This allows an agent to “read” a presentation deck.RtfProcessor: Provides support for Rich Text Format (.rtf) files. TheextractTextmethod provides a best-effort conversion to plain text.OpenDocumentProcessor: Handles the Open Document Format (ODF) files used by LibreOffice and other open-source suites, such as.odtfor text and.odsfor spreadsheets.
1
2
3
4
5
6
7
8
9
// Example check for the WordProcessor
import { DOCX_MIME_TYPE } from './constants';
export function isWordFile(mimetype: string, filename: string): boolean {
return (
mimetype === DOCX_MIME_TYPE ||
filename.toLowerCase().endsWith('.docx')
);
}
Data Processors: Structured Input
When the input isn’t a human-readable document but structured data for a system to read, these processors take over.
JsonProcessor: Parses.jsonfiles. It validates the JSON syntax and makes the data available as a JavaScript object.YamlProcessor: Handles.ymland.yamlfiles, which are common for configuration and data serialization.XmlProcessor: A processor for.xmlfiles. Given the verbosity of XML, this processor focuses on extracting the core data content.
These processors are fundamental for any workflow involving configuration files, API responses, or data exports.
1
2
3
4
5
6
7
# A sample file for the YamlProcessor
user:
id: 123
name: "John Doe"
roles:
- admin
- editor
Markup Processors: From HTML to Plain Text
This category handles files that are primarily text but include structural or presentational markup.
HtmlProcessor: Takes.htmlfiles and strips the markup to extract the raw text content, which is often the goal when feeding web content to an LLM.SvgProcessor: While SVGs are images, they are also XML-based markup. TheSvgProcessorcan extract any text elements or metadata embedded within an.svgfile.MarkdownProcessor: Parses.mdfiles, a common format for documentation and notes. It can provide either the raw Markdown source or a converted HTML representation.TextProcessor: The most basic processor. It handles.txtfiles and serves as a fallback for any unrecognized text-based format. ItsisFileSupportedlogic is broad.
1
2
3
4
5
6
7
8
9
10
11
12
// The TextProcessor is a general-purpose tool
export function isTextFile(mimetype: string, filename: string): boolean {
if (mimetype.startsWith('text/')) {
// Exclude specific text subtypes handled by other processors
if (mimetype === 'text/csv' || mimetype === 'text/html' || mimetype === 'text/markdown') {
return false;
}
return true;
}
// ... fallback logic for common text extensions
return false;
}
Code Processors: Understanding Source and Configuration
We treat source code and configuration as first-class file types. This is essential for AI-powered developer tools like our internal code reviewer, Yama.
SourceCodeProcessor: This processor doesn’t execute code. Its job is to identify the programming language usingdetectLanguageFromFilenameand prepare it for analysis or display. It recognizes a wide array of file extensions from the internalLANGUAGE_MAP.ConfigProcessor: This specialized processor handles common configuration files like.envorcredentials.json. It has a critical security function: it usesisSensitiveKeyandredactContentto automatically find and redact secrets like API keys or passwords before the content is passed to any other part of the system.
1
2
3
4
5
6
7
8
// The ConfigProcessor redacts sensitive data before processing.
// A file like this:
// API_KEY=sk-1234567890abcdef
// DATABASE_URL=postgres://user:pass@host:5432/db
// Would be processed into something like this:
// API_KEY=[REDACTED]
// DATABASE_URL=postgres://user:[REDACTED]@host:5432/db
Media Processors: Video and Audio
Processing media files is by far the most computationally intensive task. These processors are designed to extract textual and summary information from binary media formats.
AudioProcessor: Handles audio files like.mp3and.wav. Its primary function is to feed the audio data to a speech-to-text model to get a transcription.VideoProcessor: This is one of our most complex processors. For video files, it uses FFmpeg (vialoadFluentFfmpeg) to perform multiple operations. TheextractKeyframesmethod generates representative images from the video, whileextractSubtitlespulls out any embedded subtitle tracks. TheprobeVideofunction reads metadata like duration and resolution. This allows an agent to understand the content of a video without having to “watch” the whole thing.
The complexity and resource usage of these processors mean they have stricter size limits and timeouts, which are validated early in the BaseFileProcessor lifecycle.
1
2
3
4
5
6
7
8
// The VideoProcessor can identify video files by MIME type or extension.
export function isVideoFile(mimetype: string, filename: string): boolean {
if (mimetype.startsWith('video/')) {
return true;
}
const ext = filename.split('.').pop()?.toLowerCase() ?? '';
return ['mp4', 'mov', 'avi', 'mkv', 'webm'].includes(ext);
}
Archive Processors: Unpacking the Containers
A single uploaded file can contain an entire project. The ArchiveProcessor is responsible for unpacking these containers.
ArchiveProcessor: This single processor handles multiple formats. It usesdetectArchiveFormatby checking magic bytes and file extensions to identify.zip,.tar,.gz, and.tar.gzfiles. Once identified, it uses methods likeextractZipEntriesorextractTarEntriesto list the contents. A key security feature is thehasPathTraversalcheck, which prevents “zip slip” vulnerabilities where a malicious archive tries to write files outside of its extraction directory. The processor can produce a text manifest of the archive’s contents orextractEntrya specific file from it.
1
2
3
4
5
6
7
8
9
// The ArchiveProcessor detects format from file extension as a fallback
private detectFormatFromExtension(filename: string): ArchiveFormat | null {
const lowerFile = filename.toLowerCase();
if (lowerFile.endsWith('.zip')) return 'zip';
if (lowerFile.endsWith('.tar.gz') || lowerFile.endsWith('.tgz')) return 'targz';
if (lowerFile.endsWith('.tar')) return 'tar';
if (lowerFile.endsWith('.gz')) return 'gz';
return null;
}
One Registry to Rule Them All
With seventeen different processors, the system needs a way to choose the right one for any given file. This is the job of the ProcessorRegistry. When a file is submitted, the central processFileWithRegistry function queries the registry to find a suitable handler.
The selection logic, exposed via getProcessorForFile, is not just a simple MIME type lookup. Some files can be ambiguous. For example, an .xml file could be generic data for the XmlProcessor or a vector graphic for the SvgProcessor.
To resolve this, we use a priority system. Each processor registers itself with a numeric priority. The ProcessorRegistry iterates through all processors and asks each one if it isFileSupported. It collects all “yes” votes and picks the one with the highest priority (lowest number). This ensures that the most specific processor always wins.
graph TD
subgraph NeuroLink File Processing
A[File Uploaded: report.docx] --> B{MIME Type: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'};
B --> C["ProcessorRegistry.findProcessor()"];
C --> D{Priority Check};
D -- Priority 10 --> E[TextProcessor? No];
D -- Priority 5 --> F[ArchiveProcessor? No];
D -- Priority 1 --> G["WordProcessor? Yes!"];
G --> H["WordProcessor.processFile()"];
H --> I[ProcessedResult: Text + Metadata];
end
This entire system is a testament to the power of building small, specialized tools that do one thing well, then composing them into a larger, robust whole. It’s a pattern that ensures our platform can handle the diverse data needs of modern AI applications, and it’s rigorously tested as part of our commitment to quality. You can read more about that in our post, How We Test NeuroLink: 20 Continuous Test Suites and Counting. The structured output from these processors is a critical input for higher-level orchestration, such as Dynamic Model Selection: Routing AI Requests at Runtime.
Related posts:
