Post

Hugging Face Integration: 100,000+ Open Models with NeuroLink

Access 100,000+ open-source AI models through Hugging Face's inference API with NeuroLink. Intelligent tool calling detection and TypeScript examples.

Hugging Face Integration: 100,000+ Open Models with NeuroLink

By the end of this guide, you’ll have access to 100,000+ open-source models from Hugging Face through NeuroLink, with intelligent tool-calling detection and the same unified API you use with every other provider.

You will set up the Hugging Face provider, understand which models support tool calling, use open models for code generation and conversational AI, and leverage the model recommendations API. NeuroLink automatically detects whether a model supports tools, so you never get errors from trying to use tool calling with an incompatible model.

How It Works

Under the hood, NeuroLink’s Hugging Face integration uses the same pattern as several other providers: it creates an OpenAI-compatible client via createOpenAI from @ai-sdk/openai, but with a custom base URL pointing to Hugging Face’s router.

The endpoint is https://router.huggingface.co/v1, which is Hugging Face’s unified inference router. This router implements the OpenAI-compatible API specification, meaning NeuroLink communicates with it using the same request and response format as OpenAI – chat completions, streaming, and tool calling all follow the same protocol.

Proxy support is included via createProxyFetch(), which is useful for corporate environments that require all outbound traffic to go through a proxy server.

This architecture means that any model available through the Hugging Face Inference API can be accessed through NeuroLink, provided it supports chat completion (which most instruction-tuned models do).

Supported Models and Tool Calling Detection

Default Model

The default model is meta-llama/Llama-3.1-8B-Instruct, a capable instruction-tuned model. You can override this via the HUGGINGFACE_MODEL environment variable.

Tip: For high-quality reasoning tasks, consider upgrading to meta-llama/Llama-3.1-70B-Instruct. For lightweight, fast function calling, try NousResearch/Hermes-3-Llama-3.2-3B.

Intelligent Tool-Calling Detection

NeuroLink’s supportsTools() method examines the model identifier against a curated list of model families known to support tool calling. This prevents the common pitfall of sending tool definitions to models that do not understand them.

Tool-Capable Models (tools enabled automatically):

Model PatternExamplesNotes
llama-3.1-*-instructLlama-3.1-8B, 70B, 405B-InstructFull tool calling support
llama-3.1-nemotron-ultranvidia/Llama-3.1-Nemotron-Ultra-253B-v1NVIDIA-optimized variant
hermes-3-llama-3.2NousResearch/Hermes-3-Llama-3.2-3BExcellent function calling
hermes-2-proHermes 2 Pro seriesEarlier function calling models
codellama-*-instructCodeLlama-34b, 13b-InstructCode-focused with tool support
mistral-7b-instruct-v0.3mistralai/Mistral-7B-Instruct-v0.3Mistral open-weight model
mistral-8x7b-instructMixtral 8x7BMixture-of-experts with tools
nous-hermesNousResearch seriesCommunity function calling models
openchatOpenChat modelsTool-capable chat models
wizardcoderWizardCoder modelsCode generation with tools

Non-Tool Models (tools disabled automatically):

Model PatternReason
microsoft/DialoGPT-*Conversational model, treats tools as text
gpt2, bert, robertaPre-2024 models without tool training
Most pre-2024 modelsLack structured function calling capability

When tools are disabled for a model, NeuroLink gracefully degrades – your tool definitions are simply not sent to the model, preventing confusing error responses. The model will still answer your question, just without tool use.

Quick Setup

Environment Variables

1
2
3
4
export HUGGINGFACE_API_KEY=hf_your_token_here

# Recommended: Set a capable default model
export HUGGINGFACE_MODEL=meta-llama/Llama-3.1-8B-Instruct

You can obtain a Hugging Face API token from huggingface.co/settings/tokens. Free tier tokens work, though they have rate limits. Pro and Enterprise tiers offer higher throughput.

Basic Streaming

1
2
3
4
5
6
7
8
9
10
11
12
13
import { NeuroLink } from '@juspay/neurolink';

const neurolink = new NeuroLink();

const result = await neurolink.stream({
  input: { text: "Write a quicksort implementation in Python" },
  provider: "huggingface",
  model: "codellama/CodeLlama-34b-Instruct-hf",
});

for await (const chunk of result.stream) {
  if ("content" in chunk) process.stdout.write(chunk.content);
}

This streams a code generation request through CodeLlama 34B, one of the best open-source coding models available. Because NeuroLink detects that CodeLlama Instruct models support tool calling, you could also pass tools to this request if needed.

Switching Models Per-Request

One of the advantages of Hugging Face is the sheer variety of models available. You can switch models per-request without any configuration changes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// General conversation with Llama 3.1
const chatResult = await neurolink.stream({
  input: { text: "Explain quantum entanglement in simple terms" },
  provider: "huggingface",
  model: "meta-llama/Llama-3.1-70B-Instruct",
});

// Code generation with CodeLlama
const codeResult = await neurolink.stream({
  input: { text: "Write a REST API with Express.js" },
  provider: "huggingface",
  model: "codellama/CodeLlama-34b-Instruct-hf",
});

// Function calling with Hermes 3
const toolResult = await neurolink.stream({
  input: { text: "What time is it in London?" },
  provider: "huggingface",
  model: "NousResearch/Hermes-3-Llama-3.2-3B",
  tools: { /* ... */ },
});

Tool Calling with Open Models

NeuroLink enhances tool calling for Hugging Face models through several mechanisms:

Enhanced System Prompts

When tools are enabled, NeuroLink injects enhanced system prompt instructions via enhanceSystemPromptForTools(). This adds explicit guidance to the model about how to format tool calls, improving reliability with models that support tools but may not always use them optimally.

Tool Formatting

The formatToolsForHuggingFace() method passes tool definitions through for the OpenAI-compatible endpoint. Since Hugging Face’s router implements the OpenAI tool calling specification, standard Zod-based tool definitions work without modification.

Conditional Tool Enablement

The prepareStreamOptions() method checks supportsTools() before including tools in the request. For non-capable models, tools are disabled entirely – preventing confusing error responses and ensuring the model still generates useful text output.

Complete Tool Calling Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import { z } from "zod";
import { tool } from "ai";
import { NeuroLink } from '@juspay/neurolink';

const neurolink = new NeuroLink();

const result = await neurolink.stream({
  input: { text: "What's the weather in Berlin?" },
  provider: "huggingface",
  model: "meta-llama/Llama-3.1-70B-Instruct",
  tools: {
    getWeather: tool({
      description: "Get the current weather for a city",
      parameters: z.object({
        city: z.string().describe("The city name"),
      }),
      execute: async ({ city }) => ({
        temperature: 18,
        conditions: "Partly cloudy",
        city,
      }),
    }),
  },
});

for await (const chunk of result.stream) {
  if ("content" in chunk) process.stdout.write(chunk.content);
}

Llama 3.1 Instruct models have strong native tool calling support, making them an excellent choice for function-calling workloads on open-source models.

Note: Tool calling quality varies by model. Llama 3.1 70B Instruct and Hermes 3 are the most reliable options. For critical tool-calling workflows, test thoroughly with your specific tools and schemas before deploying to production.

Model Recommendations API

NeuroLink provides a static method getToolCallingRecommendations() on the HuggingFaceProvider class that returns performance ratings for recommended models:

1
2
3
const recs = HuggingFaceProvider.getToolCallingRecommendations();
console.log(recs.recommended);
// ["meta-llama/Llama-3.1-8B-Instruct", "meta-llama/Llama-3.1-70B-Instruct", ...]

Here are the detailed ratings (1-3 scale, 3 = best):

ModelSpeedQualityCostRecommended For
meta-llama/Llama-3.1-8B-Instruct323Best overall balance
meta-llama/Llama-3.1-70B-Instruct232Highest quality, slower
nvidia/Llama-3.1-Nemotron-Ultra-253B-v1231Maximum capability, resource-heavy
NousResearch/Hermes-3-Llama-3.2-3B323Lightweight, fast function calling
codellama/CodeLlama-34b-Instruct-hf232Best for code generation

For most applications, Llama 3.1 8B Instruct is the recommended starting point. It offers the best balance of speed, quality, and cost. Scale up to the 70B variant when quality demands increase, or drop down to Hermes 3 (3B) when speed and cost are the top priorities.

Error Handling

The Hugging Face provider implements enhanced error handling through handleProviderError(), including tool-calling-specific error guidance:

Error PatternClassificationGuidance
API_TOKEN_INVALID / Invalid tokenAuthentication errorCheck HUGGINGFACE_API_KEY
rate limitRate limit errorConsider upgrading to Pro/Enterprise
model + not foundModel not foundSuggests tool-capable model alternatives
function / tool errorsTool compatibility errorSuggests compatible models and schema checks
Other errorsGeneric provider errorIncludes full error message
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
try {
  const result = await neurolink.stream({
    input: { text: "test" },
    provider: "huggingface",
  });

  for await (const chunk of result.stream) {
    if ("content" in chunk) process.stdout.write(chunk.content);
  }
} catch (error) {
  if (error.message.includes("Invalid token")) {
    console.error("Check your HUGGINGFACE_API_KEY environment variable");
  } else if (error.message.includes("rate limit")) {
    console.error("Rate limited. Consider Hugging Face Pro for higher limits.");
  } else if (error.message.includes("tool")) {
    console.error("Tool calling error. Try a tool-capable model like Llama 3.1 Instruct.");
  } else {
    console.error("Hugging Face error:", error.message);
  }
}

Warning: Free-tier Hugging Face tokens have strict rate limits. If you are building a production application, upgrade to the Pro or Enterprise tier for reliable throughput.

Architecture

Here is the complete architecture of NeuroLink’s Hugging Face integration:

flowchart TB
    A[Your App] --> B[NeuroLink SDK]
    B --> C[HuggingFaceProvider]
    C --> D["createOpenAI(@ai-sdk/openai)<br/>baseURL: router.huggingface.co/v1"]
    D --> E[Hugging Face Router]

    subgraph "Tool Detection"
        F{supportsTools?}
        F -->|Llama 3.1, Hermes 3,<br/>CodeLlama| G[Tools Enabled]
        F -->|DialoGPT, GPT-2,<br/>BERT| H[Tools Disabled]
    end

    subgraph "Model Categories"
        I["Llama 3.1 - General"]
        J["CodeLlama - Code"]
        K["Hermes 3 - Function Calling"]
        L["Nemotron - Optimized"]
    end

    E --> I
    E --> J
    E --> K
    E --> L

The flow is: your app talks to NeuroLink, which delegates to HuggingFaceProvider. The provider creates an OpenAI-compatible client pointing at Hugging Face’s router endpoint. Before sending the request, it checks whether the selected model supports tool calling and adjusts the request accordingly. The response streams back through the same unified interface used by every NeuroLink provider.

Choosing the Right Open Model

With 100,000+ models available, choosing the right one can be overwhelming. Here is a practical guide:

By Use Case

Use CaseModelWhy
General chatmeta-llama/Llama-3.1-8B-InstructFast, versatile, tool-capable
High-quality reasoningmeta-llama/Llama-3.1-70B-InstructBest open-source reasoning
Code generationcodellama/CodeLlama-34b-Instruct-hfPurpose-built for code
Function callingNousResearch/Hermes-3-Llama-3.2-3BLightweight, excellent tool use
Multilingualmistralai/Mistral-7B-Instruct-v0.3Strong European language support

Hugging Face vs Direct Provider

When should you use Hugging Face versus accessing a model’s provider directly?

Use Hugging Face when:

  • You want to experiment with many different open-source models
  • You need access to models not available through other providers (Hermes, CodeLlama, etc.)
  • You want free-tier access for prototyping
  • You are evaluating models before deploying them on your own infrastructure

Use a direct provider when:

  • You need the lowest latency (direct APIs skip the HF router)
  • You need guaranteed SLAs and support
  • You are in production with high throughput requirements
  • The model is available natively (e.g., use Mistral directly for Mistral models)

What’s Next

You now have Hugging Face working through NeuroLink’s unified interface. From here:

  • AWS SageMaker: Deploy your favorite Hugging Face models to your own AWS infrastructure for full control
  • Mistral AI Integration: Access Mistral models directly for lower latency in production
  • Provider Comparison Matrix: Compare Hugging Face against commercial providers for your use case

The open-source ecosystem on Hugging Face evolves rapidly. NeuroLink gives you a stable, type-safe bridge to that innovation – experiment freely, then deploy the best model through whichever provider fits your production needs.


Related posts:

This post is licensed under CC BY 4.0 by the author.