Ollama & Qwen3-8b: Blog Extraction Failure - Need Help!

Nov 29, 2025 by Alex Johnson 56 views

Ollama & Qwen3-8b Blog Extraction Failure: Seeking Solutions and Alternative Open-Source Models

Experiencing failures while running blog extraction tasks using the Ollama client with the Qwen3-8b model can be frustrating. This article dives into a specific case where this setup struggled to accurately extract information from a simple blog page. We'll analyze the debugging logs, pinpoint potential issues, and explore alternative open-source models that might offer improved performance. If you're encountering similar problems, you're in the right place. Let’s troubleshoot this together and find a reliable solution for your blog extraction needs.

The Challenge: Extracting News from a Blog Page

The core task involves extracting the titles and publication dates of press releases from a sample blog page (https://dummy-press-releases.surge.sh/news). The user employed a setup using:

Stagehand: A tool for browser automation and content extraction.
Ollama: A platform for running large language models (LLMs) locally.
Qwen3-8b: An open-source LLM from Alibaba.

The code snippet provided demonstrates the implementation using these technologies. It initializes Stagehand with the Ollama client running Qwen3-8b, navigates to the target webpage, attempts to identify news listings, scrolls to the bottom of the page, and then uses Stagehand's extract function to retrieve the desired information. The goal is to obtain a structured JSON output containing an array of press release items, each with a title and publication date.

import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";
import { AISdkClient } from "./external_clients/aisdk";
import { createOllama } from 'ollama-ai-provider-v2';

const ollama = createOllama( { baseURL:'http://localhost:11434/api' });
  const stagehand = new Stagehand({
    env: "LOCAL",
    verbose: 2,
    cacheDir: "act-cache", // Specify a cache directory
    domSettleTimeout: 15000, // Longer wait for stability
    llmClient: new AISdkClient({
      model:   ollama('qwen3:8b')  
    }),
});

  await stagehand.init();
  const pages = stagehand.context.pages();

  await news( stagehand ,  )

export async function news( stagehand:Stagehand , page: Page) {

    await page.goto("https://dummy-press-releases.surge.sh/news", {
      waitUntil: "domcontentloaded",
    });

    const elements = await stagehand.observe("find all news listings");
    console.log("Found elements:", elements.map(e => e.description));

    await new Promise((resolve) => setTimeout(resolve, 5000));

    await stagehand.act("scroll to the bottom of the page", { page: page });

    const rawResult = await stagehand.extract(
      
        "extract the title and corresponding publish date of EACH AND EVERY press releases on this page. DO NOT MISS ANY PRESS RELEASES.",
        z.object({
          items: z.array(
            z.object({
              title: z.string().describe("The title of the press release"),
              publish_date: z
                .string()
                .describe("The date the press release was published"),
            }),
          ),
        }),

    );
    console.log( rawResult );
}

Decoding the Debugging Logs: Where Did It Go Wrong?

The debugging logs paint a clear picture of the issues encountered during the extraction process. Let's break down the key areas of concern:

1. Inaccurate Element Observation:

The stagehand.observe function, tasked with finding news listings, returned unexpected elements. Instead of identifying the individual press releases, it picked up items like "Brad Lander for Comptroller," "Sign up for campaign updates," and other page elements not directly related to the news articles. This initial misstep indicates a problem with how the observation instruction is being interpreted by the model or how Stagehand is identifying elements on the page.

The element observation stage is crucial for guiding the extraction process. If the initial selection of elements is incorrect, the subsequent steps will likely fail. This suggests the LLM might need more specific instructions or a different approach to element identification, such as using CSS selectors or more detailed XPath queries.

2. Action Failures: Method GET Not Supported

The logs show attempts to perform actions on the element with the XPath /html[1]/body[1]/div[1]/div[3]/div[1]/main[1]/div[2]/div[1]/div[1]/div[1]/div[3]/div[1]/div[1]. These actions resulted in UnderstudyCommandException: Method GET not supported errors. This indicates that Stagehand is attempting to use a GET method on an element that doesn't support it, likely a form or a section of the page that isn't designed to handle GET requests. The issue here could stem from incorrect element selection or a misunderstanding of the page structure.

The repeated attempts and failures highlight a disconnect between the intended action and the element being targeted. This could be due to a misinterpretation of the page's HTML structure by the LLM or a bug in how Stagehand handles actions on certain element types.

3. Poor Extraction Results: Missing and Inaccurate Data

The final extraction result is the most telling. Instead of extracting all press releases (more than 20 on the page), it identified only five items. Of these, only one was a genuine press release title ("An Unassuming Liberal Makes a Rapid Ascent to Power Broker"), while the others were either fake titles or pointed to forms or other non-news content. This demonstrates a significant failure in the core extraction task. The model struggles to accurately identify and extract the desired information, indicating limitations in its understanding of the task or the page content.

The presence of fake titles suggests the model might be hallucinating or filling in gaps in its understanding.
The inclusion of form-related titles points to a misinterpretation of the page structure and the role of different elements.
The low number of extracted items indicates the model is missing a significant portion of the target data.

Why Did Qwen3-8b Struggle? Potential Limitations of Open-Source Models

The observed failures raise questions about the suitability of Qwen3-8b for complex extraction tasks, especially in a zero-shot setting (without specific training data for this task). Several factors might contribute to its difficulties:

Model Size and Capabilities: While Qwen3-8b is a respectable open-source LLM, its 8 billion parameters might not be sufficient to handle the nuances of web page structure and information extraction, compared to larger, proprietary models. Smaller models often struggle with complex reasoning and understanding intricate instructions.
Training Data and Fine-tuning: The model's performance is heavily influenced by the data it was trained on. If the training data lacked sufficient examples of similar extraction tasks or had biases that hindered its ability to generalize, the model might underperform. Fine-tuning on a specific dataset of blog extraction examples could potentially improve results.
Instruction Following: Accurately interpreting and executing complex instructions is a key capability for LLMs. Qwen3-8b might have limitations in its ability to translate the extraction instruction ("extract the title and corresponding publish date of EACH AND EVERY press releases on this page. DO NOT MISS ANY PRESS RELEASES.") into a precise extraction strategy. Clearer, more structured prompts, or a few-shot learning approach (providing examples in the prompt), might help improve instruction following.

Exploring Alternative Open-Source Models for Blog Extraction

Given the challenges encountered with Qwen3-8b, it's crucial to explore other open-source LLMs that might be better suited for blog extraction tasks. Here are a few promising alternatives:

Llama 2: Developed by Meta, Llama 2 comes in various sizes (7B, 13B, and 70B parameters). The larger models, particularly the 70B version, have demonstrated impressive performance across a range of NLP tasks and might offer improved extraction capabilities. Llama 2's strong performance and open-source nature make it a compelling alternative.
Mistral 7B: Mistral 7B is another open-source LLM known for its efficiency and strong performance. It has been shown to outperform Llama 2 13B on many benchmarks, making it a potential candidate for resource-constrained environments. Its compact size and competitive performance make it worth considering for extraction tasks.
Zephyr 7B: Zephyr 7B is a fine-tuned version of Mistral 7B, specifically optimized for instruction following. This makes it particularly promising for extraction tasks where clear and precise instructions are crucial. Zephyr 7B's focus on instruction following could lead to better extraction accuracy.

When evaluating these alternatives, consider factors like model size, computational resources required, and the availability of fine-tuning datasets. Experimenting with different models and prompting strategies is essential to find the optimal solution for your specific blog extraction needs.

Optimizing the Approach: Prompt Engineering and Beyond

Beyond switching models, several techniques can be employed to improve the accuracy and reliability of blog extraction:

Prompt Engineering: Crafting clear, concise, and unambiguous prompts is crucial for guiding the LLM. Break down the extraction task into smaller, more manageable steps. For instance, instead of a single instruction, use a series of instructions:
1. "Identify all elements on the page that appear to be press release listings."
2. "For each listing, extract the title."
3. "For each listing, extract the publication date." Using more structured prompts, potentially including examples (few-shot learning), can significantly improve results.
Targeted Element Selection: Instead of relying solely on the LLM to identify elements, use CSS selectors or XPath queries to target specific HTML elements known to contain the desired information. This provides more control and reduces the ambiguity for the model. Precise element selection minimizes the chances of misinterpreting the page structure.
Data Validation and Post-processing: Implement data validation steps to filter out incorrect or incomplete extractions. For instance, check if the extracted publication date is in a valid format or if the title contains keywords associated with press releases. Post-processing can help clean up the extracted data and ensure accuracy.
Fine-tuning: If consistently extracting from similar blog structures, consider fine-tuning an open-source model on a dataset of correctly extracted press releases. This can significantly improve performance for the specific task. Fine-tuning tailors the model to the nuances of the target data.

Conclusion: Finding the Right Open-Source Solution for Blog Extraction

While the initial attempt to extract blog data using Ollama, Qwen3-8b, and Stagehand encountered significant challenges, this doesn't mean open-source models are inherently unsuitable for the task. By understanding the limitations of the model, analyzing the debugging logs, and exploring alternative models and techniques, it's possible to achieve accurate and reliable blog extraction. Experimenting with different models like Llama 2, Mistral 7B, and Zephyr 7B, along with prompt engineering and targeted element selection, will pave the way for a successful open-source solution. Remember, the key is to iterate, evaluate, and refine your approach until you achieve the desired results.

For more information on large language models and their applications, consider exploring resources like the Hugging Face Hub, a central platform for discovering and sharing models and datasets.