Fixing Kino Krakow: Race Condition Prevents Full Scraping

by Alex Johnson 58 views

Have you ever encountered a pesky bug that prevents a system from working as intended? In the world of web scraping, these issues can be particularly frustrating. This article dives deep into a recent challenge faced with the Kino Krakow scraper, a system designed to collect movie showtimes for Krakow, Poland. We'll explore how a race condition was identified and resolved, ensuring users get the complete picture of movie listings. Understanding these challenges and solutions can be invaluable for anyone involved in web scraping or software development.

Executive Summary: The Kino Krakow Scraper's 7-Day Challenge

The Kino Krakow scraper, while boasting a well-designed distributed job-based architecture, was plagued by a critical race condition. This issue prevented it from scraping the showtimes for all 7 days of the week. Although the infrastructure was in place for complete data collection, the jobs were being scheduled too closely together, leading to the overwriting of session states. This resulted in incomplete data, impacting users who relied on the scraper for comprehensive movie schedules. Identifying and resolving such issues is crucial for maintaining the reliability and effectiveness of web scrapers.

Status: 🔴 Critical Bug - Only getting 1 day of data instead of 7 days

Impact: Missing 6/7ths of movie showtimes for Krakow users

Fix Complexity: ✅ Simple one-line change


Architecture Overview: How the Scraper Works

To understand the problem, let's first break down the architecture of the Kino Krakow scraper. The scraper operates using a hierarchical job structure, where a central coordinator (SyncJob) initiates and manages several DayPageJob instances. Each DayPageJob is responsible for scraping showtimes for a specific day of the week. These jobs then schedule further tasks, such as fetching movie details (MovieDetailJob) and processing showtime information (ShowtimeProcessJob).

The job hierarchy can be visualized as follows:

SyncJob (Coordinator)
    ├─> DayPageJob (Day 0)
    ├─> DayPageJob (Day 1)
    ├─> DayPageJob (Day 2)
    ├─> DayPageJob (Day 3)
    ├─> DayPageJob (Day 4)
    ├─> DayPageJob (Day 5)
    └─> DayPageJob (Day 6)
            ├─> MovieDetailJob (unique movies)
            │       └─> TMDB API calls
            └─> ShowtimeProcessJob (all showtimes)
                    └─> EventProcessor → Database

This structure ensures a systematic approach to scraping, processing, and storing data. Understanding this job hierarchy is key to grasping how the race condition occurred and how it was resolved. The core of the issue lies in the interaction between the DayPageJob instances and their reliance on shared session state.

Key Components: The Building Blocks of the Scraper

The Kino Krakow scraper comprises several key components, each playing a specific role in the data collection process. These components interact with each other to ensure that movie showtimes are accurately scraped, processed, and stored. Understanding the purpose and function of each component is crucial for maintaining and optimizing the scraper's performance.

Component Queue Count Purpose
SyncJob :discovery 1 Coordinator: establishes session, schedules day jobs
DayPageJob :scraper_index 7 Scrapes one day's showtimes, schedules movie/showtime jobs
MovieDetailJob :scraper_detail N (unique) Fetches movie details, matches to TMDB
ShowtimeProcessJob :scraper M (all) Processes individual showtimes into events

Where:

  • N = Unique movies across all 7 days (deduplicated)
  • M = Total showtimes across all 7 days

The Complete Data Flow: From Start to Finish

To truly appreciate the complexity of the scraper and the subtlety of the bug, let's trace the complete data flow, phase by phase. This will illustrate how each component interacts with the others and where the race condition manifested itself.

Phase 1: Session Establishment (SyncJob)

The initial phase involves the SyncJob establishing a session with the Kino Krakow website. This is crucial for authenticating subsequent requests and maintaining consistency throughout the scraping process. Without a properly established session, the scraper would be unable to access the necessary data.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/sync_job.ex

1. HTTP GET → https://www.kino.krakow.pl/cinema_program/by_movie
   └─> Extract Set-Cookie headers
   └─> Extract CSRF token from <meta name="csrf-token">

2. Schedule 7 DayPageJobs (days 0-6)
   └─> Pass: cookies, csrf_token, source_id, day_offset
   └─> Stagger: delay_seconds = day_offset * 2 seconds ⚠️ TOO SHORT!

HTTP Requests: 1 GET

The SyncJob performs an HTTP GET request to retrieve the initial page, extracts necessary cookies and a CSRF token, and then schedules seven DayPageJob instances, one for each day of the week. The crucial detail here is the staggered scheduling with a delay of only 2 seconds between each job. This seemingly minor detail was the root cause of the race condition.


Phase 2: Day Scraping (DayPageJob × 7) ⚠️ RACE CONDITION

This is where the race condition rears its head. Each DayPageJob is responsible for scraping the showtimes for a specific day. The jobs interact with the Kino Krakow website by making HTTP POST requests to set the day and then GET requests to retrieve the showtime data. The shared session state and the short stagger delay between jobs create a scenario where jobs interfere with each other.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/day_page_job.ex

For EACH day (0-6):

1. HTTP POST → /settings/set_day/{day_offset}
   Headers:
     - X-CSRF-Token: {token}
     - Cookie: {cookies}
     - X-Requested-With: XMLHttpRequest

2. Sleep 2 seconds (rate limit)

3. HTTP GET → /cinema_program/by_movie
   Headers:
     - Cookie: {cookies}

4. Parse HTML:
   └─> Extract showtimes (movie_slug, cinema_slug, datetime)
   └─> Calculate date from day_offset
   └─> Generate external_id (once, at extraction time)

5. Schedule MovieDetailJobs:
   └─> Find unique movie_slugs
   └─> One job per unique movie (deduplicated)
   └─> Stagger by Config.rate_limit() (2s)

6. Schedule ShowtimeProcessJobs:
   └─> One job per showtime
   └─> Apply EventFreshnessChecker (skip recently seen)
   └─> Delay to allow MovieDetailJobs to complete first

HTTP Requests per day: 2 (POST + GET)

Total Phase 2 Requests: 7 × 2 = 14 requests

Each DayPageJob performs an HTTP POST request to set the day, sleeps for 2 seconds (to adhere to rate limits), and then performs an HTTP GET request to retrieve the showtime data for that day. The scraped data is then parsed, and MovieDetailJob and ShowtimeProcessJob instances are scheduled. This phase is critical for data collection, but the timing of the jobs was causing a significant issue.


Phase 3a: Movie Matching (MovieDetailJob × N)

After the showtime data is scraped, the MovieDetailJob instances take over, focusing on enriching the data by fetching movie details. This involves making requests to the Kino Krakow website and the TMDB (The Movie Database) API to gather information such as original titles, director, cast, and genres. Matching movies to TMDB is crucial for providing comprehensive information to users.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/movie_detail_job.ex

For EACH unique movie:

1. HTTP GET → /film/{movie_slug}.html

2. Extract metadata (MovieExtractor):
   - original_title (critical for TMDB matching)
   - polish_title
   - director, year, country, runtime, cast, genre

3. Match to TMDB (TmdbMatcher):
   - TMDB Search API call (with original_title + year)
   - Calculate confidence score
   - TMDB Details API call (if match found)

4. Confidence handling:
   ≥70%:   Auto-match (standard)
   60-69%: Auto-match (now_playing_fallback)
   50-59%: {:error, :needs_review} → Job fails
   <50%:   {:error, :low_confidence} → Job fails

5. If matched:
   - Create/update Movie in database
   - Store kino_krakow_slug in movie.metadata

HTTP Requests per movie: 1 Kino + 2-3 TMDB API calls

Total Phase 3a Requests: N + 2-3N TMDB

The MovieDetailJob fetches movie metadata, including the original title, which is critical for matching movies to TMDB. The matching process involves making calls to the TMDB API and calculating a confidence score. Based on this score, the scraper either auto-matches the movie, flags it for review, or skips it if the confidence is too low. This phase ensures that the movie data is as accurate and complete as possible.


Phase 3b: Showtime Processing (ShowtimeProcessJob × M)

The final phase involves the ShowtimeProcessJob instances processing the individual showtimes and storing them in the database. This includes looking up movie details, extracting cinema data, and transforming the data into a standardized event format. This phase is crucial for making the scraped data accessible and usable.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/showtime_process_job.ex

For EACH showtime:

1. Mark event as seen (EventFreshnessChecker)

2. Lookup movie from database:
   SELECT * FROM movies
   WHERE metadata->>'kino_krakow_slug' = ?

3. If movie not found:
   - Check MovieDetailJob status (Oban.Job table)
   - If completed without match → skip showtime
   - If pending/retrying → retry ShowtimeProcessJob

4. Extract cinema data (CinemaExtractor):
   - No HTTP request (formats from slug)
   - Note: No GPS coordinates from Kino Krakow
   - VenueProcessor will geocode later

5. Transform to event format (Transformer):
   - Build title: "{movie} at {cinema}"
   - Use external_id from DayPageJob (no regeneration)
   - Add venue_data, movie_data, metadata

6. Deduplication check (DedupHandler):
   Phase 1: Same-source dedup (external_id)
   Phase 2: Cross-source fuzzy match (higher priority sources)

7. Process via EventProcessor → Database

HTTP Requests: 0 (all data cached from previous phases)

The ShowtimeProcessJob marks events as seen, looks up movie details in the database, extracts cinema data, and transforms the data into a unified event format. It also performs deduplication checks to avoid storing duplicate events. This phase ensures that the data is clean, consistent, and ready for use.


Total HTTP Request Analysis: Efficiency Matters

One of the hallmarks of a well-designed scraper is its efficiency in making HTTP requests. Minimizing the number of requests reduces the load on the target website and ensures that the scraping process is as fast and reliable as possible. Let's analyze the total HTTP requests made by the Kino Krakow scraper.

Requests to Kino Krakow: A Lean Approach

The Kino Krakow scraper is designed to be efficient, making a minimal number of requests to the Kino Krakow website. This is crucial for avoiding rate limits and ensuring that the scraping process is as smooth as possible.

Phase Requests Details
SyncJob 1 Session establishment
DayPageJob 14 7 days × (1 POST + 1 GET)
MovieDetailJob N 1 per unique movie
ShowtimeProcessJob 0 Uses cached data
Total 15 + N Very efficient!

External API Calls: Leveraging TMDB

The scraper also makes calls to external APIs, primarily the TMDB API, to enrich the movie data. These calls are essential for providing comprehensive information to users, but they also need to be managed carefully to avoid exceeding API limits.

Service Calls Details
TMDB Search N One per unique movie
TMDB Details N One per matched movie
Geocoding V One per unique venue (lazy, cached)

Efficiency Rating: ⭐⭐⭐⭐⭐ Excellent

  • Movies deduplicated across all 7 days
  • Showtimes require no additional HTTP requests
  • Minimal redundant fetching

🔴 Critical Bug: Race Condition in Day Selection

Now we arrive at the heart of the matter: the race condition. This bug was preventing the scraper from collecting data for all 7 days of the week, significantly impacting its usefulness. Understanding the race condition requires a close look at how the DayPageJob instances were interacting with the Kino Krakow website's session management.

The Problem: Jobs Overwriting Each Other

The race condition stemmed from the way the DayPageJob instances were scheduled and how they interacted with the Kino Krakow website's session management. The short stagger delay between jobs, combined with the shared session state, led to jobs overwriting each other's day selection. This meant that instead of scraping data for all 7 days, the scraper was often only collecting data for a single day.

Current scheduling (sync_job.ex:157):

delay_seconds = day_offset * Config.rate_limit()  # 2 seconds

Actual timeline:

T=0s:  Day 0 starts → POST /settings/set_day/0
T=1s:  Day 0 POST completes
T=2s:  Day 1 starts → POST /settings/set_day/1 ⚠️
T=3s:  Day 0 sleep ends → GET (but session now set to day=1!)
       Day 1 POST completes
T=4s:  Day 2 starts → POST /settings/set_day/2 ⚠️
       Day 1 sleep ends → GET (but session now set to day=2!)
T=6s:  Day 2 sleep ends → GET
       Day 3 starts → POST /settings/set_day/3 ⚠️

Root Cause: Shared Session State and Overlapping Execution

The root cause of the race condition can be broken down into several key factors:

  1. Shared session state: All 7 DayPageJob instances were using the same cookies, meaning they were sharing the same session with the Kino Krakow website.
  2. Server-side day selection: The Kino Krakow website used a POST request to /settings/set_day/{day} to modify the server session, effectively setting the day for which showtimes would be retrieved.
  3. Overlapping execution: The DayPageJob instances were running in parallel with a stagger delay of only 2 seconds, leading to significant overlap in their execution.
  4. Job execution time: Each DayPageJob took approximately 6 seconds to complete, including the POST request, sleep time, and GET request.

Result: Jobs overwrite each other's day selection, causing all jobs to get the same or overlapping day data.

Why This Matters: Impact on Data Completeness

The race condition had a significant impact on the completeness of the scraped data. Instead of providing showtimes for all 7 days of the week, the scraper was often only returning data for a single day, severely limiting its usefulness.

  • ❌ Users only see showtimes for 1 day (likely today)
  • ❌ Missing 6/7ths of available movie showtimes
  • ❌ Freshness checker might incorrectly skip valid future showtimes
  • ❌ Incomplete event calendar for Krakow users

✅ Recommended Solution: Sequential Scheduling

To resolve the race condition, a simple yet effective solution was implemented: sequential scheduling. This involved increasing the stagger delay between the DayPageJob instances, ensuring that each job had enough time to complete before the next one started. This eliminated the overlap in execution and prevented jobs from interfering with each other's session state.

Fix: Increase Stagger Delay

The fix involved modifying a single line of code in the sync_job.ex file. By increasing the stagger delay, we ensured that each DayPageJob had enough time to complete before the next one started, effectively eliminating the race condition.

Change (sync_job.ex:157):

# Before (BROKEN - race condition):
delay_seconds = day_offset * Config.rate_limit()  # 2 seconds

# After (FIXED - sequential execution):
delay_seconds = day_offset * 10  # 10 seconds

Why 10 Seconds? A Matter of Timing

The 10-second delay was chosen based on the estimated execution time of each DayPageJob. This included the time taken for the POST request, the rate limit sleep, the GET request, and a buffer for processing. By ensuring a sufficient delay, we minimized the risk of jobs overlapping and overwriting each other's session state.

Each DayPageJob needs:

  • POST request: ~1-2 seconds
  • Rate limit sleep: 2 seconds (in code)
  • GET request: ~1-2 seconds
  • Processing buffer: ~3-4 seconds
  • Total: ~9-10 seconds

Expected Timeline (Fixed): A Smoother Process

With the increased stagger delay, the execution timeline of the DayPageJob instances looked much smoother. Each job had enough time to complete before the next one started, ensuring that the session state was correctly maintained.

T=0s:   Day 0 starts
T=6s:   Day 0 completes
T=10s:  Day 1 starts
T=16s:  Day 1 completes
T=20s:  Day 2 starts
...
T=60s:  Day 6 starts
T=66s:  Day 6 completes

Total scraping time: ~70 seconds (vs current broken ~14 seconds)

Trade-offs: Balancing Speed and Reliability

While the sequential scheduling solution effectively resolved the race condition, it also introduced some trade-offs. The most notable was the slight increase in the total scraping time. However, the reliability gained by eliminating the race condition far outweighed this minor drawback.

Pros:

  • Guaranteed no race condition
  • Simple one-line fix
  • No architectural changes needed
  • Still reasonable performance (70s total)
  • High confidence solution

Cons:

  • Slightly slower than ideal parallel execution
  • Doesn't leverage full parallelism potential

Alternative Solutions: Future Optimization

While the sequential scheduling solution was effective, it wasn't the only option. Several alternative solutions were considered for future optimization, each with its own set of pros and cons. These options offer potential for further improving the scraper's performance and efficiency.

Option 2: Separate Session Per Day

This approach involves each DayPageJob establishing its own session with the Kino Krakow website. This would eliminate the shared session state and prevent jobs from interfering with each other. However, it would also increase the number of HTTP requests, potentially impacting performance.

Approach: Each DayPageJob establishes its own session

  • Move establish_session() from SyncJob into DayPageJob
  • Each job gets own cookies + CSRF token
  • No shared state = no race condition

Pros:

  • ✅ True parallelism (all 7 days run concurrently)
  • ✅ Faster execution (~14 seconds)

Cons:

  • ❌ 7× session overhead (7 extra HTTP requests)
  • ❌ More complex implementation
  • ❌ Higher server load on Kino Krakow

Option 3: URL Parameter for Day Selection

This approach involves checking if the Kino Krakow website supports selecting the day via a URL parameter. If so, the scraper could simply pass the day as a parameter in the URL, eliminating the need for the POST request to /settings/set_day/{day}.

Approach: Check if website supports day parameter

  • Try: /cinema_program/by_movie?day=0
  • Or: /cinema_program/by_movie/2025-01-15

Pros:

  • ✅ Perfect parallelism
  • ✅ No session state needed
  • ✅ Simplest solution

Cons:

  • ❌ Unknown if Kino Krakow supports this
  • ❌ Requires testing/investigation

Option 4: Oban Unique Jobs

This approach involves leveraging Oban's unique job constraint feature to ensure that only one DayPageJob runs at a time. This would prevent the race condition without requiring a large stagger delay. However, it would also result in sequential execution, similar to the chosen solution.

Approach: Use Oban's unique constraint

  • Only one DayPageJob runs at a time
  • Others wait in queue

Pros:

  • ✅ No race condition
  • ✅ Uses Oban native features

Cons:

  • ❌ Sequential execution (slower)
  • ❌ More complex configuration

Additional Findings: Beyond the Race Condition

While addressing the race condition was the primary focus, the analysis also uncovered several other areas for potential improvement. These findings offer opportunities to further enhance the scraper's performance, data quality, and maintainability.

Cinema GPS Coordinates: Enhancing Location Data

The current implementation extracts cinema data from the slug, which doesn't include GPS coordinates. This means that the scraper relies on geocoding services to determine the location of the cinemas, adding external API calls and potentially introducing inaccuracies. Scraping cinema detail pages could provide a more direct and accurate source of GPS coordinates.

Current: CinemaExtractor just formats data from slug:

cinema_data = CinemaExtractor.extract("", showtime["cinema_slug"])

Impact:

  • No GPS coordinates fetched from Kino Krakow
  • VenueProcessor must geocode using Google Maps/Nominatim
  • Adds external API calls for geocoding
  • Potential for incorrect/missing location data

Recommendation (P2): Consider scraping cinema detail pages:

  • GET /cinema/{cinema_slug}/info
  • Extract GPS coordinates if available
  • Reduce geocoding API usage

TMDB Matching Quality: Improving Data Enrichment

The process of matching movies to TMDB is crucial for enriching the scraped data with details such as original titles, cast, and genres. However, the confidence levels used for auto-matching can impact the accuracy of the data. Implementing a review queue for medium-confidence matches could improve the overall quality of the movie data.

Success Rates:

  • ≥70% confidence: Auto-matched ✅
  • 60-69% confidence: Auto-matched (fallback) ✅
  • 50-59% confidence: Needs review → Event skipped ⚠️
  • <50% confidence: No match → Event skipped ❌

Impact:

  • Medium/low confidence matches result in lost events
  • No manual review workflow currently exists
  • Visible in Oban dashboard but requires manual intervention

Recommendation (P2):

  • Implement review queue for 50-69% matches
  • Add admin UI for manual TMDB matching
  • Track matching success rate metrics

Freshness Checking: Ensuring Up-to-Date Data

The EventFreshnessChecker is designed to prevent re-processing the same showtimes on every scrape. However, the race condition was interfering with this process, potentially leading to future showtimes being skipped. Resolving the race condition also resolves this secondary issue, ensuring that the scraper processes all relevant showtimes.

Current: EventFreshnessChecker filters recent showtimes

  • Prevents re-processing same showtime on every scrape
  • Uses last_seen_at timestamp
  • Configurable threshold (likely 24h)

Impact with race condition:

  • If all days get same data (Day 0), Days 1-6 showtimes never process
  • Freshness checker sees them as "already processed"
  • Future showtimes never make it to database

Fix: Race condition fix will resolve this secondary issue


Metrics & Observability: Keeping an Eye on Performance

Monitoring the scraper's performance and health is crucial for ensuring its long-term reliability. Implementing metrics and alerts can help identify issues early on and prevent data loss. The current metrics provide a good starting point, but there are several areas for improvement.

Current Metrics (Oban Dashboard): What We Can See

The Oban dashboard provides valuable insights into the scraper's performance, including job counts per state, individual job failures, TMDB matching failures, and processing time per job type. This data is essential for identifying and diagnosing issues.

Visible:

  • Job counts per state (completed, failed, retrying)
  • Individual job failures with error details
  • TMDB matching failures per movie
  • Processing time per job type

Recommended Additions: Enhancing Visibility

To further improve observability, several additional metrics were recommended, including day-level success metrics, unique date counts in scraped showtimes, TMDB matching success rates, and scraping coverage. These metrics would provide a more comprehensive view of the scraper's health and performance.

Missing:

  • Day-level success metrics (are all 7 days scraping?)
  • Unique date count in scraped showtimes
  • TMDB matching success rate percentage
  • Scraping coverage (% of expected showtimes)
  1. Day Coverage Metric:

    showtimes
    |> Enum.map(&Date.from_datetime(&1.datetime))
    |> Enum.uniq()
    |> length()  # Should be 7
    
  2. TMDB Success Rate:

    SELECT
      COUNT(*) FILTER (WHERE state = 'completed') as matched,
      COUNT(*) FILTER (WHERE state = 'discarded') as failed,
      COUNT(*) as total
    FROM oban_jobs
    WHERE worker = 'MovieDetailJob'
    
  3. Scraping Health Alert:

  • Alert if unique dates < 7
  • Alert if TMDB success rate < 80%
  • Alert if showtime count drops significantly

Implementation Checklist: A Step-by-Step Guide

To ensure that the fixes and improvements were implemented correctly, an implementation checklist was created. This checklist outlined the steps required for each phase of the project, from the critical bug fix to the long-term improvements.

Phase 1: Critical Bug Fix (P0)

  • [ ] Update sync_job.ex:157 to use 10-second stagger
  • [ ] Deploy and test with sample scrape
  • [ ] Verify all 7 days return different data
  • [ ] Add comment explaining race condition fix
  • [ ] Monitor Oban dashboard for 7 DayPageJob completions

Phase 2: Verification (P1)

  • [ ] Add logging to show date range in DayPageJob results
  • [ ] Add metrics for unique dates scraped
  • [ ] Create admin query to verify 7-day coverage
  • [ ] Document expected behavior in code comments

Phase 3: Long-term Improvements (P2)

  • [ ] Investigate Option 3 (URL parameter for day selection)
  • [ ] Consider scraping cinema pages for GPS coordinates
  • [ ] Implement manual review workflow for medium-confidence TMDB matches
  • [ ] Add alerting for scraping health metrics
  • [ ] Research separate session approach (Option 2) if performance becomes issue

Architecture Strengths: What Works Well

Despite the race condition, the Kino Krakow scraper boasts several architectural strengths that contribute to its overall effectiveness. These strengths should be preserved and built upon in future improvements.

Well-designed patterns:

  1. Distributed job architecture with clear separation of concerns
  2. External ID generation at extraction time (BandsInTown A+ pattern)
  3. Movie deduplication across days (one MovieDetailJob per unique movie)
  4. Freshness checking to avoid duplicate processing
  5. Granular visibility into failures via Oban dashboard
  6. TMDB confidence scoring for matching quality
  7. Proper rate limiting between HTTP requests
  8. Error handling and retry logic at each level

Efficient HTTP usage:

  • Only 15 + N requests to Kino Krakow (N = unique movies)
  • Zero redundant showtime fetches
  • Smart caching of movie data in database

Code References: Key Files and Lines

To facilitate future maintenance and improvements, a list of key code references was compiled. This list highlights the files and lines of code that are most relevant to the scraper's functionality and the identified issues.

File Purpose Lines of Interest
sync_job.ex Coordinator Line 157: Race condition fix needed
day_page_job.ex Day scraping Lines 90-136: Day selection HTTP flow
movie_detail_job.ex Movie matching Lines 68-95: TMDB confidence logic
showtime_process_job.ex Event processing Lines 80-107: Movie lookup & retry logic
config.ex Configuration Line 15: rate_limit (2 seconds)
showtime_extractor.ex HTML parsing Lines 33-43: Showtime extraction
transformer.ex Event formatting Lines 23-85: Unified format transform

Questions for Further Investigation: Unanswered Questions

During the analysis, several questions arose that warrant further investigation. These questions could lead to additional improvements and a deeper understanding of the scraper's behavior.

  1. Day selection testing: Can we verify that POST /settings/set_day actually works correctly?
  2. URL parameters: Does Kino Krakow support day selection via URL parameters?
  3. Cinema GPS: Are GPS coordinates available on cinema detail pages?
  4. TMDB matching rate: What percentage of movies successfully match?
  5. Parallel optimization: Is the sequential fix "fast enough" or should we pursue Option 2?

Conclusion: A Well-Architected Scraper with Room for Improvement

The Kino Krakow scraper is a well-architected system that faced a critical but easily fixable bug. The race condition in day selection was preventing the scraper from collecting complete data, but a simple adjustment to the stagger delay resolved the issue. This experience highlights the importance of careful attention to timing and concurrency when designing web scrapers.

Recommended action: Implement the one-line fix (10-second stagger) immediately, then monitor metrics to verify all 7 days are scraping correctly.

The architecture is sound and should continue to work well once this timing issue is resolved. Future optimizations (separate sessions, URL parameters) can be considered if performance becomes a concern.


Created: 2025-01-18

Status: Analysis Complete, Awaiting Implementation

Priority: P0 - Critical Bug Fix Required

In conclusion, understanding and resolving issues like race conditions are crucial for maintaining the effectiveness of web scrapers. The Kino Krakow scraper serves as a great example of how a well-designed system can be improved through careful analysis and targeted solutions.

For more information on web scraping and race conditions, you can check out resources like the OWASP (Open Web Application Security Project) guide on concurrency issues.