Fixing Kino Krakow: Race Condition Prevents Full Scraping

Nov 18, 2025 by Alex Johnson 58 views

Fixing the Kino Krakow Scraper: How a Race Condition Was Preventing 7-Day Scraping

Have you ever encountered a pesky bug that prevents a system from working as intended? In the world of web scraping, these issues can be particularly frustrating. This article dives deep into a recent challenge faced with the Kino Krakow scraper, a system designed to collect movie showtimes for Krakow, Poland. We'll explore how a race condition was identified and resolved, ensuring users get the complete picture of movie listings. Understanding these challenges and solutions can be invaluable for anyone involved in web scraping or software development.

Executive Summary: The Kino Krakow Scraper's 7-Day Challenge

The Kino Krakow scraper, while boasting a well-designed distributed job-based architecture, was plagued by a critical race condition. This issue prevented it from scraping the showtimes for all 7 days of the week. Although the infrastructure was in place for complete data collection, the jobs were being scheduled too closely together, leading to the overwriting of session states. This resulted in incomplete data, impacting users who relied on the scraper for comprehensive movie schedules. Identifying and resolving such issues is crucial for maintaining the reliability and effectiveness of web scrapers.

Status: 🔴 Critical Bug - Only getting 1 day of data instead of 7 days

Impact: Missing 6/7ths of movie showtimes for Krakow users

Fix Complexity: ✅ Simple one-line change

Architecture Overview: How the Scraper Works

To understand the problem, let's first break down the architecture of the Kino Krakow scraper. The scraper operates using a hierarchical job structure, where a central coordinator (SyncJob) initiates and manages several DayPageJob instances. Each DayPageJob is responsible for scraping showtimes for a specific day of the week. These jobs then schedule further tasks, such as fetching movie details (MovieDetailJob) and processing showtime information (ShowtimeProcessJob).

The job hierarchy can be visualized as follows:

SyncJob (Coordinator)
    ├─> DayPageJob (Day 0)
    ├─> DayPageJob (Day 1)
    ├─> DayPageJob (Day 2)
    ├─> DayPageJob (Day 3)
    ├─> DayPageJob (Day 4)
    ├─> DayPageJob (Day 5)
    └─> DayPageJob (Day 6)
            ├─> MovieDetailJob (unique movies)
            │       └─> TMDB API calls
            └─> ShowtimeProcessJob (all showtimes)
                    └─> EventProcessor → Database

This structure ensures a systematic approach to scraping, processing, and storing data. Understanding this job hierarchy is key to grasping how the race condition occurred and how it was resolved. The core of the issue lies in the interaction between the DayPageJob instances and their reliance on shared session state.

Key Components: The Building Blocks of the Scraper

The Kino Krakow scraper comprises several key components, each playing a specific role in the data collection process. These components interact with each other to ensure that movie showtimes are accurately scraped, processed, and stored. Understanding the purpose and function of each component is crucial for maintaining and optimizing the scraper's performance.

Component	Queue	Count	Purpose
`SyncJob`	`:discovery`	1	Coordinator: establishes session, schedules day jobs
`DayPageJob`	`:scraper_index`	7	Scrapes one day's showtimes, schedules movie/showtime jobs
`MovieDetailJob`	`:scraper_detail`	N (unique)	Fetches movie details, matches to TMDB
`ShowtimeProcessJob`	`:scraper`	M (all)	Processes individual showtimes into events

Where:

N = Unique movies across all 7 days (deduplicated)
M = Total showtimes across all 7 days

The Complete Data Flow: From Start to Finish

To truly appreciate the complexity of the scraper and the subtlety of the bug, let's trace the complete data flow, phase by phase. This will illustrate how each component interacts with the others and where the race condition manifested itself.

Phase 1: Session Establishment (SyncJob)

The initial phase involves the SyncJob establishing a session with the Kino Krakow website. This is crucial for authenticating subsequent requests and maintaining consistency throughout the scraping process. Without a properly established session, the scraper would be unable to access the necessary data.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/sync_job.ex

1. HTTP GET → https://www.kino.krakow.pl/cinema_program/by_movie
   └─> Extract Set-Cookie headers
   └─> Extract CSRF token from <meta name="csrf-token">

2. Schedule 7 DayPageJobs (days 0-6)
   └─> Pass: cookies, csrf_token, source_id, day_offset
   └─> Stagger: delay_seconds = day_offset * 2 seconds ⚠️ TOO SHORT!

HTTP Requests: 1 GET

The SyncJob performs an HTTP GET request to retrieve the initial page, extracts necessary cookies and a CSRF token, and then schedules seven DayPageJob instances, one for each day of the week. The crucial detail here is the staggered scheduling with a delay of only 2 seconds between each job. This seemingly minor detail was the root cause of the race condition.

Phase 2: Day Scraping (DayPageJob × 7) ⚠️ RACE CONDITION

This is where the race condition rears its head. Each DayPageJob is responsible for scraping the showtimes for a specific day. The jobs interact with the Kino Krakow website by making HTTP POST requests to set the day and then GET requests to retrieve the showtime data. The shared session state and the short stagger delay between jobs create a scenario where jobs interfere with each other.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/day_page_job.ex

For EACH day (0-6):

1. HTTP POST → /settings/set_day/{day_offset}
   Headers:
     - X-CSRF-Token: {token}
     - Cookie: {cookies}
     - X-Requested-With: XMLHttpRequest

2. Sleep 2 seconds (rate limit)

3. HTTP GET → /cinema_program/by_movie
   Headers:
     - Cookie: {cookies}

4. Parse HTML:
   └─> Extract showtimes (movie_slug, cinema_slug, datetime)
   └─> Calculate date from day_offset
   └─> Generate external_id (once, at extraction time)

5. Schedule MovieDetailJobs:
   └─> Find unique movie_slugs
   └─> One job per unique movie (deduplicated)
   └─> Stagger by Config.rate_limit() (2s)

6. Schedule ShowtimeProcessJobs:
   └─> One job per showtime
   └─> Apply EventFreshnessChecker (skip recently seen)
   └─> Delay to allow MovieDetailJobs to complete first

HTTP Requests per day: 2 (POST + GET)

Total Phase 2 Requests: 7 × 2 = 14 requests

Each DayPageJob performs an HTTP POST request to set the day, sleeps for 2 seconds (to adhere to rate limits), and then performs an HTTP GET request to retrieve the showtime data for that day. The scraped data is then parsed, and MovieDetailJob and ShowtimeProcessJob instances are scheduled. This phase is critical for data collection, but the timing of the jobs was causing a significant issue.

Phase 3a: Movie Matching (MovieDetailJob × N)

After the showtime data is scraped, the MovieDetailJob instances take over, focusing on enriching the data by fetching movie details. This involves making requests to the Kino Krakow website and the TMDB (The Movie Database) API to gather information such as original titles, director, cast, and genres. Matching movies to TMDB is crucial for providing comprehensive information to users.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/movie_detail_job.ex

For EACH unique movie:

1. HTTP GET → /film/{movie_slug}.html

2. Extract metadata (MovieExtractor):
   - original_title (critical for TMDB matching)
   - polish_title
   - director, year, country, runtime, cast, genre

3. Match to TMDB (TmdbMatcher):
   - TMDB Search API call (with original_title + year)
   - Calculate confidence score
   - TMDB Details API call (if match found)

4. Confidence handling:
   ≥70%:   Auto-match (standard)
   60-69%: Auto-match (now_playing_fallback)
   50-59%: {:error, :needs_review} → Job fails
   <50%:   {:error, :low_confidence} → Job fails

5. If matched:
   - Create/update Movie in database
   - Store kino_krakow_slug in movie.metadata

HTTP Requests per movie: 1 Kino + 2-3 TMDB API calls

Total Phase 3a Requests: N + 2-3N TMDB

The MovieDetailJob fetches movie metadata, including the original title, which is critical for matching movies to TMDB. The matching process involves making calls to the TMDB API and calculating a confidence score. Based on this score, the scraper either auto-matches the movie, flags it for review, or skips it if the confidence is too low. This phase ensures that the movie data is as accurate and complete as possible.

Phase 3b: Showtime Processing (ShowtimeProcessJob × M)

The final phase involves the ShowtimeProcessJob instances processing the individual showtimes and storing them in the database. This includes looking up movie details, extracting cinema data, and transforming the data into a standardized event format. This phase is crucial for making the scraped data accessible and usable.

File: lib/eventasaurus_discovery/sources/kino_krakow/jobs/showtime_process_job.ex

For EACH showtime:

1. Mark event as seen (EventFreshnessChecker)

2. Lookup movie from database:
   SELECT * FROM movies
   WHERE metadata->>'kino_krakow_slug' = ?

3. If movie not found:
   - Check MovieDetailJob status (Oban.Job table)
   - If completed without match → skip showtime
   - If pending/retrying → retry ShowtimeProcessJob

4. Extract cinema data (CinemaExtractor):
   - No HTTP request (formats from slug)
   - Note: No GPS coordinates from Kino Krakow
   - VenueProcessor will geocode later

5. Transform to event format (Transformer):
   - Build title: "{movie} at {cinema}"
   - Use external_id from DayPageJob (no regeneration)
   - Add venue_data, movie_data, metadata

6. Deduplication check (DedupHandler):
   Phase 1: Same-source dedup (external_id)
   Phase 2: Cross-source fuzzy match (higher priority sources)

7. Process via EventProcessor → Database

HTTP Requests: 0 (all data cached from previous phases)

The ShowtimeProcessJob marks events as seen, looks up movie details in the database, extracts cinema data, and transforms the data into a unified event format. It also performs deduplication checks to avoid storing duplicate events. This phase ensures that the data is clean, consistent, and ready for use.

Total HTTP Request Analysis: Efficiency Matters

One of the hallmarks of a well-designed scraper is its efficiency in making HTTP requests. Minimizing the number of requests reduces the load on the target website and ensures that the scraping process is as fast and reliable as possible. Let's analyze the total HTTP requests made by the Kino Krakow scraper.

Requests to Kino Krakow: A Lean Approach

The Kino Krakow scraper is designed to be efficient, making a minimal number of requests to the Kino Krakow website. This is crucial for avoiding rate limits and ensuring that the scraping process is as smooth as possible.

Phase	Requests	Details
SyncJob	1	Session establishment
DayPageJob	14	7 days × (1 POST + 1 GET)
MovieDetailJob	N	1 per unique movie
ShowtimeProcessJob	0	Uses cached data
Total	15 + N	Very efficient!

External API Calls: Leveraging TMDB

The scraper also makes calls to external APIs, primarily the TMDB API, to enrich the movie data. These calls are essential for providing comprehensive information to users, but they also need to be managed carefully to avoid exceeding API limits.

Service	Calls	Details
TMDB Search	N	One per unique movie
TMDB Details	N	One per matched movie
Geocoding	V	One per unique venue (lazy, cached)

Efficiency Rating: ⭐⭐⭐⭐⭐ Excellent

Movies deduplicated across all 7 days
Showtimes require no additional HTTP requests
Minimal redundant fetching

🔴 Critical Bug: Race Condition in Day Selection

Now we arrive at the heart of the matter: the race condition. This bug was preventing the scraper from collecting data for all 7 days of the week, significantly impacting its usefulness. Understanding the race condition requires a close look at how the DayPageJob instances were interacting with the Kino Krakow website's session management.

The Problem: Jobs Overwriting Each Other

The race condition stemmed from the way the DayPageJob instances were scheduled and how they interacted with the Kino Krakow website's session management. The short stagger delay between jobs, combined with the shared session state, led to jobs overwriting each other's day selection. This meant that instead of scraping data for all 7 days, the scraper was often only collecting data for a single day.

Current scheduling (sync_job.ex:157):

delay_seconds = day_offset * Config.rate_limit()  # 2 seconds

Actual timeline:

T=0s:  Day 0 starts → POST /settings/set_day/0
T=1s:  Day 0 POST completes
T=2s:  Day 1 starts → POST /settings/set_day/1 ⚠️
T=3s:  Day 0 sleep ends → GET (but session now set to day=1!)
       Day 1 POST completes
T=4s:  Day 2 starts → POST /settings/set_day/2 ⚠️
       Day 1 sleep ends → GET (but session now set to day=2!)
T=6s:  Day 2 sleep ends → GET
       Day 3 starts → POST /settings/set_day/3 ⚠️

Root Cause: Shared Session State and Overlapping Execution

The root cause of the race condition can be broken down into several key factors:

Shared session state: All 7 DayPageJob instances were using the same cookies, meaning they were sharing the same session with the Kino Krakow website.
Server-side day selection: The Kino Krakow website used a POST request to /settings/set_day/{day} to modify the server session, effectively setting the day for which showtimes would be retrieved.
Overlapping execution: The DayPageJob instances were running in parallel with a stagger delay of only 2 seconds, leading to significant overlap in their execution.
Job execution time: Each DayPageJob took approximately 6 seconds to complete, including the POST request, sleep time, and GET request.

Result: Jobs overwrite each other's day selection, causing all jobs to get the same or overlapping day data.

Why This Matters: Impact on Data Completeness

The race condition had a significant impact on the completeness of the scraped data. Instead of providing showtimes for all 7 days of the week, the scraper was often only returning data for a single day, severely limiting its usefulness.

❌ Users only see showtimes for 1 day (likely today)
❌ Missing 6/7ths of available movie showtimes
❌ Freshness checker might incorrectly skip valid future showtimes
❌ Incomplete event calendar for Krakow users

✅ Recommended Solution: Sequential Scheduling

To resolve the race condition, a simple yet effective solution was implemented: sequential scheduling. This involved increasing the stagger delay between the DayPageJob instances, ensuring that each job had enough time to complete before the next one started. This eliminated the overlap in execution and prevented jobs from interfering with each other's session state.

Fix: Increase Stagger Delay

The fix involved modifying a single line of code in the sync_job.ex file. By increasing the stagger delay, we ensured that each DayPageJob had enough time to complete before the next one started, effectively eliminating the race condition.

Change (sync_job.ex:157):

# Before (BROKEN - race condition):
delay_seconds = day_offset * Config.rate_limit()  # 2 seconds

# After (FIXED - sequential execution):
delay_seconds = day_offset * 10  # 10 seconds

Why 10 Seconds? A Matter of Timing

The 10-second delay was chosen based on the estimated execution time of each DayPageJob. This included the time taken for the POST request, the rate limit sleep, the GET request, and a buffer for processing. By ensuring a sufficient delay, we minimized the risk of jobs overlapping and overwriting each other's session state.

Each DayPageJob needs:

POST request: ~1-2 seconds
Rate limit sleep: 2 seconds (in code)
GET request: ~1-2 seconds
Processing buffer: ~3-4 seconds
Total: ~9-10 seconds

Expected Timeline (Fixed): A Smoother Process

With the increased stagger delay, the execution timeline of the DayPageJob instances looked much smoother. Each job had enough time to complete before the next one started, ensuring that the session state was correctly maintained.

T=0s:   Day 0 starts
T=6s:   Day 0 completes
T=10s:  Day 1 starts
T=16s:  Day 1 completes
T=20s:  Day 2 starts
...
T=60s:  Day 6 starts
T=66s:  Day 6 completes

Total scraping time: ~70 seconds (vs current broken ~14 seconds)

Trade-offs: Balancing Speed and Reliability

While the sequential scheduling solution effectively resolved the race condition, it also introduced some trade-offs. The most notable was the slight increase in the total scraping time. However, the reliability gained by eliminating the race condition far outweighed this minor drawback.

✅ Pros:

Guaranteed no race condition
Simple one-line fix
No architectural changes needed
Still reasonable performance (70s total)
High confidence solution

❌ Cons:

Slightly slower than ideal parallel execution
Doesn't leverage full parallelism potential

Alternative Solutions: Future Optimization

While the sequential scheduling solution was effective, it wasn't the only option. Several alternative solutions were considered for future optimization, each with its own set of pros and cons. These options offer potential for further improving the scraper's performance and efficiency.

Option 2: Separate Session Per Day

This approach involves each DayPageJob establishing its own session with the Kino Krakow website. This would eliminate the shared session state and prevent jobs from interfering with each other. However, it would also increase the number of HTTP requests, potentially impacting performance.

Approach: Each DayPageJob establishes its own session

Move establish_session() from SyncJob into DayPageJob
Each job gets own cookies + CSRF token
No shared state = no race condition

Pros:

✅ True parallelism (all 7 days run concurrently)
✅ Faster execution (~14 seconds)

Cons:

❌ 7× session overhead (7 extra HTTP requests)
❌ More complex implementation
❌ Higher server load on Kino Krakow

Option 3: URL Parameter for Day Selection

This approach involves checking if the Kino Krakow website supports selecting the day via a URL parameter. If so, the scraper could simply pass the day as a parameter in the URL, eliminating the need for the POST request to /settings/set_day/{day}.

Approach: Check if website supports day parameter

Try: /cinema_program/by_movie?day=0
Or: /cinema_program/by_movie/2025-01-15

Pros:

✅ Perfect parallelism
✅ No session state needed
✅ Simplest solution

Cons:

❌ Unknown if Kino Krakow supports this
❌ Requires testing/investigation

Option 4: Oban Unique Jobs

This approach involves leveraging Oban's unique job constraint feature to ensure that only one DayPageJob runs at a time. This would prevent the race condition without requiring a large stagger delay. However, it would also result in sequential execution, similar to the chosen solution.

Approach: Use Oban's unique constraint

Only one DayPageJob runs at a time
Others wait in queue

Pros:

✅ No race condition
✅ Uses Oban native features

Cons:

❌ Sequential execution (slower)
❌ More complex configuration

Additional Findings: Beyond the Race Condition

While addressing the race condition was the primary focus, the analysis also uncovered several other areas for potential improvement. These findings offer opportunities to further enhance the scraper's performance, data quality, and maintainability.

Cinema GPS Coordinates: Enhancing Location Data

The current implementation extracts cinema data from the slug, which doesn't include GPS coordinates. This means that the scraper relies on geocoding services to determine the location of the cinemas, adding external API calls and potentially introducing inaccuracies. Scraping cinema detail pages could provide a more direct and accurate source of GPS coordinates.

Current: CinemaExtractor just formats data from slug:

cinema_data = CinemaExtractor.extract("", showtime["cinema_slug"])

Impact:

No GPS coordinates fetched from Kino Krakow
VenueProcessor must geocode using Google Maps/Nominatim
Adds external API calls for geocoding
Potential for incorrect/missing location data

Recommendation (P2): Consider scraping cinema detail pages:

GET /cinema/{cinema_slug}/info
Extract GPS coordinates if available
Reduce geocoding API usage

TMDB Matching Quality: Improving Data Enrichment

The process of matching movies to TMDB is crucial for enriching the scraped data with details such as original titles, cast, and genres. However, the confidence levels used for auto-matching can impact the accuracy of the data. Implementing a review queue for medium-confidence matches could improve the overall quality of the movie data.

Success Rates:

≥70% confidence: Auto-matched ✅
60-69% confidence: Auto-matched (fallback) ✅
50-59% confidence: Needs review → Event skipped ⚠️
<50% confidence: No match → Event skipped ❌

Impact:

Medium/low confidence matches result in lost events
No manual review workflow currently exists
Visible in Oban dashboard but requires manual intervention

Recommendation (P2):

Implement review queue for 50-69% matches
Add admin UI for manual TMDB matching
Track matching success rate metrics

Freshness Checking: Ensuring Up-to-Date Data

The EventFreshnessChecker is designed to prevent re-processing the same showtimes on every scrape. However, the race condition was interfering with this process, potentially leading to future showtimes being skipped. Resolving the race condition also resolves this secondary issue, ensuring that the scraper processes all relevant showtimes.

Current: EventFreshnessChecker filters recent showtimes

Prevents re-processing same showtime on every scrape
Uses last_seen_at timestamp
Configurable threshold (likely 24h)

Impact with race condition:

If all days get same data (Day 0), Days 1-6 showtimes never process
Freshness checker sees them as "already processed"
Future showtimes never make it to database

Fix: Race condition fix will resolve this secondary issue

Metrics & Observability: Keeping an Eye on Performance

Monitoring the scraper's performance and health is crucial for ensuring its long-term reliability. Implementing metrics and alerts can help identify issues early on and prevent data loss. The current metrics provide a good starting point, but there are several areas for improvement.

Current Metrics (Oban Dashboard): What We Can See

The Oban dashboard provides valuable insights into the scraper's performance, including job counts per state, individual job failures, TMDB matching failures, and processing time per job type. This data is essential for identifying and diagnosing issues.

✅ Visible:

Job counts per state (completed, failed, retrying)
Individual job failures with error details
TMDB matching failures per movie
Processing time per job type

Recommended Additions: Enhancing Visibility

To further improve observability, several additional metrics were recommended, including day-level success metrics, unique date counts in scraped showtimes, TMDB matching success rates, and scraping coverage. These metrics would provide a more comprehensive view of the scraper's health and performance.

❌ Missing:

Day-level success metrics (are all 7 days scraping?)
Unique date count in scraped showtimes
TMDB matching success rate percentage
Scraping coverage (% of expected showtimes)

Day Coverage Metric:

showtimes
|> Enum.map(&Date.from_datetime(&1.datetime))
|> Enum.uniq()
|> length()  # Should be 7

TMDB Success Rate:

SELECT
  COUNT(*) FILTER (WHERE state = 'completed') as matched,
  COUNT(*) FILTER (WHERE state = 'discarded') as failed,
  COUNT(*) as total
FROM oban_jobs
WHERE worker = 'MovieDetailJob'

Scraping Health Alert:

Alert if unique dates < 7
Alert if TMDB success rate < 80%
Alert if showtime count drops significantly

Implementation Checklist: A Step-by-Step Guide

To ensure that the fixes and improvements were implemented correctly, an implementation checklist was created. This checklist outlined the steps required for each phase of the project, from the critical bug fix to the long-term improvements.

Phase 1: Critical Bug Fix (P0)

[ ] Update sync_job.ex:157 to use 10-second stagger
[ ] Deploy and test with sample scrape
[ ] Verify all 7 days return different data
[ ] Add comment explaining race condition fix
[ ] Monitor Oban dashboard for 7 DayPageJob completions

Phase 2: Verification (P1)

[ ] Add logging to show date range in DayPageJob results
[ ] Add metrics for unique dates scraped
[ ] Create admin query to verify 7-day coverage
[ ] Document expected behavior in code comments

Phase 3: Long-term Improvements (P2)

[ ] Investigate Option 3 (URL parameter for day selection)
[ ] Consider scraping cinema pages for GPS coordinates
[ ] Implement manual review workflow for medium-confidence TMDB matches
[ ] Add alerting for scraping health metrics
[ ] Research separate session approach (Option 2) if performance becomes issue

Architecture Strengths: What Works Well

Despite the race condition, the Kino Krakow scraper boasts several architectural strengths that contribute to its overall effectiveness. These strengths should be preserved and built upon in future improvements.

✅ Well-designed patterns:

Distributed job architecture with clear separation of concerns
External ID generation at extraction time (BandsInTown A+ pattern)
Movie deduplication across days (one MovieDetailJob per unique movie)
Freshness checking to avoid duplicate processing
Granular visibility into failures via Oban dashboard
TMDB confidence scoring for matching quality
Proper rate limiting between HTTP requests
Error handling and retry logic at each level

✅ Efficient HTTP usage:

Only 15 + N requests to Kino Krakow (N = unique movies)
Zero redundant showtime fetches
Smart caching of movie data in database

Code References: Key Files and Lines

To facilitate future maintenance and improvements, a list of key code references was compiled. This list highlights the files and lines of code that are most relevant to the scraper's functionality and the identified issues.

File	Purpose	Lines of Interest
`sync_job.ex`	Coordinator	Line 157: Race condition fix needed
`day_page_job.ex`	Day scraping	Lines 90-136: Day selection HTTP flow
`movie_detail_job.ex`	Movie matching	Lines 68-95: TMDB confidence logic
`showtime_process_job.ex`	Event processing	Lines 80-107: Movie lookup & retry logic
`config.ex`	Configuration	Line 15: `rate_limit` (2 seconds)
`showtime_extractor.ex`	HTML parsing	Lines 33-43: Showtime extraction
`transformer.ex`	Event formatting	Lines 23-85: Unified format transform

Questions for Further Investigation: Unanswered Questions

During the analysis, several questions arose that warrant further investigation. These questions could lead to additional improvements and a deeper understanding of the scraper's behavior.

Day selection testing: Can we verify that POST /settings/set_day actually works correctly?
URL parameters: Does Kino Krakow support day selection via URL parameters?
Cinema GPS: Are GPS coordinates available on cinema detail pages?
TMDB matching rate: What percentage of movies successfully match?
Parallel optimization: Is the sequential fix "fast enough" or should we pursue Option 2?

Conclusion: A Well-Architected Scraper with Room for Improvement

The Kino Krakow scraper is a well-architected system that faced a critical but easily fixable bug. The race condition in day selection was preventing the scraper from collecting complete data, but a simple adjustment to the stagger delay resolved the issue. This experience highlights the importance of careful attention to timing and concurrency when designing web scrapers.

Recommended action: Implement the one-line fix (10-second stagger) immediately, then monitor metrics to verify all 7 days are scraping correctly.

The architecture is sound and should continue to work well once this timing issue is resolved. Future optimizations (separate sessions, URL parameters) can be considered if performance becomes a concern.

Created: 2025-01-18

Status: Analysis Complete, Awaiting Implementation

Priority: P0 - Critical Bug Fix Required

In conclusion, understanding and resolving issues like race conditions are crucial for maintaining the effectiveness of web scrapers. The Kino Krakow scraper serves as a great example of how a well-designed system can be improved through careful analysis and targeted solutions.

For more information on web scraping and race conditions, you can check out resources like the OWASP (Open Web Application Security Project) guide on concurrency issues.