flickr-mail/AGENTS.md
Edward Betts 08f5128e8d Add interaction logging and tighten model NOT NULL constraints
Log searches (article/category) and message-generation events to a new
interaction_log table, capturing IP address and User-Agent.

Also apply NOT NULL constraints to Contribution, SentMessage, FlickrUpload,
and ThumbnailCache fields that are always populated, and remove stale
continue_token references from category.html.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-08 12:34:04 +00:00

8.5 KiB
Raw Permalink Blame History

Agent Guidelines for Flickr Mail

This document provides context for AI agents working on this codebase.

Project Overview

Flickr Mail is a Flask web application that helps users find photos on Flickr for Wikipedia articles and contact photographers to request Creative Commons licensing.

Architecture

  • main.py: Single-file Flask application containing all routes and logic
  • templates/: Jinja2 templates using Bootstrap 5 for styling
    • base.html: Base template with Bootstrap CSS/JS
    • combined.html: Main UI template for search, results, and message composition
    • message.jinja: Template for the permission request message body (with alternate text for non-free CC licenses)
    • category.html: Category search page with visited link styling
    • show_error.html: Error display template

Key Components

Flickr Search (search_flickr, parse_flickr_search_results)

Searches Flickr by scraping the search results page. The page embeds JSON data in a modelExport JavaScript variable which contains photo metadata.

  • Uses browser-like headers (BROWSER_HEADERS) to avoid blocks
  • Parses embedded JSON by counting braces (not regex) to handle nested structures
  • Accepts optional page parameter for pagination (25 photos per page)
  • Returns SearchResult dataclass containing photos and pagination metadata

SearchResult Dataclass

Contains search results with pagination info:

  • photos: List of FlickrPhoto instances
  • total_photos: Total number of matching photos
  • current_page: Current page number (1-indexed)
  • total_pages: Total number of pages (capped at 160 due to Flickr's 4000 result limit)

FlickrPhoto Dataclass

Represents a photo with:

  • id, title, path_alias, owner_nsid, username, realname
  • license (int): Flickr license code (0=ARR, 4=CC BY, 5=CC BY-SA, etc.)
  • thumb_url, medium_url: Static image URLs
  • flickr_url property: URL to photo page
  • license_name property: Human-readable license name

License Codes

Flickr uses numeric codes for licenses. Codes 1-6 are CC 2.0, codes 11-16 are CC 4.0 equivalents.

Wikipedia-compatible (FREE_LICENSES): 4 (CC BY 2.0), 5 (CC BY-SA 2.0), 7 (No known copyright), 8 (US Government), 9 (CC0), 10 (Public Domain), 14 (CC BY 4.0), 15 (CC BY-SA 4.0).

Non-free CC (NONFREE_CC_LICENSES): 1 (CC BY-NC-SA 2.0), 2 (CC BY-NC 2.0), 3 (CC BY-NC-ND 2.0), 6 (CC BY-ND 2.0), 11-13 (4.0 NC variants), 16 (CC BY-ND 4.0).

Not compatible: 0 (All Rights Reserved).

For free licenses, the message page shows an UploadWizard link instead of a message. For non-free CC licenses, a tailored message explains which restrictions (NC/ND) prevent Wikipedia use.

URL Validation (is_valid_flickr_image_url)

Validates that image URLs passed via query params are from legitimate Flickr static image servers:

  • live.staticflickr.com
  • farm*.staticflickr.com
  • c1.staticflickr.com, c2.staticflickr.com

NSID Lookup (flickr_usrename_to_nsid)

Converts a Flickr username/path alias to the NSID (internal user ID) needed for the Flickr mail URL. Scrapes the user's profile page for embedded params.

Commons Uploads Display

Shows recent Wikimedia Commons uploads on the home page, filtered to only those obtained via Flickr mail requests.

Database tables used by the app:

  • sent_messages: downloaded from Flickr sent mail, includes extracted Flickr URL and Wikipedia URL from message body
  • contributions: downloaded from Commons usercontribs
  • flickr_uploads: derived table built by update_flickr_uploads.py by matching Commons uploads to Flickr URLs
  • thumbnail_cache: cached Commons API thumbnail URLs (7-day TTL)
  • interaction_log: written by the web app to record searches and message generation events (see below)

Key functions:

  • get_recent_commons_uploads(): Loads uploads, filters by sent mail match, joins flickr_uploads with sent_messages, and fetches thumbnails from Commons API
  • normalize_flickr_url(): Normalizes URLs for matching (removes protocol, www, trailing slash)

CommonsUpload dataclass:

  • title, thumb_url, commons_url, flickr_url, creator, timestamp
  • wikipedia_url, creator_profile_url: Extracted from sent mail
  • is_wikidata_item property: Detects Q-number URLs
  • wiki_link_url, wiki_link_label: Handles Wikidata vs Wikipedia links

Maintenance script (update_flickr_uploads.py): Builds/updates flickr_uploads from contributions and links to sent_messages.

  • Scans file contributions containing UploadWizard in the comment
  • Supports both comment styles:
    • User created page with UploadWizard (older)
    • Uploaded a work by ... with UploadWizard (newer; often includes URL)
  • Extracts Flickr URL from contribution comment when present
  • Falls back to Commons extmetadata.Credit lookup when comment has no URL

Interaction Logging (log_interaction)

The log_interaction() helper writes a row to interaction_log on each meaningful user action:

  • "search_article" user submits a Wikipedia article search (page 1 only, to avoid logging every pagination hit)
  • "search_category" user submits a Wikipedia category search
  • "generate_message" a non-free CC message is generated after clicking a photo

Each row captures: Unix timestamp, interaction_type, ip_address (prefers X-Forwarded-For for proxy setups), user_agent, query (article title or category name), and optionally flickr_url / wikipedia_url.

The table is created by init_db() (called via python3 -c "from flickr_mail.database import init_db; init_db()" or any of the maintenance scripts). The web app never calls init_db() itself.

Category Search (/category route)

Finds Wikipedia articles in a category that don't have images.

Key functions:

  • parse_category_input(): Accepts category name, Category: prefix, or full Wikipedia URL
  • get_articles_without_images(): Uses MediaWiki API with generator=categorymembers and prop=images for efficient batch queries
  • has_content_image(): Filters out non-content images (UI icons, logos) using NON_CONTENT_IMAGE_PATTERNS

The cat URL parameter is preserved through search results and message pages to allow back-navigation to the category.

Previous Message Detection (get_previous_messages)

Checks the sent_messages database table for previous messages to a Flickr user. Matches by both display name and username (case-insensitive). Results shown as an info alert on the message page.

Request Flow

  1. User enters Wikipedia article title/URL → start() extracts article name. Alternatively, user searches by category via /category route.
  2. search_flickr() fetches and parses Flickr search results. Disambiguation suffixes like "(academic)" are removed for the search.
  3. Results displayed as clickable photo grid with license badges.
  4. User clicks photo → page reloads with flickr, img, license, and flickr_user params.
  5. If license is Wikipedia-compatible: show UploadWizard link.
  6. Otherwise: flickr_usrename_to_nsid() looks up the user's NSID, previous messages are checked, and the appropriate message template is rendered.
  7. User copies message and clicks link to Flickr's mail compose page.

Testing Changes

Run the Flask app locally:

python3 main.py

Then visit http://localhost:5000/

Test search functionality:

from main import search_flickr
result = search_flickr("Big Ben", page=1)
print(f"{len(result.photos)} photos, {result.total_pages} pages")
print(result.photos[0].title, result.photos[0].license_name)

Data Sync Workflow

To refresh "recent Commons uploads obtained via Flickr mail", run scripts in this order:

  1. ./download_sent_mail.py
  2. ./download_commons_contributions.py
  3. ./update_flickr_uploads.py

Notes:

  • download_sent_mail.py reads Flickr auth cookies from download_sent_mail.local.json (cookies_str key). Copy download_sent_mail.example.json to create local config.
  • main.py does not populate flickr_uploads; it only reads from it.
  • download_commons_contributions.py intentionally stops after several consecutive fully-known API batches (overlap window) to avoid full-history scans while still catching shallow gaps.

Potential Improvements

  • Cache search results to reduce Flickr requests
  • Add filtering by license type in search results
  • Handle Flickr rate limiting/blocks more gracefully
  • Add tests for the parsing logic
  • Add pagination for category search (continue token is already returned)
  • Confirm CC 4.0 license codes 11-15 (only 16 confirmed so far)