# Agent Guidelines for Flickr Mail This document provides context for AI agents working on this codebase. ## Project Overview Flickr Mail is a Flask web application that helps users find photos on Flickr for Wikipedia articles and contact photographers to request Creative Commons licensing. ## Architecture - **main.py**: Single-file Flask application containing all routes and logic - **templates/**: Jinja2 templates using Bootstrap 5 for styling - `base.html`: Base template with Bootstrap CSS/JS - `combined.html`: Main UI template for search, results, and message composition - `message.jinja`: Template for the permission request message body (with alternate text for non-free CC licenses) - `category.html`: Category search page with visited link styling - `show_error.html`: Error display template ## Key Components ### Flickr Search (`search_flickr`, `parse_flickr_search_results`) Searches Flickr by scraping the search results page. The page embeds JSON data in a `modelExport` JavaScript variable which contains photo metadata. - Uses browser-like headers (`BROWSER_HEADERS`) to avoid blocks - Parses embedded JSON by counting braces (not regex) to handle nested structures - Accepts optional `page` parameter for pagination (25 photos per page) - Returns `SearchResult` dataclass containing photos and pagination metadata ### SearchResult Dataclass Contains search results with pagination info: - `photos`: List of `FlickrPhoto` instances - `total_photos`: Total number of matching photos - `current_page`: Current page number (1-indexed) - `total_pages`: Total number of pages (capped at 160 due to Flickr's 4000 result limit) ### FlickrPhoto Dataclass Represents a photo with: - `id`, `title`, `path_alias`, `owner_nsid`, `username`, `realname` - `license` (int): Flickr license code (0=ARR, 4=CC BY, 5=CC BY-SA, etc.) - `thumb_url`, `medium_url`: Static image URLs - `flickr_url` property: URL to photo page - `license_name` property: Human-readable license name ### License Codes Flickr uses numeric codes for licenses. Codes 1-6 are CC 2.0, codes 11-16 are CC 4.0 equivalents. Wikipedia-compatible (`FREE_LICENSES`): 4 (CC BY 2.0), 5 (CC BY-SA 2.0), 7 (No known copyright), 8 (US Government), 9 (CC0), 10 (Public Domain), 14 (CC BY 4.0), 15 (CC BY-SA 4.0). Non-free CC (`NONFREE_CC_LICENSES`): 1 (CC BY-NC-SA 2.0), 2 (CC BY-NC 2.0), 3 (CC BY-NC-ND 2.0), 6 (CC BY-ND 2.0), 11-13 (4.0 NC variants), 16 (CC BY-ND 4.0). Not compatible: 0 (All Rights Reserved). For free licenses, the message page shows an UploadWizard link instead of a message. For non-free CC licenses, a tailored message explains which restrictions (NC/ND) prevent Wikipedia use. ### URL Validation (`is_valid_flickr_image_url`) Validates that image URLs passed via query params are from legitimate Flickr static image servers: - `live.staticflickr.com` - `farm*.staticflickr.com` - `c1.staticflickr.com`, `c2.staticflickr.com` ### NSID Lookup (`flickr_usrename_to_nsid`) Converts a Flickr username/path alias to the NSID (internal user ID) needed for the Flickr mail URL. Scrapes the user's profile page for embedded params. ### Commons Uploads Display Shows recent Wikimedia Commons uploads on the home page, filtered to only those obtained via Flickr mail requests. **Database tables used by the app**: - `sent_messages`: downloaded from Flickr sent mail, includes extracted Flickr URL and Wikipedia URL from message body - `contributions`: downloaded from Commons `usercontribs` - `flickr_uploads`: derived table built by `update_flickr_uploads.py` by matching Commons uploads to Flickr URLs - `thumbnail_cache`: cached Commons API thumbnail URLs (7-day TTL) - `interaction_log`: written by the web app to record searches and message generation events (see below) **Key functions**: - `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match, joins `flickr_uploads` with `sent_messages`, and fetches thumbnails from Commons API - `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash) **CommonsUpload dataclass**: - `title`, `thumb_url`, `commons_url`, `flickr_url`, `creator`, `timestamp` - `wikipedia_url`, `creator_profile_url`: Extracted from sent mail - `is_wikidata_item` property: Detects Q-number URLs - `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links **Maintenance script** (`update_flickr_uploads.py`): Builds/updates `flickr_uploads` from `contributions` and links to `sent_messages`. - Scans file contributions containing `UploadWizard` in the comment - Supports both comment styles: - `User created page with UploadWizard` (older) - `Uploaded a work by ... with UploadWizard` (newer; often includes URL) - Extracts Flickr URL from contribution comment when present - Falls back to Commons `extmetadata.Credit` lookup when comment has no URL ### Interaction Logging (`log_interaction`) The `log_interaction()` helper writes a row to `interaction_log` on each meaningful user action: - `"search_article"` – user submits a Wikipedia article search (page 1 only, to avoid logging every pagination hit) - `"search_category"` – user submits a Wikipedia category search - `"generate_message"` – a non-free CC message is generated after clicking a photo Each row captures: Unix `timestamp`, `interaction_type`, `ip_address` (prefers `X-Forwarded-For` for proxy setups), `user_agent`, `query` (article title or category name), and optionally `flickr_url` / `wikipedia_url`. The table is created by `init_db()` (called via `python3 -c "from flickr_mail.database import init_db; init_db()"` or any of the maintenance scripts). The web app never calls `init_db()` itself. ### Category Search (`/category` route) Finds Wikipedia articles in a category that don't have images. **Key functions**: - `parse_category_input()`: Accepts category name, `Category:` prefix, or full Wikipedia URL - `get_articles_without_images()`: Uses MediaWiki API with `generator=categorymembers` and `prop=images` for efficient batch queries - `has_content_image()`: Filters out non-content images (UI icons, logos) using `NON_CONTENT_IMAGE_PATTERNS` The `cat` URL parameter is preserved through search results and message pages to allow back-navigation to the category. ### Previous Message Detection (`get_previous_messages`) Checks the `sent_messages` database table for previous messages to a Flickr user. Matches by both display name and username (case-insensitive). Results shown as an info alert on the message page. ## Request Flow 1. User enters Wikipedia article title/URL → `start()` extracts article name. Alternatively, user searches by category via `/category` route. 2. `search_flickr()` fetches and parses Flickr search results. Disambiguation suffixes like "(academic)" are removed for the search. 3. Results displayed as clickable photo grid with license badges. 4. User clicks photo → page reloads with `flickr`, `img`, `license`, and `flickr_user` params. 5. If license is Wikipedia-compatible: show UploadWizard link. 6. Otherwise: `flickr_usrename_to_nsid()` looks up the user's NSID, previous messages are checked, and the appropriate message template is rendered. 7. User copies message and clicks link to Flickr's mail compose page. ## Testing Changes Run the Flask app locally: ```bash python3 main.py ``` Then visit http://localhost:5000/ Test search functionality: ```python from main import search_flickr result = search_flickr("Big Ben", page=1) print(f"{len(result.photos)} photos, {result.total_pages} pages") print(result.photos[0].title, result.photos[0].license_name) ``` ## Data Sync Workflow To refresh "recent Commons uploads obtained via Flickr mail", run scripts in this order: 1. `./download_sent_mail.py` 2. `./download_commons_contributions.py` 3. `./update_flickr_uploads.py` Notes: - `download_sent_mail.py` reads Flickr auth cookies from `download_sent_mail.local.json` (`cookies_str` key). Copy `download_sent_mail.example.json` to create local config. - `main.py` does not populate `flickr_uploads`; it only reads from it. - `download_commons_contributions.py` intentionally stops after several consecutive fully-known API batches (overlap window) to avoid full-history scans while still catching shallow gaps. ## Potential Improvements - Cache search results to reduce Flickr requests - Add filtering by license type in search results - Handle Flickr rate limiting/blocks more gracefully - Add tests for the parsing logic - Add pagination for category search (continue token is already returned) - Confirm CC 4.0 license codes 11-15 (only 16 confirmed so far)