# Agent Guidelines for Flickr Mail This document provides context for AI agents working on this codebase. ## Project Overview Flickr Mail is a Flask web application that helps users find photos on Flickr for Wikipedia articles and contact photographers to request Creative Commons licensing. ## Architecture - **main.py**: Single-file Flask application containing all routes and logic - **templates/**: Jinja2 templates using Bootstrap 5 for styling - `base.html`: Base template with Bootstrap CSS/JS - `combined.html`: Main UI template for search, results, and message composition - `message.jinja`: Template for the permission request message body (with alternate text for non-free CC licenses) - `category.html`: Category search page with visited link styling - `show_error.html`: Error display template ## Key Components ### Flickr Search (`search_flickr`, `parse_flickr_search_results`) Searches Flickr by scraping the search results page. The page embeds JSON data in a `modelExport` JavaScript variable which contains photo metadata. - Uses browser-like headers (`BROWSER_HEADERS`) to avoid blocks - Parses embedded JSON by counting braces (not regex) to handle nested structures - Accepts optional `page` parameter for pagination (25 photos per page) - Returns `SearchResult` dataclass containing photos and pagination metadata ### SearchResult Dataclass Contains search results with pagination info: - `photos`: List of `FlickrPhoto` instances - `total_photos`: Total number of matching photos - `current_page`: Current page number (1-indexed) - `total_pages`: Total number of pages (capped at 160 due to Flickr's 4000 result limit) ### FlickrPhoto Dataclass Represents a photo with: - `id`, `title`, `path_alias`, `owner_nsid`, `username`, `realname` - `license` (int): Flickr license code (0=ARR, 4=CC BY, 5=CC BY-SA, etc.) - `thumb_url`, `medium_url`: Static image URLs - `flickr_url` property: URL to photo page - `license_name` property: Human-readable license name ### License Codes Flickr uses numeric codes for licenses. Codes 1-6 are CC 2.0, codes 11-16 are CC 4.0 equivalents. Wikipedia-compatible (`FREE_LICENSES`): 4 (CC BY 2.0), 5 (CC BY-SA 2.0), 7 (No known copyright), 8 (US Government), 9 (CC0), 10 (Public Domain), 14 (CC BY 4.0), 15 (CC BY-SA 4.0). Non-free CC (`NONFREE_CC_LICENSES`): 1 (CC BY-NC-SA 2.0), 2 (CC BY-NC 2.0), 3 (CC BY-NC-ND 2.0), 6 (CC BY-ND 2.0), 11-13 (4.0 NC variants), 16 (CC BY-ND 4.0). Not compatible: 0 (All Rights Reserved). For free licenses, the message page shows an UploadWizard link instead of a message. For non-free CC licenses, a tailored message explains which restrictions (NC/ND) prevent Wikipedia use. ### URL Validation (`is_valid_flickr_image_url`) Validates that image URLs passed via query params are from legitimate Flickr static image servers: - `live.staticflickr.com` - `farm*.staticflickr.com` - `c1.staticflickr.com`, `c2.staticflickr.com` ### NSID Lookup (`flickr_usrename_to_nsid`) Converts a Flickr username/path alias to the NSID (internal user ID) needed for the Flickr mail URL. Scrapes the user's profile page for embedded params. ### Commons Uploads Display Shows recent Wikimedia Commons uploads on the home page, filtered to only those obtained via Flickr mail requests. **Data files** (in `commons_contributions/`): - `flickr_uploads.json`: List of Commons uploads from Flickr with metadata - `thumbnail_cache.json`: Cached Commons API thumbnail URLs (7-day TTL) - `sent_mail_index.json`: Index of sent mail messages (flickr_url → wikipedia_url) **Key functions**: - `build_sent_mail_index()`: Parses sent mail JSON files, extracts Flickr and Wikipedia URLs from message bodies, caches the index - `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match, fetches thumbnails from Commons API - `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash) **CommonsUpload dataclass**: - `title`, `thumb_url`, `commons_url`, `flickr_url`, `creator`, `timestamp` - `wikipedia_url`, `creator_profile_url`: Extracted from sent mail - `is_wikidata_item` property: Detects Q-number URLs - `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links **Maintenance script** (`update_flickr_uploads.py`): Run to find Flickr uploads from UploadWizard contributions that don't have the Flickr URL in the edit comment. Queries Commons API for image metadata and checks the Credit field for Flickr URLs. ### Category Search (`/category` route) Finds Wikipedia articles in a category that don't have images. **Key functions**: - `parse_category_input()`: Accepts category name, `Category:` prefix, or full Wikipedia URL - `get_articles_without_images()`: Uses MediaWiki API with `generator=categorymembers` and `prop=images` for efficient batch queries - `has_content_image()`: Filters out non-content images (UI icons, logos) using `NON_CONTENT_IMAGE_PATTERNS` The `cat` URL parameter is preserved through search results and message pages to allow back-navigation to the category. ### Previous Message Detection (`get_previous_messages`) Checks `sent_mail/messages_index.json` for previous messages to a Flickr user. Matches by both display name and username (case-insensitive). Results shown as an info alert on the message page. ## Request Flow 1. User enters Wikipedia article title/URL → `start()` extracts article name. Alternatively, user searches by category via `/category` route. 2. `search_flickr()` fetches and parses Flickr search results. Disambiguation suffixes like "(academic)" are removed for the search. 3. Results displayed as clickable photo grid with license badges. 4. User clicks photo → page reloads with `flickr`, `img`, `license`, and `flickr_user` params. 5. If license is Wikipedia-compatible: show UploadWizard link. 6. Otherwise: `flickr_usrename_to_nsid()` looks up the user's NSID, previous messages are checked, and the appropriate message template is rendered. 7. User copies message and clicks link to Flickr's mail compose page. ## Testing Changes Run the Flask app locally: ```bash python3 main.py ``` Then visit http://localhost:5000/ Test search functionality: ```python from main import search_flickr result = search_flickr("Big Ben", page=1) print(f"{len(result.photos)} photos, {result.total_pages} pages") print(result.photos[0].title, result.photos[0].license_name) ``` ## Potential Improvements - Cache search results to reduce Flickr requests - Add filtering by license type in search results - Handle Flickr rate limiting/blocks more gracefully - Add tests for the parsing logic - Add pagination for category search (continue token is already returned) - Confirm CC 4.0 license codes 11-15 (only 16 confirmed so far)