flickr-mail/AGENTS.md

194 lines
7.6 KiB
Markdown

# Agent Guidelines for Flickr Mail
This document provides context for AI agents working on this codebase.
## Project Overview
Flickr Mail is a Flask web application that helps users find photos on Flickr
for Wikipedia articles and contact photographers to request Creative Commons
licensing.
## Architecture
- **main.py**: Single-file Flask application containing all routes and logic
- **templates/**: Jinja2 templates using Bootstrap 5 for styling
- `base.html`: Base template with Bootstrap CSS/JS
- `combined.html`: Main UI template for search, results, and message composition
- `message.jinja`: Template for the permission request message body (with
alternate text for non-free CC licenses)
- `category.html`: Category search page with visited link styling
- `show_error.html`: Error display template
## Key Components
### Flickr Search (`search_flickr`, `parse_flickr_search_results`)
Searches Flickr by scraping the search results page. The page embeds JSON data
in a `modelExport` JavaScript variable which contains photo metadata.
- Uses browser-like headers (`BROWSER_HEADERS`) to avoid blocks
- Parses embedded JSON by counting braces (not regex) to handle nested structures
- Accepts optional `page` parameter for pagination (25 photos per page)
- Returns `SearchResult` dataclass containing photos and pagination metadata
### SearchResult Dataclass
Contains search results with pagination info:
- `photos`: List of `FlickrPhoto` instances
- `total_photos`: Total number of matching photos
- `current_page`: Current page number (1-indexed)
- `total_pages`: Total number of pages (capped at 160 due to Flickr's 4000 result limit)
### FlickrPhoto Dataclass
Represents a photo with:
- `id`, `title`, `path_alias`, `owner_nsid`, `username`, `realname`
- `license` (int): Flickr license code (0=ARR, 4=CC BY, 5=CC BY-SA, etc.)
- `thumb_url`, `medium_url`: Static image URLs
- `flickr_url` property: URL to photo page
- `license_name` property: Human-readable license name
### License Codes
Flickr uses numeric codes for licenses. Codes 1-6 are CC 2.0, codes 11-16 are
CC 4.0 equivalents.
Wikipedia-compatible (`FREE_LICENSES`): 4 (CC BY 2.0), 5 (CC BY-SA 2.0),
7 (No known copyright), 8 (US Government), 9 (CC0), 10 (Public Domain),
14 (CC BY 4.0), 15 (CC BY-SA 4.0).
Non-free CC (`NONFREE_CC_LICENSES`): 1 (CC BY-NC-SA 2.0), 2 (CC BY-NC 2.0),
3 (CC BY-NC-ND 2.0), 6 (CC BY-ND 2.0), 11-13 (4.0 NC variants),
16 (CC BY-ND 4.0).
Not compatible: 0 (All Rights Reserved).
For free licenses, the message page shows an UploadWizard link instead of a
message. For non-free CC licenses, a tailored message explains which
restrictions (NC/ND) prevent Wikipedia use.
### URL Validation (`is_valid_flickr_image_url`)
Validates that image URLs passed via query params are from legitimate Flickr
static image servers:
- `live.staticflickr.com`
- `farm*.staticflickr.com`
- `c1.staticflickr.com`, `c2.staticflickr.com`
### NSID Lookup (`flickr_usrename_to_nsid`)
Converts a Flickr username/path alias to the NSID (internal user ID) needed
for the Flickr mail URL. Scrapes the user's profile page for embedded params.
### Commons Uploads Display
Shows recent Wikimedia Commons uploads on the home page, filtered to only
those obtained via Flickr mail requests.
**Database tables used by the app**:
- `sent_messages`: downloaded from Flickr sent mail, includes extracted Flickr
URL and Wikipedia URL from message body
- `contributions`: downloaded from Commons `usercontribs`
- `flickr_uploads`: derived table built by `update_flickr_uploads.py` by
matching Commons uploads to Flickr URLs
- `thumbnail_cache`: cached Commons API thumbnail URLs (7-day TTL)
**Key functions**:
- `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match,
joins `flickr_uploads` with `sent_messages`, and fetches thumbnails from
Commons API
- `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash)
**CommonsUpload dataclass**:
- `title`, `thumb_url`, `commons_url`, `flickr_url`, `creator`, `timestamp`
- `wikipedia_url`, `creator_profile_url`: Extracted from sent mail
- `is_wikidata_item` property: Detects Q-number URLs
- `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links
**Maintenance script** (`update_flickr_uploads.py`):
Builds/updates `flickr_uploads` from `contributions` and links to
`sent_messages`.
- Scans file contributions containing `UploadWizard` in the comment
- Supports both comment styles:
- `User created page with UploadWizard` (older)
- `Uploaded a work by ... with UploadWizard` (newer; often includes URL)
- Extracts Flickr URL from contribution comment when present
- Falls back to Commons `extmetadata.Credit` lookup when comment has no URL
### Category Search (`/category` route)
Finds Wikipedia articles in a category that don't have images.
**Key functions**:
- `parse_category_input()`: Accepts category name, `Category:` prefix, or full
Wikipedia URL
- `get_articles_without_images()`: Uses MediaWiki API with
`generator=categorymembers` and `prop=images` for efficient batch queries
- `has_content_image()`: Filters out non-content images (UI icons, logos) using
`NON_CONTENT_IMAGE_PATTERNS`
The `cat` URL parameter is preserved through search results and message pages
to allow back-navigation to the category.
### Previous Message Detection (`get_previous_messages`)
Checks the `sent_messages` database table for previous messages to a Flickr user.
Matches by both display name and username (case-insensitive). Results shown as
an info alert on the message page.
## Request Flow
1. User enters Wikipedia article title/URL → `start()` extracts article name.
Alternatively, user searches by category via `/category` route.
2. `search_flickr()` fetches and parses Flickr search results.
Disambiguation suffixes like "(academic)" are removed for the search.
3. Results displayed as clickable photo grid with license badges.
4. User clicks photo → page reloads with `flickr`, `img`, `license`, and
`flickr_user` params.
5. If license is Wikipedia-compatible: show UploadWizard link.
6. Otherwise: `flickr_usrename_to_nsid()` looks up the user's NSID, previous
messages are checked, and the appropriate message template is rendered.
7. User copies message and clicks link to Flickr's mail compose page.
## Testing Changes
Run the Flask app locally:
```bash
python3 main.py
```
Then visit http://localhost:5000/
Test search functionality:
```python
from main import search_flickr
result = search_flickr("Big Ben", page=1)
print(f"{len(result.photos)} photos, {result.total_pages} pages")
print(result.photos[0].title, result.photos[0].license_name)
```
## Data Sync Workflow
To refresh "recent Commons uploads obtained via Flickr mail", run scripts in
this order:
1. `./download_sent_mail.py`
2. `./download_commons_contributions.py`
3. `./update_flickr_uploads.py`
Notes:
- `download_sent_mail.py` reads Flickr auth cookies from
`download_sent_mail.local.json` (`cookies_str` key). Copy
`download_sent_mail.example.json` to create local config.
- `main.py` does not populate `flickr_uploads`; it only reads from it.
- `download_commons_contributions.py` intentionally stops after several
consecutive fully-known API batches (overlap window) to avoid full-history
scans while still catching shallow gaps.
## Potential Improvements
- Cache search results to reduce Flickr requests
- Add filtering by license type in search results
- Handle Flickr rate limiting/blocks more gracefully
- Add tests for the parsing logic
- Add pagination for category search (continue token is already returned)
- Confirm CC 4.0 license codes 11-15 (only 16 confirmed so far)