Log searches (article/category) and message-generation events to a new interaction_log table, capturing IP address and User-Agent. Also apply NOT NULL constraints to Contribution, SentMessage, FlickrUpload, and ThumbnailCache fields that are always populated, and remove stale continue_token references from category.html. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
214 lines
8.5 KiB
Markdown
214 lines
8.5 KiB
Markdown
# Agent Guidelines for Flickr Mail
|
||
|
||
This document provides context for AI agents working on this codebase.
|
||
|
||
## Project Overview
|
||
|
||
Flickr Mail is a Flask web application that helps users find photos on Flickr
|
||
for Wikipedia articles and contact photographers to request Creative Commons
|
||
licensing.
|
||
|
||
## Architecture
|
||
|
||
- **main.py**: Single-file Flask application containing all routes and logic
|
||
- **templates/**: Jinja2 templates using Bootstrap 5 for styling
|
||
- `base.html`: Base template with Bootstrap CSS/JS
|
||
- `combined.html`: Main UI template for search, results, and message composition
|
||
- `message.jinja`: Template for the permission request message body (with
|
||
alternate text for non-free CC licenses)
|
||
- `category.html`: Category search page with visited link styling
|
||
- `show_error.html`: Error display template
|
||
|
||
## Key Components
|
||
|
||
### Flickr Search (`search_flickr`, `parse_flickr_search_results`)
|
||
|
||
Searches Flickr by scraping the search results page. The page embeds JSON data
|
||
in a `modelExport` JavaScript variable which contains photo metadata.
|
||
|
||
- Uses browser-like headers (`BROWSER_HEADERS`) to avoid blocks
|
||
- Parses embedded JSON by counting braces (not regex) to handle nested structures
|
||
- Accepts optional `page` parameter for pagination (25 photos per page)
|
||
- Returns `SearchResult` dataclass containing photos and pagination metadata
|
||
|
||
### SearchResult Dataclass
|
||
|
||
Contains search results with pagination info:
|
||
- `photos`: List of `FlickrPhoto` instances
|
||
- `total_photos`: Total number of matching photos
|
||
- `current_page`: Current page number (1-indexed)
|
||
- `total_pages`: Total number of pages (capped at 160 due to Flickr's 4000 result limit)
|
||
|
||
### FlickrPhoto Dataclass
|
||
|
||
Represents a photo with:
|
||
- `id`, `title`, `path_alias`, `owner_nsid`, `username`, `realname`
|
||
- `license` (int): Flickr license code (0=ARR, 4=CC BY, 5=CC BY-SA, etc.)
|
||
- `thumb_url`, `medium_url`: Static image URLs
|
||
- `flickr_url` property: URL to photo page
|
||
- `license_name` property: Human-readable license name
|
||
|
||
### License Codes
|
||
|
||
Flickr uses numeric codes for licenses. Codes 1-6 are CC 2.0, codes 11-16 are
|
||
CC 4.0 equivalents.
|
||
|
||
Wikipedia-compatible (`FREE_LICENSES`): 4 (CC BY 2.0), 5 (CC BY-SA 2.0),
|
||
7 (No known copyright), 8 (US Government), 9 (CC0), 10 (Public Domain),
|
||
14 (CC BY 4.0), 15 (CC BY-SA 4.0).
|
||
|
||
Non-free CC (`NONFREE_CC_LICENSES`): 1 (CC BY-NC-SA 2.0), 2 (CC BY-NC 2.0),
|
||
3 (CC BY-NC-ND 2.0), 6 (CC BY-ND 2.0), 11-13 (4.0 NC variants),
|
||
16 (CC BY-ND 4.0).
|
||
|
||
Not compatible: 0 (All Rights Reserved).
|
||
|
||
For free licenses, the message page shows an UploadWizard link instead of a
|
||
message. For non-free CC licenses, a tailored message explains which
|
||
restrictions (NC/ND) prevent Wikipedia use.
|
||
|
||
### URL Validation (`is_valid_flickr_image_url`)
|
||
|
||
Validates that image URLs passed via query params are from legitimate Flickr
|
||
static image servers:
|
||
- `live.staticflickr.com`
|
||
- `farm*.staticflickr.com`
|
||
- `c1.staticflickr.com`, `c2.staticflickr.com`
|
||
|
||
### NSID Lookup (`flickr_usrename_to_nsid`)
|
||
|
||
Converts a Flickr username/path alias to the NSID (internal user ID) needed
|
||
for the Flickr mail URL. Scrapes the user's profile page for embedded params.
|
||
|
||
### Commons Uploads Display
|
||
|
||
Shows recent Wikimedia Commons uploads on the home page, filtered to only
|
||
those obtained via Flickr mail requests.
|
||
|
||
**Database tables used by the app**:
|
||
- `sent_messages`: downloaded from Flickr sent mail, includes extracted Flickr
|
||
URL and Wikipedia URL from message body
|
||
- `contributions`: downloaded from Commons `usercontribs`
|
||
- `flickr_uploads`: derived table built by `update_flickr_uploads.py` by
|
||
matching Commons uploads to Flickr URLs
|
||
- `thumbnail_cache`: cached Commons API thumbnail URLs (7-day TTL)
|
||
- `interaction_log`: written by the web app to record searches and message
|
||
generation events (see below)
|
||
|
||
**Key functions**:
|
||
- `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match,
|
||
joins `flickr_uploads` with `sent_messages`, and fetches thumbnails from
|
||
Commons API
|
||
- `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash)
|
||
|
||
**CommonsUpload dataclass**:
|
||
- `title`, `thumb_url`, `commons_url`, `flickr_url`, `creator`, `timestamp`
|
||
- `wikipedia_url`, `creator_profile_url`: Extracted from sent mail
|
||
- `is_wikidata_item` property: Detects Q-number URLs
|
||
- `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links
|
||
|
||
**Maintenance script** (`update_flickr_uploads.py`):
|
||
Builds/updates `flickr_uploads` from `contributions` and links to
|
||
`sent_messages`.
|
||
- Scans file contributions containing `UploadWizard` in the comment
|
||
- Supports both comment styles:
|
||
- `User created page with UploadWizard` (older)
|
||
- `Uploaded a work by ... with UploadWizard` (newer; often includes URL)
|
||
- Extracts Flickr URL from contribution comment when present
|
||
- Falls back to Commons `extmetadata.Credit` lookup when comment has no URL
|
||
|
||
### Interaction Logging (`log_interaction`)
|
||
|
||
The `log_interaction()` helper writes a row to `interaction_log` on each
|
||
meaningful user action:
|
||
|
||
- `"search_article"` – user submits a Wikipedia article search (page 1 only,
|
||
to avoid logging every pagination hit)
|
||
- `"search_category"` – user submits a Wikipedia category search
|
||
- `"generate_message"` – a non-free CC message is generated after clicking a photo
|
||
|
||
Each row captures: Unix `timestamp`, `interaction_type`, `ip_address`
|
||
(prefers `X-Forwarded-For` for proxy setups), `user_agent`, `query` (article
|
||
title or category name), and optionally `flickr_url` / `wikipedia_url`.
|
||
|
||
The table is created by `init_db()` (called via `python3 -c "from
|
||
flickr_mail.database import init_db; init_db()"` or any of the maintenance
|
||
scripts). The web app never calls `init_db()` itself.
|
||
|
||
### Category Search (`/category` route)
|
||
|
||
Finds Wikipedia articles in a category that don't have images.
|
||
|
||
**Key functions**:
|
||
- `parse_category_input()`: Accepts category name, `Category:` prefix, or full
|
||
Wikipedia URL
|
||
- `get_articles_without_images()`: Uses MediaWiki API with
|
||
`generator=categorymembers` and `prop=images` for efficient batch queries
|
||
- `has_content_image()`: Filters out non-content images (UI icons, logos) using
|
||
`NON_CONTENT_IMAGE_PATTERNS`
|
||
|
||
The `cat` URL parameter is preserved through search results and message pages
|
||
to allow back-navigation to the category.
|
||
|
||
### Previous Message Detection (`get_previous_messages`)
|
||
|
||
Checks the `sent_messages` database table for previous messages to a Flickr user.
|
||
Matches by both display name and username (case-insensitive). Results shown as
|
||
an info alert on the message page.
|
||
|
||
## Request Flow
|
||
|
||
1. User enters Wikipedia article title/URL → `start()` extracts article name.
|
||
Alternatively, user searches by category via `/category` route.
|
||
2. `search_flickr()` fetches and parses Flickr search results.
|
||
Disambiguation suffixes like "(academic)" are removed for the search.
|
||
3. Results displayed as clickable photo grid with license badges.
|
||
4. User clicks photo → page reloads with `flickr`, `img`, `license`, and
|
||
`flickr_user` params.
|
||
5. If license is Wikipedia-compatible: show UploadWizard link.
|
||
6. Otherwise: `flickr_usrename_to_nsid()` looks up the user's NSID, previous
|
||
messages are checked, and the appropriate message template is rendered.
|
||
7. User copies message and clicks link to Flickr's mail compose page.
|
||
|
||
## Testing Changes
|
||
|
||
Run the Flask app locally:
|
||
```bash
|
||
python3 main.py
|
||
```
|
||
Then visit http://localhost:5000/
|
||
|
||
Test search functionality:
|
||
```python
|
||
from main import search_flickr
|
||
result = search_flickr("Big Ben", page=1)
|
||
print(f"{len(result.photos)} photos, {result.total_pages} pages")
|
||
print(result.photos[0].title, result.photos[0].license_name)
|
||
```
|
||
|
||
## Data Sync Workflow
|
||
|
||
To refresh "recent Commons uploads obtained via Flickr mail", run scripts in
|
||
this order:
|
||
|
||
1. `./download_sent_mail.py`
|
||
2. `./download_commons_contributions.py`
|
||
3. `./update_flickr_uploads.py`
|
||
|
||
Notes:
|
||
- `download_sent_mail.py` reads Flickr auth cookies from
|
||
`download_sent_mail.local.json` (`cookies_str` key). Copy
|
||
`download_sent_mail.example.json` to create local config.
|
||
- `main.py` does not populate `flickr_uploads`; it only reads from it.
|
||
- `download_commons_contributions.py` intentionally stops after several
|
||
consecutive fully-known API batches (overlap window) to avoid full-history
|
||
scans while still catching shallow gaps.
|
||
|
||
## Potential Improvements
|
||
|
||
- Cache search results to reduce Flickr requests
|
||
- Add filtering by license type in search results
|
||
- Handle Flickr rate limiting/blocks more gracefully
|
||
- Add tests for the parsing logic
|
||
- Add pagination for category search (continue token is already returned)
|
||
- Confirm CC 4.0 license codes 11-15 (only 16 confirmed so far)
|