Compare commits

..

8 commits

Author SHA1 Message Date
7b741e951f Add Flickr search term override field
Allow users to edit the Flickr search query without changing the
Wikipedia article. Shows a text field with the current search term
(including quotes for phrase search) that can be modified and
re-submitted. The search term persists across pagination and photo
selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 13:42:18 +00:00
57b2e474df Add pagination to category search for large categories
Large categories like "Living people" (900k+ articles) were impractical
because the code tried to download all members before displaying results.
Now stops after collecting ~200 articles and provides a "Next page" link.

Also fixes the MediaWiki API continuation protocol: passes the full
continue dict (not just gcmcontinue) so imcontinue responses are handled
properly, and reduces gcmlimit from "max" to 50 so each batch's images
fit in one API response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 13:32:56 +00:00
ab012f9cf3 Change submit button to say Search and use btn-primary styling
Match the style of the category search page button.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-08 12:34:50 +00:00
08f5128e8d Add interaction logging and tighten model NOT NULL constraints
Log searches (article/category) and message-generation events to a new
interaction_log table, capturing IP address and User-Agent.

Also apply NOT NULL constraints to Contribution, SentMessage, FlickrUpload,
and ThumbnailCache fields that are always populated, and remove stale
continue_token references from category.html.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-08 12:34:04 +00:00
252a854e76 Move Flickr sent-mail cookies into local config file 2026-02-07 14:41:41 +00:00
2819652afd Handle modern UploadWizard comments when indexing Flickr uploads 2026-02-07 13:41:27 +00:00
4f67960fe1 Make commons contributions sync resilient to shallow gaps 2026-02-07 13:34:09 +00:00
e072279566 Stop fetching all pages when downloading sent mail 2026-02-07 13:24:09 +00:00
11 changed files with 435 additions and 188 deletions

1
.gitignore vendored
View file

@ -3,3 +3,4 @@ __pycache__
commons_contributions/thumbnail_cache.json commons_contributions/thumbnail_cache.json
commons_contributions/sent_mail_index.json commons_contributions/sent_mail_index.json
flickr_mail.db flickr_mail.db
download_sent_mail.local.json

View file

@ -85,16 +85,20 @@ for the Flickr mail URL. Scrapes the user's profile page for embedded params.
Shows recent Wikimedia Commons uploads on the home page, filtered to only Shows recent Wikimedia Commons uploads on the home page, filtered to only
those obtained via Flickr mail requests. those obtained via Flickr mail requests.
**Data files** (in `commons_contributions/`): **Database tables used by the app**:
- `flickr_uploads.json`: List of Commons uploads from Flickr with metadata - `sent_messages`: downloaded from Flickr sent mail, includes extracted Flickr
- `thumbnail_cache.json`: Cached Commons API thumbnail URLs (7-day TTL) URL and Wikipedia URL from message body
- `sent_mail_index.json`: Index of sent mail messages (flickr_url → wikipedia_url) - `contributions`: downloaded from Commons `usercontribs`
- `flickr_uploads`: derived table built by `update_flickr_uploads.py` by
matching Commons uploads to Flickr URLs
- `thumbnail_cache`: cached Commons API thumbnail URLs (7-day TTL)
- `interaction_log`: written by the web app to record searches and message
generation events (see below)
**Key functions**: **Key functions**:
- `build_sent_mail_index()`: Parses sent mail JSON files, extracts Flickr and
Wikipedia URLs from message bodies, caches the index
- `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match, - `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match,
fetches thumbnails from Commons API joins `flickr_uploads` with `sent_messages`, and fetches thumbnails from
Commons API
- `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash) - `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash)
**CommonsUpload dataclass**: **CommonsUpload dataclass**:
@ -104,9 +108,32 @@ those obtained via Flickr mail requests.
- `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links - `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links
**Maintenance script** (`update_flickr_uploads.py`): **Maintenance script** (`update_flickr_uploads.py`):
Run to find Flickr uploads from UploadWizard contributions that don't have Builds/updates `flickr_uploads` from `contributions` and links to
the Flickr URL in the edit comment. Queries Commons API for image metadata `sent_messages`.
and checks the Credit field for Flickr URLs. - Scans file contributions containing `UploadWizard` in the comment
- Supports both comment styles:
- `User created page with UploadWizard` (older)
- `Uploaded a work by ... with UploadWizard` (newer; often includes URL)
- Extracts Flickr URL from contribution comment when present
- Falls back to Commons `extmetadata.Credit` lookup when comment has no URL
### Interaction Logging (`log_interaction`)
The `log_interaction()` helper writes a row to `interaction_log` on each
meaningful user action:
- `"search_article"` user submits a Wikipedia article search (page 1 only,
to avoid logging every pagination hit)
- `"search_category"` user submits a Wikipedia category search
- `"generate_message"` a non-free CC message is generated after clicking a photo
Each row captures: Unix `timestamp`, `interaction_type`, `ip_address`
(prefers `X-Forwarded-For` for proxy setups), `user_agent`, `query` (article
title or category name), and optionally `flickr_url` / `wikipedia_url`.
The table is created by `init_db()` (called via `python3 -c "from
flickr_mail.database import init_db; init_db()"` or any of the maintenance
scripts). The web app never calls `init_db()` itself.
### Category Search (`/category` route) ### Category Search (`/category` route)
@ -125,7 +152,7 @@ to allow back-navigation to the category.
### Previous Message Detection (`get_previous_messages`) ### Previous Message Detection (`get_previous_messages`)
Checks `sent_mail/messages_index.json` for previous messages to a Flickr user. Checks the `sent_messages` database table for previous messages to a Flickr user.
Matches by both display name and username (case-insensitive). Results shown as Matches by both display name and username (case-insensitive). Results shown as
an info alert on the message page. an info alert on the message page.
@ -159,6 +186,24 @@ print(f"{len(result.photos)} photos, {result.total_pages} pages")
print(result.photos[0].title, result.photos[0].license_name) print(result.photos[0].title, result.photos[0].license_name)
``` ```
## Data Sync Workflow
To refresh "recent Commons uploads obtained via Flickr mail", run scripts in
this order:
1. `./download_sent_mail.py`
2. `./download_commons_contributions.py`
3. `./update_flickr_uploads.py`
Notes:
- `download_sent_mail.py` reads Flickr auth cookies from
`download_sent_mail.local.json` (`cookies_str` key). Copy
`download_sent_mail.example.json` to create local config.
- `main.py` does not populate `flickr_uploads`; it only reads from it.
- `download_commons_contributions.py` intentionally stops after several
consecutive fully-known API batches (overlap window) to avoid full-history
scans while still catching shallow gaps.
## Potential Improvements ## Potential Improvements
- Cache search results to reduce Flickr requests - Cache search results to reduce Flickr requests

143
README.md
View file

@ -1,89 +1,100 @@
# Flickr Photo Finder for Wikipedia Articles # Flickr Mail
Tool lives here: <https://edwardbetts.com/flickr_mail/> Tool lives here: <https://edwardbetts.com/flickr_mail/>
This tool is designed to help you find photos on Flickr for Wikipedia articles Flickr Mail is a Flask app that helps find Flickr photos for Wikipedia articles
and contact the photographer. It's a Python application that leverages the Flask and contact photographers to request Wikipedia-compatible licensing.
framework for web development.
## Table of Contents ## What It Does
- [Introduction](#introduction)
- [Usage](#usage)
- [Error Handling](#error-handling)
- [Running the Application](#running-the-application)
## Introduction - Searches Flickr from a Wikipedia article title/URL
- Shows license status for each result (free vs non-free CC variants)
- Builds a ready-to-send Flickr message for non-free licenses
- Finds image-less articles in a Wikipedia category
- Shows recent Commons uploads that came from Flickr mail outreach
This tool is developed and maintained by Edward Betts (edward@4angle.com). Its ## Project Layout
primary purpose is to simplify the process of discovering and contacting
photographers on Flickr whose photos can be used to enhance Wikipedia articles.
### Key Features - `main.py`: Flask app routes and core logic
- **Integrated Flickr search**: Enter a Wikipedia article title and see Flickr - `templates/`: UI templates
photos directly in the interface - no need to visit Flickr's search page. - `download_sent_mail.py`: sync Flickr sent messages into DB
- **Photo grid with metadata**: Search results display as a grid of thumbnails - `download_commons_contributions.py`: sync Commons contributions into DB
showing the user's name and license for each photo. - `update_flickr_uploads.py`: derive `flickr_uploads` from contributions/sent mail
- **License handling**: Photos with Wikipedia-compatible licenses (CC BY, - `flickr_mail.db`: SQLite database
CC BY-SA, CC0, Public Domain) are highlighted with a green badge and link
directly to the Commons UploadWizard. Non-free CC licenses (NC/ND) show a
tailored message explaining Wikipedia's requirements. Supports both CC 2.0
and CC 4.0 license codes.
- **One-click message composition**: Click any photo to compose a permission
request message with the photo displayed alongside, showing the user's Flickr
profile and current license.
- **Previous message detection**: The message page checks sent mail history and
warns if you have previously contacted the user.
- **Category search**: Find Wikipedia articles without images in a given
category, with links to search Flickr for each article.
- **Pagination**: Browse through thousands of search results with page navigation.
- **Recent uploads showcase**: The home page displays recent Wikimedia Commons
uploads that were obtained via Flickr mail requests, with links to the
Wikipedia article and user's Flickr profile.
- Handle exceptions gracefully and provide detailed error information.
## Usage ## Database Pipeline
To use the tool, follow these steps: The recent uploads section depends on a 3-step pipeline:
1. Start the tool by running the script. 1. `./download_sent_mail.py` updates `sent_messages`
2. Access the tool through a web browser. 2. `./download_commons_contributions.py` updates `contributions`
3. Enter a Wikipedia article title or URL, or use "Find articles by category" 3. `./update_flickr_uploads.py` builds/updates `flickr_uploads`
to discover articles that need images.
4. Browse the Flickr search results displayed in the interface.
5. Click on a photo to select it. If the license is Wikipedia-compatible, you'll
be linked to the Commons UploadWizard. Otherwise, a message is composed to
request a license change.
6. Copy the subject and message, then click "Send message on Flickr" to contact
the user.
## Error Handling `main.py` only reads `flickr_uploads`; it does not populate it.
The application includes error handling to ensure a smooth user experience. If ## UploadWizard Detection
an error occurs, it will display a detailed error message with traceback
information. The error handling is designed to provide valuable insights into
any issues that may arise during use.
## Running the Application `update_flickr_uploads.py` supports both Commons UploadWizard comment styles:
To run the application, ensure you have Python 3 installed on your system. You - `User created page with UploadWizard` (older)
will also need to install the required Python modules mentioned in the script, - `Uploaded a work by ... with UploadWizard` (newer)
including Flask, requests, and others.
1. Clone this repository to your local machine. It first tries to extract a Flickr URL directly from the contribution comment.
2. Navigate to the project directory. If absent, it falls back to Commons `extmetadata.Credit`.
3. Run the following command to start the application:
## Local Run
Install dependencies (example):
```bash
pip install flask requests beautifulsoup4 sqlalchemy
```
Start the app:
```bash ```bash
python3 main.py python3 main.py
``` ```
4. Access the application by opening a web browser and visiting the provided URL Then open:
(usually `http://localhost:5000/`).
That's it! You can now use the Flickr Photo Finder tool to streamline the - `http://localhost:5000/`
process of finding and contacting photographers for Wikipedia articles.
If you encounter any issues or have questions, feel free to contact Edward Betts ## Refresh Data
(edward@4angle.com).
Happy photo hunting! Run in this order:
```bash
./download_sent_mail.py
./download_commons_contributions.py
./update_flickr_uploads.py
```
Before running `./download_sent_mail.py`, create local auth config:
```bash
cp download_sent_mail.example.json download_sent_mail.local.json
```
Then edit `download_sent_mail.local.json` and set `cookies_str` to your full
Flickr `Cookie` header value.
## Interaction Logging
The app logs searches and message generation to the `interaction_log` table:
- `search_article`: when a user searches for a Wikipedia article title (page 1 only)
- `search_category`: when a user searches a Wikipedia category
- `generate_message`: when a non-free CC message is generated for a photo
Each row records the timestamp, interaction type, client IP (from
`X-Forwarded-For` if present), User-Agent, query, and (for message events)
the Flickr and Wikipedia URLs.
## Notes
- `download_commons_contributions.py` uses an overlap window of known-only
batches before stopping to avoid full-history scans while still catching
shallow gaps.
- If a known Commons upload is missing from `flickr_uploads`, re-run the full
3-step pipeline above.

View file

@ -12,6 +12,7 @@ from flickr_mail.models import Contribution
API_URL = "https://commons.wikimedia.org/w/api.php" API_URL = "https://commons.wikimedia.org/w/api.php"
USERNAME = "Edward" USERNAME = "Edward"
CONSECUTIVE_KNOWN_BATCHES_TO_STOP = 3
# Identify ourselves properly to Wikimedia # Identify ourselves properly to Wikimedia
USER_AGENT = "CommonsContributionsDownloader/0.1 (edward@4angle.com)" USER_AGENT = "CommonsContributionsDownloader/0.1 (edward@4angle.com)"
@ -48,12 +49,8 @@ def fetch_contributions(
return contributions, new_continue return contributions, new_continue
def upsert_contribution(session, c: dict) -> None: def insert_contribution(session, c: dict) -> None:
"""Insert or update a contribution by revid.""" """Insert a contribution row (caller must ensure revid is new)."""
existing = session.query(Contribution).filter_by(revid=c["revid"]).first()
if existing:
return # Already have this revision
session.add(Contribution( session.add(Contribution(
userid=c.get("userid"), userid=c.get("userid"),
user=c.get("user"), user=c.get("user"),
@ -97,6 +94,7 @@ def main() -> None:
batch_num = 0 batch_num = 0
new_count = 0 new_count = 0
continue_token = None continue_token = None
consecutive_known_batches = 0
while True: while True:
batch_num += 1 batch_num += 1
@ -108,13 +106,24 @@ def main() -> None:
print("no results") print("no results")
break break
# One DB query per batch to identify already-known revisions.
revids = [c["revid"] for c in contributions if "revid" in c]
existing_revids = {
row[0]
for row in (
session.query(Contribution.revid)
.filter(Contribution.revid.in_(revids))
.all()
)
}
batch_new = 0 batch_new = 0
for c in contributions: for c in contributions:
# Stop if we've reached contributions we already have revid = c.get("revid")
existing = session.query(Contribution).filter_by(revid=c["revid"]).first() if revid in existing_revids:
if existing:
continue continue
upsert_contribution(session, c)
insert_contribution(session, c)
batch_new += 1 batch_new += 1
new_count += batch_new new_count += batch_new
@ -123,7 +132,18 @@ def main() -> None:
session.commit() session.commit()
if batch_new == 0: if batch_new == 0:
# All contributions in this batch already exist, we're caught up consecutive_known_batches += 1
print(
" Batch fully known "
f"({consecutive_known_batches}/"
f"{CONSECUTIVE_KNOWN_BATCHES_TO_STOP})"
)
else:
consecutive_known_batches = 0
if consecutive_known_batches >= CONSECUTIVE_KNOWN_BATCHES_TO_STOP:
# Stop after a small overlap window of known-only batches.
# This catches recent historical gaps without full-history scans.
print(" Caught up with existing data") print(" Caught up with existing data")
break break

View file

@ -0,0 +1,3 @@
{
"cookies_str": "paste your full Flickr Cookie header value here"
}

View file

@ -1,7 +1,9 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
"""Download sent FlickrMail messages for backup.""" """Download sent FlickrMail messages for backup."""
import json
import time import time
from pathlib import Path
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
@ -17,6 +19,9 @@ from flickr_mail.url_utils import (
BASE_URL = "https://www.flickr.com" BASE_URL = "https://www.flickr.com"
SENT_MAIL_URL = f"{BASE_URL}/mail/sent/page{{page}}" SENT_MAIL_URL = f"{BASE_URL}/mail/sent/page{{page}}"
MESSAGE_URL = f"{BASE_URL}/mail/sent/{{message_id}}" MESSAGE_URL = f"{BASE_URL}/mail/sent/{{message_id}}"
MAX_SENT_MAIL_PAGES = 29 # Fallback upper bound if we need to backfill everything
CONFIG_FILE = Path(__file__).with_name("download_sent_mail.local.json")
EXAMPLE_CONFIG_FILE = Path(__file__).with_name("download_sent_mail.example.json")
HEADERS = { HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:147.0) Gecko/20100101 Firefox/147.0", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:147.0) Gecko/20100101 Firefox/147.0",
@ -33,7 +38,23 @@ HEADERS = {
"Priority": "u=0, i", "Priority": "u=0, i",
} }
COOKIES_STR = """ccc=%7B%22needsConsent%22%3Atrue%2C%22managed%22%3A0%2C%22changed%22%3A0%2C%22info%22%3A%7B%22cookieBlock%22%3A%7B%22level%22%3A2%2C%22blockRan%22%3A1%7D%7D%7D; _sp_ses.df80=*; _sp_id.df80=968931de-089d-4576-b729-6662c2c13a65.1770187027.1.1770187129..adf2374b-b85c-4899-afb7-63c2203d0c44..9422de57-9cdf-49c9-ac54-183eaa1ec457.1770187027101.24; TAsessionID=7f373c97-e9f8-46cb-bc1a-cb4f164ce46b|NEW; notice_behavior=expressed,eu; usprivacy=1---; acstring=3~550.1942.3126.3005.3077.1329.196.1725.1092; euconsent-v2=CQfGXgAQfGXgAAvACDENCQFsAP_gAEPgAAAALktB9G5cSSFBYCJVYbtEYAQDwFhg4oAhAgABEwAATBoAoIwGBGAoIAiAICACAAAAIARAIAEECAAAQAAAIIABAAAMAEAAIAACIAAACAABAgAACEAIAAggWAAAAEBEAFQAgAAAQBIACFAAAgABAUABAAAAAACAAQAAACAgQAAAAAAAAAAAkAhAAAAAAAAAABAMAAABIAAAAAAAAAAAAAAAAAAABAAAAICBAAAAQAAAAAAAAAAAAAAAAAAAAgqY0H0blxJIUFgIFVhu0QgBBPAWADigCEAAAEDAABMGgCgjAIUYCAgSIAgIAAAAAAgBEAgAQAIAABAAAAAgAEAAAwAQAAgAAAAAAAAAAECAAAAQAgACCBYAAAAQEQAVACBAABAEgAIUAAAAAEBQAEAAAAAAIABAAAAICBAAAAAAAAAAACQCEAAAAAAAAAAEAwBAAEgAAAAAAAAAAAAAAAAAAAEABAAgIEAAABAA.YAAAAAAAAAAA.ILktB9G5cSSFBYCJVYbtEYAQTwFhg4oAhAgABEwAATBoAoIwGFGAoIEiAICACAAAAIARAIAEECAAAQAAAIIABAAAMAEAAIAACIAAACAABAgAACEAIAAggWAAAAEBEAFQAgQAAQBIACFAAAgABAUABAAAAAACAAQAAACAgQAAAAAAAAAAAkAhAAAAAAAAAABAMAQABIAAAAAAAAAAAAAAAAAAABAAQAICBAAAAQAAAAAAAAAAAAAAAAAAAAgA; notice_preferences=2:; notice_gdpr_prefs=0,1,2:; cmapi_gtm_bl=; cmapi_cookie_privacy=permit 1,2,3; AMCV_48E815355BFE96970A495CD0%40AdobeOrg=281789898%7CMCMID%7C44859851125632937290373504988866174366%7CMCOPTOUT-1770194232s%7CNONE%7CvVersion%7C4.1.0; AMCVS_48E815355BFE96970A495CD0%40AdobeOrg=1; xb=646693; localization=en-us%3Buk%3Bgb; flrbp=1770187037-cfbf3914859af9ef68992c8389162e65e81c86c4; flrbgrp=1770187037-8e700fa7d73b4f2d43550f40513e7c6f507fd20f; flrbgdrp=1770187037-9af21cc74000b5f3f0943243608b4284d5f60ffd; flrbgmrp=1770187037-53f7bfff110731954be6bdfb2f587d59a8305670; flrbrst=1770187037-440e42fcee9b4e8e81ba8bc3eb3d0fc8b62e7083; flrtags=1770187037-7b50035cb956b9216a2f3372f498f7008d8e26a8; flrbrp=1770187037-c0195dc99caa020d4e32b39556131add862f26a0; flrb=34; session_id=2693fb01-87a0-42b1-a426-74642807b534; cookie_session=834645%3A29f2a9722d8bac88553ea1baf7ea11b4; cookie_accid=834645; cookie_epass=29f2a9722d8bac88553ea1baf7ea11b4; sa=1775371036%3A79962317%40N00%3A8fb60f4760b4840f37af3ebc90a8cb57; vp=2075%2C1177%2C1%2C0; flrbfd=1770187037-88a4e436729c9c5551794483fbd9c80e9dac2354; flrbpap=1770187037-18adaacf3a389df4a7bdc05cd471e492c54ef841; liqpw=2075; liqph=672""" def load_cookie_string() -> str:
"""Load Flickr cookies string from local JSON config."""
if not CONFIG_FILE.exists():
raise RuntimeError(
f"Missing config file: {CONFIG_FILE}. "
f"Copy {EXAMPLE_CONFIG_FILE.name} to {CONFIG_FILE.name} and set cookies_str."
)
try:
data = json.loads(CONFIG_FILE.read_text())
except json.JSONDecodeError as exc:
raise RuntimeError(f"Invalid JSON in {CONFIG_FILE}: {exc}") from exc
cookie_str = data.get("cookies_str", "").strip()
if not cookie_str:
raise RuntimeError(f"{CONFIG_FILE} must contain a non-empty 'cookies_str' value")
return cookie_str
def parse_cookies(cookie_str: str) -> dict[str, str]: def parse_cookies(cookie_str: str) -> dict[str, str]:
@ -50,7 +71,7 @@ def create_session() -> requests.Session:
"""Create a requests session with authentication.""" """Create a requests session with authentication."""
session = requests.Session() session = requests.Session()
session.headers.update(HEADERS) session.headers.update(HEADERS)
session.cookies.update(parse_cookies(COOKIES_STR)) session.cookies.update(parse_cookies(load_cookie_string()))
return session return session
@ -166,22 +187,41 @@ def main() -> None:
http_session = create_session() http_session = create_session()
# Scrape all pages to find new messages
total_pages = 29
new_messages: list[dict] = [] new_messages: list[dict] = []
stop_fetching = False
print("Fetching message list from all pages...") print("Fetching message list until we reach existing messages...")
for page in range(1, total_pages + 1): for page in range(1, MAX_SENT_MAIL_PAGES + 1):
url = SENT_MAIL_URL.format(page=page) url = SENT_MAIL_URL.format(page=page)
print(f" Fetching page {page}/{total_pages}...") print(f" Fetching page {page}...")
try: try:
soup = fetch_page(http_session, url) soup = fetch_page(http_session, url)
page_messages = extract_messages_from_list_page(soup) page_messages = extract_messages_from_list_page(soup)
if not page_messages:
print(" No messages found on this page, stopping")
break
page_new_messages = 0
for msg in page_messages: for msg in page_messages:
if msg["message_id"] not in existing_ids: msg_id = msg.get("message_id")
new_messages.append(msg) if not msg_id:
continue
if msg_id in existing_ids:
stop_fetching = True
break
new_messages.append(msg)
page_new_messages += 1
if stop_fetching:
print(" Reached messages already in the database, stopping pagination")
break
if page_new_messages == 0:
print(" No new messages on this page, stopping pagination")
break
time.sleep(1) # Be polite to the server time.sleep(1) # Be polite to the server

View file

@ -12,20 +12,20 @@ class Contribution(Base):
__tablename__ = "contributions" __tablename__ = "contributions"
id: Mapped[int] = mapped_column(primary_key=True) id: Mapped[int] = mapped_column(primary_key=True)
userid: Mapped[int | None] userid: Mapped[int]
user: Mapped[str | None] user: Mapped[str]
pageid: Mapped[int | None] pageid: Mapped[int]
revid: Mapped[int | None] = mapped_column(unique=True) revid: Mapped[int] = mapped_column(unique=True)
parentid: Mapped[int | None] parentid: Mapped[int]
ns: Mapped[int | None] ns: Mapped[int]
title: Mapped[str | None] title: Mapped[str]
timestamp: Mapped[str | None] timestamp: Mapped[str]
minor: Mapped[str | None] minor: Mapped[str | None]
top: Mapped[str | None] top: Mapped[str | None]
comment: Mapped[str | None] = mapped_column(Text) comment: Mapped[str] = mapped_column(Text)
size: Mapped[int | None] size: Mapped[int]
sizediff: Mapped[int | None] sizediff: Mapped[int]
tags: Mapped[str | None] = mapped_column(Text) # JSON array stored as text tags: Mapped[str] = mapped_column(Text) # JSON array stored as text
__table_args__ = ( __table_args__ = (
Index("ix_contributions_timestamp", "timestamp"), Index("ix_contributions_timestamp", "timestamp"),
@ -37,16 +37,16 @@ class SentMessage(Base):
__tablename__ = "sent_messages" __tablename__ = "sent_messages"
message_id: Mapped[str] = mapped_column(primary_key=True) message_id: Mapped[str] = mapped_column(primary_key=True)
subject: Mapped[str | None] subject: Mapped[str]
url: Mapped[str | None] url: Mapped[str]
recipient: Mapped[str | None] recipient: Mapped[str]
date: Mapped[str | None] date: Mapped[str]
body: Mapped[str | None] = mapped_column(Text) body: Mapped[str] = mapped_column(Text)
body_html: Mapped[str | None] = mapped_column(Text) body_html: Mapped[str] = mapped_column(Text)
flickr_url: Mapped[str | None] flickr_url: Mapped[str]
normalized_flickr_url: Mapped[str | None] normalized_flickr_url: Mapped[str]
wikipedia_url: Mapped[str | None] wikipedia_url: Mapped[str]
creator_profile_url: Mapped[str | None] creator_profile_url: Mapped[str]
flickr_uploads: Mapped[list["FlickrUpload"]] = relationship( flickr_uploads: Mapped[list["FlickrUpload"]] = relationship(
back_populates="sent_message" back_populates="sent_message"
@ -62,15 +62,15 @@ class FlickrUpload(Base):
__tablename__ = "flickr_uploads" __tablename__ = "flickr_uploads"
id: Mapped[int] = mapped_column(primary_key=True) id: Mapped[int] = mapped_column(primary_key=True)
pageid: Mapped[int | None] pageid: Mapped[int]
revid: Mapped[int | None] revid: Mapped[int]
title: Mapped[str | None] title: Mapped[str]
timestamp: Mapped[str | None] timestamp: Mapped[str]
flickr_url: Mapped[str | None] flickr_url: Mapped[str]
normalized_flickr_url: Mapped[str | None] normalized_flickr_url: Mapped[str]
creator: Mapped[str | None] creator: Mapped[str | None]
wikipedia_url: Mapped[str | None] wikipedia_url: Mapped[str]
creator_profile_url: Mapped[str | None] creator_profile_url: Mapped[str]
sent_message_id: Mapped[str | None] = mapped_column( sent_message_id: Mapped[str | None] = mapped_column(
ForeignKey("sent_messages.message_id") ForeignKey("sent_messages.message_id")
) )
@ -89,5 +89,23 @@ class ThumbnailCache(Base):
__tablename__ = "thumbnail_cache" __tablename__ = "thumbnail_cache"
title: Mapped[str] = mapped_column(primary_key=True) title: Mapped[str] = mapped_column(primary_key=True)
thumb_url: Mapped[str | None] thumb_url: Mapped[str]
fetched_at: Mapped[int | None] # Unix timestamp fetched_at: Mapped[int] # Unix timestamp
class InteractionLog(Base):
__tablename__ = "interaction_log"
id: Mapped[int] = mapped_column(primary_key=True)
timestamp: Mapped[int] # Unix timestamp
interaction_type: Mapped[str] # "search_article", "search_category", "generate_message"
ip_address: Mapped[str | None]
user_agent: Mapped[str | None] = mapped_column(Text)
query: Mapped[str | None] # search term or category name
flickr_url: Mapped[str | None]
wikipedia_url: Mapped[str | None]
__table_args__ = (
Index("ix_interaction_log_timestamp", "timestamp"),
Index("ix_interaction_log_type", "interaction_type"),
)

159
main.py
View file

@ -18,7 +18,7 @@ from sqlalchemy import func
from werkzeug.debug.tbtools import DebugTraceback from werkzeug.debug.tbtools import DebugTraceback
from flickr_mail.database import get_session from flickr_mail.database import get_session
from flickr_mail.models import FlickrUpload, SentMessage, ThumbnailCache from flickr_mail.models import FlickrUpload, InteractionLog, SentMessage, ThumbnailCache
from flickr_mail.url_utils import extract_urls_from_message, normalize_flickr_url from flickr_mail.url_utils import extract_urls_from_message, normalize_flickr_url
import re import re
@ -348,6 +348,14 @@ class ArticleWithoutImage:
return f"/?enwp={quote(self.title)}" return f"/?enwp={quote(self.title)}"
@dataclasses.dataclass
class CategoryResult:
"""Result of a paginated category search."""
articles: list[ArticleWithoutImage]
gcmcontinue: str | None
# Common non-content images to ignore when checking if an article has images # Common non-content images to ignore when checking if an article has images
NON_CONTENT_IMAGE_PATTERNS = [ NON_CONTENT_IMAGE_PATTERNS = [
"OOjs UI icon", "OOjs UI icon",
@ -379,14 +387,15 @@ def has_content_image(images: list[dict]) -> bool:
def get_articles_without_images( def get_articles_without_images(
category: str, limit: int = 100 category: str,
) -> tuple[list[ArticleWithoutImage], str | None]: limit: int = 200,
gcmcontinue: str | None = None,
) -> CategoryResult:
"""Get articles in a category that don't have images. """Get articles in a category that don't have images.
Uses generator=categorymembers with prop=images to efficiently check Uses generator=categorymembers with prop=images to efficiently check
multiple articles in a single API request. multiple articles in a single API request, following continuation until
the limit is reached or all category members have been processed.
Returns a tuple of (articles_list, continue_token).
""" """
params = { params = {
"action": "query", "action": "query",
@ -394,49 +403,73 @@ def get_articles_without_images(
"gcmtitle": category, "gcmtitle": category,
"gcmtype": "page", # Only articles, not subcategories or files "gcmtype": "page", # Only articles, not subcategories or files
"gcmnamespace": "0", # Main namespace only "gcmnamespace": "0", # Main namespace only
"gcmlimit": str(limit), "gcmlimit": "50", # Small batches so images fit in one response
"prop": "images", "prop": "images",
"imlimit": "max", # Need enough to check all pages in batch "imlimit": "max",
"format": "json", "format": "json",
} }
headers = {"User-Agent": WIKIMEDIA_USER_AGENT} headers = {"User-Agent": WIKIMEDIA_USER_AGENT}
try:
response = requests.get(
WIKIPEDIA_API, params=params, headers=headers, timeout=30
)
response.raise_for_status()
data = response.json()
except (requests.RequestException, json.JSONDecodeError) as e:
print(f"Wikipedia API error: {e}")
return [], None
articles_without_images: list[ArticleWithoutImage] = [] articles_without_images: list[ArticleWithoutImage] = []
seen_pageids: set[int] = set()
next_gcmcontinue: str | None = None
pages = data.get("query", {}).get("pages", {}) # Build initial continue params from the external pagination token
for page in pages.values(): continue_params: dict[str, str] = {}
images = page.get("images", []) if gcmcontinue:
continue_params = {"gcmcontinue": gcmcontinue, "continue": "gcmcontinue||"}
# Skip if page has content images (not just UI icons) while True:
if has_content_image(images): request_params = params.copy()
continue request_params.update(continue_params)
title = page.get("title", "") try:
pageid = page.get("pageid", 0) response = requests.get(
WIKIPEDIA_API, params=request_params, headers=headers, timeout=30
if title and pageid:
articles_without_images.append(
ArticleWithoutImage(title=title, pageid=pageid)
) )
response.raise_for_status()
data = response.json()
except (requests.RequestException, json.JSONDecodeError) as e:
print(f"Wikipedia API error: {e}")
break
pages = data.get("query", {}).get("pages", {})
for page in pages.values():
pageid = page.get("pageid", 0)
if not pageid or pageid in seen_pageids:
continue
seen_pageids.add(pageid)
images = page.get("images", [])
# Skip if page has content images (not just UI icons)
if has_content_image(images):
continue
title = page.get("title", "")
if title:
articles_without_images.append(
ArticleWithoutImage(title=title, pageid=pageid)
)
api_continue = data.get("continue")
if not api_continue:
break
# Only stop at generator boundaries where we have a resumable token
gcmc = api_continue.get("gcmcontinue")
if gcmc and len(articles_without_images) >= limit:
next_gcmcontinue = gcmc
break
continue_params = api_continue
# Sort by title for consistent display # Sort by title for consistent display
articles_without_images.sort(key=lambda a: a.title) articles_without_images.sort(key=lambda a: a.title)
return CategoryResult(
# Get continue token if there are more results articles=articles_without_images,
continue_token = data.get("continue", {}).get("gcmcontinue") gcmcontinue=next_gcmcontinue,
)
return articles_without_images, continue_token
def is_valid_flickr_image_url(url: str) -> bool: def is_valid_flickr_image_url(url: str) -> bool:
@ -458,7 +491,7 @@ def is_valid_flickr_image_url(url: str) -> bool:
def search_flickr(search_term: str, page: int = 1) -> SearchResult: def search_flickr(search_term: str, page: int = 1) -> SearchResult:
"""Search Flickr for photos matching the search term.""" """Search Flickr for photos matching the search term."""
encoded_term = quote(f'"{search_term}"') encoded_term = quote(search_term)
url = f"https://flickr.com/search/?view_all=1&text={encoded_term}&page={page}" url = f"https://flickr.com/search/?view_all=1&text={encoded_term}&page={page}"
response = requests.get(url, headers=BROWSER_HEADERS) response = requests.get(url, headers=BROWSER_HEADERS)
@ -583,6 +616,33 @@ def parse_flickr_search_results(html: str, page: int = 1) -> SearchResult:
) )
def log_interaction(
interaction_type: str,
query: str | None = None,
flickr_url: str | None = None,
wikipedia_url: str | None = None,
) -> None:
"""Log a user interaction to the database."""
forwarded_for = flask.request.headers.get("X-Forwarded-For")
ip_address = forwarded_for.split(",")[0].strip() if forwarded_for else flask.request.remote_addr
user_agent = flask.request.headers.get("User-Agent")
session = get_session()
try:
entry = InteractionLog(
timestamp=int(time.time()),
interaction_type=interaction_type,
ip_address=ip_address,
user_agent=user_agent,
query=query,
flickr_url=flickr_url,
wikipedia_url=wikipedia_url,
)
session.add(entry)
session.commit()
finally:
session.close()
@app.errorhandler(werkzeug.exceptions.InternalServerError) @app.errorhandler(werkzeug.exceptions.InternalServerError)
def exception_handler(e: werkzeug.exceptions.InternalServerError) -> tuple[str, int]: def exception_handler(e: werkzeug.exceptions.InternalServerError) -> tuple[str, int]:
"""Handle exception.""" """Handle exception."""
@ -651,18 +711,24 @@ def start() -> str:
# Get category param if coming from category search # Get category param if coming from category search
cat = flask.request.args.get("cat") cat = flask.request.args.get("cat")
# Allow overriding the Flickr search term (default includes quotes for phrase search)
flickr_search = flask.request.args.get("flickr_search") or f'"{name}"'
flickr_url = flask.request.args.get("flickr") flickr_url = flask.request.args.get("flickr")
if not flickr_url: if not flickr_url:
# Search Flickr for photos # Search Flickr for photos
page = flask.request.args.get("page", 1, type=int) page = flask.request.args.get("page", 1, type=int)
page = max(1, page) # Ensure page is at least 1 page = max(1, page) # Ensure page is at least 1
search_result = search_flickr(name, page) if page == 1:
log_interaction("search_article", query=flickr_search, wikipedia_url=wikipedia_url)
search_result = search_flickr(flickr_search, page)
return flask.render_template( return flask.render_template(
"combined.html", "combined.html",
name=name, name=name,
enwp=enwp, enwp=enwp,
search_result=search_result, search_result=search_result,
cat=cat, cat=cat,
flickr_search=flickr_search,
) )
if "/in/" in flickr_url: if "/in/" in flickr_url:
@ -716,8 +782,16 @@ def start() -> str:
flickr_user_url=flickr_user_url, flickr_user_url=flickr_user_url,
cat=cat, cat=cat,
previous_messages=previous_messages, previous_messages=previous_messages,
flickr_search=flickr_search,
) )
log_interaction(
"generate_message",
query=name,
flickr_url=flickr_url,
wikipedia_url=wikipedia_url,
)
msg = flask.render_template( msg = flask.render_template(
"message.jinja", "message.jinja",
flickr_url=flickr_url, flickr_url=flickr_url,
@ -749,6 +823,7 @@ def start() -> str:
flickr_user_url=flickr_user_url, flickr_user_url=flickr_user_url,
cat=cat, cat=cat,
previous_messages=previous_messages, previous_messages=previous_messages,
flickr_search=flickr_search,
) )
@ -768,7 +843,9 @@ def category_search() -> str:
cat=cat, cat=cat,
) )
articles, continue_token = get_articles_without_images(category) log_interaction("search_category", query=category)
gcmcontinue = flask.request.args.get("gcmcontinue") or None
result = get_articles_without_images(category, gcmcontinue=gcmcontinue)
# Get the display name (without Category: prefix) # Get the display name (without Category: prefix)
category_name = category.replace("Category:", "") category_name = category.replace("Category:", "")
@ -778,8 +855,8 @@ def category_search() -> str:
cat=cat, cat=cat,
category=category, category=category,
category_name=category_name, category_name=category_name,
articles=articles, articles=result.articles,
continue_token=continue_token, gcmcontinue=result.gcmcontinue,
) )

View file

@ -33,7 +33,7 @@
<h5>Articles without images in <a href="https://en.wikipedia.org/wiki/{{ category | replace(' ', '_') }}" target="_blank">{{ category_name }}</a></h5> <h5>Articles without images in <a href="https://en.wikipedia.org/wiki/{{ category | replace(' ', '_') }}" target="_blank">{{ category_name }}</a></h5>
{% if articles %} {% if articles %}
<p class="text-muted small">Found {{ articles | length }} article(s) without images{% if continue_token %} (more available){% endif %}</p> <p class="text-muted small">Found {{ articles | length }} article(s) without images</p>
<div class="list-group"> <div class="list-group">
{% for article in articles %} {% for article in articles %}
@ -44,8 +44,10 @@
{% endfor %} {% endfor %}
</div> </div>
{% if continue_token %} {% if gcmcontinue %}
<p class="text-muted small mt-3">Note: Only showing first batch of results. More articles may be available in this category.</p> <div class="mt-3">
<a href="{{ url_for('category_search', cat=cat, gcmcontinue=gcmcontinue) }}" class="btn btn-outline-primary">Next page &raquo;</a>
</div>
{% endif %} {% endif %}
{% else %} {% else %}

View file

@ -12,7 +12,7 @@
<input type="text" class="form-control" id="enwp" name="enwp" value="{{ enwp }}" required> <input type="text" class="form-control" id="enwp" name="enwp" value="{{ enwp }}" required>
</div> </div>
<input type="submit" value="Submit"> <input type="submit" class="btn btn-primary" value="Search">
<a href="{{ url_for('category_search') }}" class="btn btn-outline-secondary ms-2">Find articles by category</a> <a href="{{ url_for('category_search') }}" class="btn btn-outline-secondary ms-2">Find articles by category</a>
</form> </form>
@ -63,13 +63,20 @@
<p><a href="{{ url_for('category_search', cat=cat) }}">&larr; Back to category</a></p> <p><a href="{{ url_for('category_search', cat=cat) }}">&larr; Back to category</a></p>
{% endif %} {% endif %}
<p>Wikipedia article: {{ name }}</p> <p>Wikipedia article: {{ name }}</p>
<form action="{{ url_for(request.endpoint) }}" class="mb-3 d-flex align-items-center gap-2">
<input type="hidden" name="enwp" value="{{ enwp }}">
{% if cat %}<input type="hidden" name="cat" value="{{ cat }}">{% endif %}
<label for="flickr_search" class="form-label mb-0 text-nowrap">Flickr search:</label>
<input type="text" class="form-control form-control-sm" id="flickr_search" name="flickr_search" value="{{ flickr_search }}" style="max-width: 300px;">
<button type="submit" class="btn btn-sm btn-primary">Search</button>
</form>
<p>Select a photo to compose a message ({{ search_result.total_photos | default(0) }} results):</p> <p>Select a photo to compose a message ({{ search_result.total_photos | default(0) }} results):</p>
<div class="row row-cols-2 row-cols-md-3 row-cols-lg-4 g-3 mb-3"> <div class="row row-cols-2 row-cols-md-3 row-cols-lg-4 g-3 mb-3">
{% for photo in search_result.photos %} {% for photo in search_result.photos %}
<div class="col"> <div class="col">
<div class="card h-100"> <div class="card h-100">
<a href="{{ url_for(request.endpoint, enwp=enwp, flickr=photo.flickr_url, img=photo.medium_url, license=photo.license, flickr_user=photo.realname or photo.username, cat=cat) }}"> <a href="{{ url_for(request.endpoint, enwp=enwp, flickr=photo.flickr_url, img=photo.medium_url, license=photo.license, flickr_user=photo.realname or photo.username, cat=cat, flickr_search=flickr_search) }}">
<img src="{{ photo.thumb_url }}" alt="{{ photo.title }}" class="card-img-top" style="aspect-ratio: 1; object-fit: cover;"> <img src="{{ photo.thumb_url }}" alt="{{ photo.title }}" class="card-img-top" style="aspect-ratio: 1; object-fit: cover;">
</a> </a>
<div class="card-body p-2"> <div class="card-body p-2">
@ -86,7 +93,7 @@
<ul class="pagination justify-content-center"> <ul class="pagination justify-content-center">
{% if search_result.current_page > 1 %} {% if search_result.current_page > 1 %}
<li class="page-item"> <li class="page-item">
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=search_result.current_page - 1, cat=cat) }}">Previous</a> <a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=search_result.current_page - 1, cat=cat) }}">Previous</a>
</li> </li>
{% else %} {% else %}
<li class="page-item disabled"> <li class="page-item disabled">
@ -99,7 +106,7 @@
{% if start_page > 1 %} {% if start_page > 1 %}
<li class="page-item"> <li class="page-item">
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=1, cat=cat) }}">1</a> <a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=1, cat=cat) }}">1</a>
</li> </li>
{% if start_page > 2 %} {% if start_page > 2 %}
<li class="page-item disabled"><span class="page-link">...</span></li> <li class="page-item disabled"><span class="page-link">...</span></li>
@ -108,7 +115,7 @@
{% for p in range(start_page, end_page + 1) %} {% for p in range(start_page, end_page + 1) %}
<li class="page-item {{ 'active' if p == search_result.current_page else '' }}"> <li class="page-item {{ 'active' if p == search_result.current_page else '' }}">
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=p, cat=cat) }}">{{ p }}</a> <a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=p, cat=cat) }}">{{ p }}</a>
</li> </li>
{% endfor %} {% endfor %}
@ -117,13 +124,13 @@
<li class="page-item disabled"><span class="page-link">...</span></li> <li class="page-item disabled"><span class="page-link">...</span></li>
{% endif %} {% endif %}
<li class="page-item"> <li class="page-item">
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=search_result.total_pages, cat=cat) }}">{{ search_result.total_pages }}</a> <a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=search_result.total_pages, cat=cat) }}">{{ search_result.total_pages }}</a>
</li> </li>
{% endif %} {% endif %}
{% if search_result.current_page < search_result.total_pages %} {% if search_result.current_page < search_result.total_pages %}
<li class="page-item"> <li class="page-item">
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=search_result.current_page + 1, cat=cat) }}">Next</a> <a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=search_result.current_page + 1, cat=cat) }}">Next</a>
</li> </li>
{% else %} {% else %}
<li class="page-item disabled"> <li class="page-item disabled">
@ -135,7 +142,7 @@
{% endif %} {% endif %}
<p class="text-muted small"> <p class="text-muted small">
<a href="https://flickr.com/search/?view_all=1&text={{ '"' + name + '"' | urlencode }}" target="_blank">View full search on Flickr</a> <a href="https://flickr.com/search/?view_all=1&text={{ flickr_search | urlencode }}" target="_blank">View full search on Flickr</a>
</p> </p>
{% elif name and not flickr_url %} {% elif name and not flickr_url %}
@ -144,8 +151,15 @@
<p><a href="{{ url_for('category_search', cat=cat) }}">&larr; Back to category</a></p> <p><a href="{{ url_for('category_search', cat=cat) }}">&larr; Back to category</a></p>
{% endif %} {% endif %}
<p>Wikipedia article: {{ name }}</p> <p>Wikipedia article: {{ name }}</p>
<form action="{{ url_for(request.endpoint) }}" class="mb-3 d-flex align-items-center gap-2">
<input type="hidden" name="enwp" value="{{ enwp }}">
{% if cat %}<input type="hidden" name="cat" value="{{ cat }}">{% endif %}
<label for="flickr_search_empty" class="form-label mb-0 text-nowrap">Flickr search:</label>
<input type="text" class="form-control form-control-sm" id="flickr_search_empty" name="flickr_search" value="{{ flickr_search }}" style="max-width: 300px;">
<button type="submit" class="btn btn-sm btn-primary">Search</button>
</form>
<div class="alert alert-warning">No photos found. Try a different search term.</div> <div class="alert alert-warning">No photos found. Try a different search term.</div>
<p><a href="https://flickr.com/search/?view_all=1&text={{ '"' + name + '"' | urlencode }}" target="_blank">Search on Flickr directly</a></p> <p><a href="https://flickr.com/search/?view_all=1&text={{ flickr_search | urlencode }}" target="_blank">Search on Flickr directly</a></p>
{% endif %} {% endif %}
@ -201,7 +215,7 @@
</div> </div>
{% endif %} {% endif %}
<p class="mt-3"> <p class="mt-3">
<a href="{{ url_for('start', enwp=enwp, cat=cat) if cat else url_for('start', enwp=enwp) }}">&larr; Back to search results</a> <a href="{{ url_for('start', enwp=enwp, cat=cat, flickr_search=flickr_search) }}">&larr; Back to search results</a>
</p> </p>
</div> </div>
</div> </div>

View file

@ -2,8 +2,12 @@
""" """
Find UploadWizard contributions that are from Flickr and add them to the database. Find UploadWizard contributions that are from Flickr and add them to the database.
For contributions with comment 'User created page with UploadWizard', queries the Supports both UploadWizard comment styles:
Commons API to check if the image source is Flickr (by checking the Credit field). - "User created page with UploadWizard" (older)
- "Uploaded a work by ... with UploadWizard" (newer, often includes Flickr URL)
If a Flickr URL is not present in the contribution comment, queries Commons API
to check if the image source is Flickr (by checking the Credit field).
""" """
import json import json
@ -27,6 +31,13 @@ def extract_flickr_url_from_credit(credit: str) -> str | None:
return match.group(0) if match else None return match.group(0) if match else None
def extract_flickr_url_from_comment(comment: str) -> str | None:
"""Extract Flickr URL directly from a contribution comment."""
pattern = r'https?://(?:www\.)?flickr\.com/photos/[^/\s]+/\d+'
match = re.search(pattern, comment or "")
return match.group(0) if match else None
def get_image_metadata(titles: list[str]) -> dict[str, dict]: def get_image_metadata(titles: list[str]) -> dict[str, dict]:
"""Fetch image metadata from Commons API for multiple titles.""" """Fetch image metadata from Commons API for multiple titles."""
if not titles: if not titles:
@ -97,10 +108,12 @@ def main():
) )
url_to_message = {msg.normalized_flickr_url: msg for msg in sent_messages} url_to_message = {msg.normalized_flickr_url: msg for msg in sent_messages}
# Find UploadWizard contributions (page creations only) # Find UploadWizard file uploads.
# Old format: "User created page with UploadWizard"
# New format: "Uploaded a work by ... with UploadWizard"
upload_wizard = ( upload_wizard = (
session.query(Contribution) session.query(Contribution)
.filter(Contribution.comment == "User created page with UploadWizard") .filter(Contribution.comment.contains("UploadWizard"))
.filter(Contribution.title.startswith("File:")) .filter(Contribution.title.startswith("File:"))
.all() .all()
) )
@ -127,7 +140,10 @@ def main():
credit = meta.get("credit", "") credit = meta.get("credit", "")
artist = meta.get("artist", "") artist = meta.get("artist", "")
flickr_url = extract_flickr_url_from_credit(credit) # Prefer URL directly in comment; fall back to extmetadata Credit.
flickr_url = extract_flickr_url_from_comment(c.comment or "")
if not flickr_url:
flickr_url = extract_flickr_url_from_credit(credit)
if not flickr_url: if not flickr_url:
continue continue