Compare commits
8 commits
9f0fb01878
...
7b741e951f
| Author | SHA1 | Date | |
|---|---|---|---|
| 7b741e951f | |||
| 57b2e474df | |||
| ab012f9cf3 | |||
| 08f5128e8d | |||
| 252a854e76 | |||
| 2819652afd | |||
| 4f67960fe1 | |||
| e072279566 |
11 changed files with 435 additions and 188 deletions
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -3,3 +3,4 @@ __pycache__
|
||||||
commons_contributions/thumbnail_cache.json
|
commons_contributions/thumbnail_cache.json
|
||||||
commons_contributions/sent_mail_index.json
|
commons_contributions/sent_mail_index.json
|
||||||
flickr_mail.db
|
flickr_mail.db
|
||||||
|
download_sent_mail.local.json
|
||||||
|
|
|
||||||
67
AGENTS.md
67
AGENTS.md
|
|
@ -85,16 +85,20 @@ for the Flickr mail URL. Scrapes the user's profile page for embedded params.
|
||||||
Shows recent Wikimedia Commons uploads on the home page, filtered to only
|
Shows recent Wikimedia Commons uploads on the home page, filtered to only
|
||||||
those obtained via Flickr mail requests.
|
those obtained via Flickr mail requests.
|
||||||
|
|
||||||
**Data files** (in `commons_contributions/`):
|
**Database tables used by the app**:
|
||||||
- `flickr_uploads.json`: List of Commons uploads from Flickr with metadata
|
- `sent_messages`: downloaded from Flickr sent mail, includes extracted Flickr
|
||||||
- `thumbnail_cache.json`: Cached Commons API thumbnail URLs (7-day TTL)
|
URL and Wikipedia URL from message body
|
||||||
- `sent_mail_index.json`: Index of sent mail messages (flickr_url → wikipedia_url)
|
- `contributions`: downloaded from Commons `usercontribs`
|
||||||
|
- `flickr_uploads`: derived table built by `update_flickr_uploads.py` by
|
||||||
|
matching Commons uploads to Flickr URLs
|
||||||
|
- `thumbnail_cache`: cached Commons API thumbnail URLs (7-day TTL)
|
||||||
|
- `interaction_log`: written by the web app to record searches and message
|
||||||
|
generation events (see below)
|
||||||
|
|
||||||
**Key functions**:
|
**Key functions**:
|
||||||
- `build_sent_mail_index()`: Parses sent mail JSON files, extracts Flickr and
|
|
||||||
Wikipedia URLs from message bodies, caches the index
|
|
||||||
- `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match,
|
- `get_recent_commons_uploads()`: Loads uploads, filters by sent mail match,
|
||||||
fetches thumbnails from Commons API
|
joins `flickr_uploads` with `sent_messages`, and fetches thumbnails from
|
||||||
|
Commons API
|
||||||
- `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash)
|
- `normalize_flickr_url()`: Normalizes URLs for matching (removes protocol, www, trailing slash)
|
||||||
|
|
||||||
**CommonsUpload dataclass**:
|
**CommonsUpload dataclass**:
|
||||||
|
|
@ -104,9 +108,32 @@ those obtained via Flickr mail requests.
|
||||||
- `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links
|
- `wiki_link_url`, `wiki_link_label`: Handles Wikidata vs Wikipedia links
|
||||||
|
|
||||||
**Maintenance script** (`update_flickr_uploads.py`):
|
**Maintenance script** (`update_flickr_uploads.py`):
|
||||||
Run to find Flickr uploads from UploadWizard contributions that don't have
|
Builds/updates `flickr_uploads` from `contributions` and links to
|
||||||
the Flickr URL in the edit comment. Queries Commons API for image metadata
|
`sent_messages`.
|
||||||
and checks the Credit field for Flickr URLs.
|
- Scans file contributions containing `UploadWizard` in the comment
|
||||||
|
- Supports both comment styles:
|
||||||
|
- `User created page with UploadWizard` (older)
|
||||||
|
- `Uploaded a work by ... with UploadWizard` (newer; often includes URL)
|
||||||
|
- Extracts Flickr URL from contribution comment when present
|
||||||
|
- Falls back to Commons `extmetadata.Credit` lookup when comment has no URL
|
||||||
|
|
||||||
|
### Interaction Logging (`log_interaction`)
|
||||||
|
|
||||||
|
The `log_interaction()` helper writes a row to `interaction_log` on each
|
||||||
|
meaningful user action:
|
||||||
|
|
||||||
|
- `"search_article"` – user submits a Wikipedia article search (page 1 only,
|
||||||
|
to avoid logging every pagination hit)
|
||||||
|
- `"search_category"` – user submits a Wikipedia category search
|
||||||
|
- `"generate_message"` – a non-free CC message is generated after clicking a photo
|
||||||
|
|
||||||
|
Each row captures: Unix `timestamp`, `interaction_type`, `ip_address`
|
||||||
|
(prefers `X-Forwarded-For` for proxy setups), `user_agent`, `query` (article
|
||||||
|
title or category name), and optionally `flickr_url` / `wikipedia_url`.
|
||||||
|
|
||||||
|
The table is created by `init_db()` (called via `python3 -c "from
|
||||||
|
flickr_mail.database import init_db; init_db()"` or any of the maintenance
|
||||||
|
scripts). The web app never calls `init_db()` itself.
|
||||||
|
|
||||||
### Category Search (`/category` route)
|
### Category Search (`/category` route)
|
||||||
|
|
||||||
|
|
@ -125,7 +152,7 @@ to allow back-navigation to the category.
|
||||||
|
|
||||||
### Previous Message Detection (`get_previous_messages`)
|
### Previous Message Detection (`get_previous_messages`)
|
||||||
|
|
||||||
Checks `sent_mail/messages_index.json` for previous messages to a Flickr user.
|
Checks the `sent_messages` database table for previous messages to a Flickr user.
|
||||||
Matches by both display name and username (case-insensitive). Results shown as
|
Matches by both display name and username (case-insensitive). Results shown as
|
||||||
an info alert on the message page.
|
an info alert on the message page.
|
||||||
|
|
||||||
|
|
@ -159,6 +186,24 @@ print(f"{len(result.photos)} photos, {result.total_pages} pages")
|
||||||
print(result.photos[0].title, result.photos[0].license_name)
|
print(result.photos[0].title, result.photos[0].license_name)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Data Sync Workflow
|
||||||
|
|
||||||
|
To refresh "recent Commons uploads obtained via Flickr mail", run scripts in
|
||||||
|
this order:
|
||||||
|
|
||||||
|
1. `./download_sent_mail.py`
|
||||||
|
2. `./download_commons_contributions.py`
|
||||||
|
3. `./update_flickr_uploads.py`
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- `download_sent_mail.py` reads Flickr auth cookies from
|
||||||
|
`download_sent_mail.local.json` (`cookies_str` key). Copy
|
||||||
|
`download_sent_mail.example.json` to create local config.
|
||||||
|
- `main.py` does not populate `flickr_uploads`; it only reads from it.
|
||||||
|
- `download_commons_contributions.py` intentionally stops after several
|
||||||
|
consecutive fully-known API batches (overlap window) to avoid full-history
|
||||||
|
scans while still catching shallow gaps.
|
||||||
|
|
||||||
## Potential Improvements
|
## Potential Improvements
|
||||||
|
|
||||||
- Cache search results to reduce Flickr requests
|
- Cache search results to reduce Flickr requests
|
||||||
|
|
|
||||||
143
README.md
143
README.md
|
|
@ -1,89 +1,100 @@
|
||||||
# Flickr Photo Finder for Wikipedia Articles
|
# Flickr Mail
|
||||||
|
|
||||||
Tool lives here: <https://edwardbetts.com/flickr_mail/>
|
Tool lives here: <https://edwardbetts.com/flickr_mail/>
|
||||||
|
|
||||||
This tool is designed to help you find photos on Flickr for Wikipedia articles
|
Flickr Mail is a Flask app that helps find Flickr photos for Wikipedia articles
|
||||||
and contact the photographer. It's a Python application that leverages the Flask
|
and contact photographers to request Wikipedia-compatible licensing.
|
||||||
framework for web development.
|
|
||||||
|
|
||||||
## Table of Contents
|
## What It Does
|
||||||
- [Introduction](#introduction)
|
|
||||||
- [Usage](#usage)
|
|
||||||
- [Error Handling](#error-handling)
|
|
||||||
- [Running the Application](#running-the-application)
|
|
||||||
|
|
||||||
## Introduction
|
- Searches Flickr from a Wikipedia article title/URL
|
||||||
|
- Shows license status for each result (free vs non-free CC variants)
|
||||||
|
- Builds a ready-to-send Flickr message for non-free licenses
|
||||||
|
- Finds image-less articles in a Wikipedia category
|
||||||
|
- Shows recent Commons uploads that came from Flickr mail outreach
|
||||||
|
|
||||||
This tool is developed and maintained by Edward Betts (edward@4angle.com). Its
|
## Project Layout
|
||||||
primary purpose is to simplify the process of discovering and contacting
|
|
||||||
photographers on Flickr whose photos can be used to enhance Wikipedia articles.
|
|
||||||
|
|
||||||
### Key Features
|
- `main.py`: Flask app routes and core logic
|
||||||
- **Integrated Flickr search**: Enter a Wikipedia article title and see Flickr
|
- `templates/`: UI templates
|
||||||
photos directly in the interface - no need to visit Flickr's search page.
|
- `download_sent_mail.py`: sync Flickr sent messages into DB
|
||||||
- **Photo grid with metadata**: Search results display as a grid of thumbnails
|
- `download_commons_contributions.py`: sync Commons contributions into DB
|
||||||
showing the user's name and license for each photo.
|
- `update_flickr_uploads.py`: derive `flickr_uploads` from contributions/sent mail
|
||||||
- **License handling**: Photos with Wikipedia-compatible licenses (CC BY,
|
- `flickr_mail.db`: SQLite database
|
||||||
CC BY-SA, CC0, Public Domain) are highlighted with a green badge and link
|
|
||||||
directly to the Commons UploadWizard. Non-free CC licenses (NC/ND) show a
|
|
||||||
tailored message explaining Wikipedia's requirements. Supports both CC 2.0
|
|
||||||
and CC 4.0 license codes.
|
|
||||||
- **One-click message composition**: Click any photo to compose a permission
|
|
||||||
request message with the photo displayed alongside, showing the user's Flickr
|
|
||||||
profile and current license.
|
|
||||||
- **Previous message detection**: The message page checks sent mail history and
|
|
||||||
warns if you have previously contacted the user.
|
|
||||||
- **Category search**: Find Wikipedia articles without images in a given
|
|
||||||
category, with links to search Flickr for each article.
|
|
||||||
- **Pagination**: Browse through thousands of search results with page navigation.
|
|
||||||
- **Recent uploads showcase**: The home page displays recent Wikimedia Commons
|
|
||||||
uploads that were obtained via Flickr mail requests, with links to the
|
|
||||||
Wikipedia article and user's Flickr profile.
|
|
||||||
- Handle exceptions gracefully and provide detailed error information.
|
|
||||||
|
|
||||||
## Usage
|
## Database Pipeline
|
||||||
|
|
||||||
To use the tool, follow these steps:
|
The recent uploads section depends on a 3-step pipeline:
|
||||||
|
|
||||||
1. Start the tool by running the script.
|
1. `./download_sent_mail.py` updates `sent_messages`
|
||||||
2. Access the tool through a web browser.
|
2. `./download_commons_contributions.py` updates `contributions`
|
||||||
3. Enter a Wikipedia article title or URL, or use "Find articles by category"
|
3. `./update_flickr_uploads.py` builds/updates `flickr_uploads`
|
||||||
to discover articles that need images.
|
|
||||||
4. Browse the Flickr search results displayed in the interface.
|
|
||||||
5. Click on a photo to select it. If the license is Wikipedia-compatible, you'll
|
|
||||||
be linked to the Commons UploadWizard. Otherwise, a message is composed to
|
|
||||||
request a license change.
|
|
||||||
6. Copy the subject and message, then click "Send message on Flickr" to contact
|
|
||||||
the user.
|
|
||||||
|
|
||||||
## Error Handling
|
`main.py` only reads `flickr_uploads`; it does not populate it.
|
||||||
|
|
||||||
The application includes error handling to ensure a smooth user experience. If
|
## UploadWizard Detection
|
||||||
an error occurs, it will display a detailed error message with traceback
|
|
||||||
information. The error handling is designed to provide valuable insights into
|
|
||||||
any issues that may arise during use.
|
|
||||||
|
|
||||||
## Running the Application
|
`update_flickr_uploads.py` supports both Commons UploadWizard comment styles:
|
||||||
|
|
||||||
To run the application, ensure you have Python 3 installed on your system. You
|
- `User created page with UploadWizard` (older)
|
||||||
will also need to install the required Python modules mentioned in the script,
|
- `Uploaded a work by ... with UploadWizard` (newer)
|
||||||
including Flask, requests, and others.
|
|
||||||
|
|
||||||
1. Clone this repository to your local machine.
|
It first tries to extract a Flickr URL directly from the contribution comment.
|
||||||
2. Navigate to the project directory.
|
If absent, it falls back to Commons `extmetadata.Credit`.
|
||||||
3. Run the following command to start the application:
|
|
||||||
|
## Local Run
|
||||||
|
|
||||||
|
Install dependencies (example):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install flask requests beautifulsoup4 sqlalchemy
|
||||||
|
```
|
||||||
|
|
||||||
|
Start the app:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 main.py
|
python3 main.py
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Access the application by opening a web browser and visiting the provided URL
|
Then open:
|
||||||
(usually `http://localhost:5000/`).
|
|
||||||
|
|
||||||
That's it! You can now use the Flickr Photo Finder tool to streamline the
|
- `http://localhost:5000/`
|
||||||
process of finding and contacting photographers for Wikipedia articles.
|
|
||||||
|
|
||||||
If you encounter any issues or have questions, feel free to contact Edward Betts
|
## Refresh Data
|
||||||
(edward@4angle.com).
|
|
||||||
|
|
||||||
Happy photo hunting!
|
Run in this order:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./download_sent_mail.py
|
||||||
|
./download_commons_contributions.py
|
||||||
|
./update_flickr_uploads.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Before running `./download_sent_mail.py`, create local auth config:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp download_sent_mail.example.json download_sent_mail.local.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Then edit `download_sent_mail.local.json` and set `cookies_str` to your full
|
||||||
|
Flickr `Cookie` header value.
|
||||||
|
|
||||||
|
## Interaction Logging
|
||||||
|
|
||||||
|
The app logs searches and message generation to the `interaction_log` table:
|
||||||
|
|
||||||
|
- `search_article`: when a user searches for a Wikipedia article title (page 1 only)
|
||||||
|
- `search_category`: when a user searches a Wikipedia category
|
||||||
|
- `generate_message`: when a non-free CC message is generated for a photo
|
||||||
|
|
||||||
|
Each row records the timestamp, interaction type, client IP (from
|
||||||
|
`X-Forwarded-For` if present), User-Agent, query, and (for message events)
|
||||||
|
the Flickr and Wikipedia URLs.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- `download_commons_contributions.py` uses an overlap window of known-only
|
||||||
|
batches before stopping to avoid full-history scans while still catching
|
||||||
|
shallow gaps.
|
||||||
|
- If a known Commons upload is missing from `flickr_uploads`, re-run the full
|
||||||
|
3-step pipeline above.
|
||||||
|
|
|
||||||
|
|
@ -12,6 +12,7 @@ from flickr_mail.models import Contribution
|
||||||
|
|
||||||
API_URL = "https://commons.wikimedia.org/w/api.php"
|
API_URL = "https://commons.wikimedia.org/w/api.php"
|
||||||
USERNAME = "Edward"
|
USERNAME = "Edward"
|
||||||
|
CONSECUTIVE_KNOWN_BATCHES_TO_STOP = 3
|
||||||
|
|
||||||
# Identify ourselves properly to Wikimedia
|
# Identify ourselves properly to Wikimedia
|
||||||
USER_AGENT = "CommonsContributionsDownloader/0.1 (edward@4angle.com)"
|
USER_AGENT = "CommonsContributionsDownloader/0.1 (edward@4angle.com)"
|
||||||
|
|
@ -48,12 +49,8 @@ def fetch_contributions(
|
||||||
return contributions, new_continue
|
return contributions, new_continue
|
||||||
|
|
||||||
|
|
||||||
def upsert_contribution(session, c: dict) -> None:
|
def insert_contribution(session, c: dict) -> None:
|
||||||
"""Insert or update a contribution by revid."""
|
"""Insert a contribution row (caller must ensure revid is new)."""
|
||||||
existing = session.query(Contribution).filter_by(revid=c["revid"]).first()
|
|
||||||
if existing:
|
|
||||||
return # Already have this revision
|
|
||||||
|
|
||||||
session.add(Contribution(
|
session.add(Contribution(
|
||||||
userid=c.get("userid"),
|
userid=c.get("userid"),
|
||||||
user=c.get("user"),
|
user=c.get("user"),
|
||||||
|
|
@ -97,6 +94,7 @@ def main() -> None:
|
||||||
batch_num = 0
|
batch_num = 0
|
||||||
new_count = 0
|
new_count = 0
|
||||||
continue_token = None
|
continue_token = None
|
||||||
|
consecutive_known_batches = 0
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
batch_num += 1
|
batch_num += 1
|
||||||
|
|
@ -108,13 +106,24 @@ def main() -> None:
|
||||||
print("no results")
|
print("no results")
|
||||||
break
|
break
|
||||||
|
|
||||||
|
# One DB query per batch to identify already-known revisions.
|
||||||
|
revids = [c["revid"] for c in contributions if "revid" in c]
|
||||||
|
existing_revids = {
|
||||||
|
row[0]
|
||||||
|
for row in (
|
||||||
|
session.query(Contribution.revid)
|
||||||
|
.filter(Contribution.revid.in_(revids))
|
||||||
|
.all()
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
batch_new = 0
|
batch_new = 0
|
||||||
for c in contributions:
|
for c in contributions:
|
||||||
# Stop if we've reached contributions we already have
|
revid = c.get("revid")
|
||||||
existing = session.query(Contribution).filter_by(revid=c["revid"]).first()
|
if revid in existing_revids:
|
||||||
if existing:
|
|
||||||
continue
|
continue
|
||||||
upsert_contribution(session, c)
|
|
||||||
|
insert_contribution(session, c)
|
||||||
batch_new += 1
|
batch_new += 1
|
||||||
|
|
||||||
new_count += batch_new
|
new_count += batch_new
|
||||||
|
|
@ -123,7 +132,18 @@ def main() -> None:
|
||||||
session.commit()
|
session.commit()
|
||||||
|
|
||||||
if batch_new == 0:
|
if batch_new == 0:
|
||||||
# All contributions in this batch already exist, we're caught up
|
consecutive_known_batches += 1
|
||||||
|
print(
|
||||||
|
" Batch fully known "
|
||||||
|
f"({consecutive_known_batches}/"
|
||||||
|
f"{CONSECUTIVE_KNOWN_BATCHES_TO_STOP})"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
consecutive_known_batches = 0
|
||||||
|
|
||||||
|
if consecutive_known_batches >= CONSECUTIVE_KNOWN_BATCHES_TO_STOP:
|
||||||
|
# Stop after a small overlap window of known-only batches.
|
||||||
|
# This catches recent historical gaps without full-history scans.
|
||||||
print(" Caught up with existing data")
|
print(" Caught up with existing data")
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|
|
||||||
3
download_sent_mail.example.json
Normal file
3
download_sent_mail.example.json
Normal file
|
|
@ -0,0 +1,3 @@
|
||||||
|
{
|
||||||
|
"cookies_str": "paste your full Flickr Cookie header value here"
|
||||||
|
}
|
||||||
|
|
@ -1,7 +1,9 @@
|
||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""Download sent FlickrMail messages for backup."""
|
"""Download sent FlickrMail messages for backup."""
|
||||||
|
|
||||||
|
import json
|
||||||
import time
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
import requests
|
import requests
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup
|
||||||
|
|
@ -17,6 +19,9 @@ from flickr_mail.url_utils import (
|
||||||
BASE_URL = "https://www.flickr.com"
|
BASE_URL = "https://www.flickr.com"
|
||||||
SENT_MAIL_URL = f"{BASE_URL}/mail/sent/page{{page}}"
|
SENT_MAIL_URL = f"{BASE_URL}/mail/sent/page{{page}}"
|
||||||
MESSAGE_URL = f"{BASE_URL}/mail/sent/{{message_id}}"
|
MESSAGE_URL = f"{BASE_URL}/mail/sent/{{message_id}}"
|
||||||
|
MAX_SENT_MAIL_PAGES = 29 # Fallback upper bound if we need to backfill everything
|
||||||
|
CONFIG_FILE = Path(__file__).with_name("download_sent_mail.local.json")
|
||||||
|
EXAMPLE_CONFIG_FILE = Path(__file__).with_name("download_sent_mail.example.json")
|
||||||
|
|
||||||
HEADERS = {
|
HEADERS = {
|
||||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:147.0) Gecko/20100101 Firefox/147.0",
|
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:147.0) Gecko/20100101 Firefox/147.0",
|
||||||
|
|
@ -33,7 +38,23 @@ HEADERS = {
|
||||||
"Priority": "u=0, i",
|
"Priority": "u=0, i",
|
||||||
}
|
}
|
||||||
|
|
||||||
COOKIES_STR = """ccc=%7B%22needsConsent%22%3Atrue%2C%22managed%22%3A0%2C%22changed%22%3A0%2C%22info%22%3A%7B%22cookieBlock%22%3A%7B%22level%22%3A2%2C%22blockRan%22%3A1%7D%7D%7D; _sp_ses.df80=*; _sp_id.df80=968931de-089d-4576-b729-6662c2c13a65.1770187027.1.1770187129..adf2374b-b85c-4899-afb7-63c2203d0c44..9422de57-9cdf-49c9-ac54-183eaa1ec457.1770187027101.24; TAsessionID=7f373c97-e9f8-46cb-bc1a-cb4f164ce46b|NEW; notice_behavior=expressed,eu; usprivacy=1---; acstring=3~550.1942.3126.3005.3077.1329.196.1725.1092; euconsent-v2=CQfGXgAQfGXgAAvACDENCQFsAP_gAEPgAAAALktB9G5cSSFBYCJVYbtEYAQDwFhg4oAhAgABEwAATBoAoIwGBGAoIAiAICACAAAAIARAIAEECAAAQAAAIIABAAAMAEAAIAACIAAACAABAgAACEAIAAggWAAAAEBEAFQAgAAAQBIACFAAAgABAUABAAAAAACAAQAAACAgQAAAAAAAAAAAkAhAAAAAAAAAABAMAAABIAAAAAAAAAAAAAAAAAAABAAAAICBAAAAQAAAAAAAAAAAAAAAAAAAAgqY0H0blxJIUFgIFVhu0QgBBPAWADigCEAAAEDAABMGgCgjAIUYCAgSIAgIAAAAAAgBEAgAQAIAABAAAAAgAEAAAwAQAAgAAAAAAAAAAECAAAAQAgACCBYAAAAQEQAVACBAABAEgAIUAAAAAEBQAEAAAAAAIABAAAAICBAAAAAAAAAAACQCEAAAAAAAAAAEAwBAAEgAAAAAAAAAAAAAAAAAAAEABAAgIEAAABAA.YAAAAAAAAAAA.ILktB9G5cSSFBYCJVYbtEYAQTwFhg4oAhAgABEwAATBoAoIwGFGAoIEiAICACAAAAIARAIAEECAAAQAAAIIABAAAMAEAAIAACIAAACAABAgAACEAIAAggWAAAAEBEAFQAgQAAQBIACFAAAgABAUABAAAAAACAAQAAACAgQAAAAAAAAAAAkAhAAAAAAAAAABAMAQABIAAAAAAAAAAAAAAAAAAABAAQAICBAAAAQAAAAAAAAAAAAAAAAAAAAgA; notice_preferences=2:; notice_gdpr_prefs=0,1,2:; cmapi_gtm_bl=; cmapi_cookie_privacy=permit 1,2,3; AMCV_48E815355BFE96970A495CD0%40AdobeOrg=281789898%7CMCMID%7C44859851125632937290373504988866174366%7CMCOPTOUT-1770194232s%7CNONE%7CvVersion%7C4.1.0; AMCVS_48E815355BFE96970A495CD0%40AdobeOrg=1; xb=646693; localization=en-us%3Buk%3Bgb; flrbp=1770187037-cfbf3914859af9ef68992c8389162e65e81c86c4; flrbgrp=1770187037-8e700fa7d73b4f2d43550f40513e7c6f507fd20f; flrbgdrp=1770187037-9af21cc74000b5f3f0943243608b4284d5f60ffd; flrbgmrp=1770187037-53f7bfff110731954be6bdfb2f587d59a8305670; flrbrst=1770187037-440e42fcee9b4e8e81ba8bc3eb3d0fc8b62e7083; flrtags=1770187037-7b50035cb956b9216a2f3372f498f7008d8e26a8; flrbrp=1770187037-c0195dc99caa020d4e32b39556131add862f26a0; flrb=34; session_id=2693fb01-87a0-42b1-a426-74642807b534; cookie_session=834645%3A29f2a9722d8bac88553ea1baf7ea11b4; cookie_accid=834645; cookie_epass=29f2a9722d8bac88553ea1baf7ea11b4; sa=1775371036%3A79962317%40N00%3A8fb60f4760b4840f37af3ebc90a8cb57; vp=2075%2C1177%2C1%2C0; flrbfd=1770187037-88a4e436729c9c5551794483fbd9c80e9dac2354; flrbpap=1770187037-18adaacf3a389df4a7bdc05cd471e492c54ef841; liqpw=2075; liqph=672"""
|
def load_cookie_string() -> str:
|
||||||
|
"""Load Flickr cookies string from local JSON config."""
|
||||||
|
if not CONFIG_FILE.exists():
|
||||||
|
raise RuntimeError(
|
||||||
|
f"Missing config file: {CONFIG_FILE}. "
|
||||||
|
f"Copy {EXAMPLE_CONFIG_FILE.name} to {CONFIG_FILE.name} and set cookies_str."
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = json.loads(CONFIG_FILE.read_text())
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
raise RuntimeError(f"Invalid JSON in {CONFIG_FILE}: {exc}") from exc
|
||||||
|
|
||||||
|
cookie_str = data.get("cookies_str", "").strip()
|
||||||
|
if not cookie_str:
|
||||||
|
raise RuntimeError(f"{CONFIG_FILE} must contain a non-empty 'cookies_str' value")
|
||||||
|
return cookie_str
|
||||||
|
|
||||||
|
|
||||||
def parse_cookies(cookie_str: str) -> dict[str, str]:
|
def parse_cookies(cookie_str: str) -> dict[str, str]:
|
||||||
|
|
@ -50,7 +71,7 @@ def create_session() -> requests.Session:
|
||||||
"""Create a requests session with authentication."""
|
"""Create a requests session with authentication."""
|
||||||
session = requests.Session()
|
session = requests.Session()
|
||||||
session.headers.update(HEADERS)
|
session.headers.update(HEADERS)
|
||||||
session.cookies.update(parse_cookies(COOKIES_STR))
|
session.cookies.update(parse_cookies(load_cookie_string()))
|
||||||
return session
|
return session
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -166,22 +187,41 @@ def main() -> None:
|
||||||
|
|
||||||
http_session = create_session()
|
http_session = create_session()
|
||||||
|
|
||||||
# Scrape all pages to find new messages
|
|
||||||
total_pages = 29
|
|
||||||
new_messages: list[dict] = []
|
new_messages: list[dict] = []
|
||||||
|
stop_fetching = False
|
||||||
|
|
||||||
print("Fetching message list from all pages...")
|
print("Fetching message list until we reach existing messages...")
|
||||||
for page in range(1, total_pages + 1):
|
for page in range(1, MAX_SENT_MAIL_PAGES + 1):
|
||||||
url = SENT_MAIL_URL.format(page=page)
|
url = SENT_MAIL_URL.format(page=page)
|
||||||
print(f" Fetching page {page}/{total_pages}...")
|
print(f" Fetching page {page}...")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
soup = fetch_page(http_session, url)
|
soup = fetch_page(http_session, url)
|
||||||
page_messages = extract_messages_from_list_page(soup)
|
page_messages = extract_messages_from_list_page(soup)
|
||||||
|
|
||||||
|
if not page_messages:
|
||||||
|
print(" No messages found on this page, stopping")
|
||||||
|
break
|
||||||
|
|
||||||
|
page_new_messages = 0
|
||||||
for msg in page_messages:
|
for msg in page_messages:
|
||||||
if msg["message_id"] not in existing_ids:
|
msg_id = msg.get("message_id")
|
||||||
new_messages.append(msg)
|
if not msg_id:
|
||||||
|
continue
|
||||||
|
if msg_id in existing_ids:
|
||||||
|
stop_fetching = True
|
||||||
|
break
|
||||||
|
|
||||||
|
new_messages.append(msg)
|
||||||
|
page_new_messages += 1
|
||||||
|
|
||||||
|
if stop_fetching:
|
||||||
|
print(" Reached messages already in the database, stopping pagination")
|
||||||
|
break
|
||||||
|
|
||||||
|
if page_new_messages == 0:
|
||||||
|
print(" No new messages on this page, stopping pagination")
|
||||||
|
break
|
||||||
|
|
||||||
time.sleep(1) # Be polite to the server
|
time.sleep(1) # Be polite to the server
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -12,20 +12,20 @@ class Contribution(Base):
|
||||||
__tablename__ = "contributions"
|
__tablename__ = "contributions"
|
||||||
|
|
||||||
id: Mapped[int] = mapped_column(primary_key=True)
|
id: Mapped[int] = mapped_column(primary_key=True)
|
||||||
userid: Mapped[int | None]
|
userid: Mapped[int]
|
||||||
user: Mapped[str | None]
|
user: Mapped[str]
|
||||||
pageid: Mapped[int | None]
|
pageid: Mapped[int]
|
||||||
revid: Mapped[int | None] = mapped_column(unique=True)
|
revid: Mapped[int] = mapped_column(unique=True)
|
||||||
parentid: Mapped[int | None]
|
parentid: Mapped[int]
|
||||||
ns: Mapped[int | None]
|
ns: Mapped[int]
|
||||||
title: Mapped[str | None]
|
title: Mapped[str]
|
||||||
timestamp: Mapped[str | None]
|
timestamp: Mapped[str]
|
||||||
minor: Mapped[str | None]
|
minor: Mapped[str | None]
|
||||||
top: Mapped[str | None]
|
top: Mapped[str | None]
|
||||||
comment: Mapped[str | None] = mapped_column(Text)
|
comment: Mapped[str] = mapped_column(Text)
|
||||||
size: Mapped[int | None]
|
size: Mapped[int]
|
||||||
sizediff: Mapped[int | None]
|
sizediff: Mapped[int]
|
||||||
tags: Mapped[str | None] = mapped_column(Text) # JSON array stored as text
|
tags: Mapped[str] = mapped_column(Text) # JSON array stored as text
|
||||||
|
|
||||||
__table_args__ = (
|
__table_args__ = (
|
||||||
Index("ix_contributions_timestamp", "timestamp"),
|
Index("ix_contributions_timestamp", "timestamp"),
|
||||||
|
|
@ -37,16 +37,16 @@ class SentMessage(Base):
|
||||||
__tablename__ = "sent_messages"
|
__tablename__ = "sent_messages"
|
||||||
|
|
||||||
message_id: Mapped[str] = mapped_column(primary_key=True)
|
message_id: Mapped[str] = mapped_column(primary_key=True)
|
||||||
subject: Mapped[str | None]
|
subject: Mapped[str]
|
||||||
url: Mapped[str | None]
|
url: Mapped[str]
|
||||||
recipient: Mapped[str | None]
|
recipient: Mapped[str]
|
||||||
date: Mapped[str | None]
|
date: Mapped[str]
|
||||||
body: Mapped[str | None] = mapped_column(Text)
|
body: Mapped[str] = mapped_column(Text)
|
||||||
body_html: Mapped[str | None] = mapped_column(Text)
|
body_html: Mapped[str] = mapped_column(Text)
|
||||||
flickr_url: Mapped[str | None]
|
flickr_url: Mapped[str]
|
||||||
normalized_flickr_url: Mapped[str | None]
|
normalized_flickr_url: Mapped[str]
|
||||||
wikipedia_url: Mapped[str | None]
|
wikipedia_url: Mapped[str]
|
||||||
creator_profile_url: Mapped[str | None]
|
creator_profile_url: Mapped[str]
|
||||||
|
|
||||||
flickr_uploads: Mapped[list["FlickrUpload"]] = relationship(
|
flickr_uploads: Mapped[list["FlickrUpload"]] = relationship(
|
||||||
back_populates="sent_message"
|
back_populates="sent_message"
|
||||||
|
|
@ -62,15 +62,15 @@ class FlickrUpload(Base):
|
||||||
__tablename__ = "flickr_uploads"
|
__tablename__ = "flickr_uploads"
|
||||||
|
|
||||||
id: Mapped[int] = mapped_column(primary_key=True)
|
id: Mapped[int] = mapped_column(primary_key=True)
|
||||||
pageid: Mapped[int | None]
|
pageid: Mapped[int]
|
||||||
revid: Mapped[int | None]
|
revid: Mapped[int]
|
||||||
title: Mapped[str | None]
|
title: Mapped[str]
|
||||||
timestamp: Mapped[str | None]
|
timestamp: Mapped[str]
|
||||||
flickr_url: Mapped[str | None]
|
flickr_url: Mapped[str]
|
||||||
normalized_flickr_url: Mapped[str | None]
|
normalized_flickr_url: Mapped[str]
|
||||||
creator: Mapped[str | None]
|
creator: Mapped[str | None]
|
||||||
wikipedia_url: Mapped[str | None]
|
wikipedia_url: Mapped[str]
|
||||||
creator_profile_url: Mapped[str | None]
|
creator_profile_url: Mapped[str]
|
||||||
sent_message_id: Mapped[str | None] = mapped_column(
|
sent_message_id: Mapped[str | None] = mapped_column(
|
||||||
ForeignKey("sent_messages.message_id")
|
ForeignKey("sent_messages.message_id")
|
||||||
)
|
)
|
||||||
|
|
@ -89,5 +89,23 @@ class ThumbnailCache(Base):
|
||||||
__tablename__ = "thumbnail_cache"
|
__tablename__ = "thumbnail_cache"
|
||||||
|
|
||||||
title: Mapped[str] = mapped_column(primary_key=True)
|
title: Mapped[str] = mapped_column(primary_key=True)
|
||||||
thumb_url: Mapped[str | None]
|
thumb_url: Mapped[str]
|
||||||
fetched_at: Mapped[int | None] # Unix timestamp
|
fetched_at: Mapped[int] # Unix timestamp
|
||||||
|
|
||||||
|
|
||||||
|
class InteractionLog(Base):
|
||||||
|
__tablename__ = "interaction_log"
|
||||||
|
|
||||||
|
id: Mapped[int] = mapped_column(primary_key=True)
|
||||||
|
timestamp: Mapped[int] # Unix timestamp
|
||||||
|
interaction_type: Mapped[str] # "search_article", "search_category", "generate_message"
|
||||||
|
ip_address: Mapped[str | None]
|
||||||
|
user_agent: Mapped[str | None] = mapped_column(Text)
|
||||||
|
query: Mapped[str | None] # search term or category name
|
||||||
|
flickr_url: Mapped[str | None]
|
||||||
|
wikipedia_url: Mapped[str | None]
|
||||||
|
|
||||||
|
__table_args__ = (
|
||||||
|
Index("ix_interaction_log_timestamp", "timestamp"),
|
||||||
|
Index("ix_interaction_log_type", "interaction_type"),
|
||||||
|
)
|
||||||
|
|
|
||||||
159
main.py
159
main.py
|
|
@ -18,7 +18,7 @@ from sqlalchemy import func
|
||||||
from werkzeug.debug.tbtools import DebugTraceback
|
from werkzeug.debug.tbtools import DebugTraceback
|
||||||
|
|
||||||
from flickr_mail.database import get_session
|
from flickr_mail.database import get_session
|
||||||
from flickr_mail.models import FlickrUpload, SentMessage, ThumbnailCache
|
from flickr_mail.models import FlickrUpload, InteractionLog, SentMessage, ThumbnailCache
|
||||||
from flickr_mail.url_utils import extract_urls_from_message, normalize_flickr_url
|
from flickr_mail.url_utils import extract_urls_from_message, normalize_flickr_url
|
||||||
|
|
||||||
import re
|
import re
|
||||||
|
|
@ -348,6 +348,14 @@ class ArticleWithoutImage:
|
||||||
return f"/?enwp={quote(self.title)}"
|
return f"/?enwp={quote(self.title)}"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclasses.dataclass
|
||||||
|
class CategoryResult:
|
||||||
|
"""Result of a paginated category search."""
|
||||||
|
|
||||||
|
articles: list[ArticleWithoutImage]
|
||||||
|
gcmcontinue: str | None
|
||||||
|
|
||||||
|
|
||||||
# Common non-content images to ignore when checking if an article has images
|
# Common non-content images to ignore when checking if an article has images
|
||||||
NON_CONTENT_IMAGE_PATTERNS = [
|
NON_CONTENT_IMAGE_PATTERNS = [
|
||||||
"OOjs UI icon",
|
"OOjs UI icon",
|
||||||
|
|
@ -379,14 +387,15 @@ def has_content_image(images: list[dict]) -> bool:
|
||||||
|
|
||||||
|
|
||||||
def get_articles_without_images(
|
def get_articles_without_images(
|
||||||
category: str, limit: int = 100
|
category: str,
|
||||||
) -> tuple[list[ArticleWithoutImage], str | None]:
|
limit: int = 200,
|
||||||
|
gcmcontinue: str | None = None,
|
||||||
|
) -> CategoryResult:
|
||||||
"""Get articles in a category that don't have images.
|
"""Get articles in a category that don't have images.
|
||||||
|
|
||||||
Uses generator=categorymembers with prop=images to efficiently check
|
Uses generator=categorymembers with prop=images to efficiently check
|
||||||
multiple articles in a single API request.
|
multiple articles in a single API request, following continuation until
|
||||||
|
the limit is reached or all category members have been processed.
|
||||||
Returns a tuple of (articles_list, continue_token).
|
|
||||||
"""
|
"""
|
||||||
params = {
|
params = {
|
||||||
"action": "query",
|
"action": "query",
|
||||||
|
|
@ -394,49 +403,73 @@ def get_articles_without_images(
|
||||||
"gcmtitle": category,
|
"gcmtitle": category,
|
||||||
"gcmtype": "page", # Only articles, not subcategories or files
|
"gcmtype": "page", # Only articles, not subcategories or files
|
||||||
"gcmnamespace": "0", # Main namespace only
|
"gcmnamespace": "0", # Main namespace only
|
||||||
"gcmlimit": str(limit),
|
"gcmlimit": "50", # Small batches so images fit in one response
|
||||||
"prop": "images",
|
"prop": "images",
|
||||||
"imlimit": "max", # Need enough to check all pages in batch
|
"imlimit": "max",
|
||||||
"format": "json",
|
"format": "json",
|
||||||
}
|
}
|
||||||
|
|
||||||
headers = {"User-Agent": WIKIMEDIA_USER_AGENT}
|
headers = {"User-Agent": WIKIMEDIA_USER_AGENT}
|
||||||
|
|
||||||
try:
|
|
||||||
response = requests.get(
|
|
||||||
WIKIPEDIA_API, params=params, headers=headers, timeout=30
|
|
||||||
)
|
|
||||||
response.raise_for_status()
|
|
||||||
data = response.json()
|
|
||||||
except (requests.RequestException, json.JSONDecodeError) as e:
|
|
||||||
print(f"Wikipedia API error: {e}")
|
|
||||||
return [], None
|
|
||||||
|
|
||||||
articles_without_images: list[ArticleWithoutImage] = []
|
articles_without_images: list[ArticleWithoutImage] = []
|
||||||
|
seen_pageids: set[int] = set()
|
||||||
|
next_gcmcontinue: str | None = None
|
||||||
|
|
||||||
pages = data.get("query", {}).get("pages", {})
|
# Build initial continue params from the external pagination token
|
||||||
for page in pages.values():
|
continue_params: dict[str, str] = {}
|
||||||
images = page.get("images", [])
|
if gcmcontinue:
|
||||||
|
continue_params = {"gcmcontinue": gcmcontinue, "continue": "gcmcontinue||"}
|
||||||
|
|
||||||
# Skip if page has content images (not just UI icons)
|
while True:
|
||||||
if has_content_image(images):
|
request_params = params.copy()
|
||||||
continue
|
request_params.update(continue_params)
|
||||||
|
|
||||||
title = page.get("title", "")
|
try:
|
||||||
pageid = page.get("pageid", 0)
|
response = requests.get(
|
||||||
|
WIKIPEDIA_API, params=request_params, headers=headers, timeout=30
|
||||||
if title and pageid:
|
|
||||||
articles_without_images.append(
|
|
||||||
ArticleWithoutImage(title=title, pageid=pageid)
|
|
||||||
)
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
except (requests.RequestException, json.JSONDecodeError) as e:
|
||||||
|
print(f"Wikipedia API error: {e}")
|
||||||
|
break
|
||||||
|
|
||||||
|
pages = data.get("query", {}).get("pages", {})
|
||||||
|
for page in pages.values():
|
||||||
|
pageid = page.get("pageid", 0)
|
||||||
|
if not pageid or pageid in seen_pageids:
|
||||||
|
continue
|
||||||
|
seen_pageids.add(pageid)
|
||||||
|
|
||||||
|
images = page.get("images", [])
|
||||||
|
|
||||||
|
# Skip if page has content images (not just UI icons)
|
||||||
|
if has_content_image(images):
|
||||||
|
continue
|
||||||
|
|
||||||
|
title = page.get("title", "")
|
||||||
|
if title:
|
||||||
|
articles_without_images.append(
|
||||||
|
ArticleWithoutImage(title=title, pageid=pageid)
|
||||||
|
)
|
||||||
|
|
||||||
|
api_continue = data.get("continue")
|
||||||
|
if not api_continue:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Only stop at generator boundaries where we have a resumable token
|
||||||
|
gcmc = api_continue.get("gcmcontinue")
|
||||||
|
if gcmc and len(articles_without_images) >= limit:
|
||||||
|
next_gcmcontinue = gcmc
|
||||||
|
break
|
||||||
|
|
||||||
|
continue_params = api_continue
|
||||||
|
|
||||||
# Sort by title for consistent display
|
# Sort by title for consistent display
|
||||||
articles_without_images.sort(key=lambda a: a.title)
|
articles_without_images.sort(key=lambda a: a.title)
|
||||||
|
return CategoryResult(
|
||||||
# Get continue token if there are more results
|
articles=articles_without_images,
|
||||||
continue_token = data.get("continue", {}).get("gcmcontinue")
|
gcmcontinue=next_gcmcontinue,
|
||||||
|
)
|
||||||
return articles_without_images, continue_token
|
|
||||||
|
|
||||||
|
|
||||||
def is_valid_flickr_image_url(url: str) -> bool:
|
def is_valid_flickr_image_url(url: str) -> bool:
|
||||||
|
|
@ -458,7 +491,7 @@ def is_valid_flickr_image_url(url: str) -> bool:
|
||||||
|
|
||||||
def search_flickr(search_term: str, page: int = 1) -> SearchResult:
|
def search_flickr(search_term: str, page: int = 1) -> SearchResult:
|
||||||
"""Search Flickr for photos matching the search term."""
|
"""Search Flickr for photos matching the search term."""
|
||||||
encoded_term = quote(f'"{search_term}"')
|
encoded_term = quote(search_term)
|
||||||
url = f"https://flickr.com/search/?view_all=1&text={encoded_term}&page={page}"
|
url = f"https://flickr.com/search/?view_all=1&text={encoded_term}&page={page}"
|
||||||
|
|
||||||
response = requests.get(url, headers=BROWSER_HEADERS)
|
response = requests.get(url, headers=BROWSER_HEADERS)
|
||||||
|
|
@ -583,6 +616,33 @@ def parse_flickr_search_results(html: str, page: int = 1) -> SearchResult:
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def log_interaction(
|
||||||
|
interaction_type: str,
|
||||||
|
query: str | None = None,
|
||||||
|
flickr_url: str | None = None,
|
||||||
|
wikipedia_url: str | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""Log a user interaction to the database."""
|
||||||
|
forwarded_for = flask.request.headers.get("X-Forwarded-For")
|
||||||
|
ip_address = forwarded_for.split(",")[0].strip() if forwarded_for else flask.request.remote_addr
|
||||||
|
user_agent = flask.request.headers.get("User-Agent")
|
||||||
|
session = get_session()
|
||||||
|
try:
|
||||||
|
entry = InteractionLog(
|
||||||
|
timestamp=int(time.time()),
|
||||||
|
interaction_type=interaction_type,
|
||||||
|
ip_address=ip_address,
|
||||||
|
user_agent=user_agent,
|
||||||
|
query=query,
|
||||||
|
flickr_url=flickr_url,
|
||||||
|
wikipedia_url=wikipedia_url,
|
||||||
|
)
|
||||||
|
session.add(entry)
|
||||||
|
session.commit()
|
||||||
|
finally:
|
||||||
|
session.close()
|
||||||
|
|
||||||
|
|
||||||
@app.errorhandler(werkzeug.exceptions.InternalServerError)
|
@app.errorhandler(werkzeug.exceptions.InternalServerError)
|
||||||
def exception_handler(e: werkzeug.exceptions.InternalServerError) -> tuple[str, int]:
|
def exception_handler(e: werkzeug.exceptions.InternalServerError) -> tuple[str, int]:
|
||||||
"""Handle exception."""
|
"""Handle exception."""
|
||||||
|
|
@ -651,18 +711,24 @@ def start() -> str:
|
||||||
# Get category param if coming from category search
|
# Get category param if coming from category search
|
||||||
cat = flask.request.args.get("cat")
|
cat = flask.request.args.get("cat")
|
||||||
|
|
||||||
|
# Allow overriding the Flickr search term (default includes quotes for phrase search)
|
||||||
|
flickr_search = flask.request.args.get("flickr_search") or f'"{name}"'
|
||||||
|
|
||||||
flickr_url = flask.request.args.get("flickr")
|
flickr_url = flask.request.args.get("flickr")
|
||||||
if not flickr_url:
|
if not flickr_url:
|
||||||
# Search Flickr for photos
|
# Search Flickr for photos
|
||||||
page = flask.request.args.get("page", 1, type=int)
|
page = flask.request.args.get("page", 1, type=int)
|
||||||
page = max(1, page) # Ensure page is at least 1
|
page = max(1, page) # Ensure page is at least 1
|
||||||
search_result = search_flickr(name, page)
|
if page == 1:
|
||||||
|
log_interaction("search_article", query=flickr_search, wikipedia_url=wikipedia_url)
|
||||||
|
search_result = search_flickr(flickr_search, page)
|
||||||
return flask.render_template(
|
return flask.render_template(
|
||||||
"combined.html",
|
"combined.html",
|
||||||
name=name,
|
name=name,
|
||||||
enwp=enwp,
|
enwp=enwp,
|
||||||
search_result=search_result,
|
search_result=search_result,
|
||||||
cat=cat,
|
cat=cat,
|
||||||
|
flickr_search=flickr_search,
|
||||||
)
|
)
|
||||||
|
|
||||||
if "/in/" in flickr_url:
|
if "/in/" in flickr_url:
|
||||||
|
|
@ -716,8 +782,16 @@ def start() -> str:
|
||||||
flickr_user_url=flickr_user_url,
|
flickr_user_url=flickr_user_url,
|
||||||
cat=cat,
|
cat=cat,
|
||||||
previous_messages=previous_messages,
|
previous_messages=previous_messages,
|
||||||
|
flickr_search=flickr_search,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
log_interaction(
|
||||||
|
"generate_message",
|
||||||
|
query=name,
|
||||||
|
flickr_url=flickr_url,
|
||||||
|
wikipedia_url=wikipedia_url,
|
||||||
|
)
|
||||||
|
|
||||||
msg = flask.render_template(
|
msg = flask.render_template(
|
||||||
"message.jinja",
|
"message.jinja",
|
||||||
flickr_url=flickr_url,
|
flickr_url=flickr_url,
|
||||||
|
|
@ -749,6 +823,7 @@ def start() -> str:
|
||||||
flickr_user_url=flickr_user_url,
|
flickr_user_url=flickr_user_url,
|
||||||
cat=cat,
|
cat=cat,
|
||||||
previous_messages=previous_messages,
|
previous_messages=previous_messages,
|
||||||
|
flickr_search=flickr_search,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -768,7 +843,9 @@ def category_search() -> str:
|
||||||
cat=cat,
|
cat=cat,
|
||||||
)
|
)
|
||||||
|
|
||||||
articles, continue_token = get_articles_without_images(category)
|
log_interaction("search_category", query=category)
|
||||||
|
gcmcontinue = flask.request.args.get("gcmcontinue") or None
|
||||||
|
result = get_articles_without_images(category, gcmcontinue=gcmcontinue)
|
||||||
|
|
||||||
# Get the display name (without Category: prefix)
|
# Get the display name (without Category: prefix)
|
||||||
category_name = category.replace("Category:", "")
|
category_name = category.replace("Category:", "")
|
||||||
|
|
@ -778,8 +855,8 @@ def category_search() -> str:
|
||||||
cat=cat,
|
cat=cat,
|
||||||
category=category,
|
category=category,
|
||||||
category_name=category_name,
|
category_name=category_name,
|
||||||
articles=articles,
|
articles=result.articles,
|
||||||
continue_token=continue_token,
|
gcmcontinue=result.gcmcontinue,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -33,7 +33,7 @@
|
||||||
<h5>Articles without images in <a href="https://en.wikipedia.org/wiki/{{ category | replace(' ', '_') }}" target="_blank">{{ category_name }}</a></h5>
|
<h5>Articles without images in <a href="https://en.wikipedia.org/wiki/{{ category | replace(' ', '_') }}" target="_blank">{{ category_name }}</a></h5>
|
||||||
|
|
||||||
{% if articles %}
|
{% if articles %}
|
||||||
<p class="text-muted small">Found {{ articles | length }} article(s) without images{% if continue_token %} (more available){% endif %}</p>
|
<p class="text-muted small">Found {{ articles | length }} article(s) without images</p>
|
||||||
|
|
||||||
<div class="list-group">
|
<div class="list-group">
|
||||||
{% for article in articles %}
|
{% for article in articles %}
|
||||||
|
|
@ -44,8 +44,10 @@
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
{% if continue_token %}
|
{% if gcmcontinue %}
|
||||||
<p class="text-muted small mt-3">Note: Only showing first batch of results. More articles may be available in this category.</p>
|
<div class="mt-3">
|
||||||
|
<a href="{{ url_for('category_search', cat=cat, gcmcontinue=gcmcontinue) }}" class="btn btn-outline-primary">Next page »</a>
|
||||||
|
</div>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
{% else %}
|
{% else %}
|
||||||
|
|
|
||||||
|
|
@ -12,7 +12,7 @@
|
||||||
<input type="text" class="form-control" id="enwp" name="enwp" value="{{ enwp }}" required>
|
<input type="text" class="form-control" id="enwp" name="enwp" value="{{ enwp }}" required>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<input type="submit" value="Submit">
|
<input type="submit" class="btn btn-primary" value="Search">
|
||||||
<a href="{{ url_for('category_search') }}" class="btn btn-outline-secondary ms-2">Find articles by category</a>
|
<a href="{{ url_for('category_search') }}" class="btn btn-outline-secondary ms-2">Find articles by category</a>
|
||||||
</form>
|
</form>
|
||||||
|
|
||||||
|
|
@ -63,13 +63,20 @@
|
||||||
<p><a href="{{ url_for('category_search', cat=cat) }}">← Back to category</a></p>
|
<p><a href="{{ url_for('category_search', cat=cat) }}">← Back to category</a></p>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<p>Wikipedia article: {{ name }}</p>
|
<p>Wikipedia article: {{ name }}</p>
|
||||||
|
<form action="{{ url_for(request.endpoint) }}" class="mb-3 d-flex align-items-center gap-2">
|
||||||
|
<input type="hidden" name="enwp" value="{{ enwp }}">
|
||||||
|
{% if cat %}<input type="hidden" name="cat" value="{{ cat }}">{% endif %}
|
||||||
|
<label for="flickr_search" class="form-label mb-0 text-nowrap">Flickr search:</label>
|
||||||
|
<input type="text" class="form-control form-control-sm" id="flickr_search" name="flickr_search" value="{{ flickr_search }}" style="max-width: 300px;">
|
||||||
|
<button type="submit" class="btn btn-sm btn-primary">Search</button>
|
||||||
|
</form>
|
||||||
<p>Select a photo to compose a message ({{ search_result.total_photos | default(0) }} results):</p>
|
<p>Select a photo to compose a message ({{ search_result.total_photos | default(0) }} results):</p>
|
||||||
|
|
||||||
<div class="row row-cols-2 row-cols-md-3 row-cols-lg-4 g-3 mb-3">
|
<div class="row row-cols-2 row-cols-md-3 row-cols-lg-4 g-3 mb-3">
|
||||||
{% for photo in search_result.photos %}
|
{% for photo in search_result.photos %}
|
||||||
<div class="col">
|
<div class="col">
|
||||||
<div class="card h-100">
|
<div class="card h-100">
|
||||||
<a href="{{ url_for(request.endpoint, enwp=enwp, flickr=photo.flickr_url, img=photo.medium_url, license=photo.license, flickr_user=photo.realname or photo.username, cat=cat) }}">
|
<a href="{{ url_for(request.endpoint, enwp=enwp, flickr=photo.flickr_url, img=photo.medium_url, license=photo.license, flickr_user=photo.realname or photo.username, cat=cat, flickr_search=flickr_search) }}">
|
||||||
<img src="{{ photo.thumb_url }}" alt="{{ photo.title }}" class="card-img-top" style="aspect-ratio: 1; object-fit: cover;">
|
<img src="{{ photo.thumb_url }}" alt="{{ photo.title }}" class="card-img-top" style="aspect-ratio: 1; object-fit: cover;">
|
||||||
</a>
|
</a>
|
||||||
<div class="card-body p-2">
|
<div class="card-body p-2">
|
||||||
|
|
@ -86,7 +93,7 @@
|
||||||
<ul class="pagination justify-content-center">
|
<ul class="pagination justify-content-center">
|
||||||
{% if search_result.current_page > 1 %}
|
{% if search_result.current_page > 1 %}
|
||||||
<li class="page-item">
|
<li class="page-item">
|
||||||
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=search_result.current_page - 1, cat=cat) }}">Previous</a>
|
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=search_result.current_page - 1, cat=cat) }}">Previous</a>
|
||||||
</li>
|
</li>
|
||||||
{% else %}
|
{% else %}
|
||||||
<li class="page-item disabled">
|
<li class="page-item disabled">
|
||||||
|
|
@ -99,7 +106,7 @@
|
||||||
|
|
||||||
{% if start_page > 1 %}
|
{% if start_page > 1 %}
|
||||||
<li class="page-item">
|
<li class="page-item">
|
||||||
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=1, cat=cat) }}">1</a>
|
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=1, cat=cat) }}">1</a>
|
||||||
</li>
|
</li>
|
||||||
{% if start_page > 2 %}
|
{% if start_page > 2 %}
|
||||||
<li class="page-item disabled"><span class="page-link">...</span></li>
|
<li class="page-item disabled"><span class="page-link">...</span></li>
|
||||||
|
|
@ -108,7 +115,7 @@
|
||||||
|
|
||||||
{% for p in range(start_page, end_page + 1) %}
|
{% for p in range(start_page, end_page + 1) %}
|
||||||
<li class="page-item {{ 'active' if p == search_result.current_page else '' }}">
|
<li class="page-item {{ 'active' if p == search_result.current_page else '' }}">
|
||||||
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=p, cat=cat) }}">{{ p }}</a>
|
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=p, cat=cat) }}">{{ p }}</a>
|
||||||
</li>
|
</li>
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
|
|
@ -117,13 +124,13 @@
|
||||||
<li class="page-item disabled"><span class="page-link">...</span></li>
|
<li class="page-item disabled"><span class="page-link">...</span></li>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<li class="page-item">
|
<li class="page-item">
|
||||||
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=search_result.total_pages, cat=cat) }}">{{ search_result.total_pages }}</a>
|
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=search_result.total_pages, cat=cat) }}">{{ search_result.total_pages }}</a>
|
||||||
</li>
|
</li>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
{% if search_result.current_page < search_result.total_pages %}
|
{% if search_result.current_page < search_result.total_pages %}
|
||||||
<li class="page-item">
|
<li class="page-item">
|
||||||
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, page=search_result.current_page + 1, cat=cat) }}">Next</a>
|
<a class="page-link" href="{{ url_for(request.endpoint, enwp=enwp, flickr_search=flickr_search, page=search_result.current_page + 1, cat=cat) }}">Next</a>
|
||||||
</li>
|
</li>
|
||||||
{% else %}
|
{% else %}
|
||||||
<li class="page-item disabled">
|
<li class="page-item disabled">
|
||||||
|
|
@ -135,7 +142,7 @@
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
<p class="text-muted small">
|
<p class="text-muted small">
|
||||||
<a href="https://flickr.com/search/?view_all=1&text={{ '"' + name + '"' | urlencode }}" target="_blank">View full search on Flickr</a>
|
<a href="https://flickr.com/search/?view_all=1&text={{ flickr_search | urlencode }}" target="_blank">View full search on Flickr</a>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
{% elif name and not flickr_url %}
|
{% elif name and not flickr_url %}
|
||||||
|
|
@ -144,8 +151,15 @@
|
||||||
<p><a href="{{ url_for('category_search', cat=cat) }}">← Back to category</a></p>
|
<p><a href="{{ url_for('category_search', cat=cat) }}">← Back to category</a></p>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<p>Wikipedia article: {{ name }}</p>
|
<p>Wikipedia article: {{ name }}</p>
|
||||||
|
<form action="{{ url_for(request.endpoint) }}" class="mb-3 d-flex align-items-center gap-2">
|
||||||
|
<input type="hidden" name="enwp" value="{{ enwp }}">
|
||||||
|
{% if cat %}<input type="hidden" name="cat" value="{{ cat }}">{% endif %}
|
||||||
|
<label for="flickr_search_empty" class="form-label mb-0 text-nowrap">Flickr search:</label>
|
||||||
|
<input type="text" class="form-control form-control-sm" id="flickr_search_empty" name="flickr_search" value="{{ flickr_search }}" style="max-width: 300px;">
|
||||||
|
<button type="submit" class="btn btn-sm btn-primary">Search</button>
|
||||||
|
</form>
|
||||||
<div class="alert alert-warning">No photos found. Try a different search term.</div>
|
<div class="alert alert-warning">No photos found. Try a different search term.</div>
|
||||||
<p><a href="https://flickr.com/search/?view_all=1&text={{ '"' + name + '"' | urlencode }}" target="_blank">Search on Flickr directly</a></p>
|
<p><a href="https://flickr.com/search/?view_all=1&text={{ flickr_search | urlencode }}" target="_blank">Search on Flickr directly</a></p>
|
||||||
|
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
|
|
@ -201,7 +215,7 @@
|
||||||
</div>
|
</div>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<p class="mt-3">
|
<p class="mt-3">
|
||||||
<a href="{{ url_for('start', enwp=enwp, cat=cat) if cat else url_for('start', enwp=enwp) }}">← Back to search results</a>
|
<a href="{{ url_for('start', enwp=enwp, cat=cat, flickr_search=flickr_search) }}">← Back to search results</a>
|
||||||
</p>
|
</p>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
|
||||||
|
|
@ -2,8 +2,12 @@
|
||||||
"""
|
"""
|
||||||
Find UploadWizard contributions that are from Flickr and add them to the database.
|
Find UploadWizard contributions that are from Flickr and add them to the database.
|
||||||
|
|
||||||
For contributions with comment 'User created page with UploadWizard', queries the
|
Supports both UploadWizard comment styles:
|
||||||
Commons API to check if the image source is Flickr (by checking the Credit field).
|
- "User created page with UploadWizard" (older)
|
||||||
|
- "Uploaded a work by ... with UploadWizard" (newer, often includes Flickr URL)
|
||||||
|
|
||||||
|
If a Flickr URL is not present in the contribution comment, queries Commons API
|
||||||
|
to check if the image source is Flickr (by checking the Credit field).
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
|
@ -27,6 +31,13 @@ def extract_flickr_url_from_credit(credit: str) -> str | None:
|
||||||
return match.group(0) if match else None
|
return match.group(0) if match else None
|
||||||
|
|
||||||
|
|
||||||
|
def extract_flickr_url_from_comment(comment: str) -> str | None:
|
||||||
|
"""Extract Flickr URL directly from a contribution comment."""
|
||||||
|
pattern = r'https?://(?:www\.)?flickr\.com/photos/[^/\s]+/\d+'
|
||||||
|
match = re.search(pattern, comment or "")
|
||||||
|
return match.group(0) if match else None
|
||||||
|
|
||||||
|
|
||||||
def get_image_metadata(titles: list[str]) -> dict[str, dict]:
|
def get_image_metadata(titles: list[str]) -> dict[str, dict]:
|
||||||
"""Fetch image metadata from Commons API for multiple titles."""
|
"""Fetch image metadata from Commons API for multiple titles."""
|
||||||
if not titles:
|
if not titles:
|
||||||
|
|
@ -97,10 +108,12 @@ def main():
|
||||||
)
|
)
|
||||||
url_to_message = {msg.normalized_flickr_url: msg for msg in sent_messages}
|
url_to_message = {msg.normalized_flickr_url: msg for msg in sent_messages}
|
||||||
|
|
||||||
# Find UploadWizard contributions (page creations only)
|
# Find UploadWizard file uploads.
|
||||||
|
# Old format: "User created page with UploadWizard"
|
||||||
|
# New format: "Uploaded a work by ... with UploadWizard"
|
||||||
upload_wizard = (
|
upload_wizard = (
|
||||||
session.query(Contribution)
|
session.query(Contribution)
|
||||||
.filter(Contribution.comment == "User created page with UploadWizard")
|
.filter(Contribution.comment.contains("UploadWizard"))
|
||||||
.filter(Contribution.title.startswith("File:"))
|
.filter(Contribution.title.startswith("File:"))
|
||||||
.all()
|
.all()
|
||||||
)
|
)
|
||||||
|
|
@ -127,7 +140,10 @@ def main():
|
||||||
credit = meta.get("credit", "")
|
credit = meta.get("credit", "")
|
||||||
artist = meta.get("artist", "")
|
artist = meta.get("artist", "")
|
||||||
|
|
||||||
flickr_url = extract_flickr_url_from_credit(credit)
|
# Prefer URL directly in comment; fall back to extmetadata Credit.
|
||||||
|
flickr_url = extract_flickr_url_from_comment(c.comment or "")
|
||||||
|
if not flickr_url:
|
||||||
|
flickr_url = extract_flickr_url_from_credit(credit)
|
||||||
if not flickr_url:
|
if not flickr_url:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue