docs: add Apec ingestion plan

This commit is contained in:
Antoine 2026-06-01 12:33:16 +02:00
parent ad36de0a3f
commit cfbd1943ec

View File

@ -0,0 +1,612 @@
# Apec Ingestion Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build one `fetch-apec` command that reads `data/candidate-profile.yaml`, derives deterministic Apec searches, fetches up to 50 public listings, stores raw HTML snapshots, and writes a normalized `listings.yaml` file plus run metadata.
**Architecture:** The implementation is a fetch-and-normalize pipeline with explicit artifacts by run. Profile-driven query derivation feeds an Apec adapter, successful detail pages are persisted as raw snapshots, and a small normalizer plus within-run deduper writes inspectable YAML outputs for later ranking work.
**Tech Stack:** Python 3.13, Typer, Playwright for Python, BeautifulSoup4, Pydantic v2, PyYAML, pytest
---
## File Map
- Modify: `pyproject.toml` — add Playwright dependency if missing
- Modify: `src/job_research/cli.py` — add `fetch-apec` command
- Modify: `src/job_research/models.py` — add normalized listing and run metadata models
- Modify: `src/job_research/storage.py` — add helpers for per-run artifact paths and YAML writes
- Create: `src/job_research/apec/__init__.py` — Apec package marker
- Create: `src/job_research/apec/query_derivation.py` — deterministic query derivation from candidate profile
- Create: `src/job_research/apec/adapter.py` — public Apec search and detail-page fetching
- Create: `src/job_research/apec/normalize.py` — normalize Apec detail pages into listing records
- Create: `src/job_research/apec/dedupe.py` — minimal within-run deduplication
- Create: `tests/apec/test_query_derivation.py` — profile-driven query tests
- Create: `tests/apec/test_normalize.py` — normalized listing extraction tests
- Create: `tests/apec/test_dedupe.py` — within-run dedupe tests
- Create: `tests/test_apec_cli.py` — CLI integration tests for `fetch-apec`
- Create: `tests/test_apec_storage.py` — run artifact persistence tests
## Task 0: Dependencies for Apec Ingestion
**Files:**
- Modify: `pyproject.toml`
- [ ] **Step 1: Write the failing import check**
Run: `uv run python -c "import playwright, bs4"`
Expected: FAIL with missing dependency errors
- [ ] **Step 2: Add the minimal dependencies for this slice**
```toml
# pyproject.toml
[project]
dependencies = [
"beautifulsoup4>=4.12,<5",
"playwright>=1.52,<2",
"pydantic>=2.7,<3",
"pypdf>=5.0,<6",
"pyyaml>=6.0,<7",
"typer>=0.12,<1",
]
```
- [ ] **Step 3: Sync and verify the imports work**
Run: `uv sync && uv run python -c "import playwright, bs4"`
Expected: PASS with no output
- [ ] **Step 4: Commit the dependency update**
```bash
git add pyproject.toml uv.lock
git commit -m "chore: add Apec ingestion dependencies"
```
## Task 1: Listing and Run Artifact Models
**Files:**
- Modify: `src/job_research/models.py`
- Create: `tests/test_apec_storage.py`
- [ ] **Step 1: Write the failing model serialization test**
```python
# tests/test_apec_storage.py
from job_research.models import ApecListing, ApecRunMeta, ListingWarning
def test_apec_models_serialize_expected_listing_shape() -> None:
listing = ApecListing(
source="apec",
source_job_id="123",
url="https://example.test/job/123",
title="Data Engineer",
company="Example",
location="Paris",
contract_type="CDI",
description_text="Build pipelines",
published_at="2026-06-01",
fetched_at="2026-06-01T10:00:00Z",
warnings=[ListingWarning(field="location", message="Location inferred from page text")],
)
run_meta = ApecRunMeta(
derived_queries=["Data Engineer"],
fetched_count=1,
normalized_count=1,
deduplicated_count=1,
failed_count=0,
listing_errors=[],
)
assert listing.model_dump()["source"] == "apec"
assert run_meta.model_dump()["derived_queries"] == ["Data Engineer"]
```
- [ ] **Step 2: Run the model test to verify it fails**
Run: `uv run pytest tests/test_apec_storage.py::test_apec_models_serialize_expected_listing_shape -v`
Expected: FAIL with `ImportError` or `AttributeError` for missing Apec models
- [ ] **Step 3: Add normalized listing and run metadata models**
```python
# src/job_research/models.py
class ListingWarning(BaseModel):
field: str
message: str
class ListingError(BaseModel):
url: str
stage: str
message: str
class ApecListing(BaseModel):
source: str
source_job_id: str | None = None
url: str
title: str | None = None
company: str | None = None
location: str | None = None
contract_type: str | None = None
description_text: str | None = None
published_at: str | None = None
fetched_at: str
warnings: list[ListingWarning] = Field(default_factory=list)
class ApecRunMeta(BaseModel):
derived_queries: list[str] = Field(default_factory=list)
fetched_count: int = 0
normalized_count: int = 0
deduplicated_count: int = 0
failed_count: int = 0
listing_errors: list[ListingError] = Field(default_factory=list)
```
- [ ] **Step 4: Run the model test to verify it passes**
Run: `uv run pytest tests/test_apec_storage.py::test_apec_models_serialize_expected_listing_shape -v`
Expected: PASS
- [ ] **Step 5: Commit the models**
```bash
git add src/job_research/models.py tests/test_apec_storage.py
git commit -m "feat: add Apec listing artifact models"
```
## Task 2: Run Artifact Storage Layout
**Files:**
- Modify: `src/job_research/storage.py`
- Modify: `tests/test_apec_storage.py`
- [ ] **Step 1: Write the failing run-path test**
```python
# tests/test_apec_storage.py
from pathlib import Path
from job_research.storage import apec_run_paths
def test_apec_run_paths_builds_expected_layout(tmp_path: Path) -> None:
paths = apec_run_paths(tmp_path, run_id="2026-06-01T10-00-00Z")
assert paths["run_dir"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z"
assert paths["listings"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z" / "listings.yaml"
assert paths["run_meta"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z" / "run-meta.yaml"
assert paths["snapshots"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z" / "snapshots"
```
- [ ] **Step 2: Run the run-path test to verify it fails**
Run: `uv run pytest tests/test_apec_storage.py::test_apec_run_paths_builds_expected_layout -v`
Expected: FAIL because `apec_run_paths` does not exist yet
- [ ] **Step 3: Implement run-path helpers and artifact writes**
```python
# src/job_research/storage.py
def apec_run_paths(data_root: Path, run_id: str) -> dict[str, Path]:
run_dir = data_root / "apec" / "runs" / run_id
return {
"run_dir": run_dir,
"listings": run_dir / "listings.yaml",
"run_meta": run_dir / "run-meta.yaml",
"snapshots": run_dir / "snapshots",
}
```
- [ ] **Step 4: Run the run-path test to verify it passes**
Run: `uv run pytest tests/test_apec_storage.py::test_apec_run_paths_builds_expected_layout -v`
Expected: PASS
- [ ] **Step 5: Commit the storage layout helper**
```bash
git add src/job_research/storage.py tests/test_apec_storage.py
git commit -m "feat: add Apec run artifact paths"
```
## Task 3: Deterministic Query Derivation
**Files:**
- Create: `src/job_research/apec/__init__.py`
- Create: `src/job_research/apec/query_derivation.py`
- Create: `tests/apec/test_query_derivation.py`
- [ ] **Step 1: Write the failing query derivation test**
```python
# tests/apec/test_query_derivation.py
from job_research.apec.query_derivation import derive_apec_queries
from job_research.models import CandidateProfileOutput
def test_derive_apec_queries_from_candidate_profile() -> None:
profile = CandidateProfileOutput(
target_roles=["Data Engineer", "Analytics Engineer"],
strengths=["Python", "SQL"],
skills_to_emphasize=["BigQuery", "GCP"],
constraints=["CDI only", "France only"],
)
queries = derive_apec_queries(profile)
assert "Data Engineer" in queries
assert "Analytics Engineer" in queries
assert len(queries) <= 5
```
- [ ] **Step 2: Run the query test to verify it fails**
Run: `uv run pytest tests/apec/test_query_derivation.py::test_derive_apec_queries_from_candidate_profile -v`
Expected: FAIL with missing module or function
- [ ] **Step 3: Implement deterministic query derivation**
```python
# src/job_research/apec/query_derivation.py
from job_research.models import CandidateProfileOutput
def derive_apec_queries(profile: CandidateProfileOutput) -> list[str]:
queries: list[str] = []
for title in profile.target_roles:
if title not in queries:
queries.append(title)
return queries[:5]
```
- [ ] **Step 4: Run the query test to verify it passes**
Run: `uv run pytest tests/apec/test_query_derivation.py::test_derive_apec_queries_from_candidate_profile -v`
Expected: PASS
- [ ] **Step 5: Commit query derivation**
```bash
git add src/job_research/apec/__init__.py src/job_research/apec/query_derivation.py tests/apec/test_query_derivation.py
git commit -m "feat: add deterministic Apec query derivation"
```
## Task 4: Listing Normalization and Within-Run Deduplication
**Files:**
- Create: `src/job_research/apec/normalize.py`
- Create: `src/job_research/apec/dedupe.py`
- Create: `tests/apec/test_normalize.py`
- Create: `tests/apec/test_dedupe.py`
- [ ] **Step 1: Write the failing normalization and dedupe tests**
```python
# tests/apec/test_normalize.py
from job_research.apec.normalize import normalize_apec_listing
def test_normalize_apec_listing_extracts_minimal_shape() -> None:
html = """
<html>
<body>
<h1>Data Engineer</h1>
<div class="company">Example Corp</div>
<div class="location">Paris</div>
<div class="contract">CDI</div>
<div class="description">Build pipelines</div>
</body>
</html>
"""
listing = normalize_apec_listing(url="https://example.test/job/123", html=html, fetched_at="2026-06-01T10:00:00Z")
assert listing.title == "Data Engineer"
assert listing.company == "Example Corp"
assert listing.contract_type == "CDI"
```
```python
# tests/apec/test_dedupe.py
from job_research.apec.dedupe import dedupe_apec_listings
from job_research.models import ApecListing
def test_dedupe_apec_listings_by_url() -> None:
listings = [
ApecListing(source="apec", url="https://example.test/job/1", fetched_at="2026-06-01T10:00:00Z"),
ApecListing(source="apec", url="https://example.test/job/1", fetched_at="2026-06-01T10:01:00Z"),
]
deduped = dedupe_apec_listings(listings)
assert len(deduped) == 1
```
- [ ] **Step 2: Run the normalization and dedupe tests to verify they fail**
Run: `uv run pytest tests/apec/test_normalize.py tests/apec/test_dedupe.py -v`
Expected: FAIL with missing modules/functions
- [ ] **Step 3: Implement minimal normalization and dedupe**
```python
# src/job_research/apec/normalize.py
from bs4 import BeautifulSoup
from job_research.models import ApecListing
def normalize_apec_listing(url: str, html: str, fetched_at: str) -> ApecListing:
soup = BeautifulSoup(html, "html.parser")
title = soup.find("h1")
company = soup.select_one(".company")
location = soup.select_one(".location")
contract = soup.select_one(".contract")
description = soup.select_one(".description")
return ApecListing(
source="apec",
url=url,
title=title.get_text(strip=True) if title else None,
company=company.get_text(strip=True) if company else None,
location=location.get_text(strip=True) if location else None,
contract_type=contract.get_text(strip=True) if contract else None,
description_text=description.get_text(" ", strip=True) if description else None,
fetched_at=fetched_at,
)
```
```python
# src/job_research/apec/dedupe.py
from job_research.models import ApecListing
def dedupe_apec_listings(listings: list[ApecListing]) -> list[ApecListing]:
seen: set[str] = set()
deduped: list[ApecListing] = []
for listing in listings:
if listing.url in seen:
continue
seen.add(listing.url)
deduped.append(listing)
return deduped
```
- [ ] **Step 4: Run the normalization and dedupe tests to verify they pass**
Run: `uv run pytest tests/apec/test_normalize.py tests/apec/test_dedupe.py -v`
Expected: PASS
- [ ] **Step 5: Commit normalization and dedupe**
```bash
git add src/job_research/apec/normalize.py src/job_research/apec/dedupe.py tests/apec/test_normalize.py tests/apec/test_dedupe.py
git commit -m "feat: add Apec normalization and dedupe"
```
## Task 5: Public Apec Adapter and Snapshot Persistence
**Files:**
- Create: `src/job_research/apec/adapter.py`
- Modify: `tests/test_apec_storage.py`
- [ ] **Step 1: Write the failing snapshot persistence test**
```python
# tests/test_apec_storage.py
from pathlib import Path
from job_research.storage import apec_run_paths, load_yaml
def test_apec_run_artifacts_include_snapshot_and_meta(tmp_path: Path) -> None:
paths = apec_run_paths(tmp_path, run_id="2026-06-01T10-00-00Z")
paths["snapshots"].mkdir(parents=True, exist_ok=True)
snapshot = paths["snapshots"] / "job-123.html"
snapshot.write_text("<html>snapshot</html>", encoding="utf-8")
assert snapshot.exists()
```
- [ ] **Step 2: Run the snapshot test to verify it fails if needed**
Run: `uv run pytest tests/test_apec_storage.py::test_apec_run_artifacts_include_snapshot_and_meta -v`
Expected: PASS or minimal failure if path handling needs adjustment
- [ ] **Step 3: Implement the Apec adapter skeleton and snapshot write helpers**
```python
# src/job_research/apec/adapter.py
from __future__ import annotations
from dataclasses import dataclass
@dataclass
class ApecSearchResult:
url: str
source_job_id: str | None = None
class ApecAdapter:
def __init__(self, max_listings: int = 50) -> None:
self.max_listings = max_listings
def search(self, queries: list[str]) -> list[ApecSearchResult]:
return []
def fetch_listing_html(self, url: str) -> str:
return ""
```
- [ ] **Step 4: Run the snapshot test and any adapter-adjacent tests**
Run: `uv run pytest tests/test_apec_storage.py -v`
Expected: PASS
- [ ] **Step 5: Commit the adapter scaffold**
```bash
git add src/job_research/apec/adapter.py tests/test_apec_storage.py
git commit -m "feat: add Apec adapter scaffold"
```
## Task 6: fetch-apec Command Orchestration
**Files:**
- Modify: `src/job_research/cli.py`
- Create: `tests/test_apec_cli.py`
- [ ] **Step 1: Write the failing CLI orchestration tests**
```python
# tests/test_apec_cli.py
from pathlib import Path
from typer.testing import CliRunner
from job_research.cli import app
def test_fetch_apec_reads_profile_and_writes_run_artifacts(monkeypatch, tmp_path: Path) -> None:
data_dir = tmp_path / "data"
data_dir.mkdir()
(data_dir / "candidate-profile.yaml").write_text(
"target_roles:\n - Data Engineer\nstrengths:\n - Python\nskills_to_emphasize:\n - BigQuery\nconstraints:\n - CDI only\n",
encoding="utf-8",
)
result = CliRunner().invoke(app, ["fetch-apec", "--data-root", str(data_dir)])
assert result.exit_code == 0
assert "normalized listing count" in result.stdout.lower()
```
- [ ] **Step 2: Run the CLI orchestration test to verify it fails**
Run: `uv run pytest tests/test_apec_cli.py::test_fetch_apec_reads_profile_and_writes_run_artifacts -v`
Expected: FAIL because `fetch-apec` does not exist yet
- [ ] **Step 3: Implement fetch-apec command orchestration**
```python
# src/job_research/cli.py
from datetime import UTC, datetime
from pathlib import Path
import typer
from job_research.apec.adapter import ApecAdapter
from job_research.apec.dedupe import dedupe_apec_listings
from job_research.apec.normalize import normalize_apec_listing
from job_research.apec.query_derivation import derive_apec_queries
from job_research.models import ApecRunMeta, CandidateProfileOutput, ListingError
from job_research.storage import apec_run_paths, load_yaml, save_yaml
@app.command("fetch-apec")
def fetch_apec(
data_root: Path = typer.Option(Path("data")),
) -> None:
profile_payload = load_yaml(data_root / "candidate-profile.yaml")
profile = CandidateProfileOutput.model_validate(profile_payload)
queries = derive_apec_queries(profile)
if not queries:
raise typer.BadParameter("No usable Apec queries could be derived from candidate-profile.yaml")
run_id = datetime.now(UTC).strftime("%Y-%m-%dT%H-%M-%SZ")
paths = apec_run_paths(data_root, run_id)
paths["snapshots"].mkdir(parents=True, exist_ok=True)
adapter = ApecAdapter(max_listings=50)
search_results = adapter.search(queries)
listings = []
errors: list[ListingError] = []
for result in search_results[:50]:
try:
html = adapter.fetch_listing_html(result.url)
snapshot_path = paths["snapshots"] / f"{(result.source_job_id or 'listing').replace('/', '-')}.html"
snapshot_path.write_text(html, encoding="utf-8")
listings.append(normalize_apec_listing(url=result.url, html=html, fetched_at=run_id))
except Exception as exc:
errors.append(ListingError(url=result.url, stage="fetch_or_normalize", message=str(exc)))
deduped = dedupe_apec_listings(listings)
run_meta = ApecRunMeta(
derived_queries=queries,
fetched_count=len(search_results[:50]),
normalized_count=len(listings),
deduplicated_count=len(deduped),
failed_count=len(errors),
listing_errors=errors,
)
save_yaml(paths["listings"], {"listings": [listing.model_dump(mode="json") for listing in deduped]})
save_yaml(paths["run_meta"], run_meta.model_dump(mode="json"))
typer.echo(f"Query count: {len(queries)}")
typer.echo(f"Fetched listing count: {run_meta.fetched_count}")
typer.echo(f"Normalized listing count: {run_meta.normalized_count}")
typer.echo(f"Deduplicated count: {run_meta.deduplicated_count}")
typer.echo(f"Failed listing count: {run_meta.failed_count}")
```
Implementation requirements:
- load `data/candidate-profile.yaml`
- validate into `CandidateProfileOutput`
- derive queries
- create a run id and run paths
- invoke adapter search/fetch flow
- persist snapshots, listings.yaml, run-meta.yaml
- print summary counts
- [ ] **Step 4: Run the CLI orchestration test to verify it passes**
Run: `uv run pytest tests/test_apec_cli.py::test_fetch_apec_reads_profile_and_writes_run_artifacts -v`
Expected: PASS
- [ ] **Step 5: Commit the fetch-apec command**
```bash
git add src/job_research/cli.py tests/test_apec_cli.py
git commit -m "feat: add fetch-apec command"
```
## Task 7: Full Regression and Manual Smoke Test
**Files:**
- Modify: none
- [ ] **Step 1: Run the full test suite**
Run: `uv run pytest tests -v`
Expected: PASS with all Apec-slice and profile-slice tests green
- [ ] **Step 2: Run a manual fetch-apec smoke test with mocked or safe local input**
Run: `uv run job-research fetch-apec --help`
Expected: command help shows the Apec fetch workflow
- [ ] **Step 3: Commit validated Apec ingestion slice**
```bash
git add pyproject.toml src/job_research tests
git commit -m "feat: complete Apec ingestion slice"
```
## Spec Coverage Check
- Explicit `fetch-apec` command: covered by Task 6
- Read `data/candidate-profile.yaml`: covered by Task 6
- Deterministic query derivation: covered by Task 3
- 50-listing cap and adapter behavior: covered by Task 5 and Task 6
- Raw HTML snapshot persistence: covered by Task 2, Task 5, and Task 6
- Normalized YAML listing output: covered by Task 1, Task 4, and Task 6
- Minimal within-run deduplication: covered by Task 4
- Partial-success metadata and run summary: covered by Task 1, Task 2, and Task 6