docs: add Apec ingestion plan

2026-06-01 12:33:16 +02:00 · 2026-06-01 12:33:16 +02:00 · cfbd1943ec
commit cfbd1943ec
parent ad36de0a3f
1 changed files with 612 additions and 0 deletions
--- a/docs/superpowers/plans/2026-06-01-apec-ingestion.md
+++ b/docs/superpowers/plans/2026-06-01-apec-ingestion.md
@ -0,0 +1,612 @@
+# Apec Ingestion Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Build one `fetch-apec` command that reads `data/candidate-profile.yaml`, derives deterministic Apec searches, fetches up to 50 public listings, stores raw HTML snapshots, and writes a normalized `listings.yaml` file plus run metadata.
+
+**Architecture:** The implementation is a fetch-and-normalize pipeline with explicit artifacts by run. Profile-driven query derivation feeds an Apec adapter, successful detail pages are persisted as raw snapshots, and a small normalizer plus within-run deduper writes inspectable YAML outputs for later ranking work.
+
+**Tech Stack:** Python 3.13, Typer, Playwright for Python, BeautifulSoup4, Pydantic v2, PyYAML, pytest
+
+---
+
+## File Map
+
+- Modify: `pyproject.toml` — add Playwright dependency if missing
+- Modify: `src/job_research/cli.py` — add `fetch-apec` command
+- Modify: `src/job_research/models.py` — add normalized listing and run metadata models
+- Modify: `src/job_research/storage.py` — add helpers for per-run artifact paths and YAML writes
+- Create: `src/job_research/apec/__init__.py` — Apec package marker
+- Create: `src/job_research/apec/query_derivation.py` — deterministic query derivation from candidate profile
+- Create: `src/job_research/apec/adapter.py` — public Apec search and detail-page fetching
+- Create: `src/job_research/apec/normalize.py` — normalize Apec detail pages into listing records
+- Create: `src/job_research/apec/dedupe.py` — minimal within-run deduplication
+- Create: `tests/apec/test_query_derivation.py` — profile-driven query tests
+- Create: `tests/apec/test_normalize.py` — normalized listing extraction tests
+- Create: `tests/apec/test_dedupe.py` — within-run dedupe tests
+- Create: `tests/test_apec_cli.py` — CLI integration tests for `fetch-apec`
+- Create: `tests/test_apec_storage.py` — run artifact persistence tests
+
+## Task 0: Dependencies for Apec Ingestion
+
+**Files:**
+- Modify: `pyproject.toml`
+
+- [ ] **Step 1: Write the failing import check**
+
+Run: `uv run python -c "import playwright, bs4"`
+Expected: FAIL with missing dependency errors
+
+- [ ] **Step 2: Add the minimal dependencies for this slice**
+
+```toml
+# pyproject.toml
+[project]
+dependencies = [
+  "beautifulsoup4>=4.12,<5",
+  "playwright>=1.52,<2",
+  "pydantic>=2.7,<3",
+  "pypdf>=5.0,<6",
+  "pyyaml>=6.0,<7",
+  "typer>=0.12,<1",
+]
+```
+
+- [ ] **Step 3: Sync and verify the imports work**
+
+Run: `uv sync && uv run python -c "import playwright, bs4"`
+Expected: PASS with no output
+
+- [ ] **Step 4: Commit the dependency update**
+
+```bash
+git add pyproject.toml uv.lock
+git commit -m "chore: add Apec ingestion dependencies"
+```
+
+## Task 1: Listing and Run Artifact Models
+
+**Files:**
+- Modify: `src/job_research/models.py`
+- Create: `tests/test_apec_storage.py`
+
+- [ ] **Step 1: Write the failing model serialization test**
+
+```python
+# tests/test_apec_storage.py
+from job_research.models import ApecListing, ApecRunMeta, ListingWarning
+
+
+def test_apec_models_serialize_expected_listing_shape() -> None:
+    listing = ApecListing(
+        source="apec",
+        source_job_id="123",
+        url="https://example.test/job/123",
+        title="Data Engineer",
+        company="Example",
+        location="Paris",
+        contract_type="CDI",
+        description_text="Build pipelines",
+        published_at="2026-06-01",
+        fetched_at="2026-06-01T10:00:00Z",
+        warnings=[ListingWarning(field="location", message="Location inferred from page text")],
+    )
+    run_meta = ApecRunMeta(
+        derived_queries=["Data Engineer"],
+        fetched_count=1,
+        normalized_count=1,
+        deduplicated_count=1,
+        failed_count=0,
+        listing_errors=[],
+    )
+
+    assert listing.model_dump()["source"] == "apec"
+    assert run_meta.model_dump()["derived_queries"] == ["Data Engineer"]
+```
+
+- [ ] **Step 2: Run the model test to verify it fails**
+
+Run: `uv run pytest tests/test_apec_storage.py::test_apec_models_serialize_expected_listing_shape -v`
+Expected: FAIL with `ImportError` or `AttributeError` for missing Apec models
+
+- [ ] **Step 3: Add normalized listing and run metadata models**
+
+```python
+# src/job_research/models.py
+class ListingWarning(BaseModel):
+    field: str
+    message: str
+
+
+class ListingError(BaseModel):
+    url: str
+    stage: str
+    message: str
+
+
+class ApecListing(BaseModel):
+    source: str
+    source_job_id: str | None = None
+    url: str
+    title: str | None = None
+    company: str | None = None
+    location: str | None = None
+    contract_type: str | None = None
+    description_text: str | None = None
+    published_at: str | None = None
+    fetched_at: str
+    warnings: list[ListingWarning] = Field(default_factory=list)
+
+
+class ApecRunMeta(BaseModel):
+    derived_queries: list[str] = Field(default_factory=list)
+    fetched_count: int = 0
+    normalized_count: int = 0
+    deduplicated_count: int = 0
+    failed_count: int = 0
+    listing_errors: list[ListingError] = Field(default_factory=list)
+```
+
+- [ ] **Step 4: Run the model test to verify it passes**
+
+Run: `uv run pytest tests/test_apec_storage.py::test_apec_models_serialize_expected_listing_shape -v`
+Expected: PASS
+
+- [ ] **Step 5: Commit the models**
+
+```bash
+git add src/job_research/models.py tests/test_apec_storage.py
+git commit -m "feat: add Apec listing artifact models"
+```
+
+## Task 2: Run Artifact Storage Layout
+
+**Files:**
+- Modify: `src/job_research/storage.py`
+- Modify: `tests/test_apec_storage.py`
+
+- [ ] **Step 1: Write the failing run-path test**
+
+```python
+# tests/test_apec_storage.py
+from pathlib import Path
+
+from job_research.storage import apec_run_paths
+
+
+def test_apec_run_paths_builds_expected_layout(tmp_path: Path) -> None:
+    paths = apec_run_paths(tmp_path, run_id="2026-06-01T10-00-00Z")
+
+    assert paths["run_dir"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z"
+    assert paths["listings"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z" / "listings.yaml"
+    assert paths["run_meta"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z" / "run-meta.yaml"
+    assert paths["snapshots"] == tmp_path / "apec" / "runs" / "2026-06-01T10-00-00Z" / "snapshots"
+```
+
+- [ ] **Step 2: Run the run-path test to verify it fails**
+
+Run: `uv run pytest tests/test_apec_storage.py::test_apec_run_paths_builds_expected_layout -v`
+Expected: FAIL because `apec_run_paths` does not exist yet
+
+- [ ] **Step 3: Implement run-path helpers and artifact writes**
+
+```python
+# src/job_research/storage.py
+def apec_run_paths(data_root: Path, run_id: str) -> dict[str, Path]:
+    run_dir = data_root / "apec" / "runs" / run_id
+    return {
+        "run_dir": run_dir,
+        "listings": run_dir / "listings.yaml",
+        "run_meta": run_dir / "run-meta.yaml",
+        "snapshots": run_dir / "snapshots",
+    }
+```
+
+- [ ] **Step 4: Run the run-path test to verify it passes**
+
+Run: `uv run pytest tests/test_apec_storage.py::test_apec_run_paths_builds_expected_layout -v`
+Expected: PASS
+
+- [ ] **Step 5: Commit the storage layout helper**
+
+```bash
+git add src/job_research/storage.py tests/test_apec_storage.py
+git commit -m "feat: add Apec run artifact paths"
+```
+
+## Task 3: Deterministic Query Derivation
+
+**Files:**
+- Create: `src/job_research/apec/__init__.py`
+- Create: `src/job_research/apec/query_derivation.py`
+- Create: `tests/apec/test_query_derivation.py`
+
+- [ ] **Step 1: Write the failing query derivation test**
+
+```python
+# tests/apec/test_query_derivation.py
+from job_research.apec.query_derivation import derive_apec_queries
+from job_research.models import CandidateProfileOutput
+
+
+def test_derive_apec_queries_from_candidate_profile() -> None:
+    profile = CandidateProfileOutput(
+        target_roles=["Data Engineer", "Analytics Engineer"],
+        strengths=["Python", "SQL"],
+        skills_to_emphasize=["BigQuery", "GCP"],
+        constraints=["CDI only", "France only"],
+    )
+
+    queries = derive_apec_queries(profile)
+
+    assert "Data Engineer" in queries
+    assert "Analytics Engineer" in queries
+    assert len(queries) <= 5
+```
+
+- [ ] **Step 2: Run the query test to verify it fails**
+
+Run: `uv run pytest tests/apec/test_query_derivation.py::test_derive_apec_queries_from_candidate_profile -v`
+Expected: FAIL with missing module or function
+
+- [ ] **Step 3: Implement deterministic query derivation**
+
+```python
+# src/job_research/apec/query_derivation.py
+from job_research.models import CandidateProfileOutput
+
+
+def derive_apec_queries(profile: CandidateProfileOutput) -> list[str]:
+    queries: list[str] = []
+    for title in profile.target_roles:
+        if title not in queries:
+            queries.append(title)
+    return queries[:5]
+```
+
+- [ ] **Step 4: Run the query test to verify it passes**
+
+Run: `uv run pytest tests/apec/test_query_derivation.py::test_derive_apec_queries_from_candidate_profile -v`
+Expected: PASS
+
+- [ ] **Step 5: Commit query derivation**
+
+```bash
+git add src/job_research/apec/__init__.py src/job_research/apec/query_derivation.py tests/apec/test_query_derivation.py
+git commit -m "feat: add deterministic Apec query derivation"
+```
+
+## Task 4: Listing Normalization and Within-Run Deduplication
+
+**Files:**
+- Create: `src/job_research/apec/normalize.py`
+- Create: `src/job_research/apec/dedupe.py`
+- Create: `tests/apec/test_normalize.py`
+- Create: `tests/apec/test_dedupe.py`
+
+- [ ] **Step 1: Write the failing normalization and dedupe tests**
+
+```python
+# tests/apec/test_normalize.py
+from job_research.apec.normalize import normalize_apec_listing
+
+
+def test_normalize_apec_listing_extracts_minimal_shape() -> None:
+    html = """
+    <html>
+      <body>
+        <h1>Data Engineer</h1>
+        <div class="company">Example Corp</div>
+        <div class="location">Paris</div>
+        <div class="contract">CDI</div>
+        <div class="description">Build pipelines</div>
+      </body>
+    </html>
+    """
+
+    listing = normalize_apec_listing(url="https://example.test/job/123", html=html, fetched_at="2026-06-01T10:00:00Z")
+
+    assert listing.title == "Data Engineer"
+    assert listing.company == "Example Corp"
+    assert listing.contract_type == "CDI"
+```
+
+```python
+# tests/apec/test_dedupe.py
+from job_research.apec.dedupe import dedupe_apec_listings
+from job_research.models import ApecListing
+
+
+def test_dedupe_apec_listings_by_url() -> None:
+    listings = [
+        ApecListing(source="apec", url="https://example.test/job/1", fetched_at="2026-06-01T10:00:00Z"),
+        ApecListing(source="apec", url="https://example.test/job/1", fetched_at="2026-06-01T10:01:00Z"),
+    ]
+
+    deduped = dedupe_apec_listings(listings)
+
+    assert len(deduped) == 1
+```
+
+- [ ] **Step 2: Run the normalization and dedupe tests to verify they fail**
+
+Run: `uv run pytest tests/apec/test_normalize.py tests/apec/test_dedupe.py -v`
+Expected: FAIL with missing modules/functions
+
+- [ ] **Step 3: Implement minimal normalization and dedupe**
+
+```python
+# src/job_research/apec/normalize.py
+from bs4 import BeautifulSoup
+
+from job_research.models import ApecListing
+
+
+def normalize_apec_listing(url: str, html: str, fetched_at: str) -> ApecListing:
+    soup = BeautifulSoup(html, "html.parser")
+    title = soup.find("h1")
+    company = soup.select_one(".company")
+    location = soup.select_one(".location")
+    contract = soup.select_one(".contract")
+    description = soup.select_one(".description")
+
+    return ApecListing(
+        source="apec",
+        url=url,
+        title=title.get_text(strip=True) if title else None,
+        company=company.get_text(strip=True) if company else None,
+        location=location.get_text(strip=True) if location else None,
+        contract_type=contract.get_text(strip=True) if contract else None,
+        description_text=description.get_text(" ", strip=True) if description else None,
+        fetched_at=fetched_at,
+    )
+```
+
+```python
+# src/job_research/apec/dedupe.py
+from job_research.models import ApecListing
+
+
+def dedupe_apec_listings(listings: list[ApecListing]) -> list[ApecListing]:
+    seen: set[str] = set()
+    deduped: list[ApecListing] = []
+    for listing in listings:
+        if listing.url in seen:
+            continue
+        seen.add(listing.url)
+        deduped.append(listing)
+    return deduped
+```
+
+- [ ] **Step 4: Run the normalization and dedupe tests to verify they pass**
+
+Run: `uv run pytest tests/apec/test_normalize.py tests/apec/test_dedupe.py -v`
+Expected: PASS
+
+- [ ] **Step 5: Commit normalization and dedupe**
+
+```bash
+git add src/job_research/apec/normalize.py src/job_research/apec/dedupe.py tests/apec/test_normalize.py tests/apec/test_dedupe.py
+git commit -m "feat: add Apec normalization and dedupe"
+```
+
+## Task 5: Public Apec Adapter and Snapshot Persistence
+
+**Files:**
+- Create: `src/job_research/apec/adapter.py`
+- Modify: `tests/test_apec_storage.py`
+
+- [ ] **Step 1: Write the failing snapshot persistence test**
+
+```python
+# tests/test_apec_storage.py
+from pathlib import Path
+
+from job_research.storage import apec_run_paths, load_yaml
+
+
+def test_apec_run_artifacts_include_snapshot_and_meta(tmp_path: Path) -> None:
+    paths = apec_run_paths(tmp_path, run_id="2026-06-01T10-00-00Z")
+    paths["snapshots"].mkdir(parents=True, exist_ok=True)
+    snapshot = paths["snapshots"] / "job-123.html"
+    snapshot.write_text("<html>snapshot</html>", encoding="utf-8")
+
+    assert snapshot.exists()
+```
+
+- [ ] **Step 2: Run the snapshot test to verify it fails if needed**
+
+Run: `uv run pytest tests/test_apec_storage.py::test_apec_run_artifacts_include_snapshot_and_meta -v`
+Expected: PASS or minimal failure if path handling needs adjustment
+
+- [ ] **Step 3: Implement the Apec adapter skeleton and snapshot write helpers**
+
+```python
+# src/job_research/apec/adapter.py
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass
+class ApecSearchResult:
+    url: str
+    source_job_id: str | None = None
+
+
+class ApecAdapter:
+    def __init__(self, max_listings: int = 50) -> None:
+        self.max_listings = max_listings
+
+    def search(self, queries: list[str]) -> list[ApecSearchResult]:
+        return []
+
+    def fetch_listing_html(self, url: str) -> str:
+        return ""
+```
+
+- [ ] **Step 4: Run the snapshot test and any adapter-adjacent tests**
+
+Run: `uv run pytest tests/test_apec_storage.py -v`
+Expected: PASS
+
+- [ ] **Step 5: Commit the adapter scaffold**
+
+```bash
+git add src/job_research/apec/adapter.py tests/test_apec_storage.py
+git commit -m "feat: add Apec adapter scaffold"
+```
+
+## Task 6: fetch-apec Command Orchestration
+
+**Files:**
+- Modify: `src/job_research/cli.py`
+- Create: `tests/test_apec_cli.py`
+
+- [ ] **Step 1: Write the failing CLI orchestration tests**
+
+```python
+# tests/test_apec_cli.py
+from pathlib import Path
+
+from typer.testing import CliRunner
+
+from job_research.cli import app
+
+
+def test_fetch_apec_reads_profile_and_writes_run_artifacts(monkeypatch, tmp_path: Path) -> None:
+    data_dir = tmp_path / "data"
+    data_dir.mkdir()
+    (data_dir / "candidate-profile.yaml").write_text(
+        "target_roles:\n  - Data Engineer\nstrengths:\n  - Python\nskills_to_emphasize:\n  - BigQuery\nconstraints:\n  - CDI only\n",
+        encoding="utf-8",
+    )
+
+    result = CliRunner().invoke(app, ["fetch-apec", "--data-root", str(data_dir)])
+
+    assert result.exit_code == 0
+    assert "normalized listing count" in result.stdout.lower()
+```
+
+- [ ] **Step 2: Run the CLI orchestration test to verify it fails**
+
+Run: `uv run pytest tests/test_apec_cli.py::test_fetch_apec_reads_profile_and_writes_run_artifacts -v`
+Expected: FAIL because `fetch-apec` does not exist yet
+
+- [ ] **Step 3: Implement fetch-apec command orchestration**
+
+```python
+# src/job_research/cli.py
+from datetime import UTC, datetime
+from pathlib import Path
+
+import typer
+
+from job_research.apec.adapter import ApecAdapter
+from job_research.apec.dedupe import dedupe_apec_listings
+from job_research.apec.normalize import normalize_apec_listing
+from job_research.apec.query_derivation import derive_apec_queries
+from job_research.models import ApecRunMeta, CandidateProfileOutput, ListingError
+from job_research.storage import apec_run_paths, load_yaml, save_yaml
+
+@app.command("fetch-apec")
+def fetch_apec(
+    data_root: Path = typer.Option(Path("data")),
+) -> None:
+    profile_payload = load_yaml(data_root / "candidate-profile.yaml")
+    profile = CandidateProfileOutput.model_validate(profile_payload)
+    queries = derive_apec_queries(profile)
+    if not queries:
+        raise typer.BadParameter("No usable Apec queries could be derived from candidate-profile.yaml")
+
+    run_id = datetime.now(UTC).strftime("%Y-%m-%dT%H-%M-%SZ")
+    paths = apec_run_paths(data_root, run_id)
+    paths["snapshots"].mkdir(parents=True, exist_ok=True)
+
+    adapter = ApecAdapter(max_listings=50)
+    search_results = adapter.search(queries)
+    listings = []
+    errors: list[ListingError] = []
+
+    for result in search_results[:50]:
+        try:
+            html = adapter.fetch_listing_html(result.url)
+            snapshot_path = paths["snapshots"] / f"{(result.source_job_id or 'listing').replace('/', '-')}.html"
+            snapshot_path.write_text(html, encoding="utf-8")
+            listings.append(normalize_apec_listing(url=result.url, html=html, fetched_at=run_id))
+        except Exception as exc:
+            errors.append(ListingError(url=result.url, stage="fetch_or_normalize", message=str(exc)))
+
+    deduped = dedupe_apec_listings(listings)
+    run_meta = ApecRunMeta(
+        derived_queries=queries,
+        fetched_count=len(search_results[:50]),
+        normalized_count=len(listings),
+        deduplicated_count=len(deduped),
+        failed_count=len(errors),
+        listing_errors=errors,
+    )
+
+    save_yaml(paths["listings"], {"listings": [listing.model_dump(mode="json") for listing in deduped]})
+    save_yaml(paths["run_meta"], run_meta.model_dump(mode="json"))
+
+    typer.echo(f"Query count: {len(queries)}")
+    typer.echo(f"Fetched listing count: {run_meta.fetched_count}")
+    typer.echo(f"Normalized listing count: {run_meta.normalized_count}")
+    typer.echo(f"Deduplicated count: {run_meta.deduplicated_count}")
+    typer.echo(f"Failed listing count: {run_meta.failed_count}")
+```
+
+Implementation requirements:
+- load `data/candidate-profile.yaml`
+- validate into `CandidateProfileOutput`
+- derive queries
+- create a run id and run paths
+- invoke adapter search/fetch flow
+- persist snapshots, listings.yaml, run-meta.yaml
+- print summary counts
+
+- [ ] **Step 4: Run the CLI orchestration test to verify it passes**
+
+Run: `uv run pytest tests/test_apec_cli.py::test_fetch_apec_reads_profile_and_writes_run_artifacts -v`
+Expected: PASS
+
+- [ ] **Step 5: Commit the fetch-apec command**
+
+```bash
+git add src/job_research/cli.py tests/test_apec_cli.py
+git commit -m "feat: add fetch-apec command"
+```
+
+## Task 7: Full Regression and Manual Smoke Test
+
+**Files:**
+- Modify: none
+
+- [ ] **Step 1: Run the full test suite**
+
+Run: `uv run pytest tests -v`
+Expected: PASS with all Apec-slice and profile-slice tests green
+
+- [ ] **Step 2: Run a manual fetch-apec smoke test with mocked or safe local input**
+
+Run: `uv run job-research fetch-apec --help`
+Expected: command help shows the Apec fetch workflow
+
+- [ ] **Step 3: Commit validated Apec ingestion slice**
+
+```bash
+git add pyproject.toml src/job_research tests
+git commit -m "feat: complete Apec ingestion slice"
+```
+
+## Spec Coverage Check
+
+- Explicit `fetch-apec` command: covered by Task 6
+- Read `data/candidate-profile.yaml`: covered by Task 6
+- Deterministic query derivation: covered by Task 3
+- 50-listing cap and adapter behavior: covered by Task 5 and Task 6
+- Raw HTML snapshot persistence: covered by Task 2, Task 5, and Task 6
+- Normalized YAML listing output: covered by Task 1, Task 4, and Task 6
+- Minimal within-run deduplication: covered by Task 4
+- Partial-success metadata and run summary: covered by Task 1, Task 2, and Task 6