Attribute Mapping from Blueprints

Attribute mapping is the semantic normalization stage within the Automated Floor Plan Parsing & Vectorization collection: it binds raw geometric extraction to the structured, queryable records a routing graph and a facilities database can actually consume. The stage takes polygonized room boundaries, detected architectural elements, and stray text annotations and resolves coordinate drift, ambiguous drafting conventions, and fragmented metadata into deterministic feature attributes — no human in the loop.

The Problem: Text That Doesn’t Sit Where the Geometry Is

The hard part of attribute mapping is not parsing text — it is deciding which room each label belongs to. Drafters almost never place a room name at the polygon centroid. They drop “CONFERENCE 2.14” near a doorway, run “CORRIDOR” along a circulation spine, and float a suite number in open-plan space that legally belongs to four different desks. A naive point-in-polygon test produces three failure symptoms you will see in production:

Orphaned rooms. A polygon receives no label because its text sits 0.4 m outside the boundary, so it ships as ROOM_A37 and breaks search.
Cross-assigned labels. A corridor label lands inside an adjacent office because the office polygon happens to overlap the text point.
Silent attribute loss. Sheet metadata (“REV C”, “SCALE 1:100”) is mapped as a room name, polluting the routing graph with junk nodes.

This stage exists to make label-to-polygon assignment deterministic and auditable, then to refuse to emit any record that would corrupt downstream navigation. It runs after Wall & Door Detection Algorithms have produced clean room polygons and openings, and before records enter the JSON Schema Design for Indoor Maps contract that the delivery layer enforces.

Prerequisites & Dependencies

Before implementing the resolver, the upstream stages must guarantee a few invariants. Attribute mapping is stateless and idempotent, but only if its inputs are already coordinate-aligned.

Polygonized rooms — closed, valid shapely polygons keyed by a stable room_id, typically emitted by SVG/DWG Parsing Workflows with consistent floor-level tagging.
Extracted text entities — each carrying baseline (x, y), rotation, font size, and source layer from the CAD blocks or SVG <text> nodes.
A single metric frame — all geometry projected into one Indoor Coordinate Reference System before any spatial operation; mixing millimeter CAD origins with a metric routing graph guarantees misassignment.
Libraries — shapely>=2.0 for geometry, rtree for bounding-box indexing, pyproj for datum/CRS transforms, and pydantic>=2 for schema enforcement.

Blueprint units (millimeters, inches, architectural units) must be converted to a consistent metric reference before indexing, and text baselines must be projected into the same frame as the polygons. Misalignment here propagates straight into routing failures, so deterministic normalization is non-negotiable.

How the Resolver Works

The stage runs a strict three-phase execution model. Coordinates are normalized and non-semantic text is filtered out; surviving labels are associated to polygons by buffered spatial join and confidence-weighted scoring; resolved records are type-checked against the output schema before handoff. Each phase has a single responsibility and a typed contract to the next, so a failure can be isolated to one phase rather than debugged across the whole pipeline.

Step-by-Step Implementation

Step 1: Coordinate Normalization & Text Extraction

Blueprint text rarely aligns with room centroids, so the first step normalizes every coordinate, applies unit scaling, and filters non-semantic annotations using positional heuristics and regex patterns. Filtering early keeps sheet boilerplate out of the spatial index entirely.

import re
import logging
from typing import List, Tuple, Optional

from shapely.geometry import Point
from shapely.affinity import rotate
from pydantic import BaseModel, Field

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger("attribute_mapping")


class BlueprintText(BaseModel):
    raw_text: str
    x: float
    y: float
    rotation_deg: float = 0.0
    layer: str = ""
    font_size: float = 0.0
    confidence: float = Field(default=0.0, ge=0.0, le=1.0)


class NormalizedAnnotation(BaseModel):
    model_config = {"arbitrary_types_allowed": True}
    text: str
    geometry: Point
    layer: str
    font_size: float
    confidence: float


# Sheet boilerplate that must never become a room name.
NON_SEMANTIC_PATTERNS = re.compile(
    r"^(REV|SCALE|DATE|DRAWN BY|CHECKED|SHEET|DWG|NORTH|"
    r"SCALE\s+\d+:\d+|\d{1,3}[-/]\d{1,3}[-/]\d{2,4}|[A-Z]{2,4}-\d{3,5})$",
    re.IGNORECASE,
)


def normalize_units_to_meters(
    raw_coords: List[Tuple[float, float]],
    drawing_units: str = "mm",
    scale_factor: Optional[float] = None,
) -> List[Tuple[float, float]]:
    """Convert drawing coordinates to meters using standard architectural scales."""
    unit_multipliers = {"mm": 0.001, "in": 0.0254, "ft": 0.3048, "m": 1.0, "arch": 0.0254}
    multiplier = unit_multipliers.get(drawing_units, 0.001)
    if scale_factor is not None:
        multiplier *= scale_factor
    return [(x * multiplier, y * multiplier) for x, y in raw_coords]


def extract_and_filter_text(
    raw_texts: List[BlueprintText],
    drawing_units: str = "mm",
) -> List[NormalizedAnnotation]:
    """Normalize coordinates, drop non-semantic text, return structured annotations."""
    normalized: List[NormalizedAnnotation] = []
    for txt in raw_texts:
        candidate = txt.raw_text.strip()
        if NON_SEMANTIC_PATTERNS.match(candidate):
            logger.debug("Filtered non-semantic text: %s", candidate)
            continue

        try:
            x_m, y_m = normalize_units_to_meters([(txt.x, txt.y)], drawing_units)[0]
        except (TypeError, ValueError) as exc:
            logger.warning("Skipping unparseable coordinate %r: %s", candidate, exc)
            continue

        geom = Point(x_m, y_m)
        if txt.rotation_deg != 0:
            geom = rotate(geom, -txt.rotation_deg, origin=(x_m, y_m))

        # Initial confidence from font-size consistency and source-layer classification.
        base_conf = min(1.0, txt.font_size / 12.0) if txt.font_size > 0 else 0.3
        if txt.layer.lower() in ("text", "annotations", "labels", "room_names"):
            base_conf = min(1.0, base_conf + 0.3)

        normalized.append(
            NormalizedAnnotation(
                text=candidate,
                geometry=geom,
                layer=txt.layer,
                font_size=txt.font_size,
                confidence=base_conf,
            )
        )
    logger.info("Retained %d of %d text entities", len(normalized), len(raw_texts))
    return normalized

Use pyproj when transforming between local CAD origins and a real-world CRS, particularly when integrating with GIS platforms; see the pyproj documentation for authoritative guidance on CRS transformations and datum shifts.

Step 2: Spatial Indexing & Label Association

A point-in-polygon test fails when labels sit outside boundaries, when rooms share open-plan space, or when drafting standards place text in corridors. The association engine indexes polygons in an R-tree, expands each label by a configurable buffer, and scores candidates by distance and confidence, with a nearest-centroid fallback so no label is ever silently dropped.

from typing import Dict, List

from rtree import index
from shapely.geometry import Polygon, box


class SpatialLabelResolver:
    def __init__(self, room_polygons: Dict[str, Polygon]) -> None:
        self.polygons = room_polygons
        self.idx = index.Index()
        # rtree needs integer ids, so keep a position -> room_id lookup alongside it.
        self._id_lookup: Dict[int, str] = {}
        for i, (room_id, poly) in enumerate(room_polygons.items()):
            self.idx.insert(i, poly.bounds)
            self._id_lookup[i] = room_id

    def associate_labels(
        self,
        annotations: List[NormalizedAnnotation],
        buffer_m: float = 0.5,
    ) -> Dict[str, List[NormalizedAnnotation]]:
        """Map annotations to rooms via buffered spatial joins and confidence scoring."""
        room_assignments: Dict[str, List[NormalizedAnnotation]] = {
            rid: [] for rid in self.polygons
        }
        if not self.polygons:
            logger.error("No room polygons supplied; cannot associate labels")
            return room_assignments

        for ann in annotations:
            search_box = box(
                ann.geometry.x - buffer_m, ann.geometry.y - buffer_m,
                ann.geometry.x + buffer_m, ann.geometry.y + buffer_m,
            )
            candidates = []
            for rtree_id in self.idx.intersection(search_box.bounds):
                room_id = self._id_lookup[rtree_id]
                poly = self.polygons[room_id]
                dist = ann.geometry.distance(poly)
                if poly.is_valid and dist <= buffer_m:
                    # Closer + higher confidence = better match.
                    score = ann.confidence * (1.0 / (1.0 + dist))
                    candidates.append((room_id, score))

            if candidates:
                best_room = max(candidates, key=lambda c: c[1])[0]
                room_assignments[best_room].append(ann)
            else:
                # Fallback: nearest centroid keeps every label attached to a room.
                nearest = min(
                    self.polygons.items(),
                    key=lambda item: ann.geometry.distance(item[1].centroid),
                )
                logger.debug("Label %r fell back to centroid match %s", ann.text, nearest[0])
                room_assignments[nearest[0]].append(ann)

        return room_assignments

Tune buffer_m to drafting scale — typically 0.3–0.8 m for 1:100 or 1:50 plans. When the upstream parser exports block attributes and text with consistent floor-level and layer naming, the initial confidence scoring in Step 1 needs far less manual correction here.

Step 3: Schema Validation & Routing-Graph Handoff

Once labels are spatially resolved, the stage enforces a strict output schema. Routing engines and facilities databases require deterministic field types, mandatory identifiers, and topology-ready attributes — anything malformed must be rejected, not coerced.

from typing import Dict, List, Optional

from pydantic import BaseModel, field_validator, ValidationError


class MappedRoomAttribute(BaseModel):
    room_id: str
    name: str
    area_sqm: float
    occupancy_type: str
    floor_level: int
    door_count: int
    wall_material: Optional[str] = None
    label_confidence: float
    geometry: str  # WKT or GeoJSON string

    @field_validator("area_sqm", "label_confidence")
    @classmethod
    def validate_non_negative(cls, v: float) -> float:
        if v < 0:
            raise ValueError("Must be non-negative")
        return v

    @field_validator("occupancy_type")
    @classmethod
    def normalize_occupancy(cls, v: str) -> str:
        return v.strip().upper()


def validate_batch(
    assignments: Dict[str, List[NormalizedAnnotation]],
    polygon_areas: Dict[str, float],
    floor_level: int,
) -> List[MappedRoomAttribute]:
    """Convert spatial assignments into validated schema records."""
    records: List[MappedRoomAttribute] = []
    for room_id, anns in assignments.items():
        name = max(anns, key=lambda a: a.confidence).text if anns else f"ROOM_{room_id}"
        avg_conf = sum(a.confidence for a in anns) / len(anns) if anns else 0.0

        try:
            records.append(
                MappedRoomAttribute(
                    room_id=room_id,
                    name=name,
                    area_sqm=polygon_areas.get(room_id, 0.0),
                    occupancy_type="GENERAL",   # Refined later from a POI taxonomy lookup.
                    floor_level=floor_level,
                    door_count=0,               # Populated by wall & door detection.
                    label_confidence=round(avg_conf, 3),
                    geometry=polygon_areas.get(f"{room_id}_wkt", ""),
                )
            )
        except ValidationError as exc:
            logger.warning("Schema validation failed for %s: %s", room_id, exc)

    logger.info("Validated %d of %d candidate rooms", len(records), len(assignments))
    return records

The occupancy_type placeholder is refined by the POI Taxonomy & Classification lookup, and door_count is backfilled from Wall & Door Detection Algorithms once openings are resolved. The floor_level integer must agree with the Level Mapping & Z-Axis Logic convention used across the campus. For authoritative spatial data modeling, refer to the OGC IndoorGML specification.

Edge Cases & Gotchas

Symptom	Root cause	Resolution
Labels assigned to wrong rooms	Buffer too small, or text placed in a corridor	Raise `buffer_m` to `0.6–1.0`, keep centroid fallback, verify layer filtering
High validation failure rate	Missing mandatory fields or malformed WKT	Repair geometry with `buffer(0)` pre-validation, enforce schema defaults, log raw payloads
Coordinate drift across floor levels	Inconsistent drawing origins or missing CRS	Apply global affine registration from control points, enforce `pyproj` transforms
Duplicate room names	Multiple equal-confidence labels per room	Deduplicate by largest font size or closest-to-centroid label
Slow spatial joins	Unindexed or self-intersecting polygons	Bulk-load the R-tree, pre-merge overlaps, assert `poly.is_valid` before insert
Y-axis inversion	SVG origin top-left vs. CAD bottom-left	Flip `y` during normalization so labels and polygons share one orientation

Validation Output

The stage should emit a GeoJSON FeatureCollection whose properties match the established indoor envelope, so the result drops straight into the delivery contract. A correct single-feature output looks like this:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": { "type": "Polygon", "coordinates": [[[0,0],[6,0],[6,4],[0,4],[0,0]]] },
      "properties": {
        "room_id": "L1-2.14",
        "name": "CONFERENCE 2.14",
        "occupancy_type": "MEETING",
        "floor_level": 1,
        "area_sqm": 24.0,
        "door_count": 1,
        "label_confidence": 0.91
      }
    }
  ]
}

The incorrect counterpart is the tell-tale failure mode: "name": "REV C" with "label_confidence": 0.3 means sheet boilerplate slipped past the filter. Guard the batch with explicit assertions before handoff:

def assert_mapping_quality(records: List[MappedRoomAttribute]) -> None:
    """Fail fast if any record would corrupt the routing graph."""
    named = [r for r in records if not r.name.startswith("ROOM_")]
    low_conf = [r for r in records if r.label_confidence < 0.4]

    assert records, "Empty batch: upstream produced no polygons"
    assert len(named) / len(records) >= 0.95, "Too many orphaned rooms"
    if low_conf:
        logger.warning("%d rooms below confidence floor; queue for review", len(low_conf))

A green run keeps the named-room ratio at or above 0.95 and routes every low-confidence record to human review rather than publishing it.

Performance & Scale Notes

Attribute mapping must scale across multi-floor campuses and batch queues. The dominant cost is the R-tree intersection in Step 2: building the index is O(n log n) in polygon count, and each label query is O(log n + k) for k candidates in the buffer window, so a single floor level of a few hundred rooms maps in well under a second.

Idempotency keys. Hash the input polygon geometries and raw text payloads into a deterministic job id (a topology hash); re-running the same blueprint yields identical records with no duplication.
Chunk by floor level. Split portfolios into per-floor tiles to bound R-tree memory and avoid spikes during index construction on large buildings.
Stateless workers. Run mapping in Async Batch Processing Pipelines on Celery, RQ, or Lambda with ephemeral storage; persist only validated GeoJSON to object storage or PostGIS.
Index the output. Add GIST indexes on geometry and a btree on room_id in PostGIS for sub-50 ms routing lookups.
Observability. Emit labels_filtered, spatial_misses, validation_failures, and processing_latency; the artifacts are gated by CI Gating for Map Updates, which should block a publish when validation_failures exceed 5% of batch size.

from celery import Celery

app = Celery("attribute_mapper", broker="redis://localhost:6379/0")


@app.task(bind=True, max_retries=3, default_retry_delay=60)
def process_floor_plan(self, blueprint_id: str, raw_polygons: dict, raw_texts: list) -> dict:
    """Celery-compatible worker: normalize, associate, validate one floor level."""
    try:
        normalized = extract_and_filter_text(raw_texts)
        resolver = SpatialLabelResolver(raw_polygons)
        assignments = resolver.associate_labels(normalized)
        areas = {k: v.area for k, v in raw_polygons.items()}
        records = validate_batch(assignments, areas, floor_level=1)
        return {"status": "success", "records": [r.model_dump() for r in records]}
    except (ValidationError, KeyError) as exc:
        logger.error("Mapping failed for %s: %s", blueprint_id, exc)
        raise self.retry(exc=exc)

Frequently Asked Questions

How big should the spatial buffer be?

Tie buffer_m to drafting scale, not to a fixed default. For 1:100 or 1:50 plans, 0.3–0.8 m captures labels placed near doorways without bleeding into adjacent rooms. If you see cross-assignment, shrink the buffer and lean on the centroid fallback; if you see orphaned rooms, widen it and re-run the assertion check.

Why filter text before indexing instead of after?

Sheet boilerplate (“REV C”, “SCALE 1:100”, title-block codes) carries real coordinates and will win a spatial join against whatever polygon it overlaps. Dropping it in Step 1 keeps junk out of the R-tree entirely, so confidence scores in Step 2 reflect only genuine room labels.

What happens to a room that gets no label at all?

It still ships, named ROOM_<id> with label_confidence of 0. The quality assertion flags it in the orphaned-room ratio, and it is queued for review rather than blocked — a missing name should never stall publication of a navigable floor.

Can this stage run before wall and door detection?

No. Attribute mapping needs closed, valid room polygons and the opening count, both of which come out of detection. Run it after detection and before the JSON schema contract, so door_count and topology-ready geometry are already in place.

This page is part of the Automated Floor Plan Parsing & Vectorization section of the Indoor Mapping & Wayfinding Automation reference.

Attribute Mapping from Blueprints

The Problem: Text That Doesn’t Sit Where the Geometry Is #

Prerequisites & Dependencies #

How the Resolver Works #

Step-by-Step Implementation #

Step 1: Coordinate Normalization & Text Extraction #

Step 2: Spatial Indexing & Label Association #

Step 3: Schema Validation & Routing-Graph Handoff #

Edge Cases & Gotchas #

Validation Output #

Performance & Scale Notes #

Frequently Asked Questions #

Related #