Home

Beyond the Noise: Precision GPS Trajectory Reconstruction with Pyehicle

Your GPS data is lying to you. Pyehicle reconstructs vehicle trajectories with algorithms that turn noisy, scattered GPS points into accurate, road-aligned paths. If you're building autonomous systems, analyzing fleet data, or researching mobility patterns, this library will save you weeks of frustration.

See It Work

Before any of the theory, here’s what reconstruction actually means. Drag the noise slider to crank up multipath. Click through the four stages to watch the pipeline run on a noisy trajectory, or hit Auto-play to see all four in sequence.

Interactive demo of the pyehicle reconstruction pipeline

8m
Raw GPS — points scattered around the actual road by multipath and atmospheric noise. The line through them ignores road geometry entirely.
Raw observation Map-matched Intersection corner Curve interpolation

What you’re looking at:

  1. Raw GPS — the chaos your logger actually emits, complete with multipath jumps. Notice how three points are flung well off the road; that’s a high-rise canyon or a passing truck.
  2. Map-matched — a Hidden Markov Model with the Viterbi recursion finds the most likely sequence of road segments. Points now sit on the centerlines, but corners and curves are still under-sampled.
  3. Refinedrefine_trajectory detects the 90° turn between non-adjacent road segments and inserts an intersection corner (green), reconstructed on the road graph with iGraph shortest path.
  4. Finalcurve_interpolation walks the trajectory looking for 20–80° bearing changes and inserts points along the road geometry (purple). The straight sections and the already-refined corner are left alone; only the arc gets the new points.

The road network here is synthetic, but every algorithm running in your browser is what ships in pyehicle. On real OSM data it does exactly this, just with millions of points instead of fourteen, and against an actual road graph instead of a four-segment toy.

The rest of this article is the problem analysis, the math behind the pipeline, the 25-line workflow that produces results like this on real data, and the evaluation metrics you’d use to know it worked.

Your GPS Data Is Costing You Money

Stand at 42nd Street and Broadway in Manhattan, fire up a GPS logger, and stay still. Within seconds you’ll see yourself jump 40 meters east, then 60 meters west, then back again, all without moving an inch. That’s multipath error: GNSS signals bouncing off skyscrapers before they reach the receiver, producing phantom positions that violate basic physics.

Put the same receiver in a delivery van crossing Midtown at 3 PM and the trajectory looks like a drunk spider’s path. Points scatter across parallel streets. Consecutive one-second readings imply 90 km/h sprints between samples. Straight lines cut through buildings where the van obviously turned. When the driver dips into an underpass, the trace vanishes for thirty seconds and reappears 200 meters away with no continuity at all.

None of this is theoretical. It’s what your data looks like right now.

What This Costs You

If you’re:

  • Validating autonomous vehicles → if raw GPS is anywhere in your ground-truth pipeline, you’re producing false negatives in safety validation
  • Optimizing fleet routes → your algorithms optimize phantom routes that don’t match reality
  • Analyzing urban mobility → your metrics (radius of gyration, entropy) come out significantly inflated, often by a factor of two or more depending on sampling rate and environment
  • Billing by distance → you’re either overcharging customers or losing revenue

You’ve probably been treating this as a data quality issue. It isn’t. The signal is fundamentally corrupted by physics, and no amount of filtering will fix that.

What you need is reconstruction, not cleaning.

What Pyehicle Does

Before (raw GPS):

  • Points scattered 20-50m off roads
  • Gaps from signal loss in tunnels and parking garages
  • Straight lines cutting through buildings
  • Trajectories that violate road network topology

After (Pyehicle):

  • Points sitting on road centerlines
  • Gaps filled with map-constrained interpolation
  • Paths that follow actual road network geometry
  • Intersection topology and turn constraints respected

All in roughly seven lines of Python.

Why Your First Three Attempts Will Fail

Attempt 1: Nearest-Road Snapping

# Naive approach: Snap each point to nearest road
for point in gps_trajectory:
    point.snap_to_nearest_road()

Failure mode: Snapping each point on its own produces topologically impossible paths. Point t lands on eastbound I-95, point t+1 on westbound I-95. The vehicle didn’t teleport across a Jersey barrier; your algorithm did.

Root cause: This is a sequence labeling problem and you’re solving it pointwise. The correct road at t depends on the road at t-1 and the transition probability between them.

Attempt 2: Moving Average Smoothing

# Smooth with rolling window
smoothed_lat = trajectory['lat'].rolling(window=5).mean()

Failure mode: When multipath puts your points on both sides of a street, the mean lands in the median strip. A Kalman filter would throw the outliers out based on velocity constraints. A moving average just averages them in.

Root cause: GPS noise isn’t Gaussian. Multipath produces bimodal distributions, atmospheric delays produce long tails, and a simple convolution assumes i.i.d. Gaussian noise. That assumption doesn’t hold here.

Attempt 3: Linear Interpolation Through Gaps

# Fill gaps with straight lines
trajectory.interpolate(method='linear')

Failure mode: A vehicle that enters a tunnel at point A (lat₁, lon₁) and emerges at point B (lat₂, lon₂) didn’t travel the great-circle path between them. It followed roads. If there’s a river in between, your interpolation just drove through the water.

Root cause: You need map-constrained interpolation. The interpolated path has to lie on the road network graph, not in ℝ².

The Mathematical Foundation: What Reconstruction Actually Means

At its core, GPS trajectory reconstruction is map-matching: aligning noisy GPS observations to the road network topology underneath them. The formulation below follows Newson & Krumm’s 2009 HMM map-matching paper, which is also the basis for Valhalla’s Meili.

Map-Matching via Hidden Markov Model

Given:

  • Observations: z₁, z₂, …, zₙ (GPS lat/lon measurements)
  • Hidden states: s₁, s₂, …, sₙ (true road segments from graph G)

Find: argmax P(s₁:ₙ | z₁:ₙ)

This factors via HMM. Since the goal is the most likely sequence, the Viterbi recursion uses max rather than a sum over previous states:

    δₜ(sₜ)  ∝   P(zₜ | sₜ)   ×    max     [ P(sₜ | sₜ₋₁)   ×   δₜ₋₁(sₜ₋₁) ]
                                  sₜ₋₁

where:

  • P(zₜ | sₜ) is the emission probability (how well GPS point zₜ fits road state sₜ: perpendicular projection error, GPS-to-road distance)
  • P(sₜ | sₜ₋₁) is the transition probability (route distance between the two candidate roads, weighted by network connectivity; zero if the segments don’t connect in the graph)
  • δₜ₋₁(sₜ₋₁) is the best score for reaching state sₜ₋₁ at time t-1, computed recursively

The recurrence is solved by dynamic programming in O(n·m²), where m is the number of candidate roads per observation.

Why it works: transition probabilities encode the network’s topological constraints. If road segments A and B aren’t connected in the graph, P(sₜ=B | sₜ₋₁=A) = 0, which rules out the impossible jumps naive snapping happily accepts.

After map-matching, pyehicle adds geometric refinement:

  • Trajectory refinement: inserts intersection corner points where the path transitions between road segments
  • Curve interpolation: adds points along curved roads based on bearing changes

Together these make sure the final trajectory follows real road geometry, not just snapped sample points.

The Solution: Pyehicle Does the Heavy Lifting

What you actually wanted was a library that takes messy GPS data and hands back clean, road-aligned trajectories. No PhD. No three-month detour into debugging your own HMM. Just results.

Pyehicle implements the framework above end to end:

  • Map-matching via Hidden Markov Models with the Viterbi algorithm, using either Leuven (offline, HMM with non-emitting states) or Valhalla’s Meili (over Docker)
  • Geometric refinement that inserts intersection corner points and curve interpolations along the actual road geometry
  • Geodesic math throughout — pyproj on the WGS84 ellipsoid, with local AEQD projections for metric accuracy
  • Offline-capable — Leuven works against downloaded OSM PBF files; Overpass queries are cached
  • Scales to millions of points via overlapping chunked processing
  • Optional Kalman filtering with EM parameter learning and an RTS smoother
  • Built-in evaluation metrics — F1, Precision, Recall, and RMF against ground-truth road segments, so you can measure how well a reconstruction actually matched reality

More importantly, it works on real-world data.

Architecture

Reconstruction Pipeline:

graph LR
    A["Raw GPS Data
(Noisy, Inaccurate)"] --> B["Preprocessing
(Map-Matching)"] B --> C["Reconstruction
(Combine, Refine, Curve Interp)"] C --> OP["Optional Post-processing
(Compress, Kalman Filter)"] OP --> D["Final Trajectory
(Accurate, Road-aligned)"] classDef rawNode fill:#f5f5f5,stroke:#757575,stroke-width:2px,color:#212121; classDef procNode fill:#e8dff5,stroke:#7e57c2,stroke-width:2px,color:#311b92; classDef finalNode fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b; classDef optNode fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px,stroke-dasharray: 5 5,color:#4a148c; class A rawNode; class B,C procNode; class OP optNode; class D finalNode;

Key Design Decisions

Geodesic mathematics everywhere. Every distance goes through pyproj on the WGS84 ellipsoid. Plain Euclidean distance in (lat, lon) space breaks down at high latitudes, since 1° longitude only equals 1° latitude at the equator.

AEQD projection for local operations. Azimuthal Equidistant projection, centered on the trajectory centroid, preserves radial distances from the centroid exactly. For trajectories within roughly 1000 km of the center, edge-to-edge distortion stays under a meter, which is below the noise floor of consumer GPS.

Offline-capable. No API calls are required (Overpass API is cached), so you can run this on classified networks or on a plane.

Chunked processing. Million-point trajectories are handled in overlapping chunks, with configurable chunk size and overlap to keep edge artifacts out.

See It In Action: From Messy CSV to Production-Ready Trajectory in 7 Steps

Enough documentation. Let’s clean some actual data.

What follows is a real workflow that turns 24,573 noisy GPS points into a clean, road-aligned trajectory. Around 25 lines of code, about two minutes of runtime once the OSM cache is warm.

Step 0: Installation

pip install pyehicle

Step 1: Load Raw GPS Data

import pandas as pd
import pyehicle as pye

# Load CSV with columns: lat, lon, time
df = pd.read_csv('trajectory.csv')
print(f"Raw: {len(df)} points")
# Output: Raw: 24,573 points

Step 2: Preprocessing - Map-Matching

Align GPS points to road network centerlines using a Hidden Markov Model (HMM).

# Map-match using Meili (Valhalla API - requires Docker)
# Or use Leuven for offline HMM-based matching
matched = pye.preprocessing.meili(
    df,
    lat_col='lat',
    lon_col='lon',
    time_col='time',
)
print(f"Matched: {len(matched)} points on road network")

What happened:

  1. The GPS trajectory was sent to Valhalla’s Meili service (or Leuven, for offline use)
  2. HMM probabilistic reasoning picked the most likely road sequence
  3. Points were snapped to the matched road centerlines
  4. The output includes road network metadata (OSM IDs, geometries)

Alternative (Offline): Use pye.preprocessing.leuven() with downloaded OSM PBF files. No server required.

Step 3: Load Road Network

Load the OpenStreetMap road network for the reconstruction steps that follow.

# Load from PBF file with caching for faster subsequent loads
road_network, geometries, spatial_index = pye.utilities.load_road_network(
    pbf_file_path='map.osm.pbf',
    bbox=(56.9, 57.0, 24.0, 24.2),  # (north, south, east, west)
    save_path='network_cache.graphml'  # Cache for reuse
)
print(f"Loaded {len(road_network.vs)} nodes, {len(road_network.es)} edges")
# Output: Loaded 847,293 nodes, 1,203,847 edges

What happened:

  1. The OSM PBF file was parsed to extract the road network (first run only)
  2. An iGraph graph structure was built from nodes and edges
  3. An R-tree spatial index was created for fast nearest-neighbor queries
  4. The result was cached as GraphML, giving 50-100x faster loads afterwards

First run: ~180 seconds (parsing the PBF)
Subsequent runs: ~4 seconds (loading the GraphML cache)

Step 4: Reconstruction - Trajectory Refinement

Enforce road network continuity at intersections by interpolating corner points.

refined = pye.reconstructing.refine_trajectory(
    matched,
    road_network=road_network,
    max_node_distance=10,  # Snap within 10m of nodes
    time_col='time'
)
print(f"Refined: {len(refined)} points (added intersection corners)")
# Output: Refined: 6,247 points (added 158 intersection corners)

What happened: when consecutive GPS points map to non-adjacent road segments (say, point t on 5th Ave and point t+1 on E 34th St), the algorithm:

  1. Finds the nearest graph nodes for each point
  2. Computes the shortest path between those nodes through iGraph
  3. Interpolates corner points along the intersection geometry
  4. Distributes timestamps proportionally

Why it matters: without this step, the trajectory has discontinuities at every intersection. With it, the path follows the real road network through each turn.

Step 5: Reconstruction - Curve Interpolation

Add points along road curves based on bearing changes.

final = pye.reconstructing.curve_interpolation(
    refined,
    road_network=road_network,
    lower_threshold=20,   # Interpolate for bearing changes ≥20°
    upper_threshold=80,   # But not sharp turns ≥80° (already refined)
    max_node_distance=10,
    time_col='time'
)
print(f"Final: {len(final)} points")
# Output: Final: 6,891 points

What happened: the algorithm found 127 locations where bearing changed by 20-80° (highway curves, roundabouts, gradual turns) and interpolated points along the road geometry at each one so the curve reads smoothly in the output.

Step 6: Optional Post-processing

Apply Kalman filtering and/or compression for a final pass of cleanup (optional).

# Optional: Apply Kalman filtering for smoothing
filtered = pye.preprocessing.kalman_aeqd_filter(
    final,
    lat_col='lat',
    lon_col='lon',
    time_col='time',
    em_iters=3
)

# Optional: Compress to reduce file size
final_trajectory = pye.preprocessing.spatio_temporal_compress(
    filtered,
    spatial_radius_km=0.01,
    time_threshold_s=30
)
print(f"Final trajectory: {len(final_trajectory)} points")

Why optional: these steps can refine the trajectory further, but whether you need them depends on your use case.

Step 7: Visualize and Save

# Save result
final.to_csv('fleet_vehicle_2024_03_15_reconstructed.csv', index=False)

# Validate with visualization
pye.utilities.visualization.single(
    final,
    name='Reconstructed Fleet Vehicle Trajectory',
    show_in_browser=True
)

Result: you now have a trajectory that:

  • Sits on the road network (no points stranded in buildings)
  • Respects topology (no impossible transitions)
  • Has kinematic consistency (smooth velocities, no teleportation)
  • Captures intersection and curve geometry

If you have a ground-truth trajectory (from a survey-grade receiver, manual labeling, or a higher-fidelity source), pyehicle ships the four standard map-matching evaluation metrics:

f1        = pye.utilities.f1(final, ground_truth)
precision = pye.utilities.precision(final, ground_truth)
recall    = pye.utilities.recall(final, ground_truth)
rmf       = pye.utilities.rmf(final, ground_truth)

print(f"F1: {f1:.3f}  P: {precision:.3f}  R: {recall:.3f}  RMF: {rmf:.3f}")

The metrics fetch the actual road segments from Overpass for a faithful comparison:

  • Precision — common matched length ÷ reconstructed length
  • Recall — common matched length ÷ ground-truth length
  • F1 — harmonic mean of the two
  • RMF (Route Mismatch Fraction) — proportion of the trajectory that does not match the ground truth

A reasonable production target is F1 > 0.90 on urban-canyon data; F1 < 0.75 usually means the input is too sparse, the bounding box is too tight, or the map-matcher search radius needs widening.

When NOT to Use Pyehicle

Pyehicle is built for road-bound vehicle trajectories. It’s the wrong tool for:

  • Off-road movement — hiking, drone flights, ship tracks in open water. There’s no road network to match against.
  • Indoor positioning — Wi-Fi/Bluetooth/UWB fingerprints don’t live on OSM roads.
  • Extremely sparse trajectories — fewer than one point per few hundred meters. Leuven’s non-emitting states help, but at some point HMM map-matching simply runs out of signal.
  • Cellular/CDR data — antenna-level resolution is coarser than road geometry. Reach for scikit-mobility instead.
  • Pedestrian movement in dense pedestrian networks — sidewalk geometry isn’t reliably captured in OSM as a routable graph.

For everything else — vehicle GPS, phone GPS in a car, fleet tracker pings — it’s exactly what it’s designed for.

Should You Use Pyehicle?

Short answer: if your GPS data comes from real-world sensors (vehicles, phones, trackers), reach for pyehicle first. Then bring in other tools for analysis.

The Python mobility analytics ecosystem has specialized tools for different stages of the pipeline:

Library Primary Function Input Assumption Use pyehicle When…
pyehicle Map-matching and trajectory reconstruction on road networks Raw, noisy GPS with multipath, gaps, outliers Your data comes from real-world GPS loggers (vehicles, phones, asset trackers) and has significant noise
MovingPandas Trajectory visualization, segmentation, feature engineering Clean trajectories with accurate positions You’ve already cleaned the data (via pyehicle or differential GPS) and need to split by stops, compute speeds, or visualize
scikit-mobility Human mobility metrics (entropy, radius of gyration), flow analysis Clean trajectories from any source: GPS, check-ins, CDR You’re analyzing mobility patterns at population scale and need privacy-preserving metrics or flow matrices

The Power Move: Combining All Three Libraries

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from datetime import timedelta

import pyehicle as pye
import movingpandas as mpd
import skmob
from skmob.measures.individual import radius_of_gyration

# Stage 1: Reconstruction (pyehicle)
raw = pd.read_csv('gps_logger.csv')
clean = pye.preprocessing.spatio_temporal_compress(raw)
clean = pye.preprocessing.leuven(clean)
clean = pye.preprocessing.kalman_aeqd_filter(clean)

# Stage 2: Analysis (MovingPandas)
# Trajectory needs a GeoDataFrame with a geometry column
gdf = gpd.GeoDataFrame(
    clean,
    geometry=[Point(xy) for xy in zip(clean.lon, clean.lat)],
    crs='EPSG:4326',
).set_index('time')
traj = mpd.Trajectory(gdf, traj_id=1)
traj.add_speed()

# Stops are detected with TrajectoryStopDetector
stops = mpd.TrajectoryStopDetector(traj).get_stop_segments(
    min_duration=timedelta(seconds=60),
    max_diameter=50,
)

# Stage 3: Metrics (scikit-mobility)
skmob_traj = skmob.TrajDataFrame(
    clean, latitude='lat', longitude='lon', datetime='time'
)
rg_df = radius_of_gyration(skmob_traj)

The libraries are complementary, not competitive. Pyehicle handles the physics problem (noisy GNSS signals). MovingPandas handles the feature engineering problem (what is this trajectory doing?). Scikit-mobility handles the statistical problem (what does this population’s mobility look like?).

Get Started in 30 Seconds

Stop fighting with noisy GPS data. Install pyehicle and get back to building.

pip install pyehicle

Next Steps

  1. 📖 Read the docs: pyehicle.readthedocs.io — Full API reference with examples
  2. 💻 Check the code: github.com/Nick-Doulos/pyehicle — MIT licensed, contributions welcome
  3. 🐛 Report issues: github.com/Nick-Doulos/pyehicle/issues — Found a bug? Let me know
  4. 🗺️ Get OSM data: download.geofabrik.de — Free OpenStreetMap extracts by region

Using This in Research?

@software{pyehicle2025,
  title = {Pyehicle: A Python Library for GPS Trajectory Processing and Reconstruction},
  author = {Doulos, Nick},
  year = {2025},
  version = {0.1.0},
  url = {https://github.com/Nick-Doulos/pyehicle}
}

If pyehicle saved you time, consider starring the repo on GitHub. It helps others find it.

© nickdoulos.com by Nick Doulos is licensed under CC BY 4.0