See It Work
Before any of the theory, here’s what reconstruction actually means. Drag the noise slider to crank up multipath. Click through the four stages to watch the pipeline run on a noisy trajectory, or hit Auto-play to see all four in sequence.
Interactive demo of the pyehicle reconstruction pipeline
What you’re looking at:
- Raw GPS — the chaos your logger actually emits, complete with multipath jumps. Notice how three points are flung well off the road; that’s a high-rise canyon or a passing truck.
- Map-matched — a Hidden Markov Model with the Viterbi recursion finds the most likely sequence of road segments. Points now sit on the centerlines, but corners and curves are still under-sampled.
- Refined —
refine_trajectorydetects the 90° turn between non-adjacent road segments and inserts an intersection corner (green), reconstructed on the road graph with iGraph shortest path. - Final —
curve_interpolationwalks the trajectory looking for 20–80° bearing changes and inserts points along the road geometry (purple). The straight sections and the already-refined corner are left alone; only the arc gets the new points.
The road network here is synthetic, but every algorithm running in your browser is what ships in pyehicle. On real OSM data it does exactly this, just with millions of points instead of fourteen, and against an actual road graph instead of a four-segment toy.
The rest of this article is the problem analysis, the math behind the pipeline, the 25-line workflow that produces results like this on real data, and the evaluation metrics you’d use to know it worked.
Your GPS Data Is Costing You Money
Stand at 42nd Street and Broadway in Manhattan, fire up a GPS logger, and stay still. Within seconds you’ll see yourself jump 40 meters east, then 60 meters west, then back again, all without moving an inch. That’s multipath error: GNSS signals bouncing off skyscrapers before they reach the receiver, producing phantom positions that violate basic physics.
Put the same receiver in a delivery van crossing Midtown at 3 PM and the trajectory looks like a drunk spider’s path. Points scatter across parallel streets. Consecutive one-second readings imply 90 km/h sprints between samples. Straight lines cut through buildings where the van obviously turned. When the driver dips into an underpass, the trace vanishes for thirty seconds and reappears 200 meters away with no continuity at all.
None of this is theoretical. It’s what your data looks like right now.
What This Costs You
If you’re:
- Validating autonomous vehicles → if raw GPS is anywhere in your ground-truth pipeline, you’re producing false negatives in safety validation
- Optimizing fleet routes → your algorithms optimize phantom routes that don’t match reality
- Analyzing urban mobility → your metrics (radius of gyration, entropy) come out significantly inflated, often by a factor of two or more depending on sampling rate and environment
- Billing by distance → you’re either overcharging customers or losing revenue
You’ve probably been treating this as a data quality issue. It isn’t. The signal is fundamentally corrupted by physics, and no amount of filtering will fix that.
What you need is reconstruction, not cleaning.
What Pyehicle Does
Before (raw GPS):
- Points scattered 20-50m off roads
- Gaps from signal loss in tunnels and parking garages
- Straight lines cutting through buildings
- Trajectories that violate road network topology
After (Pyehicle):
- Points sitting on road centerlines
- Gaps filled with map-constrained interpolation
- Paths that follow actual road network geometry
- Intersection topology and turn constraints respected
All in roughly seven lines of Python.
Why Your First Three Attempts Will Fail
Attempt 1: Nearest-Road Snapping
# Naive approach: Snap each point to nearest road
for point in gps_trajectory:
point.snap_to_nearest_road()
Failure mode: Snapping each point on its own produces topologically impossible paths. Point t lands on eastbound I-95, point t+1 on westbound I-95. The vehicle didn’t teleport across a Jersey barrier; your algorithm did.
Root cause: This is a sequence labeling problem and you’re solving it pointwise. The correct road at t depends on the road at t-1 and the transition probability between them.
Attempt 2: Moving Average Smoothing
# Smooth with rolling window
smoothed_lat = trajectory['lat'].rolling(window=5).mean()
Failure mode: When multipath puts your points on both sides of a street, the mean lands in the median strip. A Kalman filter would throw the outliers out based on velocity constraints. A moving average just averages them in.
Root cause: GPS noise isn’t Gaussian. Multipath produces bimodal distributions, atmospheric delays produce long tails, and a simple convolution assumes i.i.d. Gaussian noise. That assumption doesn’t hold here.
Attempt 3: Linear Interpolation Through Gaps
# Fill gaps with straight lines
trajectory.interpolate(method='linear')
Failure mode: A vehicle that enters a tunnel at point A (lat₁, lon₁) and emerges at point B (lat₂, lon₂) didn’t travel the great-circle path between them. It followed roads. If there’s a river in between, your interpolation just drove through the water.
Root cause: You need map-constrained interpolation. The interpolated path has to lie on the road network graph, not in ℝ².
The Mathematical Foundation: What Reconstruction Actually Means
At its core, GPS trajectory reconstruction is map-matching: aligning noisy GPS observations to the road network topology underneath them. The formulation below follows Newson & Krumm’s 2009 HMM map-matching paper, which is also the basis for Valhalla’s Meili.
Map-Matching via Hidden Markov Model
Given:
- Observations: z₁, z₂, …, zₙ (GPS lat/lon measurements)
- Hidden states: s₁, s₂, …, sₙ (true road segments from graph G)
Find: argmax P(s₁:ₙ | z₁:ₙ)
This factors via HMM. Since the goal is the most likely sequence, the Viterbi recursion uses max rather than a sum over previous states:
δₜ(sₜ) ∝ P(zₜ | sₜ) × max [ P(sₜ | sₜ₋₁) × δₜ₋₁(sₜ₋₁) ]
sₜ₋₁
where:
P(zₜ | sₜ)is the emission probability (how well GPS point zₜ fits road state sₜ: perpendicular projection error, GPS-to-road distance)P(sₜ | sₜ₋₁)is the transition probability (route distance between the two candidate roads, weighted by network connectivity; zero if the segments don’t connect in the graph)δₜ₋₁(sₜ₋₁)is the best score for reaching state sₜ₋₁ at time t-1, computed recursively
The recurrence is solved by dynamic programming in O(n·m²), where m is the number of candidate roads per observation.
Why it works: transition probabilities encode the network’s topological constraints. If road segments A and B aren’t connected in the graph, P(sₜ=B | sₜ₋₁=A) = 0, which rules out the impossible jumps naive snapping happily accepts.
After map-matching, pyehicle adds geometric refinement:
- Trajectory refinement: inserts intersection corner points where the path transitions between road segments
- Curve interpolation: adds points along curved roads based on bearing changes
Together these make sure the final trajectory follows real road geometry, not just snapped sample points.
The Solution: Pyehicle Does the Heavy Lifting
What you actually wanted was a library that takes messy GPS data and hands back clean, road-aligned trajectories. No PhD. No three-month detour into debugging your own HMM. Just results.
Pyehicle implements the framework above end to end:
- Map-matching via Hidden Markov Models with the Viterbi algorithm, using either Leuven (offline, HMM with non-emitting states) or Valhalla’s Meili (over Docker)
- Geometric refinement that inserts intersection corner points and curve interpolations along the actual road geometry
- Geodesic math throughout — pyproj on the WGS84 ellipsoid, with local AEQD projections for metric accuracy
- Offline-capable — Leuven works against downloaded OSM PBF files; Overpass queries are cached
- Scales to millions of points via overlapping chunked processing
- Optional Kalman filtering with EM parameter learning and an RTS smoother
- Built-in evaluation metrics — F1, Precision, Recall, and RMF against ground-truth road segments, so you can measure how well a reconstruction actually matched reality
More importantly, it works on real-world data.
Architecture
Reconstruction Pipeline:
graph LR
A["Raw GPS Data
(Noisy, Inaccurate)"] --> B["Preprocessing
(Map-Matching)"]
B --> C["Reconstruction
(Combine, Refine, Curve Interp)"]
C --> OP["Optional Post-processing
(Compress, Kalman Filter)"]
OP --> D["Final Trajectory
(Accurate, Road-aligned)"]
classDef rawNode fill:#f5f5f5,stroke:#757575,stroke-width:2px,color:#212121;
classDef procNode fill:#e8dff5,stroke:#7e57c2,stroke-width:2px,color:#311b92;
classDef finalNode fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b;
classDef optNode fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px,stroke-dasharray: 5 5,color:#4a148c;
class A rawNode;
class B,C procNode;
class OP optNode;
class D finalNode;
Key Design Decisions
Geodesic mathematics everywhere. Every distance goes through pyproj on the WGS84 ellipsoid. Plain Euclidean distance in (lat, lon) space breaks down at high latitudes, since 1° longitude only equals 1° latitude at the equator.
AEQD projection for local operations. Azimuthal Equidistant projection, centered on the trajectory centroid, preserves radial distances from the centroid exactly. For trajectories within roughly 1000 km of the center, edge-to-edge distortion stays under a meter, which is below the noise floor of consumer GPS.
Offline-capable. No API calls are required (Overpass API is cached), so you can run this on classified networks or on a plane.
Chunked processing. Million-point trajectories are handled in overlapping chunks, with configurable chunk size and overlap to keep edge artifacts out.
See It In Action: From Messy CSV to Production-Ready Trajectory in 7 Steps
Enough documentation. Let’s clean some actual data.
What follows is a real workflow that turns 24,573 noisy GPS points into a clean, road-aligned trajectory. Around 25 lines of code, about two minutes of runtime once the OSM cache is warm.
Step 0: Installation
pip install pyehicle
Step 1: Load Raw GPS Data
import pandas as pd
import pyehicle as pye
# Load CSV with columns: lat, lon, time
df = pd.read_csv('trajectory.csv')
print(f"Raw: {len(df)} points")
# Output: Raw: 24,573 points
Step 2: Preprocessing - Map-Matching
Align GPS points to road network centerlines using a Hidden Markov Model (HMM).
# Map-match using Meili (Valhalla API - requires Docker)
# Or use Leuven for offline HMM-based matching
matched = pye.preprocessing.meili(
df,
lat_col='lat',
lon_col='lon',
time_col='time',
)
print(f"Matched: {len(matched)} points on road network")
What happened:
- The GPS trajectory was sent to Valhalla’s Meili service (or Leuven, for offline use)
- HMM probabilistic reasoning picked the most likely road sequence
- Points were snapped to the matched road centerlines
- The output includes road network metadata (OSM IDs, geometries)
Alternative (Offline): Use pye.preprocessing.leuven() with downloaded OSM PBF files. No server required.
Step 3: Load Road Network
Load the OpenStreetMap road network for the reconstruction steps that follow.
# Load from PBF file with caching for faster subsequent loads
road_network, geometries, spatial_index = pye.utilities.load_road_network(
pbf_file_path='map.osm.pbf',
bbox=(56.9, 57.0, 24.0, 24.2), # (north, south, east, west)
save_path='network_cache.graphml' # Cache for reuse
)
print(f"Loaded {len(road_network.vs)} nodes, {len(road_network.es)} edges")
# Output: Loaded 847,293 nodes, 1,203,847 edges
What happened:
- The OSM PBF file was parsed to extract the road network (first run only)
- An iGraph graph structure was built from nodes and edges
- An R-tree spatial index was created for fast nearest-neighbor queries
- The result was cached as GraphML, giving 50-100x faster loads afterwards
First run: ~180 seconds (parsing the PBF)
Subsequent runs: ~4 seconds (loading the GraphML cache)
Step 4: Reconstruction - Trajectory Refinement
Enforce road network continuity at intersections by interpolating corner points.
refined = pye.reconstructing.refine_trajectory(
matched,
road_network=road_network,
max_node_distance=10, # Snap within 10m of nodes
time_col='time'
)
print(f"Refined: {len(refined)} points (added intersection corners)")
# Output: Refined: 6,247 points (added 158 intersection corners)
What happened: when consecutive GPS points map to non-adjacent road segments (say, point t on 5th Ave and point t+1 on E 34th St), the algorithm:
- Finds the nearest graph nodes for each point
- Computes the shortest path between those nodes through iGraph
- Interpolates corner points along the intersection geometry
- Distributes timestamps proportionally
Why it matters: without this step, the trajectory has discontinuities at every intersection. With it, the path follows the real road network through each turn.
Step 5: Reconstruction - Curve Interpolation
Add points along road curves based on bearing changes.
final = pye.reconstructing.curve_interpolation(
refined,
road_network=road_network,
lower_threshold=20, # Interpolate for bearing changes ≥20°
upper_threshold=80, # But not sharp turns ≥80° (already refined)
max_node_distance=10,
time_col='time'
)
print(f"Final: {len(final)} points")
# Output: Final: 6,891 points
What happened: the algorithm found 127 locations where bearing changed by 20-80° (highway curves, roundabouts, gradual turns) and interpolated points along the road geometry at each one so the curve reads smoothly in the output.
Step 6: Optional Post-processing
Apply Kalman filtering and/or compression for a final pass of cleanup (optional).
# Optional: Apply Kalman filtering for smoothing
filtered = pye.preprocessing.kalman_aeqd_filter(
final,
lat_col='lat',
lon_col='lon',
time_col='time',
em_iters=3
)
# Optional: Compress to reduce file size
final_trajectory = pye.preprocessing.spatio_temporal_compress(
filtered,
spatial_radius_km=0.01,
time_threshold_s=30
)
print(f"Final trajectory: {len(final_trajectory)} points")
Why optional: these steps can refine the trajectory further, but whether you need them depends on your use case.
Step 7: Visualize and Save
# Save result
final.to_csv('fleet_vehicle_2024_03_15_reconstructed.csv', index=False)
# Validate with visualization
pye.utilities.visualization.single(
final,
name='Reconstructed Fleet Vehicle Trajectory',
show_in_browser=True
)
Result: you now have a trajectory that:
- Sits on the road network (no points stranded in buildings)
- Respects topology (no impossible transitions)
- Has kinematic consistency (smooth velocities, no teleportation)
- Captures intersection and curve geometry
Step 8: Validate Against Ground Truth (optional but recommended)
If you have a ground-truth trajectory (from a survey-grade receiver, manual labeling, or a higher-fidelity source), pyehicle ships the four standard map-matching evaluation metrics:
f1 = pye.utilities.f1(final, ground_truth)
precision = pye.utilities.precision(final, ground_truth)
recall = pye.utilities.recall(final, ground_truth)
rmf = pye.utilities.rmf(final, ground_truth)
print(f"F1: {f1:.3f} P: {precision:.3f} R: {recall:.3f} RMF: {rmf:.3f}")
The metrics fetch the actual road segments from Overpass for a faithful comparison:
- Precision — common matched length ÷ reconstructed length
- Recall — common matched length ÷ ground-truth length
- F1 — harmonic mean of the two
- RMF (Route Mismatch Fraction) — proportion of the trajectory that does not match the ground truth
A reasonable production target is F1 > 0.90 on urban-canyon data; F1 < 0.75 usually means the input is too sparse, the bounding box is too tight, or the map-matcher search radius needs widening.
When NOT to Use Pyehicle
Pyehicle is built for road-bound vehicle trajectories. It’s the wrong tool for:
- Off-road movement — hiking, drone flights, ship tracks in open water. There’s no road network to match against.
- Indoor positioning — Wi-Fi/Bluetooth/UWB fingerprints don’t live on OSM roads.
- Extremely sparse trajectories — fewer than one point per few hundred meters. Leuven’s non-emitting states help, but at some point HMM map-matching simply runs out of signal.
- Cellular/CDR data — antenna-level resolution is coarser than road geometry. Reach for scikit-mobility instead.
- Pedestrian movement in dense pedestrian networks — sidewalk geometry isn’t reliably captured in OSM as a routable graph.
For everything else — vehicle GPS, phone GPS in a car, fleet tracker pings — it’s exactly what it’s designed for.
Should You Use Pyehicle?
Short answer: if your GPS data comes from real-world sensors (vehicles, phones, trackers), reach for pyehicle first. Then bring in other tools for analysis.
The Python mobility analytics ecosystem has specialized tools for different stages of the pipeline:
| Library | Primary Function | Input Assumption | Use pyehicle When… |
|---|---|---|---|
| pyehicle | Map-matching and trajectory reconstruction on road networks | Raw, noisy GPS with multipath, gaps, outliers | Your data comes from real-world GPS loggers (vehicles, phones, asset trackers) and has significant noise |
| MovingPandas | Trajectory visualization, segmentation, feature engineering | Clean trajectories with accurate positions | You’ve already cleaned the data (via pyehicle or differential GPS) and need to split by stops, compute speeds, or visualize |
| scikit-mobility | Human mobility metrics (entropy, radius of gyration), flow analysis | Clean trajectories from any source: GPS, check-ins, CDR | You’re analyzing mobility patterns at population scale and need privacy-preserving metrics or flow matrices |
The Power Move: Combining All Three Libraries
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from datetime import timedelta
import pyehicle as pye
import movingpandas as mpd
import skmob
from skmob.measures.individual import radius_of_gyration
# Stage 1: Reconstruction (pyehicle)
raw = pd.read_csv('gps_logger.csv')
clean = pye.preprocessing.spatio_temporal_compress(raw)
clean = pye.preprocessing.leuven(clean)
clean = pye.preprocessing.kalman_aeqd_filter(clean)
# Stage 2: Analysis (MovingPandas)
# Trajectory needs a GeoDataFrame with a geometry column
gdf = gpd.GeoDataFrame(
clean,
geometry=[Point(xy) for xy in zip(clean.lon, clean.lat)],
crs='EPSG:4326',
).set_index('time')
traj = mpd.Trajectory(gdf, traj_id=1)
traj.add_speed()
# Stops are detected with TrajectoryStopDetector
stops = mpd.TrajectoryStopDetector(traj).get_stop_segments(
min_duration=timedelta(seconds=60),
max_diameter=50,
)
# Stage 3: Metrics (scikit-mobility)
skmob_traj = skmob.TrajDataFrame(
clean, latitude='lat', longitude='lon', datetime='time'
)
rg_df = radius_of_gyration(skmob_traj)
The libraries are complementary, not competitive. Pyehicle handles the physics problem (noisy GNSS signals). MovingPandas handles the feature engineering problem (what is this trajectory doing?). Scikit-mobility handles the statistical problem (what does this population’s mobility look like?).
Get Started in 30 Seconds
Stop fighting with noisy GPS data. Install pyehicle and get back to building.
pip install pyehicle
Next Steps
- 📖 Read the docs: pyehicle.readthedocs.io — Full API reference with examples
- 💻 Check the code: github.com/Nick-Doulos/pyehicle — MIT licensed, contributions welcome
- 🐛 Report issues: github.com/Nick-Doulos/pyehicle/issues — Found a bug? Let me know
- 🗺️ Get OSM data: download.geofabrik.de — Free OpenStreetMap extracts by region
Using This in Research?
@software{pyehicle2025,
title = {Pyehicle: A Python Library for GPS Trajectory Processing and Reconstruction},
author = {Doulos, Nick},
year = {2025},
version = {0.1.0},
url = {https://github.com/Nick-Doulos/pyehicle}
}
If pyehicle saved you time, consider starring the repo on GitHub. It helps others find it.