See It Work

Before any of the theory, here’s what reconstruction actually means. Drag the noise slider to crank up multipath. Click through the four stages to watch the pipeline run on a noisy trajectory, or hit Auto-play to see all four in sequence.

Noise level 8m

Raw GPS — points scattered around the actual road by multipath and atmospheric noise. The line through them ignores road geometry entirely.

Raw observation Map-matched Intersection corner Curve interpolation

What you’re looking at:

Raw GPS — the chaos your logger actually emits, complete with multipath jumps. Notice how three points are flung well off the road; that’s a high-rise canyon or a passing truck.
Map-matched — a Hidden Markov Model with the Viterbi recursion finds the most likely sequence of road segments. Points now sit on the centerlines, but corners and curves are still under-sampled.
Refined — refine_trajectory detects the 90° turn between non-adjacent road segments and inserts an intersection corner (green), reconstructed on the road graph with iGraph shortest path.
Final — curve_interpolation walks the trajectory looking for 20–80° bearing changes and inserts points along the road geometry (purple). The straight sections and the already-refined corner are left alone; only the arc gets the new points.

The road network here is synthetic, but every algorithm running in your browser is what ships in pyehicle. On real OSM data it does exactly this, just with millions of points instead of fourteen, and against an actual road graph instead of a four-segment toy.

The rest of this article is the problem analysis, the math behind the pipeline, the 25-line workflow that produces results like this on real data, and the evaluation metrics you’d use to know it worked.

Your GPS Data Is Costing You Money

Stand at 42nd Street and Broadway in Manhattan, fire up a GPS logger, and stay still. Within seconds you’ll see yourself jump 40 meters east, then 60 meters west, then back again, all without moving an inch. That’s multipath error: GNSS signals bouncing off skyscrapers before they reach the receiver, producing phantom positions that violate basic physics.

Put the same receiver in a delivery van crossing Midtown at 3 PM and the trajectory looks like a drunk spider’s path. Points scatter across parallel streets. Consecutive one-second readings imply 90 km/h sprints between samples. Straight lines cut through buildings where the van obviously turned. When the driver dips into an underpass, the trace vanishes for thirty seconds and reappears 200 meters away with no continuity at all.

None of this is theoretical. It’s what your data looks like right now.

What This Costs You

If you’re:

Validating autonomous vehicles → if raw GPS is anywhere in your ground-truth pipeline, you’re producing false negatives in safety validation
Optimizing fleet routes → your algorithms optimize phantom routes that don’t match reality
Analyzing urban mobility → your metrics (radius of gyration, entropy) come out significantly inflated, often by a factor of two or more depending on sampling rate and environment
Billing by distance → you’re either overcharging customers or losing revenue

You’ve probably been treating this as a data quality issue. It isn’t. The signal is fundamentally corrupted by physics, and no amount of filtering will fix that.

What you need is reconstruction, not cleaning.

What Pyehicle Does

Before (raw GPS):

Points scattered 20-50m off roads
Gaps from signal loss in tunnels and parking garages
Straight lines cutting through buildings
Trajectories that violate road network topology

After (Pyehicle):

Points sitting on road centerlines
Gaps filled with map-constrained interpolation
Paths that follow actual road network geometry
Intersection topology and turn constraints respected

All in roughly seven lines of Python.

Why Your First Three Attempts Will Fail

Attempt 1: Nearest-Road Snapping

# Naive approach: Snap each point to nearest road
for point in gps_trajectory:
    point.snap_to_nearest_road()

Failure mode: Snapping each point on its own produces topologically impossible paths. Point t lands on eastbound I-95, point t+1 on westbound I-95. The vehicle didn’t teleport across a Jersey barrier; your algorithm did.

Root cause: This is a sequence labeling problem and you’re solving it pointwise. The correct road at t depends on the road at t-1 and the transition probability between them.

Attempt 2: Moving Average Smoothing

# Smooth with rolling window
smoothed_lat = trajectory['lat'].rolling(window=5).mean()

Failure mode: When multipath puts your points on both sides of a street, the mean lands in the median strip. A Kalman filter would throw the outliers out based on velocity constraints. A moving average just averages them in.

Root cause: GPS noise isn’t Gaussian. Multipath produces bimodal distributions, atmospheric delays produce long tails, and a simple convolution assumes i.i.d. Gaussian noise. That assumption doesn’t hold here.

Attempt 3: Linear Interpolation Through Gaps

# Fill gaps with straight lines
trajectory.interpolate(method='linear')

Failure mode: A vehicle that enters a tunnel at point A (lat₁, lon₁) and emerges at point B (lat₂, lon₂) didn’t travel the great-circle path between them. It followed roads. If there’s a river in between, your interpolation just drove through the water.

Root cause: You need map-constrained interpolation. The interpolated path has to lie on the road network graph, not in ℝ².

The Mathematical Foundation: What Reconstruction Actually Means

At its core, GPS trajectory reconstruction is map-matching: aligning noisy GPS observations to the road network topology underneath them. The formulation below follows Newson & Krumm’s 2009 HMM map-matching paper, which is also the basis for Valhalla’s Meili.

Map-Matching via Hidden Markov Model

Given:

Observations: z₁, z₂, …, zₙ (GPS lat/lon measurements)
Hidden states: s₁, s₂, …, sₙ (true road segments from graph G)

Find: argmax P(s₁:ₙ | z₁:ₙ)

This factors via HMM. Since the goal is the most likely sequence, the Viterbi recursion uses max rather than a sum over previous states:

    δₜ(sₜ)  ∝   P(zₜ | sₜ)   ×    max     [ P(sₜ | sₜ₋₁)   ×   δₜ₋₁(sₜ₋₁) ]
                                  sₜ₋₁

where:

P(zₜ | sₜ) is the emission probability (how well GPS point zₜ fits road state sₜ: perpendicular projection error, GPS-to-road distance)
P(sₜ | sₜ₋₁) is the transition probability (route distance between the two candidate roads, weighted by network connectivity; zero if the segments don’t connect in the graph)
δₜ₋₁(sₜ₋₁) is the best score for reaching state sₜ₋₁ at time t-1, computed recursively

The recurrence is solved by dynamic programming in O(n·m²), where m is the number of candidate roads per observation.

Why it works: transition probabilities encode the network’s topological constraints. If road segments A and B aren’t connected in the graph, P(sₜ=B | sₜ₋₁=A) = 0, which rules out the impossible jumps naive snapping happily accepts.

After map-matching, pyehicle adds geometric refinement:

Trajectory refinement: inserts intersection corner points where the path transitions between road segments
Curve interpolation: adds points along curved roads based on bearing changes

Together these make sure the final trajectory follows real road geometry, not just snapped sample points.

The Solution: Pyehicle Does the Heavy Lifting

What you actually wanted was a library that takes messy GPS data and hands back clean, road-aligned trajectories. No PhD. No three-month detour into debugging your own HMM. Just results.

Pyehicle implements the framework above end to end:

Map-matching via Hidden Markov Models with the Viterbi algorithm, using either Leuven (offline-capable, HMM with non-emitting states) or Valhalla’s Meili (over Docker)
Geometric refinement that inserts intersection corner points and curve interpolations along the actual road geometry
Geodesic math throughout — pyproj on the WGS84 ellipsoid, with local AEQD projections for metric accuracy
Offline-capable after first run — Leuven caches Overpass API responses on disk; the reconstruction step uses local OSM PBF files for road network loading
Scales to millions of points via overlapping chunked processing
Optional Kalman filtering with EM parameter learning and an RTS smoother
Built-in evaluation metrics — F1, Precision, Recall, and RMF against ground-truth road segments, so you can measure how well a reconstruction actually matched reality

More importantly, it works on real-world data.

Architecture

Reconstruction Pipeline:

graph LR
    A["Raw GPS Data
(Noisy, Inaccurate)"] --> B["Preprocessing
(Map-Matching)"]
    B --> C["Reconstruction
(Combine, Refine,
Curve Interp)"]
    C --> OP["Optional Post-processing
(Compress, Kalman Filter,
etc.)"]
    OP --> D["Final Trajectory
(Accurate,
Road-aligned)"]

    classDef rawNode fill:#f5f5f5,stroke:#757575,stroke-width:2px,color:#212121;
    classDef procNode fill:#e8dff5,stroke:#7e57c2,stroke-width:2px,color:#311b92;
    classDef finalNode fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b;
    classDef optNode fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px,stroke-dasharray: 5 5,color:#4a148c;

    class A rawNode;
    class B,C procNode;
    class OP optNode;
    class D finalNode;

Key Design Decisions

Geodesic mathematics everywhere. Every distance goes through pyproj on the WGS84 ellipsoid. Plain Euclidean distance in (lat, lon) space breaks down at high latitudes, since 1° longitude only equals 1° latitude at the equator.

AEQD projection for local operations. Azimuthal Equidistant projection, centered on the trajectory centroid, preserves radial distances from the centroid exactly. For trajectories within roughly 1000 km of the center, edge-to-edge distortion stays under a meter, which is below the noise floor of consumer GPS.

Offline-capable after first run. Overpass API responses are cached on disk; once you’ve matched in an area, subsequent runs need no network access. Run it on classified networks or on a plane.

Chunked processing. Million-point trajectories are handled in overlapping chunks, with configurable chunk size and overlap to keep edge artifacts out.

See It In Action: From Messy CSV to Production-Ready Trajectory

Enough documentation. Let’s clean some actual data.

What follows is a real workflow that turns 24,573 noisy GPS points into a clean, road-aligned trajectory in around 25 lines of code.

Step 0: Installation

pip install pyehicle

Step 1: Load Raw GPS Data

import pandas as pd
import pyehicle as pye

# Load CSV with columns: lat, lon, time
df = pd.read_csv('trajectory.csv')
print(f"Raw: {len(df)} points")
# Output: Raw: 24,573 points

Step 2: Preprocessing - Map-Matching

Align GPS points to road network centerlines using a Hidden Markov Model (HMM).

# Map-match using Meili (requires a running Valhalla server)
# Or use Leuven for HMM-based matching (offline-capable after first run)
matched = pye.preprocessing.meili(
    df,
    lat_col='lat',
    lon_col='lon',
    time_col='time',
)
print(f"Matched: {len(matched)} points on road network")

What happened:

The GPS trajectory was sent to Valhalla’s Meili service (or, with leuven(), processed against cached OpenStreetMap data)
HMM probabilistic reasoning picked the most likely road sequence
Points were snapped to the matched road centerlines
The output includes road network metadata (OSM IDs, geometries)

Alternative: Use pye.preprocessing.leuven() for HMM-based matching against the Overpass API. Responses are cached on disk by default, so repeat runs in the same area work without network access. No Valhalla server required.

Step 3: Load Road Network

Load the OpenStreetMap road network for the reconstruction steps that follow.

# Load from PBF file (or from GraphML cache at save_path if it exists)
road_network, geometries, spatial_index = pye.utilities.load_road_network(
    pbf_file_path='map.osm.pbf',
    bbox=(57.0, 56.9, 24.2, 24.0),  # (north, south, east, west)
    save_path='network_cache.graphml'  # Loaded from here if the file exists
)
print(f"Loaded {len(road_network.vs)} nodes, {len(road_network.es)} edges")
# Output: Loaded 847,293 nodes, 1,203,847 edges

What happened:

The OSM PBF file was parsed to extract the road network
An iGraph graph structure was built from nodes and edges
An R-tree spatial index was created for fast nearest-neighbor queries

If a GraphML file already exists at save_path, the function loads it directly instead of parsing the PBF, which is significantly faster.

From PBF: ~180 seconds (parsing)
From pre-existing GraphML: ~4 seconds

Enforce road network continuity at intersections by interpolating corner points.

refined = pye.reconstructing.refine_trajectory(
    matched,
    road_network=road_network,
    max_node_distance=10,  # Snap within 10m of nodes
    time_col='time'
)
print(f"Refined: {len(refined)} points (added intersection corners)")
# Output: Refined: 6,247 points (added 158 intersection corners)

What happened: when consecutive GPS points map to non-adjacent road segments (say, point t on 5th Ave and point t+1 on E 34th St), the algorithm:

Finds the nearest graph nodes for each point
Computes the shortest path between those nodes through iGraph
Interpolates corner points along the intersection geometry
Distributes timestamps proportionally

Why it matters: without this step, the trajectory has discontinuities at every intersection. With it, the path follows the real road network through each turn.

Step 5: Reconstruction - Curve Interpolation

Add points along road curves based on bearing changes.

final = pye.reconstructing.curve_interpolation(
    refined,
    road_network=road_network,
    lower_threshold=20,   # Interpolate for bearing changes ≥20°
    upper_threshold=80,   # But not sharp turns ≥80° (already refined)
    max_node_distance=10,
    time_col='time'
)
print(f"Final: {len(final)} points")
# Output: Final: 6,891 points

What happened: the algorithm found 127 locations where bearing changed by 20-80° (highway curves, roundabouts, gradual turns) and interpolated points along the road geometry at each one so the curve reads smoothly in the output.

Step 6: Optional Post-processing

Apply Kalman filtering and/or compression for a final pass of cleanup (optional).

# Optional: Apply Kalman filtering for smoothing
filtered = pye.preprocessing.kalman_aeqd_filter(
    final,
    lat_col='lat',
    lon_col='lon',
    time_col='time',
    em_iters=3
)

# Optional: Compress to reduce file size
final_trajectory = pye.preprocessing.spatio_temporal_compress(
    filtered,
    spatial_radius_km=0.01,
    time_threshold_s=30
)
print(f"Final trajectory: {len(final_trajectory)} points")

Why optional: these steps can refine the trajectory further, but whether you need them depends on your use case.

Step 7: Visualize and Save

# Save result
final.to_csv('fleet_vehicle_2024_03_15_reconstructed.csv', index=False)

# Validate with visualization
pye.utilities.visualization.single(
    final,
    name='Reconstructed Fleet Vehicle Trajectory',
    show_in_browser=True
)

Result: you now have a trajectory that:

Sits on the road network (no points stranded in buildings)
Respects topology (no impossible transitions)
Has kinematic consistency (smooth velocities, no teleportation)
Captures intersection and curve geometry

Step 8: Validate Against Ground Truth (optional but recommended)

If you have a ground-truth trajectory (from a survey-grade receiver, manual labeling, or a higher-fidelity source), pyehicle ships the four standard map-matching evaluation metrics:

f1        = pye.utilities.f1(final, ground_truth)
precision = pye.utilities.precision(final, ground_truth)
recall    = pye.utilities.recall(final, ground_truth)
rmf       = pye.utilities.rmf(final, ground_truth)

print(f"F1: {f1:.3f}  P: {precision:.3f}  R: {recall:.3f}  RMF: {rmf:.3f}")

The metrics fetch the actual road segments from Overpass for a faithful comparison:

Precision — common matched length ÷ reconstructed length
Recall — common matched length ÷ ground-truth length
F1 — harmonic mean of the two
RMF (Route Mismatch Fraction) — proportion of the trajectory that does not match the ground truth

A reasonable production target is F1 > 0.90 on urban-canyon data; F1 < 0.75 usually means the input is too sparse, the bounding box is too tight, or the map-matcher search radius needs widening.

When NOT to Use Pyehicle

Pyehicle is built for road-bound vehicle trajectories. It’s the wrong tool for:

Off-road movement — hiking, drone flights, ship tracks in open water. There’s no road network to match against.
Indoor positioning — Wi-Fi/Bluetooth/UWB fingerprints don’t live on OSM roads.
Extremely sparse trajectories — fewer than one point per few hundred meters. Leuven’s non-emitting states help, but at some point HMM map-matching simply runs out of signal.
Cellular/CDR data — antenna-level resolution is coarser than road geometry. Reach for scikit-mobility instead.
Pedestrian movement in dense pedestrian networks — sidewalk geometry isn’t reliably captured in OSM as a routable graph.

For everything else — vehicle GPS, phone GPS in a car, fleet tracker pings — it’s exactly what it’s designed for.

Should You Use Pyehicle?

Short answer: if your GPS data comes from real-world sensors (vehicles, phones, trackers), reach for pyehicle first. Then bring in other tools for analysis.

The Python mobility analytics ecosystem has specialized tools for different stages of the pipeline:

Library	Primary Function	Input Assumption	Use pyehicle When…
pyehicle	Map-matching and trajectory reconstruction on road networks	Raw, noisy GPS with multipath, gaps, outliers	Your data comes from real-world GPS loggers (vehicles, phones, asset trackers) and has significant noise
MovingPandas	Trajectory visualization, segmentation, feature engineering	Clean trajectories with accurate positions	You’ve already cleaned the data (via pyehicle or differential GPS) and need to split by stops, compute speeds, or visualize
scikit-mobility	Human mobility metrics (entropy, radius of gyration), flow analysis	Clean trajectories from any source: GPS, check-ins, CDR	You’re analyzing mobility patterns at population scale and need privacy-preserving metrics or flow matrices

The Power Move: Combining All Three Libraries

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from datetime import timedelta

import pyehicle as pye
import movingpandas as mpd
import skmob
from skmob.measures.individual import radius_of_gyration

# Stage 1: Reconstruction (pyehicle)
raw = pd.read_csv('gps_logger.csv')
clean = pye.preprocessing.spatio_temporal_compress(raw)
clean = pye.preprocessing.leuven(clean)
clean = pye.preprocessing.kalman_aeqd_filter(clean)

# Stage 2: Analysis (MovingPandas)
# Trajectory needs a GeoDataFrame with a geometry column
gdf = gpd.GeoDataFrame(
    clean,
    geometry=[Point(xy) for xy in zip(clean.lon, clean.lat)],
    crs='EPSG:4326',
).set_index('time')
traj = mpd.Trajectory(gdf, traj_id=1)
traj.add_speed()

# Stops are detected with TrajectoryStopDetector
stops = mpd.TrajectoryStopDetector(traj).get_stop_segments(
    min_duration=timedelta(seconds=60),
    max_diameter=50,
)

# Stage 3: Metrics (scikit-mobility)
skmob_traj = skmob.TrajDataFrame(
    clean, latitude='lat', longitude='lon', datetime='time'
)
rg_df = radius_of_gyration(skmob_traj)

The libraries are complementary, not competitive. Pyehicle handles the physics problem (noisy GNSS signals). MovingPandas handles the feature engineering problem (what is this trajectory doing?). Scikit-mobility handles the statistical problem (what does this population’s mobility look like?).

Get Started in 30 Seconds

Stop fighting with noisy GPS data. Install pyehicle and get back to building.

pip install pyehicle

Next Steps

📖 Read the docs: pyehicle.readthedocs.io — Full API reference with examples
💻 Check the code: github.com/Nick-Doulos/pyehicle — MIT licensed, contributions welcome
🐛 Report issues: github.com/Nick-Doulos/pyehicle/issues — Found a bug? Let me know
🗺️ Get OSM data: download.geofabrik.de — Free OpenStreetMap extracts by region

Using This in Research?

@software{pyehicle2025,
  title = {Pyehicle: A Python Library for GPS Trajectory Processing and Reconstruction},
  author = {Doulos, Nick},
  year = {2025},
  version = {0.1.0},
  url = {https://github.com/Nick-Doulos/pyehicle}
}

If pyehicle saved you time, consider starring the repo on GitHub. It helps others find it.

Home

Beyond the Noise: Precision GPS Trajectory Reconstruction with Pyehicle

See It Work

Your GPS Data Is Costing You Money

What This Costs You

What Pyehicle Does

Why Your First Three Attempts Will Fail

Attempt 1: Nearest-Road Snapping

Attempt 2: Moving Average Smoothing

Attempt 3: Linear Interpolation Through Gaps

The Mathematical Foundation: What Reconstruction Actually Means

Map-Matching via Hidden Markov Model

The Solution: Pyehicle Does the Heavy Lifting

Architecture

Key Design Decisions

See It In Action: From Messy CSV to Production-Ready Trajectory

Step 0: Installation

Step 1: Load Raw GPS Data

Step 2: Preprocessing - Map-Matching

Step 3: Load Road Network

Step 4: Reconstruction - Trajectory Refinement

Step 5: Reconstruction - Curve Interpolation

Step 6: Optional Post-processing

Step 7: Visualize and Save

Step 8: Validate Against Ground Truth (optional but recommended)

When NOT to Use Pyehicle

Should You Use Pyehicle?

The Power Move: Combining All Three Libraries

Get Started in 30 Seconds

Next Steps

Using This in Research?

See It Work

Interactive demo of the pyehicle reconstruction pipeline

Your GPS Data Is Costing You Money

What This Costs You

What Pyehicle Does

Why Your First Three Attempts Will Fail

Attempt 1: Nearest-Road Snapping

Attempt 2: Moving Average Smoothing

Attempt 3: Linear Interpolation Through Gaps

The Mathematical Foundation: What Reconstruction Actually Means

Map-Matching via Hidden Markov Model

The Solution: Pyehicle Does the Heavy Lifting

Architecture

Key Design Decisions

See It In Action: From Messy CSV to Production-Ready Trajectory

Step 0: Installation

Step 1: Load Raw GPS Data

Step 2: Preprocessing - Map-Matching

Step 3: Load Road Network

Step 4: Reconstruction - Trajectory Refinement

Step 5: Reconstruction - Curve Interpolation

Step 6: Optional Post-processing

Step 7: Visualize and Save

Step 8: Validate Against Ground Truth (optional but recommended)

When NOT to Use Pyehicle

Should You Use Pyehicle?

The Power Move: Combining All Three Libraries

Get Started in 30 Seconds

Next Steps

Using This in Research?