Lin.ear th.inking

Hilbert Curve of order 4:

Hilbert Curve of order 6:

Code is in the JEQL script repo.

import jeql.std.function.HashFunction;

hilbertOrder = 6;
side = Val.toInt( Math.pow(2, hilbertOrder) );
count = side * side;

radius = 1;

t = select * from Generate.sequence( 0, count-2 );

t = select i, geom: Geom.buffer(hilbertEdge, 0.4)
let
hilbertPt1 = HashFunction.hilbertPoint(hilbertOrder, i),
hilbertPt2 = HashFunction.hilbertPoint(hilbertOrder, i+1),
hilbertEdge = Geom.createLineFromPoints( hilbertPt1, hilbertPt2 )
from t;

t1 = select *,
styleFill: clr, styleStroke: clr, styleStrokeWidth: 1
let
clr = Color.toRGBfromHSV(Val.toDouble(i) / count, 1, 1)
from t;
Mem t1;

The function hilbertPoint uses the efficient algorithm from http://threadlocalmutex.com/. Code is on Github.

I just landed a JTS pull request for Hilbert and Morton (Z-order) codes and curves.

Hilbert Curve of level 3

Morton Curve of level 3

Apart from pretty pictures of fractals, the goal is to support experimenting with Packed Hilbert R-trees, as an alternative to the current Sort-Tile-Recursive packing strategy (implemented as STRtree in JTS). STRtrees are heavily used to speed up spatial algorithms inside JTS (and externally as per recent report). So if Hilbert curve-based packing provides better performance that would be a big win.

I'm happy to announce that I am taking a position with Crunchy Data as a Senior Geospatial Engineer. I'm working alongside fellow Victorian geospatial maven extraordinaire Paul Ramsey, as the core of a proposed Geospatial Data Centre of Excellence.

Our mission statement is simple: make PostGIS bigger, better, faster!

Bigger - more spatial algorithms and functions
Better - enhance existing functionality to make it easier to use, more powerful, and more robust
Faster - keep looking for algorithmic optimizations and ways to use the power of Postgres to make spatial processing faster

A lot of this work will involve enhancements to the core GEOS geometry library. Part of the goal is to keep JTS and GEOS aligned, so this should produce a nice boost to JTS as well.

Having been lurking in the background for many years now, I'm stoked to be (finally) able to work directly on PostGIS. And I'm excited to be part of the Crunchy team. They have some of the leading Postgres experts in-house, so I'm expecting that it will be a great learning experience. And their client list promises to expose us to some fascinating large-scale use cases for spatial data processing, which can only be good for the power and robustness of PostGIS.

I'm also looking forward to re-engaging with the geospatial open source community, and learning more about the (even bigger) open source Postgres community. Great things lie ahead!

A humble workhorse of geospatial processing is the ability to compute a point which is guaranteed to lie in the interior of a polygon. In the OGC Simple Features for SQL specification (and hence in PostGIS) this is known as ST_PointOnSurface. In JTS/GEOS it is called getInteriorPoint, for historical reasons [1].

Interior points for country boundaries

There are some important use cases for this capability:

Constructing a "proxy point" for a polygon to use in drill-down spatial query. This has all kinds of applications:

recovering attribution after polygon coverage generalization
determining parentage during polygon overlay
faster and more robust spatial join between polygonal datasets

Cartographic rendering including:

Creating leader lines
Placing labels for polygons (for which it is a quick solution but not necessarily a quality one. More on this in a later post)

There is a variety of ways that have been proposed to compute interior points: triangulation, sampling via random or grid points, medial axis transform, etc. These all involve trade-offs between location quality and performance [2]. JTS uses an approach which optimizes performance, by using a simple scan-line algorithm:

JTS Scan-Line Interior Point Algorithm

Determine a Y-ordinate which is distinct to every polygon vertex Y-ordinate and close to the centre of the vertical extent
Draw a scan line across the polygon and determine the segments of intersection
Choose the interior point as the midpoint of the widest intersection segment

Locating a polygon interior point along a scan-line

The current code has been in the JTS codebase since the release of the very first version back in 2001. It is elegantly simple, but is quite non-optimal, since it uses the overlay intersection algorithm. This is overkill for the computation of the scan-line intersection segments. It also has a couple of serious drawbacks: slow performance, and the requirement that the input polygon be valid. These are not just theoretical concerns. They have been noticed in the user community, and have caused client projects to have to resort to awkward workarounds. It's even documented as a known limitation in PostGIS.

Thanks to Crunchy Data recognizing the importance of geospatial, we've been able to look into fixing this. It turns out that a relatively simple change makes a big improvement. The scan-line intersections can be computed via a linear-time scan of the polygon edges. This works even for invalid input (except for a few pathological situations).

Interior points of invalid polygons

(LH invalid polygon shows suboptimal point placement)

Best of all, it's much faster - providing performance comparable to the (less useful) centroid computation. The performance results speak for themselves:

Dataset	# polys	# points	Time	Prev time	Improvement
World countries	244	366,951	25 ms	686 ms	x 27
Land Cover	64,090	366,951	78 ms	6.35 s	x 81

This has been committed to JTS. It will be ported to GEOS soon, and from there should show up in PostGIS (and other downstream projects like QGIS, Shapely, GDAL, etc etc).

More Ideas

Some further improvements that could be investigated:

Use the centroid to provide the Y-ordinate. This is probably better in some situations, and worse in others. But perhaps there's a fast way to choose the best one?
Use multiple scan-lines (both vertical and horizontal)
Provide better handling of short/zero-width scan-line intersections
Support clipping the interior point to a rectangle. This would provide better results for cartographic labelling

[1] JTS was based on the original OGC SFS specification (version 1.1). The spec actually does include a method Surface.pointOnSurface. The reason for the different choice of name is lost to time, but possibly the reasoning was that the JTS function is more general, since it handles all types of geometry. (One of the design principles of JTS is Geometric Uniformity, AKA No Surprises. Wherever possible spatial operations are generalized to apply to all geometric types. There is almost always a sensible generalization that can be defined, and it's often quite useful.)

[1a] Also, the acronym IPA is much better than the one for PointOnSurface.

[2] Apparently Oracle has decided to provide blazingly fast performance by simply returning a point on the boundary of the polygon. Thus proving the maxim that for every problem there is an solution which is simple, fast, and dead wrong.

As a gentle introduction to GEOS development I took on the task of porting the recent improvements to the JTS InteriorPointArea algorithm. This is a fairly small chunk o'code, but it touches most of the aspects of GEOS development process: build chain, debugging, unit tests, and infra (source control and build farm). It also has the advantage that the JTS code was still hot off the keyboard and (almost) unreleased, so it was a chance to see if any cross-fertilization would blossom from working on the two projects jointly.

Skipping lightly over the details of my (somewhat painful) GEOS learning curve, I'm delighted to say that the code has landed in master and is basking in the green glow from the build bot badges.

And now the whole point of the exercise: how much better is the new code in GEOS?

It exhibits the expected improvement in robustness, since a GEOS test which actually depended on a thrown TopologyException (due to the now-removed call to Geometry::intersection() ) had to be modified to handle a successful return.

Most importantly, there is a dramatic improvement in performance. Here's some numbers from running the GEOS InteriorPointArea performance test:

Data size	Time	Time OLD	Improvement	Time Centroid
100	.8 ms	86 ms	x 100	1 ms
1000	6 ms	144 ms	x 24	12 ms
10,000	55 ms	672 ms	x 12	107 ms
100,000	508 ms	6,714 ms	x 13	961 ms
1,000,000	5,143 ms	73,737 ms	x 14	11,162 ms

Some observations:

The performance test uses synthetic data (sine stars). Real-world datasets are likely to show significantly better times ( 80x better in some cases, based on JTS timings)
The largest improvement is for small geometries, which is nice since these are more common
InteriorPoint is now actually faster than the Centroid computation. This is also good news, since users were often tempted to try and use centroids instead of interior points, despite the known issues.

Future Work

Running the identical performance test in JTS is still faster, by roughly 5x. This may be due to the advantages of JIT compilation and memory management. It may also indicate there is room for improvement by making GEOS smarter about data handling.

And now for the final chapter in the saga of improving InteriorPoint / PointOnSurface. For those who missed the first two episodes, the series began with a new approach for the venerable JTS Geometry.interiorPoint() for polygons algorithm. Episode 2 travelled deep into the wilds of C++ with a port to GEOS. The series finale shows how this results in greatly improved performance of PostGIS ST_PointOnSurface.

The BC Voting Area dataset is a convenient test case, since it has lots of large polygons (shown here with interior points computed).

The query is about as simple as it gets:

select ST_PointOnSurface(geom) from ebc.voting_area;

Here's the query timings comparison, using the improved GEOS code and the previous implementation:

Data size	Time	Time OLD	Improvement	Time ST_Centroid
5,658 polygons (2,171,676 vertices)	341 ms	4,613 ms	x 13	369 ms

As expected, there is a dramatic improvement in performance. The improved ST_PointOnSurface runs 13 times faster than the old code. And it's now as fast as ST_Centroid. It's also more robust and tolerant of invalid input (although this test doesn't show it).

This should show up in PostGIS in the fall release (PostGIS 3 / GEOS 3.8).

On to the next improvement... (and also gotta update the docs and the tutorial!)

The second-most important criteria for a spatial algorithm is that it be fast. (The most important is that it's correct!) Many spatial algorithms have a simple implementation available, but with performance of O(n²) (or worse). This is unacceptably slow for production usage, since it results in long runtimes for data of any significant size. In JTS a lot of effort has gone into identifying O(n²) performance hotspots and engineering efficient replacements for them.

One long-standing hotspot is the algorithm for computing Euclidean distance between geometries. The obvious distance algorithm is a brute-force O(MxN) comparison between the vertices and edges (facets) of the input geometries. This is simple to implement, but very slow for large inputs. Surprisingly, there seems to be little in the computational geometry literature about more efficient distance algorithms. Perhaps because of this, many geometric libraries provide only the slow brute force algorithm - including JTS (until now).

Happily, it turns out there is a faster approach to distance computation. It uses data structures and algorithms which are already provided in JTS, so it's relatively easy to implement. The basic idea is to build a spatial index on each of the input geometries, and then use a Branch-and-Bound search algorithm to efficiently traverse the index trees to find for the minimum distance between geometry facets. This is a generalization of the R-tree Nearest Neighbour algorithm described in the classic paper by Rousssopoulos et al. [1].

JTS has the STRtree R-tree index implementation (a packed R-tree using the Sort-Tile-Recursive algorithm). This has recently been enhanced with several kinds of nearest-neighbour searches. In particular, it now supports a method to find the nearest neighbours between two different trees. The IndexedFacetDistance class uses this capability to implement fast distance searching on the facets of two geometries.

Another benefit of this approach is that it allows caching the index of one geometry. This further increases performance in the common case of repeated distance calculations against a fixed geometry.

The performance improvement is impressive. Here's the timings for computing the distance from Antarctica to other world countries:

Source Data size	Target Data size	Time Indexed	Time Brute-Force	Improvement
1 polygon (19,487 vertices)	244 polygons (366,951 vertices)	164 ms	136 s	x 830

Branch-and-bound search also speeds up isWithinDistance queries. Here's a within-distance selection query between another antipodean continent and a large set of small rectangles:

Source Data size	Target Data size	Time	Time Brute-Force	Improvement
1 polygon (7,316 vertices)	100,000 polygons (500,000 vertices)	53 ms	10.03 s	x 19

A small fly in the algorithmic ointment is that Indexed Distance is not always better than the brute-force approach. For small geometries (such as points or rectangles) a simple scan is actually faster, since it avoids the overhead of building indexes. It may be possible to determine a tuning parameter that allows automatically choosing the fastest option. Or the client can choose the faster approach, using knowledge of the use case.

Future Work

A few further ideas to build or investigate:

Implement a caching FastDistanceOp using IndexedFacetDistance and indexed Point-In-Polygon. This can be used to add a fast distance() method to PreparedGeometry
Investigate improving isWithinDistance by using the MINMAXDISTANCE metric for envelopes. This allows earlier detection of index nodes satisfying the distance constraint.
Investigate alternative R-Tree packing algorithms (such as Hilbert packing or sequence packing) to see if they improve performance

[1] Roussopoulos, Nick, Stephen Kelley, and Frédéric Vincent. "Nearest neighbor queries." ACM SIGMOD record. Vol. 24. No. 2. ACM, 1995.

My perspicacious colleague Paul Ramsey says "Postgres is having its Linux moment". There is certainly a buzz around PostgreSQL, evidenced by datapoints such as:

DB-Engines DBMS Of The Year, for the second year running
Stack Overflow's second Most-Loved DBMS (behind Redis, so really the top RDBMS - more on that here)
Hacker News trends shows it pulling way ahead of MySQL
Matt Asay on how Postgres is hip again and why this might be (TLDR: cost and features)
An analysis of why the time has finally come for Postgres

Reasons for this include:

the shift from proprietary to open source as the software model of choice
the rise of cloud-based DB platforms, where the flexibility, power and cost (free!) of Postgres makes it an obvious choice. All the major players now have Postgres cloud offerings, including Amazon, Microsoft and Google.

And happily riding along is PostGIS, bundled with most if not all major PostgreSQL distros. (Note how the Google blog post announcing cloud Postgres highlights a geospatial use case). So it's an exciting time to be able to work on PostGIS at Crunchy Data.

A popular SQL party trick is to generate the Mandelbrot set using Common Table Expressions (CTEs) to implement the required iteration. The usual demo outputs the image using ASCII art:

This is impressive in its own way... but really it's like, so 70's, man.

Here in the 21st century we have better tooling for describing graphics using text - namely, Scalable Vector Graphics(SVG). And best of all, it's built right into modern browsers (finally!). So here's the SQL Mandelbrot set brought up to date with SVG output.

A straightforward conversion of the quert is relatively easy. Simply render each cell pixel as an SVG rect element of size 1, and use a grayscale colour scheme. But a couple of improvements produce a much better result:

A more varied colour palette produces a nicer image
Using one SVG element per cell result in a very large file, which is slow to render. There's a lot of repeated pixels in the raster, so a more compact representation is possible

The colour palette is easily improved with a bit of math (modulo cumbersome SQL syntax). Here I use a two-ramp palette, sweeping through shades from black to blue, and then through tints to white.

A simple way of reducing raster size is to use Run-Length Encoding (RLE). This works well with SVG because the rect element can simply be extended by increasing the width attribute. The tricky part is using SQL to merge the rows for contiguous same-value cells . As is often the case, a straightforward procedural algorithm requires some cleverness to accomplish in SQL. It had me stumped for a while. The solution seemed bound to involve window functions, but trying various combinations of the multitudinous options available didn't produce the desired result. Then I realized that the problem is isomorphic to that of merging contiguous date ranges. That is a high-value SQL use case, and there's numerous solutions available. The two that stand out are (as discussed here):

Start-of-Group - this approach uses a LAG function to flag where the group value changes, followed by a running SUM to compute a unique index value for each group (run, in this case). Group rows are then aggregated on the index
Tabibitosan - this is a clever and efficient approach, but is harder to understand and less general

The solution presented uses Start-of-Group, for clarity. RLE reduces the number of SVG elements to about 12,000 from 160,000, and file size to 1 MB from 11 MB, and hence much faster loading and render time in a web browser.

Here's the output image, with the SQL query producing it below (also available here).

Here's how the query works:

x is a recursive query producing a sequence of integers from 0 to 400 using standard SQL
z is a recursive query creating the Mandelbrot set on a 400x400 grid. A scale and offset maps the grid cell ordinates into the complex plane, centred on the Mandelbrot set. The query computes successive values of the set equation for each cell. A cell is terminated when it is determined that the equation limit is unbounded.
itermax selects the maximum iterations for each cell. This result set contains the final result of the Mandelbrot computation
runstart finds and flags the start of each RLE "run" group for each row of the raster
runid computes an id for each run in each row
rungroup groups all the cells in each run and finds the start and end X index
plot assigns a colour to each run, based on the iteration limit i
the final SELECT outputs the SVG document, with rect elements for each run

WITH RECURSIVE
x(i) AS (
    VALUES(0)
UNION ALL
    SELECT i + 1 FROM x WHERE i ≤ 400
),
z(ix, iy, cx, cy, x, y, i) AS (
    SELECT ix, iy, x::FLOAT, y::FLOAT, x::FLOAT, y::FLOAT, 0
    FROM
        (SELECT -2.2 + 0.0074 * i, i FROM x) AS xgen(x, ix)
    CROSS JOIN
        (SELECT -1.5 + 0.0074 * i, i FROM x) AS ygen(y, iy)
    UNION ALL
    SELECT ix, iy, cx, cy,

      x*x - y*y + cx AS x, y*x*2 + cy, i + 1
    FROM z
    WHERE x*x + y*y < 16.0
    AND i < 27
),
itermax (ix, iy, i) AS (
    SELECT ix, iy, MAX(i) AS i
    FROM z
    GROUP BY iy, ix
),
runstart AS (
    SELECT iy, ix, I,
    CASE WHEN I = LAG(I) OVER (PARTITION BY iy ORDER By ix)
        THEN 0 ELSE 1 END AS runstart
    FROM itermax
),
runid AS (
    SELECT iy, ix, I,
        SUM(runstart) OVER (PARTITION BY iy ORDER By ix) AS run
    FROM runstart
),
rungroup AS (
    SELECT iy, MIN(ix) ix, MAX(ix) ixend, MIN(i) i
    FROM runid
    GROUP BY iy, run
),
plot(iy, ix, ixend, i, b, g) AS (
    SELECT iy, ix, ixend, i,
    CASE
        WHEN i < 18 THEN (255 * i / 18.0 )::integer
        WHEN i < 27 THEN 255
        ELSE 0 END AS b,
    CASE
        WHEN i < 18 THEN 0
        WHEN i < 27 THEN (255 * (i - 18) / (27 - 18 ))::integer
        ELSE 0 END AS g
    FROM rungroup
    ORDER BY iy, ix
)
SELECT '<svg viewBox="0 0 400 400"'

  || ' style="stroke-width:0" xmlns="http://www.w3.org/2000/svg">'

  || E'\n'
  || string_agg(
'<rect style="fill:rgb('

      || g || ',' || g || ',' || b || ');"'
      || ' x="' || ix || '" y="' || iy
      || '" width="' || ixend-ix+1 || '" height="1" />', E'\n' )
  || E'\n' || '</svg>' || E'\n' AS svg
FROM plot;

Since inception JTS has provided two client tools to aid in using the library. They are the TestBuilder and the TestRunner.

The TestBuilder is a GUI tool with many powerful capabilities for loading, editing and visualizing geometry. It also provides the ability to run numerous geometric functions which expose (and in some cases enhance) the JTS library functionality.
The TestRunner is a command-line tool which runs tests in the JTS XML test format.

But there's a gap which these two tools don't fill. It's often required to run JTS operations on geometry data for purposes of testing, debugging or timing operations. This can be done in the TestBuilder, but being a GUI it's highly manual process, and tedious to repeat multiple times. It is (just) possible to use the TestRunner for this, but that introduces the awkwardness of wrapping the input data in XML. The only other option up until now was to write a Java program, which is overkill for quick tests, and not very accessible for some.

What's really needed is a JTS equivalent of the UNIX expr. It should have the ability to accept geometry inputs, run an operation on them, and output the results. (Another comparison might be to a very small subset of GDAL/OGR, focussed on geometry only - and of course running in Java).

The TestBuilder already provides a rich framework for most of this functionality, so it turned out to be simple to expose this as a command-line tool. Behold - the jtsop command!

jtsop has the following capabilities:

read geometries from files or command-line
input formats include WKT, WKB, GeoJSON, GML and SHP
execute any TestBuilder operation on the geometry input (which includes all JTS Geometry methods)
output the result as WKT, WKB, GeoJSON, GML or SVG
report metrics for geometry and execution times
dynamically load and run geometry functions provided in external Java classes

Examples

Compute the area of a WKT geometry and output it as text

jtsop -a some-geom.wkt -f txt area

Compute the unary union of a WKT geometry and output as WKB

jtsop -a some-geom.wkt -f wkb Overlay.unaryUnion

Compute the union of two geometries in WKT and WKB and output as WKT

jtsop -a some-geom.wkt -b some-other-geom.wkb -f wkt Overlay.Union

Compute the buffer of distance 10 of a WKT geometry and output as GeoJSON

jtsop -a some-geom.wkt -f geojson Buffer.buffer 10

Compute the buffer of a literal geometry and output as WKT

jtsop -a "POINT (10 10)" -f wkt Buffer.buffer 10

Output a literal geometry as GeoJSON

jtsop -a "POINT (10 10)" -f geojson

Compute an operation on a geometry and output only geometry metrics and timing

jtsop -v -a some-geom.wkt Buffer.buffer 10

Uses

jtsop is already proving its worth in the JTS development process. Some other use cases come to mind:

Converting geometry between formats (e.g. WKT to GeoJSON)
Filtering and extracting subsets of geometry data (there are various selection functions available to do this)
Computing summary statistics of geometry data

Further Work

There's some interesting enhancements that could be added:

provide options to refine the input such as spatial filtering or geometry type coercion
allowing chaining multiple operations together using a little DSL (this is possible via shell piping, but doing this internally would be more efficient). This could use the pipe operator a la Elixir.
Include the Proj4J library to allow coordinate system conversions (this can be done now as a simple extension using the dynamic function loading, but it would be nice to have it built in)
output geometry as images (using the TestBuilder rendering pipeline)

The operation of buffering a geometry is a core geospatial concept. Standard buffers are computed using a fixed distance around the input geometry. JTS has provided a buffer implementation since its inception, and this is used in GEOS and all the other downstream projects as well.

One way to generalize the buffer concept is to allow the buffer distance to vary along the geometry. As often the case with geospatial concepts this construction has a few different names, including tapered buffer, cone buffer, varying buffer, and variable-distance buffer. The classic use case for variable-distance buffers is generating polygons for the "cone of uncertainty" along predicted hurricane tracks:

Another use case is depicting rivers showing the width of reaches along the river course.

I took a crack at prototyping "variable width buffers" a few years ago, in the JTS Lab. Recently there were a couple of GIS StackExchange posts looking for this functionality in PostGIS and Shapely They were good motivation to buff up the prototype and move it into JTS core. But I was dismayed to realize that the output had some serious deficiencies, due to an overly-simplistic algorithm. The idea in the Lab code was appealingly simple: compute buffer circles around each line vertex at the specified distance, and union (merge) them with trapezoids computed around the connecting line segment. But this produced ugly discontinuities for large deltas in buffer distances.

For such a seemingly simple concept there is surprisingly little prior art to be found. Even the GIS That Shall Not Be Named does not seem to provide a native variable buffer function (although recently I found a couple of user contribs which do, to some extent). There's an implementation in QGIS - but since it seems to be based on the original problematic JTS code that didn't really help. So I had the fun of coding it up from scratch.

The problem with the original code is that it should have used the outer tangent lines to the buffer circles at each vertex. Wikipedia has a good discussion of this construction. Even better, it provides an elegant mathematical algorithm, which worked perfectly when coded up.

The construction computes a single tangent segment. The opposite segment is simply the reflection in the line between the circle centres. The geometric math to handle this is now provided for reuse as LineSegment.reflect(). Finally, it is notoriously tricky to produce buffer curves with high quality in the fine details. In this case, generating the line segment buffer caps required care to avoid creating out-of-phase vertices, which would produce a "bumpy" result when merged.

This is now available in JTS as the VariableBuffer class. In addition to the general-purpose API which allows specifying a distance at each vertex, it provides a couple of simpler functions which accept a start and end distance; and a start, middle and end distance. The first is an easy way to produce the classic cone of uncertainty. The latter is useful for buffering rings, or just creating interesting shapes.

The next step is to port this to GEOS. Then it can be exposed in PostGIS and all the other downstream projects like Shapely and R-SF, and maybe even QGIS.

Future Work

There's some enhancements that could be made:

Some or all of the buffer end cap and join styles provided by the classic buffer operation could be supported, as well as the ability to set the quadrant segment count
Variable buffering of polygons and multigeometries is straightforward to add.
It might be possible to allow different distances for each side of the input line (although it may be tricky to get the joins right)
The approach of using union to merge the individual segment buffers works well. But it would improve performance to feed the entire generated variable buffer curve directly into the existing buffer generation code

More experimentally, variable buffering might provide a way to construct a decent approximation to a true geodetic buffer. The vertex distances would be computed using true geodetic distance. There might be some complications around long line segments, but perhaps these can be finessed (e.g. by densification).

The JTS TestBuilder is a great tool for creating geometry, processing it with JTS spatial functions, and visualizing the results. It has powerful capabilities for inspecting the fine details of geometry (such as the Reveal Topology mode). I've often thought it would be handy if there was a similar tool for PostGIS. Of course QGIS excels at visualizing the results of PostGIS queries. But it doesn't offer the same simplicity for creating geometry and passing it into PostGIS functions.

This is the motivation behind a recent enhancement to the TestBuilder to allow running external (system) commands that return geometry output. The output can be in any text format that TestBuilder recognizes (currently WKT, WKB and GeoJSON). It also provides the ability to encode the A and B TestBuilder input geometries as literal WKT or WKB values in the command. The net result is the ability to run external geometry functions just as if they were functions built into the TestBuilder.

Examples

Running PostGIS spatial functions

Combined with the versatile Postgres command-line interface psql, this allows running a SQL statement and loading the output as geometry. Here's an example of running a PostGIS spatial function. In this case a MultiPoint geometry has been digitized in the TestBuilder, and processed by the ST_VoronoiPolygons function. The SQL output geometry is displayed as the TestBuilder result.

The command run is:

/Applications/Postgres.app/Contents/Versions/latest/bin/psql -qtA -c
"SELECT ST_VoronoiPolygons('#a#'::geometry);"

Things to note:

the full path to psql is needed because the TestBuilder processes the command using a plain sh shell. (It might be possible to improve this.)
The psql options -qtA suppress messages, table headers, and column alignment, so that only the plain WKB of the result is output
The variable #a# has the WKT of the A input geometry substituted when the command is run. This is converted into a PostGIS geometry via the ::geometry cast. (#awkb# can be used to supply WKB, if full numeric precision is needed)

Loading data from PostGIS

This also makes it easy to load data from PostGIS to make use of TestBuilder geometry analysis and visualization capabilities. The query can be any kind of SELECT statement, which makes it easy to control what data is loaded. For large datasets it can be useful to draw an Area of Interest in the TestBuilder and use that as a spatial filter for the query. The TestBuilder is able to load multiple textual geometries, so there is not need to collect the query result into a single geometry.

Loading data from the Web

Another use for commands is to load data from the Web, by using curl to retrieve a dataset. Many web spatial datasets are available in GeoJSON, which loads fine into the TestBuilder. Here's an example of loading a dataset provided by an OGC Features service (pygeoapi):

Command Panel User Interface

The Command panel provides a simple UI to make it easier to work with commands. Command text can be pasted and cleared. A history of commands run is recorded for the TestBuilder session. Recorded commands in the session can be recalled via the Previous and Next Command buttons.
Buttons are provided to insert substitution variable text.

To help debug incorrect command syntax, error output from commands is displayed.

It can happen that a command executes successfully, but returns output that cannot be parsed. This is indicated by an error in the Result panel. A common cause of this is that the command produces logging output as well as the actual geometry text, which interferes with parsing. To aid in debugging this situation the command window shows the first few hundred characters of command output. The good news is that many commands offer a "quiet mode" to provide data-only output.

Unparseable psql output due to presence of column headers. The pqsl -t option fixes this.

If you find an interesting use for the TestBuilder Command capability, post it in the comments!

There is often a need to find a point which is guaranteed to lie in the interior of a polygon. Uses include placing cartographic labels, and using the point as a proxy for polygon containment or overlap (such as in polygon overlay).

There are several ways to compute a "centre point" for a polygon. The simplest is the polygon centroid, which is the Center of Mass of the polygon area. This has a straightforward O(N) algorithm, but it has the significant downside of not always lying inside the polygon! (For instance, the centroid of a "U" shape lies outside the shape). This makes it non-useful as an interior point algorithm.

JTS provides the InteriorPoint algorithm, which is guaranteed to return a point in the interior of a polygon. This works as follows:

Determine a horizontal scan line on which the interior point will be located. To increase the chance of the scan line having non-zero-width intersection with the polygon the scan line Y ordinate is chosen to be near the centre of the polygon's Y extent but distinct from all of vertex Y ordinates.
Compute the sections of the scan line which lie in the interior of the polygon.
Choose the widest interior section and take its midpoint as the interior point.

This works perfectly for finding proxy points, and usually produces reasonable results as a label point.

However, there are polygons for which the above algorithm finds points which lie too close to the boundary for a label point. A better choice would be the point that lies farthest from the edges of the polygon. The geometric term for this construction is the Maximum Inscribed Circle. The farthest point is the center of the circle.

Comparison of center points for a not-so-typical polygon

In the geographic domain this is romantically termed the Pole of Inaccessibility.

Pole Of Inaccessibility in Canada

This point occurs at a node of the medial axis of the polygon, so in theory all that is needed is to compute the medial axis and test the set of node points. However, medial axis algorithms are notoriously difficult to implement, and can be expensive to compute. So it's appealing to look for a simple and fast way to compute a good approximation to the Maximum Inner Circle center. There have been various approaches to this, including a geodetic grid-based approach by Garcia-Castellanos & Lombardo, and one by Martinez using random point distributions. Recently Mapbox released a clever implementation which uses successive refinement of a grid along with a branch-and-bound technique to reduce the amount of searching needed.

JTS now has a version of this algorithm, called MaximumInscribedCircle. It significantly improves performance by using spatial indexing techniques for both polygon interior testing and distance computation. This makes it very fast to find the MIC even for large, complex polygons. Performance is key for computing label points, since it is likely to be used for many polygons on a typical map.

Grid refinement to find Maximum Inscribed Circle

An interesting property of the MIC is that its radius is the distance at which the negative buffer of the polygon disappears (becomes empty). Thus the MIC radius length is a measure of the "narrowness" of a polygon. This is often useful for purposes of simplification or data cleaning, to remove narrow polygonal artifacts in data.

Sequence of negative buffers containing the Maximum Inscribed Circle center

And, as the infomercials say, that's not all! If you act today you also get a free implementation of the Largest Empty Circle ! The Largest Empty Circle is defined for a set of geometric obstacles. It is the largest circle that can be constructed whose interior does not intersect any obstacle and whose center lies in the convex hull of the obstacles. The obstacles can be points, lines or polygons (although only the first two are currently implemented in JTS). Classic use cases for the Largest Empty Circle are in logistics to find a location for a new chain store in a set of store locations; or to find the largest roadless area in environmental planning.

It turns out that the LEC can be computed by essentially the same algorithm as the MIC, with a few small changes. And of course it also uses spatial indexing to provide excellent performance.

Largest Empty Circles for point and line obstacles

Maximum Inscribed Circle and Largest Empty Circle are now in JTS master, and will be released in the upcoming version 1.17.

Further Improvements

There are some useful enhancements that can be made:

For Maximum Inscribed Circle, allow a second polygonal constraint. This supports finding a label point within a view window rectangle.
For Largest Empty Circle, allow a client-defined boundary polygon. This allows restricting the circle to lie within a tighter bound than the convex hull
For both algorithms, it should be feasible to automatically determine a termination tolerance

In the JTS Topology Suite, Overlay is the general term used for the binary set-theoretic operations intersection, union, difference and symmetric difference. These operations accept two geometry inputs and construct a geometry representing the operation result. Along with spatial predicates and buffer they are the most important functions in the JTS API.

Intersection of MultiPolygons

Overlay operations are used in many kinds of spatial processes. Any system aspiring to provide full-featured geometry processing simply has to provide overlay operations. In fact, many geometry libraries exist solely to provide implementations of overlay. Notable libraries include the ESRI Java API, Clipper and wagyu. Some of these provide overlay only for polygons, which is the most difficult case to compute.

Overlay in JTS

The JTS overlay algorithm supports the full OGC SFS geometry model, allowing any combination of polygons, lines and points as input. In addition, JTS provides an explicit precision model, to allow constraining output to a desired precision. The overlay algorithm is also available in C++ in GEOS. There it provides overlay operations in numerous systems, including PostGIS, QGIS, Shapely, and r-sf. This codebase has had an long lifespan; it was developed back in 2001 for the very first release of JTS, and while there have been improvements over the years the core of the design has remained unchanged.

However, there are some long-standing issues with JTS overlay. The most serious one is that in spite of much valiant effort over the years, overlay is not fully robust. The constructive nature of overlay operations makes them particularly susceptible to the robustness issues which are notorious in geometric algorithms using floating-point numerics. It can happen that running an overlay operation on seemingly innocuous, valid inputs results in the dreaded TopologyException being thrown. There is a steady trickle of issue reports about this in JTS, and even more for GEOS (such as here, here and here...).

Another issue is that the codebase is complex, and thus hard to debug and modify. Partly this is because of the diversity of inputs and the explicit precision model. To support this the JTS overlay algorithm has a rich and detailed semantics. But some of the complexity is due to the original design of the code. This makes it difficult to incorporate new ideas for improvements in performance and robustness.

Next Generation Overlay

So for many years it's been on my mind that JTS overlay needs a thorough overhaul. I chipped away at the problem over time, but it was clear that it was going to be a major effort. Now, thanks to the support of my employer Crunchy Data, I've at last been able to focus on a complete rewrite of the JTS overlay module. It's called OverlayNG.

The basic algorithm remains the same:

Extract the input linework, and node it together
Build a topology graph from the noded linework
Compute a full topological labelling of the graph
Extract the resultant polygons, lines and points from the graph

This algorithm is time-tested and is able to handle the complexities of multiple geometry types and topology collapse. The new codebase benefits from 20 years of experience to become simpler and more modular, with increased testability, and potential for reuse.

OverlayNG has the following improvements:

A snap-rounding noder is available to make overlay fully robust. This eliminates the possibility of TopologyExceptions (when an appropriate precision model is used).

Intersection operation with Snap-rounding

and Topology Collapse removal

Snap-rounding allows full support for specifying the output precision model. The precision model can be specified independently for each overlay call, which is more flexible and easier to use. The use of snap-rounding also provides fully valid precision reduction for geometries. This makes it feasible for the first time to fully operate in a fixed-precision regime.

Precision Reduction turned all the way up to 11

Significant performance optimizations are included (notably, one which makes polygon intersection much faster in many cases)

Intersection of a MultiPolygon with a grid (7x faster with OverlayNG)

Pluggable noding allows providing different noding strategies. One use is to run OverlayNG with the original floating-point noder, which is faster than snap-rounding (but of course has the robustness issues noted above). Another is to use a special-purpose noder to provide very fast polygonal coverage union.

Union of a polygonal coverage (10x faster with OverlayNG)

A modular and cleaner codebase allows easier testability, maintenance, enhancement and reuse. A winged-edge graph model is used for the topology graph. This is simpler and less memory intensive.
The rebuild gives an opportunity to make some semantic improvements:

Empty results are returned as empty atomic geometries of appropriate type, rather than awkward-to-handle empty GeometryCollections
Linear output is merged node-to-node. This gives union a more natural and useful semantic

A benefit of the new codebase is that it is easier to enhance and extend. For example, it should be straightforward to finally provide a SplitPolygon function for JTS. Another potential extension is overlay for Polygonal Coverages.

Code that is so widely used needs to be thoroughly tested against real-world workloads. Initially OverlayNG will be released as a separate API in JTS. This allows it to be used along with the original overlay. It can be used as a fallback for cases which fail in the original overlay process. Once the new code has been proved out in real world use, it is likely to become the standard overlay code path. Also, the code will be ported to GEOS soon, where we're hoping it will provide significant benefits to the many systems that use GEOS.

I'll be posting more articles about aspects of OverlayNG soon. The code is almost ready to release, after some final testing. In the meantime, the pre-release code is available in a Git branch. It would be great to get as much beta-testing as possible before final release, so try it out and log some feedback!

In a previous post I unveiled the exciting new improvement in the JTS Topology Suite called OverlayNG. This new implementation provides significant improvements to the core function of spatial overlay. Overlay supports computing the set-theoretic boolean operations of intersection, union, difference, and symmetric difference over all geometric types.

One of the design goals of JTS is to create modular, reusable data structures and processes to implement spatial algorithms. This increases development velocity and testability, and makes algorithms easier to understand. In spatial algorithms it is not always obvious how to identify appropriate abstractions for reuse, so this is an on-going effort of design and refactoring.

After the implementation of spatial overlay in the very first release of JTS, it became clear that overlay can be split into the following phases:

Noding, in which an set of possibly-intersecting linestrings is converted to an arrangement in which linestrings touch only at endpoints
Topology Analysis, during which the topology graph of the noded arrangement is determined
Result Extraction, in which the geometric components of the desired result are extracted from the topology graph

It also became clear that the Noding phase is critical, since it determines the overall performance and robustness of the overlay computation. Moreover, tradeoffs between these two qualities can be made by using different noding strategies. For instance, the "classic" JTS noding approach is fast, but susceptible to robustness issues. Alternatively, noding using the well-known snap-rounding paradigm is slower, but can be made fully robust.

To encapsulate this concept, JTS introduced the Noder API. Since it post-dated the original overlay code, using it in overlay had to await a reworking of that codebase. The OverlayNG project provided this opportunity. OverlayNG allows supplying a specific Noder class to be used during overlay.

One of the main goals of the OverlayNG project was to develop a noder to provide fully robust noding. This would eliminate the notorious TopologyException errors which bedevil the use of overlay. The effort has paid off with the development of not one, but two new noders. The Snapping Noder has very good performance and (with the addition of some heuristics, and so far as is known) provides robust full-precision evaluation. And the Snap-Rounding noder provides guaranteed robustness as well the ability to enforce a fixed-precision model for output.

So now OverlayNG can be run with the following suite of noders, depending on use case. The images show the result of intersection and union on the following geometries:

Fast Full-Precision Noder

The MCIndexNoder noding strategy has been available since the early days of JTS. It has very good performance due to the use of monotone chains and the STRtree spatial index. However, it is a relatively simple algorithm which due to numerical robustness issues does not always produce a valid noding. In overlay it is always used in conjunction with a noding validator, so that noding failure can be detected and an alternative strategy used to perform the operation successfully.

Intersection and Union with full-precision floating noding

Snapping Noder

The SnappingNoder is a refinement of the MCIndexNoder which snaps existing input and computed intersection vertices together, if they are closer than a snap distance tolerance. This dramatically improves the robustness of the noding, with only minor impact on performance.

Noding robustness issues are generally caused by nearly coincident line segments, or by very short line segments. Snapping mitigates both of these situations. The choice of snap tolerance is a heuristic one. Generally, a smaller snap distance has less chance of distorting the topology, but it need to be large enough to resolve intersection computation imprecision. In practice, excellent robustness is provided by using a very small snap distance (e.g. a factor of 10^12 smaller than the geometry magnitude).

Snapping of course risks creating topology collapses, but OverlayNG is designed to handle these correctly. However, there are occasional situations where the snapped arrangement is too invalid to be handled. This can be detected, and with some simple heuristic adjustments (e.g. a more aggressive snap distance) the overlay can be rerun. This strategy has proven to be fully robust in all cases tried so far.

Intersection and Union with Snapping Noding (snap tolerance = 0.5)

Snap-Rounding Noder

The SnapRoundingNoder implements the well-known snap-rounding paradigm. It provides fully robust noding by rounding and snapping linework to a fixed-precision grid. This has the unavoidable effect of rounding every output vertex to the precision grid. This may or may not be desirable depending on the situation. A useful side effect is that it provides an effective means of reducing the precision of geometries in a topologically valid way.

In the early stages of OverlayNG design and development I expected that snap-rounding would be required to ensure fully-robust overlay, in spite of the downside of fixed-precision output. But the development of the SnappingNoder and accompanying heuristics means that this noder need only be used when control over overlay output precision is desired.

Using an appropriate precision model is a highly worthwhile goal in spatial data management, since it reduces the amount of memory needed to represent data, and improves robustness and portability. This is unfortunately often neglected, mostly due to lack of tools available to enforce it. Hopefully this capability will encourage users to maintain a precision model which is better matched to the true precision of their data.

Intersection and Union with Snap-Rounding noding (precision scale = 1)

Segment Extracting Noder

This is a special-purpose noder which is really more of a "non-noder". It simply extracts every line segment in the input. It is used on geometry collections which form valid, fully-noded, non-overlapping polygonal coverages. When used with OverlyNG, this has the effect of dissolving the duplicate line segments and producing the union of the input coverage. By taking advantage of the structure inherent in the coverage model the SegmentExtractingNoder offers very fast performance. It can also operate on fully-noded linear networks.

Union of a Polygonal Coverage with SegmentExtractingNoder

The support for pluggable noding and the development of a suite of fast and/or robust noders constitutes the biggest advance of the OverlayNG code. It finally allows JTS to provide fully robust noding and true support for a fixed-precision model! This has been a dream of mine for more than a decade. It's good to think that the end of the era of TopologyException issues is in sight!

This is another in a series of posts about the new OverlayNG algorithm being developed for the JTS Topology Suite. (Previous ones are here and here). Overlay is a core spatial function which allows computing the set-theoretic boolean operations of intersection, union, difference, and symmetric difference over all geometry types. OverlayNG introduces significant improvements in performance, robustness and code design.

JTS has always provided the ability to specify a fixed-precision model for computing geometry constructions (including overlay). This ensures that output coordinates have a defined, limited precision. This can reduce the size of data transfers and storage, and generally leads to cleaner, simpler geometric output. The original overlay implementation had some issues with robustness, which were exacerbated by using fixed-precision. One of the biggest improvements in OverlayNG is that fixed-precision overlay is now guaranteed to be fully robust. This is achieved by using an implementation of the well-known snap-rounding noding paradigm.

Geometric algorithms which operate in a fixed-precision model can encounter situations called topology collapse. This happens when line segments and points become coincident due to vertices or intersection points being rounded. The OverlayNG algorithm detects occurrences of topology collapse and transforms them into valid topology in the overlay result.

Topology collapse during overlay with a fixed precision model

As a bonus, handling topology collapse during the overlay process also allows it to be tolerated when present in the original input geometries. This means that some kinds of "mildly" invalid geometry (according to the OGC model) are acceptable as input. Invalid geometry is transformed to valid geometry during the overlay process.

Specifically, input geometry may contain the following situations, which are invalid in the OGC geometry model:

A ring which self-touches at discrete points (the so-called "inverted polygon" or "exverted hole")
A ring which self-touches in one or more line segments
Rings which touch other ones along one or more line segments

Note that this does not extend to handling polygons that overlap, rather than simply touch. These are "strongly invalid", and will trigger a TopologyException during overlay.

An interesting use for this capability is to process individual geometries. By simply computing the union of a single geometry the geometry is transformed into an OGC-valid geometry. In this way OverlayNG functions as a (partial) "MakeValid" operation.

A polygon which self-touches in a line transforms to a valid polygon with a hole

A polygon which self-touches in a point transforms to a valid polygon with a hole

A collection of polygons which touch in lines transforms to a valid polygon with a hole

Moreover, some spatial systems use geometry models which do not conform to the OGC semantics. Some systems (such as ArcGIS) actually specify the use of inverted polygons and exverted holes in their topology model. And in days of yore there were systems which were unable to model holes explicitly, and so used a "connected hole" topology hack (AKA "lollipop holes".) This represented holes as an inversion connected by a zero-width corridor to the polygon shell. Both of these models are accepted by OverlayNG. Thus it provides a convenient way to convert from these non-standard models into OGC-valid topology.

This is one more reason why overlay is the real workhorse of spatial data processing!

I'm happy to say that OverlayNG has landed on the JTS master branch! This is the culmination of over a year of work, and even more years of planning and design. It will appear in the upcoming JTS 1.18 release this fall.

As described in previous posts OverlayNG brings substantial improvements to overlay computation in JTS:

A completely new codebase provides greater clarity, maintainability, and extensibility
Pluggable noding supports various kinds of noding strategies including Fast Full-Precision Noding, Snapping and Snap-Rounding.
Optimizations are built-in, including new ones such as Ring Clipping and Line Limiting.
Additional functionality including Precision Reduction and Fast Coverage Union

All of these improvements are encapsulated in the new OverlayNGRobust class. It provides fully robust execution with excellent performance, via automated fallback through a series of increasingly robust noding strategies. This should solve a host of overlay issues reported over the years in various downstream projects such as GEOS, PostGIS, Shapely, R-sf and QGIS. (Many of these cases have been captured as XML tests for overlay robustness, to ensure that they are handled by the new code).

Initially the idea was to use OverlayNG as an opportunity to simplify and improve the semantics of overlay output, including:

Sewing linear output node-to-node (to provide a full union)
Ensuring output geometry is homogeneous (to allow easy chaining of overlay operations)
Omitting lines created by topology collapse in polygonal inputs

In the end we decided that this change would have too much impact on existing tests and downstream code, so the default semantics are the same as the previous overlay implementation. However, the simplified semantics are available as a "strict" mode for overlay operations.

At the moment OverlayNG is not wired in to the JTS Geometry overlay operations, but is provided as a separate API. The plan is to provide a runtime switch to allow choosing which overlay code is used. This will allow testing in-place while avoiding potential impact on production systems. GeometryPrecisionReducer has been changed to use OverlayNG with Snap-Rounding, to provide more effective precision reduction.

GEOS has been tracking the OverlayNG codebase closely for a while now, which has been valuable to finalize the overlay semantics, and for finding and fixing issues. Having the code in JTS master gives the green light for downstream projects to do their own testing as well. There have been a few issues reported:

A minor copy-and-paste variable name issue in HotPixel (which did not actually cause any failures, since it was masked by other logic)
A clarification about the new behaviour of GeometryPrecisionReducer, revealed by an NTS test
Most notably, a serious performance issue with Snap-Rounding of large geometries was identified and fixed by a member of the NTS project. This is quite interesting, so I'll discuss it in detail in another post.

After years of designing and developing improvements to overlay, it's great to see OverlayNG make its debut. Hopefully this will be the end of issues involving the dreaded TopologyException. And the new design will make it easier to build other kinds of overlay operations, including things like fast line clipping, split by line, coverage overlay... so stay tuned!

Now that OverlayNG has landed on JTS master, it is getting attention from downstream projects interested in porting it or using it. One of the JTS ports is the Net Topology Suite (NTS) project, and it is very proactive about tracking the JTS codebase. Soon after the OverlayNG commit an NTS developer noticed an issue while running the performance tests in JTS: the InteriorPointInAreaPerfTest was now causing a StackOverflowError. This turned out to be caused by the change to GeometryPrecisionReducer to use OverlayNG with Snap-Rounding to perform robust precision reduction of polygons. Further investigation revealed that failure was occurring while querying the KdTree used in the HotPixelIndex in the SnapRoundingNoder. Moreover, in addition to the outright failure, even when queries did succeed they were taking an excessively long time for large inputs.

The reason for using a K-D tree as the search structure for HotPixels is that it supports two kinds of queries with a single structure:

Lookup queries to find the HotPixel for a coordinate. These queries are performed while building the index incrementally. Hence a dynamic data structure like a K-D tree is required, rather than a static index such as STRtree.
Range queries to find all HotPixels within a given extent. These queries are run after the index is loaded.

The JTS KdTree supports both of these queries more efficiently than QuadTree, which is the other dynamic index in JTS. However, K-D trees are somewhat notorious for becoming unbalanced if inserted points are coherent (i.e. contain monotonic runs in either ordinate dimension). This is exactly the situation which occurs in large, relatively smooth polygons - which is the test data used in InteriorPointInAreaPerfTest. An unbalanced tree is deeper than it would be if perfectly balanced. In extreme cases this leads to trees of very large depth relative to their size. This slows down query performance and, because the KdTree query implementation uses recursion, can also lead to stack overflow.

A perfectly balanced K-D tree, with depth = logN. In the worst case an unbalanced tree has only one node at each level (depth = N).

I considered a few alternatives to overcome this major blocker. One was to use a QuadTree, but as mentioned this would reduce performance. There are schemes to load a K-D tree in a balanced way, but they seemed complex and potentially non-performant.

It's always great when people who file issues are also able to provide code to fix the problem. And happily the NTS developer submitted a pull request with a very nice solution. He observed that while the K-D tree was being built incrementally, in fact the inserted points were all available beforehand. He proposed randomizing them before insertion, using the elegantly simple Fisher-Yates shuffle. This worked extremely well, providing a huge performance boost and eliminated the stack overflow error. Here's the relative performance times:

Num Pts	Randomized	In Order
10K	126 ms	341 ms
20K	172 ms	1924 ms
50K	417 ms	12.3 s
100K	1030 ms	59 s
200K	1729 ms	240 s
500K	5354 ms	Overflow

Once the solution was merged into JTS, my colleague Paul Ramsey quickly ported it to GEOS, so that PostGIS and all the other downstream clients of GEOS would not encounter this issue.

It's surprising to me that this performance issue hasn't shown up in the other two JTS uses of KdTree: SnappingNoder and ConstrainedDelaunayTriangulator. More investigation required!

The OGC Simple Features specification implemented by JTS has strict rules about what constitutes a valid polygonal geometry. These include:

Polygon rings must be simple; i.e. they may not touch or cross themselves
MultiPolygon elements may not overlap or touch at more than a finite number of points (i.e they may not intersect along an edge)

These rules were chosen for good reason. They ensure that OGC-valid polygonal geometry is the simplest possible representation of an enclosed area. This greatly simplifies the evaluation of most operations on polygonal geometry, which leads to improved performance. JTS operations generally require input which is valid according to OGC rules. And they always (with some rare exceptions) emit result geometry which is OGC-valid.

But data in the wild is often not this well-behaved. This creates the need to "clean" or "make valid" polygonal geometry in order to carry out operations on it. Shortly after JTS was first released we discovered a useful trick: constructing a zero-width buffer via geom.buffer(0) converts an invalid polygonal geometry into a valid one. It can also be used as a simple way of converting "inverted" polygon topology (ESRI-style) into valid OGC topology. The reason this works is that the buffer algorithm inherently has to handle overlaps and self-intersections since they often occur during the generation of raw buffer offset curves. The algorithm nodes self-intersections, merges overlaps, and creates new polygons or holes if necessary.

A polygon with many invalidities: overlap, self-touch in point and line, and a "bow-tie".

The polygon fixed by using buffer(0).

Note that the bow-tie portion on the right is considered to lie in the exterior of the polygon due to ring orientation, and thus is removed.

In the 20 years since the release of JTS (and its derivative GEOS) this trick has passed into the lore of open-source spatial data processing. It has become a recommended technique for fixing invalid polygonal geometry in numerous projects, such as PostGIS, Shapely, RGeo, GeoTools, R-sf and QGIS. It's also used internally in JTS, in algorithms such as DouglasPeuckerSimplifier, VWSimplifier, and Densifier which might otherwise produce invalid polygonal results.

BUT - there's a nasty little surprise lying in wait for users of buffer(0)! It doesn't always work. It turns out that the buffer algorithm has a serious flaw: in some situations involving invalid "bow-tie" topology it will discard a large part of the input geometry. This has been reported in quite a few issues (here and here in JTS, and also in GEOS and Shapely).

Result of running DouglasPeuckerSimplifier on a polygon with a bow-tie. (See issue)

Close-up of the result - clearly undesirable.

The problem occurs because the buffer algorithm computes the orientation of rings in order to build the buffer offset curve (in the case of a zero-width buffer this is just the original ring linework). Currently the Orientation.isCCW test is used to do this. This uses an efficient algorithm that determines ring orientation by checking the line segments incident on the uppermost vertex of the ring (see Wikipedia for an explanation of why this works.) For a valid ring (where the linework does not cross itself) this works perfectly. However, in a invalid self-crossing ring (sometimes called a "bow-tie" or "figure-8") a choice must be made about which lobe is assigned to be the "interior". The upper-vertex approach always picks the top lobe. If that happens to be very small, the larger part of the ring is considered "exterior" and hence is removed by buffering.

A bow-tie polygon where buffer makes the evidently wrong choice for interior.

The problem occurs for non-zero buffer distances as well.

This problem has limped along for many years now, never being quite enough of a pain point to motivate the effort needed to find a fix (or to fund one!). And to be honest, the buffer code is some of the most complicated and delicate in JTS, and I was concerned about wading into it to add what seemed poised to be a fiddly correction.

But recently there has been renewed interest in providing a Make-Valid capability for JTS. This inspired me to revisit the usage of buffer(0), and think more deeply about ring orientation and its role in determining valid polygonal topology. And this led to discovering a surprisingly simple solution for the buffer issue.

The fix is to use an orientation test which takes into account the entire ring. This is provided by the Signed-Area Orientation test, implemented in Orientation.isCCWArea using the Shoelace Formula. This effectively determines orientation based on the largest area enclosed by the ring. This corresponds more closely to user expectation based on visual assessment. It also minimizes the change in area and (usually) extent.

And indeed it works:

It fixes the simplification issue nicely:

The fix consists of about 4 lines of actual code. To paraphrase a great orator, never in the history of JTS has so much benefit been given to so many by so few lines of code. Now buffer(0) can be recommended unreservedly as an effective, performant way to fix polygonal geometry. And all those helpful documentation pages can drop any qualifications they might have.

As usual, this fix will soon show up in GEOS, and from there in PostGIS and other downstream projects.

This isn't the end of the story. There are times when the effect of buffer(0) is not what is desired for fixing polygon topology. This is discussed nicely in this blog post. The ongoing research into Make Valid will explore alternatives and how to provide an API for them.

The GEOS geometry API is used by many, many projects to do their heavy geometric lifting. But GEOS has always had a bit of a PR problem. Most of those projects provide a more accessible interface to perform GEOS operations. Some offer a high-level language like Python, R, or SQL (and these typically come with a REPL to make things even easier). Or there are GUIs like QGIS, or a command-line interface (CLI) like GDAL/OGR.

But you can't do much with GEOS on its own. It is a C/C++ library, and to use it you need to break out the compiler and start cutting code. It's essentially "headless". Even for GEOS developers, writing an entire C program just to try out a geometry operation on a dataset is painful, to say the least.

There is the GEOS XMLTester utility, of course. It processes carefully structured XML files, but that is hardly convenient. (And in case this brings to mind a snide comment like "2001 called and wants its file format back", XML actually works very well in JTS and GEOS as a portable and readable format for geometry tests. But I digress.)

JTS (on which GEOS is based) has the TestBuilder GUI, which works well for testing out and visualizing the results of JTS operations. JTS also has a CLI called JtsOp. Writing a GUI for GEOS would be a tall order. But a command-line interface (CLI) is much simpler to code, and has significant utility. In fact there is an interesting project called geos-cli that provides a simple CLI for GEOS. But it's ideal to have the CLI code as part of the GEOS project, since it ensures being up-to-date with the library code, and makes it easy to add operations to test new functionality.

This need has led to the development of geosop. It is a CLI for GEOS which performs a range of useful tasks:

Run GEOS operations to confirm their semantics
Test the behaviour of GEOS on specific geometric data
Time the performance of operation execution
Profile GEOS code to find hotspots
Check memory usage characteristics of GEOS code
Generate spatial data for use in visualization or testing
Convert datasets between WKT and WKB

geosop has the following capabilities:

Read WKT and WKB from files, standard input, or command-line literals
Execute GEOS operations on the list(s) of input geometries. Binary operations are executed on every pair of input geometries (i.e. the cross join akaCartesian product)
Output geometry results in WKT or WKB (or text, for non-geometric results)
Display the execution time of data input and operations
Display a full log of the command processing

Here's a look at how it works.

geosop -h gives a list of the options and operations available:

geosop - GEOS v. 3.10.0dev

Executes GEOS geometry operations

Usage:

geosop [OPTION...] opName opArg

-a arg source for A geometries (WKT, WKB, file, stdin,

stdin.wkb)

-b arg source for B geometries (WKT, WKB, file, stdin,

stdin.wkb)

--alimit arg Limit number of A geometries read

-c, --collect Collect input into single geometry

-e, --explode Explode result

-f, --format arg Output format

-h, --help Print help

-p, --precision arg Sets number of decimal places in WKT output

-r, --repeat arg Repeat operation N times

-t, --time Print execution time

-v, --verbose Verbose output

Operations:

area A - computes area for geometry A

boundary A - computes boundary for geometry A

buffer A N - cmputes the buffer of geometry A

centroid A - computes centroid for geometry A

contains A B - tests if geometry A contains geometry B

containsPrep A B - tests if geometry A contains geometry B, using PreparedGeometry

containsProperlyPrep A B - tests if geometry A properly contains geometry B using PreparedGeometry

convexHull A - computes convexHull for geometry A

copy A - computes copy for geometry A

covers A B - tests if geometry A covers geometry B

coversPrep A B - tests if geometry A covers geometry B using PreparedGeometry

difference A B - computes difference of geometry A from B

differenceSR A B - computes difference of geometry A from B rounding to a precision scale factor

distance A B - computes distance between geometry A and B

distancePrep A B - computes distance between geometry A and B using PreparedGeometry

envelope A - computes envelope for geometry A

interiorPoint A - computes interiorPoint for geometry A

intersection A B - computes intersection of geometry A and B

intersectionSR A B - computes intersection of geometry A and B

intersects A B - tests if geometry A and B intersect

intersectsPrep A B - tests if geometry A intersects B using PreparedGeometry

isValid A - tests if geometry A is valid

length A - computes length for geometry A

makeValid A - computes makeValid for geometry A

nearestPoints A B - computes nearest points of geometry A and B

nearestPointsPrep A B - computes nearest points of geometry A and B using PreparedGeometry

polygonize A - computes polygonize for geometry A

reducePrecision A N - reduces precision of geometry to a precision scale factor

relate A B - computes DE-9IM matrix for geometry A and B

symDifference A B - computes symmetric difference of geometry A and B

symDifferenceSR A B - computes symmetric difference of geometry A and B

unaryUnion A - computes unaryUnion for geometry A

union A B - computes union of geometry A and B

unionSR A B - computes union of geometry A and B

Most GEOS operations are provided, and the list will be completed soon.

Some examples of using geosop are below.

Compute the interior point for each country in a world polygons dataset, and output them as WKT:

geosop -a world.wkt -f wkt interiorPoint

Determine the time required to compute buffers of distance 1 for each country in the world:

geosop -a world.wkt --time buffer 1

Compute the union of all countries in Europe:

geosop -a europe.wkb --collect -f wkb unaryUnion

The README gives many more examples of how to use the various command-line options. In a subsequent post I'll give some demonstrations of using geosop for various tasks including GEOS testing, performance tuning, and geoprocessing.

Future Work

There's potential to make geosop even more useful:

GeoJSON is a popular format for use in spatial toolchains. Adding GeoJSON reading and writing would allow geosop to be more widely used for geo-processing.
Adding SVG output would provide a way to visualize the results of GEOS operations.
Improve support for performance testing by adding operations to generate various kinds of standard test datasets (such as point grids, polygon grids, and random point fields).

And of course, work will be ongoing to keep geosop up-to-date as new operations and functionality are added to GEOS.

Fun wit JEQL: Hilbert Curves

Hilbert and Morton Curves in JTS

Joining Crunchy Data to work on PostGIS

Better and Faster Interior Point for Polygons in JTS/GEOS

Better/Faster Interior Point for Polygons - now in GEOS

Better/Faster ST_PointOnSurface for PostGIS

Fast Geometry Distance in JTS

PostgreSQL's Linux moment

Mandelbrot Set in SQL using SVG with RLE

JtsOp - a CLI for JTS

Variable-distance buffering in JTS

Running commands in the JTS TestBuilder

Examples

Running PostGIS spatial functions

Loading data from PostGIS

Loading data from the Web

Command Panel User Interface

Maximum Inscribed Circle and Largest Empty Circle in JTS

JTS Overlay - the Next Generation

JTS OverlayNG - Noding Strategies

JTS OverlayNG - Tolerant Topology Transformation

OverlayNG lands in JTS master

Randomization to the Rescue!

Fixing Buffer for fixing Polygons

Introducing geosop - a CLI for GEOS

Future Work