Halton sequences, at last

September 10, 2012, 9:21 pm

≫ Next: State Of The Map US - 2012

A while ago I posted on generating random point sets, and Sean Gillies suggested Halton sequences as a way of generating nice-looking point distributions. I finally got around to looking into this idea. Indeed he was correct - Halton sequences provide a very pleasing appearance for point distributions (much better than purely random distributions).

The above shows the first 1000 elements of the Halton sequence with bases (2,3). Larger bases exhibit progressively more coherence, which may or may not be desirable depending on application. Here's H(7,9):

The code to generate Halton sequences is pretty trivial - see the Wikipedia article for some pseudo-code. (Although I have to say that the explanation of their derivation is pretty poor. I've seen this sad trend on some other Wikipedia mathematical articles as well - in particular the one for Barnes surface interpolation. I suppose the rejoinder would be to get in there and improve it!)

Keying off something else that Sean showed on his blog, I also played around with using Halton sequences to generate quasi-random sets of polygons. Here's one example, using a slightly different approach to Sean's:

And the same for H(5,7), which looks slightly nicer I think:

Might come in handy someday for generating test data, or possibly as a visual texture.

↧

State Of The Map US - 2012

October 18, 2012, 4:14 pm

≫ Next: The Great Geometry Clipping Contest

≪ Previous: Halton sequences, at last

Thanks to OpenGeo, last weekend I attended the OpenStreetMap State of the Map (US) conference in Portland, Oregon. It was a great conference to attend. Below are some of my takeaways.

[Obligatory disclaimer: these are my opinions, not necessarily those of OpenGeo]

It was an interesting change from being at software conferences like FOSS4G. The conference was mostly about software, but from a reverse perspective. Instead of new software ideas in search of applications and data, the perspective was of a huge, freely-available, rapidly growing dataset and how this encourages the development of innovative software to manage, display and analyze it. One result of this was less of a "tribal" split among the attendees. (A tribal rift that might have appeared is choice of mapping engine, but just about everyone there was in the Mapnik tribe. I had to dig to find GeoServer or MapServer users.)

What was similar to FOSS4G was the notable excitement about transforming established business models - but with different players, organizations which use (and usually pay for) road and transit data, rather than established software vendors. Also similar was the feeling of community, the energy and passion of the people involved, and the sense of being involved in something that is a radically new and more empowering way of building something that is essential in people’s lives.
(See the presentation When Google Maps Gives you Lemons, make Lemonade for a great example of why and how this is happening).

Incidentally, the pre-conference party was at GeoLoqui, who later that weekend announced their acquisition by ESRI. Apparently the big money hadn't started flowing their way yet, since they ran out of beer just after we got there.

Statistics

about 225 attendees
3 tracks of presentations over 2 days
3rd SOTM-US. First one was 2 years ago, with only 40 attendees

The list of talks with links to some presentations.
Comments on Selected Talks

The other thing similar to FOSS4G is that there are many more good talks than one person can possibly take in. Here's notes from ones I did attend:

Steve Coast (formerly of Cloudmade, now at Microsoft) gave his 1000th "Founder of OSM" keynote. He said that the top thing OSM needs to continue to grow is addressing. He singled out Google Mapmaker as being a clear and present danger to the growth of OSM.
Check the great video Address is Approximate, mentioned by Henk Hoff in his keynote

Dane Springmeyer and Artem Pavlenko from MapBox presented on new features of Mapnik.

Ability to define raster compositing operations between map layers. Showed using it to obtain tint bands, as well as some other examples which are maybe more interesting than essential.
Vertex converters, similar to GeometryTransformations in GeoServer. Uses include simplification (various simplification methods were tested - not clear which are available) and smoothing
Format support: CSV, GeoJSON, Python

Nathan Kelso and Michal Migurski from MapBox presented on how to prepare OSM and terrain data to make a purty base map. A lot of work is required, and most of it is slow/complex and thus has to be done offline (the beauty of tiles...). They have a clever simulated-annealing based labeller with a cool video.
AJ Ashton from MapBox talked about OSM data preparation for effective cartography. It makes the good point that good multi-scale cartography takes a lot of pre-rendering data prep, as well as additional data sources to OSM (such as NaturalEarth for placename priority).
MapBox and the Knight Foundation announced their big grant ($575K) to “improve OSM tools”. There was a palpable sense of concern in the community attending about what this would mean for the current developer community. Apparently there are still hard feelings about past experiences with CloudMade and MapQuest.
Tom Macwright from MapBox talked about OSM infrastructure. I had no idea it was run on such a minimal infrastucture (3 servers in London). The codebase (known as “RailsPort”) is pretty Rube(y) Goldbergian. The bus factor is a bit too low for comfort. The good news is that the data fits in a terabyte, so there are mirrors all over the place.
Steve Coast raised an interesting question wondering if it would improve the OSM codebase if it had reusability/portability as an explicit goal. You might think it would be rare to find other people who need to run a massive crowd-sourced map of the world, but Eric Wolf of USGS says that they run a fork of OSM internally to support their mapping. He did say that it was difficult to contribute back to the trunk.
Nathan Van Der Wilt presented his ArgyleTiles project to create a tiled base map of earth imagery. A great idea, but one that’s been tried before (where have you gone, OpenAerialMap?).

David Turner from OpenPlans discussed using OpenTripPlanner with OSM. They have highly-tuned route planning rules which work with OSM tags. OpenTripPlanner is a very interesting and useful project. They make a fair bit of use of JTS and spatial algorithms (such as linear referencing and concave hulls)

Portland TriMet presented on their use of OSM and OpenTripPlanner to provide extremely high-quality routing (much better than Google in many cases, particularly for multi-modal involving human-powered transport). They did a lot of work to improve OSM data in Portland to make routing work better. They emphasized that this was not that expensive - 4 interns over a few months to improve and QA the entire area. This is an ongoing project, and is done interatively in conjunction with the OpenTrip Planner, to identify routes which show up as taking longer than expected. TriMet also presented on their extensive use of open data for mapping.
There were a couple of talks from TeleNav, a commercial traffic reporting and routing company. They have 30M cllients globally reporting and using traffic data. Sounds like they are switching to use OSM data. They are putting a lot of work into error detection and data cleaning. They have created an open alternative to the TeleAtlas/NavTeq TMC spec (which is licensed) call TTL. This creates a standard set of road segments which is the basis for traffic data collection and reporting.
Martijn Van Exel is a longtime OSM user and advocate, now based in Salt Lake City. He is doing a lot of work on error detection and reporting in OSM, notable creating the Remap-A-Tron. He’s using GeoServer (yay!). I told him about the new heatmap rendering, which might be of interest for visualization
Jeff Meyer presented on spatio-temporal applications in education and the humanities. He made a plea to model and capture temporal attributes in OSM, to preserve its value into the future.
Abe Usher had a hilarious presentation on Heatmaps for data visualization.
Ben Standefer from Urban Airship talked about how location-based push to mobile devices is becoming big business. They are using OSM for things like identifying neighbourhood polygons, and POI (points of interest) polygons (eg arenas). They’re using the JTS STRtree spatial indexing (yay! Blows me away that it is suitable for hard production use)
Alex Barth from MapBox talked about Carmen, an open-source geocoder they are developing. It seems more like a reverse geocoder - most of what he talked about was how to identify named locations from points. Nothing was said about the hard parts of address geocoding (such as parsing, error handling, address models, and fuzzy matching)

↧

The Great Geometry Clipping Contest

November 13, 2012, 12:22 pm

≫ Next: YAUSEM (Yet Another US Election Map)

≪ Previous: State Of The Map US - 2012

Don Meltz initiated a fascinating flurry of performance evaluation with his post on Is QGIS a Viable Alternative to ArcGIS? and its followup ArcGIS vs QGIS Clipping Contest Rematch. He looked at a spatial processing task involving clipping a large dataset of contour lines against a fairly simple polygon. His conclusion was that QGIS was a lot faster than ArcGIS at performing this task. His final testing produced a time of 6 min 27 sec for QGIS, versus a time of 1 h 35 min for ArcGIS - and which failed with a topology error! (Note: subsequently ESRI reported that they have improved their algorithm to provide much better results for this case - and presumably to enable it to actually complete!).

This inspired a lot of other people to dive in and run the same test (since Don helpfully provided the test data here). Systems tested include many commercial (ArcGIS, GlobalMapper, Manifold) and FOSS systems (QGIS, PostGIS, SpatialLite, GRASS, OGR, uDig, OpenJUMP, etc). There's a summary of some of the timing results here (and the many comments to these posts provide lots of different timings on various software and hardware configurations).

On the Java side, Andrea Antonello of jGrass provides an in-depth description of his optimized implementation using uDig, GeoTools and JTS here. His best result was about 80 sec, using 4 cores and including data I/O. (He later tested on Amazon AWS using 32 cores, giving a 25 sec time).

The SpatialLite wiki has a nice page showing how this problem is tackled in SQL here. It also has an excellent analysis of what this contest actually demonstrates. The key conclusion is that almost all the open source systems are using JTS or GEOS, so what is really being measured is the effectiveness of the JTS overlay algorithm.

The one exception is GRASS, which uses a completely different topological algorithm. From the results the performance of this seems similar to or only a bit slower than JTS/GEOS. This is actually quite impressive since it sounds like the algorithm is computing full topology of the data. But it's hard to know without a more detailed understanding of how it works. And in any case, comparing the contest timings is difficult since many different hardware configurations were used.

One aspect of this task is that it is "pleasantly parallel", so implementations which can multi-thread over many cores should see linear performance improvement. Only some of the systems tested were able to take advantage of this. In particular, PostGIS and SpatialLite execute single queries in a sequential fashion, which is a shame. Perhaps this will be rectified in future releases?

It's great to see JTS and GEOS used effectively in so many different applications, and that they hold their own against the well-funded competition (and in fact often doing much better).

And here's the kicker - it could run even faster! The JTS overlay algorithm was designed for the single geometry/geometry case. It has not been optimized for iterated operations against a fixed query geometry. By using a caching approach similar to the existing JTS PreparedGeometry API, it will be possible to avoid recomputing polygon topology and thus provide a significant speedup. Also, the overlay algorithm was designed to handle all geometry types, and in particular the polygon/polygon case, which is the most demanding situation. It could be optimized to handle simpler cases (such as the polygon/line overlay of this task) in a faster way. All it will take is some time, money, and some hard thinking...

↧

YAUSEM (Yet Another US Election Map)

November 14, 2012, 10:37 pm

≫ Next: Convenience trumps all

≪ Previous: The Great Geometry Clipping Contest

The usual US election map is a starkly simplistic collection of red and blue blobs. This does reflect the esoteric (to a Canadian) Electoral College first-past-the-post arrangement. But after seeing how close the actual vote counts were in most states, it seemed to me like this doesn't really reflect the actual political reality of the US. Really it's pretty much a purely purple country. John Nelson has a nice map that elegantly visualizes this reality at a county level (and Brian Timoney explains why it's the only map that isn't a lie).

As another attempt at map truthiness, I used JEQL to produce the following map. It shows actual vote numbers at a state level, color themed along two dimensions:

The hue shows the relative proportion of Democrat VS Republican votes (using the now-canonical blue and red). For reference, Florida is almost exactly 50-50. This nicely shows that really the US is just varying shades of purple.
The saturation is proportional to the relative population of the state. California is fully saturated, since it's the most populous state. The the inland Western and the far Northeastern states are pretty pale, since they have fairly low populations. This is roughly proportional to the weight of the state's Electoral College votes, although there are amusing anomalies.

I make no claim that this map represents any valid statistics - it's just a fun exercise in using JEQL to do spatial visualization. For reference, the script is:

CSVReader t colSep: "\t" file: "us_vote_raw.txt";

t = select String.trim(col1) name,
    Val.toInt(String.keepChars(col2, "0123456789")) ecVote,
    Val.toDouble(String.keepChars(col4, "0123456789")) demvote,
    Val.toDouble(String.keepChars(col5, "0123456789")) repvote from t;

Print t;

maxVote = val(select max(demvote+repvote) from t);

ShapefileReader tus file: "us_state.shp";

tvote = select name, demvote, repvote, GEOMETRY,
        demvote / (demvote+repvote) demfrac,
        demvote+repvote totvote,
       (demvote+repvote)/maxVote density
    from t
    join tus on t.name == String.trim(tus.STATE_NAME);

Mem tvote;

tplot = select GEOMETRY,
    styleFill,
    #ffffff styleStroke,
    1 styleStrokeWidth
    with {
        clr = Color.interpolate("ff0000", "0000ff", demfrac);
        h = Color.getH(clr);
        s = density;
        v = 1;
        styleFill = Color.toRGBfromHSV(h,s,v);    }
    from tvote;
extent = BOX(-128 20, -65 50);

Plot    data: tplot
    extent: val(extent)
    width: 800
    height: 400
    background: "0000aa"
    file: "us_vote.png";

The raw data came from Wikipedia via simple cut-and-paste to a text file.

↧

Convenience trumps all

December 20, 2012, 1:50 pm

≫ Next: JTS 1.13 Released

≪ Previous: YAUSEM (Yet Another US Election Map)

The always-readable Stephen O'Grady has an insightful post titled Do Not Underestimate the Power of Convenience. He proposes that an increasingly important factor driving the uptake of software is the developer's drive for convenience. "Convenience" is an, er, convenient term for things like ease-of-use, power, portability and low barrier to entry. In the case of software the lowest possible barrier is provided by free open source software, but the principle applies to infrastructure as well (reflected by the rapid uptake of things like cloud computing and BYOD). As evidence he lists a set of technologies whose prevalence has been driven bottom-up by developers, rather than top-down by corporate fiat. It reads like the bill-of-materials for IT infrastructure in a start-up: AWS, Linux, dynamic languages, Git, Eclipse, etc.

One reason he gives for this situation is the pleasing (to my ears) observation that "Developers are the new Kingmakers" (which he discusses in detail in another post). I suspect that these are mutually-reinforcing phenomena. Developer's drive for convenience has led to the rapid evolution of open source software (and not coincidentally the improvement of the tools which enable its development). This has led to the current situation where in many cases OSS surpasses commercial offerings, or at least is "good enough" to be used in demanding production environments. Since the developers are the ones making the software, or at least are the ones who understand it, they need to be involved in the decisions to adopt it (if they aren't spearheading the adoption themselves). (And especially since there are usually no marketing people in sight...)

In fact, I contend that the drive for convenience is the motivation for many of the advances in software and computation. A classic example is the ongoing quest to increase the concision of computer languages, by making them more expressive and provide a richer computation model. Another is the evolution of systems which can easily automate mundane tasks (with examples including the rich shell and command-line capabilities of *nix and the increasingly powerful array of build tools - the best of which of course are free and open source.) Developer's preference for software which is portable and standard is also driven by convenience - everyone wants their current favourite set of software available wherever they happen to be are working (and as expected, open source software typically exhibits the highest degree of portability).

As Larry Wall said, laziness and impatience are virtues for a programmer.

Lazy programmers avoid yak shaving

Taken to the extreme, the drive for convenience is simply another way of stating the ultimate goal of all compute science - to free computation from any limitations of space, time, money and power. Put this way, O'Grady's thesis is almost a tautology. But it's a valuable reminder of the constant need to push against the strong opposing forces of commercial interest and bureaucratic inertia.

↧

JTS 1.13 Released

December 25, 2012, 8:35 am

≫ Next: 2012 Year in Review - Blog Roundup

≪ Previous: Convenience trumps all

I'm to announce that JTS 1.13 has been released. It's available for download from SourceForge.

There is a long list of new features, enhancements and bug fixes in this release:

Functionality Improvements

Changed GeometryFactory.createGeometry() to make a deep copy of the argument Geometry, using the CoordinateSequenceFactory of the factory
Added ability to specify a dimension in CoordinateArraySequence
Changed Geometry.getEnvelopeInternal() to return a copy of the cached envelope, to prevent modification
Added GeometryEditor.CoordinateSequenceOperation to allow easy editing of constituent CoordinateSequences
Added GeometryFactory.createPolygon convenience methods which do not require holes to be specified
Geometry overlay methods now return empty results as atomic types of appropriate dimension
Added RectangleLineIntersector to provide efficient rectangle-line intersection testing
Added getOrdinate and setOrdinate to Coordinate
Quadtree is Serializable
STRtree is Serializable
Added max, average and wrap functions to MathUtil
Improved WKTReader parse error reporting to report input line of error
Improved WKBReader to repair structurally-invalid input
Made TopologyPreservingSimplifier thread-safe
Added AbstractSTRtree.isEmpty() method
Added QuadTree.isEmpty() method
Added KdTree.isEmpty() method
Added decimation and duplicate point removal to ShapeWriter.
ScaledNoder now preserves Z values of input
Added instance methods for all Triangle static methods
Added CGAlgorithmsDD containing high-precision versions of some basic CG algorithms
Added IntersectionMatrix.isTrue() method for testing IM pattern matches
Added getRawCoordinates methods to PackedCoordinateSequence concrete classes
Modified Geometry.isSimple() to explicity check for simplicity for all types, and support GeometryCollections
Improved MCIndexSnapRounder to add nodes only where they are necessary
Added CoordinateArrays.removeNull() method
Enhanced GeometryEditor to handle null geometries returned from operation
Added WKBHExFileReader
Added Distance3D operation

Performance Improvements

Simplified & improved performance of RectangleIntersects by using new RectangleLineIntersector
In RandomPointsInGridBuilder eliminated redundant ArrayList usage
In PreparedPolygonIntersects and PreparedLineStringIntersects added check to avoid creating segment index if all test inputs are points
In AbstractSTRtree switched to using indexed list access for better performance than using iterators
In AbstractSTRtree freed inserted item array after index is built
Improved performance of Polygonizer for cases with many potential holes
Improved performance for some DD methods by making them final
Added fast filter for CGAlgorithmsDD.orientationIndex, and switched to self-operations for DD determinant
Changed STRtree.createNode() to use a static class for nodes
Changed QuadTree Node to use scalar x and y variables rather than a Coordinate to reduce memory allocation
Fixed PreparedGeometry concrete classes to be thread-safe.
Fixed SortedPackedIntervalRTree so that it is thread-safe.

Robustness Improvements

Switched to using DD extended-precision arithmetic to compute orientation predicate
CGAlgorithms.distanceLineLine() improved to be more robust and performant
Fixed robustness issue causing Empty Stack failure in ConvexHull for some nearly collinear inputs
CGAlgorithms.signedArea() uses a more accurate algorithm

Bug Fixes

Fixed Geometry.equalsExact() to avoid NPE when comparing empty and non-empty Points
Fixed CascadedPolygonUnion to discard non-polygonal components created during unioning, to avoid failures and provide more desirable behaviour
Fixed CentralEndpointIntersector to initialize result correctly
Fixed DelaunayTriangulationBuilder.extractUniqueCoordinates(Geometry)to avoid mutating the vertex order of the input Geometry
Fixed ConformingDelaunayTriangulationBuilder to allow non-disjoint site and constraint vertex sets
Fixed RandomPointsInGridBuilder point generation to use circle constraint correctly
Fixed Linear Referencing API to handle MultiLineStrings consistently, by always using the lowest possible index value, and by trimming zero-length components from results
Fixed bug in LocationIndexedLine and LengthIndexLine which was causing an assertion failure when the indexOfAfter() method was called with a constraint location which is at the end of the line
Fixed bug in STRtree.query(Envelope, ItemVisitor) causing an NPE when tree is empty
Fixed issue with creating zero-length edges during buffer topology building under fixed precision, by: adding filter to remove zero-length edges; using a better estimate of scale factor for reducing to fixed precision after initial failure.
Fixed TopologyPreservingSimplifier to return a valid result for closed LineStrings with large distance tolerances
Fixed TopologyPreservingSimplifier to return an empty result for an empty input
Fixed DouglasPeuckerSimplifier to return an empty result for an empty input
Fixed MinimumBoundingCircle to correctly compute circle for obtuse triangles.
Fixd GeometryPrecisionReducer to use input GeometryFactory when polygon topology is fixed
Fixed GeometryNoder bug that was failing to snap to end vertices of lines
Fixed Geometry.getCentroid() and Geometry.getInteriorPoint() to return POINT EMPTY for empty inputs
Fixed DelaunayTriangulationBuilder to correctly extract unique points
Fixed KdTree to correctly handle inserting duplicate points into an empty tree
Fixed LineSegment.projectionFactor() to handle zero-length lines (by returning Double.POSITIVE_INFINITY)
Fixed LocationIndexedLine to handle locations on zero-length lines
Fixed LengthIndexedLine and LocationIndexedLine to handle indexOfAfter() correctly
Fixed WKBReader to handle successive geometrys with different endianness
Fixed GeometricShapeFactory to correctly handle setting the centre point
Fixed GeometryFactory.createMultiPoint(CoordinateSequence) to handle sequences of dimension > 3

API Changes

Changed visibility of TaggedLineStringSimplifier back to public due to user demand

Testing

Added Performance Testing framework (PerformanceTestRunnerand PerformanceTestCase)
Added named predicate tests to all Relate test cases

JTS TestBuilder

Functionality Improvements

Added segment index visualization styling
Improved Geometry Inspector
Added stream digitizing for Polygon and LineString tools
Added output of Test Case XML with WKB
Added Extract Component tool
Added Delete Vertices Or Components tool
Added Geometry Edit Panel pop-up menu, with operations
Added Halton sequence functions
Added sorting functions
Added function for selection of first N components
Added CGAlgorithms functions
Added ability to paste and load multiple WKBHex geometries

Performance Improvements

Using decimation substantially improves rendering time for large geometries.

Bug Fixes

Fixed bug in saving XML test files

↧

2012 Year in Review - Blog Roundup

January 2, 2013, 12:23 pm

≫ Next: Lead, the criminal element

≪ Previous: JTS 1.13 Released

A look back at 2012 from a software technology perspective by some of my favourite blogs:

Inspired By Actual Events - a wide-reaching roundup. I found the Java and friends links especially interesting, since the Java/JVM world is so big now it's hard to keep up with and distill the really significant events.
Interoperability Happens (Ted Nedward) - As usual, opinionated and insightful commentary on enterprise software technology from a hard-core developer perspective.
Tim Anderson - A strong focus on Microsoft, but also fairly even-handed assessment of the rest of the "A"-team (Apple, Android/Google, and Amazon). (And a not-very-optimistic mention of the "B" team - BB/RIM). I always appreciate Tim Anderson's reading of the internal and external tea-leaves of MS technology. It's always fascinating to see the elephant trying to jump, in a schadenfreudal sort of way.
Tim Bray - Not really a roundup, and not all that tech-focussed, but always a good read.

↧

Lead, the criminal element

January 3, 2013, 9:58 am

≫ Next: Functional Programming Whinging

≪ Previous: 2012 Year in Review - Blog Roundup

I've heard before about the postulated link between atmospheric lead levels (courtesy of the leaded gasoline used through the middle decades of the 20th century) and crime levels. This Mother Jones article America's Real Criminal Element: Lead is the best explanation I've seen so far (and has links to the original papers). It really sounds like this hypothesis is fully confirmed - and the best thing about this story is that it has a happy ending. (Unless you're trying to get elected as mayor - or Prime Minister - on a tough-on-crime platform).

There is a nice geospatial connection here. As with many epidemiological issues, spatial locality is an important aspect of the analyses that lead (ahem) to the conclusion. The article is chock-full of references to the spatial nature of the problem, such as:

We now have studies at the international level, the national level, the state level, the city level, and even the individual level

and my favourite:

a good rule of thumb for categorizing epidemics: If it spreads along lines of communication, he says, the cause is information. Think Bieber Fever. If it travels along major transportation routes, the cause is microbial. Think influenza. If it spreads out like a fan, the cause is an insect. Think malaria. But if it's everywhere, all at once—as both the rise of crime in the '60s and '70s and the fall of crime in the '90s seemed to be—the cause is a molecule.

↧

Functional Programming Whinging

January 3, 2013, 3:33 pm

≫ Next: JTS Union VS ArcGIS Dissolve

≪ Previous: Lead, the criminal element

Tim Bray thinks Uncle Bob Martin's post on Functional Programming Basics is the cat's pyjamas.

Meh. "Basics" is the key word in that title - the article is pretty light and fluffy. Fine if you don't know squat about FP, but it's also accompanied by a whole lot of starry-eyed razzle-dazzle which isn't really justified by the content (and note that I'm not saying it's wrong, just not substantiated).

To be fair, TB does have a few gripes. Here's a few more:

The example used to show how FP wonderfully avoids variables and side-effects is that hoary old one of computing squares of integers. (I mean really hoary - this was the first program I ever wrote, in WATFIV. And I at least had cool line printer output!) How about using something that's a bit more representative of an actual computational problem? Like say, red-black trees - withdeletion!
As TB points out, the people who really need to make algorithms run fast across 64 cores are a small percentage of current coders. For everyone else, scale-out is a more mundane but pressing problem. And it's not clear to me whether FP will make that easier.
As someone who spends his leisure hours trying to make spatial algorithms more performant, I'm suspicious of anything that promises to automagically make code go faster across multiple cores. In spatial most interesting problems are not "pleasantly parallel", and many of them are memory-bound as well as being compute-bound. So advances in performance would seem depend on better algorithms, not a different choice of language.

Back in the day I was pretty keen on FP languages - but I realized after being exposed to Smalltalk and later Java, a lot of their appeal was due to their (necessary) provision of automatic memory management (which was painfully lacking in the "mainstream" languages such as FORTRAN, Pascal, C - oh, and even C++).

But I'm not trying to prove a negative here. Certainly the FP features of no side-effects and lazy evaluation would seem to offer a lot of benefit for the right class of problems. And FP or FP-ish languages are more mainstream than ever before. So perhaps they really will become the mainstream language paradigm. I just hope I don't have to be coding using layers of inconveniently situated parentheses.

↧

JTS Union VS ArcGIS Dissolve

February 6, 2013, 9:18 pm

≫ Next: The subversiveness of Open Source

≪ Previous: Functional Programming Whinging

Ragnvald Larsen has an interesting post on ways to mitigate the poor performance and stabilty of Dissolve computations in ArcGIS. Dissolve is the Arc term for the geometric union of a collection of polygons (possibly grouped by attribute, although that capability was not used in this case).

Ragnvald's dataset consisted of a 15 MB shapefile containing about 7000 overlapping polygons. Here's what the data looks like:

He found that using the ArcGIS Dissolve method took about 150 sec to process the dataset. In an effort to reduce this time, he experimented with partitioning the dataset and doing the union in batches. After a (presumably lengthy) series of experiments to find the optimal batch size, he was able to get the time down to 25 sec using a batch size of 110 features.

Improving union performance by partioning the input is the basic idea behind the Cascaded Union function in JTS (which I blogged about back in 2007). Cascaded Union uses a spatial index to automatically optimize the partitioning. Ragnvald doesn't mention whether he used a spatial index, but I suspect this might be quite time-consuming to code in ArcPy.

I thought it would be interesting to compare the performance of the JTS algorithm to the ArcGIS one. To do this I used JEQL, which provides an easy high-level way to read the data and invoke the JTS Cascaded Union. The entire process can be expressed as a very simple JEQL script:

ShapefileReader t file: "agder/agder_buffer.shp";
t = select geomUnionMem(GEOMETRY) g from t;
ShapefileWriter t file: "result.shp";

geomUnionMem is a JEQL spatial aggregate function which is implemented using the JTS Cascaded Union algorithm. (Although not needed in this case, note that the more general Dissolve use case of unioning groups of features by their attributes can easily be achieved by using the standard SQL GROUP BY clause.)

Running this on a (late-model) PC workstation produced a timing of about 1.5 sec!

Here's the output union:

↧

The subversiveness of Open Source

February 12, 2013, 10:29 pm

≫ Next: 10 Step Program for Developers

≪ Previous: JTS Union VS ArcGIS Dissolve

It's no longer novel to observe that Open Source is, if not the dominant software paradigm of the era, at least one of the most significant innovations in the history of software practice. Recently it struck me how downright bizarre the Open Source paradigm really is. I can't think of another field of human endeavour where the fundamental paradigm mandates giving away the product of one's labour. Consider a few sweepingly-generalized examples:

Business - Fugedaboudit! It's all about the money. Apart from the Diggers of 60's Haight-Ashbury notoriety there aren't too many examples of businesses whose model consists of giving away their stock.
Arts - Hah! Obviously the big media companies are doing everything they can to squeeze money out of artistic endeavour. But even among the less mercantile stakeholders the main discussion is about how artists can be compensated for their creations. No-one seriously advocates that artists give away all their work for free.
Sport - Don't get me started on the gross discrepancy between compensation and value in professional sport. And at the amateur level, sponsorship and funding organizations are recognized to be essential to promoting the continued generation of sporting "product". (Wouldn't it be great if there was a similar system of sponsorship for software developers?)
Science - You might think this would be the exception that proves the rule. After all, sharing research results is a revered principle of scientific progress. The domain relies on publishing information openly to an even greater extent than in software development. But in my (admittedly limited) experience many scientists are actually quite protective of their intellectual property, since their livelihood depends in a direct way on amassing it and monetizing how it is dispensed. And it's well known that academic institutions pay very close attention to licensing the IP generated by them (or their employees).

Just to be clear, I am not suggesting that the open source paradigm is flawed or wrong. In fact, I spend the major part of my professional life living and breathing Open Source geospatial software (JTS/GEOS, JEQL, Proj4J, GeoServer, PostGIS, etc). As a means of increasing the velocity and quality of software development it's by far the best model. And it's much more democratic and self-actualizing than the semi-feudal alternatives. But it really is a subversive concept. Marxist, even. It's no wonder that it's taking so long for the suits to wrap their heads around how to deal with it.

Long live the anarcho-syndicalist commune of Open Source Software craftsmen!

↧

10 Step Program for Developers

March 14, 2013, 9:31 am

≫ Next: Beautiful cartography using OpenJUMP

≪ Previous: The subversiveness of Open Source

Andrew Oliver lays out the 10 Step Program for developers. Here's his points:

Blog
Go open source
Not six months, not 10 years
Eye on the new stuff, hands on the practical
Write your own documentation
Brevity is the soul
Wow the crowd
Be realistic
Solve the hard stuff, know the tools (hmm.. isn't that two points?)
Practice humility

This blog post is my practice of points 1 and 6. And also one of my own:

11. Copy the work of other smart people

↧

Beautiful cartography using OpenJUMP

May 13, 2013, 10:19 am

≫ Next: Flight Paths in JEQL Redux

≪ Previous: 10 Step Program for Developers

An OpenJUMP user just posted some really nice cartographic maps made using a combination of OpenJUMP, Inkscape, GRASS, and GIMP.

Roman Empire - Physical Map

Roman Empire - Political Map

He gives OJ the following glowing endorsement:

I find Open JUMP to be the most vector-friendly open source GIS software. The preparation of the datasets (rivers, lakes, sea, roads, borders) was really [a] piece of cake...

It's great to see the small but dedicated OpenJUMP community steadily adding new features and improving the software quality. 10 years after it was launched, OpenJUMP continues to be the "Little Open-Source GIS that Can".

↧

Flight Paths in JEQL Redux

May 29, 2013, 10:06 pm

≫ Next: How to get OpenLayers WMSGetFeatureInfo to emit GeoServer CQL Filters for multiple layers

≪ Previous: Beautiful cartography using OpenJUMP

The intertubes are buzzing about a flight path visualization done by Michael Markieta. This is based on the same OpenFlights dataset that I used a couple of years ago as a demonstration of JEQL processing and visualization capabilities.

Markieta's blog post outlines his workflow using ArcGIS. It's a bit cumbersome - apart from having to jump through hoops to read the data from the original DAT files, apparently the dataset has to be split into six parts to be able to process it. (For a measly 58K rows?!)

No details are provided about styling, which is the key part of the exercise. The images apparently use alpha blending to show flight density. Also, the coordinate system seems to be more curvaceous than the squaresville Plate Carree I used (so much more haute couture than saying Lat/Long). Both of these are easy to do in JEQL. Here's some samples of the improved output, using the alpha channel and a Mollwiede projection.

Europe

North America

And here's the entire image, in glorious hi-res suitable for framing:

↧

How to get OpenLayers WMSGetFeatureInfo to emit GeoServer CQL Filters for multiple layers

June 12, 2013, 5:38 pm

≫ Next: Maslow's Hierarchy in the 21st Century

≪ Previous: Flight Paths in JEQL Redux

OpenLayers provides the useful WMSGetFeatureInfo control. It's designed to work with the standard WMS GetFeatureInfo request. As per the standard, the control supports querying multiple layers via setting the layers property.

It's often necessary to define client-side filters for WMS layers, to display only a subset of the layer data in the backing feature type. Usually the filters need to be defined dynamically, based on the application context. When using GeoServer as the web mapping engine a convenient (but non-standard) way of doing this is to use the CQL_FILTER WMS parameter. (One might reasonably ask why there isn't an equally simple way to do this in the WMS standard itself, but that's another story). In OpenLayers this parameter can be added dynamically to a layer via the mergeNewParams method:

lyr.mergeNewParams({'CQL_FILTER': "filter expression" });

Naturally it is necessary to have the GetFeatureInfo control respect the layer filters as well. This is straightforward in the case of a single layer. The GeoServer CQL_FILTER parameter can be supplied using the vendorParams property on the WMSGetFeatureInfo control:

infoControl.vendorParams = { 'CQL_FILTER': 'filter expression'};

Since the CQL_FILTER parameter supports a list of filters, it's also straightforward to filter multiple layers as long as the list of layers queried is static:

infoControl.vendorParams = { 'CQL_FILTER': 'filt-1; filt-2; filt-3'};

But WMSGetFeatureInfo also provides the useful ability to query only visible layers (via the queryVisible property). This makes things much trickier, since the list of filter expressions must match the list of layers provided in the QUERY_LAYERS parameter. There's no built-in way of doing this in OpenLayers itself (not surprisingly, since the CQL_FILTER parameter syntax is specific to GeoServer only).

One way to do this is to build the CQL_FILTER parameter value dynamically uisng the CQL_FILTERs defined for the visible layers. This can be done when the control is invoked, via hooking the beforegetfeatureinfo event.

Here's a code snippet to do this:

var infoControl;

function initInfoControl()
{
  infoControl = new OpenLayers.Control.WMSGetFeatureInfo({
		url: wms_url,
		title: 'Identify features by clicking',
		layers: [
			layers....
		],
		queryVisible: true,
		maxFeatures: 3,
		infoFormat: 'application/vnd.ogc.gml'
	});
  infoControl.events.register(
    "beforegetfeatureinfo", null, onBeforeGetFeatureInfo); 
  infoControl.events.register
    ("getfeatureinfo", null, onGetFeatureInfo); 
  map.addControl(infoControl);
  infoControl.activate();
}
function onBeforeGetFeatureInfo(event)
{
  // build CQL_FILTER param list from active info layer CQL_FILTER params
  var layers = infoControl.findLayers();
  var filter = "";
  for (var i = 0, len = layers.length; i < len; i++) {
    if (i > 0) 	filter += ";";
    var lyrCQL = layers[i].params.CQL_FILTER
    if (lyrCQL != null) {
      filter += lyrCQL;
    }
  }
  infoControl.vendorParams = { 'CQL_FILTER': filter	};
}

Although climbing up the OpenLayers learning curve often feels like a big struggle, it's important to recognize the very wide set of requirements that the library is trying to address. Due to the nature of spatial data, user interfaces and protocols dealing with it are inherently complex. The more I work with OpenLayers, the more appreciation I have for the fine balance between simplicity and flexibility the designers have achieved. (And if that sounds like I do not subscribe to the "Spatial is not special" canard, you're hearing me right!).

↧

Maslow's Hierarchy in the 21st Century

November 15, 2013, 12:26 pm

≫ Next: Even-distribution Random Point and Polygons in JTS

≪ Previous: How to get OpenLayers WMSGetFeatureInfo to emit GeoServer CQL Filters for multiple layers

Been a while since I posted, so posting some humour seems like a good start to getting back on track...

↧

Even-distribution Random Point and Polygons in JTS

July 15, 2015, 10:13 am

≫ Next: SliceGraphs in JEQL

≪ Previous: Maslow's Hierarchy in the 21st Century

Recently I fixed the JTS KD-Tree implementation so that it works as advertised with a distance tolerance to provide point snapping. This gives a fast way to produce random point fields with even distribution (i.e. no points too close together).

First, generate a batch of random points using RandomPointsBuilder. As is well known, this produces a very "lumpy" distribution of points:

Then, put them in a KD-Tree using a snapping distance tolerance. Querying all points in the final tree produces a nice even distribution of points:

Using the Concave Hull algorithm available here with the same distance tolerance produces a random polygon with a very pleasing appearance:

I suspect that these kinds of polygons might be useful for generating stress tests for geometric algorithms.

UPDATE: Adding a bit of Bezier Smoothing produces an even cooler-looking polygon:

↧

SliceGraphs in JEQL

August 21, 2015, 6:50 am

≫ Next: Variable-Width Buffers in JTS

≪ Previous: Even-distribution Random Point and Polygons in JTS

In the interests of increasing blog output, I'm going to experiment with using a terser style in posts where the content is mostly self-explanatory. Sort of an embedded microblog...

Inspiration: original Population Lines print , this plot from the Line Graphs in R blog post:

Data: SEDAC World Population grid, subsampled

JEQLscript

Plot:

↧

Variable-Width Buffers in JTS

August 21, 2015, 7:58 am

≫ Next: The world needs a new flavour of SOSS!

≪ Previous: SliceGraphs in JEQL

Inspiration: this post on the JSTS group (with an image - good job on requirements!)

JTS implementation: in the lab

Code: Geometry geom = VariableWidthBuffer.buffer( line, 10, 80 );

Output:

↧

The world needs a new flavour of SOSS!

April 12, 2018, 11:29 am

≫ Next: Fun wit JEQL: Hilbert Curves

≪ Previous: Variable-Width Buffers in JTS

Yes, you don't what the acronym SOSS means, because I just made it up. SOSS stands for Standard Open Simple Spatial format.

It's crazy that in the 21st century the most common de facto standard spatial format is based on 30-year old technology, is proprietary, and has silly limitations such as 11-character uppercase attribute names.

I'm talking, of course, about shapefiles.

Surely we can do better than this?!

Now there are actually a few things that shapefiles get right. For instance, the shapefile's simplistic tabular data model gets two full marks for being - simple and tabular! Hierarchical data models are very cool and highly expressive, but overkill and too complex for 80% of the use cases out there.

Another useful feature of shapefiles is that they store floating point data with full precision - i.e. in binary. Representing binary floating point numbers as textual decimal values is inherently lossy, and causes all kinds of subtle and annoying problems. (I'm always surprised that this doesn't crop up more often as a serious limitation of GML.)

So what are the current leading contenders for a SOSS format? Here's an opinionated list, with pros and cons

Format	Pro	Con
Shapefile	tabular, lossless numerics	proprietary, antiquated, limited
GML	complex to model and parse, lossy numerics, poor schema handling	open, flexible
KML	proprietary, lossy, limited attribute handling, designed for presentation	relatively simple, well documented
GeoRSS	not appropriate as a full-featured SOSS, lossy
GeoJSON	too tied to Javascript, lossy, no schema standard
YAML	needs a spatial profile

Conspicuous by its absence on this list is XML. In fact XML is a meta-format, not a format. To utilize XML would require defining an appropriate profile (which would need to be highly restricted to meet the criteria of simple). The major drawback of XML is that specifying the profile almost inevitably drags one into the mind-bending hell of XML Schema. (There are other schema languages, such as RelaxNG, but they involve similar complexity and have even less traction).

There's also more esoteric formats such as NetCDF, but it fails the simplicity test, and it's unclear how well it supports Geometry types.

↧