TransWikia.com

Finding nearest neighbor with smaller attribute value in QGIS for a huge dataset: make expression more efficient

Geographic Information Systems Asked on June 13, 2021

The setting
I have a Geopackage point layer in QGIS 3.18 with over 350.000 features for an area of about 1700 km² (extent ca. 50*60 km), representing centroids of buildings. The points contain an attribute with the construction year of the building: from 1000 to 2020. A few statistical values, based on Basic statistics for fields, can be found below. CRS is EPSG:2056 (projected CRS for Switzerland, units=m).

What I want to do
The idea now is to find for each building the nearest building that is older and create an attribute nearest_older with the fid of this next older building – see the visualization at the bottom.

In a conceptual sense, it is similar to the concept of Topographic isolation: for a summit, find the minimum distance to a point of equal/higher elevation.

What I tried
I found an expression that works in principle, see it below. For some additional ideas how to proceed, see at the bottom as well.

The problem
The expression will not be efficient for such a huge data set (at least not in reasonable time). For testing, I reduced the computation time by: A) let it run only for the first 999 features (if (fid< 1000) and B) limited the number of nearest features to take into consideration to 1000 (limit:=1000). Even with this limitations, calculation runs for about 15 minutes.

The Question
How could the expression be improved for performance (calculation time)?
There will be probably no ideal solution, just an optimal one. So I’m interested in a solution that significantly reduces the calclulation time in a way that it becomes operationable for my dataset.

The expression

if (
fid< 1000,
with_variable (
    'cy',
    "construction_year" ,
    array_first (
        array_filter (  
            overlay_nearest( 
                @layer, 
                $id,  
                limit:=1000
            ),
            attribute (
                get_feature_by_id (
                    @layer, 
                    @element
                ), 
                'construction_year'
            ) < @cy
        )
    )
),
'')

Explanation how the expression works

  1. overlay_nearest( ) get an ordered array of the id for the x nearest neighbours (x here is set to 1000)

  2. array_filter (): filter this array in a way that only elements (@element) are retained where the attribute () named construction_year is smaller than the variable @cy – and this variable is defined as the construction year of the current feature. So the filter returns for each feature an array of id’s, where the construction year is older than the one of the current feature.

  3. array_first () returns the first id of this value. As overlay_nearest from step 1 orders the array based on closeness to the current feature, this is the id of the nearest_older I am looking for.

Further ideas how to improve performance

Here are a few ideas I had how to make the expression more efficient. All of these approaches have some shortcomings, however. And particularly, I’m stuck how to combine them in the most efficient way. And probably there are other approaches, not considered yet.

  1. Run it sequentially, like in the if-clause above: for features 1 to 999, than for 999 to 1999 and so on. Not very efficient, however.

  2. limit:=1000 to reduce the number of elements in the array created by overlay_nearest() to 1000 is unflexible: the newer a building is, the higher chances are that very close to it, you’ll find one that is older. Thus, for the majority of the buildings (who are constructed in the last few decades), you won’t need to identify a fixed number of 1000 nearest neighbors – a number of 50 or 100 or so would be OK. So the fixed value 1000 could be replaced by a formula that returns an inverse proportional value regarding the construction year. However, how to get an "optimum" formula, based on the distribution of the values in my field construction_year? For this, compare the statistical values below.

  3. Not for every feature a "match" has to be found in one pass. For some features, in the first round and using the limits defined, no matching id with an older building could be found. These NULL value cases could be calculated (based on a condition if NULL) in a next iteration. So an iterative approach could be used – but how to set it up?

A few statistical values

These values are based on Basic statistics for fields for the field construction_year

Count: 336856
Unique values: 629
Minimum: 1000
Maximum: 2020
Range: 1020
Mean: 1954
Median: 1971
Standard deviation: 64.5
Coefficient of Variation:  0.033
Minority (rarest occurring value): 1004
Majority (most frequently occurring value): 1980
First quartile: 1935
Third quartile: 1994
Interquartile Range (IQR): 59

Histogram, created with DataPlotly plugin: distribution of construction years, with the highest peak at 1980/81, lower peaks at 1928-1931 etc.:
enter image description here

Visualization

This shows the principle: each point is labeled with its construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 – so it goes on until the first (nearest) point is found with an older construction date:
enter image description here

3 Answers

Here is a more universal processing tool for this task with some advantages over my basic script:

  • Its responsive while executing it
  • Ability to cancel the execution
  • Usable in graphical modeler
  • It creates a copy of the layer instead of editing it
  • Simple GUI with progress bar
  • Adds ID and attribute of the near neighbor as well as the distance between feature and matching neighbor
  • Free choice of fields and operators
  • Setting for maximum nearest neighbors to compare (most likely increases performance a lot on large layers but most likely also leads to more non-matches!)
  • Setting for maximum search distance (can increase performance a lot on large layers but can also lead to more non-matches!)
  • Setting for a value where no comparison is done (can increase performance a lot on large layers, if chosen properly. Typically this is the minimum value of the available attributes when using < or <= operator and the maximum value of the available attributes when using the > or >= operator; only usable when comparing numerical values!)
# V1.3.2

from PyQt5.QtCore import QCoreApplication, QVariant
from qgis.core import (QgsSpatialIndex, QgsProcessingParameterFeatureSink, QgsFeatureSink, QgsField, QgsFields, QgsFeature, QgsGeometry, QgsPoint, QgsWkbTypes, 
                       QgsProcessingAlgorithm, QgsProcessingParameterField, QgsProcessingParameterBoolean, QgsProcessingParameterVectorLayer, QgsProcessingOutputVectorLayer, QgsProcessingParameterEnum, QgsProcessingParameterNumber)
import operator

class NearNeighborAttributeByAttributeComparison(QgsProcessingAlgorithm):
    SOURCE_LYR = 'SOURCE_LYR'
    SOURCE_FIELD = 'ID_FIELD'
    ATTRIBUTE_FIELD = 'ATTRIBUTE_FIELD'
    MAX_NEIGHBORS = 'MAX_NEIGHBORS'
    MAX_DISTANCE = 'MAX_DISTANCE'
    DONOT_COMPARE_VALUE = 'DONOT_COMPARE_VALUE'
    DONOT_COMPARE_BOOL = 'DONOT_COMPARE_BOOL'
    OPERATOR = 'OPERATOR'
    OUTPUT = 'OUTPUT'

    def initAlgorithm(self, config=None):
        
        self.addParameter(
            QgsProcessingParameterVectorLayer(
                self.SOURCE_LYR, self.tr('Source Layer'))) # Take any source layer
        self.addParameter(
            QgsProcessingParameterField(
                self.SOURCE_FIELD, self.tr('Attribute field containing unique IDs'),'id','SOURCE_LYR'))
        self.addParameter(
            QgsProcessingParameterField(
                self.ATTRIBUTE_FIELD, self.tr('Attribute field for comparison'),'year','SOURCE_LYR'))
        self.addParameter(
            QgsProcessingParameterNumber(
                self.MAX_NEIGHBORS, self.tr('Maximum number of nearest neighbors to compare (use -1 to compare all features of the layer)'),defaultValue=1000,minValue=-1))
        self.addParameter(
            QgsProcessingParameterNumber(
                self.MAX_DISTANCE, self.tr('Only take nearest neighbors within this maximum distance into account for comparison'),defaultValue=10000,minValue=0))
        self.addParameter(
            QgsProcessingParameterEnum(
                self.OPERATOR, self.tr('Operator to compare the attribute value (If attribute is of type string, only == and != do work)'),
                    ['<','<=','==','!=','>=','>'],defaultValue=0))
        self.addParameter(
            QgsProcessingParameterNumber(
                self.DONOT_COMPARE_VALUE, self.tr('Do not search for matches on features having a value (insert chosen operator here) x n Only works for numerical values and dependent on the chosen operator. n Typically this should be the max or min value available in the attributes and therefore there cant be a match.'),defaultValue=0))
        self.addParameter(
            QgsProcessingParameterBoolean(
                self.DONOT_COMPARE_BOOL,self.tr('Check this Box to actually use the previous option ("Do not search for matches on ....")'),defaultValue=0))
        self.addParameter(
            QgsProcessingParameterFeatureSink(
                self.OUTPUT, self.tr('Near Neighbor Attributes'))) # Output

    def processAlgorithm(self, parameters, context, feedback):
        # Get Parameters and assign to variable to work with
        layer = self.parameterAsLayer(parameters, self.SOURCE_LYR, context)
        idfield = self.parameterAsString(parameters, self.SOURCE_FIELD, context)
        idfield_index = layer.fields().indexFromName(idfield) # get the fieldindex of the id field
        idfield_type = layer.fields()[idfield_index].type() # get the fieldtype of this field
        attrfield = self.parameterAsString(parameters, self.ATTRIBUTE_FIELD, context)
        attrfield_index = layer.fields().indexFromName(attrfield) # get the fieldindex of the attribute field
        attrfield_type = layer.fields()[attrfield_index].type() # get the fieldtype of this field
        maxneighbors = self.parameterAsDouble(parameters, self.MAX_NEIGHBORS, context)
        maxdistance = self.parameterAsDouble(parameters, self.MAX_DISTANCE, context)
        donotcomparevalue = self.parameterAsDouble(parameters, self.DONOT_COMPARE_VALUE, context)
        donotcomparebool = self.parameterAsBool(parameters, self.DONOT_COMPARE_BOOL, context)
        op = self.parameterAsString(parameters, self.OPERATOR, context)
        op = int(op[0]) # get the index of the chosen operator
        #import operator
        ops = { # get the operator by this index
            0: operator.lt,
            1: operator.le,
            2: operator.eq,
            3: operator.ne,
            4: operator.ge,
            5: operator.gt
            }
        op_func = ops[op] # create the operator function
        
        total = 100.0 / layer.featureCount() if layer.featureCount() else 0 # Initialize progress for progressbar
        
        # if -1 has been chosen for maximum features to compare, use the amount of features of the layer, else use the given input
        if maxneighbors == -1:
            maxneighbors = layer.featureCount()
        
        fields = layer.fields() # get all fields of the inputlayer
        fields.append(QgsField("near_id", idfield_type)) # create new field with same type as the inputfield
        fields.append(QgsField("near_attr", attrfield_type)) # same here for the attribute field
        fields.append(QgsField("near_dist", QVariant.Double, len=20, prec=5)) # add a new field of type double
        
        idx = QgsSpatialIndex(layer.getFeatures()) # create a spatial index

        (sink, dest_id) = self.parameterAsSink(parameters, self.OUTPUT, context,
                                               fields, layer.wkbType(),
                                               layer.sourceCrs())
                                               
        for current, feat in enumerate(layer.getFeatures()): # iterate over source 
            new_feat = QgsFeature(fields) # copy source fields + appended
            attridx = 0 # reset attribute fieldindex
            for attr in feat.attributes(): # iterate over attributes of source layer for the current feature
                new_feat[attridx] = attr # copy attribute values over to the new layer
                attridx += 1 # go to the next field
            new_feat.setGeometry(feat.geometry()) # copy over the geometry of the source feature
            if ((not(op_func(feat[attrfield], donotcomparevalue))) or (not donotcomparebool)): # only search for matches if not beeing told to not do to so
                nearestneighbors = idx.nearestNeighbor(feat.geometry(), neighbors=maxneighbors, maxDistance=maxdistance) # get the featureids of the maximum specified number of near neighbors within a maximum distance
                try:
                    nearestneighbors.remove(feat.id()) # remove the current feature from this list (otherwise the nearest feature by == operator would always be itself...)
                except:
                    pass # ignore on error
                for near in nearestneighbors: # for each feature iterate over the nearest ones (the index is already sorted by distance, so the first match will be the nearest match)
                    if op_func(layer.getFeature(near)[attrfield], feat[attrfield]): # if the current nearest attribute is (chosen operator here) than the current feature ones, then
                        new_feat['near_id'] = layer.getFeature(near)[idfield] # get the near matchs's id value and fill the current feature with its value
                        new_feat['near_attr'] = layer.getFeature(near)[attrfield] # also get the attribute value of this near feature
                        new_feat['near_dist'] = feat.geometry().distance(layer.getFeature(near).geometry()) # and finally calculate the distance between the current feature and the nearest matching feature
                        break # break the for loop of near features and continue with the next feat
            else: # do not search for near neighbor matches if given value is (operator here) than x
                pass # do nothing and continue adding the feature
                    
            sink.addFeature(new_feat, QgsFeatureSink.FastInsert) # add feature to the output
            feedback.setProgress(int(current * total)) # Set Progress in Progressbar
            
            if feedback.isCanceled(): # Cancel algorithm if button is pressed
                break

        return {self.OUTPUT: dest_id} # Return result of algorithm

    def tr(self, string):
        return QCoreApplication.translate('Processing', string)

    def createInstance(self):
        return NearNeighborAttributeByAttributeComparison()

    def name(self):
        return 'NearNeighborAttributeByAttributeComparison'

    def displayName(self):
        return self.tr('Add near neighbor attribute by comparing attribute values')

    def group(self):
        return self.tr('FROM GISSE')

    def groupId(self):
        return 'from_gisse'

    def shortHelpString(self):
        return self.tr(
        'This Algorithm searches the n'
        '- x nearest neighbors n'
        '- within a given maximum distance n'
        'of the current feature and compares a given attribute. n'
        'If this comparison returns true, it adds the id, and the attribute of this neighbor to the current feature as well as the distance to this neighbor. n n '
        'Further explanations available on https://gis.stackexchange.com/a/396856/107424'
        )

To use it, just copy paste the code without any changes and place the python file in C:UsersyourusernameAppDataRoamingQGISQGIS3profilesdefaultprocessingscripts. See the docs for more informations. Once saved, you will find it in your processing toolbox within "Scripts" -> "FROM GISSE"

enter image description here

Only tested in QGIS 3.18.2. I can already tell this wont work in QGIS 3.2 (as intended), as this version has no option to limit the maximum search distance. Don't know when this feature was introduced though. However, feel free to just test in your version, worst thing that can happen is an error or a crash. In this case, upgrade your QGIS or edit the script accordingly.


Some further explanations on this script and the used method:

First, a spatial index is built on the layer. Then we iterate over the whole layer. For each "actual feature " (naming it this way to not confuse with my used term "near feature") we are getting the nearest neighbors of the current actual feature by using the QgsSpatialIndex().nearestNeighbor(<current_point>,<maximum_neighbors>,<maximum_distance>) method. This method returns an, by distance ascendingly, ordered list of the nearest feature id's. Then, to prevent iterating over itself and therefore finding itself when using the == operator, we remove the current actual feature's id from this list.

Now we iterate of this ordered list of near feature id's (if there is no given limitation by using the "Do not search for matches on..." option). So first we get the very closest other feature by its id. Check if it fulfills our given attribute comparison criteria. If not, we go to the second nearest feature. Check again. Go to the third nearest... and so on. If we finally find the first one fulfilling the criteria, we are grabbing the two attributes by a focused layer.getFeature(<near_matching_feat_id>)[<field>] request and calculate the distance by using <current_actual_feat>.geometry().distance(layer.getFeature(<near_matching_feat_id>).geometry()). These two operations should not cost anything worth mentioning the calculation time. Also these things are always be done only once for an actual feature as we do not iterate here. As soon as we have done this, we stop (break command) the iteration over the ordered list of near features and go to the next actual feature.


Runtime and optimal settings:

So if I am not completely mistaken, the calculation time in the best case should be: t = time_needed_for((number_of_actual_features-1)*1) + time_needed_for(creating_index). The best case is, if for all actual features the very closest other (near) feature fulfills the given attribute comparison criteria. In the worst case it should be t = time_needed_for((number_of_actual_features-1)^2) + time_needed_for(creating_index). The worst case is, if for all actual features the very farthest other (near) feature fulfills the given attribute comparison criteria. So we have three options to prevent the worst case from happening: limit the maximum number of near features to compare (when limiting to max. 1000 comparisons the worst case is t = time_needed_for((number_of_actual_features-1)*1000) + time_needed_for(creating_index) if there are at least 1000 actual features) or less helpful, but still, limit the maximum distance of near features to compare. The second option is less helpful because in this distance, theoretically, still all other near features could be included. As third option we can limit the search by giving an attribute. This makes sense when using the available minimum attribute here when using < or <= operator. Or when using the available maximum attribute when using > or >= operator. This option prevents from searching for a match if the criterion is fulfilled. So e.g. when using < and the minimum year available, prevent the algorithm from searching where there cannot be a match anyway in that case.

So the optimal settings strongly depend on your individual layer and your individual desired result. On the individual layer, because the runtime is dependent on the spatial and attributional distribution of your features. The algorithm searches for the nearest feature, matching the attribute condition, ordered by their distance. If everytime the very nearest feature already is a match, the runtime is short. If not, the more features need to be compared, the greater the runtime is. Here comes your personal desired result in play: you can limit the number of features to compare and/or their maximum distance. Lets say you limit the number of features to compare to 100: If there is no attributional match within the 100 closest features, the loop will be skipped and the current actual feature will simply get no result. So no result, but increased runtime. But if there is a match at lets say the 51st closest feature, this setting does have absolutely no effect, as the loop is skipped anyway after the 51st comparison.

Here an example I have tested; given 1000 randomly distributed points within a given extent of 10km*10km and random years between 1850 and 2020. Not using attribute search limitations in this test.

  • Max. Features to compare: 10; Max. searchdistance: 10000m; Runtime = 0.2 seconds; No match for 103 features
  • Max. Features to compare: 100; Max. searchdistance: 10000m; Runtime = 0.34 seconds; No match for 11 features
  • Max. Features to compare: 1000; Max. searchdistance: 10000m; Runtime = 1.13 seconds; No match for 7 features (all of them have the lowest year 1850, so there cant be a match, and the algorithm definitely searched the whole layer for these 7 features: 7*1000 iterations (minimum, dont know about the other features...))

The newly implemented option in V1.3 ("Do not search for matches on ...") can prevent a bad case (like in the last example setting): When there can't be a match, it can prevent from searching at all in that case. But this option can also be useful in other cases where one wants to limit the match searching.


Possible improvements:

So far, not beeing an expert in Python, I think the performance maybe could be improved by using an attribute index. This could allow to directly filter out (non) matches. However, my question if such a thing exists is still unanswered, and I am not sure if my "workaround" would increase the performance a lot. However, I am not sure if an attribute index improves performance here in combination with a spatial index. (But thinking about the whole thing again, the performance and result can be improved by adding another input, where you would manually add a year, where below no search at all would be done, e.g. the minimum year where there cant be a result anyways. --> just implemented that one in V1.3)

Surley there may be other things as well, a Python expert can optimize on this script. Processing tools run as background threads, thats why QGIS is still responsive while executing it.


Why not using expressions here?

Now lets come to the question on why expressions are a lot slower here: To my opinion/knowledge the reasons are: they are not using a spatial index and especially, the calculation is feature based. So basically overlay_nearest() is aggregating the whole layer as many times as the layer features has. The big difference to the Python script is, that we are only building an spatial index once and do not aggregate anything. We get the nearest neighbors by using this index on each iteration. As already said, I am not 100% sure here, so I welcome if my statement can be confirmed or disproved.

Correct answer by MrXsquared on June 13, 2021

If you are open to a PyQGIS solution you can use this script. It takes about two seconds on my test layer with 1000 features.

layer = iface.activeLayer() # get the layer
layer.dataProvider().addAttributes([QgsField('nearest_older_fid', QVariant.Int)]) # add new field
layer.updateFields() # update fields so we can work with the new one

idx = QgsSpatialIndex(layer.getFeatures()) # create a spatial index
nfeats = layer.featureCount() # count total features of the layer

with edit(layer): # edit the layer
    for feat in layer.getFeatures(): # iterate over the layers features
        for near in idx.nearestNeighbor(feat.geometry(), nfeats): # for each feature iterate over the nearest ones
            if layer.getFeature(near)['year'] < feat['year']: # if the current nearest ones is smaller than the current feature ones
                feat['nearest_older_fid'] = layer.getFeature(near)['id'] # get the nearest featureid and fill the current feature with its value
                # Alternatively you could also use: 
                # feat['nearest_older_fid'] = near # this will not return the attribute "id" but directly the featureid of the nearest feature. 
                layer.updateFeature(feat) # update the feature
                break # break the for loop of near features and continue with the next feat

You may need to adjust fieldnames.

The key elements are:

  • usage of a spatial index to find the nearest features
  • itereate over the nearest features
  • break this iteration as soon as a nearest-features year is smaller than the current feature ones year

Demo:

enter image description here

Answered by MrXsquared on June 13, 2021

Testing-Results: best settings for the script by MrXsquared

Finally, I can present the results of my testing series, based on the python-script (accepted answer) by MrXsquared. To anticipate the result: a setting of max_distance 500 meters and max_neigbors of 50 was the most performant one. The shortest time to find all results is an iterative approach, with relative low seetings for the first run and than using increased values for the remaining feautres where no "nearest older neighbor" was found in the first run.

It took just a bit over half an hour in three iterations to get a "nearest older" neighbor for all points in my dataset. Running the script did not block use of windows or even QGIS. Wile the script was running (with no other activity on the machinge), utilized capazity was about 32%. Details below.

Influence of the computer used on performance

However: the influence of the machine you use should not be underestimated and even could affect the performance more than the seetings used. I ran all the tests on the same machine (8 CPU, 2GHz, 16 GB RAM). However, to compare, I ran the "winnig setting" on another (older) machine as well (4 CPU, 3.33 GHz, 8 GB RAM): there it took 2114 seconds (instead of 1772, thus more than 19% longer) to get the same result (test no. 8 below in the table).

The basiscs

My layer consists of 336.856 point features. The idea was to identify the maximum possible nearest neighbors in a minimum of time: to make calculation as effective as possible. I tested different setting for maximum distance and maximum number of nearest neigbors to compare.

The tests

Below you see the table of the results, running the script on the same set of 336.856 points with 22 diferent values for max_distance and max_neighbors. The columns show: settings, time it took to calculate in seconds as well as in min:sec., numer of features for which a "nearest older" was found in absolute value and percentage of the whole data set and the same for number of "not found nearest older neighbor" (with used settings). The last two columns are a mean calculated as per number of items found per second and the inverse: mean time it took to find 1000 items in seconds.

The following graphic (done in QGIS, by the way) shows the result for all but the last run. The number in the label corresponds to the no value (first column) of the table below, where you can see the details for each entry. The size of the symbols corresponds to the value of the max_distance setting (not linear!), the color corresponds to the number of max_neigbors: the darker, the higher is the value for maximum neighbors to consider:

enter image description here

enter image description here

no  max_dist    max_neighbors   duration in sec min:sec found   found in %  not found   not found in %  found/s s/1000 found
1   100 50  1619    26:59   304395  90.36   32461   9.64    188 5.32
2   100 100 1676    27:56   306313  90.93   30543   9.07    182.8   5.47
3   100 200 1727    28:47   306413  90.96   30443   9.04    177.4   5.64
4   125 50  1722    28:42   311261  92.4    25595   7.6 180.8   5.53
5   150 50  1734    28:54   314404  93.33   22452   6.67    181.3   5.52
6   175 50  1752    29:12   316167  93.86   20689   6.14    180.5   5.54
7   250 50  1753    29:13   319179  94.75   17677   5.25    182.1   5.49
8   500 50  1772    29:32   322258  95.67   14598   4.33    181.9   5.5
9   1000    50  1820    30:20   322906  95.86   13950   4.14    177.4   5.64
10  175 100 1929    32:09   322563  95.76   14293   4.24    167.2   5.98
11  200 100 1953    32:33   324417  96.31   12439   3.69    166.1   6.02
12  250 100 1986    33:06   326600  96.96   10256   3.04    164.5   6.08
13  500 100 2001    33:21   330164  98.01   6692    1.99    165 6.06
14  5000    100 2035    33:55   331107  98.29   5749    1.71    162.7   6.15
15  1000    100 2049    34:09   331055  98.3    5801    1.7 161.6   6.19
16  1000    200 2246    37:26   334165  99.2    2691    0.8 148.8   6.72
17  2000    200 2254    37:34   334261  99.23   2595    0.77    148.3   6.74
18  1000    500 2557    42:37   335686  99.65   1170    0.35    131.3   7.62
19  5000    500 2611    43:31   335888  99.71   968 0.29    128.6   7.77
20  10000   1000    2853    47:33   336379  99.86   477 0.14    117.9   8.48
21  50000   1000    2889    48:09   336379  99.86   477 0.14    116.4   8.59
22  10000   2000    3289    54:49   336615  99.93   241 0.07    102.3   9.77

Identifying the most efficient setting

So which one is the most efficient setting? It should be placed as much to the upper left as possible (minimum time, maximum % of items found). Connecting points 1 to 10, you can see an S-shaped curve. Points 1 and 8 seem to be more performant, as all the others are to the right side of the connecting line 1 to 8. Point 1 has the best overall ratio for found/s, but setting 8 finds substantially more points (5 percentage points more) with not so much more calculation time: it increases only 2 and a half minute (9%).

Points 9 and 10 do not substantially improve the result, to the contrary, they just take longer. Points 12 to 21 (including 22, which is not on the graphic) describe a kind of asymptotic curve, thus the effort to find additional results increases disproportionately. So these points can't be considered to represent an efficient solution.

The winner and further iteration

So the most efficient solution is the one numbered here with 8: it finds a nearest older neighbor for 95.67% of the 336.856 features in just under half an hour (29:32 min). So from all the tested settings, it has the best time/found-ratio.

  1. First run: see settings for test no. 8 in te table above.

  2. Senod run: For the remaining 14.598 "not found" features, I ran the script a second time with these settings/results: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 194 sec result: nearest neighbor found for all but 21 of 14.598 features

  3. Third run: For the last 21 features, I ran the script a third time with the same settings as for the second run: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 0.12 sec result: nearest neighbor found for all but 2 of 21 features

So in this third run, all features were found. The remaining two are the two oldest building of the data set, so for them, no "nearest older" exists.

Conclusion

Thus, the calculation for the whole data-set in three runs took: 1772 + 194 + 0 = 1966 sec. or 32:46 min. or a bit more than half an hour. So for the whole original features of 336.856 points that means as a mean value: 171.3 features calculated per second or 0.0058 sec. for calculation of one feature.

In my context, this is quite a good result. That's why I think the answer is well worth it's bounty.

Answered by Babel on June 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP