Shortest path ray tracing on parallel GPU devices |
Algorithm 3 is the kernel Relaxation. It takes care of finding the smaller traveltime to each vertex up to the current iteration. This kernel is launched for each vertex, so in a similar way to the kernel function that precalculates the weights, it starts by obtaining the current vertex indices . The following step is to iterate each neighbor of vertex . During each iteration it calculates the following expression:
(1) |
Figure 3 shows schematically this calculation. It is the current smaller traveltime from source to neighbor vertex plus the traveltime from this vertex to vertex . This is a candidate smaller traveltime from source vertex to passing through . If this traveltime is smaller than the current smaller traveltime from source vertex to vertex , we have to actualize its value and record that this new smaller traveltime is reached through vertex
tt+w1
Figure 3. Traveltime compared during relaxation kernel. |
---|
It should be noted that during this operation each kernel thread, although reads information from many differente vertices, only modifies information of its own vertex and no other kernel thread modify this vertex. This is the main difference with the shortest path algorithm of Harish and Narayanan (2007), that at this point calculates the expression:
(2) |
i.e. the sum of the traveltime from source to vertex and the traveltime from this vertex to vertex . Figure 4 displays schematically this calculation. This value is a tentative smaller traveltime from source to vertex passing through . If this is small that the current smaller traveltime to vertex it modifies traveltime and predecessor values for vertex . The problem is that another kernel thread might be, concurrently, attempting to modify information of the same vertex (because vertices usually belong to many neighborhoods) and this can lead to race conditions.
tt+w2
Figure 4. Traveltime compared in Harish and Narayanan (2007). |
---|
The solution of Harish and Narayanan (2007) is to use an atomic function that compares and possibly stores in one single instruction the smaller traveltime in . This strategy avoids the race condition but has two drawbacks. First, atomic functions slow down kernel execution. And second, it only allows to write a single value in an atomic operation, but we need to write two values: traveltime and predecessor. To solve this last problem it is possible to implement and use mutexes to do all the actualizations avoiding the race conditions, but this only slows down even more the kernel execution, as mutexes are usually implemented using atomic functions.
Another detail to take into account is that during the traveltime and predecesor actualization we have been setting the auxiliary arrays and instead of the arrays and . The reason is that in the current kernel iteration other threads are still using the current values of and and we only want to use the new values in the next invocation of this kernel for consistency.
Shortest path ray tracing on parallel GPU devices |