Hi, I want to know more about the execution of the distributed query such that I can tell whether I write the gsql queries in the best way. The material I can find is only Distributed Query Mode, but I don’t fully understand these two sentences:
The query executes in parallel across all the machines which have source vertex data for a given hop in the query. That is, each SELECT statement defines a 1-hop traversal from a set of source vertices to a set of target vertices.
E.g., does 1-hop traversal
here include both 1-hop query
and 1-hop pattern matching
? For 1-hop patten matching a -(<link:e)- b
that looks for in-neighbors, is a
or b
considered as the soruce vertex
?
Here I want to understand the execution of two queries below, and I am not sure if my understandings (listed below) are correct:
- Subquery
triangle
is invoked insidebatch_triangle
, which is similar to a function is invoked inside another function in programming languages like C/C++, wherebatch_triangle
will finish the current invocation oftriangle
before it makes the next invocation. - When executing
batch_triangle
, due to the distributed keyword, TigerGraph launches one query instance ofbatch_triangle
on each machine (supposed replication factor is 1), each instance accesses the local partition of the graph GG, finds the target edges withe.batch_id == batch_id
and at the end invokestriangle
with each target edge. - For the first
select
intriangle
, the computation of the query instance will not be moved as the instance is running on the machine that stores the vertexa
. We get the neighbors ofa
in the firstselect
. - For the second
select
intriangle
, as it starts from the neighbors ofa
and these neighbors may be located in different machines, the query instance will create multiple instances where each instance is created on each of the machines who store the neighbors to process the secondselect
. The newly created instances run in parallel and return the result individually. The returnedS.size()
of the each new instance will be correctly accumulated by the original query instance.
use graph GG;
create query triangle(vertex<node> a, vertex<node> b) for graph GG returns (int) {
A = {a};
S = select s
from A:i -(link:e)-> node:s;
S = select s
from S:s -(link:e)-> node:i
where i.vid == b.vid;
return S.size();
}
create distributed query batch_triangle(int batch_id) for graph GG {
ListAccum<vertex<node>> @neighbors;
SumAccum<INT> @@n_matchings;
start = {node.*};
S = select s
from start:s -(link:e)-> node:n
where e.batch_id == batch_id
accum s.@neighbors += n
post-accum
foreach i in s.@neighbors do
@@n_matchings += triangle(s, i)
end;
print batch_id, @@n_matchings;
}