Edge induced selection speed

Jon_Herke · April 8, 2020, 5:29pm

Let say we want to get all pair of Vertex connected by a directed “knows” relationship? s->know:e->t

How fast is that kind of query ? Does tiger Graph scan all node in search of those who have that directed edge ? Is there some form of edge index ? I’m trying to get a sense of how TigerGraph handle edge traversal especially when the number of edges is important.

Vertex have index based on the primary key, but nothing is said about edges. Hence i’m curious as to how a edge induced selection works: s->know:e->t . Is parallelism involved here ?

Jon_Herke · April 8, 2020, 5:29pm

Hi Maatary,

Edge traversal in TigerGraph is done efficiently with parallelism and the average traversal speed for edges can up to tens of million edges per second.

To implement a directed “knows” relationship among all pair of Vertex connected in vertex set Seed of Person type, you need to do the following

//define a boolean flag to mark all vertices belong to the Seed set

OrAccum @visited = false;

//mark all elements in Seed using a boolean flag visited

X = Select src

  From Seed:src

  Accum src.@visited = true;

Y = select src

from Seed :src ->know:e-> Person :tgt

Where tgt.@visited == true

Accum … //do anything on these edges

The traversal is done for all “know” type edges starting from Seed and all invalid edges that are not among the Seed vertex set are filtered out by target boolean flag filters.

Please let me know if you have any questions.

Best Wishes,

Dan

Jon_Herke · April 8, 2020, 5:30pm

Dan,

thank you so much. I guess I have strong background in Sparql which is confusing me a little here. I know DataStacks and JanusGraph at intermediate level. Familiar with LSM-tree and Cassandra too. Hence, it is from both perspective that I come.

I did not think about writing the query that way. Thank you. I think this will help a lot moving forward.

This being said.

How do I return the connected pair, as i would do in sparql

Select X, Y

where {X :knows Y}

That is how do i project the pair ?

Can you elaborate on what exactly is being parallelized here ? Even better what part trigger the parallelization ?
Wether in parallel or not (as i assume each src is processed in parallel but not sure) does the statement, from Seed :src ->know:e-> Person :tgt , corresponds to some form of LSM-tree or whatever scan, at least partially ?

I hope you can get me to understand a bit what’s going on under the hood without having to reveal your secret sauce

At least i want to understand to get a sense of what set tiger graph apart from what you call second generation GraphDB from that stand point.

Full disclosure: My company is looking into evaluating TigerGraph. We deal with Lifescience Data.

Jon_Herke · April 8, 2020, 5:30pm

Hi Maatary,

return the connected pair
//define a boolean flag to mark all vertices belong to the Seed set
OrAccum @visited = false;
//mark all elements in Seed using a boolean flag visited
X = Select src
From Seed:src
Accum src.@visited = true;

Y = select src
from Seed :src ->know:e-> Person :tgt
Where tgt.@visited == true
Accum … //do anything on these edges

Vertex Set Y is actually the subset of Seed where there is at least one “know” edge connecting with another vertex in Seed set (the vertices of the connected pairs) and one can output the vertices of the Set Y by, i.e. “Print Y;”

If you want to output the edge list that connecting all vertices inside Seed, you can also use a ListAcuum to store all these edges, i.e.
//define a boolean flag to mark all vertices belong to the Seed set
OrAccum @visited = false;
ListAccum @@innerEdgeList;

//mark all elements in Seed using a boolean flag visited
X = Select src
From Seed:src
Accum src.@visited = true;

Y = select src
from Seed :src ->know:e-> Person :tgt
Where tgt.@visited == true
Accum @@innerEdgeList += e; //store all such edges into the global variable

Print @@innerEdgeList; //output the subgraph of connected vertices in Seed.

& 3) parallelization in TigerGraph: Edges stored in TigerGraph is batched by its source vertices and edge traversal parallelization is done by handling these batches in parallel. Conceptually, every edge in Seed :src ->know:e-> Person :tgt is executed in parallel and independently with each other so that you can only access the current edge info in the Accum block. However, global accumulators as well as local accumulators from previous steps can be used to record some information about other vertex or edge.

Best Wishes,
Dan

Jon_Herke · April 8, 2020, 5:30pm

Dan,

thanks.

A) Based on your late answer please find below my questions:

Y = select src

From Seed:src ->knows:e-> Person:tgt

Print Y

I don’t really understand why do I need the visited flag ? what does it add. The Edge induce vertext set should be enough. It supposed to find all entities that knows a person.

2&3)

What means " Edges stored in TigerGraph is batched by its source vertices" ? I don’t get the batch word here.
In a sense, I’m curious as to the retrieval process from disk, in case some part of the graph are in disk. For Vertex we have indexes with O(1), thanks to the primary key. But for Edges there is no info about an index for them. Hence that would require to scan all node in the graph to figure out those who have the edge type (assuming you are using adjancy list or something similar) ? I don’t if that’s what’s happening. This sounds very inefficiency to me.

Jon_Herke · April 8, 2020, 5:30pm

Hi Maatary,

A) Based on your late answer please find below my questions:

Y = select src

From Seed:src ->knows:e-> Person:tgt

Print Y

I don’t really understand why do I need the visited flag ? what does it add. The Edge induce vertext set should be enough. It supposed to find all entities that knows a person.

"From Seed:src ->knows:e-> Person:tgt" means traverse all edges starting from Seed set and with “knows” type edge, where it also includes edges connecting Seed to other Person vertices that does not belongs to the set Seed. I assume that you are interested in “knows” type edges among vertices of Seed but not the edges from vertices of Seed to another vertex that does not belong to set Seed.

2&3)

What means " Edges stored in TigerGraph is batched by its source vertices" ? I don’t get the batch word here.
In a sense, I’m curious as to the retrieval process from disk, in case some part of the graph are in disk. For Vertex we have indexes with O(1), thanks to the primary key. But for Edges there is no info about an index for them. Hence that would require to scan all node in the graph to figure out those who have the edge type (assuming you are using adjancy list or something similar) ? I don’t if that’s what’s happening. This sounds very inefficiency to me.

Batch means that we group vertices in batches and each batch will have all their outgoing edges (adjacent list) , in this way, we only need to know the source vertex set and edge type, then we could quickly find out all its outgoing edges for these source vertices. In tigergraph, by default, all data are stored in memory. However, we also have configuration to store partial graph data on the disk.

Best Wishes,

Dan

Jon_Herke · April 8, 2020, 5:31pm

Dan,

Based on the following statement taken from your e-book:
"
TigerGraph isn’t described as an in-memory database, because having data in memory is a preference but not a requirement. Users can set parameters that specify how much of the available memory may be used for holding the graph. If the full graph does not fit in memory, then the excess is stored on disk. Best performance is achieved when the full graph fits in memory, of course.
"

I m assuming that wether on disk or in memory some form of scanning must happen, and that you store data such that you can perform a scan on file, or in memory. My understanding being that in-memory database, would not incur the cost of transforming data in a format that they can be scanned on file, as the scan would only happen in memory, using some form of LRU like caching to evict, least used object when necessary that is when it does not fit in memory.

So as a follow up to my question:

A) How does the scan happen ? is it a parallel search ? is it an ordered list, a binary tree or something like ? I don’t need to know the exact thing, but maybe just an indication that you have a structured that makes searching quite fast, by pruning aggressively, rather than going over the all list. Also if it is done in parallel, would be nice to know.

B) Can that scan be done in memory as much as on disk ?

C) What happens when we are in a clustered mode for that particular query ? What would support the efficiency of such a search.

Sorry for the may question. Just making sure to pick the right DB, seeking for the best performance for our use case.

Jon_Herke · April 8, 2020, 5:31pm

For instances,

in triples store for which the model is very close, as in a bit simplified, some of them makes massive use of indexes (comes with its own drawbacks e.g. load time): https://franz.com/agraph/support/documentation/current/triple-index.html#header2-8

POSGI Index

“To find unknown subjects that have a specific predicate and object value, AllegroGraph uses the posgi index. All triples that have the same predicate are located together in that index, and are then sorted by object value. This makes it very easy to locate the correct predicate/object triples, or even * ranges* of such triples.”

Granted here we are not dealing with triples but a graph structure. Although this was just to give a sense of what I was asking for. This king of information, can help tune query and understand better the performance of certain queries when using TigerGraph.

>> in this way, we only need to know the source vertex set and edge type, then we could quickly find out all its outgoing edges for these source vertices.

It is the statement " we could quickly find out all its outgoing edges for these source vertices" that is not clear to me. Each vertex can have multiple edge type associated to it. Hence selecting all the appropriate edge can mean scanning the all list for each vertex, or the vertex and adjaceny list could be represented in a way that the search prune and goes quickly like a Red black tree .

I imagine that given the pervasive parallelism, then each vertex is searched in parallel, nonetheless a scan of some sort must happen for each vertex and their adjacency list.

I hope this additional comment makes the question clearer and makes it easier to answer the question with the depth you see fit given the proprietary nature of your product.

Kind regards.

Jon_Herke · April 8, 2020, 5:31pm

Maatary,

Since you are interested in choosing the best performant graph DB, I would recommend to take a look of this benchmark

And you can reproduce all of them by following

https://github.com/tigergraph/ecosys/tree/benchmark/benchmark

WR,

Mingxi