Is there a concept of index in TigerGraph?

Hello,

I have been reading the documentation and the GSQL 101 tutorial and I was wondering how TigerGraph is dealing with queries such as :

SELECT * FROM person where age > 20

Is there any possible kind of indexing on the age property or are all person nodes scanned ?

If not, what is the best strategy to fasten this type of initial filtering, in particular for the case of parametrized queries.
For instance, using the GSQL 101 code, we could imagine the following use case :

USE GRAPH social CREATE QUERY hello2 (INT minAge) FOR GRAPH social{
    OrAccum @visited = false;
    AvgAccum @@avgAge;
    Start ={person.* where age > minAge};  <-- this line is invalid but from the documentation I can't find a proper way to perform such sub-set definition
    FirstNeighbors = SELECT tgt
            FROM Start:s -(friendship:e)-> person:tgt
            ACCUM tgt.@visited += true, s.@visited += true;
    SecondNeighbors = SELECT tgt
            FROM FirstNeighbors -(:e)-> :tgt
            WHERE tgt.@visited == false
            POST_ACCUM @@avgAge += tgt.age;
PRINT SecondNeighbors;
PRINT @@avgAge;
}

Hi Thibault,

The command of “SELECT * FROM person where age > 20” does not use any index to search for person vertex but internally it runs in parallel to speed up the search.

In GSQL query, you can have a vertex block with where condition to do the filtering, which is also running in parallel for all person vertices, i.e.

CREATE QUERY hello2 (INT minAge) FOR GRAPH social{ 
    OrAccum @visited = false; 
    AvgAccum @@avgAge; 
    //Start ={person.* where age > minAge};  <-- this line is invalid but from the documentation I can't find a proper way to perform such sub-set definition
    Start ={person.* };
   //using a vertex block to do the vertex filtering, which is running in parallel    
   Start = SELECT src 
               FROM Start:src
               WHERE src.age > minAge;

    FirstNeighbors = SELECT tgt 
            FROM Start:s -(friendship:e)-> person:tgt 
            ACCUM tgt.@visited += true, s.@visited += true; 
    SecondNeighbors = SELECT tgt 
            FROM FirstNeighbors -(:e)-> :tgt 
            WHERE tgt.@visited == false 
            POST_ACCUM @@avgAge += tgt.age; 
PRINT SecondNeighbors; 
PRINT @@avgAge; 
}

Our current VertexBlock is already fast enough and can work with tens of millions of vertices per second. Usually, you do not need any index to speed up the query.

Currently, if you really want any index functionality in GSQL, you have to construct your own index vertex and edge. For example, in your case, you can create a new vertex type called “age” to represent all the possible ages for any person and build an additional “person_age” type of edge connecting the index node “age” and vertex “person”. So that in the GSQL query, you can start with all valid “age” vertices and do an Edge block to find all valid persons. This type of technique work better in performance than the above VertexBlock filtering if you have billions of person vertices.

Best Wishes,
Dan

Thank you for your quick and detailed answer Dan.

Regarding your solution using a dedicated node for a particular value, it may lead to a ‘dense node’ (so many edges) which from my experience triggers many issues in most graph databases. Given the optimizations you are mentioning I will definitely benchmark TigerGraph for this case. The performances look very promising so far !

Hi Thibault,

Yes, I agree with you that if one index connecting too many edges will cause performance downgrade. Another work around is to create multiple index nodes instead of one index node for one age. For example, in our loading job we could have a token function to take a random number of range (1~100) and attach this random number to the age to make it as the primary_id for the age index node, where we will have 100 age nodes for the same age, such as for age 20 we can have “20-1”, “20-2”, … “20-100” index vertices and all with the same attribute age = 20. In this way an index node is replaced with a group of vertices with the same age. In this way, we just search for index nodes with the given age first and then starting from these index vertices to look for person vertices.

Best Wishes,

Dan

Hi Thbault,

I am curious what’s your benchmark so far ? What’s your take on the efficiency of the filtering.

Dan,

This problem seems to be refered as Global Index in Janus Graph or Datastax. You mention in your documentation https://docs.tigergraph.com/ui/graphstudio/explore-graph the notion of Vertex attribute filter .

  1. Are attribute filter and global index both talking about the same problem.

  2. Global index is a solution ? What direction are you taking to address this issue ? Also when in the roadmap can we expect that out ?

Hi Maatary,

  1. Are attribute filter and global index both talking about the same problem.
    No, attribute filter is just doing the scanning of all given set of vertices and then apply filter condition to filter out invalid vertices. Currently, tigergraph does not provide the builtin index feature. However, one could instead build index vertex and corresponding edges to implement the equivalent indexing feature using additional graph traversal from index nodes to valid vertices.

  2. Global index is a solution ? What direction are you taking to address this issue ? Also when in the roadmap can we expect that out ?
    TigerGraph index is definitely on the roadmap and it is designed to automatically speed up the attribute filtering via the index if that attribute is an indexed attribute in the graph schema. All graph updates on that attribute will also being used to update the index info. For feature release questions, our product manager …@Victor Lee can help you on that.

Best Wishes,
Dan

Thanks for the answer.

Currently, tigergraph does not provide the builtin index feature. However, one could instead build index vertex and corresponding edges to implement the equivalent indexing feature using additional graph traversal from index nodes to valid vertices.

I think I understand that. It seems like it is based on the builtin primary_id indexing of TigerGraph. However I just want to be sure. Can you just provide a simple illustration based on the original question example.

If you could provide the gsql code that would be great. Simply to put that in writing and make sure that everyone is on the same page on this.

Hi Maatary,

Currently, tigergraph does not provide the builtin index feature. However, one could instead build index vertex and corresponding edges to implement the equivalent indexing feature using additional graph traversal from index nodes to valid vertices.

For example, you have huge number of transactions vertices and you only interested in the transaction with specific days.

One way to do it is to activate all transaction nodes and filter out transactions that are not in the given days, which is very inefficient and slow if transaction vertices are huge. Instead, one could create an index node call “date” type and connecting “date” with “transaction” node using edge “transaction_date”.

//DateSet is the set of date vertices that you want to search for transactions

Create query searchTrans(set<vertex<date>> DateSet) for graph poc_graph {

X = Select tgt

      From DateSet:src - (transaction_date: e)-> transaction: tgt

      ;

 print X;

}

In this way, you got all transactions within a given dates and this is an example of using index vertex to speed up the search. You also need a loading job to keep the “transaction_date” edges updated for any new transaction vertex.

Best Wishes,

Dan