NLP Workloads Using pyTigerGraph?

vamossagar12 · August 16, 2020, 3:15pm

We are using enterprise TG but had a few questions on migrating one of our NLP based use cases onto Tiger Graph.

We have some entity resolution algorithms which rely on NLP constructs like Lemmatization/edit distance etc. Now, from what I have understood, there are 2 ways to do Data Science on TG:

In Graph ML: This I believe is the ideal way even for our use case. One of the problems that I am having is how to plug in the NLP constructs on GSQL. I think we can write UDFs but i am not very much sure about the C++ libraries for NLP. I think stitching together all steps into a GSQL might still be challengning. Wanted to know if anyone has tried it here and can provide any pointers.
Using pyTigerGraph: This is something I discovered recently. It’s very easy for me to use this option because my entire code base is on python and I just need to tweak the relevant bits to pull data from TG using the functions provided and rest would work as is. I have seen few graph gurus videos where people are doing Data Science using this technique. I had a couple of questions on this regard though:

2.1) Can it be used for medium/large size graph workloads? I ask this because our graph can grow upto order of billions of edges and 100s of millions of node. Even though the Data footprint on TG is of the order of GBs, but wanted to understand how would it pan out for larger graphs. There has been an issue raised in the github repo: Ability to load medium/large graphs · Issue #7 · pyTigerGraph/pyTigerGraph · GitHub on this regard.
2.2) Is there a way to do some kind of pagination where the results from the graph can’t be accommodated in memory? I think it hits restpp underneath so does restpp provide such functionality?

Parker_Erickson · August 17, 2020, 3:39pm

Hey @vamossagar12! I am glad that you discovered pyTigerGraph. Full disclaimer - I originally created pyTigerGraph, so I might be a little biased in my answer.

Honestly, I think that UDFs are a pain. pyTigerGraph is nice, because as you mentioned, your workflow is already Python based.

2.1) I believe that the latest update may have fixed some of the issues for loading large graphs in the github issue you mentioned, although I have not tested this yet.

2.2) Unfortunately there is not any kind of pagination built in, as pyTigerGraph does rely on the restpp endpoint, and I do not believe that this is supported in the restpp service.

What is your data workflow like? From your description above, it sounds like there are batch jobs that perform the NLP and data ingestion into TigerGraph. You could create queries that extract the entities from the graph, perform your NLP, and then update those vertices in the graph relatively easily in pyTigerGraph, and that would allow you to 1) distribute the NLP across a cluster of machines and 2) remove the memory/pagination/efficiency issues that may exist by processing each entity individually or in a small batch. You could also consider using Kafka streams into TigerGraph and integrate your NLP pipeline in that streaming process (although I have never done this personally).

Victor_Lee · August 17, 2020, 7:17pm

2.1) Using TigerGraph directly, graph size shouldn’t be an issue. We have customers running clusters with 100s of billions of edges. I guess there are some scale issues being resolved with pyTigerGraph.

vamossagar12 · August 18, 2020, 3:17am

Thanks @Parker_Erickson. I am aware that you are the creator of pyTigerGraph !
Yeah UDFs for this kind of workloads seems to be difficult to pull off for us- atleast right now.

Coming to the NLP workflow, for this use case, I can extract entities in micro-batches and run the NLP functions on those batches. I wanted to know in general, what if I need to load a huge graph then do I pull everything or try to break it in chunks.

In memory on TG, we might see memory spikes when selecting a huge number of vertices/edges but that’s still within the context of the query. But with pyTigerGraph, there could potentially be a cost of delivering those json payloads to the client. That was the only reason I thought of asking this.

Anyways, I will give it a try and post here back. Thanks for the response!

vamossagar12 · August 18, 2020, 3:18am

Yes @Victor_Lee, my doubt was mainly around pyTG. We have been using TigerGraph with billions of edges without much hassles directly via queries.

Parker_Erickson · August 18, 2020, 4:45pm

Yeah, I have a feeling that there would be limitations on the JSON payload back to the client. I have done some larger extractions (~50,000 vertices) in the past but it is slow. I don’t know if @Victor_Lee or @Jon_Herke know if there would be a better/more direct way to get large amounts of data out of the graph and into python. I know that there is a Spark integration with TigerGraph, and wonder how that is implemented/what the bandwidth is like.

Eventually, I would like to have a CI/CD pipeline setup with pyTigerGraph that can run some unit tests, but also efficiency benchmarks just to get a more defined feeling as what the package’s limitations are.

It is always cool to hear from people that are using pyTigerGraph! Always feel free to open issues or pull requests on the repo - it is meant to be a community tool.

Parker

vamossagar12 · August 19, 2020, 5:13am

Hey @Parker_Erickson Yeah that has to be a problem i believe. I was checking the jdbc connector for TG and it lists a limitation of Rest pp:

https://github.com/tigergraph/ecosys/tree/master/tools/etl/tg-jdbc-driver#limitation-of-resultset which 2 GB. I mean that’s a huge payload but the point is, we will hit the bottle neck at some point.

vamossagar12 · August 19, 2020, 5:15am

BTW, @Parker_Erickson, I see that you library uses the requests library underneath. Did you try to enable the stream mode to see if it helps?

Parker_Erickson · August 19, 2020, 1:35pm

We do use the requests library, but I did not realize that there was a stream parameter. Will definitely take a look at it. Thanks for the tip!

vamossagar12 · August 19, 2020, 3:02pm

Cool. I would be interested to know the findings…

vamossagar12 · October 20, 2020, 9:45am

hey @Parker_Erickson circling back on this thread… I had a couple of scenarios where the payload being returned was around a GB and parsing it kept failing due to json decode error.
I did try the stream option separtely via requests and I got the same behaviour.
Is there any workarounds that you are aware of for this?

One option is to write the output to file via the query and then download the file. Anything else that has been tried at your end?

Parker_Erickson · October 20, 2020, 2:28pm

@vamossagar12 I have a feeling that this is an error on python’s JSON library then if you replicate it via requests. I am not positive though.

vamossagar12 · October 20, 2020, 3:51pm

Yeah… Let me try to add some batching mechanism into the queries. Anyways, i don’t think it’s a good idea to send back json payloads which are more than 1 GB in size.