Storing ML Models in TigerGraph from the Spark-TigerGraph Integration

Maatdeamon · July 25, 2020, 7:37pm

In the Following Webinar and several others before that (Bad phone call detection use case), Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI - YouTube ( Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI), It is receptively mentioned that models can be stored back in Tigergraph for real time prediction.

Do you have example of how this is done ? An example of a model function or whatever, trained in Spark, then written in TigerGraph, with the configuration file for the parameters, as explained in the several webinar but not shown.

I am really curious to get a sense of how this is actually done.

Many thanks

Bruno · July 27, 2020, 8:56am

Hey @Maatdeamon,

there is a nice demo of in database ML in our Github ecosystem:

Also, you can find Graph Gurus 19 (deep learning) script here:

They are not based on Spark, but I think you will get the idea how to use ML with TigerGraph.

Best,
Bruno

Maatdeamon · July 27, 2020, 9:02am

@Bruno Thank you very much for the Links.

I have seen the demo with in graph ML and the issue here is indeed it does not address the specific of our infrastructure, which rely on Spark ML.

I have not seen the deep learning one.

But ultimately, what i am specifically after is: how as claimed in the spark tiger graph integration webinar, can you store the parameter of your model as configuration file, and then load it in a query to do real time prediction. I did not see it in the in Graph ML demo.

It is just that specific claim that i would like to see in action, or have an explanation of here.

Will check the deep learning one in the mean time.

Bruno · July 27, 2020, 9:32am

Another resource around TigerGraph and machine learning:
https://colab.research.google.com/drive/1-Dgl804R_4wEmH4efggCEkhcx6ATGHAU

I know that @Richard_Henderson is building ML with Python - maybe he can jump in here when he has time.

Bruno

Mingxi_Wu · July 29, 2020, 5:54am

@Maatdeamon

You can use LOADACCUM to load parameter files into a global accumulator. And the loaded global accumulator can invoke an expression function to do the scoring.

LOADACCUM doc is here https://docs.tigergraph.com/dev/gsql-ref/querying/declaration-and-assignment-statements#loadaccum-statement
User-defined expression function doc is here https://docs.tigergraph.com/dev/gsql-ref/querying/operators-functions-and-expressions#user-defined-functions

E.g. below is a query we used to do scoring. TigerGraph is used to collect graph feature, and we got the trained classifier parameter from external ML tools, and load them into global accumulators, and for a new phone, we collect its graph features, and called the user-define scoring function with the classifier parameters and the newly collected features.

CREATE QUERY scoringMethod(vertex phoneId, string inputPath = “/tmp/lgweight.configure”, string internalClusterFileName = “/tmp/kMeanWeight_internal.csv”, string externalClusterFileName = “/tmp/kMeanWeight_external.csv”, bool reloadWeightInfo = false, string configFileName = “/tmp/collectFeatures.config”, bool reloadConfig = false, int numOfTopFriends = 5, float abnormalThreshold = 0.5, float advertisementThreshold = 0.5, bool usePrint = true) for graph testGraph returns (ListAccum)
{
TYPEDEF tuple<float weight, float scale, float mean> ParamInfo;
ListAccum @@featureList;
ListAccum @@scoreList;
int isExternal = 0;
int fraudFlag = -1;
static ListAccum @@paramInfo0;
static ListAccum @@paramInfo1;
static ListAccum @@paramInfo2;
static ListAccum @@paramInfo3;
static ListAccum<ListAccum > @@InternalCluster;
static ListAccum<ListAccum > @@ExternalCluster;

if (@@paramInfo0.size() == 0 or reloadWeightInfo == true) {
@@paramInfo0.clear();
@@paramInfo0 = {loadAccum(inputPath, $0,$1,$2,",", false)};
@@paramInfo1.clear();
@@paramInfo1 = {loadAccum(inputPath, $3,$4,$5,",", false)};
@@paramInfo2.clear();
@@paramInfo2 = {loadAccum(inputPath, $6,$7,$8,",", false)};
@@paramInfo3.clear();
@@paramInfo3 = {loadAccum(inputPath, $9,$10,$11,",", false)};
}
if (@@InternalCluster.size() == 0 or reloadWeightInfo == true) {
@@InternalCluster.clear();
@@InternalCluster = loadClusterInfo(internalClusterFileName);
@@ExternalCluster.clear();
@@ExternalCluster = loadClusterInfo(externalClusterFileName);
}
//call another gsql query to collect features
@@featureList = collectFeaturesR(phoneId, configFileName, reloadConfig, numOfTopFriends, false);
//obtain the last element of featureList which indicates isExternal
isExternal = @@featureList.get(69);
//scoring by calling an user defined expression function.
score(phoneId,
@@featureList,
abnormalThreshold,
advertisementThreshold,
@@paramInfo0,
@@paramInfo1,
@@paramInfo2,
@@paramInfo3,
@@scoreList);
//set the initial scamCluster as -1, which means not a valid scam
@@scoreList += -1;
//Get the fraud flag out
fraudFlag = @@scoreList.get(0);
if (fraudFlag == 3) {
if (isExternal == 0) {
updateScamClusterFlag(@@featureList, @@InternalCluster, @@scoreList);
} else {
updateScamClusterFlag(@@featureList, @@ExternalCluster, @@scoreList);
}
}

if (usePrint) {
print @@scoreList;
}
return @@scoreList;
}

Richard_Henderson · July 30, 2020, 8:31am

It;s worth noting the technique here.

Static accumulators are used as a cache, and retain their contents across invocations.
The weights here are being held in a file, the scoring routine itself is expressed as a user-defined query.

The weights here are loaded as a file, but could just as easily be held in the database and loaded from there.

Ref my current ML work, I’m looking at how we can integrate catboost into TG and load trained models into that. This may take a week or two to get working.