ACCUM to find unique vertexes?

markmegerian · April 12, 2020, 3:33pm

I think this should be pretty easy, I am looking for the fastest solution.

Lets say you have millions of cell phone customers

Customer – hasnumber --> CellPhoneNum --inareacode–> AreaCode

and you have reverse edges on both “hasnumber” and “inareacode”. Many customers have multiple cell phones.
If I want to total up customers by area code, and I use:

SumAccum @customercount;
A1 = {AreaCode.*};
SELECT a FROM A1:a -(>:e1) - CellPhoneNum:cpn - (>:e2) - Customer:c
ACCUM a.@customercount += 1;

PRINT A1[A1.areaCodeVal, A1.@customercount];

I will get the wrong numbers because I would actually count customers multiple times if they have multiple cell phones. What’s the best way to just ACCUM unique customers for each area code?

Jon_Herke · April 13, 2020, 3:25pm

@markmegerian

You could mark if you’ve visited that vertex with a “True” & “False”

You would default to not visited.
Then mark it as visited.
Then not count it next time if it’s already been visited.

CREATE QUERY test(/* Parameters here */) FOR GRAPH MyGraph { 
  SumAccum @customercount;
  OrAccum @visited;

  A1 = {AreaCode.*};
  SELECT a FROM A1:a -(>:e1) - CellPhoneNum:cpn - (>:e2) - Customer:c
	WHERE c.@visted == false
    ACCUM a.@customercount += 1,
	      c.@visted = true;

  PRINT A1[A1.areaCodeVal, A1.@customercount];
}

markmegerian · April 13, 2020, 3:42pm

Hey Jon - Should the @visited accum be on the Customer?

You are showing it on the AreaCode vertex.

Jon_Herke · April 13, 2020, 4:39pm

@markmegerian you are correct (my apologies!)

markmegerian · April 13, 2020, 7:58pm

Jon - just tried this and it did not work as expected. It still totaled up multiple repeats. Is there an issue with timing of it? That is, can I not rely upon using selection in the WHERE clause that depends on the value being updated in the ACCUM?

My simple test seems to indicate that my totals are exactly what they were before, without the extra WHERE clause.

Jon_Herke · April 13, 2020, 11:29pm

@markmegerian You can use this. I think the marking False and True would be the “optimized” version and then run a count… but this way will give you the results you’re looking for.

CREATE QUERY multiHop() FOR GRAPH MyGraph SYNTAX v2{
    SetAccum<STRING> @areaCnt;
    A1 =  {AreaCode.*};

    S1 = SELECT a FROM A1:a -(>:e1) - CellPhoneNum:cpn - (>:e2) - Customer:c
	     ACCUM a.@areaCnt += c.id;

    PRINT S1[S1.@areaCnt.size() AS Customer_Count];
}

Jon_Herke · April 13, 2020, 11:29pm

Ran something similar to test
**

**

markmegerian · April 13, 2020, 11:50pm

@Jon_Herke thanks for the help. My concern about using a SetAccum is that I am literally accumulating millions of unique vertexes in some cases, and I want to avoid actually creating a set that large just so I can count them up. I feel like you were on the right track with the first suggestion but I couldn’t get it to work. Any other suggestions? Is there anything that we could do with a POST ACCUM to make the first suggestion work with the @visited flag?

Jon_Herke · April 14, 2020, 1:09am

Can you use this? I just wrote up a similar query in my graph:

CREATE QUERY useMap(/* Parameters here */) FOR GRAPH MyGraph SYNTAX v2{
	/***************************************
	* Grab a Unique Count of those Patinets 
    * located in South Korean Cities
	***************************************/
    MapAccum<VERTEX, INT> @@intMapAccum;
	OrAccum @visted;

    C1 = {City.*};
	S1 = SELECT v1 FROM C1:v1-(TRAVEL_EVENT_IN:e1)-TravelEvent:v2-(PATIENT_TRAVELED:e2)-Patient:v3
	               WHERE v3.@visted == false
	               ACCUM @@intMapAccum += (v1 -> 1)
	               POST-ACCUM v3.@visted = true;

  PRINT @@intMapAccum As Patient_Count_In_City; 
}

Victor_Lee · April 17, 2020, 6:23am

CREATE QUERY patientsByCityTravel( ) FOR GRAPH MyGraph SYNTAX v2 {
	/***************************************
	* Get the number of patients located in South Korean Cities
	* Returns either number of patients (default) or patient Ids (if getPatientIds = true)
	***************************************/
      MapAccum<VERTEX<City>, SetAccum<STRING>> @@patientsByCityMap;


     C1 = {City.*};
     S1 = SELECT v1 FROM C1:v1-(TRAVEL_EVENT_IN:e1)-TravelEvent:v2-(PATIENT_TRAVELED:e2)-Patient:v3
	               ACCUM @@patientsByCityMap += (v1 -> v3.patient_id);

	IF getPatientIds THEN
	  PRINT @@patientsByCityMap;
	ELSE
    FOREACH (city, patientSet) IN  @@patientsByCityMap DO
	    PRINT city, patientSet.size(); 
	  END;
	END;
}

Richard_Henderson · April 17, 2020, 9:00am

CREATE QUERY countCustomersByArea() FOR GRAPH MyGraph SYNTAX V2 { 
  
  SetAccum <VERTEX<Customer>> @uniqueCustomers;
  
  AllCustomers = {Customer.*};
  AllAreas = SELECT a FROM AllCustomers:c-()-CellPhoneNum-()-AreaCode:a
  ACCUM
    a.@uniqueCustomers += c;
       
  PRINT AllAreas[AllAreas.val, AllAreas.@uniqueCustomers.size()]; 
}

markmegerian · September 1, 2020, 3:13am

I have been encountering plenty of opportunities to use this type of logic. It works, but I am interested in other alternatives.

Let’s say we aren’t concerned which of the customers cell phones we use if they have more than one, we just want to be sure to only count each customer once

would this achieve the desired result?

AllAreas = SELECT a FROM AllCustomers:c-()-CellPhoneNum-()-AreaCode:a
 SAMPLE 1 EDGE WHEN c.outdegree() >= 1   
  ACCUM a.@customerCount += 1;

Richard_Henderson · September 3, 2020, 7:27am

Should work, yes, though I haven’t personally tried the sample clause in a compound path.
Let me know if it doesn’t work as expected.