Number of edges doesn't add up

I am working on this netflow event dataset as defined as below (csv)
epoch_time,duration,src_device, dst_device,protocol,src_port ,dst_port,src_packets ,dst_packets ,src_bytes,dst_bytes

I created a “device” vertex for src_device/dst_device.
I then created a directed edge from device to device with the rest of columns as attributes.

My test csv has 10k rows and I know there are 24 duplicates there, which gives me 9976 unique records.

I then loaded csv file. I got 1471 vertices, which I can understand that the dataset has 1471 unique devices. What is surprising is that I only got 1490 edges. I expect to see at least 9976 edges for unique connections between devices. Below is my gsql

create vertex device(PRIMARY_ID id String)
create directed edge netflow (
    from device, to device, 
    epoch_time UINT, duration UINT, protocol UINT, src_port String, dst_port String, 
    src_packets UINT, dst_packets UINT, src_bytes UINT, dst_bytes UINT)

create loading job load_devices for graph host_event_netflow
    define filename in_file;
    load in_file 
        to vertex device values ($2),
        to vertex device values ($3),
        to edge netflow values ($2, $3, $0, $1, $4, $5, $6, $7, $8, $9, $10) 
        using HEADER="false", SEPARATOR=",";


What did I do wrong?

1 Like

Are you expecting to get multiple edges between a single pair of device vertices? If so, that won’t work. We only support a single edge between any one pair of vertices.

You could create a list in the edge to hold all of the events, though that has its own challenges.

You could try creating a vertex “event” and doing it that way, though primary keys can be tricky there.

1 Like

Since there is most likely many communications between two devices (at different times, using different ports), the later communication overwrite the previous values of the netflow edge between the two.
You probably want to create a netflow vertex type and create multiple instances of it between devices, each representing one communication session (i.e. reify the netflow).
An alternative solution would be adding a LIST attribute to netflow (storing each netflows’ details probably in a UDF) but that would probably less efficient and more difficult to work with.

1 Like

You beat me on this :slight_smile:

1 Like

OK. I understand it now. This seems like a big deficiency in the tool as I would think this is a common thing in a graph (for example, on a map, you can go from point A to point B via lots of routes). Or I am just showing my newbie-ness in graph database. :sweat_smile:

In this case, other than storing things on a list of tuple attributes as Szilard has suggested. you can also consider adding one extra vertex type called connection_event.

The connection_event stays between two devices. On the connection_event vertex, you can store all your current edge attributes. This way, after the loading you should have 9976 connection_event vertexes.


Thanks Xinyu,

What you suggested is what I implemented yesterday following answers from @SzilardTG Szilard and @rik .

It’s not intuitive, but seems to provide what I need to do. Long term though, are you going to support multiple edges between the same verteics?