Data Loading Issue

dhruva · August 5, 2020, 7:25pm

Hello, we are working on performing analysis on how many records we have in the load csv vs. how many records were loaded in TG. There are instances were the file contains 300k records but TG loads 280k records. However upon reloading the same file again, it will load all 300k records. This happens with lot of load jobs but number of records missing would vary. Is this known behavior? Why could be the cause of this behavior? Can someone please help us look into the issue?

CREATE LOADING JOB load_test FOR GRAPH social {

      DEFINE HEADER h = "col1", "col2", "col3", "col4", "col5";
      DEFINE FILENAME f = "/dummy.csv";

      LOAD f TO VERTEX Vertex1 VALUES($"col1", $"col1", $"col3", $"col4") USING SEPARATOR=",", HEADER="true", EOL="\n", QUOTE="double";
      LOAD f TO VERTEX Vertex2 VALUES($"col5", $"col5") USING SEPARATOR=",", HEADER="true", EOL="\n", QUOTE="double";

      LOAD f TO EDGE Vertex1_X_Vertex1 VALUES($"col1", $"col2") USING SEPARATOR=",", HEADER="true", EOL="\n", QUOTE="double";
      LOAD f TO EDGE Vertex1_Y_Vertex2 VALUES($"col1", $"col5") USING SEPARATOR=",", HEADER="true", EOL="\n", QUOTE="double";
}

Jon_Herke · August 5, 2020, 8:02pm

@dhruva can you confirm the file ext? Should it be /dummy.csv instead of dummy.txt? TigerGraph can ingest .csv and .json

dbhatta3 · August 5, 2020, 8:03pm

I have seen sometimes graph studio shows an inaccurate count. Run a distributed gsql count query to get the actual count.

dhruva · August 5, 2020, 8:27pm

Sorry, updated the text. It’s .csv.

dhruva · August 5, 2020, 8:28pm

I use stat_vertex_number function to get the count. Also, is that a thing? Because most of our queries are non-distributed and seems like that would affect those queries.

dbhatta3 · August 6, 2020, 2:58pm

I have experienced this in earlier versions where data was actually loaded but the counts on graphstudio are wrong. Graphstudio and non-distributed queries use an estimation method to get a count of vertices on remote nodes; this method had issues in earlier versions. The true count was only available when we ran count query in a distributed mode - which collects actual counts from every node and presents the true count to the user.
There was another issue with overflow in the internal kafka where data was pushed but lost before it was written to disk. But 300K should not be a problem. Just to debug you may start with 100K and check if you get consistent result. Putting my 2 cents here hope it helps

dhruva · August 6, 2020, 4:47pm

Ahh I wasn’t aware about that issue but what you said makes sense. However, I had couple examples that we looked up in TG that we made sure exists in .csv but we were not able to find even with SELECT statement. Also tried to look them up using Explore Graph and it couldn’t locate that vertex.

Szilard_Barany · August 10, 2020, 2:33pm

Did you check the rejections in the loading logs?

dhruva · August 13, 2020, 5:39pm

For some reason the log file wasn’t there for that run But upon reload, it loaded all the records

Szilard_Barany · August 13, 2020, 6:28pm

Hmm, okay, at least it worked. Let us know if you encounter the issue again, I can help investigate.