Estimating Jobs
If you are unsure which compute pool selector to use for your job, you can estimate the job using the procedures in the Neo4j_Graph_Analytics.estimate_experimental
schema.
The schema mirrors the algorithm procedures, but instead of computing the results, it estimates how much memory the job would require and suggests a compute pool selector based on that memory requirement.
Estimating a Job
Let’s say you want to run the Weakly Connected Components algorithm on a graph stored in the EXAMPLE_DB.DATA_SCHEMA
schema.
CALL Neo4j_Graph_Analytics.graph.wcc('CPU_X64_XS', {
'project': {
'nodeTables': ['EXAMPLE_DB.DATA_SCHEMA.NODES'],
'relationshipTables': {
'EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIPS': {
'sourceTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
'targetTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
'orientation': 'NATURAL'
}
}
},
'compute': { 'consecutiveIds': true },
'write': [{
'nodeLabel': 'NODES',
'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES_COMPONENTS'
}]
});
The compute pool selector we use in the example is CPU_X64_XS
, which is the smallest compute pool selector available.
The depending on the size of the graph, this selector might not be sufficient to run the job.
Unfortunately, this is not known until the job is executed and while we might be able to project the graph, the computation might still fail to due to insufficient memory.
To get a better idea of the memory requirements, we can use the estimate
procedure instead:
CALL Neo4j_Graph_Analytics.estimate_experimental.wcc('CPU_X64_XS', {
'project': {
'nodeTables': ['EXAMPLE_DB.DATA_SCHEMA.NODES'],
'relationshipTables': {
'EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIPS': {
'sourceTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
'targetTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
'orientation': 'NATURAL'
}
}
},
'compute': { 'consecutiveIds': true },
'write': [{
'nodeLabel': 'NODES',
'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES_COMPONENTS'
}]
});
As you can see, the procedure configuration is exactly the same as for the wcc
procedure.
Note, that we also need to provide a compute pool selector, as this triggers a job to estimate the memory requirements.
However, that job will not project the graph or compute the results, so using the smallest pool selector is fine.
You might however consider changing the number of compute nodes on that pool if more uses are using this selector at the same time.
Running this procedure will return the same job result layout as the wcc
procedure, but with a different JOB_RESULT
.
JOB_ID | JOB_STATUS | JOB_START | JOB_END | JOB_RESULT |
---|---|---|---|---|
job_42 |
SUCCESS |
.. |
.. |
{ "arguments": { "node_count": 10000000, "node_label_count": 1, "node_property_count": 1, "relationship_count": 500000000, "relationship_property_count": 1, "relationship_type_count": 1 }, "estimation": { "bytes_total": 6012198024 }, "recommendation": { "pool_selector": "CPU_X64_M", } } |
The JOB_RESULT
contains the following information:
-
arguments
contains information about the counts we inferred from the node and relationship table. This includes row count, but also the number of columns to infer the property count. -
estimation
is the raw result of the estimation job, which contains the total number of bytes that would be required to run the job. -
recommendation
contains the recommended compute pool selector based on the estimated memory requirements.
The example shows, that the job would require about 6 GB of memory to run and recommends the CPU_X64_M
compute pool selector.
The CPU_X64_XS
compute pool provide 6 GB of memory, but we need to consider memory for the operation system and other processed on the compute node.
Of course, the job will still start with the smaller compute pool, but it is likely to fail due to insufficient memory.
Once the result is computed, you can change the schema to estimate_experimental
to graph
and run the job again with the recommended compute pool selector: