Estimating Jobs

If you are unsure which compute pool selector to use for your job, you can estimate the job using the procedures in the Neo4j_Graph_Analytics.estimate_experimental schema. The schema mirrors the algorithm procedures, but instead of computing the results, it estimates how much memory the job would require and suggests a compute pool selector based on that memory requirement.

Estimating a Job

Let’s say you want to run the Weakly Connected Components algorithm on a graph stored in the EXAMPLE_DB.DATA_SCHEMA schema.

CALL Neo4j_Graph_Analytics.graph.wcc('CPU_X64_XS', {
    'project': {
        'nodeTables': ['EXAMPLE_DB.DATA_SCHEMA.NODES'],
        'relationshipTables': {
            'EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIPS': {
                'sourceTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
                'targetTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
                'orientation': 'NATURAL'
            }
        }
    },
    'compute': { 'consecutiveIds': true },
    'write': [{
        'nodeLabel': 'NODES',
        'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES_COMPONENTS'
    }]
});

The compute pool selector we use in the example is CPU_X64_XS, which is the smallest compute pool selector available. The depending on the size of the graph, this selector might not be sufficient to run the job. Unfortunately, this is not known until the job is executed and while we might be able to project the graph, the computation might still fail to due to insufficient memory.

To get a better idea of the memory requirements, we can use the estimate procedure instead:

CALL Neo4j_Graph_Analytics.estimate_experimental.wcc('CPU_X64_XS', {
    'project': {
        'nodeTables': ['EXAMPLE_DB.DATA_SCHEMA.NODES'],
        'relationshipTables': {
            'EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIPS': {
                'sourceTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
                'targetTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES',
                'orientation': 'NATURAL'
            }
        }
    },
    'compute': { 'consecutiveIds': true },
    'write': [{
        'nodeLabel': 'NODES',
        'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.NODES_COMPONENTS'
    }]
});

As you can see, the procedure configuration is exactly the same as for the wcc procedure. Note, that we also need to provide a compute pool selector, as this triggers a job to estimate the memory requirements. However, that job will not project the graph or compute the results, so using the smallest pool selector is fine. You might however consider changing the number of compute nodes on that pool if more uses are using this selector at the same time.

Running this procedure will return the same job result layout as the wcc procedure, but with a different JOB_RESULT.

Table 1. Results
JOB_ID JOB_STATUS JOB_START JOB_END JOB_RESULT

job_42

SUCCESS

..

..

{
  "arguments": {
    "node_count": 10000000,
    "node_label_count": 1,
    "node_property_count": 1,
    "relationship_count": 500000000,
    "relationship_property_count": 1,
    "relationship_type_count": 1
  },
  "estimation": {
    "bytes_total": 6012198024
  },
  "recommendation": {
    "pool_selector": "CPU_X64_M",
  }
}

The JOB_RESULT contains the following information:

  • arguments contains information about the counts we inferred from the node and relationship table. This includes row count, but also the number of columns to infer the property count.

  • estimation is the raw result of the estimation job, which contains the total number of bytes that would be required to run the job.

  • recommendation contains the recommended compute pool selector based on the estimated memory requirements.

The example shows, that the job would require about 6 GB of memory to run and recommends the CPU_X64_M compute pool selector. The CPU_X64_XS compute pool provide 6 GB of memory, but we need to consider memory for the operation system and other processed on the compute node. Of course, the job will still start with the smaller compute pool, but it is likely to fail due to insufficient memory.

Once the result is computed, you can change the schema to estimate_experimental to graph and run the job again with the recommended compute pool selector:

Estimation Input

Note, that the recommendation is based on counting the number of rows in the node and relationship tables. If your tables are actually views, and involve a complex query execution, this might result in unexpected costs on the consumer side.