One of the more powerful features of ClickHouse is its introspective capabilities. This can be easily leveraged to understand where load on our multi-tenant clickhouse servers is coming from.
Show me the queries:
The following query gives an at-a-glance overview of what is generating load on the cluster:
withtoDateTime(now()) as target_time,toIntervalDay(3) as interval,(SELECT sum(read_bytes)from clusterAllReplicas('posthog', system,query_log)where event_time >= target_time - interval and event_time <= target_time and type > 1 and is_initial_query) AS total_bytes_readselectmultiIf(http_user_agent = 'Apache-HttpClient/4.5.13 (Java/11.0.17)', 'Metabase?',http_user_agent = 'Go-http-client/1.1', 'Grafana?',http_user_agent = 'python-requests/2.28.1', 'infi_orm',http_user_agent != '', http_user_agent,query = 'SELECT version()', 'version_query',JSONExtractString(log_comment, 'kind') = 'request' and JSONExtractString(log_comment, 'route_id') IN ('api/event/?$', 'api/projects/(?P<parent_lookup_team_id>[^/.]+)/events/?$'), '/api/event',log_comment != '', JSONExtractString(log_comment, 'kind'),query_kind = 'Insert',query LIKE '%FORMAT JSON%' or (not is_initial_query and query like '%`elements_chain` FROM `posthog`.`sharded_events%') or (query like '%min(`_timestamp`) AS `min`, max(`_timestamp`)%'), 'historical-exports','unknown') as query_type,formatReadableSize(sum(read_bytes)) AS read_bytes_,sum(read_bytes) / total_bytes_read AS read_bytes_percentage,count(),sum(query_duration_ms),formatReadableSize(sum(result_bytes)) AS result_bytes_,sum(read_rows),arraySort(arrayDistinct(groupArray(getMacro('replica')))) as hostsfrom clusterAllReplicas('posthog', system, query_log)where event_time >= target_time - interval and event_time <= target_time and type > 1 and is_initial_querygroup by query_typeorder by sum(query_duration_ms) DESC
Advanced
To diagnose further, it's important to understand ClickHouse operations.
Useful dimensions to slice the data on:
query_duration_ms- How long the query tookformatReadableSize(read_bytes)- Total number of bytes read from all tables and table functions participated in query.formatReadableSize(memory_usage)- Memory usage of this query.formatReadableSize(result_bytes)- How big was the response for this query. Useful fornot is_initial_queryto determine if too much data streaming is going on.ProfileEvents['OSCPUVirtualTimeMicroseconds'])- How much time was spent by the CPUProfileEvents['ReadCompressedBytes'])- How much data was read from disk (compressed)ProfileEvents['NetworkSendElapsedMicroseconds'])- How much time was spent sending data over networkProfileEvents['NetworkReceiveBytes'])- How much time was spent reading data over networkProfileEvents['NetworkSendBytes'])- How much data was sent over the network
Full list of all valid measurements can be found in ClickHouse source code.
Other useful expressions:
is_initial_query- indicates whether this was a main query or pushed down from coordinator. Notelog_commentis also forwarded.any(log_comment)- shows the structure of the log commentgetMacro('replica')- What replica was this on?getMacro('shard')- What shard was this query made on?getMacro('hostClusterType')- What cluster was this on? Online or offline?
log_comment
We make use of the log_comment column quite extensively to add metadata to queries in order to make analysis simpler.
log_comment is set by specifying the log_comment setting when running a query. It then populates in the column with the same name on the query log table.
Some useful things you'll find in log_comment include:
JSONExtractString(log_comment, 'kind')- What is the query from? EitherceleryorrequestJSONExtractString(log_comment, 'query_type')- Type of query e.g.trends_total_volumeorfunnel_conversionJSONExtractString(log_comment, 'id')- Request path or task name (depending on kind)JSONExtractString(log_comment, 'route_id')- what geberic route id was responsible for the query (only set for kind=request)JSONExtractInt(log_comment, 'user_id')- user_idJSONExtractInt(log_comment, 'team_id')- team_idJSONExtractString(log_comment, 'access_method')- What access method was used? Blank indicates it's normal web traffic (only set for kind=request)JSONExtractString(log_comment, 'http_referer')- HTTP referer (only set for kind=request)JSONExtractString(log_comment, 'http_user_agent')- HTTP user agent (only set for kind=request)JSONExtractString(log_comment, 'container_hostname')- Kubernetes pod where the query was initiated fromJSONExtractString(log_comment, 'workload')- EitherWORKLOAD.ONLINEorWORKLOAD.OFFLINE. This determines which node on the shard we prefer to send the query toJSONExtractString(JSONExtractString(log_comment, 'query_settings'), '<setting_name>')- Any settings passed along with the query can be found within thequery_settingsobject. To get the value for a specific setting, you need to callJSONExtractXtwice.JSONExtractString(log_comment, 'filter')- TheFilterobject for an insight. You'll need to use array functions andJSONExtractfunctions in order to access deeply nested filter data