$ Impala

$ 端口

客户端 端口
impala-shell, ODBC driver 21000
JDBC, ODBC driver 21050

$ JDBC连接方式

  1. 使用Cloudera JDBC Connectorcom.cloudera.impala.jdbc41.Driver,连接串格式:

    jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
    
  2. 使用Hive JDBC Driverorg.apache.hive.jdbc.HiveDriver

    1. 简单连接串(注意21050端口号后面的“/”,以及“;auth=noSasl”,缺少会出错):

      jdbc:hive2://myhost.example.com:21050/;auth=noSasl
      
    2. kerberos连接串:

      jdbc:hive2://myhost.example.com:21050/;principal=impala/myhost.example.com@H2.EXAMPLE.COM
      
    3. LDAP连接串:

      jdbc:hive2://host:port/db_name;user=ldap_userid;password=ldap_password
      

参考:

$ 负载均衡

负载均衡主要适用于impalad进程。

负载均衡有下面的优势:

  1. 应用程序只需要连接到一个地址

  2. 实现高可用连接

  3. coordinator节点比其他节点需要更多CPU和内存。代理服务器可以调度查询从而使每个连接使用不同的coordinator节点。

选择负载均衡算法:

$ Leastconn

使用最少的连接将会话连接到coordinator节点,通常用于由许多独立的,短期运行的查询。

$ Source IP Persistence

来自相同IP地址的会话始终会到达相同的协调器。对于混合着查询和DDL语句的情况是一个不错的选择,例如CREATE TABLEALTER TABLE。由于DDL语句中的元数据更改需要花费时间才能在整个群集中传播,因此在这种情况下,最好使用Source IP Persistence算法。如果无法选择Source IP Persistence,请通过同一会话运行DDL和依赖于DDL结果的后续查询,例如通过运行impala-shell -f script_file 通过一个会话提交多个语句。

$ Round-robin

将连接分布到所有coordinator节点,通常不建议用于Impala。

你可能需要执行基准测试和负载测试,以确定哪个设置对你的用例是最优的。总是使用两种负载平衡算法:Hue使用Source IP Persistence,其他的场景使用Leastconn

$ 参考

Using Impala through a Proxy for High Availability (opens new window)

$ 账号授权

SHOW语句与权限相关语法:

SHOW ROLES
SHOW CURRENT ROLES
SHOW ROLE GRANT GROUP group_name
SHOW GRANT ROLE role_name

SHOW GRANT USER user_name
SHOW GRANT USER user_name ON SERVER
SHOW GRANT USER user_name ON DATABASE database_name
SHOW GRANT USER user_name ON TABLE table_name
SHOW GRANT USER user_name ON URI uri

GRANT语句语法:

GRANT ROLE role_name TO GROUP group_name

GRANT privilege ON object_type object_name
   TO [ROLE] roleName
   [WITH GRANT OPTION]

privilege ::= ALL | CREATE | INSERT | REFRESH | SELECT | SELECT(column_name)
object_type ::= SERVER | URI | DATABASE | TABLE

REVOKE语句语法:

REVOKE ROLE role_name FROM GROUP group_name

REVOKE [GRANT OPTION FOR] privilege ON object_type object_name
  FROM [ROLE] role_name

privilege ::= ALL | CREATE | INSERT | REFRESH | SELECT | SELECT(column_name)

object_type ::= SERVER | URI | DATABASE | TABLE

兼容性:

不需要HDFS权限

Kudu注意事项:

  1. 仅在Server上有ALL权限的用户可以创建外部Kudu表

  2. DELETE, UPDATE, 和 UPSERT 操作要求ALL权限

参考:

GRANT Statement (CDH 5.2 or higher only) (opens new window)

$ 数据类型

类型 范围 备注
TINYINT -128 .. 127
SMALLINT -32768 .. 32767
INT -2147483648 .. 2147483647
BIGINT -9223372036854775808 .. 9223372036854775807
FLOAT 正负1.40129846432481707e-45 .. 3.40282346638528860e+38
DOUBLE,REAL 4.94065645841246544e-324d .. 1.79769313486231570e+308
DECIMAL[(precision[, scale])] precision:between 1 and 38(default 9)scale:default 0
STRING 2 GB
CHAR
VARCHAR
TIMESTAMP 1400-01-01 to 9999-12-31 UTC时区
BOOLEAN

类型转换:

TINYINT SMALLINT INT BIGINT FLOAT DOUBLE DECIMAL STRING CHAR VARCHAR TIMESTAMP BOOLEAN
TINYINT - Auto Auto Auto Auto Auto Cast Cast
SMALLINT Cast - Auto Auto Auto Auto Cast Cast
INT Cast Cast - Auto Auto Auto Cast Cast
BIGINT Cast Cast - Auto Auto Cast Cast
FLOAT Cast Cast Cast Cast - Auto No Cast Cast Cast
DOUBLE Cast Cast Cast Cast Cast - No Cast Cast Cast
DECIMAL Cast Cast Cast Cast Auto Auto - Cast Cast Cast
STRING Cast Cast Cast Cast Cast Cast - Auto No
CHAR -
VARCHAR -
TIMESTAMP -
BOOLEAN Cast -

参考:

Impala Type Conversion Functions (opens new window)

$ Query Option

名称 适用角色 标签 默认值 候选值 说明
ABORT_JAVA_UDF_ON_EXCEPTION
ABORT_ON_DEFAULT_LIMIT_EXCEEDED Removed
ABORT_ON_ERROR false When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error, rather than continuing and possibly returning incomplete results.
ALLOW_ERASURE_CODED_FILES @since 3.1 false When the ALLOW_ERASURE_CODED_FILES query option is set to FALSE, Impala returns an error when a query requires scanning an erasure coded file.
ALLOW_UNSUPPORTED_FORMATS Removed
ANALYTIC_RANK_PUSHDOWN_THRESHOLD
APPX_COUNT_DISTINCT @since 2.0 false When the APPX_COUNT_DISTINCT query option is set to TRUE, Impala implicitly converts COUNT(DISTINCT) operations to the NDV() function calls.
ASYNC_CODEGEN
BATCH_SIZE @since 2.11 0(1024) 0-65536 Number of rows evaluated at a time by SQL operators.Using a large number improves responsiveness, especially for scan operations, at the cost of a higher memory footprint.
BROADCAST_BYTES_LIMIT 34359738368 (32 GB) 0:ignored The Impala planner may in rare cases make a bad choice to broadcast a large table or intermediate result and encounter performance problems due to high memory pressure. Setting this limit will make the planner pick a partition based hash join instead of broadcast and avoid such performance problems.
BROADCAST_TO_PARTITION_FACTOR
BUFFER_POOL_LIMIT the lower of 80% of the MEM_LIMIT setting, or the MEM_LIMIT setting minus 100 MB. -- Set an absolute value. set buffer_pool_limit=8GB; -- Set a relative value based on the MEM_LIMIT setting. set buffer_pool_limit=80%; Defines a limit on the amount of memory that a query can allocate from the internal buffer pool. The value for this limit applies to the memory on each host, not the aggregate memory across the cluster. Typically not changed by users, except during diagnosis of out-of-memory errors during queries.
CLIENT_IDENTIFIER
COMPRESSION_CODEC SNAPPY SET COMPRESSION_CODEC=codec_name; // Supported for all codecs. SET COMPRESSION_CODEC=codec_name:compression_level; // Only supported for ZSTD.The allowed values for this query option are SNAPPY (the default), GZIP, ZSTD, LZ4, and NONE. When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled by the COMPRESSION_CODEC query option.
COMPUTE_COLUMN_MINMAX_STATS
COMPUTE_STATS_MIN_SAMPLE_SIZE
CONVERT_LEGACY_HIVE_PARQUET_UTC_TIMESTAMPS
CPU_LIMIT_S
DEBUG_ACTION
DECIMAL_V2 A query option that changes behavior related to the DECIMAL data type. Set this option to FALSE for backward compatibility to Impala 2.x.
DEFAULT_FILE_FORMAT @since 3.2 TEXT (0)RC_FILE (1)SEQUENCE_FILE (2)AVRO (3)PARQUET (4)KUDU (5)ORC (6) Use the DEFAULT_FILE_FORMAT query option to set the default table file format.
DEFAULT_HINTS_INSERT_STATEMENT none SET DEFAULT_HINTS_INSERT_STATEMENT=CLUSTERED; SET DEFAULT_HINTS_INSERT_STATEMENT=SHUFFLE; SET DEFAULT_HINTS_INSERT_STATEMENT=NOCLUSTERED:NOSHUFFLE; The DEFAULT_HINTS_INSERT_STATEMENT query option sets the default hints for the INSERT statements with no optimizer hint specified.The default hints apply to the HDFS and Kudu table formats and are ignored for the HBase table format.
DEFAULT_JOIN_DISTRIBUTION_MODE BROADCAST (equivalent to 0) or SHUFFLE (equivalent to 1) This option determines the join distribution that Impala uses when any of the tables involved in a join query is missing statistics.
DEFAULT_NDV_SCALE
DEFAULT_ORDER_BY_LIMIT Removed
DEFAULT_SPILLABLE_BUFFER_SIZE @since 2.10.0 2097152 (2 MB) Specifies the default size for a memory buffer used when the spill-to-disk mechanism is activated.
DEFAULT_TRANSACTIONAL_TYPE @since 3.3 NONE NONE: The table will not be created as transactional. INSERT_ONLY: The table will be created as transactional. Use the DEFAULT_TRANSACTIONAL_TYPE query option to create insert-only transactional tables by default.If either table properties, transactional or transactional_properties, are set, this query option is ignored.
DELETE_STATS_IN_TRUNCATE @since 4.0 TRUE This query option DELETE_STATS_IN_TRUNCATE can be used to delete or retain table statistics.
DISABLE_CACHED_READS
DISABLE_CODEGEN false it's used to work around any issues with Impala's runtime code generation
DISABLE_CODEGEN_ROWS_THRESHOLD @since 2.10.0 50000 This setting controls the cutoff point (in terms of number of rows processed per Impala daemon) below which Impala disables native code generation for the whole query.If a query uses the complex data types STRUCT, ARRAY, or MAP, then codegen is never automatically disabled regardless of the DISABLE_CODEGEN_ROWS_THRESHOLD setting.
DISABLE_DATA_CACHE
DISABLE_HBASE_NUM_ROWS_ESTIMATE FALSE While generating a plan for an HBase query, the planner samples the underlying HBase tables to estimate their row count and row size, and the sampling can negatively impact the planning time. When the HBase table stats does not change much in short time, disable the sampling by setting the DISABLE_HBASE_NUM_ROWS_ESTIMATE query option to TRUE. And Impala planner will fall back to using Hive Metastore (HMS) table stats instead.
DISABLE_HDFS_NUM_ROWS_ESTIMATE
DISABLE_OUTERMOST_TOPN
DISABLE_ROW_RUNTIME_FILTERING
DISABLE_STREAMING_PREAGGREGATIONS false Turns off the "streaming preaggregation" optimization that is available in Impala 2.5 and higher.
DISABLE_UNSAFE_SPILLS false Enable this option if you prefer to have queries fail when they exceed the Impala memory limit, rather than write temporary data to disk.
ENABLE_ASYNC_DDL_EXECUTION
ENABLE_ASYNC_LOAD_DATA_EXECUTION
ENABLE_CNF_REWRITES
ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION
ENABLE_EXPR_REWRITES true The ENABLE_EXPR_REWRITES query option controls whether to enable or disable the query compile time optimizations.
ENABLE_KUDU_TRANSACTION
ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION
ENABLE_REPLAN
ENABLED_RUNTIME_FILTER_TYPES
EXEC_SINGLE_NODE_ROWS_THRESHOLD @since 2.10 100 This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query as a "small" query, turning off optimizations such as parallel execution and native code generation.This setting applies to queries where the number of rows processed can be accurately determined, either through table and column statistics, or by the presence of a LIMIT clause. If Impala cannot accurately estimate the number of rows, then this setting does not apply.For a query that is determined to be "small", all work is performed on the coordinator node.
EXEC_TIME_LIMIT_S @since 2.12 0 (no time limit ) The EXEC_TIME_LIMIT_S query option sets a time limit on query execution.The time limit only starts once the query is executing. Time spent planning the query, scheduling the query, or in admission control is not counted towards the execution time limit.
EXPAND_COMPLEX_TYPES
EXPLAIN_LEVEL 1 0 or MINIMAL, 1 or STANDARD, 2 or EXTENDED, 3 or VERBOSE Controls the amount of detail provided in the output of the EXPLAIN statement.
FETCH_ROWS_TIMEOUT_MS @since 3.4 10000 Use the FETCH_ROWS_TIMEOUT_MS query option to control how long Impala waits for query results when clients fetch rows.
HBASE_CACHE_BLOCKS false Setting this option is equivalent to calling the setCacheBlocks method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application.
HBASE_CACHING 0 Setting this option is equivalent to calling the setCaching method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application.
IDLE_SESSION_TIMEOUT @since 2.12 0 The IDLE_SESSION_TIMEOUT query option sets the time in seconds after which an idle session is cancelled. A session is idle when no activity is occurring for any of the queries in that session, and the session has not started any new queries. Once a session is expired, you cannot issue any new query requests to it. The session remains open, but the only operation you can perform is to close it.
JOIN_ROWS_PRODUCED_LIMIT @since 4.0 0 The JOIN_ROWS_PRODUCED_LIMIT query option limits the number of join rows produced by a join node and is used to prevent runaway join queries.
KUDU_READ_MODE @since 3.1 DEFAULT "DEFAULT""READ_LATEST""READ_AT_SNAPSHOT" The KUDU_READ_MODE query option allows you to set a desired consistency level for scans of Kudu tables.
KUDU_REPLICA_SELECTION
KUDU_SNAPSHOT_READ_TIMESTAMP_MICROS
LOCK_MAX_WAIT_TIME_S
LIVE_PROGRESS @since 2.3.0 TRUE When you run a query, the live progress bar appears in the output of a query. The bar shows roughly the percentage of completed processing. When the query finishes, the live progress bar disappears from the console output.Starting in Impala 3.1, the summary output also includes the queuing status consisting of whether the query was queued and what was the latest queuing reason.
LIVE_SUMMARY @since 2.3.0 FALSE Impala displays the same output as the SUMMARY command for queries submitted through the impala-shell command, with the measurements updated in real time as the query progresses. When the query finishes, the final SUMMARY output remains visible in the impala-shell console output.
MAX_CNF_EXPRS
MAX_ERRORS 0 (meaning 1000 errors) Maximum number of non-fatal errors for any particular query that are recorded in the Impala log file.
MAX_FS_WRITERS
MAX_IO_BUFFERS Removed
MAX_MEM_ESTIMATE_FOR_ADMISSION @since 3.1 Use the MAX_MEM_ESTIMATE_FOR_ADMISSION query option to set an upper limit on the memory estimates of a query as a workaround for over-estimates precluding a query from being admitted.
MAX_NUM_RUNTIME_FILTERS
MAX_RESULT_SPOOLING_MEM @since 3.4 100 MB Use the MAX_RESULT_SPOOLING_MEM query option to set the maximum amount of memory used when spooling query results.If the amount of memory exceeds this value when spooling query results, all memory will most likely be spilled to disk.The MAX_RESULT_SPOOLING_MEM query option is applicable only when query result spooling is enabled with the SPOOL_QUERY_RESULTS query option set to TRUE.
MAX_ROW_SIZE
MAX_SCAN_RANGE_LENGTH 0(indicates backend default, which is the same as the HDFS block size for each table.) The MAX_RESULT_SPOOLING_MEM query option is applicable only when query result spooling is enabled with the SPOOL_QUERY_RESULTS query option set to TRUE.
MAX_SPILLED_RESULT_SPOOLING_MEM @since 3.4 1 GB Use the MAX_SPILLED_RESULT_SPOOLING_MEM query option to set the maximum amount of memory that can be spilled when spooling query results.
MAX_STATEMENT_LENGTH_BYTES
MEM_LIMIT 0 (unlimited) The MEM_LIMIT query option defines the maximum amount of memory a query can allocate on each node.
MEM_LIMIT_EXECUTORS
MIN_SPILLABLE_BUFFER_SIZE @since 2.10.0 65536 (64 KB) Specifies the minimum size for a memory buffer used when the spill-to-disk mechanism is activated, for example for queries against a large table with no statistics, or large join operations.
MINMAX_FILTER_FAST_CODE_PATH
MINMAX_FILTER_PARTITION_COLUMNS
MINMAX_FILTER_SORTED_COLUMNS
MINMAX_FILTER_THRESHOLD
MINMAX_FILTERING_LEVEL
MT_DOP
NOW_STRING
NUM_NODES 0 0 (meaning all nodes) or 1 (meaning all work is done on the coordinator node) Limit the number of nodes that process a query, typically during debugging.
NUM_REMOTE_EXECUTOR_CANDIDATES
NUM_ROWS_PRODUCED_LIMIT
NUM_SCANNER_THREADS
OPTIMIZE_PARTITION_KEY_SCANS @since 2.5.0 false Enables a fast code path for queries that apply simple aggregate functions to partition key columns: MIN(key_column), MAX(key_column), or COUNT(DISTINCT key_column).
OPTIMIZE_SIMPLE_LIMIT
ORC_ASYNC_READ
ORC_READ_STATISTICS
ORC_SCHEMA_RESOLUTION
PARQUET_ANNOTATE_STRINGS_UTF8 @since 2.6.0 false Causes Impala INSERT and CREATE TABLE AS SELECT statements to write Parquet files that use the UTF-8 annotation for STRING columns.
PARQUET_ARRAY_RESOLUTION @since 2.9.0 THREE_LEVEL THREE_LEVEL
TWO_LEVEL
TWO_LEVEL_THEN_THREE_LEVEL
The PARQUET_ARRAY_RESOLUTION query option controls the behavior of the indexed-based resolution for nested arrays in Parquet.
PARQUET_BLOOM_FILTER_WRITE
PARQUET_BLOOM_FILTERING
PARQUET_DICTIONARY_FILTERING @since 2.9.0 true The PARQUET_DICTIONARY_FILTERING query option controls whether Impala uses dictionary filtering for Parquet files.
PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT
PARQUET_FALLBACK_SCHEMA_RESOLUTION @since 2.6.0 POSITION POSITION (0)
NAME (1)
The PARQUET_FALLBACK_SCHEMA_RESOLUTION query option allows Impala to look up columns within Parquet files by column name, rather than column order, when necessary.
PARQUET_FILE_SIZE 0 (produces files with a target size of 256 MB; files might be larger for very wide tables) the maximum value for this setting is 1 gigabyte (1g) Specifies the maximum size of each Parquet data file produced by Impala INSERT statements.
PARQUET_LATE_MATERIALIZATION_THRESHOLD
PARQUET_OBJECT_STORE_SPLIT_SIZE 256 MB The value must be greater than or equal to 1 MB. Use the PARQUET_OBJECT_STORE_SPLIT_SIZE query option to control the Parquet format specific split sizes on non-block based stores,
PARQUET_PAGE_ROW_COUNT_LIMIT @since 3.3 set the maximum number of rows that can be written on a single Parquet data page.
PARQUET_READ_PAGE_INDEX to disable or enable using the Parquet page index during scans.
Of the types: Boolean, Integer, Decimal, String, Timestamp.
For simple predicates of the forms: <slot> <op> <constant> or <constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ
PARQUET_READ_STATISTICS @since 2.9.0 true true (1)
false (0)
set the maximum number of rows that can be written on a single Parquet data page.
Of the numerical types for the old version of the statistics: Boolean, Integer, Float
Of the types for the new version of the statistics (starting in IMPALA 2.8): Boolean, Integer, Float, Decimal, String, Timestamp
For simple predicates of the forms: <slot> <op> <constant> or <constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ
PARQUET_TIMESTAMP_TYPE
PARQUET_WRITE_PAGE_INDEX @since 3.3 TRUE true (1)
false (0)
to disable or enable the Parquet page index writing.
PLANNER_TESTCASE_MODE
PREAGG_BYTES_LIMIT
PREFETCH_MODE @since 2.6.0 1 (equivalent to HT_BUCKET) 0(NONE)
1(HT_BUCKET)
Determines whether the prefetching optimization is applied during join query processing.
QUERY_TIMEOUT_S @since 2.0.0 0 For queries, this means the query has results ready but is waiting for a client to fetch the data.
For sessions, this means that no query has been submitted for some period of time.
REFRESH_UPDATED_HMS_PARTITIONS @since 4.0.0 FALSE refresh any updated HMS partitions.
REPLICA_PREFERENCE @since 2.7.0 CACHE_LOCAL (0) CACHE_LOCAL (0), DISK_LOCAL (2), REMOTE (4) This allows Impala to schedule reads to suboptimal replicas (e.g. local in the presence of cached ones) in order to distribute the work across more executor nodes.
REPORT_SKEW_LIMIT
REQUEST_POOL empty The pool or queue name that queries should be submitted to. Only applies when you enable the Impala admission control feature.
RESERVATION_REQUEST_TIMEOUT Removed
RESOURCE_TRACE_RATIO @since 3.2 0 from 0 to 1 are allowed specifies the ratio of queries where the CPU usage info will be included in the profiles.
RETRY_FAILED_QUERIES @since 4.0 TRUE control whether or not queries are transparently retried on cluster membership changes.
This feature supports retrying the entire query and NOT the individual query fragments. INSERT and DDL queries will NOT be retried.
query retry will be skipped if the query has returned any results to the client.
RM_INITIAL_MEM Removed
RUNTIME_BLOOM_FILTER_SIZE @since 2.5.0 1048576 (1 MB) Maximum: 16 MB In Impala 2.6 and higher, this query option only applies as a fallback, when statistics are not available.
RUNTIME_FILTER_ERROR_RATE
RUNTIME_FILTER_MAX_SIZE @since 2.6.0 0 (meaning use the value from the corresponding impalad startup option) defines the maximum size for a filter.Filter sizes are rounded up to the nearest power of two.
RUNTIME_FILTER_MIN_SIZE @since 2.6.0 0 (meaning use the value from the corresponding impalad startup option) defines the minimum size for a filter.Filter sizes are rounded up to the nearest power of two.
RUNTIME_FILTER_MODE @since 2.5.0 2 numeric (0, 1, 2) or corresponding mnemonic strings (OFF, LOCAL, GLOBAL). It turns this feature on and off, and controls how extensively the filters are transmitted between hosts.
RUNTIME_FILTER_WAIT_TIME_MS @since 2.5.0 0 (meaning use the value from the corresponding impalad startup option) specifies a time in milliseconds that each scan node waits for runtime filters to be produced by other plan fragments.
RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT
S3_SKIP_INSERT_STAGING @since 2.6.0 true Speeds up INSERT operations on tables or partitions residing on the Amazon S3 filesystem.
SCAN_BYTES_LIMIT @since 3.1 0 (no limit) The SCAN_BYTES_LIMIT query option sets a limit on the bytes scanned by HDFS and HBase SCAN operations.
SCAN_NODE_CODEGEN_THRESHOLD Removed
SCHEDULE_RANDOM_REPLICA @since 2.5.0 false The SCHEDULE_RANDOM_REPLICA query option fine-tunes the scheduling algorithm for deciding which host processes each HDFS data block or Kudu tablet to reduce the chance of CPU hotspots.The SCHEDULE_RANDOM_REPLICA query option only applies to tables and partitions that are not enabled for the HDFS caching.
SCRATCH_LIMIT -1 (amount of spill space is unlimited) Specifies the maximum amount of disk storage, in bytes, that any Impala query can consume on any host using the "spill to disk" mechanism that handles queries that exceed the memory limit.
A value of zero turns off the spill to disk feature for queries in the current session, causing them to fail immediately if they exceed the memory limit.
SEQ_COMPRESSION_MODE Removed
SHOW_COLUMN_MINMAX_STATS
SHUFFLE_DISTINCT_EXPRS false controls the shuffling behavior when a query has both grouping and distinct expressions. Impala can optionally include the distinct expressions in the hash exchange to spread the data among more nodes. However, this plan requires one more hash exchange phase.
It is recommended that you turn off this option if the NDVs of the grouping expressions are high.
SORT_RUN_BYTES_LIMIT
SPOOL_ALL_RESULTS_FOR_RETRIES
SPOOL_QUERY_RESULTS @since 3.4 FALSE Query result spooling controls how rows are returned to the client.
STATEMENT_EXPRESSION_LIMIT
STRICT_MODE
SYNC_DDL false causes any DDL operation such as CREATE TABLE or ALTER TABLE to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala catalog service.
INSERT statements also delay their completion until all the underlying data and metadata changes are propagated to all Impala nodes and this option applies to all filesystem-based tables.
TARGETED_KUDU_SCAN_RANGE_LENGTH
TEST_REPLAN
THREAD_RESERVATION_AGGREGATE_LIMIT @since 3.1 0 (no limit) limits the number of reserved threads for a query across all nodes on which it is executing.Queries that have more threads than this threshold are rejected by Impala’s admission controller before they start executing.
THREAD_RESERVATION_LIMIT
TIMEZONE @since 3.1 The system time zone where the Coordinator Impalad runs a canonical code or a time zone name defined in IANA Time Zone Database (opens new window). defines the timezone used for conversions between UTC and the local time. If not set, Impala uses the system time zone where the Coordinator Impalad runs.
Impala takes the timezone into a consideration in the following cases:
When calling the NOW() function When converting between Unix time and timestamp if the use_local_tz_for_unix_timestamp_conversions flag is TRUE When reading Parquet timestamps written by Hive if the convert_legacy_hive_parquet_utc_timestamps flag is TRUE
TOPN_BYTES_LIMIT @since 3.1 536870912 (512 MB) places a limit on the amount of estimated memory that Impala can process for top-N queries.
Top-N queries don't spill to disk so they have to keep all rows they process in memory.If the Impala planner estimates that a top-N operator will process more bytes than the TOPN_BYTES_LIMIT value, it will replace the top-N operator with the sort operator. Switching to the sort operator allows Impala to spill to disk, thus requiring less memory than top-N, but potentially with performance penalties.
USE_DOP_FOR_COSTING
USE_LOCAL_TZ_FOR_UNIX_TIMESTAMP_CONVERSIONS
UTF8_MODE @since 4.1 FALSE allows string functions to recognize the UTF-8 characters, thus processing strings in a compatible way as other engines.
V_CPU_CORES Removed

$ 参考

Query Options for the SET Statement (opens new window)

ImpalaService.thrift TImpalaQueryOptions (opens new window)

$ DDL

将kudu表从一个impala集群托管到另一个impala集群:

-- 在A集群创建一个Kudu表
CREATE TABLE id_city (
   id INT NOT NULL,
   city STRING NULL,
   PRIMARY KEY (id) 
) PARTITION BY HASH (id) PARTITIONS 4 STORED AS KUDU;

-- 在B集群创建kudu外表
CREATE external TABLE id_city 
STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='impala::default.id_city','kudu.master_addresses'='');

-- 将B集群的kudu表改为内表:
ALTER TABLE id_city SET TBLPROPERTIES('EXTERNAL'='false');

-- 将A集群的kudu表改为外表:
ALTER TABLE id_city SET TBLPROPERTIES('EXTERNAL'='true');

-- 删除A集群的kudu表
DROP TABLE id_city;

参考:

CREATE TABLE Statement (opens new window)

ALTER TABLE Statement (opens new window)

$ 问题

PreparedStatement不能使用UPSERT语句的问题

在Version 2.6.8中已被解决(参考 [IMPJ-419] )。

中文乱码

参考:https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_string.html#string (opens new window)

$ SQL Hints

In Impala 2.0 and higher, you can specify the hints inside comments that use either the /* */ or -- notation. Specify a + symbol immediately before the hint name. Recently added hints are only available using the /* */ and -- notation. For clarity, the /* */ and -- styles are used in the syntax and examples throughout this section. With the /* */ or -- notation for hints, specify a + symbol immediately before the first hint name. Multiple hints can be specified separated by commas, for example /* +clustered,shuffle */

UPSERT /* +NOCLUSTERED,NOSHUFFLE */ INTO ...

Starting from CDH 5.12 / Impala 2.9, the INSERT or UPSERT operations into Kudu tables automatically add an exchange and a sort node to the plan that partitions and sorts the rows according to the partitioning/primary key scheme of the target table (unless the number of rows to be inserted is small enough to trigger single node execution). Since Kudu partitions and sorts rows on write, pre-partitioning and sorting takes some of the load off of Kudu and helps large INSERT operations to complete without timing out. However, this default behavior may slow down the end-to-end performance of the INSERT or UPSERT operations. Starting from CDH 5.13 / Impala 2.10, you can use the /* +NOCLUSTERED / and / +NOSHUFFLE */ hints together to disable partitioning and sorting before the rows are sent to Kudu. Additionally, since sorting may consume a large amount of memory, consider setting the MEM_LIMIT query option for those queries.

参考:

https://impala.apache.org/docs/build/html/topics/impala_hints.html (opens new window)

https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/impala_upsert.html (opens new window)

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_kudu.html (opens new window)

$ Impala flush mode

Use Kudu AUTO_FLUSH_BACKGROUND mode

https://issues.apache.org/jira/browse/IMPALA-4134 (opens new window)

$ Admission Control

$ 配置

对于使用名为default的单个资源池的简单配置 ,可以在命令行上指定配置选项:

配置项 类型 默认值 说明
--queue_wait_timeout_ms int64 60000 请求在队列中等待的超时时间(单位:毫秒)
--default_pool_max_requests int64 -1 在请求排队之前允许运行的最大并发数。是一个软限制,在负载较重时,并发查询的总数可能会略高。负值表示没有限制。
--default_pool_max_queued int64 无限制 在拒绝请求之前允许排队的最大请求数。是一个软限制,在负载较重时,排队查询的总数可能会略高。负值或 0 表示一旦达到最大并发请求,请求总是被拒绝。
--default_pool_mem_limit string "" 在对该资源池的新请求排队之前,该资源池中所有未完成的请求可以使用的最大内存量(跨整个集群)。以字节、兆字节或千兆字节为单位,由一个数字后跟大写或小写后缀b(可选)m、 或g指定。可以为兆字节和千兆字节指定浮点值。还可以通过指定%后缀将其指定为物理内存的百分比。0 或不设置表示无限制。是一个软限制,在负载较重的时候,并发查询使用的总内存可能会略高一些。
--disable_pool_max_requests Boolean false 解除针对每个资源池的最大运行请求的限制。
--disable_pool_mem_limits Boolean false 解除每个资源池的内存限制。

对于具有使用不同设置的多个资源池的高级配置,需要手动设置配置文件:

  • 通过--‑‑fair_scheduler_allocation_path=fair-scheduler.xml指定fair-scheduler.xml配置文件的路径

  • 通过--‑‑llama_site_path=llama-site.xml指定llama-site.xml配置文件的路径

fair-scheduler.xml示例:

<allocations>
    <queue name="root">
        <aclSubmitApps> </aclSubmitApps>
        <queue name="default">
            <maxResources>50000 mb, 0 vcores</maxResources>
            <aclSubmitApps>*</aclSubmitApps>
        </queue>
        <queue name="development">
            <maxResources>200000 mb, 0 vcores</maxResources>
            <aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
        </queue>
        <queue name="production">
            <maxResources>1000000 mb, 0 vcores</maxResources>
            <aclSubmitApps> ops,admin</aclSubmitApps>
        </queue>
    </queue>
    <queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="default" />
    </queuePlacementPolicy>
</allocations>

llama-site.xml示例:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>llama.am.throttling.maximum.placed.reservations.root.default</name>
    <value>10</value>
  </property>
  <property>
    <name>llama.am.throttling.maximum.queued.reservations.root.default</name>
    <value>50</value>
  </property>
  <property>
    <!--指定该资源池里运行查询的默认查询选项-->
    <name>impala.admission-control.pool-default-query-options.root.default</name>
    <value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>
  </property>
  <property>
    <!--队列排队的超时时间,单位:毫秒-->
    <name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
    <value>30000</value>
  </property>
  <property>
    <name>impala.admission-control.max-query-mem-limit.root.default.regularPool</name>
    <value>1610612736</value><!--1.5GB-->
  </property>
  <property>
    <name>impala.admission-control.min-query-mem-limit.root.default.regularPool</name>
    <value>52428800</value><!--50MB-->
  </property>
  <property>
    <name>impala.admission-control.clamp-mem-limit-query-option.root.default.regularPool</name>
    <value>true</value>
  </property>
</configuration>

$ 参考

Configuring Admission Control (opens new window)

更新时间: 11/7/2022, 1:59:51 AM