$ Impala
$ 端口
客户端 | 端口 |
---|---|
impala-shell, ODBC driver | 21000 |
JDBC, ODBC driver | 21050 |
$ JDBC连接方式
使用Cloudera JDBC Connector
com.cloudera.impala.jdbc41.Driver
,连接串格式:jdbc:impala://Host:Port[/Schema];Property1=Value;Property2=Value;...
使用Hive JDBC Driver
org.apache.hive.jdbc.HiveDriver
:简单连接串(注意21050端口号后面的“/”,以及“;auth=noSasl”,缺少会出错):
jdbc:hive2://myhost.example.com:21050/;auth=noSasl
kerberos连接串:
jdbc:hive2://myhost.example.com:21050/;principal=impala/myhost.example.com@H2.EXAMPLE.COM
LDAP连接串:
jdbc:hive2://host:port/db_name;user=ldap_userid;password=ldap_password
参考:
$ 负载均衡
负载均衡主要适用于impalad
进程。
负载均衡有下面的优势:
应用程序只需要连接到一个地址
实现高可用连接
coordinator节点比其他节点需要更多CPU和内存。代理服务器可以调度查询从而使每个连接使用不同的coordinator节点。
选择负载均衡算法:
$ Leastconn
使用最少的连接将会话连接到coordinator节点,通常用于由许多独立的,短期运行的查询。
$ Source IP Persistence
来自相同IP地址的会话始终会到达相同的协调器。对于混合着查询和DDL语句的情况是一个不错的选择,例如CREATE TABLE
和 ALTER TABLE
。由于DDL语句中的元数据更改需要花费时间才能在整个群集中传播,因此在这种情况下,最好使用Source IP Persistence
算法。如果无法选择Source IP Persistence
,请通过同一会话运行DDL和依赖于DDL结果的后续查询,例如通过运行impala-shell -f script_file
通过一个会话提交多个语句。
$ Round-robin
将连接分布到所有coordinator节点,通常不建议用于Impala。
你可能需要执行基准测试和负载测试,以确定哪个设置对你的用例是最优的。总是使用两种负载平衡算法:Hue使用Source IP Persistence
,其他的场景使用Leastconn
。
$ 参考
Using Impala through a Proxy for High Availability (opens new window)
$ 账号授权
SHOW语句与权限相关语法:
SHOW ROLES
SHOW CURRENT ROLES
SHOW ROLE GRANT GROUP group_name
SHOW GRANT ROLE role_name
SHOW GRANT USER user_name
SHOW GRANT USER user_name ON SERVER
SHOW GRANT USER user_name ON DATABASE database_name
SHOW GRANT USER user_name ON TABLE table_name
SHOW GRANT USER user_name ON URI uri
GRANT语句语法:
GRANT ROLE role_name TO GROUP group_name
GRANT privilege ON object_type object_name
TO [ROLE] roleName
[WITH GRANT OPTION]
privilege ::= ALL | CREATE | INSERT | REFRESH | SELECT | SELECT(column_name)
object_type ::= SERVER | URI | DATABASE | TABLE
REVOKE语句语法:
REVOKE ROLE role_name FROM GROUP group_name
REVOKE [GRANT OPTION FOR] privilege ON object_type object_name
FROM [ROLE] role_name
privilege ::= ALL | CREATE | INSERT | REFRESH | SELECT | SELECT(column_name)
object_type ::= SERVER | URI | DATABASE | TABLE
兼容性:
不需要HDFS权限
Kudu注意事项:
仅在Server上有ALL权限的用户可以创建外部Kudu表
DELETE, UPDATE, 和 UPSERT 操作要求ALL权限
参考:
GRANT Statement (CDH 5.2 or higher only) (opens new window)
$ 数据类型
类型 | 范围 | 备注 |
---|---|---|
TINYINT | -128 .. 127 | |
SMALLINT | -32768 .. 32767 | |
INT | -2147483648 .. 2147483647 | |
BIGINT | -9223372036854775808 .. 9223372036854775807 | |
FLOAT | 正负1.40129846432481707e-45 .. 3.40282346638528860e+38 | |
DOUBLE,REAL | 4.94065645841246544e-324d .. 1.79769313486231570e+308 | |
DECIMAL[(precision[, scale])] | precision:between 1 and 38(default 9)scale:default 0 | |
STRING | 2 GB | |
CHAR | ||
VARCHAR | ||
TIMESTAMP | 1400-01-01 to 9999-12-31 | UTC时区 |
BOOLEAN |
类型转换:
TINYINT | SMALLINT | INT | BIGINT | FLOAT | DOUBLE | DECIMAL | STRING | CHAR | VARCHAR | TIMESTAMP | BOOLEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
TINYINT | - | Auto | Auto | Auto | Auto | Auto | Cast | Cast | ||||
SMALLINT | Cast | - | Auto | Auto | Auto | Auto | Cast | Cast | ||||
INT | Cast | Cast | - | Auto | Auto | Auto | Cast | Cast | ||||
BIGINT | Cast | Cast | - | Auto | Auto | Cast | Cast | |||||
FLOAT | Cast | Cast | Cast | Cast | - | Auto | No | Cast | Cast | Cast | ||
DOUBLE | Cast | Cast | Cast | Cast | Cast | - | No | Cast | Cast | Cast | ||
DECIMAL | Cast | Cast | Cast | Cast | Auto | Auto | - | Cast | Cast | Cast | ||
STRING | Cast | Cast | Cast | Cast | Cast | Cast | - | Auto | No | |||
CHAR | - | |||||||||||
VARCHAR | - | |||||||||||
TIMESTAMP | - | |||||||||||
BOOLEAN | Cast | - |
参考:
Impala Type Conversion Functions (opens new window)
$ Query Option
名称 | 适用角色 | 标签 | 默认值 | 候选值 | 说明 |
---|---|---|---|---|---|
ABORT_JAVA_UDF_ON_EXCEPTION | |||||
ABORT_ON_DEFAULT_LIMIT_EXCEEDED | Removed | ||||
ABORT_ON_ERROR | false | When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error, rather than continuing and possibly returning incomplete results. | |||
ALLOW_ERASURE_CODED_FILES | @since 3.1 | false | When the ALLOW_ERASURE_CODED_FILES query option is set to FALSE, Impala returns an error when a query requires scanning an erasure coded file. | ||
ALLOW_UNSUPPORTED_FORMATS | Removed | ||||
ANALYTIC_RANK_PUSHDOWN_THRESHOLD | |||||
APPX_COUNT_DISTINCT | @since 2.0 | false | When the APPX_COUNT_DISTINCT query option is set to TRUE , Impala implicitly converts COUNT(DISTINCT) operations to the NDV() function calls. | ||
ASYNC_CODEGEN | |||||
BATCH_SIZE | @since 2.11 | 0(1024) | 0-65536 | Number of rows evaluated at a time by SQL operators.Using a large number improves responsiveness, especially for scan operations, at the cost of a higher memory footprint. | |
BROADCAST_BYTES_LIMIT | 34359738368 (32 GB) | 0:ignored | The Impala planner may in rare cases make a bad choice to broadcast a large table or intermediate result and encounter performance problems due to high memory pressure. Setting this limit will make the planner pick a partition based hash join instead of broadcast and avoid such performance problems. | ||
BROADCAST_TO_PARTITION_FACTOR | |||||
BUFFER_POOL_LIMIT | the lower of 80% of the MEM_LIMIT setting, or the MEM_LIMIT setting minus 100 MB. | -- Set an absolute value. set buffer_pool_limit=8GB; -- Set a relative value based on the MEM_LIMIT setting. set buffer_pool_limit=80%; | Defines a limit on the amount of memory that a query can allocate from the internal buffer pool. The value for this limit applies to the memory on each host, not the aggregate memory across the cluster. Typically not changed by users, except during diagnosis of out-of-memory errors during queries. | ||
CLIENT_IDENTIFIER | |||||
COMPRESSION_CODEC | SNAPPY | SET COMPRESSION_CODEC=codec_name; // Supported for all codecs. SET COMPRESSION_CODEC=codec_name:compression_level; // Only supported for ZSTD.The allowed values for this query option are SNAPPY (the default), GZIP , ZSTD , LZ4 , and NONE . | When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled by the COMPRESSION_CODEC query option. | ||
COMPUTE_COLUMN_MINMAX_STATS | |||||
COMPUTE_STATS_MIN_SAMPLE_SIZE | |||||
CONVERT_LEGACY_HIVE_PARQUET_UTC_TIMESTAMPS | |||||
CPU_LIMIT_S | |||||
DEBUG_ACTION | |||||
DECIMAL_V2 | A query option that changes behavior related to the DECIMAL data type. Set this option to FALSE for backward compatibility to Impala 2.x. | ||||
DEFAULT_FILE_FORMAT | @since 3.2 | TEXT (0)RC_FILE (1)SEQUENCE_FILE (2)AVRO (3)PARQUET (4)KUDU (5)ORC (6) | Use the DEFAULT_FILE_FORMAT query option to set the default table file format. | ||
DEFAULT_HINTS_INSERT_STATEMENT | none | SET DEFAULT_HINTS_INSERT_STATEMENT=CLUSTERED; SET DEFAULT_HINTS_INSERT_STATEMENT=SHUFFLE; SET DEFAULT_HINTS_INSERT_STATEMENT=NOCLUSTERED:NOSHUFFLE; | The DEFAULT_HINTS_INSERT_STATEMENT query option sets the default hints for the INSERT statements with no optimizer hint specified.The default hints apply to the HDFS and Kudu table formats and are ignored for the HBase table format. | ||
DEFAULT_JOIN_DISTRIBUTION_MODE | BROADCAST (equivalent to 0) or SHUFFLE (equivalent to 1) | This option determines the join distribution that Impala uses when any of the tables involved in a join query is missing statistics. | |||
DEFAULT_NDV_SCALE | |||||
DEFAULT_ORDER_BY_LIMIT | Removed | ||||
DEFAULT_SPILLABLE_BUFFER_SIZE | @since 2.10.0 | 2097152 (2 MB) | Specifies the default size for a memory buffer used when the spill-to-disk mechanism is activated. | ||
DEFAULT_TRANSACTIONAL_TYPE | @since 3.3 | NONE | NONE : The table will not be created as transactional. INSERT_ONLY : The table will be created as transactional. | Use the DEFAULT_TRANSACTIONAL_TYPE query option to create insert-only transactional tables by default.If either table properties, transactional or transactional_properties , are set, this query option is ignored. | |
DELETE_STATS_IN_TRUNCATE | @since 4.0 | TRUE | This query option DELETE_STATS_IN_TRUNCATE can be used to delete or retain table statistics. | ||
DISABLE_CACHED_READS | |||||
DISABLE_CODEGEN | false | it's used to work around any issues with Impala's runtime code generation | |||
DISABLE_CODEGEN_ROWS_THRESHOLD | @since 2.10.0 | 50000 | This setting controls the cutoff point (in terms of number of rows processed per Impala daemon) below which Impala disables native code generation for the whole query.If a query uses the complex data types STRUCT, ARRAY, or MAP, then codegen is never automatically disabled regardless of the DISABLE_CODEGEN_ROWS_THRESHOLD setting. | ||
DISABLE_DATA_CACHE | |||||
DISABLE_HBASE_NUM_ROWS_ESTIMATE | FALSE | While generating a plan for an HBase query, the planner samples the underlying HBase tables to estimate their row count and row size, and the sampling can negatively impact the planning time. When the HBase table stats does not change much in short time, disable the sampling by setting the DISABLE_HBASE_NUM_ROWS_ESTIMATE query option to TRUE. And Impala planner will fall back to using Hive Metastore (HMS) table stats instead. | |||
DISABLE_HDFS_NUM_ROWS_ESTIMATE | |||||
DISABLE_OUTERMOST_TOPN | |||||
DISABLE_ROW_RUNTIME_FILTERING | |||||
DISABLE_STREAMING_PREAGGREGATIONS | false | Turns off the "streaming preaggregation" optimization that is available in Impala 2.5 and higher. | |||
DISABLE_UNSAFE_SPILLS | false | Enable this option if you prefer to have queries fail when they exceed the Impala memory limit, rather than write temporary data to disk. | |||
ENABLE_ASYNC_DDL_EXECUTION | |||||
ENABLE_ASYNC_LOAD_DATA_EXECUTION | |||||
ENABLE_CNF_REWRITES | |||||
ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION | |||||
ENABLE_EXPR_REWRITES | true | The ENABLE_EXPR_REWRITES query option controls whether to enable or disable the query compile time optimizations. | |||
ENABLE_KUDU_TRANSACTION | |||||
ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION | |||||
ENABLE_REPLAN | |||||
ENABLED_RUNTIME_FILTER_TYPES | |||||
EXEC_SINGLE_NODE_ROWS_THRESHOLD | @since 2.10 | 100 | This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query as a "small" query, turning off optimizations such as parallel execution and native code generation.This setting applies to queries where the number of rows processed can be accurately determined, either through table and column statistics, or by the presence of a LIMIT clause. If Impala cannot accurately estimate the number of rows, then this setting does not apply.For a query that is determined to be "small", all work is performed on the coordinator node. | ||
EXEC_TIME_LIMIT_S | @since 2.12 | 0 (no time limit ) | The EXEC_TIME_LIMIT_S query option sets a time limit on query execution.The time limit only starts once the query is executing. Time spent planning the query, scheduling the query, or in admission control is not counted towards the execution time limit. | ||
EXPAND_COMPLEX_TYPES | |||||
EXPLAIN_LEVEL | 1 | 0 or MINIMAL, 1 or STANDARD, 2 or EXTENDED, 3 or VERBOSE | Controls the amount of detail provided in the output of the EXPLAIN statement. | ||
FETCH_ROWS_TIMEOUT_MS | @since 3.4 | 10000 | Use the FETCH_ROWS_TIMEOUT_MS query option to control how long Impala waits for query results when clients fetch rows. | ||
HBASE_CACHE_BLOCKS | false | Setting this option is equivalent to calling the setCacheBlocks method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application. | |||
HBASE_CACHING | 0 | Setting this option is equivalent to calling the setCaching method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application. | |||
IDLE_SESSION_TIMEOUT | @since 2.12 | 0 | The IDLE_SESSION_TIMEOUT query option sets the time in seconds after which an idle session is cancelled. A session is idle when no activity is occurring for any of the queries in that session, and the session has not started any new queries. Once a session is expired, you cannot issue any new query requests to it. The session remains open, but the only operation you can perform is to close it. | ||
JOIN_ROWS_PRODUCED_LIMIT | @since 4.0 | 0 | The JOIN_ROWS_PRODUCED_LIMIT query option limits the number of join rows produced by a join node and is used to prevent runaway join queries. | ||
KUDU_READ_MODE | @since 3.1 | DEFAULT | "DEFAULT""READ_LATEST""READ_AT_SNAPSHOT" | The KUDU_READ_MODE query option allows you to set a desired consistency level for scans of Kudu tables. | |
KUDU_REPLICA_SELECTION | |||||
KUDU_SNAPSHOT_READ_TIMESTAMP_MICROS | |||||
LOCK_MAX_WAIT_TIME_S | |||||
LIVE_PROGRESS | @since 2.3.0 | TRUE | When you run a query, the live progress bar appears in the output of a query. The bar shows roughly the percentage of completed processing. When the query finishes, the live progress bar disappears from the console output.Starting in Impala 3.1, the summary output also includes the queuing status consisting of whether the query was queued and what was the latest queuing reason. | ||
LIVE_SUMMARY | @since 2.3.0 | FALSE | Impala displays the same output as the SUMMARY command for queries submitted through the impala-shell command, with the measurements updated in real time as the query progresses. When the query finishes, the final SUMMARY output remains visible in the impala-shell console output. | ||
MAX_CNF_EXPRS | |||||
MAX_ERRORS | 0 (meaning 1000 errors) | Maximum number of non-fatal errors for any particular query that are recorded in the Impala log file. | |||
MAX_FS_WRITERS | |||||
MAX_IO_BUFFERS | Removed | ||||
MAX_MEM_ESTIMATE_FOR_ADMISSION | @since 3.1 | Use the MAX_MEM_ESTIMATE_FOR_ADMISSION query option to set an upper limit on the memory estimates of a query as a workaround for over-estimates precluding a query from being admitted. | |||
MAX_NUM_RUNTIME_FILTERS | |||||
MAX_RESULT_SPOOLING_MEM | @since 3.4 | 100 MB | Use the MAX_RESULT_SPOOLING_MEM query option to set the maximum amount of memory used when spooling query results.If the amount of memory exceeds this value when spooling query results, all memory will most likely be spilled to disk.The MAX_RESULT_SPOOLING_MEM query option is applicable only when query result spooling is enabled with the SPOOL_QUERY_RESULTS query option set to TRUE . | ||
MAX_ROW_SIZE | |||||
MAX_SCAN_RANGE_LENGTH | 0(indicates backend default, which is the same as the HDFS block size for each table.) | The MAX_RESULT_SPOOLING_MEM query option is applicable only when query result spooling is enabled with the SPOOL_QUERY_RESULTS query option set to TRUE . | |||
MAX_SPILLED_RESULT_SPOOLING_MEM | @since 3.4 | 1 GB | Use the MAX_SPILLED_RESULT_SPOOLING_MEM query option to set the maximum amount of memory that can be spilled when spooling query results. | ||
MAX_STATEMENT_LENGTH_BYTES | |||||
MEM_LIMIT | 0 (unlimited) | The MEM_LIMIT query option defines the maximum amount of memory a query can allocate on each node. | |||
MEM_LIMIT_EXECUTORS | |||||
MIN_SPILLABLE_BUFFER_SIZE | @since 2.10.0 | 65536 (64 KB) | Specifies the minimum size for a memory buffer used when the spill-to-disk mechanism is activated, for example for queries against a large table with no statistics, or large join operations. | ||
MINMAX_FILTER_FAST_CODE_PATH | |||||
MINMAX_FILTER_PARTITION_COLUMNS | |||||
MINMAX_FILTER_SORTED_COLUMNS | |||||
MINMAX_FILTER_THRESHOLD | |||||
MINMAX_FILTERING_LEVEL | |||||
MT_DOP | |||||
NOW_STRING | |||||
NUM_NODES | 0 | 0 (meaning all nodes) or 1 (meaning all work is done on the coordinator node) | Limit the number of nodes that process a query, typically during debugging. | ||
NUM_REMOTE_EXECUTOR_CANDIDATES | |||||
NUM_ROWS_PRODUCED_LIMIT | |||||
NUM_SCANNER_THREADS | |||||
OPTIMIZE_PARTITION_KEY_SCANS | @since 2.5.0 | false | Enables a fast code path for queries that apply simple aggregate functions to partition key columns: MIN(key_column) , MAX(key_column) , or COUNT(DISTINCT key_column) . | ||
OPTIMIZE_SIMPLE_LIMIT | |||||
ORC_ASYNC_READ | |||||
ORC_READ_STATISTICS | |||||
ORC_SCHEMA_RESOLUTION | |||||
PARQUET_ANNOTATE_STRINGS_UTF8 | @since 2.6.0 | false | Causes Impala INSERT and CREATE TABLE AS SELECT statements to write Parquet files that use the UTF-8 annotation for STRING columns. | ||
PARQUET_ARRAY_RESOLUTION | @since 2.9.0 | THREE_LEVEL | THREE_LEVEL TWO_LEVEL TWO_LEVEL_THEN_THREE_LEVEL | The PARQUET_ARRAY_RESOLUTION query option controls the behavior of the indexed-based resolution for nested arrays in Parquet. | |
PARQUET_BLOOM_FILTER_WRITE | |||||
PARQUET_BLOOM_FILTERING | |||||
PARQUET_DICTIONARY_FILTERING | @since 2.9.0 | true | The PARQUET_DICTIONARY_FILTERING query option controls whether Impala uses dictionary filtering for Parquet files. | ||
PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT | |||||
PARQUET_FALLBACK_SCHEMA_RESOLUTION | @since 2.6.0 | POSITION | POSITION (0) NAME (1) | The PARQUET_FALLBACK_SCHEMA_RESOLUTION query option allows Impala to look up columns within Parquet files by column name, rather than column order, when necessary. | |
PARQUET_FILE_SIZE | 0 (produces files with a target size of 256 MB; files might be larger for very wide tables) | the maximum value for this setting is 1 gigabyte (1g ) | Specifies the maximum size of each Parquet data file produced by Impala INSERT statements. | ||
PARQUET_LATE_MATERIALIZATION_THRESHOLD | |||||
PARQUET_OBJECT_STORE_SPLIT_SIZE | 256 MB | The value must be greater than or equal to 1 MB. | Use the PARQUET_OBJECT_STORE_SPLIT_SIZE query option to control the Parquet format specific split sizes on non-block based stores, | ||
PARQUET_PAGE_ROW_COUNT_LIMIT | @since 3.3 | set the maximum number of rows that can be written on a single Parquet data page. | |||
PARQUET_READ_PAGE_INDEX | to disable or enable using the Parquet page index during scans. Of the types: Boolean, Integer, Decimal, String, Timestamp. For simple predicates of the forms: <slot> <op> <constant> or <constant> <op> <slot> , where <op> is LT, LE, GE, GT, and EQ | ||||
PARQUET_READ_STATISTICS | @since 2.9.0 | true | true (1 )false (0 ) | set the maximum number of rows that can be written on a single Parquet data page. Of the numerical types for the old version of the statistics: Boolean, Integer, Float Of the types for the new version of the statistics (starting in IMPALA 2.8): Boolean, Integer, Float, Decimal, String, Timestamp For simple predicates of the forms: <slot> <op> <constant> or <constant> <op> <slot> , where <op> is LT, LE, GE, GT, and EQ | |
PARQUET_TIMESTAMP_TYPE | |||||
PARQUET_WRITE_PAGE_INDEX | @since 3.3 | TRUE | true (1 )false (0 ) | to disable or enable the Parquet page index writing. | |
PLANNER_TESTCASE_MODE | |||||
PREAGG_BYTES_LIMIT | |||||
PREFETCH_MODE | @since 2.6.0 | 1 (equivalent to HT_BUCKET ) | 0(NONE )1(HT_BUCKET) | Determines whether the prefetching optimization is applied during join query processing. | |
QUERY_TIMEOUT_S | @since 2.0.0 | 0 | For queries, this means the query has results ready but is waiting for a client to fetch the data. For sessions, this means that no query has been submitted for some period of time. | ||
REFRESH_UPDATED_HMS_PARTITIONS | @since 4.0.0 | FALSE | refresh any updated HMS partitions. | ||
REPLICA_PREFERENCE | @since 2.7.0 | CACHE_LOCAL (0) | CACHE_LOCAL (0 ), DISK_LOCAL (2 ), REMOTE (4 ) | This allows Impala to schedule reads to suboptimal replicas (e.g. local in the presence of cached ones) in order to distribute the work across more executor nodes. | |
REPORT_SKEW_LIMIT | |||||
REQUEST_POOL | empty | The pool or queue name that queries should be submitted to. Only applies when you enable the Impala admission control feature. | |||
RESERVATION_REQUEST_TIMEOUT | Removed | ||||
RESOURCE_TRACE_RATIO | @since 3.2 | 0 | from 0 to 1 are allowed | specifies the ratio of queries where the CPU usage info will be included in the profiles. | |
RETRY_FAILED_QUERIES | @since 4.0 | TRUE | control whether or not queries are transparently retried on cluster membership changes. This feature supports retrying the entire query and NOT the individual query fragments. INSERT and DDL queries will NOT be retried. query retry will be skipped if the query has returned any results to the client. | ||
RM_INITIAL_MEM | Removed | ||||
RUNTIME_BLOOM_FILTER_SIZE | @since 2.5.0 | 1048576 (1 MB) | Maximum: 16 MB | In Impala 2.6 and higher, this query option only applies as a fallback, when statistics are not available. | |
RUNTIME_FILTER_ERROR_RATE | |||||
RUNTIME_FILTER_MAX_SIZE | @since 2.6.0 | 0 (meaning use the value from the corresponding impalad startup option) | defines the maximum size for a filter.Filter sizes are rounded up to the nearest power of two. | ||
RUNTIME_FILTER_MIN_SIZE | @since 2.6.0 | 0 (meaning use the value from the corresponding impalad startup option) | defines the minimum size for a filter.Filter sizes are rounded up to the nearest power of two. | ||
RUNTIME_FILTER_MODE | @since 2.5.0 | 2 | numeric (0, 1, 2) or corresponding mnemonic strings (OFF , LOCAL , GLOBAL ). | It turns this feature on and off, and controls how extensively the filters are transmitted between hosts. | |
RUNTIME_FILTER_WAIT_TIME_MS | @since 2.5.0 | 0 (meaning use the value from the corresponding impalad startup option) | specifies a time in milliseconds that each scan node waits for runtime filters to be produced by other plan fragments. | ||
RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT | |||||
S3_SKIP_INSERT_STAGING | @since 2.6.0 | true | Speeds up INSERT operations on tables or partitions residing on the Amazon S3 filesystem. | ||
SCAN_BYTES_LIMIT | @since 3.1 | 0 (no limit) | The SCAN_BYTES_LIMIT query option sets a limit on the bytes scanned by HDFS and HBase SCAN operations. | ||
SCAN_NODE_CODEGEN_THRESHOLD | Removed | ||||
SCHEDULE_RANDOM_REPLICA | @since 2.5.0 | false | The SCHEDULE_RANDOM_REPLICA query option fine-tunes the scheduling algorithm for deciding which host processes each HDFS data block or Kudu tablet to reduce the chance of CPU hotspots.The SCHEDULE_RANDOM_REPLICA query option only applies to tables and partitions that are not enabled for the HDFS caching. | ||
SCRATCH_LIMIT | -1 (amount of spill space is unlimited) | Specifies the maximum amount of disk storage, in bytes, that any Impala query can consume on any host using the "spill to disk" mechanism that handles queries that exceed the memory limit. A value of zero turns off the spill to disk feature for queries in the current session, causing them to fail immediately if they exceed the memory limit. | |||
SEQ_COMPRESSION_MODE | Removed | ||||
SHOW_COLUMN_MINMAX_STATS | |||||
SHUFFLE_DISTINCT_EXPRS | false | controls the shuffling behavior when a query has both grouping and distinct expressions. Impala can optionally include the distinct expressions in the hash exchange to spread the data among more nodes. However, this plan requires one more hash exchange phase. It is recommended that you turn off this option if the NDVs of the grouping expressions are high. | |||
SORT_RUN_BYTES_LIMIT | |||||
SPOOL_ALL_RESULTS_FOR_RETRIES | |||||
SPOOL_QUERY_RESULTS | @since 3.4 | FALSE | Query result spooling controls how rows are returned to the client. | ||
STATEMENT_EXPRESSION_LIMIT | |||||
STRICT_MODE | |||||
SYNC_DDL | false | causes any DDL operation such as CREATE TABLE or ALTER TABLE to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala catalog service.INSERT statements also delay their completion until all the underlying data and metadata changes are propagated to all Impala nodes and this option applies to all filesystem-based tables. | |||
TARGETED_KUDU_SCAN_RANGE_LENGTH | |||||
TEST_REPLAN | |||||
THREAD_RESERVATION_AGGREGATE_LIMIT | @since 3.1 | 0 (no limit) | limits the number of reserved threads for a query across all nodes on which it is executing.Queries that have more threads than this threshold are rejected by Impala’s admission controller before they start executing. | ||
THREAD_RESERVATION_LIMIT | |||||
TIMEZONE | @since 3.1 | The system time zone where the Coordinator Impalad runs | a canonical code or a time zone name defined in IANA Time Zone Database (opens new window). | defines the timezone used for conversions between UTC and the local time. If not set, Impala uses the system time zone where the Coordinator Impalad runs. Impala takes the timezone into a consideration in the following cases: When calling the NOW() function When converting between Unix time and timestamp if the use_local_tz_for_unix_timestamp_conversions flag is TRUE When reading Parquet timestamps written by Hive if the convert_legacy_hive_parquet_utc_timestamps flag is TRUE | |
TOPN_BYTES_LIMIT | @since 3.1 | 536870912 (512 MB) | places a limit on the amount of estimated memory that Impala can process for top-N queries. Top-N queries don't spill to disk so they have to keep all rows they process in memory.If the Impala planner estimates that a top-N operator will process more bytes than the TOPN_BYTES_LIMIT value, it will replace the top-N operator with the sort operator. Switching to the sort operator allows Impala to spill to disk, thus requiring less memory than top-N, but potentially with performance penalties. | ||
USE_DOP_FOR_COSTING | |||||
USE_LOCAL_TZ_FOR_UNIX_TIMESTAMP_CONVERSIONS | |||||
UTF8_MODE | @since 4.1 | FALSE | allows string functions to recognize the UTF-8 characters, thus processing strings in a compatible way as other engines. | ||
V_CPU_CORES | Removed |
$ 参考
Query Options for the SET Statement (opens new window)
ImpalaService.thrift TImpalaQueryOptions (opens new window)
$ DDL
将kudu表从一个impala集群托管到另一个impala集群:
-- 在A集群创建一个Kudu表
CREATE TABLE id_city (
id INT NOT NULL,
city STRING NULL,
PRIMARY KEY (id)
) PARTITION BY HASH (id) PARTITIONS 4 STORED AS KUDU;
-- 在B集群创建kudu外表
CREATE external TABLE id_city
STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='impala::default.id_city','kudu.master_addresses'='');
-- 将B集群的kudu表改为内表:
ALTER TABLE id_city SET TBLPROPERTIES('EXTERNAL'='false');
-- 将A集群的kudu表改为外表:
ALTER TABLE id_city SET TBLPROPERTIES('EXTERNAL'='true');
-- 删除A集群的kudu表
DROP TABLE id_city;
参考:
CREATE TABLE Statement (opens new window)
ALTER TABLE Statement (opens new window)
$ 问题
PreparedStatement不能使用UPSERT语句的问题
在Version 2.6.8中已被解决(参考 [IMPJ-419] )。
中文乱码
$ SQL Hints
In Impala 2.0 and higher, you can specify the hints inside comments that use either the
/* */
or--
notation. Specify a+
symbol immediately before the hint name. Recently added hints are only available using the/* */
and--
notation. For clarity, the/* */
and--
styles are used in the syntax and examples throughout this section. With the/* */
or--
notation for hints, specify a+
symbol immediately before the first hint name. Multiple hints can be specified separated by commas, for example/* +clustered,shuffle */
UPSERT /* +NOCLUSTERED,NOSHUFFLE */ INTO ...
Starting from CDH 5.12 / Impala 2.9, the INSERT or UPSERT operations into Kudu tables automatically add an exchange and a sort node to the plan that partitions and sorts the rows according to the partitioning/primary key scheme of the target table (unless the number of rows to be inserted is small enough to trigger single node execution). Since Kudu partitions and sorts rows on write, pre-partitioning and sorting takes some of the load off of Kudu and helps large INSERT operations to complete without timing out. However, this default behavior may slow down the end-to-end performance of the INSERT or UPSERT operations. Starting from CDH 5.13 / Impala 2.10, you can use the /* +NOCLUSTERED / and / +NOSHUFFLE */ hints together to disable partitioning and sorting before the rows are sent to Kudu. Additionally, since sorting may consume a large amount of memory, consider setting the MEM_LIMIT query option for those queries.
参考:
https://impala.apache.org/docs/build/html/topics/impala_hints.html (opens new window)
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_kudu.html (opens new window)
$ Impala flush mode
Use Kudu AUTO_FLUSH_BACKGROUND mode
https://issues.apache.org/jira/browse/IMPALA-4134 (opens new window)
$ Admission Control
$ 配置
对于使用名为default
的单个资源池的简单配置 ,可以在命令行上指定配置选项:
配置项 | 类型 | 默认值 | 说明 |
---|---|---|---|
--queue_wait_timeout_ms | int64 | 60000 | 请求在队列中等待的超时时间(单位:毫秒) |
--default_pool_max_requests | int64 | -1 | 在请求排队之前允许运行的最大并发数。是一个软限制,在负载较重时,并发查询的总数可能会略高。负值表示没有限制。 |
--default_pool_max_queued | int64 | 无限制 | 在拒绝请求之前允许排队的最大请求数。是一个软限制,在负载较重时,排队查询的总数可能会略高。负值或 0 表示一旦达到最大并发请求,请求总是被拒绝。 |
--default_pool_mem_limit | string | "" | 在对该资源池的新请求排队之前,该资源池中所有未完成的请求可以使用的最大内存量(跨整个集群)。以字节、兆字节或千兆字节为单位,由一个数字后跟大写或小写后缀b (可选)m 、 或g 指定。可以为兆字节和千兆字节指定浮点值。还可以通过指定% 后缀将其指定为物理内存的百分比。0 或不设置表示无限制。是一个软限制,在负载较重的时候,并发查询使用的总内存可能会略高一些。 |
--disable_pool_max_requests | Boolean | false | 解除针对每个资源池的最大运行请求的限制。 |
--disable_pool_mem_limits | Boolean | false | 解除每个资源池的内存限制。 |
对于具有使用不同设置的多个资源池的高级配置,需要手动设置配置文件:
通过
--‑‑fair_scheduler_allocation_path=fair-scheduler.xml
指定fair-scheduler.xml
配置文件的路径通过
--‑‑llama_site_path=llama-site.xml
指定llama-site.xml
配置文件的路径
fair-scheduler.xml示例:
<allocations>
<queue name="root">
<aclSubmitApps> </aclSubmitApps>
<queue name="default">
<maxResources>50000 mb, 0 vcores</maxResources>
<aclSubmitApps>*</aclSubmitApps>
</queue>
<queue name="development">
<maxResources>200000 mb, 0 vcores</maxResources>
<aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
</queue>
<queue name="production">
<maxResources>1000000 mb, 0 vcores</maxResources>
<aclSubmitApps> ops,admin</aclSubmitApps>
</queue>
</queue>
<queuePlacementPolicy>
<rule name="specified" create="false"/>
<rule name="default" />
</queuePlacementPolicy>
</allocations>
llama-site.xml示例:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.default</name>
<value>10</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.default</name>
<value>50</value>
</property>
<property>
<!--指定该资源池里运行查询的默认查询选项-->
<name>impala.admission-control.pool-default-query-options.root.default</name>
<value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>
</property>
<property>
<!--队列排队的超时时间,单位:毫秒-->
<name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
<value>30000</value>
</property>
<property>
<name>impala.admission-control.max-query-mem-limit.root.default.regularPool</name>
<value>1610612736</value><!--1.5GB-->
</property>
<property>
<name>impala.admission-control.min-query-mem-limit.root.default.regularPool</name>
<value>52428800</value><!--50MB-->
</property>
<property>
<name>impala.admission-control.clamp-mem-limit-query-option.root.default.regularPool</name>
<value>true</value>
</property>
</configuration>